2005-04-24 20:08:48

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH] private mounts

This simple patch adds support for private (or invisible) mounts. The
rationale is to allow mounts to be private for a user but still in the
global namespace.

An immediate user of this would be FUSE, which currently achieves the
hiding of data with inode->permission(), which is less elegant.

Christoph, I'm specially interested in your opinion, since you were so
strongly opposed to the current solution in FUSE.

Performance measurements indicate that the overhead is about 2% of the
time spent following mounts, or 6ns per-mount on a 533 Celeron.

This patch does:

- add new mount flag: MS_PRIVATE / MNT_PRIVATE
- add new member in struct vfsmount: mnt_uid
- if MNT_PRIVATE is set, set mnt_uid to current->fsuid in
do_add_mount() and do_remount()
- in clone_mnt() copy mnt_uid to the new mount
- in lookup_mnt() while looping through the hash chain for the
mountpoint, check if the mount is "visible" for this process, and
skip it if not

Comments are appreciated. If there are no vetoes agains the patch, I
think it's suitable for -mm.

Thanks,
Miklos

Signed-off-by: Miklos Szeredi <[email protected]>

diff -rup orig/linux-2.6.11/fs/namespace.c linux-2.6.11/fs/namespace.c
--- orig/linux-2.6.11/fs/namespace.c 2005-03-04 23:18:48.000000000 +0100
+++ linux-2.6.11/fs/namespace.c 2005-04-24 12:44:41.000000000 +0200
@@ -81,6 +81,15 @@ void free_vfsmnt(struct vfsmount *mnt)
}

/*
+ * Check if this mount should be skipped or not
+ */
+static inline int mnt_visible(struct vfsmount *mnt)
+{
+ return !(mnt->mnt_flags & MNT_PRIVATE) ||
+ mnt->mnt_uid == current->fsuid;
+}
+
+/*
* Now, lookup_mnt increments the ref count before returning
* the vfsmount struct.
*/
@@ -97,7 +106,8 @@ struct vfsmount *lookup_mnt(struct vfsmo
if (tmp == head)
break;
p = list_entry(tmp, struct vfsmount, mnt_hash);
- if (p->mnt_parent == mnt && p->mnt_mountpoint == dentry) {
+ if (p->mnt_parent == mnt && p->mnt_mountpoint == dentry &&
+ mnt_visible(p)) {
found = mntget(p);
break;
}
@@ -155,6 +165,7 @@ clone_mnt(struct vfsmount *old, struct d

if (mnt) {
mnt->mnt_flags = old->mnt_flags;
+ mnt->mnt_uid = old->mnt_uid;
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
@@ -234,6 +245,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NOSUID, ",nosuid" },
{ MNT_NODEV, ",nodev" },
{ MNT_NOEXEC, ",noexec" },
+ { MNT_PRIVATE, ",private" },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -252,6 +264,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
+ if (mnt->mnt_flags & MNT_PRIVATE)
+ seq_printf(m, ",mnt_uid=%u", mnt->mnt_uid);
if (mnt->mnt_sb->s_op->show_options)
err = mnt->mnt_sb->s_op->show_options(m, mnt);
seq_puts(m, " 0 0\n");
@@ -684,8 +698,11 @@ static int do_remount(struct nameidata *

down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
- if (!err)
+ if (!err) {
nd->mnt->mnt_flags=mnt_flags;
+ if (mnt_flags & MNT_PRIVATE)
+ nd->mnt->mnt_uid = current->fsuid;
+ }
up_write(&sb->s_umount);
if (!err)
security_sb_post_remount(nd->mnt, flags, data);
@@ -807,6 +824,8 @@ int do_add_mount(struct vfsmount *newmnt
goto unlock;

newmnt->mnt_flags = mnt_flags;
+ if (mnt_flags & MNT_PRIVATE)
+ newmnt->mnt_uid = current->fsuid;
err = graft_tree(newmnt, nd);

if (err == 0 && fslist) {
@@ -1033,7 +1052,9 @@ long do_mount(char * dev_name, char * di
mnt_flags |= MNT_NODEV;
if (flags & MS_NOEXEC)
mnt_flags |= MNT_NOEXEC;
- flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_ACTIVE);
+ if (flags & MS_PRIVATE)
+ mnt_flags |= MNT_PRIVATE;
+ flags &= ~(MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_PRIVATE|MS_ACTIVE);

/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
diff -rup orig/linux-2.6.11/include/linux/fs.h linux-2.6.11/include/linux/fs.h
--- orig/linux-2.6.11/include/linux/fs.h 2005-03-04 23:19:05.000000000 +0100
+++ linux-2.6.11/include/linux/fs.h 2005-04-24 10:23:33.000000000 +0200
@@ -96,6 +96,7 @@ extern int dir_notify_enable;
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_PRIVATE 256 /* Make this mount invisible to other users */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff -rup orig/linux-2.6.11/include/linux/mount.h linux-2.6.11/include/linux/mount.h
--- orig/linux-2.6.11/include/linux/mount.h 2004-12-25 11:52:55.000000000 +0100
+++ linux-2.6.11/include/linux/mount.h 2005-04-24 10:24:29.000000000 +0200
@@ -19,6 +19,7 @@
#define MNT_NOSUID 1
#define MNT_NODEV 2
#define MNT_NOEXEC 4
+#define MNT_PRIVATE 8

struct vfsmount
{
@@ -31,6 +32,7 @@ struct vfsmount
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
int mnt_flags;
+ uid_t mnt_uid;
int mnt_expiry_mark; /* true if marked for expiry */
char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */
struct list_head mnt_list;


2005-04-24 20:13:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:08:13PM +0200, Miklos Szeredi wrote:
> Comments are appreciated. If there are no vetoes agains the patch, I
> think it's suitable for -mm.

Vetoed. Having suid application with different pathname resolution than
that of parent just because it is suid is not acceptable. I'm sorry,
but breaking hell knows how many existing applications is not an option.

2005-04-24 20:18:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:08:13PM +0200, Miklos Szeredi wrote:
> This simple patch adds support for private (or invisible) mounts. The
> rationale is to allow mounts to be private for a user but still in the
> global namespace.

As mentioned in the last -fsdevel thread a few times the idea of per-user
mounts is fundamentally flawed. Crossing a namespace boundary must be
explicit - using clone or a new unshare() syscall.

2005-04-24 20:46:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Comments are appreciated. If there are no vetoes agains the patch, I
> > think it's suitable for -mm.
>
> Vetoed. Having suid application with different pathname resolution than
> that of parent just because it is suid is not acceptable. I'm sorry,
> but breaking hell knows how many existing applications is not an option.

I'm pretty sure any suid program doing path resolution and other
filesystem operations on _behalf_ of the original user will do them
with fsuid, fsgid set to the original. Otherwise they are bound to
break in other cases too (NFS export with root_sqash, etc).

Have any counterexamples?

Thanks,
Miklos

2005-04-24 20:50:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > This simple patch adds support for private (or invisible) mounts. The
> > rationale is to allow mounts to be private for a user but still in the
> > global namespace.
>
> As mentioned in the last -fsdevel thread a few times the idea of per-user
> mounts is fundamentally flawed. Crossing a namespace boundary must be
> explicit - using clone or a new unshare() syscall.

Also mentioned in that thread quite a few times is the fact the the
clone() and unshare() modell does not solve people's requirements.

Care to read through that thread and suggest an alternative solution?

Thanks,
Miklos

2005-04-24 20:54:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:50:04PM +0200, Miklos Szeredi wrote:
> Also mentioned in that thread quite a few times is the fact the the
> clone() and unshare() modell does not solve people's requirements.

Could we please get of references to requirements without a rationale?
There's quite enough of that from Carrion-Grade Linux crowd, TYVM.

2005-04-24 21:00:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Also mentioned in that thread quite a few times is the fact the the
> > clone() and unshare() modell does not solve people's requirements.
>
> Could we please get of references to requirements without a rationale?
> There's quite enough of that from Carrion-Grade Linux crowd, TYVM.

The rationale has been explained in that thread. E.g. this quote from
Jamie Lokier in an answer to you:

> I believe the point is:
>
> 1. Person is logged from client Y to server X, and mounts something on
> $HOME/mnt/private (that's on X).
>
> 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> and wants it to work.
>
> The second operation is a separate login to the first.

Solution?

Thanks,
Miklos

2005-04-24 21:06:07

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:59:46PM +0200, Miklos Szeredi wrote:
> > I believe the point is:
> >
> > 1. Person is logged from client Y to server X, and mounts something on
> > $HOME/mnt/private (that's on X).
> >
> > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > and wants it to work.
> >
> > The second operation is a separate login to the first.
>
> Solution?

... is the same as for the same question with "set of mounts" replaced
with "environment variables".

2005-04-24 21:06:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:59:46PM +0200, Miklos Szeredi wrote:
> > Could we please get of references to requirements without a rationale?
> > There's quite enough of that from Carrion-Grade Linux crowd, TYVM.
>
> The rationale has been explained in that thread. E.g. this quote from
> Jamie Lokier in an answer to you:

You still haven't written down coheren requirements.

>
> > I believe the point is:
> >
> > 1. Person is logged from client Y to server X, and mounts something on
> > $HOME/mnt/private (that's on X).
> >
> > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > and wants it to work.
> >
> > The second operation is a separate login to the first.
>
> Solution?

just restart your shell. Same way you do that after adjusting $PATH.

2005-04-24 21:13:13

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Christoph Hellwig wrote:
> > > I believe the point is:
> > >
> > > 1. Person is logged from client Y to server X, and mounts something on
> > > $HOME/mnt/private (that's on X).
> > >
> > > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > > and wants it to work.
> > >
> > > The second operation is a separate login to the first.
> >
> > Solution?
>
> just restart your shell. Same way you do that after adjusting $PATH.

What do you mean?

I cannot think of any way restarting the shell would solve the above.

-- Jamie

2005-04-24 21:16:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > I believe the point is:
> > >
> > > 1. Person is logged from client Y to server X, and mounts something on
> > > $HOME/mnt/private (that's on X).
> > >
> > > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > > and wants it to work.
> > >
> > > The second operation is a separate login to the first.
> >
> > Solution?
>
> ... is the same as for the same question with "set of mounts" replaced
> with "environment variables".

No. You can't set "mount environment" in scp.

Otherwise your analogy is nice, but misses a few points. The usage of
mounts that we are talking about is much more dynamic than usage of
environment variables. You wouldn't want to set an environment
variable in all your shells just to access a remote system though
sshfs for example. It _is_ possible (except the ftp, scp case) but
_very_ inconvenient.

I ask again, what solution would you suggest?

Thanks,
Miklos

2005-04-24 21:19:44

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> No. You can't set "mount environment" in scp.

Of course you can. It does execute the obvious set of rc files.

> Otherwise your analogy is nice, but misses a few points. The usage of
> mounts that we are talking about is much more dynamic than usage of
> environment variables.

What the hell are you smoking and just how are you using shell?

2005-04-24 21:29:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> > No. You can't set "mount environment" in scp.
>
> Of course you can. It does execute the obvious set of rc files.

Don't think so. ftp server and sftp server sure as hell don't.

> > Otherwise your analogy is nice, but misses a few points. The usage of
> > mounts that we are talking about is much more dynamic than usage of
> > environment variables.
>
> What the hell are you smoking and just how are you using shell?

Maybe differently from you :). It's not that often that I have to
tweak environment variables. They are usually set by scripts.

However if you write me a script that reads my mind as to which server
I want to mount with sshfs at which time, I give you all my respect.

Miklos

2005-04-24 21:38:44

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Al Viro wrote:
> > > I believe the point is:
> > >
> > > 1. Person is logged from client Y to server X, and mounts something on
> > > $HOME/mnt/private (that's on X).
> > >
> > > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > > and wants it to work.
> > >
> > > The second operation is a separate login to the first.
> >
> > Solution?
>
> ... is the same as for the same question with "set of mounts" replaced
> with "environment variables".

Not quite.

After changing environment variables in .profile, you can copy them to
other shells using ". ~/.profile".

There is no analogous mechanism to copy namespaces.

I agree with you that Miklos' patch is not the right way to do it.

Much better is the proposal to make namespaces first-class objects,
that can be switched to. Then users can choose to have themselves a
namespace containing their private mounts, if they want it, with
login/libpam or even a program run from .profile switching into it.

While users can be allowed to create their own namespaces which affect
the path traversal of their _own_ directories, it's important that the
existence of such namespaces cannot affect path traversal of other
directories such as /etc, or /autofs/whatever - and that creation of
namespaces by a user cannot prevent the unmounting of a non-user
filesystem either.

The way to do that is shared subtrees, or something along those lines.

Here is one possible implementation:

As far as I can tell, namespaces are equivalent to predicates attached
to every mount - the predicate being "this mount intercepts path
traversal at this point if current namespace == X".

It makes sense, when users can create namespaces for themselves, that
the predicate be changed to "this mount valid if [list of current
namespace and all parent namespaces] contains X". Parent namespace
means the namespace from which a CLONE_NS namespace inherits.

Then it would be safe (i.e. secure) to allow ordinary users to use
CLONE_NS for the purpose of establishing private namespace(s), within
which they can mount things on directories they own. But those users
would continue to see mounts & unmounts done by the system in other
directories such as /mnt and /autofs. Effectively this confines the
new namespace to only affecting directories owned by the user.

That would work properly with suid programs, properly with autofs and
also manual system-wide administration, and it is general enough that
it doesn't force any particular policy. Also, it would be usable for
partial sharing of resources in virtual server and chroot scenarios.
What's not to like? :)

-- Jamie

2005-04-24 21:40:12

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> > > No. You can't set "mount environment" in scp.
> >
> > Of course you can. It does execute the obvious set of rc files.
>
> Don't think so. ftp server and sftp server sure as hell don't.

That's no argument, because you are free to change the ftp and sftp
servers to add this behaviour if you want it.

-- Jamie

2005-04-24 21:43:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Al Viro wrote:
> On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> > No. You can't set "mount environment" in scp.
>
> Of course you can. It does execute the obvious set of rc files.

It doesn't work for the specified use-scenario. The reason is that
there is no command or system call that can be executed from those rc
files to join an existing namespace.

He wants to do this:

1. From client, login to server and do a usermount on $HOME/private.

2. From client, login to server and read the files previously mounted.

-- Jamie

2005-04-24 22:06:34

by maciek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi, Jamie Lokier napisał(a):

>Al Viro wrote:
>
>
>>On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
>>
>>
>>>No. You can't set "mount environment" in scp.
>>>
>>>
>>Of course you can. It does execute the obvious set of rc files.
>>
>>
>
>It doesn't work for the specified use-scenario. The reason is that
>there is no command or system call that can be executed from those rc
>files to join an existing namespace.
>
>He wants to do this:
>
> 1. From client, login to server and do a usermount on $HOME/private.
>
> 2. From client, login to server and read the files previously mounted.
>
>-- Jamie
>
>
Maybe, pam_mount would be the solution?

http://www.flyn.org/projects/pam_mount/

it provides mounting a filesystem when logging in, and unmounting on
exit, just set the mount options.

Maciek Stopa


2005-04-24 22:20:30

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, 2005-04-24 at 14:38, Jamie Lokier wrote:
> Al Viro wrote:
> > > > I believe the point is:
> > > >
> > > > 1. Person is logged from client Y to server X, and mounts something on
> > > > $HOME/mnt/private (that's on X).
> > > >
> > > > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > > > and wants it to work.
> > > >
> > > > The second operation is a separate login to the first.
> > >
> > > Solution?
> >
> > ... is the same as for the same question with "set of mounts" replaced
> > with "environment variables".
>
> Not quite.
>
> After changing environment variables in .profile, you can copy them to
> other shells using ". ~/.profile".
>
> There is no analogous mechanism to copy namespaces.
>
> I agree with you that Miklos' patch is not the right way to do it.
>
> Much better is the proposal to make namespaces first-class objects,
> that can be switched to. Then users can choose to have themselves a
> namespace containing their private mounts, if they want it, with
> login/libpam or even a program run from .profile switching into it.
>
> While users can be allowed to create their own namespaces which affect
> the path traversal of their _own_ directories, it's important that the
> existence of such namespaces cannot affect path traversal of other
> directories such as /etc, or /autofs/whatever - and that creation of
> namespaces by a user cannot prevent the unmounting of a non-user
> filesystem either.
>
> The way to do that is shared subtrees, or something along those lines.

Right. Adding to it. To begin with the system namespace has all its
entire tree shared. So when a new namespace is cloned, the new namespace
can see any new mount/unmount/binds done in the system namespace as
well. (System namespace is the first initial namespace created by
default).

Any private mounts done by the user in his private-namespace
will first make that part of the tree private first and then will
continue with the mount. Otherwise the private mount will end up showing
in the system namespace(since it is shared).

RP
>
> Here is one possible implementation:
>
> As far as I can tell, namespaces are equivalent to predicates attached
> to every mount - the predicate being "this mount intercepts path
> traversal at this point if current namespace == X".
>
> It makes sense, when users can create namespaces for themselves, that
> the predicate be changed to "this mount valid if [list of current
> namespace and all parent namespaces] contains X". Parent namespace
> means the namespace from which a CLONE_NS namespace inherits.
>
> Then it would be safe (i.e. secure) to allow ordinary users to use
> CLONE_NS for the purpose of establishing private namespace(s), within
> which they can mount things on directories they own. But those users
> would continue to see mounts & unmounts done by the system in other
> directories such as /mnt and /autofs. Effectively this confines the
> new namespace to only affecting directories owned by the user.
>
> That would work properly with suid programs, properly with autofs and
> also manual system-wide administration, and it is general enough that
> it doesn't force any particular policy. Also, it would be usable for
> partial sharing of resources in virtual server and chroot scenarios.
> What's not to like? :)


>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-04-24 22:23:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Ram wrote:
> > Much better is the proposal to make namespaces first-class objects,
> > that can be switched to. Then users can choose to have themselves a
> > namespace containing their private mounts, if they want it, with
> > login/libpam or even a program run from .profile switching into it.
> >
> > While users can be allowed to create their own namespaces which affect
> > the path traversal of their _own_ directories, it's important that the
> > existence of such namespaces cannot affect path traversal of other
> > directories such as /etc, or /autofs/whatever - and that creation of
> > namespaces by a user cannot prevent the unmounting of a non-user
> > filesystem either.
> >
> > The way to do that is shared subtrees, or something along those lines.
>
> Right. Adding to it. To begin with the system namespace has all its
> entire tree shared. So when a new namespace is cloned, the new namespace
> can see any new mount/unmount/binds done in the system namespace as
> well. (System namespace is the first initial namespace created by
> default).
>
> Any private mounts done by the user in his private-namespace
> will first make that part of the tree private first and then will
> continue with the mount. Otherwise the private mount will end up showing
> in the system namespace(since it is shared).

Yes, exactly that.

-- Jamie

2005-04-25 06:00:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> >
> > ... is the same as for the same question with "set of mounts" replaced
> > with "environment variables".
>
> Not quite.
>
> After changing environment variables in .profile, you can copy them to
> other shells using ". ~/.profile".
>
> There is no analogous mechanism to copy namespaces.
>
> I agree with you that Miklos' patch is not the right way to do it.

I'm not sure that it is either. But, see bellow...

> Much better is the proposal to make namespaces first-class objects,
> that can be switched to. Then users can choose to have themselves a
> namespace containing their private mounts, if they want it, with
> login/libpam or even a program run from .profile switching into it.

It would be good if it could be done just in libpam. But that would
require every libpam user to call into it after the fork() or
whatever, so unshare() and join_namespace() don't mess up the server
running environment.

If not, then it would mean modifying numerous programs, having these
modifications integrated, then having distributions pick up the
changes, etc. I would imagine quite a long cycle for this to be
acutally useful.

> While users can be allowed to create their own namespaces which affect
> the path traversal of their _own_ directories, it's important that the
> existence of such namespaces cannot affect path traversal of other
> directories such as /etc, or /autofs/whatever - and that creation of
> namespaces by a user cannot prevent the unmounting of a non-user
> filesystem either.
>
> The way to do that is shared subtrees, or something along those lines.

Yes, but we would be achieving essentially the same as my patch, just
with more complexity. And my patch achieves what FUSE does in 2 lines
of code, namely hide the mount from other users by returning -EACCESS
in case fsuid does not mach the mount owner.

I aggree that your solution is more flexible, but it's also hugely
more complex. If somebody want's to implement it, fine. But don't
expect me to do it, unless some company hires my for fs development
(hint, hint ;)

Thanks,
Miklos

2005-04-25 06:41:54

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, 2005-04-24 at 23:00, Miklos Szeredi wrote:
> > >
> > > ... is the same as for the same question with "set of mounts" replaced
> > > with "environment variables".
> >
> > Not quite.
> >
> > After changing environment variables in .profile, you can copy them to
> > other shells using ". ~/.profile".
> >
> > There is no analogous mechanism to copy namespaces.
> >
> > I agree with you that Miklos' patch is not the right way to do it.
>
> I'm not sure that it is either. But, see bellow...
>
> > Much better is the proposal to make namespaces first-class objects,
> > that can be switched to. Then users can choose to have themselves a
> > namespace containing their private mounts, if they want it, with
> > login/libpam or even a program run from .profile switching into it.
>
> It would be good if it could be done just in libpam. But that would
> require every libpam user to call into it after the fork() or
> whatever, so unshare() and join_namespace() don't mess up the server
> running environment.
>
> If not, then it would mean modifying numerous programs, having these
> modifications integrated, then having distributions pick up the
> changes, etc. I would imagine quite a long cycle for this to be
> acutally useful.
>
> > While users can be allowed to create their own namespaces which affect
> > the path traversal of their _own_ directories, it's important that the
> > existence of such namespaces cannot affect path traversal of other
> > directories such as /etc, or /autofs/whatever - and that creation of
> > namespaces by a user cannot prevent the unmounting of a non-user
> > filesystem either.
> >
> > The way to do that is shared subtrees, or something along those lines.
>
> Yes, but we would be achieving essentially the same as my patch, just
> with more complexity. And my patch achieves what FUSE does in 2 lines
> of code, namely hide the mount from other users by returning -EACCESS
> in case fsuid does not mach the mount owner.
>

I have not yet sure how invisible mount can be used to solve the FUSE
problem.

Again my understanding of the basic requirement of FUSE is:

1. A user being able to setup his own VFS-mount environment which
is only visible to the user.
2. The same user being able to see exactly the same VFS-mount
environment from any login session.

RP

> I aggree that your solution is more flexible, but it's also hugely
> more complex. If somebody want's to implement it, fine. But don't
> expect me to do it, unless some company hires my for fs development
> (hint, hint ;)



>
> Thanks,
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-04-25 07:11:49

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 23:29:22 +0200, Miklos Szeredi wrote:
> > On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> > > No. You can't set "mount environment" in scp.
> >
> > Of course you can. It does execute the obvious set of rc files.
>
> Don't think so. ftp server and sftp server sure as hell don't.

Sftp sure *DOES*. It is invoked by shell, which is not run as login one,
but even non-login shell sources an rc file.

> > > Otherwise your analogy is nice, but misses a few points. The usage of
> > > mounts that we are talking about is much more dynamic than usage of
> > > environment variables.
> >
> > What the hell are you smoking and just how are you using shell?
>
> Maybe differently from you :). It's not that often that I have to
> tweak environment variables. They are usually set by scripts.
>
> However if you write me a script that reads my mind as to which server
> I want to mount with sshfs at which time, I give you all my respect.

I can't write a script that reads your mind. But I sure can write
a script that finds out what you mounted in the other shells (with help
of a little wrapper around the mount command).

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.26 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-25 07:15:48

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 22:43:39 +0100, Jamie Lokier wrote:
> Al Viro wrote:
> > On Sun, Apr 24, 2005 at 11:15:35PM +0200, Miklos Szeredi wrote:
> > > No. You can't set "mount environment" in scp.
> >
> > Of course you can. It does execute the obvious set of rc files.
>
> It doesn't work for the specified use-scenario. The reason is that
> there is no command or system call that can be executed from those rc
> files to join an existing namespace.
>
> He wants to do this:
>
> 1. From client, login to server and do a usermount on $HOME/private.
>
> 2. From client, login to server and read the files previously mounted.

Ok, that almost can be done. All that is needed from kernel is an
ability to mount bind from open directory handle instead of a path! The
rest is doable in userland.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (924.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-25 07:23:02

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 08:00:20 +0200, Miklos Szeredi wrote:
> > Much better is the proposal to make namespaces first-class objects,
> > that can be switched to. Then users can choose to have themselves a
> > namespace containing their private mounts, if they want it, with
> > login/libpam or even a program run from .profile switching into it.
>
> It would be good if it could be done just in libpam. But that would
> require every libpam user to call into it after the fork() or
> whatever, so unshare() and join_namespace() don't mess up the server
> running environment.

They do. The *HAVE* to do! The 'session' stage modifies the environment,
so it must be done after the fork. So if it, in addition to environment,
modifies namespace, it won't make a difference.

> If not, then it would mean modifying numerous programs, having these
> modifications integrated, then having distributions pick up the
> changes, etc. I would imagine quite a long cycle for this to be
> acutally useful.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.09 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-25 09:48:14

by Olivier Galibert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sun, Apr 24, 2005 at 10:19:42PM +0100, Al Viro wrote:
> Of course you can. It does execute the obvious set of rc files.

Is there a possibility for a process to change its namespace to
another existing one? That would be needed to have a per-user
namespace you go to from rc files or pam.

I'd kinda wonder what happens to pwd.

OG.

2005-04-25 09:56:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> I have not yet sure how invisible mount can be used to solve the FUSE
> problem.
>
> Again my understanding of the basic requirement of FUSE is:
>
> 1. A user being able to setup his own VFS-mount environment which
> is only visible to the user.
> 2. The same user being able to see exactly the same VFS-mount
> environment from any login session.

More generally:

1. the files exported by the FUSE filesystem should not be accessible
by other users.

2. The user should see exactly the same files from any login session.

These can be satisfied in various ways. Permission checking, or by
making FUSE mounts invisible to other users, or with private
namespaces (in increasing complexity).

Thanks,
Miklos

2005-04-25 09:59:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Don't think so. ftp server and sftp server sure as hell don't.
>
> Sftp sure *DOES*. It is invoked by shell, which is not run as login one,
> but even non-login shell sources an rc file.

You win :)

> > However if you write me a script that reads my mind as to which server
> > I want to mount with sshfs at which time, I give you all my respect.
>
> I can't write a script that reads your mind. But I sure can write
> a script that finds out what you mounted in the other shells (with help
> of a little wrapper around the mount command).

How do you bind mount it from a different namespace? You _do_ need
bind mount, since a new mount might require password input, etc...

Thanks,
Miklos

2005-04-25 10:09:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> They do. The *HAVE* to do! The 'session' stage modifies the environment,
> so it must be done after the fork. So if it, in addition to environment,
> modifies namespace, it won't make a difference.

That is good news.

So in theory it's doable. Anyone willing to help putting it all
together?

Thanks,
Miklos

2005-04-25 10:49:19

by Heikki Orsila

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi <[email protected]> wrote:
> More generally:
>
> 1. the files exported by the FUSE filesystem should not be accessible
> by other users.

Not even by root. If an admin of a remote system runs a backup script, one
does not want private home computer files go with that. The admin of the
remote system does definitely not want to make backups of my home files.

Heikki Orsila Barbie's law:
[email protected] "Math is hard, let's go shopping!"
http://www.iki.fi/shd

2005-04-25 11:47:09

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 11:58:50 +0200, Miklos Szeredi wrote:
> > > However if you write me a script that reads my mind as to which server
> > > I want to mount with sshfs at which time, I give you all my respect.
> >
> > I can't write a script that reads your mind. But I sure can write
> > a script that finds out what you mounted in the other shells (with help
> > of a little wrapper around the mount command).
>
> How do you bind mount it from a different namespace? You _do_ need
> bind mount, since a new mount might require password input, etc...

Yes, I would need one thing from kernel. That one thing would be to
mount bind a directory handle, instead of path.

And if you wonder how I get the handle, that's what SCM_RIGHTS message
of unix-domain sockets is for.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (897.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-25 15:38:39

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Jan Hudec <[email protected]> wrote:
> On Mon, Apr 25, 2005 at 11:58:50 +0200, Miklos Szeredi wrote:

>> How do you bind mount it from a different namespace? You _do_ need
>> bind mount, since a new mount might require password input, etc...
>
> Yes, I would need one thing from kernel. That one thing would be to
> mount bind a directory handle, instead of path.
>
> And if you wonder how I get the handle, that's what SCM_RIGHTS message
> of unix-domain sockets is for.

You'll need something to get the FD from. What will that be if the mount
was done from a subshell of the midnight commander run in a screen session?

What about X sessions? Open a xterm, do the mount and then do what to get
the mount working for the programs run from the window manager?
Relogin? The xterm with the mount will be gone.
Use a daemon to keep an additional reference to the namespace? That's UGLY.

With attachable namespaces, the whole thing should be as simple as
(pseudocode)
mknamespace -p users/$UID # (like mkdir -p)
setnamespace users/$UID # (like cd)

Optionally, the namespaces and their private mounts might be scheduled to
be removed if the last user is gone, or they need to be persistent,
depending on the applicaton (e.g. ssh used as rexec or shared mounts).
--
Remember, a retreating enemy is probably just falling back and regrouping.

2005-04-25 15:52:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > > > I believe the point is:
> > > >
> > > > 1. Person is logged from client Y to server X, and mounts something on
> > > > $HOME/mnt/private (that's on X).
> > > >
> > > > 2. On client Y, person does "scp X:mnt/private/secrets.txt ."
> > > > and wants it to work.
> > > >
> > > > The second operation is a separate login to the first.
> > >
> > > Solution?
> >
> > ... is the same as for the same question with "set of mounts" replaced
> > with "environment variables".
>
> Not quite.
>
> After changing environment variables in .profile, you can copy them to
> other shells using ". ~/.profile".
>
> There is no analogous mechanism to copy namespaces.

Actually, after you add right mount xyzzy /foo lines into .profile,
you can just . ~/.profile ;-).
Pavel

--
Boycott Kodak -- for their patent abuse against Java.

2005-04-25 16:24:59

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, 2005-04-25 at 08:17, Bodo Eggert wrote:
> Jan Hudec <[email protected]> wrote:
> > On Mon, Apr 25, 2005 at 11:58:50 +0200, Miklos Szeredi wrote:
>
> >> How do you bind mount it from a different namespace? You _do_ need
> >> bind mount, since a new mount might require password input, etc...
> >
> > Yes, I would need one thing from kernel. That one thing would be to
> > mount bind a directory handle, instead of path.
> >
> > And if you wonder how I get the handle, that's what SCM_RIGHTS message
> > of unix-domain sockets is for.
>
> You'll need something to get the FD from. What will that be if the mount
> was done from a subshell of the midnight commander run in a screen session?
>
> What about X sessions? Open a xterm, do the mount and then do what to get
> the mount working for the programs run from the window manager?
> Relogin? The xterm with the mount will be gone.
> Use a daemon to keep an additional reference to the namespace? That's UGLY.
>
> With attachable namespaces, the whole thing should be as simple as
> (pseudocode)
> mknamespace -p users/$UID # (like mkdir -p)
> setnamespace users/$UID # (like cd)
>
> Optionally, the namespaces and their private mounts might be scheduled to
> be removed if the last user is gone, or they need to be persistent,
> depending on the applicaton (e.g. ssh used as rexec or shared mounts).

Agreed.

I guess for this thread to make any progress, we need a set of coherent
requirements from FUSE team.

RP


2005-04-25 16:40:16

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 11:48:04AM +0200, Olivier Galibert wrote:
> Is there a possibility for a process to change its namespace to
> another existing one? That would be needed to have a per-user
> namespace you go to from rc files or pam.

I haven't looked at this in about a year, but as of a year ago, no.
Namespaces are/were second-class objects that exist only as referenced by
tasks. I played with implementing a newns PAM module. It worked, but was
full of holes. I started writing a paper on it, but never got around to
finishing it, for various reasons.

Tim

2005-04-25 19:01:08

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>mknamespace -p users/$UID # (like mkdir -p)
>setnamespace users/$UID # (like cd)
^^^^^^^^

You realize that 'cd' is a shell command, and has to be, I hope. That
little fact has thrown a wrench into many of the ideas in this thread.

2005-04-25 19:03:43

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Bodo Eggert <[email protected]> wrote:
> Use a daemon to keep an additional reference to the namespace? That's UGLY.

Why not? There's a daemon already: the FUSE daemon. It's an ideal
candidate for passing out the information about the mounts it's
involved in.

-- Jamie

2005-04-25 19:10:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Pavel Machek wrote:
> > > ... is the same as for the same question with "set of mounts" replaced
> > > with "environment variables".
> >
> > Not quite.
> >
> > After changing environment variables in .profile, you can copy them to
> > other shells using ". ~/.profile".
> >
> > There is no analogous mechanism to copy namespaces.
>
> Actually, after you add right mount xyzzy /foo lines into .profile,
> you can just . ~/.profile ;-).

Is there a mount command that can do that? We're talking about
private mounts - invisible to other namespaces, which includes the
other shells.

If there was a /proc/NNN/namespace, that would do the trick :)

-- Jamie

2005-04-25 19:18:02

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Ram wrote:
> I guess for this thread to make any progress, we need a set of coherent
> requirements from FUSE team.

Yes. A list of use-cases from the FUSE team which would be nice
to use would be a good start. Then people who aren't so close to FUSE
can suggest alternative ways of doing those, until we whittle down to
the essential features that aren't already available in the kernel, if
any.

-- Jamie

2005-04-25 21:08:53

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>> No. You can't set "mount environment" in scp.
>
>Of course you can. It does execute the obvious set of rc files.

Incidentally, there is no obvious set of files. The only relevant one
that gets executed does so by accident because of a side effect of an ugly
hack.

Jamie pointed out that such files wouldn't really help anyway, because
there isn't a shell command that can affect the mounts seen by the copy
server process it forks. And others have noted that some such remote
processes don't run shells at all. But in case anyone is thinking of
shell rc files as an architectural solution to the scp problem, let me
explain shell rc files, in particular Bash's:

.profile runs when a login shell starts, which is supposed to be when you
start a work session with the computer. You put stuff in there like an
announcement of mail, displaying reminders, reading news, etc.

/etc/profile is the same, but for everyone.

.bashrc runs when an interactive shell starts that isn't a login shell,
which is supposed to be as in opening a new shell window. You put stuff
in there to customize your interactive experience -- key binding, screen
colors, aliases, and the like.

Some builds of Bash have a system level version of this as
/etc/bash.bashrc.

All of these are for shells that are being used by a human. They can
really mess up a "user" that is a machine. The most important case of a
non-human user is a shell script.

The rc file named by the BASH_ENV environment variable runs for every
shell, interactive or not. But this is hard to use for personalization
because you need a place to personalize BASH_ENV. It's also hard to use
for anything else, because so many programs (including some Ssh daemons)
cut off environment variable inheritance.

Now for the ugly hack: An interactive shell is normally one whose
Standard Input is a terminal. But when rsh came about, Standard Input was
a socket, even though the shell session was quite interactive. So Bash
contains code that looks at several conditions consistent with an rsh
session and if it determines that it is probably being run as the backend
of an rsh session, it treats the shell as interactive. Openssh 'ssh'
doesn't need this hack, because Sshd uses a pseudo-terminal instead of a
socket as the shell's Standard Input. But Openssh's 'scp' falls into the
trap and gets taken as an interactive human user of the shell. So .bashrc
runs. Many are the scp sessions I've tortured with my .bashrc, and spent
hours debugging. (I finally removed the hack from Bash and regained
sanity).

A design for user-specific namespaces that relies on this particular hack
would not be clean.

On the other hand, it is possible to customize any scp backend session
just by making a personal wrapper for the scp backend program. The
wrapper can do the setup -- either directly or by running an "scprc" file.
With Openssh, you can choose the backend program in various places.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-26 08:59:08

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 12:02:43 -0700, Bryan Henderson wrote:
> >mknamespace -p users/$UID # (like mkdir -p)
> >setnamespace users/$UID # (like cd)
> ^^^^^^^^
>
> You realize that 'cd' is a shell command, and has to be, I hope. That
> little fact has thrown a wrench into many of the ideas in this thread.

You don't want to have such command, really! What you want is a PAM
module, that looks whether there is already a session for that user and
switches to it's namespace, if it does. Remember that it's namespace
- it can be first created, then attached and then populated with mounts.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (721.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-26 09:06:38

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 17:17:35 +0200, Bodo Eggert <[email protected]> wrote:
> Jan Hudec <[email protected]> wrote:
> > On Mon, Apr 25, 2005 at 11:58:50 +0200, Miklos Szeredi wrote:
> Use a daemon to keep an additional reference to the namespace? That's UGLY.

It's as ugly as ssh-agent.

But I have to say, that I really like attachable namespaces bettern than
descriptor mount bind. It's a hell lot simpler to work with.

> With attachable namespaces, the whole thing should be as simple as
> (pseudocode)
> mknamespace -p users/$UID # (like mkdir -p)
> setnamespace users/$UID # (like cd)

Well, yes and no. We should probably just have a syscall
int join_namespace(pid_t pid)
which would join the namespace process pid uses. And then have a PAM
session module, that would attach the namespace of the first user's
session (creating new namespace if this is the first session).

> Optionally, the namespaces and their private mounts might be scheduled to
> be removed if the last user is gone, or they need to be persistent,
> depending on the applicaton (e.g. ssh used as rexec or shared mounts).

I'd have them garbage-collected. When last process using them goes away,
so does the namespace. If that's not what you want, start a persistent
process for the user.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.38 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-26 09:17:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > I guess for this thread to make any progress, we need a set of coherent
> > requirements from FUSE team.
>
> Yes. A list of use-cases from the FUSE team which would be nice
> to use would be a good start.

What kind of usecases would you like to see? FUSE filesystems are
used just like any other filesystem, so listing "copy file from x to
z" and suchlike seems pointless ;)

> Then people who aren't so close to FUSE can suggest alternative ways
> of doing those, until we whittle down to the essential features that
> aren't already available in the kernel, if any.

The most important difference between orinary filesystems and FUSE is
the fact, that the filesystem data/metadata is provided by a userspace
process run with the privileges of the mount "owner" instead of the
kernel, or some remote entity usually running with elevated
privileges.

The security implication of this is that a non-privileged user must
not be able to use this capability to compromise the system. Obvious
requirements arising from this are:

- mount owner should not be able to get elevated privileges with the
help of the mounted filesystem

- mount owner should not be able to induce undesired behavior in
other users' or the super user's processes

- mount owner should not get illegitimate access to information from
other users' and the super user's processes

These are currently ensured with the following constraints:

1) mount is only allowed to directory or file which the mount owner
can modify without limitation (write access + no sticky bit for
directories)

2) nosuid,nodev mount options are forced

3) any process running with fsuid different from the owner is denied
all access to the filesystem

1) and 2) are ensured by the "fusermount" mount utility which is a
setuid root application doing the actual mount operation.

3) is ensured by a check in the permission() method in kernel

I started thinking about doing 3) in a different way because Christoph
H. made a big deal out of it, saying that FUSE is unacceptable into
mainline in this form.

The suggested use of private namespaces would be OK, but in their
current form have many limitations that make their use impractical (as
discussed in this thread).

Suggested improvements that would address these limitations:

- implement shared subtrees

- allow a process to join an existing namespace (make namespaces
first-class objects)

- implement the namespace creation/joining in a PAM module

With all that in place the check of owner against current->fsuid may
be removed from the FUSE kernel module, without compromising the
security requirements.

Suid programs still interesting questions, since they get access even
to the private namespace causing some information leak (exact
order/timing of filesystem operations performed), giving some
ptrace-like capabilities to unprivileged users. BTW this problem is
not strictly limited to the namespace approach, since suid programs
setting fsuid and accessing users' files will succeed with the current
approach too.

Is this information enough for further progress to be made?

Thanks for the help,
Miklos

2005-04-26 09:20:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 11:16:18AM +0200, Miklos Szeredi wrote:
> The most important difference between orinary filesystems and FUSE is
> the fact, that the filesystem data/metadata is provided by a userspace
> process run with the privileges of the mount "owner" instead of the
> kernel, or some remote entity usually running with elevated
> privileges.

define "mount owner". Right now mount requires CAP_SYS_ADMIN which means
fairly privilegued. Having some kind of user mounts would be nice, but
needs a fair amount of auditing first.

2005-04-26 09:22:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts


> > The most important difference between orinary filesystems and FUSE is
> > the fact, that the filesystem data/metadata is provided by a userspace
> > process run with the privileges of the mount "owner" instead of the
> > kernel, or some remote entity usually running with elevated
> > privileges.
>
> define "mount owner". Right now mount requires CAP_SYS_ADMIN which means
> fairly privilegued.

FUSE uses a suid root helper (as explained below). Please read the
whole mail.

Thanks,
Miklos

2005-04-26 09:29:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > > > ... is the same as for the same question with "set of mounts" replaced
> > > > with "environment variables".
> > >
> > > Not quite.
> > >
> > > After changing environment variables in .profile, you can copy them to
> > > other shells using ". ~/.profile".
> > >
> > > There is no analogous mechanism to copy namespaces.
> >
> > Actually, after you add right mount xyzzy /foo lines into .profile,
> > you can just . ~/.profile ;-).
>
> Is there a mount command that can do that? We're talking about
> private mounts - invisible to other namespaces, which includes the
> other shells.
>
> If there was a /proc/NNN/namespace, that would do the trick :)

Sounds like the solution, then. I do not think Al Viro is going to
kill you for /proc/NNN/namespace...
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-26 09:30:19

by Martin Mares

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hello!

> - mount owner should not get illegitimate access to information from
> other users' and the super user's processes
[...]
> 3) any process running with fsuid different from the owner is denied
> all access to the filesystem

This smells. Denying access to root doesn't make any sense. I agree
that it could help in some corner cases (like avoiding automated backup
from backing up user filesystems), but in the end it's going to be
an annoyance.

Per-user namespaces (set up by PAM) look as a very reasonable solution.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"A semicolon. Another line ends in the dance of camel." -- Kabir Ahuja

2005-04-26 09:36:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 11:22:03AM +0200, Miklos Szeredi wrote:
> > define "mount owner". Right now mount requires CAP_SYS_ADMIN which means
> > fairly privilegued.
>
> FUSE uses a suid root helper (as explained below). Please read the
> whole mail.

In that case you're totally out of luck. This is not a setup we want to
account for.

2005-04-26 09:44:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > define "mount owner". Right now mount requires CAP_SYS_ADMIN which means
> > > fairly privilegued.
> >
> > FUSE uses a suid root helper (as explained below). Please read the
> > whole mail.
>
> In that case you're totally out of luck. This is not a setup we want to
> account for.

Christoph, you are being thickheaded, and this is not the first time.
Please go away.

Thanks,
Miklos

2005-04-26 09:52:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 11:41:00AM +0200, Miklos Szeredi wrote:
> Christoph, you are being thickheaded, and this is not the first time.
> Please go away.

Please stop the flaming. You're adding the equivalent of "I've added
a suid shell, please make sure it can only affect the caller's files".

Do you really think we want to add such crap?

You're really falling into the Hans Reiser trap - if you just wanted to
add a simple userland filesystem you'd be done by now, but you're trying
to funnel new semantics in through it. Which is by far not as easy as
adding a simple file system driver and needs a lot more though.

2005-04-26 09:57:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 11:53:01AM +0200, Miklos Szeredi wrote:
> No. FUSE is not a suid shell, and it's definitely not crap. You are
> being impolite and without a reason. So don't be surprised if you get
> burned.

You're model for user mounts is total crap. The rest of the fuse kernel
code is quite nice (1).

> > You're really falling into the Hans Reiser trap - if you just wanted to
> > add a simple userland filesystem you'd be done by now, but you're trying
> > to funnel new semantics in through it. Which is by far not as easy as
> > adding a simple file system driver and needs a lot more though.
>
> I've started FUSE in 2001. Did you think it was easy getting this far?

And that matters how exactly?


(1) except the rather fishy get_user_pages variant. but I'm not expert
enough there to comment more

2005-04-26 10:01:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi <[email protected]> wrote:
>
> > > > define "mount owner". Right now mount requires CAP_SYS_ADMIN which means
> > > > fairly privilegued.
> > >
> > > FUSE uses a suid root helper (as explained below). Please read the
> > > whole mail.
> >
> > In that case you're totally out of luck. This is not a setup we want to
> > account for.
>
> Christoph, you are being thickheaded,

Not as thick as mine! Could someone please explain in small words what's
wrong with an suid mount helper?

2005-04-26 10:01:01

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Christoph, you are being thickheaded, and this is not the first time.
> > Please go away.
>
> Please stop the flaming. You're adding the equivalent of "I've added
> a suid shell, please make sure it can only affect the caller's files".
>
> Do you really think we want to add such crap?

Do you think I'd want such crap?

No. FUSE is not a suid shell, and it's definitely not crap. You are
being impolite and without a reason. So don't be surprised if you get
burned.

> You're really falling into the Hans Reiser trap - if you just wanted to
> add a simple userland filesystem you'd be done by now, but you're trying
> to funnel new semantics in through it. Which is by far not as easy as
> adding a simple file system driver and needs a lot more though.

I've started FUSE in 2001. Did you think it was easy getting this far?

Come on!

Miklos

2005-04-26 10:05:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 10:56:08AM +0100, Christoph Hellwig wrote:
> On Tue, Apr 26, 2005 at 11:53:01AM +0200, Miklos Szeredi wrote:
> > No. FUSE is not a suid shell, and it's definitely not crap. You are
> > being impolite and without a reason. So don't be surprised if you get
> > burned.
>
> You're model for user mounts is total crap. The rest of the fuse kernel
> code is quite nice (1).

And btw, in case you think this flaming here is going to bring you forward
in any way: it's not. User mounts and namespace-related changes don't
belong into a lowelevel filesystem driver no matter what. Whatever way
to handle user-private mount we're going to agree on it's not going to be
inside the fuse code. So you're really better of stoppign the flaming,
stripping out those bits and help writing down a scheme that you'd want
to see in the VFS so we can see whether it makes sense and is implementable.

2005-04-26 10:08:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 03:00:10AM -0700, Andrew Morton wrote:
> Not as thick as mine! Could someone please explain in small words what's
> wrong with an suid mount helper?

Nothing per-se. What makes it bad is the contect of a userland filesystem
where the actual filesystem operations in the mounted filesystem happen
in contect of a non-privilegued user.

2005-04-26 10:09:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > No. FUSE is not a suid shell, and it's definitely not crap. You are
> > being impolite and without a reason. So don't be surprised if you get
> > burned.
>
> You're model for user mounts is total crap. The rest of the fuse kernel
> code is quite nice (1).

Oh, a compliment from Christoph H. himself, how flattering :)

And for the first part, please _explain_ why you think it's crap.
Calling something crap, will surely not make it less crap, will it?

Miklos

2005-04-26 10:17:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Apr 26, 2005 at 03:00:10AM -0700, Andrew Morton wrote:
> > Not as thick as mine! Could someone please explain in small words what's
> > wrong with an suid mount helper?
>
> Nothing per-se. What makes it bad is the contect of a userland filesystem
> where the actual filesystem operations in the mounted filesystem happen
> in contect of a non-privilegued user.

That's one of the major points of FUSE, isn't it? So that unprivileged
users can do interesting things.

Or are you saying that that's a desirable objective, but it should be
implemented differently?

2005-04-26 10:16:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 12:01:17PM +0200, Miklos Szeredi wrote:
> And for the first part, please _explain_ why you think it's crap.

Problem 1:

- you're mounting things into the global namespace, but expect it only
be visible to a certain subset of processes. these processes are also
not specicified by a tradition unix session / process group / etc but
against all the process attributes we have based on the uid

Problem 2, which is related:

- in fuse you're re-routing filesystem request to userspace, so fine so good
- mount is currently a privilegued operation, and expects a privilegued
filesystem implementation, not an ordinary user
- to bypass that you have a suid mount wrapper
- now you need various hacks to make sure this can't be used by other users

in short you are hacking around the namespace management which sits above
the filesystems in a rather broken way.

2005-04-26 10:41:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, Apr 26, 2005 at 03:14:14AM -0700, Andrew Morton wrote:
> That's one of the major points of FUSE, isn't it? So that unprivileged
> users can do interesting things.
>
> Or are you saying that that's a desirable objective, but it should be
> implemented differently?

It's a desirable objective, but the implementation is wrong. If we have
a user mount that must be known to the VFS so that the VFS can enforce
the right restrictions instead of leaving various crude hacks in lowlevel
filesystem drivers. Especially as fuse isn't the only filesystem for which
this makes sense - smbfs or v9fs want the same features aswell

2005-04-26 11:48:00

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, 26 Apr 2005, Jan Hudec wrote:
> On Mon, Apr 25, 2005 at 17:17:35 +0200, Bodo Eggert <[email protected]> wrote:

> > With attachable namespaces, the whole thing should be as simple as
> > (pseudocode)
> > mknamespace -p users/$UID # (like mkdir -p)
> > setnamespace users/$UID # (like cd)
>
> Well, yes and no. We should probably just have a syscall
> int join_namespace(pid_t pid)
> which would join the namespace process pid uses. And then have a PAM
> session module, that would attach the namespace of the first user's
> session (creating new namespace if this is the first session).

This will help for the fuse case, but since namespaces are hierarchical
(as I understand them), you can as well make the structure visible and
thereby turn a feature for one user into a feature for general use.
--
Programming is an art form that fights back.

2005-04-26 11:49:05

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, 25 Apr 2005, Bryan Henderson wrote:

> >mknamespace -p users/$UID # (like mkdir -p)
> >setnamespace users/$UID # (like cd)
> ^^^^^^^^
>
> You realize that 'cd' is a shell command, and has to be, I hope. That
> little fact has thrown a wrench into many of the ideas in this thread.

I suppose it will be called by the login process or by wrappers like
'nice'.
--
Never stand when you can sit, never sit when you can lie down, never stay
awake when you can sleep.

2005-04-26 12:09:01

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Problem 1:
>
> - you're mounting things into the global namespace, but expect it only
> be visible to a certain subset of processes. these processes are also
> not specicified by a tradition unix session / process group / etc but
> against all the process attributes we have based on the uid

Yes, so? You are harping on this without ever telling a concrete
technical problem with it.

Yes root will assume it's a directory that can be read. And it will
get -EACCESS. Is there a problem with this? Is there a problem with
NFS exports with root_squash (which is the default btw)? See the
similarity?

> Problem 2, which is related:
>
> - in fuse you're re-routing filesystem request to userspace, so fine so good
> - mount is currently a privilegued operation, and expects a privilegued
> filesystem implementation, not an ordinary user
> - to bypass that you have a suid mount wrapper
> - now you need various hacks to make sure this can't be used by other users

Why is this a hack? What is the problem with it?

NFSv3 implements it's own permission checking based on credentials
returned by the server. If client doesn't support Posix ACL and
server does, the permission bits _will_ be out of sync with the actual
permission you get on the file, and you will never know why. Is it a
problem? Can it be avoided?

How would sshfs client create permission bits for files, which match
the exact permissions that the user has on the server. The client
doesn't even know the uid of the user on the remote system let alone
what groups it belongs to.

Think about these problems, and if you have a _solution_ that you
think is better, then pray tell me. If you don't have a solution, but
just want to bad-mouth the current FUSE imlementation, I'm not
interested.

Miklos

2005-04-26 13:07:54

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On 4/26/05, Christoph Hellwig <[email protected]> wrote:
> On Tue, Apr 26, 2005 at 03:14:14AM -0700, Andrew Morton wrote:
> > That's one of the major points of FUSE, isn't it? So that unprivileged
> > users can do interesting things.
> >
> > Or are you saying that that's a desirable objective, but it should be
> > implemented differently?
>
> It's a desirable objective, but the implementation is wrong. If we have
> a user mount that must be known to the VFS so that the VFS can enforce
> the right restrictions instead of leaving various crude hacks in lowlevel
> filesystem drivers. Especially as fuse isn't the only filesystem for which
> this makes sense - smbfs or v9fs want the same features aswell
>

As far as I can see, there are (at least) two distinct discussions
going on. For the sake of clarity I'd like to get the security
concerns/requirements laid out for each:

TYPE A) general purpose user-mountable file systems

This seems to be the feature that would be useful to many of the
different file systems
(fuse, v9fs, smbfs, etc). What security restrictions need to be in
place if we were to take the SYS_CAP_ADMIN check out of sys_mount?
>From what I've gleaned from the discussion so far they would include:
1) Restricting where the user could mount
- the suggestion so far is that a user could only mount/bind to a
directory he could write to and without the sticky bit
2) Restricting what the user could mount
- mounting arbitrary file systems could expose a vulnerability
3) Restricting how the user can mount (nosuid, nogid enforced)
4) Restricting user mount visibility (in private namespaces) so as not
to pollute the global namespace
5) Restricting how much the user can mount (restricting number of
mounts and/or number of namespaces with a ulimit)

(1), (3), (4), and (5) seem straightforward to me. (2) seems a little
less-so. I understand a little bit of the vulnerability (specifically
when mounting physical devices with file systems that may or may not
be tolerant to malicious formats), but I hate restricting the user. I
guess perhaps we could have something in the file system type
information which describes whether or not it should be user
mountable.

Implementation wise, (3), (4), and (5) seem pretty straightforward to
implement in the kernel. (1) and (2) wouldn't be that bad if the
policy were kept simple, but any sort of an advanced policy would seem
to require a user-space application to assist -- but that seems to
require an suid mount app. Is it better to come up with a simple
universal policy and implement it within VFS, or allow for a more
complex policy that would require user-space application assistance?

Have I missed something from the security angle?

TYPE B) per-user namespace / attachable namespace / etc.

This argument seems to come mostly from the FUSE camp, but the goal
seems noble enough: given enforcement of requiring private namespaces
for user mounts in (A), how can we create a user-environment similar
to what the user would expect without private mounts (ie. a global
namespace per user).

The main security concern here has been stated in detail before, so
I'll only summarize: only the user who mounted the file system should
be granted access to it. Private namespaces in (A) seem to grant that
security, however, the (B) requirement of a global user namespace
invalidates that as a new login (or su) woud attatch to the private
namespace (and if I'm not mistaken a root su'ing to the user would
also get around the currently implemented permission scheme). I don't
think anyone has come up with a good solution here.

My hack at a solution for this (even though I don't see this as a big
requirement):
Proper namespace inheritance (meaning changes to the parent are
propagated to the child as references not copies -- I believe the
shared subtree RFC covers the right semantics) along with establishing
a new private namespace for each login session. As far as accessing
already mounted FUSE file systems between different login sessions --
I see this as a really obscure requirement that complicates things a
great deal, however -- if you split the concept of "srv points" from
file system mounts and remount the file system (perhaps automatically
as part of initiating the session) for every new login -- then you can
revalidate security at each of these mounts. This sounds somewhat
extreme, but with a proper keyring style authentication management
system it could be made fairly transparent to the user. I believe
others in this thread have proposed something similar, I'm just
weighing in that if this must be a requirement (and I don't think it
should), this is how I think it should be done.

In general (TYPE A) and (TYPE B) are related but separate
implementation efforts, I think we should focus on getting (TYPE A)
right (due to the system security implications) and evolve a solution
to (TYPE B) based on use.

-eric

2005-04-26 13:24:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > Christoph, you are being thickheaded, and this is not the first time.
> > Please go away.
>
> Please stop the flaming. You're adding the equivalent of "I've added
...
> You're really falling into the Hans Reiser trap - if you just wanted to
> add a simple userland filesystem you'd be done by now, but you're trying
> to funnel new semantics in through it. Which is by far not as easy as
> adding a simple file system driver and needs a lot more though.

Could we get root-only fuse in, please? It is usefull on its own...

In many cases you can just run one fused for all the users and handle
priviledges inside fused...
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-04-26 13:29:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Could we get root-only fuse in, please?

chmod u-s /usr/bin/fusermount

Miklos

2005-04-26 13:47:03

by Ville Herva

[permalink] [raw]
Subject: filesystem transactions API

Apparently, Windows Longhorn will include something called "transactional
NTFS". It's explained pretty well in

http://blogs.msdn.com/because_we_can/

Basically, a process can create a fs transaction, and all fs changes made
between start of the transaction and commit are atomical - meaning nothing
is visible until commit, and if commit fails, everything is rolled back.

Sound useful... Although there are no service pack installs that could fail
in Linux, the same thing could be useful in rpm, yum, almost anything.

What do you think?



-- v --

[email protected]

2005-04-26 14:07:45

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Pavel Machek wrote:
> > > Actually, after you add right mount xyzzy /foo lines into .profile,
> > > you can just . ~/.profile ;-).
> >
> > Is there a mount command that can do that? We're talking about
> > private mounts - invisible to other namespaces, which includes the
> > other shells.
> >
> > If there was a /proc/NNN/namespace, that would do the trick :)
>
> Sounds like the solution, then. I do not think Al Viro is going to
> kill you for /proc/NNN/namespace...

Looking closer, I think we already have it.

It's called /proc/NNN/root.

Does chroot into /proc/NNN/root cause the chroot'ing process to adopt
the namespace of NNN? Looking at the code, I think it does.

Furthermore, I think a daemon can acquire file descriptors for
multiple namespaces already, by open("/") and passing descriptors
between processes. And the chroot can be done using /proc/self/fd/N
after receiving a descriptor.

This is because file descriptors, and current->fs->pwd and
current->fs->root, record the vfsmnt as well as the dentry that they
opened.

So no new system calls are needed. A daemon to hand out per-user
namespaces (or any other policy) can be written using existing
kernels, and those namespaces can be joined using chroot.

That's the theory anyway. It's always possible I misread the code (as
I don't use namespaces and don't have tools handy to try them).

-- Jamie

2005-04-26 14:14:41

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Ville Herva wrote:
> Apparently, Windows Longhorn will include something called "transactional
> NTFS". It's explained pretty well in
>
> http://blogs.msdn.com/because_we_can/
>
> Basically, a process can create a fs transaction, and all fs changes made
> between start of the transaction and commit are atomical - meaning nothing
> is visible until commit, and if commit fails, everything is rolled back.
>
> Sound useful... Although there are no service pack installs that could fail
> in Linux, the same thing could be useful in rpm, yum, almost anything.
>
> What do you think?

I think I've wanted something like that for _years_ in unix.

It's an old, old idea, and I've often wondered why we haven't implemented it.

-- Jamie

2005-04-26 14:15:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> TYPE A) general purpose user-mountable file systems
>
> This seems to be the feature that would be useful to many of the
> different file systems
> (fuse, v9fs, smbfs, etc). What security restrictions need to be in
> place if we were to take the SYS_CAP_ADMIN check out of sys_mount?
> >From what I've gleaned from the discussion so far they would include:
> 1) Restricting where the user could mount
> - the suggestion so far is that a user could only mount/bind to a
> directory he could write to and without the sticky bit
> 2) Restricting what the user could mount
> - mounting arbitrary file systems could expose a vulnerability
> 3) Restricting how the user can mount (nosuid, nogid enforced)
> 4) Restricting user mount visibility (in private namespaces) so as not
> to pollute the global namespace
> 5) Restricting how much the user can mount (restricting number of
> mounts and/or number of namespaces with a ulimit)
>
> (1), (3), (4), and (5) seem straightforward to me. (2) seems a little
> less-so. I understand a little bit of the vulnerability (specifically
> when mounting physical devices with file systems that may or may not
> be tolerant to malicious formats), but I hate restricting the user. I
> guess perhaps we could have something in the file system type
> information which describes whether or not it should be user
> mountable.
>
> Implementation wise, (3), (4), and (5) seem pretty straightforward to
> implement in the kernel.

Umm, yes. Here's one I prepared earlier:

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=107701207710525&w=2

I stopped maintaining it, because as you can see getting something
accepted to mainline is not as easy as it first sounds :)

> (1) and (2) wouldn't be that bad if the policy were kept simple,
> but any sort of an advanced policy would seem to require a
> user-space application to assist -- but that seems to require an
> suid mount app. Is it better to come up with a simple universal
> policy and implement it within VFS, or allow for a more complex
> policy that would require user-space application assistance?

Good question. I'm undecided on the suid/nosuid mount issue. It sure
would be nicer not to need a mount helper...

> Have I missed something from the security angle?
>
> TYPE B) per-user namespace / attachable namespace / etc.
>
> This argument seems to come mostly from the FUSE camp, but the goal
> seems noble enough: given enforcement of requiring private namespaces
> for user mounts in (A), how can we create a user-environment similar
> to what the user would expect without private mounts (ie. a global
> namespace per user).
>
> The main security concern here has been stated in detail before, so
> I'll only summarize: only the user who mounted the file system should
> be granted access to it. Private namespaces in (A) seem to grant that
> security, however, the (B) requirement of a global user namespace
> invalidates that as a new login (or su) woud attatch to the private
> namespace (and if I'm not mistaken a root su'ing to the user would
> also get around the currently implemented permission scheme).

Note: I'm mostly concerned with system security not user security.
Protecting user data from root is a treacherous thing to attempt.

> I don't think anyone has come up with a good solution here.
>
> My hack at a solution for this (even though I don't see this as a big
> requirement):
> Proper namespace inheritance (meaning changes to the parent are
> propagated to the child as references not copies -- I believe the
> shared subtree RFC covers the right semantics) along with establishing
> a new private namespace for each login session. As far as accessing
> already mounted FUSE file systems between different login sessions --
> I see this as a really obscure requirement

It's not that obscure. Scp, sftp each will be a separate session, and
you can't set up mounts within an scp.

> that complicates things a great deal, however -- if you split the
> concept of "srv points" from file system mounts and remount the file
> system (perhaps automatically as part of initiating the session) for
> every new login -- then you can revalidate security at each of these
> mounts.

Why would you have to revalidate? A simple bind mount would suffice.
However, joining another sessions namespace makes more sense, than
copying the mounts individually.

Miklos

2005-04-26 14:23:12

by Artem B. Bityutskiy

[permalink] [raw]
Subject: Re: filesystem transactions API

Jamie Lokier wrote:
> I think I've wanted something like that for _years_ in unix.
>
> It's an old, old idea, and I've often wondered why we haven't implemented it.
>

I thought it is possible to rather easily to implement this on top
of non-transactional FS (albeit I didn't try) and there is no need
to overcomplicate an FS. Just implement a specialized user-space
library and utilize it.


--
Best regards, Artem B. Bityuckiy
Oktet Labs (St. Petersburg), Software Engineer.
+78124286709 (office) +79112449030 (mobile)
E-mail: [email protected], web: http://www.oktetlabs.ru

2005-04-26 14:26:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: filesystem transactions API

ty den 26.04.2005 Klokka 16:46 (+0300) skreiv Ville Herva:
> Apparently, Windows Longhorn will include something called "transactional
> NTFS". It's explained pretty well in
>
> http://blogs.msdn.com/because_we_can/
>
> Basically, a process can create a fs transaction, and all fs changes made
> between start of the transaction and commit are atomical - meaning nothing
> is visible until commit, and if commit fails, everything is rolled back.
>
> Sound useful... Although there are no service pack installs that could fail
> in Linux, the same thing could be useful in rpm, yum, almost anything.
>
> What do you think?

NetApp have implemented something similar in their DAFS filesystem
called "rollback locks" (or autorecover locks).

http://www.watersprings.org/pub/id/draft-wittle-dafs-00.txt

Very useful for database apps etc.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-04-26 14:33:06

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Artem B. Bityuckiy wrote:
> Jamie Lokier wrote:
> >I think I've wanted something like that for _years_ in unix.
> >
> >It's an old, old idea, and I've often wondered why we haven't implemented
> >it.
> >
>
> I thought it is possible to rather easily to implement this on top
> of non-transactional FS (albeit I didn't try) and there is no need
> to overcomplicate an FS. Just implement a specialized user-space
> library and utilize it.

No. A transaction means that _all_ processes will see the whole
transaction or not.

It does _not_ mean that only a subset of programs, which happen to
link with a particular user-space library, will see it or not.

For example, you can use transactions for distro package management: a
whole update of a package would be a single transaction, so that at no
time does any program see an inconsistent set of files. See why
_every_ process in the system must have the same view?

[ If you meant that you can implement it with a user-space library
that every process in the system links to, that's true. But it would
rather misses the point of having filesystems in the kernel at all :) ]

-- Jamie

2005-04-26 14:46:07

by Artem B. Bityutskiy

[permalink] [raw]
Subject: Re: filesystem transactions API

Jamie Lokier wrote:
> Artem B. Bityuckiy wrote:
>
> No. A transaction means that _all_ processes will see the whole
> transaction or not.
>
> It does _not_ mean that only a subset of programs, which happen to
> link with a particular user-space library, will see it or not.
>
> For example, you can use transactions for distro package management: a
> whole update of a package would be a single transaction, so that at no
> time does any program see an inconsistent set of files. See why
> _every_ process in the system must have the same view?
>
> [ If you meant that you can implement it with a user-space library
> that every process in the system links to, that's true. But it would
> rather misses the point of having filesystems in the kernel at all :) ]
>
Hmm, so the whole point to implement transactions in the kernel space is
to do the transactions in a way that nobody can see any intermediate
inconsistent state ?


--
Best regards, Artem B. Bityuckiy
Oktet Labs (St. Petersburg), Software Engineer.
+78124286709 (office) +79112449030 (mobile)
E-mail: [email protected], web: http://www.oktetlabs.ru

2005-04-26 15:01:49

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On 4/26/05, Miklos Szeredi <[email protected]> wrote:
> > that complicates things a great deal, however -- if you split the
> > concept of "srv points" from file system mounts and remount the file
> > system (perhaps automatically as part of initiating the session) for
> > every new login -- then you can revalidate security at each of these
> > mounts.
>
> Why would you have to revalidate? A simple bind mount would suffice.
> However, joining another sessions namespace makes more sense, than
> copying the mounts individually.
>

Well, the forced revalidation was an attempt to protect "user-data"
from root, which, as you pointed out in your reply, is a somewhat
sketchy thing. It may also be useful if you wish to share a
filesystem/namespace with a subset of users with a permissions model
outside of the normal user/groups model (which the user doesn't really
have any control over).

Anyways, just an additional idea for consideration -- as I said, I
don't really feel a strong need for this, so perhaps its best
forgotten for now.

-eric

2005-04-26 15:02:03

by John Stoffel

[permalink] [raw]
Subject: Re: filesystem transactions API

>>>>> "Jamie" == Jamie Lokier <[email protected]> writes:

Jamie> No. A transaction means that _all_ processes will see the
Jamie> whole transaction or not.

This is really hard. How do you handle the case where process X
starts a transaction modifies files a, b & c, but process Y has file b
open for writing, and never lets it go? Or the file gets unlinked?

Jamie> For example, you can use transactions for distro package
Jamie> management: a whole update of a package would be a single
Jamie> transaction, so that at no time does any program see an
Jamie> inconsistent set of files. See why _every_ process in the
Jamie> system must have the same view?

What about programs that are already open and running?

It might be doable in some sense, but I can see that details are
really hard to get right. Esp without breaking existing Unix
semantics.

But then again, I could be smoking something good (or bad :-) here, so
take what I say with a grain of salt.

John

2005-04-26 15:13:07

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: filesystem transactions API

On 2005-04-26T11:01:54, John Stoffel <[email protected]> wrote:

> Jamie> No. A transaction means that _all_ processes will see the
> Jamie> whole transaction or not.
> This is really hard. How do you handle the case where process X
> starts a transaction modifies files a, b & c, but process Y has file b
> open for writing, and never lets it go? Or the file gets unlinked?

I suggest you ask Hans, reiser4 does have such a feature if I recall
correctly.

It gets a whole lot more interesting if you want the sucker to spawn
more than one mount though.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-26 15:19:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Artem B. Bityuckiy wrote:
> Hmm, so the whole point to implement transactions in the kernel space is
> to do the transactions in a way that nobody can see any intermediate
> inconsistent state ?

Yes.

-- Jamie

2005-04-26 15:20:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: filesystem transactions API

ty den 26.04.2005 Klokka 11:01 (-0400) skreiv John Stoffel:
> >>>>> "Jamie" == Jamie Lokier <[email protected]> writes:
>
> Jamie> No. A transaction means that _all_ processes will see the
> Jamie> whole transaction or not.
>
> This is really hard. How do you handle the case where process X
> starts a transaction modifies files a, b & c, but process Y has file b
> open for writing, and never lets it go? Or the file gets unlinked?

That is why implementing it as a form of lock makes sense.

> Jamie> For example, you can use transactions for distro package
> Jamie> management: a whole update of a package would be a single
> Jamie> transaction, so that at no time does any program see an
> Jamie> inconsistent set of files. See why _every_ process in the
> Jamie> system must have the same view?
>
> What about programs that are already open and running?
>
> It might be doable in some sense, but I can see that details are
> really hard to get right. Esp without breaking existing Unix
> semantics.

Wrong.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-04-26 15:25:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

John Stoffel wrote:
> >>>>> "Jamie" == Jamie Lokier <[email protected]> writes:
>
> Jamie> No. A transaction means that _all_ processes will see the
> Jamie> whole transaction or not.
>
> This is really hard. How do you handle the case where process X
> starts a transaction modifies files a, b & c, but process Y has file b
> open for writing, and never lets it go? Or the file gets unlinked?

Then it starts to depend on what kind of transactions you want to
implement.

You can say that a transaction isn't allowed when a process has one of
the files opened for writing. Or you can say a transaction is
equivalent to calling all of the I/O system calls at once. You can
also decide if you want the reads and directory lookups performed in
the transactions to become prerequisites for the transaction
completing (so it's aborted if another process writes to those file
regions or changes the directory structure in a way which breaks a
prerequisite), or if you want those to lock the things which are read
for the duration of the transaction, or even just ignore reads for
transaction purposes. Or, you can say that transactions are limited
to just directory structure, and not file contents (that's good enough
for package management), or you can say they're limited to just file
contents (that's good enough for databases and text file edits).

Etc, etc, quite a lot of semantic choices.

> What about programs that are already open and running?
>
> It might be doable in some sense, but I can see that details are
> really hard to get right. Esp without breaking existing Unix
> semantics.

It's even harder without kernel support! :)

-- Jamie

2005-04-26 15:43:12

by Charles P. Wright

[permalink] [raw]
Subject: Re: filesystem transactions API

On Tue, 2005-04-26 at 18:22 +0400, Artem B. Bityuckiy wrote:
> Jamie Lokier wrote:
> > I think I've wanted something like that for _years_ in unix.
> >
> > It's an old, old idea, and I've often wondered why we haven't implemented it.
> >
>
> I thought it is possible to rather easily to implement this on top
> of non-transactional FS (albeit I didn't try) and there is no need
> to overcomplicate an FS. Just implement a specialized user-space
> library and utilize it.
There are actually plenty of things that make it harder than it first
seems to provide ACID transactions. The two most difficult things are
going to be atomicity and isolation.

Atomicity is difficult, because you have lots of caches each with their
own bits of state (e.g., the inode/dentry caches). Assuming your
transaction is committed that isn't so much of a problem, but once you
have on rollback you need to undo any changes to those caches.

Isolation (this is the property that says that concurrent transactions
should be the same as if there was a serial execution) is also tricky to
get right. A transaction can touch any number of objects, and user-
applications may not respect any lock ordering --- which means you will
have deadlocks, and you must detect and resolve them (probably by
aborting one of the transactions).

None of these problems are insurmountable, and there are definitely good
reasons to use transactions. For example, RPM uses transactions to
update its own databases, it would be great if it could use transactions
to update the whole file system. Mail servers also have to go through
hoops to provide atomic updates. Isolation takes care of race
conditions.

At our lab, we've been experimenting with transactional file systems.
We've ported the Berkeley database to the kernel, because it already
provides ACID transactions. We've also built a simple file system on
top of it, with a rudimentary transactions API that is exposed to user-
level. One of the key things that we've learned is that it isn't very
easy to just "bolt" transactions onto your file system after the fact,
because there are just so many interactions between the file system,
caches, and the transaction manager.

Charles

2005-04-26 15:49:45

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Trond Myklebust wrote:
> > Jamie> No. A transaction means that _all_ processes will see the
> > Jamie> whole transaction or not.
> >
> > This is really hard. How do you handle the case where process X
> > starts a transaction modifies files a, b & c, but process Y has file b
> > open for writing, and never lets it go? Or the file gets unlinked?
>
> That is why implementing it as a form of lock makes sense.

The problem with making them exclusive locks is that you halt the
system for the duration of the transaction. If it's a big transaction
such as updating 1000 files for a package update, that blocks a lot of
programs for a long time, and it's not necessary.

And, because that's a potential denial of service, you have to limit
the size of transactions and their duration, especially for ordinary
users. That makes transactions a lot less useful than they can be.

I would implement them as a combination of time-limited lock, and
abortable transaction with file & directory reads establishing
prerequisites.

While the transaction lock is held, everything read (i.e. read byte
ranges, lock byte ranges, directory lookups, and stat results) cause
the corresponding range or inode to be exclusively locked for this
transaction, and also cause them to be recorded in the prerequisite
set for this transaction. Everything written (i.e. byte ranges or any
other filesystem modifying operation) is queued.

If the transaction lock timeout is reached before the transaction is
closed, all the exlusive locks for this transaction are released, and
the transaction lock itself is released, and the prerequisite set
continues to be recorded.

If at any time, another process tries to modify any of the information
in the transaction's prerequisite set, then firstly: if the
transaction lock is held, the other process is blocked until that lock
is released. Secondly: if the other process successfully modifies
information in the transaction's prerequisite set, the transaction is
aborted. All further operations in this transaction will fail,
including reads, writes, and the final close which commits writes.

Finally, when the transaction is closed, either it fails because
prerequisites were modified, or it commits all the pending filesystem
modifications of this transaction.

Why two phases?

The second phase, with no exclusive locking, is to allow ordinary
users to use transactions without blocking other processes or hogging
excessive system resources. It allows other processes to progress
while a big transaction is in progress. In other words, it prevents
some kinds of denial-of-service, allows arbitrarily large transactions
as long as there's enough space in the filesystem, and is generally
better.

The first phase, with exlusive locking, uses a randomised timeout for
the lock. This is to prevent starvation of transacting processes by
other processes. It's analogous to the problem of readers starving
writers in some kinds of read-write locks. The randomised timeout is
to prevent mutual starvation between two or more transacting
processes, which might otherwise get into synchronised livelock.

Enjoy :)
-- Jamie

2005-04-26 15:54:25

by Artem B. Bityutskiy

[permalink] [raw]
Subject: Re: filesystem transactions API

Jamie Lokier wrote:
> The problem with making them exclusive locks is that you halt the
> system for the duration of the transaction. If it's a big transaction
> such as updating 1000 files for a package update, that blocks a lot of
> programs for a long time, and it's not necessary.

Surely we'll anyway block others if we have a kernel-level transaction
support?
What is the difference in which layer to block?

--
Best regards, Artem B. Bityuckiy
Oktet Labs (St. Petersburg), Software Engineer.
+78124286709 (office) +79112449030 (mobile)
E-mail: [email protected], web: http://www.oktetlabs.ru

2005-04-26 16:04:22

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Artem B. Bityuckiy wrote:
> Jamie Lokier wrote:
> >The problem with making them exclusive locks is that you halt the
> >system for the duration of the transaction. If it's a big transaction
> >such as updating 1000 files for a package update, that blocks a lot of
> >programs for a long time, and it's not necessary.
>
> Surely we'll anyway block others if we have a kernel-level
> transaction support? What is the difference in which layer to
> block?

No. Why would you block? You can have transactions without blocking
other processes.

When updating, say, the core-utils package (which contains cat),
there's no reason why a program which executes "cat" should have to
block during the update. It can simply execute the old one until the
new one is committed at the end of the update.

It's analogous to RCU for protecting kernel data structures without
blocking readers.

-- Jamie

2005-04-26 16:06:27

by Artem B. Bityutskiy

[permalink] [raw]
Subject: Re: filesystem transactions API

Jamie Lokier wrote:
> No. Why would you block? You can have transactions without blocking
> other processes.
>
> When updating, say, the core-utils package (which contains cat),
> there's no reason why a program which executes "cat" should have to
> block during the update. It can simply execute the old one until the
> new one is committed at the end of the update.
>
> It's analogous to RCU for protecting kernel data structures without
> blocking readers.
>
Hmm, can't we implement a user-space locking system which admits of
readers during transactions? I gues we can.

--
Best regards, Artem B. Bityuckiy
Oktet Labs (St. Petersburg), Software Engineer.
+78124286709 (office) +79112449030 (mobile)
E-mail: [email protected], web: http://www.oktetlabs.ru

2005-04-26 16:15:11

by Artem B. Bityutskiy

[permalink] [raw]
Subject: Re: filesystem transactions API

Charles P. Wright wrote:
> Atomicity is difficult, because you have lots of caches each with their
> own bits of state (e.g., the inode/dentry caches). Assuming your
> transaction is committed that isn't so much of a problem, but once you
> have on rollback you need to undo any changes to those caches.
I guess if you do synchronization before unlocking all is OK. Roll-back
means deleting partially written things and restore old things, then run
fsyncs. Whys this may be not enough?

--
Best regards, Artem B. Bityuckiy
Oktet Labs (St. Petersburg), Software Engineer.
+78124286709 (office) +79112449030 (mobile)
E-mail: [email protected], web: http://www.oktetlabs.ru

2005-04-26 16:28:06

by Erik Hensema

[permalink] [raw]
Subject: Re: filesystem transactions API

Ville Herva ([email protected]) wrote:
> Apparently, Windows Longhorn will include something called "transactional
> NTFS". It's explained pretty well in
>
> http://blogs.msdn.com/because_we_can/
>
> Basically, a process can create a fs transaction, and all fs changes made
> between start of the transaction and commit are atomical - meaning nothing
> is visible until commit, and if commit fails, everything is rolled back.
>
> Sound useful... Although there are no service pack installs that could fail
> in Linux, the same thing could be useful in rpm, yum, almost anything.
>
> What do you think?

Doesn't reiser4 do this?

--
Erik Hensema <[email protected]>

2005-04-26 17:10:40

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>> >mknamespace -p users/$UID # (like mkdir -p)
>> >setnamespace users/$UID # (like cd)
>> ^^^^^^^^
>>
>> You realize that 'cd' is a shell command, and has to be, I hope. That
>> little fact has thrown a wrench into many of the ideas in this thread.
>
>I suppose it will be called by the login process or by wrappers like
>'nice'.

Just to be clear, then: this idea is fundamentally different from the
mkdir/cd analogy the thread starts with above. And it misses one rather
important requirement compared to mkdir/cd: You can't add a new mount to
an existing shell.

Several more complicated schemes that may achieve that are being discussed
in this thread.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-26 17:24:15

by Diego Calleja

[permalink] [raw]
Subject: Re: filesystem transactions API

El Tue, 26 Apr 2005 16:24:34 +0100,
Jamie Lokier <[email protected]> escribi?:

> It's even harder without kernel support! :)

This seems to implement something in userspace which might be interesting:
http://users.auriga.wearlab.de/~alb/libjio/

2005-04-26 17:24:59

by Charles P. Wright

[permalink] [raw]
Subject: Re: filesystem transactions API

On Tue, 2005-04-26 at 20:07 +0400, Artem B. Bityuckiy wrote:
> Charles P. Wright wrote:
> > Atomicity is difficult, because you have lots of caches each with their
> > own bits of state (e.g., the inode/dentry caches). Assuming your
> > transaction is committed that isn't so much of a problem, but once you
> > have on rollback you need to undo any changes to those caches.
> I guess if you do synchronization before unlocking all is OK. Roll-back
> means deleting partially written things and restore old things, then run
> fsyncs. Whys this may be not enough?
That would be fine for the on-disk image of the file system, but the in-
memory image also needs to be handled. Keeping track of all of these
objects and their changes is not a simple task.

Charles

2005-04-26 17:47:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Diego Calleja wrote:
> > It's even harder without kernel support! :)
>
> This seems to implement something in userspace which might be interesting:
> http://users.auriga.wearlab.de/~alb/libjio/

Thanks. That looks like a handy little library.

It doesn't do full filesystem transactions, obviously. Just
transactions within a single file, and requiring all processes using
the file to cooperate.

-- Jamie

2005-04-26 18:55:45

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>On Tue, Apr 26, 2005 at 03:00:10AM -0700, Andrew Morton wrote:
>> Not as thick as mine! Could someone please explain in small words
what's
>> wrong with an suid mount helper?
>
>Nothing per-se. What makes it bad is the context of a userland
filesystem
>where the actual filesystem operations in the mounted filesystem happen
>in context of a non-privileged user.

How did the fact that the file access system calls involve user-controlled
code come into this? I thought the FUSE kernel code already shielded the
system from said code to everyone's satisfaction.

We've been talking, rather, about the namespace changes. The exact same
issue exists with a non-userspace filesystem where the user controls the
filesystem contents. For example, a filesystem on a user-supplied CD. A
system administrator -- personally or through his setuid proxy -- might
want to mount this CD for the benefit of some users/processes/whatever but
not add it to the global namespace.

The issue of private mounts (mount = namespace change) would be good to
resolve separately from any problem with bringing user space code into the
kernel.

BTW, since Miklos said "mount helper" and others have said "mount
wrapper," I think some of us may not be familiar with mount helpers. It's
irrelevant to this discussion, but: util-linux 'mount' has a little-known
feature wherein it can run a filesystem-type-specific program in a child
process to do some of the mount function. A "mount wrapper" would be the
opposite -- a filesystem-type-specific program that runs the generic
'mount' program in a child process.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-26 20:09:53

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, 26 Apr 2005, Bryan Henderson wrote:

> >> >mknamespace -p users/$UID # (like mkdir -p)
> >> >setnamespace users/$UID # (like cd)
> >> ^^^^^^^^
> >>
> >> You realize that 'cd' is a shell command, and has to be, I hope. That
> >> little fact has thrown a wrench into many of the ideas in this thread.
> >
> >I suppose it will be called by the login process or by wrappers like
> >'nice'.
>
> Just to be clear, then: this idea is fundamentally different from the
> mkdir/cd analogy the thread starts with above.

NACK, it's very similar to the cd "$HOME" (or ulimit calls) done by the
login mechanism, except for the fact that no shell does implement a
setnamespace command and therefore can't leave that namespace. If the
shell were actually modified to implement setnamespace, that command would
be exactly like the cd command.

The wrapper I mentioned will usurally not be needed for normal operation,
but if users want additional private namespaces, they'll need this
seperate wrapper (or to modify the application or the shell) in order to
switch into them.

> And it misses one rather
> important requirement compared to mkdir/cd: You can't add a new mount to
> an existing shell.

The mount would be a part of the current namespace, which is shared across
all current user processes unless they are started without login (e.g.
procmail[0]) or running in a different namespace (the user called
setnamespace).



[0] If you want procmail in a user namespace, use a wrapper like
---/usr/bin/procmail---
#!/bin/sh
exec /usr/bin/setnamespace /users/"$UID" -- /usr/bin/procmail.bin "$@"
---

BTW: I think the namespaces will need the normal file permissions.

--
Fun things to slip into your budget
Paradigm pro-activator (a whole pack)
(you mean beer?)

2005-04-26 20:16:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > Could we get root-only fuse in, please?
>
> chmod u-s /usr/bin/fusermount

:-)))). I meant merging patches that are not controversial into
mainline. AFAICT only controversial pieces are "make it safe for
non-root users"...

Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-26 22:08:30

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>> Just to be clear, then: this idea is fundamentally different from the
>> mkdir/cd analogy the thread starts with above.
>
>NACK, it's very similar to the cd "$HOME" (or ulimit calls) done by the
>login mechanism,

That's not a NACK. The cd "$HOME" and ulimit calls done by the login
process (more precisely, by a shell profile) are quite different from the
mkdir/cd the thread started with. Who creates a new directory in his
shell profile? I assume the mkdir/cd analogy is a case of a person doing
a mkdir and cd in a running shell. (That is indeed analogous to what one
would like to do with a private mount).

When you said "by the login process or by wrappers like nice," in response
to my pointing out that setnamespace would need to be a shell builtin
command, I assumed you were talking about putting it in the code that
execs the shell as opposed to in the shell profile, thus eliminating the
need for a shell builtin.

But the important thing is just to recognize, as you say explicitly now,
that setnamespace has to be shell builtin command for
setnamespace/mknamespace to be analogous to mkdir/cd. That was my
original statement, if somewhat indirect:

>> >> >mknamespace -p users/$UID # (like mkdir -p)
>> >> >setnamespace users/$UID # (like cd)
>> >> ^^^^^^^^
>> >> You realize that 'cd' is a shell command, and has to be, I hope.
That
>> >> little fact has thrown a wrench into many of the ideas in this
thread.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-27 08:19:42

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Tue, 26 Apr 2005, Bryan Henderson wrote:

> >> Just to be clear, then: this idea is fundamentally different from the
> >> mkdir/cd analogy the thread starts with above.
> >
> >NACK, it's very similar to the cd "$HOME" (or ulimit calls) done by the
> >login mechanism,
>
> That's not a NACK. The cd "$HOME" and ulimit calls done by the login
> process (more precisely, by a shell profile) are quite different from the
> mkdir/cd the thread started with. Who creates a new directory in his
> shell profile?

I create a directory in /tmp and set $TMP to that directory, because I
can't just mount a private tmpfs. But that's another topic.

> I assume the mkdir/cd analogy is a case of a person doing
> a mkdir and cd in a running shell. (That is indeed analogous to what one
> would like to do with a private mount).

ACK, with respect to lifetime and processes affected, it will be exactly
like creating/using a directory in a tmpfs. But as you noticed, you'd need
the shell builtin command to make this analogy complete. That's not going
to happen, but it's not needed for operation.

> When you said "by the login process or by wrappers like nice," in response
> to my pointing out that setnamespace would need to be a shell builtin
> command, I assumed you were talking about putting it in the code that
> execs the shell as opposed to in the shell profile, thus eliminating the
> need for a shell builtin.

Exactly. You can't patch all login daemons, so you'll need pam to do the
initial setup.

After that, the users may decide to ignore having a private namespace (it
will just DTRT), or they can decide to use that feature to lock in some of
their programs. Obviously pam won't allow private sub-namespaces at random
times, while the general system call would support this, and their shell
won't do that, too. In the same way you'll need a wrapper like "#!/bin/sh
cd $dir&&exec $prog" for doing the initial chdir on behalf of
chdir-ignorant programs, you'll need a wrapper for setnamespace-ignorant
programs. The only difference is that chdir-ignorant programs are rare.

> But the important thing is just to recognize, as you say explicitly now,
> that setnamespace has to be shell builtin command for
> setnamespace/mknamespace to be analogous to mkdir/cd. That was my
> original statement, if somewhat indirect:

For the analogy yes, for usage no.
--
The secret of the universe is #@*%! NO CARRIER

2005-04-27 08:50:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > Could we get root-only fuse in, please?
> >
> > chmod u-s /usr/bin/fusermount
>
> :-)))). I meant merging patches that are not controversial into
> mainline. AFAICT only controversial pieces are "make it safe for
> non-root users"...

This is the controversial part in all it's glory:

if (!(fc->flags & FUSE_ALLOW_OTHER) && current->fsuid != fc->user_id)
return -EACCES;

Leaving it out would gain us what exactly?

I'm not trying to say that this is somehow better than the
pam+shared-subtrees solution discuseed. That certainly has advantages
over this (e.g. suid programs get permission to fuse mounted
filesystems).

But leaving it out makes no sense. Zero, zilch, none.

Maybe I'm totally dumb, but I just don't get Christoph's argument.

Thanks,
Miklos

2005-04-27 09:10:06

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Jamie Lokier wrote:

> He wants to do this:
>
> 1. From client, login to server and do a usermount on $HOME/private.
>
> 2. From client, login to server and read the files previously mounted.
>
>
This is works fine with plain "mount", except that the mount isn't
hidden from others. Why hide it? Permissions can be used to prevent
others from looking at the mounted stuff if need be. I.e. put
the mountpoint in a directory not readable by others, or
have the root of that fs unreadable by others.

Helge Hafting

2005-04-27 09:15:24

by Jan Hudec

[permalink] [raw]
Subject: Re: filesystem transactions API

On Tue, Apr 26, 2005 at 20:01:45 +0400, Artem B. Bityuckiy wrote:
> Jamie Lokier wrote:
> >No. Why would you block? You can have transactions without blocking
> >other processes.
> >
> >When updating, say, the core-utils package (which contains cat),
> >there's no reason why a program which executes "cat" should have to
> >block during the update. It can simply execute the old one until the
> >new one is committed at the end of the update.
> >
> >It's analogous to RCU for protecting kernel data structures without
> >blocking readers.
> >
> Hmm, can't we implement a user-space locking system which admits of
> readers during transactions? I gues we can.

The problem with implementing in userland, as was already said in the
thread, is, that if some process does not use the library, it can
completely mess it up. It is only safe in kernel.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (970.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-27 09:25:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > > > Could we get root-only fuse in, please?
> > >
> > > chmod u-s /usr/bin/fusermount
> >
> > :-)))). I meant merging patches that are not controversial into
> > mainline. AFAICT only controversial pieces are "make it safe for
> > non-root users"...
>
> This is the controversial part in all it's glory:
>
> if (!(fc->flags & FUSE_ALLOW_OTHER) && current->fsuid != fc->user_id)
> return -EACCES;
>
> Leaving it out would gain us what exactly?

Well, if it brings us ugly semantics, keeping those two lines out for
a while can help merge a lot...
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-27 09:34:46

by Jan Hudec

[permalink] [raw]
Subject: Re: filesystem transactions API

On Tue, Apr 26, 2005 at 16:24:34 +0100, Jamie Lokier wrote:
> John Stoffel wrote:
> > >>>>> "Jamie" == Jamie Lokier <[email protected]> writes:
> >
> > Jamie> No. A transaction means that _all_ processes will see the
> > Jamie> whole transaction or not.
> >
> > This is really hard. How do you handle the case where process X
> > starts a transaction modifies files a, b & c, but process Y has file b
> > open for writing, and never lets it go? Or the file gets unlinked?
>
> Then it starts to depend on what kind of transactions you want to
> implement.
>
> You can say that a transaction isn't allowed when a process has one of
> the files opened for writing. Or you can say a transaction is
> equivalent to calling all of the I/O system calls at once. You can
> also decide if you want the reads and directory lookups performed in
> the transactions to become prerequisites for the transaction
> completing (so it's aborted if another process writes to those file
> regions or changes the directory structure in a way which breaks a
> prerequisite), or if you want those to lock the things which are read
> for the duration of the transaction, or even just ignore reads for
> transaction purposes. Or, you can say that transactions are limited
> to just directory structure, and not file contents (that's good enough
> for package management), or you can say they're limited to just file
> contents (that's good enough for databases and text file edits).
>
> Etc, etc, quite a lot of semantic choices.

How do we specify which calls belong to a transaction? By some kind of
extra file handle?

I'd think having global per-process transaction is not the best way.
So I think we should have some kind of transaction handle (probably in
the file handle space) and a way to say that a syscall is done within
a transaction. To avoid duplicating all syscalls, we could have
set_active_transaction() operation.

Now I think the criteria for semantics should be serializability. That
would mean, that lookup paths would have to be locked IFF the lookup was
done within the transaction -- but you would be free to open a file
without transaction, then set_active_transaction and write that file.
That way the write would become atomic, but someone else could freely
rename the file from under you.

Note: Editors currently write to a temporary file and rename over the
original (if they have permissions to do it), which is as good
transaction as they need.

> > What about programs that are already open and running?
> >
> > It might be doable in some sense, but I can see that details are
> > really hard to get right. Esp without breaking existing Unix
> > semantics.
>
> It's even harder without kernel support! :)

If every syscall (touching filesystem) was turned into a transaction of
it's own, it wouldn't break any semantics.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (2.90 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-27 09:37:59

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: filesystem transactions API

On 2005-04-26T11:40:02, "Charles P. Wright" <[email protected]> wrote:

> Atomicity is difficult, because you have lots of caches each with their
> own bits of state (e.g., the inode/dentry caches). Assuming your
> transaction is committed that isn't so much of a problem, but once you
> have on rollback you need to undo any changes to those caches.
>
> Isolation (this is the property that says that concurrent transactions
> should be the same as if there was a serial execution) is also tricky to
> get right. A transaction can touch any number of objects, and user-
> applications may not respect any lock ordering --- which means you will
> have deadlocks, and you must detect and resolve them (probably by
> aborting one of the transactions).

Just as a weird idea, spawned by the FUSE thread.

"Transactions happen in their own namespace".

Besides having a namespace_(create|join) as needed for FUSE (or
similar), there'd be a privileged namespace_replace(target, source) (or
_merge, if you prefer - that however seems to imply that a namespace was
actually forked off another).

So, you want transactions for testing some software update, you create
your new one, mount stuff, do the update, and then "commit" it by
replacing the global namespace by it.

If you want to discard, just exit it. As soon as no further references
to a namespace exist, it can be cleaned up (and non-persistent
transactions will be 'unrolled' and thrown away).

Now where's that pipe of mine... ;-)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-27 10:43:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > This is the controversial part in all it's glory:
> >
> > if (!(fc->flags & FUSE_ALLOW_OTHER) && current->fsuid != fc->user_id)
> > return -EACCES;
> >
> > Leaving it out would gain us what exactly?
>
> Well, if it brings us ugly semantics, keeping those two lines out for
> a while can help merge a lot...

To the mount owner the semantics are quite normal. Others will be
denied access to the mountpoint, which doesn't introduce any new
semantics either.

If you look at it from the POV of _any_ process, there are NO NEW
SEMANTICS. Nothing that programs, scripts or anything has to be
modified for. Nothing that could cause _any_ problems later, if this
check was removed.

Prove me wrong!

Thanks,
Miklos

2005-04-27 10:58:59

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: filesystem transactions API

In article <20050427091402.GA1904@vagabond> you wrote:
> The problem with implementing in userland, as was already said in the
> thread, is, that if some process does not use the library, it can
> completely mess it up. It is only safe in kernel.

Only if all process use the transacted API. There is no reason to fear that
some malicious user is messing with your DB files.

Greetings
Bernd

2005-04-27 12:00:38

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, Apr 27, 2005 at 12:42:04 +0200, Miklos Szeredi wrote:
> > > This is the controversial part in all it's glory:
> > >
> > > if (!(fc->flags & FUSE_ALLOW_OTHER) && current->fsuid != fc->user_id)
> > > return -EACCES;
> > >
> > > Leaving it out would gain us what exactly?
> >
> > Well, if it brings us ugly semantics, keeping those two lines out for
> > a while can help merge a lot...
>
> To the mount owner the semantics are quite normal. Others will be
> denied access to the mountpoint, which doesn't introduce any new
> semantics either.

What makes you think Pavel was talking about semantics?!

The point was that:
Ok, there is a strong disagreement about these two lines. Could we have
a patch with everything but these two lines, so it can be integrated
immediately to profit of the testing and generally be useful, and then
the controversial bits when the issue is beaten to death?

So, please, could you send a stripped-down version, that is not safe for
mounting by users, but can be tested for many cases where that is
sufficient?

> If you look at it from the POV of _any_ process, there are NO NEW
> SEMANTICS. Nothing that programs, scripts or anything has to be
> modified for. Nothing that could cause _any_ problems later, if this
> check was removed.
>
> Prove me wrong!

As I understand it, doing things like this is butt ugly. Not just in
fuse -- in NFS, in samba, everywhere where such hacks are employed. But
now they just have enough of those hacks and want a cleaner solution.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.60 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-27 12:25:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> What makes you think Pavel was talking about semantics?!

Well, if it brings us ugly semantics, keeping those two lines out for
^^^^^^^^^
a while can help merge a lot...

> The point was that:
> Ok, there is a strong disagreement about these two lines. Could we have
> a patch with everything but these two lines, so it can be integrated
> immediately to profit of the testing and generally be useful, and then
> the controversial bits when the issue is beaten to death?

I could remove this check.

But it would only cause confusion. How would the userspace utilities
differentiate between the safe out-of-kernel and the unsafe in-kernel
module? Adding hacks to make this possible is far more ugly IMO than
integrating the current well tested solution.

It makes no sense. If someone would give me a rational explanation
why it is bad, I would be content. But you just tell me it's
terrible, ugly, crap which may well be true, but are not technical
terms, which I can relate to.

> As I understand it, doing things like this is butt ugly. Not just in
> fuse -- in NFS, in samba, everywhere where such hacks are employed. But
> now they just have enough of those hacks and want a cleaner solution.

Please do. I want it too.

_When_ we have a better solution, all the hacks can be removed, and
the world will rejoice.

Until then, let the hacks live! Please!

Thanks,
Miklos

2005-04-27 12:43:05

by Jan Hudec

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, Apr 27, 2005 at 14:23:48 +0200, Miklos Szeredi wrote:
> > What makes you think Pavel was talking about semantics?!
>
> Well, if it brings us ugly semantics, keeping those two lines out for
> ^^^^^^^^^
> a while can help merge a lot...
>
> > The point was that:
> > Ok, there is a strong disagreement about these two lines. Could we have
> > a patch with everything but these two lines, so it can be integrated
> > immediately to profit of the testing and generally be useful, and then
> > the controversial bits when the issue is beaten to death?
>
> I could remove this check.
>
> But it would only cause confusion. How would the userspace utilities
> differentiate between the safe out-of-kernel and the unsafe in-kernel
> module? Adding hacks to make this possible is far more ugly IMO than
> integrating the current well tested solution.
>
> It makes no sense. If someone would give me a rational explanation
> why it is bad, I would be content. But you just tell me it's
> terrible, ugly, crap which may well be true, but are not technical
> terms, which I can relate to.

Where the hell do you see it above. The only thing I said above is it is
controversial.

The userland tools don't need to know. They just need to not be suid.

> > As I understand it, doing things like this is butt ugly. Not just in
> > fuse -- in NFS, in samba, everywhere where such hacks are employed. But
> > now they just have enough of those hacks and want a cleaner solution.
>
> Please do. I want it too.
>
> _When_ we have a better solution, all the hacks can be removed, and
> the world will rejoice.
>
> Until then, let the hacks live! Please!

Ok, here I say it is ugly (but not that it's crap). And the reason is,
that there is a permission system, with some semantics, and then various
filesystems adapt it in varous ways to fit what they want. So every
filesystem ends up with it's onw little different behaviour.

That being said, fuse does just about the same as NFS, samba and others
and I don't really see the reason why it couldn't be integrated. But
I am not the one to decide.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (2.20 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-27 13:23:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > It makes no sense. If someone would give me a rational explanation
> > why it is bad, I would be content. But you just tell me it's
> > terrible, ugly, crap which may well be true, but are not technical
> > terms, which I can relate to.
>
> Where the hell do you see it above.

I meant it as a plural "you" mainly refering to Christoph, who did use
those words, and not too sparingly either :)

I did not mean it personally, please accept my apologies.

> The userland tools don't need to know. They just need to not be suid.

But I'd want to continue distribute the non-crippled kernel module
too, with suid fusermount. Then fusermount _has_ to know which kernel
module is currently active.

> Ok, here I say it is ugly (but not that it's crap). And the reason is,
> that there is a permission system, with some semantics, and then various
> filesystems adapt it in varous ways to fit what they want. So every
> filesystem ends up with it's onw little different behaviour.
>
> That being said, fuse does just about the same as NFS, samba and others
> and I don't really see the reason why it couldn't be integrated. But
> I am not the one to decide.

Every opinion counts.

I'm not trying to convince people that the current solution is
perfect. What I'm saying that it's

a) not harmful

b) it makes non-privileged mounts possible

And b) is _the_ most important feature IMO, so the argument for
stripping it out has to be very good.

Thanks,
Miklos

2005-04-27 13:36:48

by Andi Kleen

[permalink] [raw]
Subject: Re: filesystem transactions API

"Artem B. Bityuckiy" <[email protected]> writes:

> Jamie Lokier wrote:
>> I think I've wanted something like that for _years_ in unix.
>> It's an old, old idea, and I've often wondered why we haven't
>> implemented it.
>>
>
> I thought it is possible to rather easily to implement this on top
> of non-transactional FS (albeit I didn't try) and there is no need
> to overcomplicate an FS. Just implement a specialized user-space
> library and utilize it.

Yes it is. e.g. newer sleepycat DB has a nice library for this.
It should be somewhere on your distribution.

-Andi

2005-04-27 13:44:46

by Ville Herva

[permalink] [raw]
Subject: Re: filesystem transactions API

On Wed, Apr 27, 2005 at 11:34:12AM +0200, you [Jan Hudec] wrote:
> On Tue, Apr 26, 2005 at 16:24:34 +0100, Jamie Lokier wrote:
> > John Stoffel wrote:
> > > >>>>> "Jamie" == Jamie Lokier <[email protected]> writes:
> > >
> > > Jamie> No. A transaction means that _all_ processes will see the
> > > Jamie> whole transaction or not.
> > >
> > > This is really hard. How do you handle the case where process X
> > > starts a transaction modifies files a, b & c, but process Y has file b
> > > open for writing, and never lets it go? Or the file gets unlinked?
> >
> > Then it starts to depend on what kind of transactions you want to
> > implement.
> >
> > You can say that a transaction isn't allowed when a process has one of
> > the files opened for writing. Or you can say a transaction is
> > equivalent to calling all of the I/O system calls at once. You can
> > also decide if you want the reads and directory lookups performed in
> > the transactions to become prerequisites for the transaction
> > completing (so it's aborted if another process writes to those file
> > regions or changes the directory structure in a way which breaks a
> > prerequisite), or if you want those to lock the things which are read
> > for the duration of the transaction, or even just ignore reads for
> > transaction purposes. Or, you can say that transactions are limited
> > to just directory structure, and not file contents (that's good enough
> > for package management), or you can say they're limited to just file
> > contents (that's good enough for databases and text file edits).
> >
> > Etc, etc, quite a lot of semantic choices.
>
> How do we specify which calls belong to a transaction? By some kind of
> extra file handle?
>
> I'd think having global per-process transaction is not the best way.
> So I think we should have some kind of transaction handle (probably in
> the file handle space) and a way to say that a syscall is done within
> a transaction. To avoid duplicating all syscalls, we could have
> set_active_transaction() operation.

That's more or less what NTFS does. See the example at
http://blogs.msdn.com/because_we_can/



-- v --

[email protected]

2005-04-27 14:32:53

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > > This is the controversial part in all it's glory:
> > >
> > > if (!(fc->flags & FUSE_ALLOW_OTHER) && current->fsuid != fc->user_id)
> > > return -EACCES;
> > >
> > > Leaving it out would gain us what exactly?
> >
> > Well, if it brings us ugly semantics, keeping those two lines out for
> > a while can help merge a lot...
>
> To the mount owner the semantics are quite normal. Others will be
> denied access to the mountpoint, which doesn't introduce any new
> semantics either.

Why, exactly, is this check in the kernel and not the FUSE daemon?

Someone said the FUSE daemon knows which user is making filesystem
requests, and can therefore do this. Is it true?

-- Jamie

2005-04-27 14:40:49

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > The userland tools don't need to know. They just need to not be suid.
>
> But I'd want to continue distribute the non-crippled kernel module
> too, with suid fusermount. Then fusermount _has_ to know which kernel
> module is currently active.

You can use a version number or feature-bitmask for that.

-- Jamie

2005-04-27 14:47:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Why, exactly, is this check in the kernel and not the FUSE daemon?
>
> Someone said the FUSE daemon knows which user is making filesystem
> requests, and can therefore do this. Is it true?

Yes.

The check is in the kernel, because otherwise it couldn't be enforced.

It is not there for the purpose of protecting user's data. Rather for
protecting other users (including root) from unknowingly entering the
FUSE directory and thus leaking otherwise inaccessible information
(exact file operations performed) to the mount owner.

It's probably not a great security risk, but it's better to be safe
than sorry. If a sysadmin decides, it's not problematic, he can
relax this by

echo user_allow_other >> /etc/fuse.conf

Miklos

2005-04-27 14:56:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Why, exactly, is this check in the kernel and not the FUSE daemon?
> >
> > Someone said the FUSE daemon knows which user is making filesystem
> > requests, and can therefore do this. Is it true?
>
> Yes.
>
> The check is in the kernel, because otherwise it couldn't be enforced.

I'm going to compile a fuse-user-mount FAQ. This is about the 4th
time I answer this question in this thread :)

Miklos

2005-04-27 15:18:32

by Jamie Lokier

[permalink] [raw]
Subject: Re: filesystem transactions API

Ville Herva wrote:
> > How do we specify which calls belong to a transaction? By some kind of
> > extra file handle?
> >
> > I'd think having global per-process transaction is not the best way.
> > So I think we should have some kind of transaction handle (probably in
> > the file handle space) and a way to say that a syscall is done within
> > a transaction. To avoid duplicating all syscalls, we could have
> > set_active_transaction() operation.
>
> That's more or less what NTFS does. See the example at
> http://blogs.msdn.com/because_we_can/

That's the obvious choice but it limits the usefulness quite a lot.

If we have transactions, then I'd like to be able to do this from a shell:

transaction_open t

tar xvpSfz blahblah.tar.gz
cd blahblah
patch -p1 -E < foo.patch
# etc.

transaction_close $t

I'd also like to write inside a single C program:

transaction * t = transaction_open ();

/* Ordinary complicated filesystem operations here... */
link (a, b);
rename (c, d);
read, write, stat etc.
conf = open ("/etc/blahblah.conf", O_RDONLY);
read (conf, ...)
close (conf);
/* If /etc/blahblah.conf is changed by another program during
the transaction, the transaction is invalidated, because the
dbm update below is dependent on what was read... */
dbm_open (...);
do_dbm_stuff (...);
dbm_close (...);
/* Whatever this command does, I'd like to include in the transaction. */
system ("perl -pi -e 's/old_value/new_value/g' /etc/another.conf");

transaction_close (t);

Fundamentally, if transactions are supported in the kernel then these
two usages are easy to offer:

1. Ordinary file system calls as part of a transaction.

This allows libraries which are not transaction-aware to be
used, such as the dbm example above, and other things like XML
parsers/writers.

2. Subprocesses inherit a transaction, so a program can execute
complex transactions by using other programs.

It's useful, and there is no good reason to disallow that.

Nonetheless, there's a need for some kind of transaction handles. A
file descriptor representing a transaction seems like a natural fit.

Complex programs will want to have multiple transactions at the same
time: For example, any program structured using event-driven logic or
async I/O may have multiple independent state machines per thread,
each wanting to be able to have their own transactions.

This suggests a few things:

- Transactions have a file descriptor to represent them.

- Each thread has a "current transaction" that applies to all filesystem
operations.

- Concurrent threads will need their own current transactions, even
while keeping "current directory" global to the whole process for
POSIX reasons. A process wide "current transaction" is too coarse.

- Transactions should be automatically nestable: a program or
library which uses transactions should itself be callable from a
program or library which is using a transaction.

- Transactions should record whether they cannot provide
transactions for some operation that is attempted (e.g. writing to
a file on a remote filesystem), aborting the transaction.

- When a transaction aborts due to the actions of _another_ process
(or thread) which is outside the transaction, that abort is an
event which should be detectable synchronously (by polling the
transaction fd) or asynchronously (by a signal - the SIGIO
mechanism is fine for this).

- An exclusive locking period should be optional, requested by a
flag when opening the transaction. Most usages will want the
locking period with its default parameters.

- Ideally, programs or mechanisms which provide alternative views of
part of a filesystem, such as search results (Beagle), tarfs, or
mailfs, should be able to update synchronously with transactions
that affect whatever the view is watching, so that the view
changes are effectively part of the transaction. This does _not_
mean that a transaction must wait for watchers to calculate
anything. It does mean a transaction must synchronously and
simultaneously invalidate caches held by watchers during the
atomic commit.

-- Jamie

2005-04-27 15:33:35

by Martin Mares

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hello!

> It is not there for the purpose of protecting user's data. Rather for
> protecting other users (including root) from unknowingly entering the
> FUSE directory and thus leaking otherwise inaccessible information
> (exact file operations performed) to the mount owner.

Huh? Do you really suppose that there could be anything secret in the
operations somebody else is performing on your files?

I still don't see any real problem this check could ever solve.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
God is real, unless declared integer.

2005-04-27 15:50:51

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On 2005-04-27T17:33:20, Martin Mares <[email protected]> wrote:

> > It is not there for the purpose of protecting user's data. Rather for
> > protecting other users (including root) from unknowingly entering the
> > FUSE directory and thus leaking otherwise inaccessible information
> > (exact file operations performed) to the mount owner.
>
> Huh? Do you really suppose that there could be anything secret in the
> operations somebody else is performing on your files?

It is certainly an information leak not otherwise available. And with
the ability to change the layout underneath, you might trigger bugs in
root programs: Are they really capable of seeing the same filename
twice, or can you throw them into a deep recursion by simulating
infinitely deep directories/circular hardlinks...?

Certainly a useful tool for hardening applications, but I can see the
point of not wanting to let unwary applications run into a namespace
controlled by a user. Of course, this is sort-of similar to "find
-xdev", but I'm not sure whether it is not indeed new behaviour.



Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

2005-04-27 16:46:55

by Martin Mares

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hello!

> It is certainly an information leak not otherwise available. And with
> the ability to change the layout underneath, you might trigger bugs in
> root programs: Are they really capable of seeing the same filename
> twice, or can you throw them into a deep recursion by simulating
> infinitely deep directories/circular hardlinks...?

Yes, it can help you trigger bugs, but all these bugs are triggerable
without user filesystems as well, although it's harder to do so.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
If the government wants us to respect the law, it should set a better example.

2005-04-27 16:59:05

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: filesystem transactions API

In article <[email protected]> you wrote:
> If we have transactions, then I'd like to be able to do this from a shell:
...
> I'd also like to write inside a single C program:

perhaps you will need to use plan9 or hurd? :)

Because this pretty much virtualisation/snapshots. Anyway, it would be a
nice thing to have, for sure (I am not sure if all the technical
implications like deadlocks and serialisations can be solved in a unix
compatible manner (and especially for at least more than one local and
networked file system).

> It's useful, and there is no good reason to disallow that.

There might be no good reasons, but a lot of hard problems.

> Nonetheless, there's a need for some kind of transaction handles. A
> file descriptor representing a transaction seems like a natural fit.

Yes, that might be a good thing, beacause it can be passed, inherited and
access controled and possesed.

Greetings
Bernd

2005-04-27 17:34:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> It is certainly an information leak not otherwise available. And with
> the ability to change the layout underneath, you might trigger bugs in
> root programs: Are they really capable of seeing the same filename
> twice, or can you throw them into a deep recursion by simulating
> infinitely deep directories/circular hardlinks...?

Circular or otherwise hardlinked directories are not allowed since it
would not only confuse applications but the VFS as well.

Hmm, looking at the code it seems that for some reason I removed this
check from the 2.6 version of FUSE. Stupid me!

Thanks for the reminder :)

> Certainly a useful tool for hardening applications, but I can see the
> point of not wanting to let unwary applications run into a namespace
> controlled by a user. Of course, this is sort-of similar to "find
> -xdev", but I'm not sure whether it is not indeed new behaviour.

A trivial DoS against any process entering the userspace filesystem is
just not to answer the filesystem request.

So it's not just information leak, but also a fine way to _control_
certain behavior of applications.

Thanks,
Miklos

2005-04-27 17:40:05

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > It is certainly an information leak not otherwise available. And with
> > the ability to change the layout underneath, you might trigger bugs in
> > root programs: Are they really capable of seeing the same filename
> > twice, or can you throw them into a deep recursion by simulating
> > infinitely deep directories/circular hardlinks...?
>
> Yes, it can help you trigger bugs, but all these bugs are triggerable
> without user filesystems as well, although it's harder to do so.

It's not just triggering bugs. You have very fine control over what
you present in your filesystem. Examples are huge files, huge
directories, operations that complete slowly or never at all.

Is it possible to limit all these from kernelspace? Probably yes,
although a timeout for operations is something that cuts either way.
And the compexity of these checks would probably be orders of
magnitude higher then the check we are currently discussing.

So this check _is_ needed on systems where the users cannot be trusted.

Thanks,
Miklos

2005-04-27 17:42:44

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, 2005-04-27 at 10:33, Miklos Szeredi wrote:
> > It is certainly an information leak not otherwise available. And with
> > the ability to change the layout underneath, you might trigger bugs in
> > root programs: Are they really capable of seeing the same filename
> > twice, or can you throw them into a deep recursion by simulating
> > infinitely deep directories/circular hardlinks...?
>
> Circular or otherwise hardlinked directories are not allowed since it
> would not only confuse applications but the VFS as well.
>
> Hmm, looking at the code it seems that for some reason I removed this
> check from the 2.6 version of FUSE. Stupid me!
>
> Thanks for the reminder :)
>
> > Certainly a useful tool for hardening applications, but I can see the
> > point of not wanting to let unwary applications run into a namespace
> > controlled by a user. Of course, this is sort-of similar to "find
> > -xdev", but I'm not sure whether it is not indeed new behaviour.
>
> A trivial DoS against any process entering the userspace filesystem is
> just not to answer the filesystem request.
>
> So it's not just information leak, but also a fine way to _control_
> certain behavior of applications.
>

I think you need to disallow overmounts on invisible mounts by any user
other than the owner. If not, some other user (including root) can
overmount on your mount and the user will end up with DoS.

RP

> Thanks,
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-04-27 17:49:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> I think you need to disallow overmounts on invisible mounts by any user
> other than the owner. If not, some other user (including root) can
> overmount on your mount and the user will end up with DoS.

I'm not following you here. How would an overmount cause DoS?

Thanks,
Miklos

2005-04-27 17:54:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > the ability to change the layout underneath, you might trigger bugs in
> > > root programs: Are they really capable of seeing the same filename
> > > twice, or can you throw them into a deep recursion by simulating
> > > infinitely deep directories/circular hardlinks...?
> > Circular or otherwise hardlinked directories are not allowed since it
> > would not only confuse applications but the VFS as well.
>
> Right, that you can catch. But can you prevent a user fs module from
> creating an infinitely deep directory structure out of thin air? Do you
> limit the maximum path length / depth?

No.

> (Sending this privately and not to LKML, because I first wanted to check
> the facts ;-)

OK, CC restored. You shouldn't be afraid to send to LKML. It's the
ultimate spam list ;)

> > > Certainly a useful tool for hardening applications, but I can see the
> > > point of not wanting to let unwary applications run into a namespace
> > > controlled by a user. Of course, this is sort-of similar to "find
> > > -xdev", but I'm not sure whether it is not indeed new behaviour.
> >
> > A trivial DoS against any process entering the userspace filesystem is
> > just not to answer the filesystem request.
> >
> > So it's not just information leak, but also a fine way to _control_
> > certain behavior of applications.
>
> Yes. I first thought the check was superfluous, because hey, why
> shouldn't root be able to access everything... But then it struck me
> that that might actually be a good idea for all those reasons. root's
> tools don't expect that the namespace they are traversing is
> _completely_ controlled by a user.

Exactly.

Thanks,
Miklos

2005-04-27 17:56:47

by Martin Mares

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi Miklos!

> Is it possible to limit all these from kernelspace? Probably yes,
> although a timeout for operations is something that cuts either way.
> And the compexity of these checks would probably be orders of
> magnitude higher then the check we are currently discussing.

Yes ... but does the check we are discussing really solve the problem?

Let's say that you attempt to export home directories of users by a user-space
NFS daemon. This daemon probably changes its fsuid to match the remote user,
so the check happily accepts the access and the user is able to lock up the
daemon.

It doesn't seem that there is any simple and universal cure -- root programs
or setuid programs altering their fsuid are just too similar to the real user
programs to separate them cleanly.

I see a lot of similarities with symlinks -- many programs also need to take
extra care of symlinks to be safe. However, symlinks are already senior
citizens of Unix systems and programs know how to cope with them since ages.

Maybe this could be taken advantage of by keeping all user mounts in a separate
directory like /mnt/usr (and /mnt is very likely to be avoided by all programs
traversing directory structure automatically) and symlinking from the requested
mount points there (with symlinks naturally not followed by automatic traversals).

I agree it isn't a neat solution, but it seems to be the first one which is
close to working.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Lisp Users: Due to the holiday, there will be no garbage collection on Monday.

2005-04-27 17:59:23

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, 2005-04-27 at 10:47, Miklos Szeredi wrote:
> > I think you need to disallow overmounts on invisible mounts by any user
> > other than the owner. If not, some other user (including root) can
> > overmount on your mount and the user will end up with DoS.
>
> I'm not following you here. How would an overmount cause DoS?

eg:

user 1 does a invisible mount on /mnt/mnt1
root does a visible mount on /mnt/mnt1

user 1 will no longer be able to access his /mnt/mnt1

in fact even if root mounts something on /mnt, the problem still exists.

RP

>
> Thanks,
> Miklos

2005-04-27 18:09:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > Is it possible to limit all these from kernelspace? Probably yes,
> > although a timeout for operations is something that cuts either way.
> > And the compexity of these checks would probably be orders of
> > magnitude higher then the check we are currently discussing.
>
> Yes ... but does the check we are discussing really solve the
> problem?
>
> Let's say that you attempt to export home directories of users by a
> user-space NFS daemon. This daemon probably changes its fsuid to
> match the remote user, so the check happily accepts the access and
> the user is able to lock up the daemon.

Valid point. The only defense is that when a program set's fsuid,
it's performing the operation "on behalf of the user". So the user is
actually doing DoS against himself.

Of course this is not strictly true. E.g. in the userspace NFS case
it's probably a DoS against all users of the mount.

> It doesn't seem that there is any simple and universal cure -- root
> programs or setuid programs altering their fsuid are just too
> similar to the real user programs to separate them cleanly.

Root programs setting fsuid are relatively rare, most are suid
programs originally started from the user (nfsd is an exception).

So yes the check fsuid is not the perfect solution. However let me
remind you that neither is the one with private namespace.

> I see a lot of similarities with symlinks -- many programs also need
> to take extra care of symlinks to be safe. However, symlinks are
> already senior citizens of Unix systems and programs know how to
> cope with them since ages.
>
> Maybe this could be taken advantage of by keeping all user mounts in
> a separate directory like /mnt/usr (and /mnt is very likely to be
> avoided by all programs traversing directory structure
> automatically) and symlinking from the requested mount points there
> (with symlinks naturally not followed by automatic traversals).

Maybe. It would be trivial to add a config option to fuse.conf to
limit user mounts to some directory.

Thanks,
Miklos



2005-04-27 18:12:28

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> eg:
>
> user 1 does a invisible mount on /mnt/mnt1
> root does a visible mount on /mnt/mnt1
>
> user 1 will no longer be able to access his /mnt/mnt1
>
> in fact even if root mounts something on /mnt, the problem still exists.

This is not something specific to FUSE. Root can overmount any of
your directories after which you won't be able to access it (unless
some of your processes have a CWD there).

Miklos

2005-04-27 18:26:13

by Martin Mares

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hello!

> So yes the check fsuid is not the perfect solution. However let me
> remind you that neither is the one with private namespace.

What I'm arguing about is that the fsuid check is obscure (it breaks
traditional semantics of file permissions [*], it doesn't allow an user
to grant access to his user mount to other users, even if the permissions
allow that and so on) and it doesn't fully solve the problem anyway.

For similar reasons, I don't advocate for private namespaces either.

The cure more likely lies in simple policy rules like the "all user mounts
belong to /mnt/usr" one, instead of putting dubious policy to the kernel.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Mr. Worf, scan that ship." "Aye, Captain... 600 DPI?

2005-04-27 18:47:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > So yes the check fsuid is not the perfect solution. However let me
> > remind you that neither is the one with private namespace.
>
> What I'm arguing about is that the fsuid check is obscure (it breaks
> traditional semantics of file permissions [*],

No, the permissions are not visible to any other user. So there are
no semantics to break.

> it doesn't allow an user to grant access to his user mount to other
> users,

Yes, but that granting must be explicitly acknowledget by the grantee,
to avoid the problems previously discussed. It's probably something
possible to do with the private namespaces (sending mounts to other
user's namespaces, etc)

> even if the permissions allow that and so on) and it doesn't
> fully solve the problem anyway.

I think I know how to fully solve the problem. If the user has
permission to ptrace the process in question then he can already do
whatever he likes with that process, so userspace filesystem operation
can unconditionally be allowed. Otherwise it's no-no by default.

This thread is proving to be ever more useful :) Thanks everyone!

> For similar reasons, I don't advocate for private namespaces either.
>
> The cure more likely lies in simple policy rules like the "all user mounts
> belong to /mnt/usr" one, instead of putting dubious policy to the kernel.

I'm keeping policy out of the kernel by making the check optional.
Then the userspace daemon can enforce such policies as /mnt/usr.

I'll prefer the checking one, since, I'm all alone on my machine,
don't want to share anything, but _do_ want to have mounts under my
home directory. You prefer the /mnt/usr, since you want to share it
with others. A combination is also possible: the user choses for each
mount which is preferable.

Agreed?

Miklos

2005-04-27 19:40:29

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, 2005-04-27 at 11:09, Miklos Szeredi wrote:
> > eg:
> >
> > user 1 does a invisible mount on /mnt/mnt1
> > root does a visible mount on /mnt/mnt1
> >
> > user 1 will no longer be able to access his /mnt/mnt1
> >
> > in fact even if root mounts something on /mnt, the problem still exists.
>
> This is not something specific to FUSE. Root can overmount any of
> your directories after which you won't be able to access it (unless
> some of your processes have a CWD there).

sorry, I think I have not raised by concern clearly.

I am mostly talking about the semantics of 'invisible/private mount' not
FUSE in particular, since the kernel patch brings in new feature
to VFS.

My understanding of private mount is:
1. The contents of the private mount is visible only to the
mount owner.
2. The vfsmount of the private mount is only accessible to
the mount owner, and only the mount owner can mount anything
on top of it.

But I dont see (2) is being checked for.

I can overmount something on top of a private mount owned by someother
user. I verified that with your patch.

1. do a invisible mount as user 'x' on /mnt
2. do a visible mount as root on /mnt and it *succeeds* and also masks
the earlier mount to the user 'x'.

I am not concerned about the masking effect so much. But I am concerned
that the private vfsmount at /mnt is accessible to someother user
to mount something else on top of it. **The dentry on top of which the
new vfsmount is done belongs to the private vfsmount**.


Am I making sense? If I do make sense, than all we need is a patch on
top of your patch which disallows non-owner to mount something on top of
a private/invisible vfsmount owned by some owner.

If I am not making sense, I keep quite :)
RP



>
> Miklos

2005-04-27 20:04:08

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> sorry, I think I have not raised by concern clearly.
>
> I am mostly talking about the semantics of 'invisible/private mount' not
> FUSE in particular, since the kernel patch brings in new feature
> to VFS.
>
> My understanding of private mount is:
> 1. The contents of the private mount is visible only to the
> mount owner.
> 2. The vfsmount of the private mount is only accessible to
> the mount owner, and only the mount owner can mount anything
> on top of it.
>
> But I dont see (2) is being checked for.

It's automatically enforced, since the mount syscall itself will use
the same path lookup mechanism as any other filesystem operation.

> I can overmount something on top of a private mount owned by someother
> user. I verified that with your patch.
>
> 1. do a invisible mount as user 'x' on /mnt
> 2. do a visible mount as root on /mnt and it *succeeds* and also masks
> the earlier mount to the user 'x'.

Yes, because a later mount on a _same_ dentry will mask an earlier
mount. But that does not mean, that the mount happened on the private
mount's root.

You can check where the mount ended up, by having a shell of 'x' cd to
the private mount. Then do the overmount. If the shell can still see
the private mount, then the overmount did not in fact mount over the
private root.

> Am I making sense? If I do make sense, than all we need is a patch on
> top of your patch which disallows non-owner to mount something on top of
> a private/invisible vfsmount owned by some owner.

Yes it makes sense, but I think what you want is already the case.

Thanks,
Miklos

2005-04-27 20:56:04

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Ram wrote:
> On Wed, 2005-04-27 at 11:09, Miklos Szeredi wrote:
>
>>>eg:
>>>
>>>user 1 does a invisible mount on /mnt/mnt1
>>>root does a visible mount on /mnt/mnt1
>>>
>>>user 1 will no longer be able to access his /mnt/mnt1
>>>
>>>in fact even if root mounts something on /mnt, the problem still exists.
>>
>>This is not something specific to FUSE. Root can overmount any of
>>your directories after which you won't be able to access it (unless
>>some of your processes have a CWD there).
>
>
> sorry, I think I have not raised by concern clearly.
>
> I am mostly talking about the semantics of 'invisible/private mount' not
> FUSE in particular, since the kernel patch brings in new feature
> to VFS.
>
> My understanding of private mount is:
> 1. The contents of the private mount is visible only to the
> mount owner.
> 2. The vfsmount of the private mount is only accessible to
> the mount owner, and only the mount owner can mount anything
> on top of it.
>
> But I dont see (2) is being checked for.
>
> I can overmount something on top of a private mount owned by someother
> user. I verified that with your patch.
>
> 1. do a invisible mount as user 'x' on /mnt
> 2. do a visible mount as root on /mnt and it *succeeds* and also masks
> the earlier mount to the user 'x'.
>
> I am not concerned about the masking effect so much. But I am concerned
> that the private vfsmount at /mnt is accessible to someother user
> to mount something else on top of it. **The dentry on top of which the
> new vfsmount is done belongs to the private vfsmount**.
>
>
> Am I making sense? If I do make sense, than all we need is a patch on
> top of your patch which disallows non-owner to mount something on top of
> a private/invisible vfsmount owned by some owner.
>
> If I am not making sense, I keep quite :)

I think you point out a solution could be worse that what it cures.
There are clearly problems with mount over, but imagine that a user does
an invisible mount over /mnt, doesn't that prevent other mounts which
are usually made, like /mnt/cdrom, /mnt/loopN, etc?

Every time someone suggests a solution it seems to open a new path to
possible abuse. And features which only work with a monotonic kernel
rather than modules would seem to indicate that the feature is nice but
the implementation might benefit from more thinking time.

Frankly the whole statement that the controversial code MUST go in now
and could be removed later sounds like a salesman telling me I MUST sign
the contract today, but he will let me out of it if I decide it was a
mistake.

I'm not against the feature, but a lot of people I consider competent
seem to find the implementation controversial, which argues for waiting
until more eyes are on the code. If the rest of the code is useless
without the controversial part, maybe it should all stay a patch to use
or not as people decide.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-04-27 21:38:39

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Wed, 2005-04-27 at 13:03, Miklos Szeredi wrote:
> > sorry, I think I have not raised by concern clearly.
> >
> > I am mostly talking about the semantics of 'invisible/private mount' not
> > FUSE in particular, since the kernel patch brings in new feature
> > to VFS.
> >
> > My understanding of private mount is:
> > 1. The contents of the private mount is visible only to the
> > mount owner.
> > 2. The vfsmount of the private mount is only accessible to
> > the mount owner, and only the mount owner can mount anything
> > on top of it.
> >
> > But I dont see (2) is being checked for.
>
> It's automatically enforced, since the mount syscall itself will use
> the same path lookup mechanism as any other filesystem operation.
>
> > I can overmount something on top of a private mount owned by someother
> > user. I verified that with your patch.
> >
> > 1. do a invisible mount as user 'x' on /mnt
> > 2. do a visible mount as root on /mnt and it *succeeds* and also masks
> > the earlier mount to the user 'x'.
>
> Yes, because a later mount on a _same_ dentry will mask an earlier
> mount. But that does not mean, that the mount happened on the private
> mount's root.
>
> You can check where the mount ended up, by having a shell of 'x' cd to
> the private mount. Then do the overmount. If the shell can still see
> the private mount, then the overmount did not in fact mount over the
> private root.

ok. Generally overmounts are done on the root dentry of the topmost
vfsmount. But in this case, your patch mounts it on the same dentry
as that of the private mount.

Essentially I was always under the assertion that 'a dentry can hold
only one vfsmount'. But invisible mount seem to invalidate that
assertion. Don't see any bad effects of that. Probably some VFS
experts may. (or probably my assertion is wrong to begin with)

RP


2005-04-27 23:00:07

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > The userland tools don't need to know. They just need to not be suid.
>
> But I'd want to continue distribute the non-crippled kernel module
> too, with suid fusermount. Then fusermount _has_ to know which kernel
> module is currently active.

Add a mount flag and make kernel refuse mount on unknown flags?

> > Ok, here I say it is ugly (but not that it's crap). And the reason is,
> > that there is a permission system, with some semantics, and then various
> > filesystems adapt it in varous ways to fit what they want. So every
> > filesystem ends up with it's onw little different behaviour.
> >
> > That being said, fuse does just about the same as NFS, samba and others
> > and I don't really see the reason why it couldn't be integrated. But
> > I am not the one to decide.
>
> Every opinion counts.
>
> I'm not trying to convince people that the current solution is
> perfect. What I'm saying that it's
>
> a) not harmful
>
> b) it makes non-privileged mounts possible
>
> And b) is _the_ most important feature IMO, so the argument for
> stripping it out has to be very good.

Well, you'll have problems with suid programs suddenly not being able
to access files. nfs gets away with it, but nfs is perceived as
"broken" anyway...

Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-27 23:22:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] private mounts

on den 27.04.2005 Klokka 16:58 (+0200) skreiv Pavel Machek:

> >
> > And b) is _the_ most important feature IMO, so the argument for
> > stripping it out has to be very good.
>
> Well, you'll have problems with suid programs suddenly not being able
> to access files. nfs gets away with it, but nfs is perceived as
> "broken" anyway...

Really?

The NFS security model is based on the principle that the administrator
of the SERVER can override access permissions on his/her hardware. Pray
tell why you think that is "broken"?

Cheers,
Trond

--
Trond Myklebust <[email protected]>

2005-04-28 07:03:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> ok. Generally overmounts are done on the root dentry of the topmost
> vfsmount. But in this case, your patch mounts it on the same dentry
> as that of the private mount.
>
> Essentially I was always under the assertion that 'a dentry can hold
> only one vfsmount'. But invisible mount seem to invalidate that
> assertion.

You can do that without an invisible mount:

mkdir /tmp/mnt
mkdir /tmp/dir1
mkdir /tmp/dir1/subdir1
mkdir /tmp/dir2
mkdir /tmp/dir2/subdir2

cd /tmp/mnt
mount --bind /tmp/dir1 .
mount --bind /tmp/dir2 .

Now you have both /tmp/dir1 and /tmp/dir2 rooted at the same dentry.

To test this, in another shell do this just after the first bind mount:

cd /tmp/mnt

Then after the second mount do

ls -l subdir1/..

Now unmount everything and repeat the experiment, but do the mounts
this way:

mount --bind /tmp/dir1 /tmp/mnt
mount --bind /tmp/dir2 /tmp/mnt

Now the second mount is an overmount of the first, and you will
actually get different result from the "ls".

Playing with mounts is fun :)

Miklos

2005-04-28 07:25:52

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> I think you point out a solution could be worse that what it cures.
> There are clearly problems with mount over, but imagine that a user does
> an invisible mount over /mnt, doesn't that prevent other mounts which
> are usually made, like /mnt/cdrom, /mnt/loopN, etc?

As previously explained, user mounts are only allowed on directories
for which the user has full write access. Exactly for this reason.

> Every time someone suggests a solution it seems to open a new path to
> possible abuse. And features which only work with a monotonic kernel
> rather than modules would seem to indicate that the feature is nice but
> the implementation might benefit from more thinking time.

Huh? Where did modularitly come into this?

> Frankly the whole statement that the controversial code MUST go in now
> and could be removed later sounds like a salesman telling me I MUST sign
> the contract today, but he will let me out of it if I decide it was a
> mistake.

The point of this thread is to find a solution to a problem. The
discussion is turning up very interesting viewpoints and I'm
understanding the problem better and better, and I think other people
are too.

While I disagree with the view taken by Christoph H., I'm now also
thankful to him for stiring up the mud, because it ended up with a lot
of useful ideas.

In the end I'd like a solution that everybody is happy with. That
means I'm not going to give up searching because someone said, that
the current solution is crappy.

Do you understand my position?

> I'm not against the feature, but a lot of people I consider competent
> seem to find the implementation controversial, which argues for waiting
> until more eyes are on the code.

Yes. I'm not going to ask Andrew to merge the code until I feel that
everybody concerned is happy with it. No matter how many release
cycles it takes.

> If the rest of the code is useless without the controversial part,
> maybe it should all stay a patch to use or not as people decide.

It has been distributed separately from the kernel for 3 years now.
So people _can_ try it out.

Thanks,
Miklos

2005-04-28 08:40:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On St 27-04-05 19:21:56, Trond Myklebust wrote:
> on den 27.04.2005 Klokka 16:58 (+0200) skreiv Pavel Machek:
>
> > >
> > > And b) is _the_ most important feature IMO, so the argument for
> > > stripping it out has to be very good.
> >
> > Well, you'll have problems with suid programs suddenly not being able
> > to access files. nfs gets away with it, but nfs is perceived as
> > "broken" anyway...
>
> Really?
>
> The NFS security model is based on the principle that the administrator
> of the SERVER can override access permissions on his/her hardware. Pray
> tell why you think that is "broken"?

Well, administrator on CLIENT can impersonate whoever he wants, and if
data happens to be cached, he can just read them from local memory. So
whatever SERVER administrator does, CLIENT administrator can work
around.
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-28 08:40:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > The NFS security model is based on the principle that the administrator
> > of the SERVER can override access permissions on his/her hardware. Pray
> > tell why you think that is "broken"?
>
> Well, administrator on CLIENT can impersonate whoever he wants,

Not really. Root squash has the very important effect that whatever
the client does, it cannot impersonate "root".

Miklos

2005-04-28 11:36:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] private mounts

to den 28.04.2005 Klokka 10:24 (+0200) skreiv Pavel Machek:

> Well, administrator on CLIENT can impersonate whoever he wants, and if
> data happens to be cached, he can just read them from local memory. So
> whatever SERVER administrator does, CLIENT administrator can work
> around.

This is why you have identity squashing and/or strong security: to stop
the CLIENT administrator impersonating whoever he wants and working
around your security measures.

Yes there's all the FUD about how the administrator can still take over
your RPCSEC_GSS creds and/or read cached data once you have logged in.
If you log into a compromised client then you're screwed. What's new?

Trond

--
Trond Myklebust <[email protected]>

2005-04-28 13:28:49

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>
> Looking closer, I think we already have it.
>
> It's called /proc/NNN/root.
>
> Does chroot into /proc/NNN/root cause the chroot'ing process to adopt
> the namespace of NNN? Looking at the code, I think it does.
>
...
>
> So no new system calls are needed. A daemon to hand out per-user
> namespaces (or any other policy) can be written using existing
> kernels, and those namespaces can be joined using chroot.
>
> That's the theory anyway. It's always possible I misread the code (as
> I don't use namespaces and don't have tools handy to try them).
>

I've been thinking about this a bit more...would you even need chroot?
(wouldn't exposing chroot functionality to a user incur additional
security risk? I guess it would be okay as long as you were only
chrooting to one of your other process' roots?)

If you were organized about where the mounts in your private namespace
were done, you could just mount -bind them from
/proc/NNN/root/home/$USER/mnt (or something). That requries a certain
amount of discipline in your mounts (or maybe not -- just diff
/proc/NNN/mounts to see what you are missing and bind the
differences).

-eric

2005-04-28 13:47:48

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On 4/26/05, Jamie Lokier <[email protected]> wrote:
>
> It's called /proc/NNN/root.
>
> So no new system calls are needed. A daemon to hand out per-user
> namespaces (or any other policy) can be written using existing
> kernels, and those namespaces can be joined using chroot.
>
> That's the theory anyway. It's always possible I misread the code (as
> I don't use namespaces and don't have tools handy to try them).
>

Should have checked myself before posting my previous reply -- but
this doesn't seem to work. /proc/NNN/root is represented as a
symlink, but when you CLONE_NS and then try to look at another one of
your process' /proc/NNN/root the link doesn't seem to have a target
and you get permission denied on all accesses. I haven't looked at
the underlying procfs code, but adapting procfs for this sort of
purpose feels wrong.

-eric

2005-04-28 17:59:04

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>This is why you have identity squashing and/or strong security: to stop
>the CLIENT administrator impersonating whoever he wants and working
>around your security measures.

That's more of a confirmation than a refutation of the statement that NFS
root squashing is broken. Root squashing itself simply does not squash a
typical system administrator's ability to get at other people's files.
"broken" isn't the right word, because as long as you recognize root
squashing for what it is, it's working as designed. It just isn't what it
appears to be.

But, in the context of the current thread, I think the perception of NFS
root squashing as something broken and not to be built upon with private
mounts has to do with the fact that it messes up Linux's basic file
permission scheme: a process with CAP_DAC_OVERRIDE can get EACCES.
EACCESS means discretionary access controls (DAC) prevent access. So this
behavior is unexpected and unnatural. Worse, an operation can succeed
_without_ CAP_DAC_OVERRIDE, but not _with_ it. I've seen this behavior
cause trouble a number of times -- mostly because it's entirely
unanticipated.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-28 18:14:44

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > > Is it possible to limit all these from kernelspace? Probably yes,
> > > although a timeout for operations is something that cuts either way.
> > > And the compexity of these checks would probably be orders of
> > > magnitude higher then the check we are currently discussing.
> >
> > Yes ... but does the check we are discussing really solve the
> > problem?
> >
> > Let's say that you attempt to export home directories of users by a
> > user-space NFS daemon. This daemon probably changes its fsuid to
> > match the remote user, so the check happily accepts the access and
> > the user is able to lock up the daemon.
>
> Valid point. The only defense is that when a program set's fsuid,
> it's performing the operation "on behalf of the user". So the user is
> actually doing DoS against himself.
>
> Of course this is not strictly true. E.g. in the userspace NFS case
> it's probably a DoS against all users of the mount.

Exactly. So can we simply merge root-only fuse, and then worry
how to make it safe with user-mounted fuse. See your own unfsd example
why user-mounting is bad.

One possible solution would be to have root-owned fused that
talks to user-owned fused-s and checks they are behaving correctly?

Second is somehow improving those two lines this long thread is all about...

Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-04-28 19:21:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Eric Van Hensbergen wrote:
> > It's called /proc/NNN/root.
> >
> > So no new system calls are needed. A daemon to hand out per-user
> > namespaces (or any other policy) can be written using existing
> > kernels, and those namespaces can be joined using chroot.
> >
> > That's the theory anyway. It's always possible I misread the code (as
> > I don't use namespaces and don't have tools handy to try them).
> >
>
> Should have checked myself before posting my previous reply -- but
> this doesn't seem to work. /proc/NNN/root is represented as a
> symlink, but when you CLONE_NS and then try to look at another one of
> your process' /proc/NNN/root the link doesn't seem to have a target
> and you get permission denied on all accesses.

I've looked at the code. Look in fs/proc/base.c (Linux 2.6.10),
proc_root_link().

I don't see anything there to prevent you from traversing to the
mounts in the other namespace.

So why is it failing? Any idea?

> I haven't looked at the underlying procfs code, but adapting procfs
> for this sort of purpose feels wrong.

Having a file/directory which represents namespaces held by another
process makes much more sense to me than new system calls and
inventing yet another id space to represent namespaces.

And, given that you can look at the filesystems another process can
see by doing ptrace on it, it might as well be accessible in a more
natural way.

-- Jamie

2005-04-28 19:23:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Eric Van Hensbergen wrote:
> > Does chroot into /proc/NNN/root cause the chroot'ing process to adopt
> > the namespace of NNN? Looking at the code, I think it does.
>
> I've been thinking about this a bit more...would you even need chroot?
> (wouldn't exposing chroot functionality to a user incur additional
> security risk? I guess it would be okay as long as you were only
> chrooting to one of your other process' roots?)

You don't need to let an ordinary user do chroot.

The login process can do it before it changes uid to the user, the
same as it does to set up all the other per-user parameters.

-- Jamie

2005-04-28 19:30:40

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Thu, 2005-04-28 at 00:00, Miklos Szeredi wrote:
> > ok. Generally overmounts are done on the root dentry of the topmost
> > vfsmount. But in this case, your patch mounts it on the same dentry
> > as that of the private mount.
> >
> > Essentially I was always under the assertion that 'a dentry can hold
> > only one vfsmount'. But invisible mount seem to invalidate that
> > assertion.
>
> You can do that without an invisible mount:
>
> mkdir /tmp/mnt
> mkdir /tmp/dir1
> mkdir /tmp/dir1/subdir1
> mkdir /tmp/dir2
> mkdir /tmp/dir2/subdir2
>
> cd /tmp/mnt
> mount --bind /tmp/dir1 .
> mount --bind /tmp/dir2 .
>
> Now you have both /tmp/dir1 and /tmp/dir2 rooted at the same dentry.

Ok. got it!. Agreed. great example!

Thanks,
RP

2005-04-28 19:39:59

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Thu, 2005-04-28 at 12:20, Jamie Lokier wrote:
> Eric Van Hensbergen wrote:
> > > It's called /proc/NNN/root.
> > >
> > > So no new system calls are needed. A daemon to hand out per-user
> > > namespaces (or any other policy) can be written using existing
> > > kernels, and those namespaces can be joined using chroot.
> > >
> > > That's the theory anyway. It's always possible I misread the code (as
> > > I don't use namespaces and don't have tools handy to try them).
> > >
> >
> > Should have checked myself before posting my previous reply -- but
> > this doesn't seem to work. /proc/NNN/root is represented as a
> > symlink, but when you CLONE_NS and then try to look at another one of
> > your process' /proc/NNN/root the link doesn't seem to have a target
> > and you get permission denied on all accesses.
>
> I've looked at the code. Look in fs/proc/base.c (Linux 2.6.10),
> proc_root_link().
>
> I don't see anything there to prevent you from traversing to the
> mounts in the other namespace.
>
> So why is it failing? Any idea?

Since you are traversing a symlink, you will be traversing the symlink
in the context of traversing process's namespace.

If process 'x' is traversing /proc/y/root , the lookup for the root
dentry will happen in the context of process x's namespace, and not
process y's namespace. Hence process 'x' wont really get into
the namespace of the process y.

RP

2005-04-28 19:42:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Exactly. So can we simply merge root-only fuse, and then worry
> how to make it safe with user-mounted fuse. See your own unfsd example
> why user-mounting is bad.
>
> One possible solution would be to have root-owned fused that
> talks to user-owned fused-s and checks they are behaving correctly?

It's very hard to do that. What should be the timeout for requests,
so that valid filesystems don't break, yet it's not possible to do a
fairly ugly DoS? It's almost impossible I'd say.

> Second is somehow improving those two lines this long thread is all about...

That's what I did. See the recent documentation and code patches
(cc-d to -fsdevel). I'm pretty convinced it's the right thing to do.
OK, I was with the previous solution too, but anyway ;)

Thanks,
Miklos

2005-04-28 19:46:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] private mounts

to den 28.04.2005 Klokka 10:58 (-0700) skreiv Bryan Henderson:
> >This is why you have identity squashing and/or strong security: to stop
> >the CLIENT administrator impersonating whoever he wants and working
> >around your security measures.
>
> That's more of a confirmation than a refutation of the statement that NFS
> root squashing is broken. Root squashing itself simply does not squash a
> typical system administrator's ability to get at other people's files.
> "broken" isn't the right word, because as long as you recognize root
> squashing for what it is, it's working as designed. It just isn't what it
> appears to be.

Root squashing is there to enforce the policy that nobody gets to access
any files with uid=0,gid=0. IOW it is a policy that is first and
foremost meant to make root-owned files untouchable.

Strong security, OTOH, enforces the policy that you need to authenticate
as a given person in order to get at that person's files.

Neither can prevent man-in-the-middle style attacks by root on a client
that is compromised nor can they stop someone who has managed to lift
your username+password from somewhere. Every security policy has its
limitations.

> But, in the context of the current thread, I think the perception of NFS
> root squashing as something broken and not to be built upon with private
> mounts has to do with the fact that it messes up Linux's basic file
> permission scheme: a process with CAP_DAC_OVERRIDE can get EACCES.
> EACCESS means discretionary access controls (DAC) prevent access. So this
> behavior is unexpected and unnatural. Worse, an operation can succeed
> _without_ CAP_DAC_OVERRIDE, but not _with_ it. I've seen this behavior
> cause trouble a number of times -- mostly because it's entirely
> unanticipated.

Tough. Your administrator has set a certain policy on the fileserver,
and it is being correctly enforced. If that policy decision turns out to
be unnecessarily strict then you are quite free to plead with the
administrator to change it.

Note that these days CAP_DAC_OVERRIDE is no longer guaranteed to be
sufficient even for local disk if the administrator is using LSE to set
up custom policies.

Cheers,
Trond
--
Trond Myklebust <[email protected]>

2005-04-28 20:21:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Hi!

> > Exactly. So can we simply merge root-only fuse, and then worry
> > how to make it safe with user-mounted fuse. See your own unfsd example
> > why user-mounting is bad.
> >
> > One possible solution would be to have root-owned fused that
> > talks to user-owned fused-s and checks they are behaving correctly?
>
> It's very hard to do that. What should be the timeout for requests,
> so that valid filesystems don't break, yet it's not possible to do a
> fairly ugly DoS? It's almost impossible I'd say.

You can still put those two lines into root-owned fused, where people
are less likely to notice them ;-).
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-28 22:09:19

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Ram wrote:
> > I've looked at the code. Look in fs/proc/base.c (Linux 2.6.10),
> > proc_root_link().
> >
> > I don't see anything there to prevent you from traversing to the
> > mounts in the other namespace.
> >
> > So why is it failing? Any idea?
>
> Since you are traversing a symlink, you will be traversing the symlink
> in the context of traversing process's namespace.
>
> If process 'x' is traversing /proc/y/root , the lookup for the root
> dentry will happen in the context of process x's namespace, and not
> process y's namespace. Hence process 'x' wont really get into
> the namespace of the process y.

Lookups don't happen in the context of a namespace.

They happen in the context of a vfsmnt. And the switch to a new
vfsmnt is done by matching against (dentry,parent-vfsmnt) pairs.
current->namespace is only checked for mount & unmount operations, not
for path lookups.

Which means proc_root_link, when it switches to the vfsmnt at the root
of the other process, should traverse into the tree of vfsmnts which
make up the other namespace.

-- Jamie

2005-04-28 22:40:22

by Bryan Henderson

[permalink] [raw]
Subject: Re: [PATCH] private mounts

>Root squashing is there to enforce the policy that nobody gets to access
>any files with uid=0,gid=0. IOW it is a policy that is first and
>foremost meant to make root-owned files untouchable.

That's the only thing it does well, but you'd have to convince me that
that's what it was designed for and that's what everyone expects out of
it. The most salient effect of root squashing -- the one that takes
people by surprise -- is that it removes the special rights an NFS server
otherwise accords to uid 0. If protecting files owned by uid=0, gid=0
were the original design goal, the protocol could have been designed to do
that while still giving uid 0 access to everybody else's files.

>>a process with CAP_DAC_OVERRIDE can get EACCES. ... Whine, whine...
>Tough.

This is actually off-topic. We're not talking about whether root
squashing is a good compromise. We started with the statement that the
only existing thing like (some private mount proposal) is NFS root
squashing and the statement that some people consider that broken. That
elicited a response from you that suggested you were unaware there was
anything not to like about root squashing ("Really?") and then some
descriptions of the objections. The fact is that negative perceptions of
root squashing exist. I know you know that. There are respectable
technical people who don't agree with the compromise. So if one is
looking for a broadly acceptable design of private mounts, one might want
to find one that doesn't use NFS root squashing as its precedent.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

2005-04-29 00:36:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] private mounts

to den 28.04.2005 Klokka 15:38 (-0700) skreiv Bryan Henderson:
> >Root squashing is there to enforce the policy that nobody gets to access
> >any files with uid=0,gid=0. IOW it is a policy that is first and
> >foremost meant to make root-owned files untouchable.
>
> That's the only thing it does well, but you'd have to convince me that
> that's what it was designed for and that's what everyone expects out of
> it. The most salient effect of root squashing -- the one that takes
> people by surprise -- is that it removes the special rights an NFS server
> otherwise accords to uid 0. If protecting files owned by uid=0, gid=0
> were the original design goal, the protocol could have been designed to do
> that while still giving uid 0 access to everybody else's files.

That is much harder to do. The nfs server would have to take over the
permissions checking on behalf of whatever it is exporting for all
operations.

> >>a process with CAP_DAC_OVERRIDE can get EACCES. ... Whine, whine...
> >Tough.
>
> This is actually off-topic. We're not talking about whether root
> squashing is a good compromise. We started with the statement that the
> only existing thing like (some private mount proposal) is NFS root
> squashing and the statement that some people consider that broken. That
> elicited a response from you that suggested you were unaware there was
> anything not to like about root squashing ("Really?") and then some
> descriptions of the objections. The fact is that negative perceptions of
> root squashing exist. I know you know that. There are respectable
> technical people who don't agree with the compromise. So if one is
> looking for a broadly acceptable design of private mounts, one might want
> to find one that doesn't use NFS root squashing as its precedent.

The lack of agreement on root squashing is a reason for it to be a
matter of administrator-defined policy, and why the solution chosen
_should_ allow for that kind of behaviour.

If the user is free to futz around with the namespace, then it makes a
lot of sense for administrators to want to restrict access to this
user-defined namespace to non-suid programs that won't start screwing
round with opening files on arbitrary filesystems using the wrong
credentials and/or capabilities.
Particularly so if the user is capable of mounting remote filesystems.

Trond
--
Trond Myklebust <[email protected]>

2005-04-29 07:57:25

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Thu, 2005-04-28 at 15:08, Jamie Lokier wrote:
> Ram wrote:
> > > I've looked at the code. Look in fs/proc/base.c (Linux 2.6.10),
> > > proc_root_link().
> > >
> > > I don't see anything there to prevent you from traversing to the
> > > mounts in the other namespace.
> > >
> > > So why is it failing? Any idea?
> >
> > Since you are traversing a symlink, you will be traversing the symlink
> > in the context of traversing process's namespace.
> >
> > If process 'x' is traversing /proc/y/root , the lookup for the root
> > dentry will happen in the context of process x's namespace, and not
> > process y's namespace. Hence process 'x' wont really get into
> > the namespace of the process y.
>
> Lookups don't happen in the context of a namespace.
>
> They happen in the context of a vfsmnt. And the switch to a new
> vfsmnt is done by matching against (dentry,parent-vfsmnt) pairs.
> current->namespace is only checked for mount & unmount operations, not
> for path lookups.

Looked deeper into the code, and realized that in procfs, the symlink is
not followed through link_path_walk(). instead it is expected to
return the root vfsmount of the traversed process as you rightly
pointed.


>
> Which means proc_root_link, when it switches to the vfsmnt at the root
> of the other process, should traverse into the tree of vfsmnts which
> make up the other namespace.

Yes. But proc_check_root() in proc_pid_follow_link() is failing the
traversal, because it is expecting the root vfsmount of the traversed
process to belong to the vfsmount tree of the traversing process.
In other words its expecting them to be both in the same namespace.

The permissions get denied by this code in proc_check_root():

while (vfsmnt != our_vfsmnt) {
if (vfsmnt == vfsmnt->mnt_parent)
goto out;
de = vfsmnt->mnt_mountpoint;
vfsmnt = vfsmnt->mnt_parent;
}

RP
> -- Jamie

2005-04-29 14:13:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> >
> > Which means proc_root_link, when it switches to the vfsmnt at the root
> > of the other process, should traverse into the tree of vfsmnts which
> > make up the other namespace.
>
> Yes. But proc_check_root() in proc_pid_follow_link() is failing the
> traversal, because it is expecting the root vfsmount of the traversed
> process to belong to the vfsmount tree of the traversing process.
> In other words its expecting them to be both in the same namespace.
>
> The permissions get denied by this code in proc_check_root():
>

Removing the check makes chroot enter the tree under the other
process's namespace. However it does not actually change the
namespace, hence mount/umount won't work.

So joinig a namespace does need a new syscall unfortunately.

Miklos

2005-04-29 14:44:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> Removing the check makes chroot enter the tree under the other
> process's namespace. However it does not actually change the
> namespace, hence mount/umount won't work.
>
> So joinig a namespace does need a new syscall unfortunately.

It would be trivial to copy mnt->mnt_namespace to current->namespace
in set_fs_root. No need for a syscall just for that.

Given that it works, the right place to decide whether it's allowed is
the permissions on /proc/NNN/root. But remember that you can already
access another process' namespace using ptrace on that process, so
this doesn't relax security if /proc/NNN/root can be entered whenever
ptrace is allowed.

I would really like to know what the purpose of check_mnt() is in
namespace.c. In standard kernels you can't enter another process'
namespace (without the change you tried in proc/base.c), so I don't see
how check_mnt() can _ever_ fail. Can it?

And if it can't fail, is there any need for current->namespace, or can
it just be removed?

-- Jamie

2005-04-29 14:57:06

by Jamie Lokier

[permalink] [raw]
Subject: Question about current->namespace and check_mnt()

Hi Al,

I have a specific namespace.c question:

I really like to know what the purpose of check_mnt() is in
namespace.c. In standard kernels you can't enter another process'
namespace so I don't see how check_mnt() can _ever_ fail. Can it
fail, or in other words, what is the purpose of that check?

And if it can't fail, is there really a need for current->namespace, or
can it just be removed?

Also, I would think the current process' rootmnt->mnt_namespace would
adequately define the "current process namespace", so making
current->namespace redundant in that way. Is that right?

Thanks,
-- Jamie

2005-04-30 08:33:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 08:07:34PM +0100, Jamie Lokier wrote:
> Pavel Machek wrote:
> > > > ... is the same as for the same question with "set of mounts" replaced
> > > > with "environment variables".
> > >
> > > Not quite.
> > >
> > > After changing environment variables in .profile, you can copy them to
> > > other shells using ". ~/.profile".
> > >
> > > There is no analogous mechanism to copy namespaces.
> >
> > Actually, after you add right mount xyzzy /foo lines into .profile,
> > you can just . ~/.profile ;-).
>
> Is there a mount command that can do that? We're talking about
> private mounts - invisible to other namespaces, which includes the
> other shells.
>
> If there was a /proc/NNN/namespace, that would do the trick :)

I don't think you need a /proc/NNN/namespace, /proc/NNN/mounts already
contains a mount table. It's pretty trivial to write a small shellscript
to parse that, compare with the current namespace and do all mount/umounts
to make them fit the other processes namespace. Real problem here are
filesystems that don't implement ->show_options or do so only partially
so that some options are lost.

2005-04-30 08:35:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 11:58:50AM +0200, Miklos Szeredi wrote:
> > I can't write a script that reads your mind. But I sure can write
> > a script that finds out what you mounted in the other shells (with help
> > of a little wrapper around the mount command).
>
> How do you bind mount it from a different namespace? You _do_ need
> bind mount, since a new mount might require password input, etc...

Not nessecarily. The filesystem gets called into ->get_sb for every mount,
and can then decided whether to return an existing superblock instance or
setup a new one. If the credentials for the new mount match an old one
it can just reuse it. (e.g. for block based filesystem it will always reuse
right now)

2005-04-30 08:37:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Mon, Apr 25, 2005 at 11:48:04AM +0200, Olivier Galibert wrote:
> On Sun, Apr 24, 2005 at 10:19:42PM +0100, Al Viro wrote:
> > Of course you can. It does execute the obvious set of rc files.
>
> Is there a possibility for a process to change its namespace to
> another existing one? That would be needed to have a per-user
> namespace you go to from rc files or pam.

It is not right now, and I don't think joining a namespace is a concept
that fits very well into our architecture. What does make sense is an
unshare() syscall that takes the CLONE_* argument and unshares those in
the current process from the parent without creating a new process. Then
you can easily reproduce another namespace by value instead of by reference.

2005-04-30 09:26:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > I can't write a script that reads your mind. But I sure can write
> > > a script that finds out what you mounted in the other shells (with help
> > > of a little wrapper around the mount command).
> >
> > How do you bind mount it from a different namespace? You _do_ need
> > bind mount, since a new mount might require password input, etc...
>
> Not nessecarily. The filesystem gets called into ->get_sb for every mount,
> and can then decided whether to return an existing superblock instance or
> setup a new one. If the credentials for the new mount match an old one
> it can just reuse it. (e.g. for block based filesystem it will always reuse
> right now)

And if the credentials are checked in userspace (sshfs)?

Miklos

2005-04-30 09:43:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > > How do you bind mount it from a different namespace? You _do_ need
> > > bind mount, since a new mount might require password input, etc...
> >
> > Not nessecarily. The filesystem gets called into ->get_sb for every mount,
> > and can then decided whether to return an existing superblock instance or
> > setup a new one. If the credentials for the new mount match an old one
> > it can just reuse it. (e.g. for block based filesystem it will always reuse
> > right now)
>
> And if the credentials are checked in userspace (sshfs)?

Well, if you can find a way to tell the userspace FUSE daemon to know
that the mount is being done by the same user as the existing mount,
you don't need (or want) to check the credentials - you want the FUSE
daemon to tell the kernel code which superblock to reuse.

This hack is a bit nasty - namespace per login, copying mounts
from another login's namespace - but it would work.

-- Jamie

2005-04-30 10:16:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Well, if you can find a way to tell the userspace FUSE daemon to know
> that the mount is being done by the same user as the existing mount,
> you don't need (or want) to check the credentials - you want the FUSE
> daemon to tell the kernel code which superblock to reuse.

It sounds very _very_ complicated compared to just using bind mounts.

And maybe the user _does_ want a new connection to the same server
(for whatever reason). Why should we _force_ a sharing of
superblocks?

Miklos

2005-04-30 14:36:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > Well, if you can find a way to tell the userspace FUSE daemon to know
> > that the mount is being done by the same user as the existing mount,
> > you don't need (or want) to check the credentials - you want the FUSE
> > daemon to tell the kernel code which superblock to reuse.
>
> It sounds very _very_ complicated compared to just using bind mounts.
>
> And maybe the user _does_ want a new connection to the same server
> (for whatever reason). Why should we _force_ a sharing of
> superblocks?

The point is that you can decide whether to do that in userspace.
It's up to whatever code you put in the _userspace_ FUSE commands.

No kernel support for bind mounts from another namespace is required.

Actually, in terms of complexity, it's not much different from using
bind mounts. Either way involves finding all the mounts of another
session and copying them one by one: either by getting confirmation
from the daemon to attach to the same superblock, or by getting
handles from the daemon for all the individual directories to bind
mount.

In all, I think private namespaces are still the cleaner way to do it
_when_ a user wants their mounts to appear in multiple sessions anyway.

But bind mounts or superblock sharing are more flexible, at the same
time as being more cumbersome as a user interface.

-- JAmie

2005-04-30 16:00:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Actually, in terms of complexity, it's not much different from using
> bind mounts.

As has been suggested by Pavel, bind mounting foreign namespaces could
just be done with a new bind_fd(fd, path) syscall and file descriptor
passing with SCM_RIGHTS.

That sounds to me orders of magnitude less complex (on the kernel side
at least) than sb sharing.

Miklos



2005-04-30 16:44:45

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > Actually, in terms of complexity, it's not much different from using
> > bind mounts.
>
> As has been suggested by Pavel, bind mounting foreign namespaces could
> just be done with a new bind_fd(fd, path) syscall and file descriptor
> passing with SCM_RIGHTS.

Yes, he's right.

But you don't need a new system call to bind an fd.

"mount --bind /proc/self/fd/N mount_point" works, try it.

> That sounds to me orders of magnitude less complex (on the kernel side
> at least) than sb sharing.

In terms of what happens in the kernel, they're almost exactly the
same: either way, a super block ends up shared by two mounts. That's
what I meant.

I agree that in terms of what userspace has to do, if just binding
works that's simpler. And it does seem to work with the above mount
command.

-- Jamie

2005-04-30 16:49:00

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sat, 2005-04-30 at 01:33, Christoph Hellwig wrote:
> On Mon, Apr 25, 2005 at 08:07:34PM +0100, Jamie Lokier wrote:
> > Pavel Machek wrote:
> > > > > ... is the same as for the same question with "set of mounts" replaced
> > > > > with "environment variables".
> > > >
> > > > Not quite.
> > > >
> > > > After changing environment variables in .profile, you can copy them to
> > > > other shells using ". ~/.profile".
> > > >
> > > > There is no analogous mechanism to copy namespaces.
> > >
> > > Actually, after you add right mount xyzzy /foo lines into .profile,
> > > you can just . ~/.profile ;-).
> >
> > Is there a mount command that can do that? We're talking about
> > private mounts - invisible to other namespaces, which includes the
> > other shells.
> >
> > If there was a /proc/NNN/namespace, that would do the trick :)
>
> I don't think you need a /proc/NNN/namespace, /proc/NNN/mounts already
> contains a mount table. It's pretty trivial to write a small shellscript
> to parse that, compare with the current namespace and do all mount/umounts
> to make them fit the other processes namespace. Real problem here are
> filesystems that don't implement ->show_options or do so only partially
> so that some options are lost.

The other problem is: How would new mounts in any of these namespaces
propogate to other namespaces owned by the same user?

I mean, how will the other namespace's belonging to the same user, be
able to pull the mounts into their namespaces? shared subtree won't be
a solution because these namespaces won't have a parent-child
relationship to begin with, for the propogation to be set up.


RP



>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-04-30 17:08:33

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> But you don't need a new system call to bind an fd.
>
> "mount --bind /proc/self/fd/N mount_point" works, try it.

Ahh, yes :)

Still proc_check_root() has to be relaxed, to allow dereferencing link
under a different namespace. Maybe the check should be skipped for
capable(CAP_SYS_ADMIN) or similar.

What do people think about that?

Miklos

2005-04-30 18:20:32

by Olivier Galibert

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sat, Apr 30, 2005 at 07:07:56PM +0200, Miklos Szeredi wrote:
> > But you don't need a new system call to bind an fd.
> >
> > "mount --bind /proc/self/fd/N mount_point" works, try it.
>
> Ahh, yes :)
>
> Still proc_check_root() has to be relaxed, to allow dereferencing link
> under a different namespace. Maybe the check should be skipped for
> capable(CAP_SYS_ADMIN) or similar.
>
> What do people think about that?

To me it looks like an atrocious hack that works only because of the
way the implementation is done and not really by design. A well
defined interface where you want to do is explicitely said is way less
annoying long term. I don't know what the right approach would be
(join <ns> vs. exec in <ns> vs. clone in <ns>) or even what a
namespace reference should look like (fd, pid, something else), and
probably only Al has a good idea of that. Al, you've been quite
silent here. What do you think the right method/interface would be to
start an interactive shell in a pre-existing different namespace?

OG.

2005-04-30 23:55:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Miklos Szeredi wrote:
> > But you don't need a new system call to bind an fd.
> >
> > "mount --bind /proc/self/fd/N mount_point" works, try it.
>
> Ahh, yes :)
>
> Still proc_check_root() has to be relaxed, to allow dereferencing link
> under a different namespace.

Not necessary.

Why not have the FUSE daemon keep open a file descriptor for the
directory it's mounted on, and have it sent that to new would-be
mounters of the same directory using a unix domain socket (rather as
Pavel suggested)?

> Maybe the check should be skipped for
> capable(CAP_SYS_ADMIN) or similar.

No. The check is to prevent processes in chroot jails from accessing
directories outside their jail. Even CAP_SYS_ADMIN processes must be
forbidden from doing that.

But proc_check_root is unnecessarily strict, in that it prevents a
process from traversing into a "child" namespace.

IMHO, a better security restriction anyway would be for processes in
chroot jails to not be able to see processes outside the jail in /proc
- only processes inside the jail should be visible. I think everyone
agrees that would be best.

If that were implemented, then proc_check_root would be redundant and
could be removed entirely.

-- Jamie

2005-04-30 23:58:43

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] private mounts

Olivier Galibert wrote:
> > > "mount --bind /proc/self/fd/N mount_point" works, try it.
> >
> > What do people think about that?
>
> To me it looks like an atrocious hack that works only because of the
> way the implementation is done and not really by design.

>From fs/namespace.c:do_loopback, the function which does bind mounts:

if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {

check_mnt() verifies that a mountpoint is in the same namespace as the
current process. recurse is set for --rbind mounts, but not --bind mounts.

Notice how old_nd.mnt is explicitly _not_ checked for being in the current
namespace when doing --bind?

That says to me that Al thought about this case, and coded for it...

(I'm still not clear why the check_mnt() calls are needed at all, though).

-- Jamie

2005-05-01 02:40:00

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sat, 2005-04-30 at 16:58, Jamie Lokier wrote:
> Olivier Galibert wrote:
> > > > "mount --bind /proc/self/fd/N mount_point" works, try it.
> > >
> > > What do people think about that?
> >
> > To me it looks like an atrocious hack that works only because of the
> > way the implementation is done and not really by design.
>
> >From fs/namespace.c:do_loopback, the function which does bind mounts:
>
> if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
>
> check_mnt() verifies that a mountpoint is in the same namespace as the
> current process. recurse is set for --rbind mounts, but not --bind mounts.
>
> Notice how old_nd.mnt is explicitly _not_ checked for being in the current
> namespace when doing --bind?

> That says to me that Al thought about this case, and coded for it...
>
> (I'm still not clear why the check_mnt() calls are needed at all, though).
>
Making a wild guess.

What if some filesystem allowed access to vfsmount in other namespace?
Just like the proc filesystem having the ability to do so, but
marginally stops it through the check in proc_check_root().

However the check you mentioned above where-a-bind-mount-across-
namespace is allowed, implies that there is some legal way of getting
access to vfsmounts in other namespace. Or maybe a remote possibility
that its a bug?

RP


> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-05-01 05:56:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> Not necessary.
>
> Why not have the FUSE daemon keep open a file descriptor for the
> directory it's mounted on, and have it sent that to new would-be
> mounters of the same directory using a unix domain socket (rather as
> Pavel suggested)?

How does that help? It doesn't matter _which_ process you try to bind
mount /proc/XXX/fd/N from, the result will be the same.

> No. The check is to prevent processes in chroot jails from accessing
> directories outside their jail. Even CAP_SYS_ADMIN processes must be
> forbidden from doing that.

As someone pointed out, CAP_SYS_ADMIN processes can already escape the
chroot jail with CLONE_NEWNS. (fd=open("."); clone(CLONE_NEWNS);
[child:] fchdir(fd); chdir(".."))

> But proc_check_root is unnecessarily strict, in that it prevents a
> process from traversing into a "child" namespace.
>
> IMHO, a better security restriction anyway would be for processes in
> chroot jails to not be able to see processes outside the jail in /proc
> - only processes inside the jail should be visible. I think everyone
> agrees that would be best.

Dunno. It's a big change possibly breaking existing applications.
Chroot probably has other uses than jailing.

> If that were implemented, then proc_check_root would be redundant and
> could be removed entirely.

Yes.

Miklos

2005-05-01 06:39:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> But proc_check_root is unnecessarily strict, in that it prevents a
> process from traversing into a "child" namespace.
>
> IMHO, a better security restriction anyway would be for processes in
> chroot jails to not be able to see processes outside the jail in /proc
> - only processes inside the jail should be visible. I think everyone
> agrees that would be best.

Creating a new namespace would also have the same effect (only
processes using that namespace are visible). It would be rather ugly,
if a user could not see processes in other login sessions, just
because he uses private namespaces.

Miklos

2005-05-01 15:41:34

by Eric Van Hensbergen

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On 5/1/05, Miklos Szeredi <[email protected]> wrote:
>
> As someone pointed out, CAP_SYS_ADMIN processes can already escape the
> chroot jail with CLONE_NEWNS. (fd=open("."); clone(CLONE_NEWNS);
> [child:] fchdir(fd); chdir(".."))
>

This really does seem like a bug. Is there are a reason behind this
"feature", or should one of us be looking into a patch to correct
this?

Miklos you earlier suggested:
>>>How about fixing fchdir, so it checks whether you gone outside the
>>>tree under current->fs->rootmnt? Should be fairly easy to do.

-eric

2005-05-11 09:05:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] private mounts

On Sat, Apr 30, 2005 at 11:25:10AM +0200, Miklos Szeredi wrote:
> > > > I can't write a script that reads your mind. But I sure can write
> > > > a script that finds out what you mounted in the other shells (with help
> > > > of a little wrapper around the mount command).
> > >
> > > How do you bind mount it from a different namespace? You _do_ need
> > > bind mount, since a new mount might require password input, etc...
> >
> > Not nessecarily. The filesystem gets called into ->get_sb for every mount,
> > and can then decided whether to return an existing superblock instance or
> > setup a new one. If the credentials for the new mount match an old one
> > it can just reuse it. (e.g. for block based filesystem it will always reuse
> > right now)
>
> And if the credentials are checked in userspace (sshfs)?

The it needs to call to userspace in ->get_sb..

2005-05-11 10:43:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] private mounts

> > > > > I can't write a script that reads your mind. But I sure can
> > > > > write a script that finds out what you mounted in the other
> > > > > shells (with help of a little wrapper around the mount
> > > > > command).
> > > >
> > > > How do you bind mount it from a different namespace? You _do_
> > > > need bind mount, since a new mount might require password
> > > > input, etc...
> > >
> > > Not nessecarily. The filesystem gets called into ->get_sb for
> > > every mount, and can then decided whether to return an existing
> > > superblock instance or setup a new one. If the credentials for
> > > the new mount match an old one it can just reuse it. (e.g. for
> > > block based filesystem it will always reuse right now)
> >
> > And if the credentials are checked in userspace (sshfs)?
>
> The it needs to call to userspace in ->get_sb..

That's clear.

What I don't get is what's the point in adding complexity to the
kernel and userspace programs, when it can be done without _any_
changes, just by doing a bind mount.

It's not just calling ->get_sb. It's finding the right filesystem
daemon, that has been started with the exact same command line
arguments, environment etc.

It's just not practical.

Miklos