2008-02-13 05:45:41

by Paul Menage

[permalink] [raw]
Subject: [PATCH] Add MS_BIND_FLAGS mount flag

From: Paul Menage <[email protected]>

Add a new mount() flag, MS_BIND_FLAGS.

MS_BIND_FLAGS indicates that a bind mount should take its per-mount flags
from the arguments passed to mount() rather than from the source
mountpoint.

This flag allows you to create a bind mount with the desired per-mount
flags in a single operation, rather than having to do a bind mount
followed by a remount, which is fiddly and can block for non-trivial
periods of time (on sb->s_umount?).

For recursive bind mounts, only the root of the tree being bound
inherits the per-mount flags from the mount() arguments; sub-mounts
inherit their per-mount flags from the source tree as usual.

Signed-off-by: Paul Menage <[email protected]>


---
fs/namespace.c | 36 +++++++++++++++++++++++++-----------
include/linux/fs.h | 2 ++
2 files changed, 27 insertions(+), 11 deletions(-)

Index: 2.6.24-mm1-bindflags/fs/namespace.c
===================================================================
--- 2.6.24-mm1-bindflags.orig/fs/namespace.c
+++ 2.6.24-mm1-bindflags/fs/namespace.c
@@ -512,13 +512,13 @@ static struct vfsmount *skip_mnt_tree(st
}

static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
- int flag)
+ int flag, int mnt_flags)
{
struct super_block *sb = old->mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);

if (mnt) {
- mnt->mnt_flags = old->mnt_flags;
+ mnt->mnt_flags = mnt_flags;
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
@@ -1095,8 +1095,9 @@ static int lives_below_in_same_fs(struct
}
}

-struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
- int flag)
+static struct vfsmount *
+__copy_tree(struct vfsmount *mnt, struct dentry *dentry,
+ int flag, int mnt_flags)
{
struct vfsmount *res, *p, *q, *r, *s;
struct nameidata nd;
@@ -1104,7 +1105,7 @@ struct vfsmount *copy_tree(struct vfsmou
if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
return NULL;

- res = q = clone_mnt(mnt, dentry, flag);
+ res = q = clone_mnt(mnt, dentry, flag, mnt_flags);
if (!q)
goto Enomem;
q->mnt_mountpoint = mnt->mnt_mountpoint;
@@ -1126,7 +1127,7 @@ struct vfsmount *copy_tree(struct vfsmou
p = s;
nd.path.mnt = q;
nd.path.dentry = p->mnt_mountpoint;
- q = clone_mnt(p, p->mnt_root, flag);
+ q = clone_mnt(p, p->mnt_root, flag, p->mnt_flags);
if (!q)
goto Enomem;
spin_lock(&vfsmount_lock);
@@ -1146,6 +1147,11 @@ Enomem:
}
return NULL;
}
+struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
+ int flag)
+{
+ return __copy_tree(mnt, dentry, flag, mnt->mnt_flags);
+}

struct vfsmount *collect_mounts(struct vfsmount *mnt, struct dentry *dentry)
{
@@ -1320,7 +1326,8 @@ static int do_change_type(struct nameida
/*
* do loopback mount.
*/
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags,
+ int mnt_flags)
{
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
@@ -1342,10 +1349,15 @@ static int do_loopback(struct nameidata
goto out;

err = -ENOMEM;
- if (recurse)
- mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0);
+ /* Use the source mount flags unless the user passed MS_BIND_FLAGS */
+ if (!(flags & MS_BIND_FLAGS))
+ mnt_flags = old_nd.path.mnt->mnt_flags;
+ if (flags & MS_REC)
+ mnt = __copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0,
+ mnt_flags);
else
- mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0);
+ mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0,
+ mnt_flags);

if (!mnt)
goto out;
@@ -1874,7 +1886,9 @@ long do_mount(char *dev_name, char *dir_
retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
data_page);
else if (flags & MS_BIND)
- retval = do_loopback(&nd, dev_name, flags & MS_REC);
+ retval = do_loopback(&nd, dev_name,
+ flags & (MS_REC | MS_BIND_FLAGS),
+ mnt_flags);
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&nd, flags);
else if (flags & MS_MOVE)
Index: 2.6.24-mm1-bindflags/include/linux/fs.h
===================================================================
--- 2.6.24-mm1-bindflags.orig/include/linux/fs.h
+++ 2.6.24-mm1-bindflags/include/linux/fs.h
@@ -125,6 +125,8 @@ extern int dir_notify_enable;
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_BIND_FLAGS (1<<24) /* For MS_BIND, take mnt_flags from
+ * args, not from source mountpoint */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)


2008-02-14 06:02:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

On Tue, Feb 12, 2008 at 09:45:15PM -0800, Paul Menage wrote:
> From: Paul Menage <[email protected]>
>
> Add a new mount() flag, MS_BIND_FLAGS.
>
> MS_BIND_FLAGS indicates that a bind mount should take its per-mount flags
> from the arguments passed to mount() rather than from the source
> mountpoint.
>
> This flag allows you to create a bind mount with the desired per-mount
> flags in a single operation, rather than having to do a bind mount
> followed by a remount, which is fiddly and can block for non-trivial
> periods of time (on sb->s_umount?).
>
> For recursive bind mounts, only the root of the tree being bound
> inherits the per-mount flags from the mount() arguments; sub-mounts
> inherit their per-mount flags from the source tree as usual.

I think this concept is reasonable, but I don't think MS_BIND_FLAGS
is a descriptive name for this flag. MS_EXPLICIT_FLAGS might be better
but still isn't optimal.

> +static struct vfsmount *
> +__copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> + int flag, int mnt_flags)

> +struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> + int flag)
> +{
> + return __copy_tree(mnt, dentry, flag, mnt->mnt_flags);
> +}

Please just change copy_tree to have an additional parameter. There's
just four callers of it in the tree, and three of them want the new
parameter.

> + /* Use the source mount flags unless the user passed MS_BIND_FLAGS */

I think this comment could be made a little more explicit.

/*
* If MS_EXPLICIT_FLAGS is passed in we will take the paramters
* the user has specified. The default behaviour for bind
* mounts is however to take the flags from existing mount
* instances for the same superblock.
*/

> #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
> #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
> #define MS_I_VERSION (1<<23) /* Update inode I_version field */
> +#define MS_BIND_FLAGS (1<<24) /* For MS_BIND, take mnt_flags from
> + * args, not from source mountpoint */
> #define MS_ACTIVE (1<<30)
> #define MS_NOUSER (1<<31)

this looks like whitespace damage to me.

2008-02-14 08:31:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

Please always CC linux-fsdevel on VFS patches!

> From: Paul Menage <[email protected]>
>
> Add a new mount() flag, MS_BIND_FLAGS.
>
> MS_BIND_FLAGS indicates that a bind mount should take its per-mount flags
> from the arguments passed to mount() rather than from the source
> mountpoint.

This is useful, but...

> This flag allows you to create a bind mount with the desired per-mount
> flags in a single operation, rather than having to do a bind mount
> followed by a remount, which is fiddly and can block for non-trivial
> periods of time (on sb->s_umount?).
>
> For recursive bind mounts, only the root of the tree being bound
> inherits the per-mount flags from the mount() arguments; sub-mounts
> inherit their per-mount flags from the source tree as usual.

This is rather strange behavior. I think it would be much better, if
setting mount flags would work for recursive operations as well. Also
what we really need is not resetting all the mount flags to some
predetermined values, but to be able to set or clear each flag
individually.

For example, with the per-mount-read-only thing the most useful
application would be to just set the read-only flag and leave the
others alone.

And this is where we usually conclude, that a new userspace mount API
is long overdue. So for starters, how about a new syscall for bind
mounts:

int mount_bind(const char *src, const char *dst, unsigned flags,
unsigned mnt_flags);

or something?

Miklos

>
> Signed-off-by: Paul Menage <[email protected]>
>
>
> ---
> fs/namespace.c | 36 +++++++++++++++++++++++++-----------
> include/linux/fs.h | 2 ++
> 2 files changed, 27 insertions(+), 11 deletions(-)
>
> Index: 2.6.24-mm1-bindflags/fs/namespace.c
> ===================================================================
> --- 2.6.24-mm1-bindflags.orig/fs/namespace.c
> +++ 2.6.24-mm1-bindflags/fs/namespace.c
> @@ -512,13 +512,13 @@ static struct vfsmount *skip_mnt_tree(st
> }
>
> static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
> - int flag)
> + int flag, int mnt_flags)
> {
> struct super_block *sb = old->mnt_sb;
> struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
>
> if (mnt) {
> - mnt->mnt_flags = old->mnt_flags;
> + mnt->mnt_flags = mnt_flags;
> atomic_inc(&sb->s_active);
> mnt->mnt_sb = sb;
> mnt->mnt_root = dget(root);
> @@ -1095,8 +1095,9 @@ static int lives_below_in_same_fs(struct
> }
> }
>
> -struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> - int flag)
> +static struct vfsmount *
> +__copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> + int flag, int mnt_flags)
> {
> struct vfsmount *res, *p, *q, *r, *s;
> struct nameidata nd;
> @@ -1104,7 +1105,7 @@ struct vfsmount *copy_tree(struct vfsmou
> if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
> return NULL;
>
> - res = q = clone_mnt(mnt, dentry, flag);
> + res = q = clone_mnt(mnt, dentry, flag, mnt_flags);
> if (!q)
> goto Enomem;
> q->mnt_mountpoint = mnt->mnt_mountpoint;
> @@ -1126,7 +1127,7 @@ struct vfsmount *copy_tree(struct vfsmou
> p = s;
> nd.path.mnt = q;
> nd.path.dentry = p->mnt_mountpoint;
> - q = clone_mnt(p, p->mnt_root, flag);
> + q = clone_mnt(p, p->mnt_root, flag, p->mnt_flags);
> if (!q)
> goto Enomem;
> spin_lock(&vfsmount_lock);
> @@ -1146,6 +1147,11 @@ Enomem:
> }
> return NULL;
> }
> +struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
> + int flag)
> +{
> + return __copy_tree(mnt, dentry, flag, mnt->mnt_flags);
> +}
>
> struct vfsmount *collect_mounts(struct vfsmount *mnt, struct dentry *dentry)
> {
> @@ -1320,7 +1326,8 @@ static int do_change_type(struct nameida
> /*
> * do loopback mount.
> */
> -static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
> +static int do_loopback(struct nameidata *nd, char *old_name, int flags,
> + int mnt_flags)
> {
> struct nameidata old_nd;
> struct vfsmount *mnt = NULL;
> @@ -1342,10 +1349,15 @@ static int do_loopback(struct nameidata
> goto out;
>
> err = -ENOMEM;
> - if (recurse)
> - mnt = copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0);
> + /* Use the source mount flags unless the user passed MS_BIND_FLAGS */
> + if (!(flags & MS_BIND_FLAGS))
> + mnt_flags = old_nd.path.mnt->mnt_flags;
> + if (flags & MS_REC)
> + mnt = __copy_tree(old_nd.path.mnt, old_nd.path.dentry, 0,
> + mnt_flags);
> else
> - mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0);
> + mnt = clone_mnt(old_nd.path.mnt, old_nd.path.dentry, 0,
> + mnt_flags);
>
> if (!mnt)
> goto out;
> @@ -1874,7 +1886,9 @@ long do_mount(char *dev_name, char *dir_
> retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
> data_page);
> else if (flags & MS_BIND)
> - retval = do_loopback(&nd, dev_name, flags & MS_REC);
> + retval = do_loopback(&nd, dev_name,
> + flags & (MS_REC | MS_BIND_FLAGS),
> + mnt_flags);
> else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
> retval = do_change_type(&nd, flags);
> else if (flags & MS_MOVE)
> Index: 2.6.24-mm1-bindflags/include/linux/fs.h
> ===================================================================
> --- 2.6.24-mm1-bindflags.orig/include/linux/fs.h
> +++ 2.6.24-mm1-bindflags/include/linux/fs.h
> @@ -125,6 +125,8 @@ extern int dir_notify_enable;
> #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
> #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
> #define MS_I_VERSION (1<<23) /* Update inode I_version field */
> +#define MS_BIND_FLAGS (1<<24) /* For MS_BIND, take mnt_flags from
> + * args, not from source mountpoint */
> #define MS_ACTIVE (1<<30)
> #define MS_NOUSER (1<<31)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-02-14 13:07:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag


On Thu, 2008-02-14 at 09:30 +0100, Miklos Szeredi wrote:

> And this is where we usually conclude, that a new userspace mount API
> is long overdue. So for starters, how about a new syscall for bind
> mounts:
>
> int mount_bind(const char *src, const char *dst, unsigned flags,
> unsigned mnt_flags);

If you're going to add a new bind mount interface, then please consider
adding 'openat'-like file descriptors. I'm already looking into doing
this for the main 'mount' interface in order to solve the automounting
problem when dealing with arbitrary namespaces.

Cheers
Trond

---
From: Trond Myklebust <[email protected]>
VFS: Add support for a new mountat() system call

sys_mountat() basically adds an openat()-like 'dirfd' argument to the mount
system call interface.
This is needed in order to support automounting in the presence of
arbitrary user namespaces. It does so by allowing the kernel to encode the
namespace+directory information in a directory file descriptor that may be
passed to an automounter daemon that is running in a different namespace.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/namespace.c | 47 ++++++++++++++++++++++++++++++++++------------
include/linux/syscalls.h | 5 +++++
2 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9025d22..53b5943 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -670,7 +670,7 @@ static int do_umount(struct vfsmount *mnt, int flags)
* unixes. Our API is identical to OSF/1 to avoid making a mess of AMD
*/

-asmlinkage long sys_umount(char __user * name, int flags)
+asmlinkage long sys_umountat(int dfd, char __user * name, int flags)
{
struct nameidata nd;
int retval;
@@ -695,6 +695,11 @@ out:
return retval;
}

+asmlinkage long sys_umount(char __user * name, int flags)
+{
+ return sys_umountat(AT_FDCWD, name, flags);
+}
+
#ifdef __ARCH_WANT_SYS_OLDUMOUNT

/*
@@ -963,7 +968,7 @@ static noinline int do_change_type(struct nameidata *nd, int flag)
* do loopback mount.
* noinline this do_mount helper to save do_mount stack space.
*/
-static noinline int do_loopback(struct nameidata *nd, char *old_name,
+static noinline int do_loopback(struct nameidata *nd, int dfd, char *old_name,
int recurse)
{
struct nameidata old_nd;
@@ -973,7 +978,7 @@ static noinline int do_loopback(struct nameidata *nd, char *old_name,
return err;
if (!old_name || !*old_name)
return -EINVAL;
- err = path_lookup(old_name, LOOKUP_FOLLOW, &old_nd);
+ err = path_lookup_fd(dfd, old_name, LOOKUP_FOLLOW, &old_nd);
if (err)
return err;

@@ -1053,7 +1058,7 @@ static inline int tree_contains_unbindable(struct vfsmount *mnt)
/*
* noinline this do_mount helper to save do_mount stack space.
*/
-static noinline int do_move_mount(struct nameidata *nd, char *old_name)
+static noinline int do_move_mount(struct nameidata *nd, int dfd, char *old_name)
{
struct nameidata old_nd, parent_nd;
struct vfsmount *p;
@@ -1062,7 +1067,7 @@ static noinline int do_move_mount(struct nameidata *nd, char *old_name)
return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
- err = path_lookup(old_name, LOOKUP_FOLLOW, &old_nd);
+ err = path_lookup_fd(dfd, old_name, LOOKUP_FOLLOW, &old_nd);
if (err)
return err;

@@ -1445,7 +1450,8 @@ int copy_mount_options(const void __user * data, unsigned long *where)
* Therefore, if this magic number is present, it carries no information
* and must be discarded.
*/
-long do_mount(char *dev_name, char *dir_name, char *type_page,
+static long do_mountat(int devdfd, char *dev_name,
+ int dirdfd, char *dir_name, char *type_page,
unsigned long flags, void *data_page)
{
struct nameidata nd;
@@ -1484,7 +1490,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);

/* ... and get the mountpoint */
- retval = path_lookup(dir_name, LOOKUP_FOLLOW|LOOKUP_MOUNT, &nd);
+ retval = path_lookup_fd(dirdfd, dir_name, LOOKUP_FOLLOW|LOOKUP_MOUNT,
+ &nd);
if (retval)
return retval;

@@ -1496,11 +1503,11 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
retval = do_remount(&nd, flags & ~MS_REMOUNT, mnt_flags,
data_page);
else if (flags & MS_BIND)
- retval = do_loopback(&nd, dev_name, flags & MS_REC);
+ retval = do_loopback(&nd, devdfd, dev_name, flags & MS_REC);
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&nd, flags);
else if (flags & MS_MOVE)
- retval = do_move_mount(&nd, dev_name);
+ retval = do_move_mount(&nd, devdfd, dev_name);
else
retval = do_new_mount(&nd, type_page, flags, mnt_flags,
dev_name, data_page);
@@ -1509,6 +1516,13 @@ dput_out:
return retval;
}

+long do_mount(char *dev_name, char *dir_name, char *type_page,
+ unsigned long flags, void *data_page)
+{
+ return do_mountat(AT_FDCWD, dev_name, AT_FDCWD, dir_name,
+ type_page, flags, data_page);
+}
+
/*
* Allocate a new namespace structure and populate it with contents
* copied from the namespace of the passed in task structure.
@@ -1597,7 +1611,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
return new_ns;
}

-asmlinkage long sys_mount(char __user * dev_name, char __user * dir_name,
+asmlinkage long sys_mountat(int devdfd, char __user * dev_name,
+ int dirdfd, char __user * dir_name,
char __user * type, unsigned long flags,
void __user * data)
{
@@ -1625,8 +1640,8 @@ asmlinkage long sys_mount(char __user * dev_name, char __user * dir_name,
goto out3;

lock_kernel();
- retval = do_mount((char *)dev_page, dir_page, (char *)type_page,
- flags, (void *)data_page);
+ retval = do_mountat(devdfd, (char *)dev_page, dirdfd, dir_page,
+ (char *)type_page, flags, (void *)data_page);
unlock_kernel();
free_page(data_page);

@@ -1639,6 +1654,14 @@ out1:
return retval;
}

+asmlinkage long sys_mount(char __user * dev_name, char __user * dir_name,
+ char __user * type, unsigned long flags,
+ void __user * data)
+{
+ return sys_mountat(AT_FDCWD, dev_name, AT_FDCWD, dir_name,
+ type, flags, data);
+}
+
/*
* Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
* It can block. Requires the big lock held.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4c2577b..de1dc02 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -614,6 +614,11 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_mountat(int devdfd, char __user *dev_name,
+ int dirdfd, char __user *dir_name,
+ char __user *type, unsigned long flags,
+ void __user *data);
+asmlinkage long sys_umountat(int dfd, char __user *name, int flags);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);


2008-02-14 15:19:57

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

On Thu, Feb 14, 2008 at 12:30 AM, Miklos Szeredi <[email protected]> wrote:
> > For recursive bind mounts, only the root of the tree being bound
> > inherits the per-mount flags from the mount() arguments; sub-mounts
> > inherit their per-mount flags from the source tree as usual.
>
> This is rather strange behavior. I think it would be much better, if
> setting mount flags would work for recursive operations as well. Also
> what we really need is not resetting all the mount flags to some
> predetermined values, but to be able to set or clear each flag
> individually.

This is certainly true, but as you observe below it's a fair bit more
fiddly to specify in the API. I wasn't sure how much people recursive
bind mounts, so I figured I'd throw out this simpler version first.

>
> For example, with the per-mount-read-only thing the most useful
> application would be to just set the read-only flag and leave the
> others alone.
>
> And this is where we usually conclude, that a new userspace mount API
> is long overdue. So for starters, how about a new syscall for bind
> mounts:
>
> int mount_bind(const char *src, const char *dst, unsigned flags,
> unsigned mnt_flags);

The "flags" argument could be the same as for regular mount, and
contain the mnt_flags - so the extra argument could maybe usefully be
a "mnt_flags_mask", to indicate which flags we actually care about
overriding.

What would happen when an existing super-block flag changes to become
a per-mount flag (e.g. per-mount read-only)? I think that would just
fit in with the "mask" idea, as long as we complained if any bits in
mnt_flags_mask weren't actually per-mount settable.

Being able to mask/set mount flags might be useful on a remount too,
since there's no clean way to get the existing mount flags for a mount
other than by scanning /proc/mounts. So an alternative to a separate
system call would be a new mnt_flag_mask argument to mount() (whose
presence would be indicated by a flag bit being set in the main flags)
which would be used to control which bits were set cleared for
remount/bind calls. Seems a bit wasteful of bits though. If we turned
"flags" into an (optionally) 64-bit argument then we'd have plenty of
bits to be able to specify both a "set" bit and a "mask" bit for each,
without needing a new syscall.

Paul

2008-02-14 15:22:32

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

On Wed, Feb 13, 2008 at 10:02 PM, Christoph Hellwig <[email protected]> wrote:
>
> I think this concept is reasonable, but I don't think MS_BIND_FLAGS
> is a descriptive name for this flag. MS_EXPLICIT_FLAGS might be better
> but still isn't optimal.
>

MS_BIND_FLAGS_OVERRIDE ?

Paul

2008-02-14 15:23:42

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

[ cc: linux-fsdevel ]

On Thu, Feb 14, 2008 at 7:22 AM, Paul Menage <[email protected]> wrote:
> On Wed, Feb 13, 2008 at 10:02 PM, Christoph Hellwig <[email protected]> wrote:
> >
> > I think this concept is reasonable, but I don't think MS_BIND_FLAGS
> > is a descriptive name for this flag. MS_EXPLICIT_FLAGS might be better
> > but still isn't optimal.
> >
>
> MS_BIND_FLAGS_OVERRIDE ?
>
> Paul
>

2008-02-14 16:05:38

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

> On Thu, Feb 14, 2008 at 12:30 AM, Miklos Szeredi <[email protected]> wrote:
> > > For recursive bind mounts, only the root of the tree being bound
> > > inherits the per-mount flags from the mount() arguments; sub-mounts
> > > inherit their per-mount flags from the source tree as usual.
> >
> > This is rather strange behavior. I think it would be much better, if
> > setting mount flags would work for recursive operations as well. Also
> > what we really need is not resetting all the mount flags to some
> > predetermined values, but to be able to set or clear each flag
> > individually.
>
> This is certainly true, but as you observe below it's a fair bit more
> fiddly to specify in the API. I wasn't sure how much people recursive
> bind mounts, so I figured I'd throw out this simpler version first.
>
> >
> > For example, with the per-mount-read-only thing the most useful
> > application would be to just set the read-only flag and leave the
> > others alone.
> >
> > And this is where we usually conclude, that a new userspace mount API
> > is long overdue. So for starters, how about a new syscall for bind
> > mounts:
> >
> > int mount_bind(const char *src, const char *dst, unsigned flags,
> > unsigned mnt_flags);
>
> The "flags" argument could be the same as for regular mount, and
> contain the mnt_flags - so the extra argument could maybe usefully be
> a "mnt_flags_mask", to indicate which flags we actually care about
> overriding.

The way I imagined it, is that mnt_flags is a mask, and the operation
(determined by flags) is either:

- set bits in mask
- clear bits in mask (or not in mask)
- set flags to mask

It doesn't allow setting some bits, clearing some others, and leaving
alone the rest. But I think such flexibility isn't really needed.
>
> What would happen when an existing super-block flag changes to become
> a per-mount flag (e.g. per-mount read-only)? I think that would just
> fit in with the "mask" idea, as long as we complained if any bits in
> mnt_flags_mask weren't actually per-mount settable.
>
> Being able to mask/set mount flags might be useful on a remount too,
> since there's no clean way to get the existing mount flags for a mount
> other than by scanning /proc/mounts. So an alternative to a separate
> system call would be a new mnt_flag_mask argument to mount() (whose
> presence would be indicated by a flag bit being set in the main flags)
> which would be used to control which bits were set cleared for
> remount/bind calls. Seems a bit wasteful of bits though. If we turned
> "flags" into an (optionally) 64-bit argument then we'd have plenty of
> bits to be able to specify both a "set" bit and a "mask" bit for each,
> without needing a new syscall.

The big problem, is that current sys_mount() interface is a really big
pile of random things thrown together, and not a proper API. And just
introducing a new mount64 syscall is really making the problem worse,
by allowing more random things to be added to it.

So while creating a new clean API is harder work, it should be worth
it in the long run.

Maybe instead of messing with masks, it's better to introduce a
get_flags() or a more general mount_stat() operation, and let
userspace deal with setting and clearing flags, just as we do for
stat/chmod?

So we'd have

mount_stat(path, stat);
mount_bind(from, to, flags);
mount_set_flags(path, flags);
mount_move(from, to);

and perhaps

mount_remount(path, opt_string, flags);

And I'm not against doing it with the "at*" variants, as Trond
suggested.

Hmm?

Miklos

2008-02-14 16:13:31

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

On Thu, Feb 14, 2008 at 8:03 AM, Miklos Szeredi <[email protected]> wrote:
> > The "flags" argument could be the same as for regular mount, and
> > contain the mnt_flags - so the extra argument could maybe usefully be
> > a "mnt_flags_mask", to indicate which flags we actually care about
> > overriding.
>
> The way I imagined it, is that mnt_flags is a mask, and the operation
> (determined by flags) is either:
>
> - set bits in mask
> - clear bits in mask (or not in mask)
> - set flags to mask
>
> It doesn't allow setting some bits, clearing some others, and leaving
> alone the rest. But I think such flexibility isn't really needed.

I think I'd suggest something like:

new_mnt->mnt_flags = (old_mnt->mnt_flags & ~arg_mask) | (arg_flags & mask)

> Maybe instead of messing with masks, it's better to introduce a
> get_flags() or a more general mount_stat() operation, and let
> userspace deal with setting and clearing flags, just as we do for
> stat/chmod?
>
> So we'd have
>
> mount_stat(path, stat);
> mount_bind(from, to, flags);
> mount_set_flags(path, flags);
> mount_move(from, to);
>
> and perhaps
>
> mount_remount(path, opt_string, flags);

Sounds reasonable to me. But it wouldn't directly solve the "do a
recursive bind mount setting the MS_READONLY flag on all children"
problem, so we'd need some of the earlier suggestions too.

Paul

2008-02-14 16:27:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag


On Thu, 2008-02-14 at 17:03 +0100, Miklos Szeredi wrote:

> And I'm not against doing it with the "at*" variants, as Trond
> suggested.

If you're going to change the syscall, then you should ensure that it
solves _all_ the problems that are known at this time. Ignoring the
automounter issue is just going to force us to redo the syscall in a
couple of months...

Trond

2008-02-14 17:32:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

> > Maybe instead of messing with masks, it's better to introduce a
> > get_flags() or a more general mount_stat() operation, and let
> > userspace deal with setting and clearing flags, just as we do for
> > stat/chmod?
> >
> > So we'd have
> >
> > mount_stat(path, stat);
> > mount_bind(from, to, flags);
> > mount_set_flags(path, flags);
> > mount_move(from, to);
> >
> > and perhaps
> >
> > mount_remount(path, opt_string, flags);
>
> Sounds reasonable to me. But it wouldn't directly solve the "do a
> recursive bind mount setting the MS_READONLY flag on all children"
> problem, so we'd need some of the earlier suggestions too.

Doh, you're right.

Let's try the original idea, but a bit cleaner:

/* flags: */
#define MNT_CTRL_RECURSE (1 << 0)

/* mnt_flags: */
#define MNT_NOSUID 0x01
#define MNT_NODEV 0x02
#define MNT_NOEXEC 0x04
#define MNT_NOATIME 0x08
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20

#define MNT_SHARED 0x1000
#define MNT_UNBINDABLE 0x2000
#define MNT_PNODE_MASK 0x3000

struct mount_param {
u64 flags; /* control flags */
u64 mnt_flags; /* new mount flags */
u64 mnt_flags_mask; /* mask for new mount flags */
};

int mount_bindat(int fromfd, const char *frompath,
int tofd, const char *topath,
struct mount_param *param);

int mount_setflagsat(int fd, const char *path,
struct mount_param *param);

int mount_moveat(int fromfd, const char *frompath,
int tofd, const char *topath);

...

I deliberately not used the MS_* flags, which is currently a messy mix
of things with totally different meanings.

Does this solve all the issues?

Miklos

2008-02-14 17:40:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

> > And I'm not against doing it with the "at*" variants, as Trond
> > suggested.
>
> If you're going to change the syscall, then you should ensure that it
> solves _all_ the problems that are known at this time. Ignoring the
> automounter issue is just going to force us to redo the syscall in a
> couple of months...

Sure.

Although, an (almost) equivalent userspace code would be:

mount_fooat(int fd, const char *path)
{
char tmpbuf[64];
int tmpfd = openat(fd, path);

sprintf(tmpbuf, "/proc/self/fd/%i", tmpfd);
return mount_foo(tmpbuf, ...);
}

Or is there something (other than not requiring proc) that the *at
variant gives?

Miklos

2008-02-14 18:05:35

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

On Thu, Feb 14, 2008 at 9:31 AM, Miklos Szeredi <[email protected]> wrote:
>
> I deliberately not used the MS_* flags, which is currently a messy mix
> of things with totally different meanings.
>
> Does this solve all the issues?

We should add a size parameter either in the mount_params or as a
final argument, for future extensibility.

And we might as well include MNT_READONLY in the API on the assumption
that per-mount readonly will be available soon.

Paul

2008-02-14 19:18:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag


On Thu, 2008-02-14 at 18:39 +0100, Miklos Szeredi wrote:
> > > And I'm not against doing it with the "at*" variants, as Trond
> > > suggested.
> >
> > If you're going to change the syscall, then you should ensure that it
> > solves _all_ the problems that are known at this time. Ignoring the
> > automounter issue is just going to force us to redo the syscall in a
> > couple of months...
>
> Sure.
>
> Although, an (almost) equivalent userspace code would be:
>
> mount_fooat(int fd, const char *path)
> {
> char tmpbuf[64];
> int tmpfd = openat(fd, path);
>
> sprintf(tmpbuf, "/proc/self/fd/%i", tmpfd);
> return mount_foo(tmpbuf, ...);
> }
>
> Or is there something (other than not requiring proc) that the *at
> variant gives?

The ability to have a daemon handle mounting onto a directory that only
exists in another process's private namespace.

Say I'm running in my private namespace that contains paths that are not
shared by the trusted 'init' namespace. If I were to step on an
autofs-like trap, I'd like for the kernel to be able to notify the
automounter that is running in the trusted namespace set up by 'init',
and have it mount the directory onto my namespace. This should happen
even if the path is not shared.

With mountat() the kernel can still pass the necessary information to
the automounter by giving it a directory file descriptor 'fd' that
points to the directory on top of which it wants the mount to occur.
Then automounter then executes

mountat(AT_FDCWD, dev_name, fd, '.', type, flags, data);

and hey presto, the magic occurs.

2008-02-14 22:19:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

> On Thu, 2008-02-14 at 18:39 +0100, Miklos Szeredi wrote:
> > > > And I'm not against doing it with the "at*" variants, as Trond
> > > > suggested.
> > >
> > > If you're going to change the syscall, then you should ensure that it
> > > solves _all_ the problems that are known at this time. Ignoring the
> > > automounter issue is just going to force us to redo the syscall in a
> > > couple of months...
> >
> > Sure.
> >
> > Although, an (almost) equivalent userspace code would be:
> >
> > mount_fooat(int fd, const char *path)
> > {
> > char tmpbuf[64];
> > int tmpfd = openat(fd, path);
> >
> > sprintf(tmpbuf, "/proc/self/fd/%i", tmpfd);
> > return mount_foo(tmpbuf, ...);
> > }
> >
> > Or is there something (other than not requiring proc) that the *at
> > variant gives?
>
> The ability to have a daemon handle mounting onto a directory that only
> exists in another process's private namespace.
>
> Say I'm running in my private namespace that contains paths that are not
> shared by the trusted 'init' namespace. If I were to step on an
> autofs-like trap, I'd like for the kernel to be able to notify the
> automounter that is running in the trusted namespace set up by 'init',
> and have it mount the directory onto my namespace. This should happen
> even if the path is not shared.
>
> With mountat() the kernel can still pass the necessary information to
> the automounter by giving it a directory file descriptor 'fd' that
> points to the directory on top of which it wants the mount to occur.
> Then automounter then executes
>
> mountat(AT_FDCWD, dev_name, fd, '.', type, flags, data);
>
> and hey presto, the magic occurs.

I understand perfectly that this is what you want to do. And I'm
saying that the following code snippet should do exactly the same,
without having to add a new syscall:

char tmpbuf[64];
sprintf(tmpbuf, "/proc/self/fd/%i", fd);
mount(dev_name, tmpbuf, type, flags, data);

[ You could actually try to read people's responses, instead of
immediately assuming they don't understand :-/ ]

So again, what is the advantage of a mountat() syscall over just using
the proc trick?

Miklos

2008-02-14 22:31:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag

> On Thu, Feb 14, 2008 at 9:31 AM, Miklos Szeredi <[email protected]> wrote:
> >
> > I deliberately not used the MS_* flags, which is currently a messy mix
> > of things with totally different meanings.
> >
> > Does this solve all the issues?
>
> We should add a size parameter either in the mount_params or as a
> final argument, for future extensibility.

OK, let's add it to mount_params then.

> And we might as well include MNT_READONLY in the API on the assumption
> that per-mount readonly will be available soon.

Right. That patch-set should already have been merged into 2.6.25...

Miklos

2008-02-14 23:33:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Add MS_BIND_FLAGS mount flag


On Thu, 2008-02-14 at 23:18 +0100, Miklos Szeredi wrote:

> I understand perfectly that this is what you want to do. And I'm
> saying that the following code snippet should do exactly the same,
> without having to add a new syscall:
>
> char tmpbuf[64];
> sprintf(tmpbuf, "/proc/self/fd/%i", fd);
> mount(dev_name, tmpbuf, type, flags, data);
>
> [ You could actually try to read people's responses, instead of
> immediately assuming they don't understand :-/ ]

I did read your reply, but the follow-link magic that is performed in
proc_fd_link() wasn't immediately obvious to someone who is unfamiliar
with that code. I understand where you're coming from now.

Anyhow, that does indeed look as if it will do what is needed for the
automounter.

Trond