LinuxLists.cc - [patch 1/2] VFS: new fgetattr() file operation

2007-09-24 12:26:19

Subject: [patch 1/2] VFS: new fgetattr() file operation

Thanks to everyone for the feedback. Here's two of the VFS patches
reworked according to comments. I also plan to rework the setattr()
patch accordingly and perhaps the xattr patch, altough that is the
lowest priority.

Christoph, are these OK with you in this form?

----
From: Miklos Szeredi <[email protected]>

Add a new file operation: f_op->fgetattr(), that is invoked by
fstat(). Fall back to i_op->getattr() if it is not defined.

This is useful for filesystems such as sshfs, which don't have a state
associated with inodes, but do have a state associated with open file
handles, and can perform a getattr operation using these handles.

In these cases there are basically two ways to correctly implement
open-unlink-fstat semantics:

1) use "sillyrenaming"

2) keep track of open files for each inode within the filesystem, and
randomly choose one for each getattr() request on an inode with
i_nlink == 0

3) VFS passes the open file to the filesystem, which can be used to
perform a getattr operation on the file handle

No 3. is by far the simplest solution, but it does require this
interface change in the VFS. It is also the only one that takes care
of the case, when a regular file is unlinked or renamed on the remote
host, while it is still open locally.

Signed-off-by: Miklos Szeredi <[email protected]>
---

Index: linux/fs/stat.c
===================================================================
--- linux.orig/fs/stat.c 2007-09-24 13:04:37.000000000 +0200
+++ linux/fs/stat.c 2007-09-24 13:06:10.000000000 +0200
@@ -55,6 +55,27 @@ int vfs_getattr(struct vfsmount *mnt, st

EXPORT_SYMBOL(vfs_getattr);

+static int vfs_fgetattr(struct file *file, struct kstat *stat)
+{
+ struct vfsmount *mnt = file->f_path.mnt;
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ int retval;
+
+ retval = security_inode_getattr(mnt, dentry);
+ if (retval)
+ return retval;
+
+ if (file->f_op && file->f_op->fgetattr) {
+ return file->f_op->fgetattr(file, stat);
+ } else if (inode->i_op->getattr) {
+ return inode->i_op->getattr(mnt, dentry, stat);
+ } else {
+ generic_fillattr(inode, stat);
+ return 0;
+ }
+}
+
int vfs_stat_fd(int dfd, char __user *name, struct kstat *stat)
{
struct nameidata nd;
@@ -101,7 +122,7 @@ int vfs_fstat(unsigned int fd, struct ks
int error = -EBADF;

if (f) {
- error = vfs_getattr(f->f_path.mnt, f->f_path.dentry, stat);
+ error = vfs_fgetattr(f, stat);
fput(f);
}
return error;
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2007-09-24 13:04:37.000000000 +0200
+++ linux/include/linux/fs.h 2007-09-24 13:06:10.000000000 +0200
@@ -1193,6 +1193,7 @@ struct file_operations {
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
int (*revoke)(struct file *, struct address_space *);
+ int (*fgetattr)(struct file *, struct kstat *);
};

struct inode_operations {

2007-09-24 12:26:43

by Miklos Szeredi

[permalink] [raw]

Subject: [patch 2/2] VFS: allow filesystem to override mknod capability checks

From: Miklos Szeredi <[email protected]>

Add a new super block flag, that results in the VFS not checking if
the current process has enough privileges to do an mknod().

If this flag is set, all mounts for this super block will have the
"nodev" flag implied.

This is needed on filesystems, where an unprivileged user may be able
to create a device node, without causing security problems.

One such example is "mountlo" a loopback mount utility implemented
with fuse and UML, which runs as an unprivileged userspace process.
In this case the user does in fact have the right to create device
nodes within the filesystem image, as long as the user has write
access to the image. Since the filesystem is mounted with "nodev",
adding device nodes is not a security concern.

Signed-off-by: Miklos Szeredi <[email protected]>
---

Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c 2007-09-24 13:52:17.000000000 +0200
+++ linux/fs/namei.c 2007-09-24 13:54:57.000000000 +0200
@@ -1617,7 +1617,7 @@ int may_open(struct nameidata *nd, int a
if (S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
flag &= ~O_TRUNC;
} else if (S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode)) {
- if (nd->mnt->mnt_flags & MNT_NODEV)
+ if (IS_MNT_NODEV(nd->mnt))
return -EACCES;

flag &= ~O_TRUNC;
@@ -1920,7 +1920,8 @@ int vfs_mknod(struct inode *dir, struct
if (error)
return error;

- if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+ if (!(dir->i_sb->s_flags & MS_MKNOD_NOCAP) &&
+ (S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
return -EPERM;

if (!dir->i_op || !dir->i_op->mknod)
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2007-09-24 13:52:17.000000000 +0200
+++ linux/include/linux/fs.h 2007-09-24 13:54:57.000000000 +0200
@@ -130,6 +130,8 @@ extern int dir_notify_enable;
#define MS_SETUSER (1<<23) /* set mnt_uid to current user */
#define MS_NOMNT (1<<24) /* don't allow unprivileged submounts */
#define MS_KERNMOUNT (1<<25) /* this is a kern_mount call */
+#define MS_MKNOD_NOCAP (1<<26) /* no capability check in mknod,
+ implies "nodev" */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -190,6 +192,10 @@ extern int dir_notify_enable;
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)

+#define IS_MNT_NODEV(mnt) (((mnt)->mnt_flags & MNT_NODEV) || \
+ ((mnt)->mnt_sb->s_flags & MS_MKNOD_NOCAP))
+
+
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */

Index: linux/drivers/mtd/mtdsuper.c
===================================================================
--- linux.orig/drivers/mtd/mtdsuper.c 2007-09-24 13:52:17.000000000 +0200
+++ linux/drivers/mtd/mtdsuper.c 2007-09-24 13:54:57.000000000 +0200
@@ -194,7 +194,7 @@ int get_sb_mtd(struct file_system_type *
if (!S_ISBLK(nd.dentry->d_inode->i_mode))
goto out;

- if (nd.mnt->mnt_flags & MNT_NODEV) {
+ if (IS_MNT_NODEV(nd.mnt)) {
ret = -EACCES;
goto out;
}
Index: linux/fs/block_dev.c
===================================================================
--- linux.orig/fs/block_dev.c 2007-09-24 13:52:17.000000000 +0200
+++ linux/fs/block_dev.c 2007-09-24 13:54:57.000000000 +0200
@@ -1408,7 +1408,7 @@ struct block_device *lookup_bdev(const c
if (!S_ISBLK(inode->i_mode))
goto fail;
error = -EACCES;
- if (nd.mnt->mnt_flags & MNT_NODEV)
+ if (IS_MNT_NODEV(nd.mnt))
goto fail;
error = -ENOMEM;
bdev = bd_acquire(inode);
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c 2007-09-24 13:52:17.000000000 +0200
+++ linux/fs/namespace.c 2007-09-24 13:54:57.000000000 +0200
@@ -431,7 +431,6 @@ static int show_vfsmnt(struct seq_file *
};
static struct proc_fs_info mnt_info[] = {
{ MNT_NOSUID, ",nosuid" },
- { MNT_NODEV, ",nodev" },
{ MNT_NOEXEC, ",noexec" },
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
@@ -459,6 +458,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
+ if (IS_MNT_NODEV(mnt))
+ seq_puts(m, ",nodev");
if (mnt->mnt_flags & MNT_USER)
seq_printf(m, ",user=%i", mnt->mnt_uid);
if (mnt->mnt_sb->s_op->show_options)

2007-09-24 12:36:55

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

On Mon, Sep 24, 2007 at 02:24:54PM +0200, Miklos Szeredi wrote:
> Thanks to everyone for the feedback. Here's two of the VFS patches
> reworked according to comments. I also plan to rework the setattr()
> patch accordingly and perhaps the xattr patch, altough that is the
> lowest priority.
>
> Christoph, are these OK with you in this form?

Not at all. Attribute operations like this have no business at all
looking at the struct file. Please fix your dreaded filesystem to
implement proper unix semantics intead, and if that means adding
silly rename support so be it.

2007-09-24 12:38:25

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch 2/2] VFS: allow filesystem to override mknod capability checks

On Mon, Sep 24, 2007 at 02:25:54PM +0200, Miklos Szeredi wrote:
> From: Miklos Szeredi <[email protected]>
>
> Add a new super block flag, that results in the VFS not checking if
> the current process has enough privileges to do an mknod().
>
> If this flag is set, all mounts for this super block will have the
> "nodev" flag implied.
>
> This is needed on filesystems, where an unprivileged user may be able
> to create a device node, without causing security problems.
>
> One such example is "mountlo" a loopback mount utility implemented
> with fuse and UML, which runs as an unprivileged userspace process.
> In this case the user does in fact have the right to create device
> nodes within the filesystem image, as long as the user has write
> access to the image. Since the filesystem is mounted with "nodev",
> adding device nodes is not a security concern.

This one looks okay, but I'd prefer to not put it in until we actually
have proper non-privilegued mounts.

2007-09-24 12:49:47

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> On Mon, Sep 24, 2007 at 02:24:54PM +0200, Miklos Szeredi wrote:
> > Thanks to everyone for the feedback. Here's two of the VFS patches
> > reworked according to comments. I also plan to rework the setattr()
> > patch accordingly and perhaps the xattr patch, altough that is the
> > lowest priority.
> >
> > Christoph, are these OK with you in this form?
>
> Not at all. Attribute operations like this have no business at all
> looking at the struct file. Please fix your dreaded filesystem to
> implement proper unix semantics intead,

It's a fixed protocol, with servers installed on millions of servers.
The protocol is called SFTP and the server is part of the OpenSSH
package. There's nothing I can change there.

And even if I could change the protocol, it's impossible to implement
full UNIX semantics with a userspace server. Please think a bit about
it.

> and if that means adding silly rename support so be it.

That's what is done currently.

But it's has various dawbacks, like rmdir doesn't work if there are
open files within an otherwise empty directory.

I'd happily accept suggestions on how to deal with this differenty.

Thanks,
Miklos

2007-09-24 12:59:59

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

On Mon, Sep 24, 2007 at 02:48:08PM +0200, Miklos Szeredi wrote:
> > and if that means adding silly rename support so be it.
>
> That's what is done currently.
>
> But it's has various dawbacks, like rmdir doesn't work if there are
> open files within an otherwise empty directory.
>
> I'd happily accept suggestions on how to deal with this differenty.

Only sillyrename files with nlink > 1? I don't see how attributes can
change anything for a deleted file.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-09-24 13:07:25

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> > > and if that means adding silly rename support so be it.
> >
> > That's what is done currently.
> >
> > But it's has various dawbacks, like rmdir doesn't work if there are
> > open files within an otherwise empty directory.
> >
> > I'd happily accept suggestions on how to deal with this differenty.
>
> Only sillyrename files with nlink > 1? I don't see how attributes can
> change anything for a deleted file.

I don't quite understand your suggestion.

A file isn't deleted while there are still links or open files
refering to it. So getting the attributes for a file with nlink==0 is
perfectly valid while the file is still open.

If a network filesystem protocol can't handle operations (be it data
or metadata) on an unlinked file, we must do sillirenaming, so that
the file is not actually unlinked.

Miklos

2007-09-24 13:11:07

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote:
> If a network filesystem protocol can't handle operations (be it data
> or metadata) on an unlinked file, we must do sillirenaming, so that
> the file is not actually unlinked.

Or not support such a broken protocol at all.

2007-09-24 13:16:27

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote:
> A file isn't deleted while there are still links or open files
> refering to it. So getting the attributes for a file with nlink==0 is
> perfectly valid while the file is still open.

Is it? Why not just pretend that the attributes are wiped when the file
is deleted. Effectively, they are, since they can't affect anything.

> If a network filesystem protocol can't handle operations (be it data
> or metadata) on an unlinked file, we must do sillirenaming, so that
> the file is not actually unlinked.

Or you could call getattr right before you unlink and cache the result
in the client.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-09-24 13:19:03

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> > If a network filesystem protocol can't handle operations (be it data
> > or metadata) on an unlinked file, we must do sillirenaming, so that
> > the file is not actually unlinked.
>
> Or not support such a broken protocol at all.

Wonder what people would say if we removed support for NFSv[23].

Just because a protocol does not support "perfect" UNIX semantics, it
doesn't mean it's broken. By that standard almost all network
filesystem protocols are severely broken.

Miklos

2007-09-24 13:22:37

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> On Mon, Sep 24, 2007 at 03:06:06PM +0200, Miklos Szeredi wrote:
> > A file isn't deleted while there are still links or open files
> > refering to it. So getting the attributes for a file with nlink==0 is
> > perfectly valid while the file is still open.
>
> Is it? Why not just pretend that the attributes are wiped when the file
> is deleted.

You mean "when finally unlinked"? Delete happens when the file is
closed.

> Effectively, they are, since they can't affect anything.

Sure it can. It may be open on the server as well.

> > If a network filesystem protocol can't handle operations (be it data
> > or metadata) on an unlinked file, we must do sillirenaming, so that
> > the file is not actually unlinked.
>
> Or you could call getattr right before you unlink and cache the result
> in the client.

The file can still be modified after being unlinked.

And even if we did this caching thing and modify the attributes when
the file is modified, it would not deal with access on the remote end,
and would be much more complex than the other alternatives.

Miklos

2007-09-24 13:27:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

On Mon, Sep 24, 2007 at 03:18:10PM +0200, Miklos Szeredi wrote:
> > Or not support such a broken protocol at all.
>
> Wonder what people would say if we removed support for NFSv[23].
>
> Just because a protocol does not support "perfect" UNIX semantics, it
> doesn't mean it's broken. By that standard almost all network
> filesystem protocols are severely broken.

Well, they are broken by these and other standards. At least nfs and
cifs maintainers do the workarounds for this brokeness where they belong.

2007-09-24 13:37:27

by Alan

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> But it's has various dawbacks, like rmdir doesn't work if there are
> open files within an otherwise empty directory.
>
> I'd happily accept suggestions on how to deal with this differenty.

NFS has that problem because it really has to sillyrename into the same
directory. I don't see that ssh/sftp needs to do that. Instead it can
sillyrename anywhere in the filesystem.

Alan

2007-09-24 13:38:58

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> On Mon, Sep 24, 2007 at 03:18:10PM +0200, Miklos Szeredi wrote:
> > > Or not support such a broken protocol at all.
> >
> > Wonder what people would say if we removed support for NFSv[23].
> >
> > Just because a protocol does not support "perfect" UNIX semantics, it
> > doesn't mean it's broken. By that standard almost all network
> > filesystem protocols are severely broken.
>
> Well, they are broken by these and other standards. At least nfs and
> cifs maintainers do the workarounds for this brokeness where they belong.

And my patch is not working around a problem, rather solving a problem
in a correct way.

Let me summarise it:

There are valid reasons we have fstat() in addition to stat/lstat.
For example we want to protect agains races involving unlink/rename on
an open file.

Say I want to implement a network filesystem with a pure unprivileged
userspace sever (this is basically what sshfs does).

I want my filesystem client implementation to keep all these
advantages of fstat(). So how can I do that? There's a simple way:
implement this operation with fstat() on the server. I get all the
advantages on the remote system automatically.

But for that the filesystem needs to have the open file that the
fstat() on the client was performed on.

It's that simple. There's really no ugly hacks going on behind the
scenes. It's just that we do want to delegate some properties of this
operation onto the server, and the simplest and best implementation is
to just let the filesystem have the information it needs.

Why is that such a big problem?

Miklos

2007-09-24 13:41:43

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 1/2] VFS: new fgetattr() file operation

> > But it's has various dawbacks, like rmdir doesn't work if there are
> > open files within an otherwise empty directory.
> >
> > I'd happily accept suggestions on how to deal with this differenty.
>
> NFS has that problem because it really has to sillyrename into the same
> directory. I don't see that ssh/sftp needs to do that. Instead it can
> sillyrename anywhere in the filesystem.

I don't think it can. How can we find in a reliable way another
directory, which is writable by the user?

Miklos

2007-09-28 02:40:47

by NeilBrown

[permalink] [raw]

Subject: Re: [patch 2/2] VFS: allow filesystem to override mknod capability checks

On Monday September 24, [email protected] wrote:
> From: Miklos Szeredi <[email protected]>
>
> Add a new super block flag, that results in the VFS not checking if
> the current process has enough privileges to do an mknod().
>
> If this flag is set, all mounts for this super block will have the
> "nodev" flag implied.
>
> This is needed on filesystems, where an unprivileged user may be able
> to create a device node, without causing security problems.
>
> One such example is "mountlo" a loopback mount utility implemented
> with fuse and UML, which runs as an unprivileged userspace process.
> In this case the user does in fact have the right to create device
> nodes within the filesystem image, as long as the user has write
> access to the image. Since the filesystem is mounted with "nodev",
> adding device nodes is not a security concern.

I must admit that I don't feel very comfortable about this. I wonder
how many more flags we might be tempted to add to allow
user-controlled filesystems to do interesting things. Somehow I doubt
this will be the last, so we should be very careful allowing it to be
the first (or is it the second already?)

A more concrete comment on the patch: Is it really necessary to
introduce IS_MNT_NODEV?? Why not simply test both the flags
(MS_MKNOD_NOCAP and MNT_NODEV) before allowing the mknod? That would
localise the change to where is it really relevant.

Do we actually need a new flag? Would not a combination of MS_NODEV
and MS_SETUSER achieve the same thing (near enough)?

Do you imagine this flag being set as a mount option (-o unprivmknod)
or does the filesystem set it itself?
If the latter, maybe this test should be moved down into the
filesystems ->mknod operation. Most filesystems get

if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
return -EPERM;

at the top of ->mknod. fuse can do whatever it likes without
bothering common code.

According to fs.h, we only support 32 fs-independent mount-flags, and
over half are in use. I'm not convinced we should spend one on such a
narrow requirement.

NeilBrown

>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
>
> Index: linux/fs/namei.c
> ===================================================================
> --- linux.orig/fs/namei.c 2007-09-24 13:52:17.000000000 +0200
> +++ linux/fs/namei.c 2007-09-24 13:54:57.000000000 +0200
> @@ -1617,7 +1617,7 @@ int may_open(struct nameidata *nd, int a
> if (S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
> flag &= ~O_TRUNC;
> } else if (S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode)) {
> - if (nd->mnt->mnt_flags & MNT_NODEV)
> + if (IS_MNT_NODEV(nd->mnt))
> return -EACCES;
>
> flag &= ~O_TRUNC;
> @@ -1920,7 +1920,8 @@ int vfs_mknod(struct inode *dir, struct
> if (error)
> return error;
>
> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> + if (!(dir->i_sb->s_flags & MS_MKNOD_NOCAP) &&
> + (S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> return -EPERM;
>
> if (!dir->i_op || !dir->i_op->mknod)
> Index: linux/include/linux/fs.h
> ===================================================================
> --- linux.orig/include/linux/fs.h 2007-09-24 13:52:17.000000000 +0200
> +++ linux/include/linux/fs.h 2007-09-24 13:54:57.000000000 +0200
> @@ -130,6 +130,8 @@ extern int dir_notify_enable;
> #define MS_SETUSER (1<<23) /* set mnt_uid to current user */
> #define MS_NOMNT (1<<24) /* don't allow unprivileged submounts */
> #define MS_KERNMOUNT (1<<25) /* this is a kern_mount call */
> +#define MS_MKNOD_NOCAP (1<<26) /* no capability check in mknod,
> + implies "nodev" */
> #define MS_ACTIVE (1<<30)
> #define MS_NOUSER (1<<31)
>
> @@ -190,6 +192,10 @@ extern int dir_notify_enable;
> #define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
> #define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
>
> +#define IS_MNT_NODEV(mnt) (((mnt)->mnt_flags & MNT_NODEV) || \
> + ((mnt)->mnt_sb->s_flags & MS_MKNOD_NOCAP))
> +
> +
> /* the read-only stuff doesn't really belong here, but any other place is
> probably as bad and I don't want to create yet another include file. */
>
> Index: linux/drivers/mtd/mtdsuper.c
> ===================================================================
> --- linux.orig/drivers/mtd/mtdsuper.c 2007-09-24 13:52:17.000000000 +0200
> +++ linux/drivers/mtd/mtdsuper.c 2007-09-24 13:54:57.000000000 +0200
> @@ -194,7 +194,7 @@ int get_sb_mtd(struct file_system_type *
> if (!S_ISBLK(nd.dentry->d_inode->i_mode))
> goto out;
>
> - if (nd.mnt->mnt_flags & MNT_NODEV) {
> + if (IS_MNT_NODEV(nd.mnt)) {
> ret = -EACCES;
> goto out;
> }
> Index: linux/fs/block_dev.c
> ===================================================================
> --- linux.orig/fs/block_dev.c 2007-09-24 13:52:17.000000000 +0200
> +++ linux/fs/block_dev.c 2007-09-24 13:54:57.000000000 +0200
> @@ -1408,7 +1408,7 @@ struct block_device *lookup_bdev(const c
> if (!S_ISBLK(inode->i_mode))
> goto fail;
> error = -EACCES;
> - if (nd.mnt->mnt_flags & MNT_NODEV)
> + if (IS_MNT_NODEV(nd.mnt))
> goto fail;
> error = -ENOMEM;
> bdev = bd_acquire(inode);
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c 2007-09-24 13:52:17.000000000 +0200
> +++ linux/fs/namespace.c 2007-09-24 13:54:57.000000000 +0200
> @@ -431,7 +431,6 @@ static int show_vfsmnt(struct seq_file *
> };
> static struct proc_fs_info mnt_info[] = {
> { MNT_NOSUID, ",nosuid" },
> - { MNT_NODEV, ",nodev" },
> { MNT_NOEXEC, ",noexec" },
> { MNT_NOATIME, ",noatime" },
> { MNT_NODIRATIME, ",nodiratime" },
> @@ -459,6 +458,8 @@ static int show_vfsmnt(struct seq_file *
> if (mnt->mnt_flags & fs_infop->flag)
> seq_puts(m, fs_infop->str);
> }
> + if (IS_MNT_NODEV(mnt))
> + seq_puts(m, ",nodev");
> if (mnt->mnt_flags & MNT_USER)
> seq_printf(m, ",user=%i", mnt->mnt_uid);
> if (mnt->mnt_sb->s_op->show_options)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-10-01 18:08:01

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [patch 2/2] VFS: allow filesystem to override mknod capability checks

> On Monday September 24, [email protected] wrote:
> > From: Miklos Szeredi <[email protected]>
> >
> > Add a new super block flag, that results in the VFS not checking if
> > the current process has enough privileges to do an mknod().
> >
> > If this flag is set, all mounts for this super block will have the
> > "nodev" flag implied.
> >
> > This is needed on filesystems, where an unprivileged user may be able
> > to create a device node, without causing security problems.
> >
> > One such example is "mountlo" a loopback mount utility implemented
> > with fuse and UML, which runs as an unprivileged userspace process.
> > In this case the user does in fact have the right to create device
> > nodes within the filesystem image, as long as the user has write
> > access to the image. Since the filesystem is mounted with "nodev",
> > adding device nodes is not a security concern.
>
> I must admit that I don't feel very comfortable about this. I wonder
> how many more flags we might be tempted to add to allow
> user-controlled filesystems to do interesting things. Somehow I doubt
> this will be the last, so we should be very careful allowing it to be
> the first (or is it the second already?)

Or third ;)

Yes, I've always argued, that permission checking needs to be pushed
to the filesystem, since the VFS can't always do a good enough job.

My usual example is sshfs, where it is just impossible to know the
uid/gid mapping between the server and the client. So any permission
checking based on uid or gid on the client simply can't work, the only
sane thing to do is to delegate the permission checking to the server.
Which works beautifully, since the server is an unprivileged process,
and everything automatically works out.

All the fuse kernel module has to do is to basically define
->permission() to always return success, and let the userspace
filesystem do the permission checking.

This works fine, except a couple of things, like checking the sticky
bit on a directory, and mknod().

> A more concrete comment on the patch: Is it really necessary to
> introduce IS_MNT_NODEV?? Why not simply test both the flags
> (MS_MKNOD_NOCAP and MNT_NODEV) before allowing the mknod?

Because we need MNT_NODEV on _all_ mounts belonging to a superblock,
not just the one on which mknod was performed on, which would get
really ugly. This way it's simple: if the MS_MKNOD_NOCAP is specified
for the super block, that implies, that devices can't be opened.

> Do we actually need a new flag? Would not a combination of MS_NODEV
> and MS_SETUSER achieve the same thing (near enough)?

See above.

> Do you imagine this flag being set as a mount option (-o unprivmknod)
> or does the filesystem set it itself?

I imagine this flag to usually be set by the filesystem itself. But
it could be a separate mount option. I guess it could make sense in
some non-fuse cases as well.

> If the latter, maybe this test should be moved down into the
> filesystems ->mknod operation. Most filesystems get
>
> if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> return -EPERM;
>
> at the top of ->mknod. fuse can do whatever it likes without
> bothering common code.

Yes, that's one of the options, but it would be a huge change, with
nasty security implications for past and future filesystems.

> According to fs.h, we only support 32 fs-independent mount-flags, and
> over half are in use. I'm not convinced we should spend one on such a
> narrow requirement.

I think there's consensus on the need for a new mount API. Not just
because the 32 bits for the flags will be exhausted sooner or later
anyway, but the current API is limited in lots of other ways.

Miklos