2017-03-06 23:32:59

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC] Add option to mount only a pids subset


After discussion with Oleg Nesterov I reimplement my patch as an additional
option for /proc. This option affects the mountpoint. It means that in one
pid namespace it possible to have both the whole traditional /proc and
/proc with only pids subset.

However, it remains an open question about overlayfs support in the /proc.

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an option for /proc, flags applied to
/proc have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid sunset
and additional required files will be mounted on top using the overlayfs.
But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

* Show pidonly via proc_show_options.
* Add overlayfs support.

---
fs/internal.h | 1 +
fs/namespace.c | 9 ++++++++
fs/proc/generic.c | 4 ++++
fs/proc/inode.c | 7 +++++-
fs/proc/internal.h | 8 +++++++
fs/proc/root.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++----
fs/stat.c | 4 ++++
fs/super.c | 20 +++++++++++++++++
include/linux/fs.h | 2 ++
include/linux/mount.h | 1 +
10 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 4fcf517..cb44ca7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void);
* super.c
*/
extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data);
extern bool trylock_super(struct super_block *sb);
extern struct dentry *mount_fs(struct file_system_type *,
int, const char *, void *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..6cb2bcb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p)
struct vfsmount *
vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
{
+ int err;
struct mount *mnt;
struct dentry *root;

@@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
mnt->mnt.mnt_sb = root->d_sb;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
mnt->mnt_parent = mnt;
+
+ err = do_mount_sb(&mnt->mnt, flags, data);
+ if(err) {
+ mnt_free_id(mnt);
+ free_vfsmnt(mnt);
+ return ERR_PTR(err);
+ }
+
lock_mount_hash();
list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
unlock_mount_hash();
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7eb3cef..dd3da60 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
int proc_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
+ struct proc_options *opts = file->f_path.mnt->fs_data;
+
+ if (opts && opts->pid_only)
+ return 1;

return proc_readdir_de(PDE(inode), file, ctx);
}
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..51b1712 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
if (pid->hide_pid != 0)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
-
return 0;
}

@@ -118,6 +117,8 @@ static const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
+ .getattr_fs = proc_getattrfs,
+ .mount_fs = proc_mount_cb,
.remount_fs = proc_remount,
.show_options = proc_show_options,
};
@@ -323,6 +324,10 @@ static int proc_reg_open(struct inode *inode, struct file *file)
int (*open)(struct inode *, struct file *);
int (*release)(struct inode *, struct file *);
struct pde_opener *pdeo;
+ struct proc_options *opts = file->f_path.mnt->fs_data;
+
+ if (opts && opts->pid_only)
+ return -ENOENT;

/*
* Ensure that
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194..7fdbfee 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -267,11 +267,19 @@ static inline void proc_tty_init(void) {}
/*
* root.c
*/
+struct proc_options {
+ kgid_t pid_gid;
+ int hide_pid;
+ int pid_only;
+};
+
extern struct proc_dir_entry proc_root;
extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data);
+extern int proc_getattrfs(struct vfsmount *, struct dentry *, struct kstat *);

/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4bd0373..e07f37a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -14,6 +14,7 @@
#include <linux/stat.h>
#include <linux/init.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
@@ -24,16 +25,17 @@
#include "internal.h"

enum {
- Opt_gid, Opt_hidepid, Opt_err,
+ Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
};

static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+ {Opt_pidonly, "pidonly"},
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -53,7 +55,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
case Opt_gid:
if (match_int(&args[0], &option))
return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
+ fs_opts->pid_gid = make_kgid(current_user_ns(), option);
break;
case Opt_hidepid:
if (match_int(&args[0], &option))
@@ -62,7 +64,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
pr_err("proc: hidepid value must be between 0 and 2.\n");
return 0;
}
- pid->hide_pid = option;
+ fs_opts->hide_pid = option;
+ break;
+ case Opt_pidonly:
+ fs_opts->pid_only = 1;
break;
default:
pr_err("proc: unrecognized mount option \"%s\" "
@@ -74,6 +79,19 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+ struct proc_options opts;
+
+ if (!proc_fill_options(options, &opts))
+ return 0;
+
+ pid->pid_gid = opts.pid_gid;
+ pid->hide_pid = opts.hide_pid;
+
+ return 1;
+}
+
int proc_remount(struct super_block *sb, int *flags, char *data)
{
struct pid_namespace *pid = sb->s_fs_info;
@@ -82,6 +100,42 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
return !proc_parse_options(data, pid);
}

+int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+ struct inode *inode = d_inode(dentry);
+ struct pid *pid = proc_pid(dentry->d_inode);
+ struct proc_options *opts = mnt->fs_data;
+
+ if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
+ return -ENOENT;
+
+ if (!inode->i_op->getattr) {
+ generic_fillattr(inode, stat);
+ return 0;
+ }
+
+ return inode->i_op->getattr(mnt, dentry, stat);
+}
+
+int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data)
+{
+ struct proc_options *opts;
+
+ if (!data || *flags & MS_KERNMOUNT)
+ return 0;
+
+ opts = kzalloc(sizeof(struct proc_options), GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ if (!proc_fill_options(data, opts))
+ return -EINVAL;
+
+ mnt->fs_data = opts;
+
+ return 0;
+}
+
static struct dentry *proc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
diff --git a/fs/stat.c b/fs/stat.c
index bc045c7..1e26308 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -10,6 +10,7 @@
#include <linux/file.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -53,6 +54,9 @@ int vfs_getattr_nosec(struct path *path, struct kstat *stat)
{
struct inode *inode = d_backing_inode(path->dentry);

+ if (path->mnt->mnt_sb->s_op->getattr_fs)
+ return path->mnt->mnt_sb->s_op->getattr_fs(path->mnt, path->dentry, stat);
+
if (inode->i_op->getattr)
return inode->i_op->getattr(path->mnt, path->dentry, stat);

diff --git a/fs/super.c b/fs/super.c
index c183835..478dd5b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -832,6 +832,26 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
return retval;
}

+/**
+ * do_mount_sb - asks filesystem to finish mount with options.
+ * @mnt: vfsmount in question
+ * @flags: numeric part of options
+ * @data: the rest of options
+ *
+ * Alters the mount options of a mounted filesystem.
+ */
+int do_mount_sb(struct vfsmount *mnt, int flags, void *data)
+{
+ if (mnt->mnt_sb->s_writers.frozen != SB_UNFROZEN)
+ return -EBUSY;
+
+ if (mnt->mnt_sb->s_op->mount_fs) {
+ return mnt->mnt_sb->s_op->mount_fs(mnt, &flags, data);
+ }
+
+ return 0;
+}
+
static void do_emergency_remount(struct work_struct *work)
{
struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 83de8b6..5bd1b84 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1759,6 +1759,8 @@ struct super_operations {
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
+ int (*mount_fs) (struct vfsmount *, int *, char *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*umount_begin) (struct super_block *);

diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..4bd364e 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -67,6 +67,7 @@ struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
struct super_block *mnt_sb; /* pointer to superblock */
int mnt_flags;
+ void *fs_data; /* fs-specific data */
};

struct file; /* forward dec */
--
2.10.2


2017-03-07 16:24:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <[email protected]> wrote:
>
> After discussion with Oleg Nesterov I reimplement my patch as an additional
> option for /proc. This option affects the mountpoint. It means that in one
> pid namespace it possible to have both the whole traditional /proc and
> /proc with only pids subset.
>

I like this. I think you should split it into two patches, though:
one that reworks how procfs gets mounted and one that makes adds the
new functionality.

Djajal had some concerns about the first part breaking applications
that use stat and expect certain behavior. This should be manageable,
though, but making stat work appropriately.

2017-03-07 17:52:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

I can't really review this... but in any case I think you should split
this patch to separate the vfs and proc changes.

On 03/07, Alexey Gladkov wrote:
>
> @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
> mnt->mnt.mnt_sb = root->d_sb;
> mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> mnt->mnt_parent = mnt;
> +
> + err = do_mount_sb(&mnt->mnt, flags, data);
> + if(err) {
> + mnt_free_id(mnt);
> + free_vfsmnt(mnt);
> + return ERR_PTR(err);
> + }

This duplicates the error handling, we do the same if mount_fs() fails.
Perhaps you should move these 2 lines into cleanup block and add goto's.

> +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct pid *pid = proc_pid(dentry->d_inode);
> + struct proc_options *opts = mnt->fs_data;
> +
> + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> + return -ENOENT;

Hmm. I don't quite understand why do we need this, and how this should work.

Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
but only because they both do stat() ?

Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
work ?

I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
out we also need to update proc_sys_dir_file_operations too and may be something
else...

> +
> + if (!inode->i_op->getattr) {
> + generic_fillattr(inode, stat);
> + return 0;
> + }
> +
> + return inode->i_op->getattr(mnt, dentry, stat);
> +}

Oh, it would be nice to not duplicate the code from the caller, imo.

Oleg.


2017-03-09 11:28:36

by Djalal Harouni

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <[email protected]> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
>
> I like this. I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.
>
> Djajal had some concerns about the first part breaking applications
> that use stat and expect certain behavior. This should be manageable,
> though, but making stat work appropriately.

I'm bit lost in the two discussion, however the main concern I was
discussing with Andy was if you have per superblock proc mounts then
each mount will end up with its own device ID st_dev, right now they
share the same ID if they are in the same pid namespace, but if we
change that then we may break the following:
http://man7.org/linux/man-pages/man7/namespaces.7.html

Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
suggests to follow up with fstat() to identify the namespaces..
"By applying fstat(2) to the returned file descriptor, one obtains a
stat structure whose st_dev (resident device) and st_ino (inode
number) fields together identify the owning/parent namespace."

Other /proc/self/ns/* comparison and stat() logic...

Andy suggested that we may have the same st_dev for mounts in the same
pid namespace... I'm not sure which side effect this may bring!

Thanks!

--
tixxdz

2017-03-09 20:57:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

Djalal Harouni <[email protected]> writes:

> On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <[email protected]> wrote:
>> >
>> > After discussion with Oleg Nesterov I reimplement my patch as an additional
>> > option for /proc. This option affects the mountpoint. It means that in one
>> > pid namespace it possible to have both the whole traditional /proc and
>> > /proc with only pids subset.
>> >
>>
>> I like this. I think you should split it into two patches, though:
>> one that reworks how procfs gets mounted and one that makes adds the
>> new functionality.
>>
>> Djajal had some concerns about the first part breaking applications
>> that use stat and expect certain behavior. This should be manageable,
>> though, but making stat work appropriately.
>
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html
>
> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Well right now it is less of an issue than you image. I suggest you
stat /proc and /proc/self/ns/* and look at st_dev.

That said anything that changes today's proc needs to be tested with a
range of distro's especially those using selinux and apparmor as they
tend to have policies that are very sensitive to the implementation
details. Such that seemingly innocent changes can cause systems to stop
booting.

Eric

2017-03-10 23:38:25

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote:
> I can't really review this... but in any case I think you should split
> this patch to separate the vfs and proc changes.
>
> On 03/07, Alexey Gladkov wrote:
> >
> > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
> > mnt->mnt.mnt_sb = root->d_sb;
> > mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> > mnt->mnt_parent = mnt;
> > +
> > + err = do_mount_sb(&mnt->mnt, flags, data);
> > + if(err) {
> > + mnt_free_id(mnt);
> > + free_vfsmnt(mnt);
> > + return ERR_PTR(err);
> > + }
>
> This duplicates the error handling, we do the same if mount_fs() fails.
> Perhaps you should move these 2 lines into cleanup block and add goto's.
>
> > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct pid *pid = proc_pid(dentry->d_inode);
> > + struct proc_options *opts = mnt->fs_data;
> > +
> > + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> > + return -ENOENT;
>
> Hmm. I don't quite understand why do we need this, and how this should work.
>
> Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
> but only because they both do stat() ?
>
> Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
> work ?

Yes, you're right! I thought that getattr is called always together with
open(). I wanted to prevent all attempts open() for not-pid directories.

> I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
> out we also need to update proc_sys_dir_file_operations too and may be something
> else...

My main task was to hide all possible direcitrices from the /proc
(in pidonly mode)... even those which we do not know. In this case we
can't rely on the fact that everyone will follow the rules and to
properly handle open().

My current attempt was to force filesystem level check of mountpoint flag.
This is necessary to avoid even the theoretical possibility of ignoring
"pidonly" parameter.

I guess I need to add callback to vfs_open or something to can be sure
that we will not open the wrong file or directory in pidonly mode.

--
Rgrds, legion

2017-03-10 23:57:52

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <[email protected]> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
>
> I like this. I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.

Sure, but first I wanted to discuss the idea. My patch isn't very useful
without the ability to add additional files.

I will split it into parts in the new version.

--
Rgrds, legion

2017-03-11 21:43:03

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote:
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html

In fact, nothing has changed. I added a parameter that affects
the mountpoint, not the entire pid namespace.

The procfs will still be global. The device ID will be the same as before.

> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Basically we have here a issue because other proc options (hidepid for
example) affect the entire pid namespace, but, I guess, have to affect
the mountpoint.

# grep ^proc /proc/mounts
proc /proc proc rw,relatime 0 0

# mount -t proc proc /tmp/proc
# mount -o remount,hidepid=2 -t proc proc /tmp/proc

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

--
Rgrds, legion

2017-03-12 01:55:03

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 83de8b6..5bd1b84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,6 +1759,8 @@ struct super_operations {
> int (*thaw_super) (struct super_block *);
> int (*unfreeze_fs) (struct super_block *);
> int (*statfs) (struct dentry *, struct kstatfs *);
> + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> + int (*mount_fs) (struct vfsmount *, int *, char *);
> int (*remount_fs) (struct super_block *, int *, char *);
> void (*umount_begin) (struct super_block *);
>
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1172cce..4bd364e 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -67,6 +67,7 @@ struct vfsmount {
> struct dentry *mnt_root; /* root of the mounted tree */
> struct super_block *mnt_sb; /* pointer to superblock */
> int mnt_flags;
> + void *fs_data; /* fs-specific data */

No. This is really asking for trouble - you *can't* hang
fs-specific data structures off vfsmount. Lifetimes make no
sense whatsoever.

BTW, your patch leaks supreblock reference on failure, and
is kludgy as hell wrt to what it's doing in procfs itself, but
that's secondary - the quoted part is enough for a hard NAK on
architectural grounds. Don't go there.

2017-03-12 02:13:17

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Sun, Mar 12, 2017 at 01:54:38AM +0000, Al Viro wrote:
> On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:
>
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 83de8b6..5bd1b84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,6 +1759,8 @@ struct super_operations {
> > int (*thaw_super) (struct super_block *);
> > int (*unfreeze_fs) (struct super_block *);
> > int (*statfs) (struct dentry *, struct kstatfs *);
> > + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> > + int (*mount_fs) (struct vfsmount *, int *, char *);
> > int (*remount_fs) (struct super_block *, int *, char *);
> > void (*umount_begin) (struct super_block *);
> >
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 1172cce..4bd364e 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -67,6 +67,7 @@ struct vfsmount {
> > struct dentry *mnt_root; /* root of the mounted tree */
> > struct super_block *mnt_sb; /* pointer to superblock */
> > int mnt_flags;
> > + void *fs_data; /* fs-specific data */
>
> No. This is really asking for trouble - you *can't* hang
> fs-specific data structures off vfsmount. Lifetimes make no
> sense whatsoever.
>
> BTW, your patch leaks supreblock reference on failure, and
> is kludgy as hell wrt to what it's doing in procfs itself, but
> that's secondary - the quoted part is enough for a hard NAK on
> architectural grounds. Don't go there.

PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
expose the full thing. And as for the lifetimes making no sense...
note that you are simply not freeing these structures of yours.
Try to handle that and you'll get a serious PITA all over the
place.

What are you trying to achieve, anyway? Why not add a second vfsmount
pointer per pid_namespace and make it initialized on demand, at the
first attempt of no-pid mount? Just have a separate no-pid instance
created for those namespaces where it had been asked for, with
separate superblock and dentry tree not containing anything other
that pid-only parts + self + thread-self...

2017-03-13 03:20:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <[email protected]> wrote:
> PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> expose the full thing. And as for the lifetimes making no sense...
> note that you are simply not freeing these structures of yours.
> Try to handle that and you'll get a serious PITA all over the
> place.
>
> What are you trying to achieve, anyway? Why not add a second vfsmount
> pointer per pid_namespace and make it initialized on demand, at the
> first attempt of no-pid mount? Just have a separate no-pid instance
> created for those namespaces where it had been asked for, with
> separate superblock and dentry tree not containing anything other
> that pid-only parts + self + thread-self...

Can't we just make procfs work like most other filesystems and have
each mount have its own superblock? If we need to do something funky
to stat() output to keep existing userspace working, I think that's
okay.

As far as I can tell, proc_mnt is very nearly useless -- it seems to
be used for proc_flush_task (which claims to be purely an optimization
and could be preserved in the common case where there's only one
relevant mount) and for sysctl_binary. For the latter, we could
create proc_mnt but make actual user-initiated mounts be new
superblocks anyway.

2017-03-13 13:27:54

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <[email protected]> wrote:
> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> > expose the full thing. And as for the lifetimes making no sense...
> > note that you are simply not freeing these structures of yours.
> > Try to handle that and you'll get a serious PITA all over the
> > place.
> >
> > What are you trying to achieve, anyway? Why not add a second vfsmount
> > pointer per pid_namespace and make it initialized on demand, at the
> > first attempt of no-pid mount? Just have a separate no-pid instance
> > created for those namespaces where it had been asked for, with
> > separate superblock and dentry tree not containing anything other
> > that pid-only parts + self + thread-self...
>
> Can't we just make procfs work like most other filesystems and have
> each mount have its own superblock? If we need to do something funky
> to stat() output to keep existing userspace working, I think that's
> okay.

First of all, most of the filesystems do *NOT* guarantee anything of
that sort. And what's the point of having more instances than
necessary, anyway?

> As far as I can tell, proc_mnt is very nearly useless -- it seems to
> be used for proc_flush_task (which claims to be purely an optimization
> and could be preserved in the common case where there's only one
> relevant mount) and for sysctl_binary. For the latter, we could
> create proc_mnt but make actual user-initiated mounts be new
> superblocks anyway.

Again, what for? It won't salvage that kludge... It's not as if it
had been hard to have separate pid-only instance created when asked
for (and reused every time when we are asked for pid-only). What's
the point of ever having more than two instances per pidns? IDGI...

Folks, there is no one-to-one correspondence between mountpoints and
superblocks. Not since 2000 or so. Just don't try to shove your
per-superblock stuff into vfsmount; it simply won't work. If you
want a separate instance for that thing, then just go ahead and
have ->mount() decide which one to use (and whether to create a new
one). All there is to it...

2017-03-13 15:25:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Mon, Mar 13, 2017 at 6:27 AM, Al Viro <[email protected]> wrote:
> On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
>> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <[email protected]> wrote:
>> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
>> > expose the full thing. And as for the lifetimes making no sense...
>> > note that you are simply not freeing these structures of yours.
>> > Try to handle that and you'll get a serious PITA all over the
>> > place.
>> >
>> > What are you trying to achieve, anyway? Why not add a second vfsmount
>> > pointer per pid_namespace and make it initialized on demand, at the
>> > first attempt of no-pid mount? Just have a separate no-pid instance
>> > created for those namespaces where it had been asked for, with
>> > separate superblock and dentry tree not containing anything other
>> > that pid-only parts + self + thread-self...
>>
>> Can't we just make procfs work like most other filesystems and have
>> each mount have its own superblock? If we need to do something funky
>> to stat() output to keep existing userspace working, I think that's
>> okay.
>
> First of all, most of the filesystems do *NOT* guarantee anything of
> that sort. And what's the point of having more instances than
> necessary, anyway?

I mean that, if I do:

mount -t proc -o foobar none a
mount -t proc -o baz none b

Then I think that the second mount should create a whole new proc
instance rather than just a new vfsmount. Then the options could
differ, which would solve a bunch of problems.

>
> Again, what for? It won't salvage that kludge... It's not as if it
> had been hard to have separate pid-only instance created when asked
> for (and reused every time when we are asked for pid-only). What's
> the point of ever having more than two instances per pidns? IDGI...

I can easily procfs growing more than one interesting option.
Pid-only and hidepid come to mind, and that's already six possible
combinations. The current hidepid implementation is really awful.

>
> Folks, there is no one-to-one correspondence between mountpoints and
> superblocks. Not since 2000 or so. Just don't try to shove your
> per-superblock stuff into vfsmount; it simply won't work. If you
> want a separate instance for that thing, then just go ahead and
> have ->mount() decide which one to use (and whether to create a new
> one). All there is to it...

That's what I mean. I just don't see the point of going all-out in
trying to reuse superblocks.

2017-03-20 13:00:11

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC] Add option to mount only a pids subset


Al Viro, this patch looks better ?

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids without
rest objects. Since this is an option for /proc, flags applied to /proc
have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid
sunset and additional required files will be mounted on top using the
overlayfs.

But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

* Add overlayfs support.

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/generic.c | 5 ++
fs/proc/inode.c | 2 +
fs/proc/internal.h | 7 +++
fs/proc/root.c | 113 ++++++++++++++++++++++++++++++++++++++++--
include/linux/pid_namespace.h | 1 +
5 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..50bb1e9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -23,6 +23,7 @@
#include <linux/spinlock.h>
#include <linux/completion.h>
#include <linux/uaccess.h>
+#include <linux/pid_namespace.h>

#include "internal.h"

@@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
int proc_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
+ struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+ if (ns->pidfs && inode == d_inode(ns->pidfs))
+ return 1;

return proc_readdir_de(PDE(inode), file, ctx);
}
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..0c9be65 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
if (pid->hide_pid != HIDEPID_OFF)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+ if (root == pid->pidfs)
+ seq_printf(seq, ",pidonly");

return 0;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b..a5a4bf1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {}
/*
* root.c
*/
+struct proc_options {
+ kgid_t pid_gid;
+ int hide_pid;
+ int pid_only;
+};
+
extern struct proc_dir_entry proc_root;
+extern struct proc_dir_entry pidfs_root;
extern int proc_parse_options(char *options, struct pid_namespace *pid);

extern void proc_self_init(void);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..c2443d5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -26,16 +26,17 @@
#include "internal.h"

enum {
- Opt_gid, Opt_hidepid, Opt_err,
+ Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
};

static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+ {Opt_pidonly, "pidonly"},
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
{
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
case Opt_gid:
if (match_int(&args[0], &option))
return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
+ fs_opts->pid_gid = make_kgid(current_user_ns(), option);
break;
case Opt_hidepid:
if (match_int(&args[0], &option))
@@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
pr_err("proc: hidepid value must be between 0 and 2.\n");
return 0;
}
- pid->hide_pid = option;
+ fs_opts->hide_pid = option;
+ break;
+ case Opt_pidonly:
+ fs_opts->pid_only = 1;
break;
default:
pr_err("proc: unrecognized mount option \"%s\" "
@@ -77,6 +81,72 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+ struct proc_options opts = { 0 };
+
+ if (!proc_fill_options(options, &opts))
+ return 0;
+
+ pid->pid_gid = opts.pid_gid;
+ pid->hide_pid = opts.hide_pid;
+
+ return 1;
+}
+
+static int pidfs_register_dir(struct dentry *root, char *name, struct inode *inode)
+{
+ struct inode *root_inode = d_inode(root);
+ struct dentry *child;
+
+ inode_lock(root_inode);
+ child = d_alloc_name(root, name);
+ if (child) {
+ d_add(child, inode);
+ } else {
+ child = ERR_PTR(-ENOMEM);
+ }
+ inode_unlock(root_inode);
+ if (IS_ERR(child)) {
+ pr_err("pidfs_register_dir: can't allocate /pidfs/%s\n", name);
+ return PTR_ERR(child);
+ }
+ return 0;
+}
+
+static int fill_pidfs_root(struct super_block *s)
+{
+ struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct inode *root_inode;
+ struct dentry *root;
+ int ret;
+
+ pde_get(&pidfs_root);
+ root_inode = proc_get_inode(s, &pidfs_root);
+ if (!root_inode) {
+ pr_err("pidfs_fill_root: get root inode failed\n");
+ return -ENOMEM;
+ }
+
+ root = d_make_root(root_inode);
+ if (!root) {
+ pr_err("pidfs_fill_root: allocate dentry failed\n");
+ return -ENOMEM;
+ }
+
+ ret = pidfs_register_dir(root, "self", d_inode(ns->proc_self));
+ if (ret)
+ return ret;
+
+ ret = pidfs_register_dir(root, "thread-self", d_inode(ns->proc_thread_self));
+ if (ret)
+ return ret;
+
+ ns->pidfs = root;
+
+ return 0;
+}
+
int proc_remount(struct super_block *sb, int *flags, char *data)
{
struct pid_namespace *pid = sb->s_fs_info;
@@ -89,6 +159,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
struct pid_namespace *ns;
+ static struct dentry *root;
+ struct proc_options opts = { 0 };

if (flags & MS_KERNMOUNT) {
ns = data;
@@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
ns = task_active_pid_ns(current);
}

- return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+ root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+
+ if (!IS_ERR(root)) {
+ if (!proc_fill_options(data, &opts))
+ return ERR_PTR(-EINVAL);
+
+ if (opts.pid_only) {
+ int ret;
+
+ if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
+ return ERR_PTR(ret);
+
+ root = ns->pidfs;
+ }
+ }
+
+ return root;
}

static void proc_kill_sb(struct super_block *sb)
@@ -109,6 +197,8 @@ static void proc_kill_sb(struct super_block *sb)
dput(ns->proc_self);
if (ns->proc_thread_self)
dput(ns->proc_thread_self);
+ if (ns->pidfs)
+ dput(ns->pidfs);
kill_anon_super(sb);
put_pid_ns(ns);
}
@@ -214,6 +304,19 @@ struct proc_dir_entry proc_root = {
.name = "/proc",
};

+struct proc_dir_entry pidfs_root = {
+ .low_ino = PROC_ROOT_INO,
+ .namelen = 5,
+ .mode = S_IFDIR | S_IRUGO | S_IXUGO,
+ .nlink = 2,
+ .count = ATOMIC_INIT(1),
+ .proc_iops = &proc_root_inode_operations,
+ .proc_fops = &proc_root_operations,
+ .parent = &pidfs_root,
+ .subdir = RB_ROOT,
+ .name = "/pidfs",
+};
+
int pid_ns_prepare_proc(struct pid_namespace *ns)
{
struct vfsmount *mnt;
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c2a989d..e7b4d64 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -41,6 +41,7 @@ struct pid_namespace {
struct vfsmount *proc_mnt;
struct dentry *proc_self;
struct dentry *proc_thread_self;
+ struct dentry *pidfs;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
--
2.10.2

2017-03-23 15:59:51

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH] proc: allow to change proc mount options per mount

As per the discussion with Andy, and following what Al Viro suggested
maybe this can work ? the patch is still buggy on top of Linus' tree 093b995e3b

Currently hidepid mount option is propagated to all proc mounts that are
in the same pid namespace. This patch make it possible to have proc
mounts with different options inside the same pid namespace.

Since this may break userspace or code that checks/expects some device
identifiers, this mode is only supported when theo ption "version=2" is
passed.

[Buggy patch]

[tixxdz@fedora-kvm linux]$ sudo mount -t proc none /test
[tixxdz@fedora-kvm linux]$ sudo mount -t proc -o hidepid=2,version=2 none /test2
[tixxdz@fedora-kvm linux]$ ls -l /proc | head -6
total 0
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 1
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 10
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 100
dr-xr-xr-x. 9 gdm gdm 0 Mar 23 17:25 1005
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test | head -16
total 0
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 1
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 10
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 100
dr-xr-xr-x. 9 gdm gdm 0 Mar 23 17:25 1005
dr-xr-xr-x. 9 root root 0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test2 | head -6
total 0
dr-xr-xr-x. 9 tixxdz tixxdz 0 Mar 23 17:27 1182
dr-xr-xr-x. 9 tixxdz tixxdz 0 Mar 23 17:27 1197
dr-xr-xr-x. 9 tixxdz tixxdz 0 Mar 23 17:27 1199
dr-xr-xr-x. 9 tixxdz tixxdz 0 Mar 23 17:27 1222
dr-xr-xr-x. 9 tixxdz tixxdz 0 Mar 23 17:27 1225


Signed-off-by: Djalal Harouni <[email protected]>
---
fs/locks.c | 6 +-
fs/proc/base.c | 51 +++++++++-----
fs/proc/generic.c | 5 ++
fs/proc/inode.c | 10 +--
fs/proc/root.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++--
fs/proc/self.c | 8 ++-
fs/proc/thread_self.c | 6 +-
fs/proc_namespace.c | 14 ++--
include/linux/proc_fs.h | 9 +++
9 files changed, 248 insertions(+), 42 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 2681132..dab5058 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2617,7 +2617,8 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
unsigned int fl_pid;

if (fl->fl_nspid) {
- struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+ struct pid_namespace *proc_pidns = fs_info->pid_ns;

/* Don't let fl_pid change based on who is reading the file */
fl_pid = pid_nr_ns(fl->fl_nspid, proc_pidns);
@@ -2701,7 +2702,8 @@ static int locks_show(struct seq_file *f, void *v)
{
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
- struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+ struct pid_namespace *proc_pidns = fs_info->pid_ns;

fl = hlist_entry(v, struct file_lock, fl_link);

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c87b6b9..74b389d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -681,13 +681,24 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
* May current process learn task's sched/cmdline info (for hide_pid_min=1)
* or euid/egid (for hide_pid_min=2)?
*/
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
struct task_struct *task,
int hide_pid_min)
{
- if (pid->hide_pid < hide_pid_min)
+ int hide_pid;
+ kgid_t pid_gid;
+ int version = fs_info->version;
+
+ if (version == 2) {
+ hide_pid = fs_info->hide_pid;
+ pid_gid = fs_info->pid_gid;
+ } else {
+ hide_pid = fs_info->pid_ns->hide_pid;
+ pid_gid = fs_info->pid_ns->pid_gid;
+ }
+ if (hide_pid < hide_pid_min)
return true;
- if (in_group_p(pid->pid_gid))
+ if (in_group_p(pid_gid))
return true;
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
}
@@ -695,14 +706,15 @@ static bool has_pid_permissions(struct pid_namespace *pid,

static int proc_pid_permission(struct inode *inode, int mask)
{
- struct pid_namespace *pid = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *pid = fs_info->pid_ns;
struct task_struct *task;
bool has_perms;

task = get_proc_task(inode);
if (!task)
return -ESRCH;
- has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+ has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
put_task_struct(task);

if (!has_perms) {
@@ -730,12 +742,12 @@ static const struct inode_operations proc_def_inode_operations = {
static int proc_single_show(struct seq_file *m, void *v)
{
struct inode *inode = m->private;
- struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;

- ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
@@ -1734,7 +1746,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
{
struct inode *inode = d_inode(path->dentry);
struct task_struct *task;
- struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);

generic_fillattr(inode, stat);

@@ -1743,7 +1755,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
stat->gid = GLOBAL_ROOT_GID;
task = pid_task(proc_pid(inode), PIDTYPE_PID);
if (task) {
- if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+ if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
rcu_read_unlock();
/*
* This doesn't prevent learning whether PID exists,
@@ -2249,6 +2261,8 @@ static const struct seq_operations proc_timers_seq_ops = {
static int proc_timers_open(struct inode *inode, struct file *file)
{
struct timers_private *tp;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

tp = __seq_open_private(file, &proc_timers_seq_ops,
sizeof(struct timers_private));
@@ -2256,7 +2270,7 @@ static int proc_timers_open(struct inode *inode, struct file *file)
return -ENOMEM;

tp->pid = proc_pid(inode);
- tp->ns = inode->i_sb->s_fs_info;
+ tp->ns = ns;
return 0;
}

@@ -3077,13 +3091,13 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsign
int result = -ENOENT;
struct task_struct *task;
unsigned tgid;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

tgid = name_to_int(&dentry->d_name);
if (tgid == ~0U)
goto out;

- ns = dentry->d_sb->s_fs_info;
rcu_read_lock();
task = find_task_by_pid_ns(tgid, ns);
if (task)
@@ -3147,7 +3161,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
int proc_pid_readdir(struct file *file, struct dir_context *ctx)
{
struct tgid_iter iter;
- struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(file)->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
loff_t pos = ctx->pos;

if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
@@ -3174,7 +3189,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
int len;

cond_resched();
- if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+ if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
continue;

len = snprintf(name, sizeof(name), "%d", iter.tgid);
@@ -3371,7 +3386,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
struct task_struct *task;
struct task_struct *leader = get_proc_task(dir);
unsigned tid;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(dentry->d_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

if (!leader)
goto out_no_task;
@@ -3380,7 +3396,6 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
if (tid == ~0U)
goto out;

- ns = dentry->d_sb->s_fs_info;
rcu_read_lock();
task = find_task_by_pid_ns(tid, ns);
if (task)
@@ -3482,7 +3497,8 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
struct task_struct *task;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
int tid;

if (proc_inode_is_dead(inode))
@@ -3494,7 +3510,6 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
/* f_version caches the tgid value that the last readdir call couldn't
* return. lseek aka telldir automagically resets f_version to 0.
*/
- ns = inode->i_sb->s_fs_info;
tid = (int)file->f_version;
file->f_version = 0;
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..49c8cb9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -28,6 +28,11 @@

static DEFINE_RWLOCK(proc_subdir_lock);

+struct proc_fs_info *proc_sb(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
{
if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..4743943 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,7 +103,8 @@ void __init proc_init_inodecache(void)
static int proc_show_options(struct seq_file *seq, struct dentry *root)
{
struct super_block *sb = root->d_sb;
- struct pid_namespace *pid = sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *pid = fs_info->pid_ns;

if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
@@ -473,11 +474,12 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)

int proc_fill_super(struct super_block *s, void *data, int silent)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;

- if (!proc_parse_options(data, ns))
+ if (fs_info->version == 1 && !proc_parse_options(data, ns))
return -EINVAL;

/* User space would break if executables or devices appear on proc */
@@ -495,7 +497,7 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..d4047ef 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -15,6 +15,7 @@
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/sched/stat.h>
+#include <linux/slab.h>
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
@@ -26,12 +27,13 @@
#include "internal.h"

enum {
- Opt_gid, Opt_hidepid, Opt_err,
+ Opt_gid, Opt_hidepid, Opt_version ,Opt_err,
};

static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+ {Opt_version, "version=%u"},
{Opt_err, NULL},
};

@@ -67,6 +69,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
}
pid->hide_pid = option;
break;
+ case Opt_version:
+ break;
default:
pr_err("proc: unrecognized mount option \"%s\" "
"or missing value\n", p);
@@ -77,40 +81,205 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
return 1;
}

+struct proc_options {
+ int version; /* version field auto set to 1 to not break userspace */
+ kgid_t pid_gid;
+ int hide_pid;
+};
+
+int proc_parse_early_options(char *options, void *holder,
+ struct proc_options *fs_options)
+{
+ char *p, *opts, *orig;
+ substring_t args[MAX_OPT_ARGS];
+ int option, ret = 0;
+
+ if (!options)
+ return 0;
+
+ opts = kstrdup(options, GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ orig = opts;
+
+ while ((p = strsep(&opts, ",")) != NULL) {
+ int token;
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_version:
+ if (match_int(&args[0], &option)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (option < 1 || option > 2) {
+ pr_err("proc: version value must be 1 or 2.\n");
+ ret = -EINVAL;
+ goto out;
+ }
+ fs_options->version = option;
+ break;
+ case Opt_gid:
+ if (match_int(&args[0], &option)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ fs_options->pid_gid = make_kgid(current_user_ns(), option);
+ break;
+ case Opt_hidepid:
+ if (match_int(&args[0], &option)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (option < 0 || option > 2) {
+ pr_err("proc: hidepid value must be between 0 and 2.\n");
+ ret = -EINVAL;
+ goto out;
+ }
+ fs_options->hide_pid = option;
+ break;
+ case Opt_err:
+ /*
+ * pr_err("proc: unrecognized mount option \"%s\" ", p);
+ * ret = -EINVAL;
+ * goto out;
+ */
+ default:
+ break;
+ }
+ }
+
+out:
+ kfree(orig);
+ return ret;
+}
+
+static int proc_test_super(struct super_block *s, void *data)
+{
+ int ret = 0;
+ struct proc_fs_info *p = data;
+ struct proc_fs_info *fs_info = proc_sb(s);
+
+ if (p->version == 1 && p->pid_ns == fs_info->pid_ns)
+ ret = 1;
+ /*
+ if (p->version == 2 && p->pid_ns == fs_info->pid_ns &&
+ p->hide_pid == fs_info->hide_pid &&
+ gid_eq(p->pid_gid, fs_info->pid_gid))
+ ret = 1;
+ */
+ return ret;
+}
+
+static int proc_set_super(struct super_block *sb, void *data)
+{
+ sb->s_fs_info = data;
+ return set_anon_super(sb, NULL);
+}
+
int proc_remount(struct super_block *sb, int *flags, char *data)
{
- struct pid_namespace *pid = sb->s_fs_info;
+ int error;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *pid = fs_info->pid_ns;
+ struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };

sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+ error = proc_parse_early_options(data, sb->s_type, &fs_options);
+ if (error < 0)
+ return error;
+
+ if (fs_options.version == 1) {
+ error = proc_parse_options(data, pid);
+ if (!error)
+ return -EINVAL;
+ }
+
+ fs_info->version = fs_options.version;
+ fs_info->pid_gid = fs_options.pid_gid;
+ fs_info->hide_pid = fs_options.hide_pid;
+
+ return 0;
}

static struct dentry *proc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
struct pid_namespace *ns;
+ struct super_block *sb;
+ struct proc_fs_info *fs_info = NULL;
+ struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };
+ int error = 0;
+
+ if (!(flags & MS_KERNMOUNT)) {
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+
+
+ error = proc_parse_early_options(data, fs_type, &fs_options);
+ if (error)
+ return ERR_PTR(error);
+ }
+
+ fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_NOFS);
+ if (!fs_info)
+ return ERR_PTR(-ENOMEM);
+
+ fs_info->version = fs_options.version;
+ fs_info->pid_gid = fs_options.pid_gid;
+ fs_info->hide_pid = fs_options.hide_pid;

if (flags & MS_KERNMOUNT) {
ns = data;
data = NULL;
+ fs_info->version = 1; /* Lets restore this */
} else {
ns = task_active_pid_ns(current);
}

- return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+ fs_info->pid_ns = ns;
+
+ sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ ns->user_ns, fs_info);
+ if (IS_ERR(sb)) {
+ error = PTR_ERR(sb);
+ goto error_fs_info;
+ }
+
+ if (sb->s_root) {
+ kfree(fs_info);
+ } else {
+ error = proc_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+ if (error) {
+ deactivate_locked_super(sb);
+ goto error_fs_info;
+ }
+
+ sb->s_flags |= MS_ACTIVE;
+ }
+
+ return dget(sb->s_root);
+
+error_fs_info:
+ kfree(fs_info);
+ return ERR_PTR(error);
}

static void proc_kill_sb(struct super_block *sb)
{
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *ns = (struct pid_namespace *)fs_info->pid_ns;

- ns = (struct pid_namespace *)sb->s_fs_info;
if (ns->proc_self)
dput(ns->proc_self);
if (ns->proc_thread_self)
dput(ns->proc_thread_self);
kill_anon_super(sb);
put_pid_ns(ns);
+ kfree(fs_info);
}

static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 39857f6..9f95174 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -10,7 +10,8 @@ static const char *proc_self_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- struct pid_namespace *ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid_t tgid = task_tgid_nr_ns(current, ns);
char *name;

@@ -34,9 +35,10 @@ static unsigned self_inum;
int proc_setup_self(struct super_block *s)
{
struct inode *root_inode = d_inode(s->s_root);
- struct pid_namespace *ns = s->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = fs_info->pid_ns;
struct dentry *self;
-
+
inode_lock(root_inode);
self = d_alloc_name(s->s_root, "self");
if (self) {
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 20614b6..13d9aef 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -10,7 +10,8 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- struct pid_namespace *ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid_t tgid = task_tgid_nr_ns(current, ns);
pid_t pid = task_pid_nr_ns(current, ns);
char *name;
@@ -34,8 +35,9 @@ static unsigned thread_self_inum;

int proc_setup_thread_self(struct super_block *s)
{
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = fs_info->pid_ns;
struct inode *root_inode = d_inode(s->s_root);
- struct pid_namespace *ns = s->s_fs_info;
struct dentry *thread_self;

inode_lock(root_inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index b5713fe..d0ae937 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -36,23 +36,23 @@ static unsigned mounts_poll(struct file *file, poll_table *wait)
return res;
}

-struct proc_fs_info {
+struct proc_fs_opts {
int flag;
const char *str;
};

static int show_sb_opts(struct seq_file *m, struct super_block *sb)
{
- static const struct proc_fs_info fs_info[] = {
+ static const struct proc_fs_opts fs_opts[] = {
{ MS_SYNCHRONOUS, ",sync" },
{ MS_DIRSYNC, ",dirsync" },
{ MS_MANDLOCK, ",mand" },
{ MS_LAZYTIME, ",lazytime" },
{ 0, NULL }
};
- const struct proc_fs_info *fs_infop;
+ const struct proc_fs_opts *fs_infop;

- for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+ for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
if (sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
@@ -62,7 +62,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)

static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{
- static const struct proc_fs_info mnt_info[] = {
+ static const struct proc_fs_opts mnt_opts[] = {
{ MNT_NOSUID, ",nosuid" },
{ MNT_NODEV, ",nodev" },
{ MNT_NOEXEC, ",noexec" },
@@ -71,9 +71,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_RELATIME, ",relatime" },
{ 0, NULL }
};
- const struct proc_fs_info *fs_infop;
+ const struct proc_fs_opts *fs_infop;

- for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+ for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2d2bf59..27a1e85 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -7,10 +7,18 @@
#include <linux/types.h>
#include <linux/fs.h>

+struct proc_fs_info {
+ struct pid_namespace *pid_ns;
+ int version;
+ kgid_t pid_gid;
+ int hide_pid;
+};
+
struct proc_dir_entry;

#ifdef CONFIG_PROC_FS

+extern struct proc_fs_info *proc_sb(struct super_block *sb);
extern void proc_root_init(void);
extern void proc_flush_task(struct task_struct *);

@@ -53,6 +61,7 @@ static inline void proc_flush_task(struct task_struct *task)
{
}

+extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
static inline struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
static inline struct proc_dir_entry *proc_mkdir(const char *name,
--
2.10.2

2017-03-23 16:06:33

by Djalal Harouni

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

Hi Alexey,

On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
<[email protected]> wrote:
>
>
> Al Viro, this patch looks better ?
>
> == Overview ==
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids without
> rest objects. Since this is an option for /proc, flags applied to /proc
> have an effect on this subset of filesystem.

I just sent a patch that also has to deal with proc hidepid here:
https://lkml.org/lkml/2017/3/23/505

I'm not sure if that's the right approach, it is still buggy, however
seems that your patch also stores the mount option inside the
pid_namespace which may get propagated to all mounts inside same pidns
?

I didn't have enough time but maybe if they are related we can work it
out together ?

Thank you!


--
tixxdz

2017-03-23 16:07:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

Again, I can't really review this, I know nothing about vfs, but since
nobody else replied...

On 03/20, Alexey Gladkov wrote:
>
> @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
> ns = task_active_pid_ns(current);
> }
>
> - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +
> + if (!IS_ERR(root)) {
> + if (!proc_fill_options(data, &opts))
> + return ERR_PTR(-EINVAL);

So we have to call proc_fill_options() twice, not good... Yes, I understand
why, but perhaps we factor it out somehow, we can pack options + pid_ns into
sb->s_fs_info. Nevermind, this is minor.

> + if (opts.pid_only) {
> + int ret;
> +
> + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> + return ERR_PTR(ret);
> +
> + root = ns->pidfs;

Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
But this all is fixeable.

So with this change "mount -opidonly" creates another IS_ROOT() dentry which
is not equal to sb->s_root. I simply do not know if this is technically
correct or not... but, say, the "Only bind mounts can have disconnected paths"
comment in path_connected() makes me worry ;)

And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
the normal /proc mount. Not really good imo even if not really wrong... Lets
look at proc_flush_task(). The exiting task will flush its $pid dentries in
/proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
still...

Oleg.

2017-03-23 22:07:33

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
> Hi Alexey,
>
> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
> <[email protected]> wrote:
> >
> >
> > Al Viro, this patch looks better ?
> >
> > == Overview ==
> >
> > Some of the container virtualization systems are mounted /proc inside
> > the container. This is done in most cases to operate with information
> > about the processes. Knowing that /proc filesystem is not fully
> > virtualized they are mounted on top of dangerous places empty files or
> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
> >
> > The structure of this filesystem is dynamic and any module can create a
> > new object which will not necessarily be virtualized. There are
> > proprietary modules that aren't in the mainline whose work we can not
> > verify.
> >
> > This opens up a potential threat to the system. The developers of the
> > virtualization system can't predict all dangerous places in /proc by
> > definition.
> >
> > A more effective solution would be to mount into the container only what
> > is necessary and ignore the rest.
> >
> > Right now there is the opportunity to pass in the container any port of
> > the /proc filesystem using mount --bind expect the pids.
> >
> > This patch allows to mount only the part of /proc related to pids without
> > rest objects. Since this is an option for /proc, flags applied to /proc
> > have an effect on this subset of filesystem.
>
> I just sent a patch that also has to deal with proc hidepid here:
> https://lkml.org/lkml/2017/3/23/505

I completely agree with you that it looks wrong when options of one
mountpoint affect the others mountpoints.

> I'm not sure if that's the right approach, it is still buggy, however
> seems that your patch also stores the mount option inside the
> pid_namespace which may get propagated to all mounts inside same pidns
> ?

I don't store 'pidonly' option in my current patch. I mean in:
https://lkml.org/lkml/2017/3/20/363

I parse options twice at first mount of procfs. It happens before
the mounting /proc in userspace.

I know it's bad, but I couldn't find place to store this option.

> I didn't have enough time but maybe if they are related we can work it
> out together ?

I don't have enough experience in kenrel hacking, but I would happily do
my best :)

I also tring to mention it in every patch, as my changes almost completely
useless without the ability to use the overlayfs.

Now if you remove the restriction:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

and mount procfs as lowerdir in overlayfs then you get NULL pointer
dereference at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891

I got it when I tried to do `ls -la /overlay/proc/self/`.

--
Rgrds, legion

2017-03-23 22:57:20

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote:
> Again, I can't really review this, I know nothing about vfs, but since
> nobody else replied...

Thanks anyway :)

> On 03/20, Alexey Gladkov wrote:
> >
> > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
> > ns = task_active_pid_ns(current);
> > }
> >
> > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +
> > + if (!IS_ERR(root)) {
> > + if (!proc_fill_options(data, &opts))
> > + return ERR_PTR(-EINVAL);
>
> So we have to call proc_fill_options() twice, not good... Yes, I understand
> why, but perhaps we factor it out somehow, we can pack options + pid_ns into
> sb->s_fs_info. Nevermind, this is minor.

It happens only once, when we don't have the s_root yet.

> > + if (opts.pid_only) {
> > + int ret;
> > +
> > + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> > + return ERR_PTR(ret);
> > +
> > + root = ns->pidfs;
>
> Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
> in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
> if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
> But this all is fixeable.
>
> So with this change "mount -opidonly" creates another IS_ROOT() dentry which
> is not equal to sb->s_root. I simply do not know if this is technically
> correct or not... but, say, the "Only bind mounts can have disconnected paths"
> comment in path_connected() makes me worry ;)
>
> And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
> the normal /proc mount. Not really good imo even if not really wrong... Lets
> look at proc_flush_task(). The exiting task will flush its $pid dentries in
> /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
> still...

I know that I'm cheater, but I did not start first :)

--
Rgrds, legion

2017-03-26 07:13:44

by Djalal Harouni

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

(Cc'ed Kees and Jann for the procfs stacking issue)


On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov
<[email protected]> wrote:
> On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
>> Hi Alexey,
>>
>> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>> <[email protected]> wrote:
>> >
>> >
>> > Al Viro, this patch looks better ?
>> >
>> > == Overview ==
>> >
>> > Some of the container virtualization systems are mounted /proc inside
>> > the container. This is done in most cases to operate with information
>> > about the processes. Knowing that /proc filesystem is not fully
>> > virtualized they are mounted on top of dangerous places empty files or
>> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>> >
>> > The structure of this filesystem is dynamic and any module can create a
>> > new object which will not necessarily be virtualized. There are
>> > proprietary modules that aren't in the mainline whose work we can not
>> > verify.
>> >
>> > This opens up a potential threat to the system. The developers of the
>> > virtualization system can't predict all dangerous places in /proc by
>> > definition.
>> >
>> > A more effective solution would be to mount into the container only what
>> > is necessary and ignore the rest.
>> >
>> > Right now there is the opportunity to pass in the container any port of
>> > the /proc filesystem using mount --bind expect the pids.
>> >
>> > This patch allows to mount only the part of /proc related to pids without
>> > rest objects. Since this is an option for /proc, flags applied to /proc
>> > have an effect on this subset of filesystem.
>>
>> I just sent a patch that also has to deal with proc hidepid here:
>> https://lkml.org/lkml/2017/3/23/505
>
> I completely agree with you that it looks wrong when options of one
> mountpoint affect the others mountpoints.
>
>> I'm not sure if that's the right approach, it is still buggy, however
>> seems that your patch also stores the mount option inside the
>> pid_namespace which may get propagated to all mounts inside same pidns
>> ?
>
> I don't store 'pidonly' option in my current patch. I mean in:
> https://lkml.org/lkml/2017/3/20/363
>
> I parse options twice at first mount of procfs. It happens before
> the mounting /proc in userspace.
>
> I know it's bad, but I couldn't find place to store this option.

Ok, then maybe that approach of having a procfs struct holder instead
of using pid namespace may help!


>> I didn't have enough time but maybe if they are related we can work it
>> out together ?
>
> I don't have enough experience in kenrel hacking, but I would happily do
> my best :)

OK, let's sync on this then.

> I also tring to mention it in every patch, as my changes almost completely
> useless without the ability to use the overlayfs.
>
> Now if you remove the restriction:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

This protection was introduced in e54ad7f1ee263

That's a security fix, ecryptfs seems to schedule some code through
kthreads which always run in the initial namespaces with full
capabilities, bypassing all security checks.

I don't know if overlayfs has something similar, if not then maybe if
it's possible in ecryptfs to check the lower mount and the fs type and
move this procfs there... Cc'ed Jann in case he may comment.

> and mount procfs as lowerdir in overlayfs then you get NULL pointer
> dereference at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891
>
> I got it when I tried to do `ls -la /overlay/proc/self/`.
>
> --
> Rgrds, legion

2017-03-30 21:36:47

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC] Add option to mount only a pids subset

On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote:
> (Cc'ed Kees and Jann for the procfs stacking issue)
>
> > I completely agree with you that it looks wrong when options of one
> > mountpoint affect the others mountpoints.
> >
> >> I'm not sure if that's the right approach, it is still buggy, however
> >> seems that your patch also stores the mount option inside the
> >> pid_namespace which may get propagated to all mounts inside same pidns
> >> ?
> >
> > I don't store 'pidonly' option in my current patch. I mean in:
> > https://lkml.org/lkml/2017/3/20/363
> >
> > I parse options twice at first mount of procfs. It happens before
> > the mounting /proc in userspace.
> >
> > I know it's bad, but I couldn't find place to store this option.
>
> Ok, then maybe that approach of having a procfs struct holder instead
> of using pid namespace may help!

I deside to stop doing my patch. I talked with a few people and found out
that the overlayfs doesn't feel very well if on the lower level filesystem
appear/disappear files.

In addition with the pidfs isn't so simple. Separate the root will lead to
a doubling of memory consumption and restrictions on the filesystem operations
level can easily be skipped.

It means that even I can do this pidfs (or pid subset in /proc), it
will be pointless.

--
Rgrds, legion