2017-03-30 15:28:18

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH RFC 0/4] proc: support multiple separate proc instances per pidnamespace

Hi,

This RFC can be applied on top of Linus' tree 89970a04d7

This RFC implements support for multiple separate proc instances inside
the same pid namespace. This allows to solve lot of problems that
today's use case face.

Historically procfs was tied to pid namespaces, and mount options were
propagated to all other procfs instances in the same pid namespace. This
solved several use cases in that time. However today we face new
problems, there are mutliple container implementations there, some of
them want to hide pid entries, others want to hide non-pid entries,
others want to have sysctlfs, others want to share pid namespace with
private procfs mounts. All these with current implementation won't work
since all options will be propagated to all procfs mounts.

This series allow to have new instances of procfs per pid namespace where
each instance can have its own mount option inside the same pid namespace.
This was also suggested by Andy Lutomirski.


Now:
$ sudo mount -t proc -o unshare,hidepid=2 none /test

The option 'unshare' will allow to mount a new instance of procfs inside
the same pid namespace.

Before:
$ stat /proc/slabinfo

File: ‘/proc/slabinfo’
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
Device: 4h/4d Inode: 4026532046 Links: 1

$ stat /test3/slabinfo

File: ‘/test3/slabinfo’
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
Device: 4h/4d Inode: 4026532046 Links: 1


After:
$ stat /proc/slabinfo

File: ‘/proc/slabinfo’
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
Device: 4h/4d Inode: 4026532046 Links: 1

$ stat /test3/slabinfo

File: ‘/test3/slabinfo’
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
Device: 31h/49d Inode: 4026532046 Links: 1


Any better name for the option 'unshare' ? suggestions ?

I was going to use 'version=2' but then this may sound more like a
proc2 fs which currently impossible to implement since it will share
locks with the old proc.


Al, Eric any comments please ?

[Patch RFC 4/4] proc: support flushing dcache entries of a task on multiple procfs mounts
Is maybe not needed, and I have to test it further.

Thanks!


2017-03-30 15:23:43

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH RFC 3/4] proc: support mounting new procfs instances inside same pid namespace

This patch adds support for 'unshare' mount option to have multiple
separated procfs inside the same pid namespace. This allows to solve lot
of problem for containers and their specific use cases.

Signed-off-by: Djalal Harouni <[email protected]>
---
fs/proc/generic.c | 10 +++++++++
fs/proc/inode.c | 3 +++
fs/proc/root.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/proc_fs.h | 12 ++++++++++
4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7e5e419..7ae5377 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -52,6 +52,11 @@ void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid)
fs_info->pid_gid = gid;
}

+void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version)
+{
+ fs_info->version = version;
+}
+
int proc_fs_get_hide_pid(struct proc_fs_info *fs_info)
{
/* For backward compatibility */
@@ -70,6 +75,11 @@ kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
return fs_info->pid_gid;
}

+int proc_fs_get_unshare(struct proc_fs_info *fs_info)
+{
+ return fs_info->version;
+}
+
static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
{
if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index ca47a0a..5f7557d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -111,6 +111,9 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
if (pid->hide_pid != HIDEPID_OFF)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);

+ if (proc_fs_get_unshare(fs_info) == PROC_FS_V2)
+ seq_printf(seq, ",unshare");
+
return 0;
}

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a96c02..7a8f425 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -27,15 +27,52 @@
#include "internal.h"

enum {
- Opt_gid, Opt_hidepid, Opt_err,
+ Opt_gid, Opt_hidepid, Opt_unshare, Opt_err,
};

static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+ {Opt_unshare, "unshare"},
{Opt_err, NULL},
};

+/* We only parse 'unshare' option here */
+int proc_parse_early_options(char *options, struct proc_fs_info *fs_info)
+{
+ char *p, *opts, *orig;
+ substring_t args[MAX_OPT_ARGS];
+
+ if (!options)
+ return 0;
+
+ opts = kstrdup(options, GFP_KERNEL);
+ if (!opts)
+ return -ENOMEM;
+
+ orig = opts;
+
+ while ((p = strsep(&opts, ",")) != NULL) {
+ int token;
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, tokens, args);
+ switch (token) {
+ case Opt_unshare:
+ pr_info("proc: mounting a new procfs instance ");
+ proc_fs_set_unshare(fs_info, PROC_FS_V2);
+ break;
+ default:
+ break;
+ }
+ }
+
+ kfree(orig);
+ return 0;
+}
+
int proc_parse_options(char *options, struct proc_fs_info *fs_info)
{
char *p;
@@ -70,6 +107,8 @@ int proc_parse_options(char *options, struct proc_fs_info *fs_info)
}
proc_fs_set_hide_pid(fs_info, option);
break;
+ case Opt_unshare:
+ break;
default:
pr_err("proc: unrecognized mount option \"%s\" "
"or missing value\n", p);
@@ -82,9 +121,19 @@ int proc_parse_options(char *options, struct proc_fs_info *fs_info)

int proc_remount(struct super_block *sb, int *flags, char *data)
{
+ int error, version;
struct proc_fs_info *fs_info = proc_sb(sb);

+ version = proc_fs_get_unshare(fs_info);
+
sync_filesystem(sb);
+
+ if (version == PROC_FS_V2) {
+ error = proc_parse_early_options(data, fs_info);
+ if (error < 0)
+ return error;
+ }
+
return !proc_parse_options(data, fs_info);
}

@@ -122,15 +171,21 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
if (!fs_info)
return ERR_PTR(-ENOMEM);

+ /* Set it as early as possible */
+ proc_fs_set_unshare(fs_info, PROC_FS_V1);
+
if (flags & MS_KERNMOUNT) {
ns = data;
data = NULL;
} else {
+ error = proc_parse_early_options(data, fs_info);
+ if (error < 0)
+ goto error_fs_info;
+
ns = task_active_pid_ns(current);
}

fs_info->pid_ns = ns;
- fs_info->version = PROC_FS_V1;
fs_info->hide_pid = HIDEPID_OFF;
fs_info->pid_gid = GLOBAL_ROOT_GID;
refcount_set(&fs_info->users, 1);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index c23299d..e3a78a5 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -31,8 +31,11 @@ extern void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid);

extern void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid);

+extern void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version);
+
extern int proc_fs_get_hide_pid(struct proc_fs_info *fs_info);
extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info);
+extern int proc_fs_get_unshare(struct proc_fs_info *fs_info);

extern void proc_root_init(void);
extern void proc_flush_task(struct task_struct *);
@@ -84,6 +87,10 @@ static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_p
{
}

+static inline void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version)
+{
+}
+
static inline void proc_fs_set_pid_gid(struct proc_info_fs *fs_info, kgid_t gid)
{
}
@@ -98,6 +105,11 @@ extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
return GLOBAL_ROOT_GID;
}

+static inline int proc_fs_get_unshare(struct proc_fs_info *fs_info)
+{
+ return PROC_FS_V1;
+}
+
extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
static inline struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
--
2.10.2

2017-03-30 15:24:13

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH RFC 4/4] proc: support flushing dcache entries of a task on multiple procfs mounts

This allows to flush dcache entries of a task on multiple procfs mounts
per pid namespace.

Maybe this patch is nedded since this is just an optimization and maybe
it contains bugs.

Signed-off-by: Djalal Harouni <[email protected]>
---
fs/proc/base.c | 28 ++++++++++++++++++++++------
fs/proc/inode.c | 13 +++++++++++--
fs/proc/root.c | 13 +++++++++++++
include/linux/pid_namespace.h | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/proc_fs.h | 5 +++++
5 files changed, 93 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fd16566..0b96eb1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1739,7 +1739,6 @@ int pid_getattr(const struct path *path, struct kstat *stat,
struct task_struct *task;
struct inode *inode = d_inode(path->dentry);
struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
- struct pid_namespace *pid = fs_info->pid_ns;

generic_fillattr(inode, stat);

@@ -2967,7 +2966,8 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
.permission = proc_pid_permission,
};

-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
+static void proc_flush_task_mnt_root(struct dentry *mnt_root,
+ pid_t pid, pid_t tgid)
{
struct dentry *dentry, *leader, *dir;
char buf[PROC_NUMBUF];
@@ -2976,7 +2976,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", pid);
/* no ->d_hash() rejects on procfs */
- dentry = d_hash_and_lookup(mnt->mnt_root, &name);
+ dentry = d_hash_and_lookup(mnt_root, &name);
if (dentry) {
d_invalidate(dentry);
dput(dentry);
@@ -2987,7 +2987,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)

name.name = buf;
name.len = snprintf(buf, sizeof(buf), "%d", tgid);
- leader = d_hash_and_lookup(mnt->mnt_root, &name);
+ leader = d_hash_and_lookup(mnt_root, &name);
if (!leader)
goto out;

@@ -3042,14 +3042,30 @@ void proc_flush_task(struct task_struct *task)
int i;
struct pid *pid, *tgid;
struct upid *upid;
+ struct proc_fs_info *fs_info_entry;
+ struct pid_namespace *pid_ns;
+ struct dentry *mnt_root;

pid = task_pid(task);
tgid = task_tgid(task);

for (i = 0; i <= pid->level; i++) {
upid = &pid->numbers[i];
- proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
- tgid->numbers[i].nr);
+ pid_ns = upid->ns;
+
+ pidns_procfs_lock_shared(pid_ns);
+ list_for_each_entry(fs_info_entry, &pid_ns->procfs_mounts,
+ pidns_entry) {
+ if (proc_fs_get_unshare(fs_info_entry) == PROC_FS_V2) {
+ mnt_root = fs_info_entry->sb->s_root;
+ proc_flush_task_mnt_root(mnt_root, upid->nr,
+ tgid->numbers[i].nr);
+ }
+ }
+ pidns_procfs_unlock_shared(pid_ns);
+
+ mnt_root = pid_ns->proc_mnt->mnt_root;
+ proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
}
}

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 5f7557d..73e49b3 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -478,10 +478,19 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
int proc_fill_super(struct super_block *s, void *data, int silent)
{
struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
- int ret;
+ int ret, version;

- get_pid_ns(fs_info->pid_ns);
+ fs_info->sb = s;
+
+ version = proc_fs_get_unshare(fs_info);
+
+ if (version == PROC_FS_V2) {
+ pidns_procfs_lock(ns);
+ list_add_tail(&fs_info->pidns_entry, &ns->procfs_mounts);
+ pidns_procfs_unlock(ns);
+ }

if (!proc_parse_options(data, fs_info))
return -EINVAL;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 7a8f425..73f972f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -222,6 +222,7 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,

static void proc_destroy_sb(struct super_block *sb)
{
+ int version;
struct proc_fs_info *fs_info = proc_sb(sb);
struct pid_namespace *ns = (struct pid_namespace *)fs_info->pid_ns;

@@ -229,6 +230,15 @@ static void proc_destroy_sb(struct super_block *sb)
dput(ns->proc_self);
if (ns->proc_thread_self)
dput(ns->proc_thread_self);
+
+ version = proc_fs_get_unshare(fs_info);
+
+ if (version == PROC_FS_V2) {
+ pidns_procfs_lock(ns);
+ list_del(&fs_info->pidns_entry);
+ pidns_procfs_unlock(ns);
+ }
+
kill_anon_super(sb);
put_pid_ns(ns);
kfree(fs_info);
@@ -352,6 +362,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
return PTR_ERR(mnt);

ns->proc_mnt = mnt;
+ init_rwsem(&ns->rw_procfs_mnts);
+ INIT_LIST_HEAD(&ns->procfs_mounts);
+
return 0;
}

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c2a989d..05639c8 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -41,6 +41,8 @@ struct pid_namespace {
struct vfsmount *proc_mnt;
struct dentry *proc_self;
struct dentry *proc_thread_self;
+ struct rw_semaphore rw_procfs_mnts;
+ struct list_head procfs_mounts; /* list of separated procfs mounts */
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
@@ -107,4 +109,44 @@ extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
void pidhash_init(void);
void pidmap_init(void);

+#ifdef CONFIG_PROC_FS
+static inline void pidns_procfs_lock(struct pid_namespace *pid_ns)
+{
+ down_write(&pid_ns->rw_procfs_mnts);
+}
+
+static inline void pidns_procfs_unlock(struct pid_namespace *pid_ns)
+{
+ up_write(&pid_ns->rw_procfs_mnts);
+}
+
+static inline void pidns_procfs_lock_shared(struct pid_namespace *pid_ns)
+{
+ down_read(&pid_ns->rw_procfs_mnts);
+}
+
+static inline void pidns_procfs_unlock_shared(struct pid_namespace *pid_ns)
+{
+ up_read(&pid_ns->rw_procfs_mnts);
+}
+#else /* !CONFIG_PROC_FS */
+
+static inline void pidns_procfs_lock(struct pid_namespace *pid_ns)
+{
+}
+
+static inline void pidns_procfs_unlock(struct pid_namespace *pid_ns)
+{
+}
+
+static inline void pidns_procfs_lock_shared(struct pid_namespace *pid_ns)
+{
+}
+
+static inline void pidns_procfs_unlock_shared(struct pid_namespace *pid_ns)
+{
+}
+
+#endif /* CONFIG_PROC_FS */
+
#endif /* _LINUX_PID_NS_H */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index e3a78a5..b2200e3 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -15,7 +15,12 @@ enum {

struct proc_fs_info {
refcount_t users;
+
+ struct super_block *sb;
struct pid_namespace *pid_ns;
+
+ struct list_head pidns_entry; /* Node in procfs_mounts of a pidns */
+
kgid_t pid_gid;
int hide_pid;
int version;
--
2.10.2

2017-03-30 15:24:43

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH RFC 2/4] proc: add helpers to set/get hidepid and gid mount options

This is a preparation patch to allow to set and get hidepid and gid
mount options correctly

Signed-off-by: Djalal Harouni <[email protected]>
---
fs/proc/base.c | 15 +++++++++------
fs/proc/generic.c | 37 +++++++++++++++++++++++++++++++++++++
fs/proc/inode.c | 5 +++--
fs/proc/internal.h | 2 +-
fs/proc/root.c | 13 ++++++++-----
include/linux/proc_fs.h | 35 ++++++++++++++++++++++++++++++++---
6 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index cd16979..fd16566 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -681,13 +681,16 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
* May current process learn task's sched/cmdline info (for hide_pid_min=1)
* or euid/egid (for hide_pid_min=2)?
*/
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
struct task_struct *task,
int hide_pid_min)
{
- if (pid->hide_pid < hide_pid_min)
+ int hide_pid = proc_fs_get_hide_pid(fs_info);
+ kgid_t gid = proc_fs_get_pid_gid(fs_info);
+
+ if (hide_pid < hide_pid_min)
return true;
- if (in_group_p(pid->pid_gid))
+ if (in_group_p(gid))
return true;
return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
}
@@ -703,7 +706,7 @@ static int proc_pid_permission(struct inode *inode, int mask)
task = get_proc_task(inode);
if (!task)
return -ESRCH;
- has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+ has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
put_task_struct(task);

if (!has_perms) {
@@ -1745,7 +1748,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
stat->gid = GLOBAL_ROOT_GID;
task = pid_task(proc_pid(inode), PIDTYPE_PID);
if (task) {
- if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+ if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
rcu_read_unlock();
/*
* This doesn't prevent learning whether PID exists,
@@ -3179,7 +3182,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
int len;

cond_resched();
- if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+ if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
continue;

len = snprintf(name, sizeof(name), "%d", iter.tgid);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 49c8cb9..7e5e419 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -16,6 +16,7 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/printk.h>
+#include <linux/pid_namespace.h>
#include <linux/mount.h>
#include <linux/init.h>
#include <linux/idr.h>
@@ -33,6 +34,42 @@ struct proc_fs_info *proc_sb(struct super_block *sb)
return sb->s_fs_info;
}

+void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid)
+{
+ /* For backward compatibility */
+ if (fs_info->version == PROC_FS_V1)
+ fs_info->pid_ns->hide_pid = hide_pid;
+ else if (fs_info->version == PROC_FS_V2)
+ fs_info->hide_pid = hide_pid;
+}
+
+void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid)
+{
+ /* For backward compatibility */
+ if (fs_info->version == PROC_FS_V1)
+ fs_info->pid_ns->pid_gid = gid;
+ else if (fs_info->version == PROC_FS_V2)
+ fs_info->pid_gid = gid;
+}
+
+int proc_fs_get_hide_pid(struct proc_fs_info *fs_info)
+{
+ /* For backward compatibility */
+ if (fs_info->version == PROC_FS_V1)
+ return fs_info->pid_ns->hide_pid;
+
+ return fs_info->hide_pid;
+}
+
+kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
+{
+ /* For backward compatibility */
+ if (fs_info->version == PROC_FS_V1)
+ return fs_info->pid_ns->pid_gid;
+
+ return fs_info->pid_gid;
+}
+
static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
{
if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index e708288..ca47a0a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -475,11 +475,12 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
int proc_fill_super(struct super_block *s, void *data, int silent)
{
struct proc_fs_info *fs_info = proc_sb(s);
- struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;

- if (!proc_parse_options(data, ns))
+ get_pid_ns(fs_info->pid_ns);
+
+ if (!proc_parse_options(data, fs_info))
return -EINVAL;

/* User space would break if executables or devices appear on proc */
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b..126fa31 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -261,7 +261,7 @@ static inline void proc_tty_init(void) {}
* root.c
*/
extern struct proc_dir_entry proc_root;
-extern int proc_parse_options(char *options, struct pid_namespace *pid);
+extern int proc_parse_options(char *options, struct proc_fs_info *fs_info);

extern void proc_self_init(void);
extern int proc_remount(struct super_block *, int *, char *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index a683e93..6a96c02 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -36,11 +36,12 @@ static const match_table_t tokens = {
{Opt_err, NULL},
};

-int proc_parse_options(char *options, struct pid_namespace *pid)
+int proc_parse_options(char *options, struct proc_fs_info *fs_info)
{
char *p;
substring_t args[MAX_OPT_ARGS];
int option;
+ kgid_t gid;

if (!options)
return 1;
@@ -56,7 +57,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
case Opt_gid:
if (match_int(&args[0], &option))
return 0;
- pid->pid_gid = make_kgid(current_user_ns(), option);
+ gid = make_kgid(current_user_ns(), option);
+ proc_fs_set_pid_gid(fs_info, gid);
break;
case Opt_hidepid:
if (match_int(&args[0], &option))
@@ -66,7 +68,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
pr_err("proc: hidepid value must be between 0 and 2.\n");
return 0;
}
- pid->hide_pid = option;
+ proc_fs_set_hide_pid(fs_info, option);
break;
default:
pr_err("proc: unrecognized mount option \"%s\" "
@@ -81,10 +83,9 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
int proc_remount(struct super_block *sb, int *flags, char *data)
{
struct proc_fs_info *fs_info = proc_sb(sb);
- struct pid_namespace *pid = fs_info->pid_ns;

sync_filesystem(sb);
- return !proc_parse_options(data, pid);
+ return !proc_parse_options(data, fs_info);
}

static int proc_test_super(struct super_block *s, void *data)
@@ -130,6 +131,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,

fs_info->pid_ns = ns;
fs_info->version = PROC_FS_V1;
+ fs_info->hide_pid = HIDEPID_OFF;
+ fs_info->pid_gid = GLOBAL_ROOT_GID;
refcount_set(&fs_info->users, 1);

sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index e1cb9c3..c23299d 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -9,8 +9,8 @@
#include <linux/refcount.h>

enum {
- PROC_FS_V1 = 1,
- PROC_FS_V2 = 2,
+ PROC_FS_V1 = 1, /* Inside same pidns procfs mounts are shared */
+ PROC_FS_V2 = 2, /* New procfs mounts are separated by default */
};

struct proc_fs_info {
@@ -27,6 +27,13 @@ struct proc_dir_entry;

extern struct proc_fs_info *proc_sb(struct super_block *sb);

+extern void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid);
+
+extern void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid);
+
+extern int proc_fs_get_hide_pid(struct proc_fs_info *fs_info);
+extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info);
+
extern void proc_root_init(void);
extern void proc_flush_task(struct task_struct *);

@@ -38,7 +45,7 @@ extern struct proc_dir_entry *proc_mkdir_data(const char *, umode_t,
extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t,
struct proc_dir_entry *);
struct proc_dir_entry *proc_create_mount_point(const char *name);
-
+
extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
struct proc_dir_entry *,
const struct file_operations *,
@@ -69,6 +76,28 @@ static inline void proc_flush_task(struct task_struct *task)
{
}

+static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid)
+{
+}
+
+static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid)
+{
+}
+
+static inline void proc_fs_set_pid_gid(struct proc_info_fs *fs_info, kgid_t gid)
+{
+}
+
+static inline int proc_fs_get_hide_pid(struct proc_fs_info *fs_info)
+{
+ return 0;
+}
+
+extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
+{
+ return GLOBAL_ROOT_GID;
+}
+
extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
static inline struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
--
2.10.2

2017-03-30 15:25:47

by Djalal Harouni

[permalink] [raw]
Subject: [PATCH RFC 1/4] proc: add proc_fs_info struct to store proc options

This is a preparation patch that adds a proc_fs_info to be able to store
different procfs options. Right now some mount options are stored inside
the pid namespace which make multiple proc share the same mount options.
This patch will help also to fix this.

Signed-off-by: Djalal Harouni <[email protected]>
---
fs/locks.c | 6 ++--
fs/proc/base.c | 31 +++++++++++--------
fs/proc/generic.c | 5 +++
fs/proc/inode.c | 8 +++--
fs/proc/root.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++---
fs/proc/self.c | 8 +++--
fs/proc/thread_self.c | 6 ++--
fs/proc_namespace.c | 14 ++++-----
include/linux/proc_fs.h | 17 +++++++++++
9 files changed, 141 insertions(+), 35 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 2681132..dab5058 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2617,7 +2617,8 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
unsigned int fl_pid;

if (fl->fl_nspid) {
- struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+ struct pid_namespace *proc_pidns = fs_info->pid_ns;

/* Don't let fl_pid change based on who is reading the file */
fl_pid = pid_nr_ns(fl->fl_nspid, proc_pidns);
@@ -2701,7 +2702,8 @@ static int locks_show(struct seq_file *f, void *v)
{
struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;
- struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+ struct pid_namespace *proc_pidns = fs_info->pid_ns;

fl = hlist_entry(v, struct file_lock, fl_link);

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c87b6b9..cd16979 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -695,7 +695,8 @@ static bool has_pid_permissions(struct pid_namespace *pid,

static int proc_pid_permission(struct inode *inode, int mask)
{
- struct pid_namespace *pid = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *pid = fs_info->pid_ns;
struct task_struct *task;
bool has_perms;

@@ -730,12 +731,12 @@ static const struct inode_operations proc_def_inode_operations = {
static int proc_single_show(struct seq_file *m, void *v)
{
struct inode *inode = m->private;
- struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
int ret;

- ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
@@ -1732,9 +1733,10 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
int pid_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
- struct inode *inode = d_inode(path->dentry);
struct task_struct *task;
- struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+ struct inode *inode = d_inode(path->dentry);
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *pid = fs_info->pid_ns;

generic_fillattr(inode, stat);

@@ -2249,6 +2251,8 @@ static const struct seq_operations proc_timers_seq_ops = {
static int proc_timers_open(struct inode *inode, struct file *file)
{
struct timers_private *tp;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

tp = __seq_open_private(file, &proc_timers_seq_ops,
sizeof(struct timers_private));
@@ -2256,7 +2260,7 @@ static int proc_timers_open(struct inode *inode, struct file *file)
return -ENOMEM;

tp->pid = proc_pid(inode);
- tp->ns = inode->i_sb->s_fs_info;
+ tp->ns = ns;
return 0;
}

@@ -3077,13 +3081,13 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsign
int result = -ENOENT;
struct task_struct *task;
unsigned tgid;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

tgid = name_to_int(&dentry->d_name);
if (tgid == ~0U)
goto out;

- ns = dentry->d_sb->s_fs_info;
rcu_read_lock();
task = find_task_by_pid_ns(tgid, ns);
if (task)
@@ -3147,7 +3151,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
int proc_pid_readdir(struct file *file, struct dir_context *ctx)
{
struct tgid_iter iter;
- struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(file_inode(file)->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
loff_t pos = ctx->pos;

if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
@@ -3371,7 +3376,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
struct task_struct *task;
struct task_struct *leader = get_proc_task(dir);
unsigned tid;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(dentry->d_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;

if (!leader)
goto out_no_task;
@@ -3380,7 +3386,6 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
if (tid == ~0U)
goto out;

- ns = dentry->d_sb->s_fs_info;
rcu_read_lock();
task = find_task_by_pid_ns(tid, ns);
if (task)
@@ -3482,7 +3487,8 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
struct task_struct *task;
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
int tid;

if (proc_inode_is_dead(inode))
@@ -3494,7 +3500,6 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
/* f_version caches the tgid value that the last readdir call couldn't
* return. lseek aka telldir automagically resets f_version to 0.
*/
- ns = inode->i_sb->s_fs_info;
tid = (int)file->f_version;
file->f_version = 0;
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..49c8cb9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -28,6 +28,11 @@

static DEFINE_RWLOCK(proc_subdir_lock);

+struct proc_fs_info *proc_sb(struct super_block *sb)
+{
+ return sb->s_fs_info;
+}
+
static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
{
if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..e708288 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,7 +103,8 @@ void __init proc_init_inodecache(void)
static int proc_show_options(struct seq_file *seq, struct dentry *root)
{
struct super_block *sb = root->d_sb;
- struct pid_namespace *pid = sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *pid = fs_info->pid_ns;

if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
@@ -473,7 +474,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)

int proc_fill_super(struct super_block *s, void *data, int silent)
{
- struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
struct inode *root_inode;
int ret;

@@ -495,7 +497,7 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
* top of it
*/
s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-
+
pde_get(&proc_root);
root_inode = proc_get_inode(s, &proc_root);
if (!root_inode) {
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..a683e93 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -15,6 +15,7 @@
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/sched/stat.h>
+#include <linux/slab.h>
#include <linux/module.h>
#include <linux/bitops.h>
#include <linux/user_namespace.h>
@@ -79,16 +80,46 @@ int proc_parse_options(char *options, struct pid_namespace *pid)

int proc_remount(struct super_block *sb, int *flags, char *data)
{
- struct pid_namespace *pid = sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *pid = fs_info->pid_ns;

sync_filesystem(sb);
return !proc_parse_options(data, pid);
}

+static int proc_test_super(struct super_block *s, void *data)
+{
+ int ret = 0;
+ struct proc_fs_info *p = data;
+ struct proc_fs_info *fs_info = proc_sb(s);
+
+ if (p->version == PROC_FS_V1 && fs_info->version == PROC_FS_V1 &&
+ p->pid_ns == fs_info->pid_ns)
+ ret = 1;
+
+ return ret;
+}
+
+static int proc_set_super(struct super_block *sb, void *data)
+{
+ sb->s_fs_info = data;
+ return set_anon_super(sb, NULL);
+}
+
static struct dentry *proc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
+ int error;
+ struct super_block *sb;
struct pid_namespace *ns;
+ struct proc_fs_info *fs_info;
+
+ if (!(flags & MS_KERNMOUNT) && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+
+ fs_info = kzalloc(sizeof(*fs_info), GFP_NOFS);
+ if (!fs_info)
+ return ERR_PTR(-ENOMEM);

if (flags & MS_KERNMOUNT) {
ns = data;
@@ -97,20 +128,60 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
ns = task_active_pid_ns(current);
}

- return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+ fs_info->pid_ns = ns;
+ fs_info->version = PROC_FS_V1;
+ refcount_set(&fs_info->users, 1);
+
+ sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ ns->user_ns, fs_info);
+ if (IS_ERR(sb)) {
+ error = PTR_ERR(sb);
+ goto error_fs_info;
+ }
+
+ if (sb->s_root) {
+ struct proc_fs_info *old = proc_sb(sb);
+
+ refcount_inc(&old->users);
+ kfree(fs_info);
+ } else {
+ error = proc_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+ if (error) {
+ deactivate_locked_super(sb);
+ goto error;
+ }
+
+ sb->s_flags |= MS_ACTIVE;
+ }
+
+ return dget(sb->s_root);
+
+error_fs_info:
+ kfree(fs_info);
+error:
+ return ERR_PTR(error);
}

-static void proc_kill_sb(struct super_block *sb)
+static void proc_destroy_sb(struct super_block *sb)
{
- struct pid_namespace *ns;
+ struct proc_fs_info *fs_info = proc_sb(sb);
+ struct pid_namespace *ns = (struct pid_namespace *)fs_info->pid_ns;

- ns = (struct pid_namespace *)sb->s_fs_info;
if (ns->proc_self)
dput(ns->proc_self);
if (ns->proc_thread_self)
dput(ns->proc_thread_self);
kill_anon_super(sb);
put_pid_ns(ns);
+ kfree(fs_info);
+}
+
+static void proc_kill_sb(struct super_block *sb)
+{
+ struct proc_fs_info *fs_info = proc_sb(sb);
+
+ if (refcount_dec_and_test(&fs_info->users))
+ proc_destroy_sb(sb);
}

static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 39857f6..9f95174 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -10,7 +10,8 @@ static const char *proc_self_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- struct pid_namespace *ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid_t tgid = task_tgid_nr_ns(current, ns);
char *name;

@@ -34,9 +35,10 @@ static unsigned self_inum;
int proc_setup_self(struct super_block *s)
{
struct inode *root_inode = d_inode(s->s_root);
- struct pid_namespace *ns = s->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = fs_info->pid_ns;
struct dentry *self;
-
+
inode_lock(root_inode);
self = d_alloc_name(s->s_root, "self");
if (self) {
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 20614b6..13d9aef 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -10,7 +10,8 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- struct pid_namespace *ns = inode->i_sb->s_fs_info;
+ struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+ struct pid_namespace *ns = fs_info->pid_ns;
pid_t tgid = task_tgid_nr_ns(current, ns);
pid_t pid = task_pid_nr_ns(current, ns);
char *name;
@@ -34,8 +35,9 @@ static unsigned thread_self_inum;

int proc_setup_thread_self(struct super_block *s)
{
+ struct proc_fs_info *fs_info = proc_sb(s);
+ struct pid_namespace *ns = fs_info->pid_ns;
struct inode *root_inode = d_inode(s->s_root);
- struct pid_namespace *ns = s->s_fs_info;
struct dentry *thread_self;

inode_lock(root_inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index b5713fe..d0ae937 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -36,23 +36,23 @@ static unsigned mounts_poll(struct file *file, poll_table *wait)
return res;
}

-struct proc_fs_info {
+struct proc_fs_opts {
int flag;
const char *str;
};

static int show_sb_opts(struct seq_file *m, struct super_block *sb)
{
- static const struct proc_fs_info fs_info[] = {
+ static const struct proc_fs_opts fs_opts[] = {
{ MS_SYNCHRONOUS, ",sync" },
{ MS_DIRSYNC, ",dirsync" },
{ MS_MANDLOCK, ",mand" },
{ MS_LAZYTIME, ",lazytime" },
{ 0, NULL }
};
- const struct proc_fs_info *fs_infop;
+ const struct proc_fs_opts *fs_infop;

- for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+ for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
if (sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
@@ -62,7 +62,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)

static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{
- static const struct proc_fs_info mnt_info[] = {
+ static const struct proc_fs_opts mnt_opts[] = {
{ MNT_NOSUID, ",nosuid" },
{ MNT_NODEV, ",nodev" },
{ MNT_NOEXEC, ",noexec" },
@@ -71,9 +71,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_RELATIME, ",relatime" },
{ 0, NULL }
};
- const struct proc_fs_info *fs_infop;
+ const struct proc_fs_opts *fs_infop;

- for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+ for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2d2bf59..e1cb9c3 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -6,11 +6,27 @@

#include <linux/types.h>
#include <linux/fs.h>
+#include <linux/refcount.h>
+
+enum {
+ PROC_FS_V1 = 1,
+ PROC_FS_V2 = 2,
+};
+
+struct proc_fs_info {
+ refcount_t users;
+ struct pid_namespace *pid_ns;
+ kgid_t pid_gid;
+ int hide_pid;
+ int version;
+};

struct proc_dir_entry;

#ifdef CONFIG_PROC_FS

+extern struct proc_fs_info *proc_sb(struct super_block *sb);
+
extern void proc_root_init(void);
extern void proc_flush_task(struct task_struct *);

@@ -53,6 +69,7 @@ static inline void proc_flush_task(struct task_struct *task)
{
}

+extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
static inline struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
static inline struct proc_dir_entry *proc_mkdir(const char *name,
--
2.10.2

2017-03-30 19:11:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] proc: add proc_fs_info struct to store proc options

On Thu, Mar 30, 2017 at 8:22 AM, Djalal Harouni <[email protected]> wrote:
> This is a preparation patch that adds a proc_fs_info to be able to store
> different procfs options. Right now some mount options are stored inside
> the pid namespace which make multiple proc share the same mount options.
> This patch will help also to fix this.
>
> Signed-off-by: Djalal Harouni <[email protected]>

> static struct dentry *proc_mount(struct file_system_type *fs_type,
> int flags, const char *dev_name, void *data)
> {
> + int error;
> + struct super_block *sb;
> struct pid_namespace *ns;
> + struct proc_fs_info *fs_info;
> +
> + if (!(flags & MS_KERNMOUNT) && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);

Why is this check needed?

> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 2d2bf59..e1cb9c3 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -6,11 +6,27 @@
>
> #include <linux/types.h>
> #include <linux/fs.h>
> +#include <linux/refcount.h>
> +
> +enum {
> + PROC_FS_V1 = 1,
> + PROC_FS_V2 = 2,
> +};
> +
> +struct proc_fs_info {
> + refcount_t users;
> + struct pid_namespace *pid_ns;
> + kgid_t pid_gid;
> + int hide_pid;
> + int version;
> +};

What is version?

Should this patch have just users and pid_ns and move the other stuff
to patch 2?

2017-03-30 19:10:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 3/4] proc: support mounting new procfs instances inside same pid namespace

On Thu, Mar 30, 2017 at 8:22 AM, Djalal Harouni <[email protected]> wrote:
> This patch adds support for 'unshare' mount option to have multiple
> separated procfs inside the same pid namespace. This allows to solve lot
> of problem for containers and their specific use cases.

It would be nice if we could make this work without 'unshare'. How about:

hidepid still sets pid_ns->hidepid. "this_mount_hidepid" (or whatever
you want to call it), if set, overrides pid_ns->hidepid.

--Andy

>
> Signed-off-by: Djalal Harouni <[email protected]>
> ---
> fs/proc/generic.c | 10 +++++++++
> fs/proc/inode.c | 3 +++
> fs/proc/root.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++--
> include/linux/proc_fs.h | 12 ++++++++++
> 4 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 7e5e419..7ae5377 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -52,6 +52,11 @@ void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid)
> fs_info->pid_gid = gid;
> }
>
> +void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version)
> +{
> + fs_info->version = version;
> +}
> +
> int proc_fs_get_hide_pid(struct proc_fs_info *fs_info)
> {
> /* For backward compatibility */
> @@ -70,6 +75,11 @@ kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
> return fs_info->pid_gid;
> }
>
> +int proc_fs_get_unshare(struct proc_fs_info *fs_info)
> +{
> + return fs_info->version;
> +}
> +
> static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
> {
> if (len < de->namelen)
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index ca47a0a..5f7557d 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -111,6 +111,9 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
> if (pid->hide_pid != HIDEPID_OFF)
> seq_printf(seq, ",hidepid=%u", pid->hide_pid);
>
> + if (proc_fs_get_unshare(fs_info) == PROC_FS_V2)
> + seq_printf(seq, ",unshare");
> +
> return 0;
> }
>
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 6a96c02..7a8f425 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -27,15 +27,52 @@
> #include "internal.h"
>
> enum {
> - Opt_gid, Opt_hidepid, Opt_err,
> + Opt_gid, Opt_hidepid, Opt_unshare, Opt_err,
> };
>
> static const match_table_t tokens = {
> {Opt_hidepid, "hidepid=%u"},
> {Opt_gid, "gid=%u"},
> + {Opt_unshare, "unshare"},
> {Opt_err, NULL},
> };
>
> +/* We only parse 'unshare' option here */
> +int proc_parse_early_options(char *options, struct proc_fs_info *fs_info)
> +{
> + char *p, *opts, *orig;
> + substring_t args[MAX_OPT_ARGS];
> +
> + if (!options)
> + return 0;
> +
> + opts = kstrdup(options, GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> +
> + orig = opts;
> +
> + while ((p = strsep(&opts, ",")) != NULL) {
> + int token;
> +
> + if (!*p)
> + continue;
> +
> + token = match_token(p, tokens, args);
> + switch (token) {
> + case Opt_unshare:
> + pr_info("proc: mounting a new procfs instance ");
> + proc_fs_set_unshare(fs_info, PROC_FS_V2);
> + break;
> + default:
> + break;
> + }
> + }
> +
> + kfree(orig);
> + return 0;
> +}
> +
> int proc_parse_options(char *options, struct proc_fs_info *fs_info)
> {
> char *p;
> @@ -70,6 +107,8 @@ int proc_parse_options(char *options, struct proc_fs_info *fs_info)
> }
> proc_fs_set_hide_pid(fs_info, option);
> break;
> + case Opt_unshare:
> + break;
> default:
> pr_err("proc: unrecognized mount option \"%s\" "
> "or missing value\n", p);
> @@ -82,9 +121,19 @@ int proc_parse_options(char *options, struct proc_fs_info *fs_info)
>
> int proc_remount(struct super_block *sb, int *flags, char *data)
> {
> + int error, version;
> struct proc_fs_info *fs_info = proc_sb(sb);
>
> + version = proc_fs_get_unshare(fs_info);
> +
> sync_filesystem(sb);
> +
> + if (version == PROC_FS_V2) {
> + error = proc_parse_early_options(data, fs_info);
> + if (error < 0)
> + return error;
> + }
> +
> return !proc_parse_options(data, fs_info);
> }
>
> @@ -122,15 +171,21 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
> if (!fs_info)
> return ERR_PTR(-ENOMEM);
>
> + /* Set it as early as possible */
> + proc_fs_set_unshare(fs_info, PROC_FS_V1);
> +
> if (flags & MS_KERNMOUNT) {
> ns = data;
> data = NULL;
> } else {
> + error = proc_parse_early_options(data, fs_info);
> + if (error < 0)
> + goto error_fs_info;
> +
> ns = task_active_pid_ns(current);
> }
>
> fs_info->pid_ns = ns;
> - fs_info->version = PROC_FS_V1;
> fs_info->hide_pid = HIDEPID_OFF;
> fs_info->pid_gid = GLOBAL_ROOT_GID;
> refcount_set(&fs_info->users, 1);
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index c23299d..e3a78a5 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -31,8 +31,11 @@ extern void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid);
>
> extern void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid);
>
> +extern void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version);
> +
> extern int proc_fs_get_hide_pid(struct proc_fs_info *fs_info);
> extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info);
> +extern int proc_fs_get_unshare(struct proc_fs_info *fs_info);
>
> extern void proc_root_init(void);
> extern void proc_flush_task(struct task_struct *);
> @@ -84,6 +87,10 @@ static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_p
> {
> }
>
> +static inline void proc_fs_set_unshare(struct proc_fs_info *fs_info, int version)
> +{
> +}
> +
> static inline void proc_fs_set_pid_gid(struct proc_info_fs *fs_info, kgid_t gid)
> {
> }
> @@ -98,6 +105,11 @@ extern kgid_t proc_fs_get_pid_gid(struct proc_fs_info *fs_info)
> return GLOBAL_ROOT_GID;
> }
>
> +static inline int proc_fs_get_unshare(struct proc_fs_info *fs_info)
> +{
> + return PROC_FS_V1;
> +}
> +
> extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
> static inline struct proc_dir_entry *proc_symlink(const char *name,
> struct proc_dir_entry *parent,const char *dest) { return NULL;}
> --
> 2.10.2
>



--
Andy Lutomirski
AMA Capital Management, LLC

2017-03-30 19:12:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] proc: support multiple separate proc instances per pidnamespace

On Thu, Mar 30, 2017 at 8:22 AM, Djalal Harouni <[email protected]> wrote:
> Hi,
>
> This RFC can be applied on top of Linus' tree 89970a04d7
>
> This RFC implements support for multiple separate proc instances inside
> the same pid namespace. This allows to solve lot of problems that
> today's use case face.
>
> Historically procfs was tied to pid namespaces, and mount options were
> propagated to all other procfs instances in the same pid namespace. This
> solved several use cases in that time. However today we face new
> problems, there are mutliple container implementations there, some of
> them want to hide pid entries, others want to hide non-pid entries,
> others want to have sysctlfs, others want to share pid namespace with
> private procfs mounts. All these with current implementation won't work
> since all options will be propagated to all procfs mounts.
>
> This series allow to have new instances of procfs per pid namespace where
> each instance can have its own mount option inside the same pid namespace.
> This was also suggested by Andy Lutomirski.
>
>
> Now:
> $ sudo mount -t proc -o unshare,hidepid=2 none /test
>
> The option 'unshare' will allow to mount a new instance of procfs inside
> the same pid namespace.
>
> Before:
> $ stat /proc/slabinfo
>
> File: ‘/proc/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
> $ stat /test3/slabinfo
>
> File: ‘/test3/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
>
> After:
> $ stat /proc/slabinfo
>
> File: ‘/proc/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
> $ stat /test3/slabinfo
>
> File: ‘/test3/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 31h/49d Inode: 4026532046 Links: 1
>
>
> Any better name for the option 'unshare' ? suggestions ?
>
> I was going to use 'version=2' but then this may sound more like a
> proc2 fs which currently impossible to implement since it will share
> locks with the old proc.
>
>
> Al, Eric any comments please ?

I like the concept, except that I think it would be nice to avoid
needing 'unshare', perhaps by making unsharing the default and making
hidepid work backwards compatibly if needed.

2017-03-30 22:07:51

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] proc: support multiple separate proc instances per pidnamespace

On Thu, Mar 30, 2017 at 05:22:55PM +0200, Djalal Harouni wrote:
> Hi,
>
> This RFC can be applied on top of Linus' tree 89970a04d7
>
> This RFC implements support for multiple separate proc instances inside
> the same pid namespace. This allows to solve lot of problems that
> today's use case face.
>
> Historically procfs was tied to pid namespaces, and mount options were
> propagated to all other procfs instances in the same pid namespace. This
> solved several use cases in that time. However today we face new
> problems, there are mutliple container implementations there, some of
> them want to hide pid entries, others want to hide non-pid entries,
> others want to have sysctlfs, others want to share pid namespace with
> private procfs mounts. All these with current implementation won't work
> since all options will be propagated to all procfs mounts.
>
> This series allow to have new instances of procfs per pid namespace where
> each instance can have its own mount option inside the same pid namespace.
> This was also suggested by Andy Lutomirski.
>
>
> Now:
> $ sudo mount -t proc -o unshare,hidepid=2 none /test
>
> The option 'unshare' will allow to mount a new instance of procfs inside
> the same pid namespace.
>
> Before:
> $ stat /proc/slabinfo
>
> File: ‘/proc/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
> $ stat /test3/slabinfo
>
> File: ‘/test3/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
>
> After:
> $ stat /proc/slabinfo
>
> File: ‘/proc/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 4h/4d Inode: 4026532046 Links: 1
>
> $ stat /test3/slabinfo
>
> File: ‘/test3/slabinfo’
> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
> Device: 31h/49d Inode: 4026532046 Links: 1
>
>
> Any better name for the option 'unshare' ? suggestions ?
>
> I was going to use 'version=2' but then this may sound more like a
> proc2 fs which currently impossible to implement since it will share
> locks with the old proc.
>
>
> Al, Eric any comments please ?

Multiple mnt_root's lead us to significant memory costs for storing dentry
of tasks. I mean what we will get as many copies of the tasks dentry as many
times we have mounted the procfs with 'unshare' flag. No?

--
Rgrds, legion

2017-03-31 10:49:41

by Djalal Harouni

[permalink] [raw]
Subject: Re: [PATCH RFC 1/4] proc: add proc_fs_info struct to store proc options

On Thu, Mar 30, 2017 at 9:10 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Mar 30, 2017 at 8:22 AM, Djalal Harouni <[email protected]> wrote:
>> This is a preparation patch that adds a proc_fs_info to be able to store
>> different procfs options. Right now some mount options are stored inside
>> the pid namespace which make multiple proc share the same mount options.
>> This patch will help also to fix this.
>>
>> Signed-off-by: Djalal Harouni <[email protected]>
>
>> static struct dentry *proc_mount(struct file_system_type *fs_type,
>> int flags, const char *dev_name, void *data)
>> {
>> + int error;
>> + struct super_block *sb;
>> struct pid_namespace *ns;
>> + struct proc_fs_info *fs_info;
>> +
>> + if (!(flags & MS_KERNMOUNT) && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>> + return ERR_PTR(-EPERM);
>
> Why is this check needed?

This is the same check that we used to have, from mount_ns(). I think
we have to keep the same semantics for now. Later we may adapt it
according to that suggestion of procfs with 'scope=x' mount options
feature where you want a specific feature of procfs that needs X
capability ?

>
>> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
>> index 2d2bf59..e1cb9c3 100644
>> --- a/include/linux/proc_fs.h
>> +++ b/include/linux/proc_fs.h
>> @@ -6,11 +6,27 @@
>>
>> #include <linux/types.h>
>> #include <linux/fs.h>
>> +#include <linux/refcount.h>
>> +
>> +enum {
>> + PROC_FS_V1 = 1,
>> + PROC_FS_V2 = 2,
>> +};
>> +
>> +struct proc_fs_info {
>> + refcount_t users;
>> + struct pid_namespace *pid_ns;
>> + kgid_t pid_gid;
>> + int hide_pid;
>> + int version;
>> +};
>
> What is version?

This is just a name to mirror 'unshare' option, please ignore it, I
will change the var name.

>
> Should this patch have just users and pid_ns and move the other stuff
> to patch 2?

Indeed, will fix it.

Thanks!

--
tixxdz

2017-03-31 11:26:49

by Djalal Harouni

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] proc: support multiple separate proc instances per pidnamespace

On Fri, Mar 31, 2017 at 12:16 AM, Alexey Gladkov
<[email protected]> wrote:
> On Thu, Mar 30, 2017 at 05:22:55PM +0200, Djalal Harouni wrote:
>> Hi,
>>
>> This RFC can be applied on top of Linus' tree 89970a04d7
>>
>> This RFC implements support for multiple separate proc instances inside
>> the same pid namespace. This allows to solve lot of problems that
>> today's use case face.
>>
>> Historically procfs was tied to pid namespaces, and mount options were
>> propagated to all other procfs instances in the same pid namespace. This
>> solved several use cases in that time. However today we face new
>> problems, there are mutliple container implementations there, some of
>> them want to hide pid entries, others want to hide non-pid entries,
>> others want to have sysctlfs, others want to share pid namespace with
>> private procfs mounts. All these with current implementation won't work
>> since all options will be propagated to all procfs mounts.
>>
>> This series allow to have new instances of procfs per pid namespace where
>> each instance can have its own mount option inside the same pid namespace.
>> This was also suggested by Andy Lutomirski.
>>
>>
>> Now:
>> $ sudo mount -t proc -o unshare,hidepid=2 none /test
>>
>> The option 'unshare' will allow to mount a new instance of procfs inside
>> the same pid namespace.
>>
>> Before:
>> $ stat /proc/slabinfo
>>
>> File: ‘/proc/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>> $ stat /test3/slabinfo
>>
>> File: ‘/test3/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>>
>> After:
>> $ stat /proc/slabinfo
>>
>> File: ‘/proc/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>> $ stat /test3/slabinfo
>>
>> File: ‘/test3/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 31h/49d Inode: 4026532046 Links: 1
>>
>>
>> Any better name for the option 'unshare' ? suggestions ?
>>
>> I was going to use 'version=2' but then this may sound more like a
>> proc2 fs which currently impossible to implement since it will share
>> locks with the old proc.
>>
>>
>> Al, Eric any comments please ?
>
> Multiple mnt_root's lead us to significant memory costs for storing dentry
> of tasks. I mean what we will get as many copies of the tasks dentry as many
> times we have mounted the procfs with 'unshare' flag. No?

With current implementation, that's true. However I think that we
should not sacrifice usage for optimization, currently it is
practically impossible to improve procfs, support new options or make
use of the current ones without affecting other procfs mounts. Andy
also suggested to have a mini-proc without non-pid stuff inside, and
without a new disconnected instance, new mounts or bind mounts may
expose the non-pid stuff.

Also we can improve this, right now it is not implemented but we may
can change how we do lookups, instead of doing a ptrace task after
instantiating a pid dentry we may do a ptrace permission check on task
there then create its related proc inode. With this all new procfs
instances with hidepid option set, will only have dentries of tasks
that the caller can ptrace. Also there is already the code to flush
the related task when it dies, tough, it needs further testing.

Also as with tmpfs where inodes are accounted by the memory
controller, I'm not sure if it's possible to account the same in
procfs during the first access ?

I don't see a better way to solve the current procfs problems that we
face or how to modernize it and add new options... in the end users
can always chose to use it or not.

--
tixxdz

2017-03-31 11:45:58

by Djalal Harouni

[permalink] [raw]
Subject: Re: [PATCH RFC 0/4] proc: support multiple separate proc instances per pidnamespace

On Thu, Mar 30, 2017 at 9:12 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Mar 30, 2017 at 8:22 AM, Djalal Harouni <[email protected]> wrote:
>> Hi,
>>
>> This RFC can be applied on top of Linus' tree 89970a04d7
>>
>> This RFC implements support for multiple separate proc instances inside
>> the same pid namespace. This allows to solve lot of problems that
>> today's use case face.
>>
>> Historically procfs was tied to pid namespaces, and mount options were
>> propagated to all other procfs instances in the same pid namespace. This
>> solved several use cases in that time. However today we face new
>> problems, there are mutliple container implementations there, some of
>> them want to hide pid entries, others want to hide non-pid entries,
>> others want to have sysctlfs, others want to share pid namespace with
>> private procfs mounts. All these with current implementation won't work
>> since all options will be propagated to all procfs mounts.
>>
>> This series allow to have new instances of procfs per pid namespace where
>> each instance can have its own mount option inside the same pid namespace.
>> This was also suggested by Andy Lutomirski.
>>
>>
>> Now:
>> $ sudo mount -t proc -o unshare,hidepid=2 none /test
>>
>> The option 'unshare' will allow to mount a new instance of procfs inside
>> the same pid namespace.
>>
>> Before:
>> $ stat /proc/slabinfo
>>
>> File: ‘/proc/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>> $ stat /test3/slabinfo
>>
>> File: ‘/test3/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>>
>> After:
>> $ stat /proc/slabinfo
>>
>> File: ‘/proc/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 4h/4d Inode: 4026532046 Links: 1
>>
>> $ stat /test3/slabinfo
>>
>> File: ‘/test3/slabinfo’
>> Size: 0 Blocks: 0 IO Block: 1024 regular empty file
>> Device: 31h/49d Inode: 4026532046 Links: 1
>>
>>
>> Any better name for the option 'unshare' ? suggestions ?
>>
>> I was going to use 'version=2' but then this may sound more like a
>> proc2 fs which currently impossible to implement since it will share
>> locks with the old proc.
>>
>>
>> Al, Eric any comments please ?
>
> I like the concept, except that I think it would be nice to avoid
> needing 'unshare', perhaps by making unsharing the default and making
> hidepid work backwards compatibly if needed.

Of course I can update, but as said lot of stuff may already depend on
the behaviour of procfs mount options, and the device ID inside same
pid namespace where is it is always the same ID. With this change we
will have different IDs inside same pid namespace, and I don't have
the knowledge to predict if something will break. I remember you said
that we may update stat() to show same ID but I really don't know if
that's a good idea, what if userspace now tries to check if these are
two separate proc mounts ? also maybe there is third party kernel code
that does something with these device IDs ?

For the namespace introspection, Eric suggested that we may stat /proc
before stating /proc/pid/ns/x , however he also concluded that some
security or other obscure stuff may break! I basically don't have the
resources nor the knowledge to predict what errors may come due to
changing the default behaviour of procfs. From my position it is
better to introduce a new option for it so users are warned. Is this
reasonable ? or do we have to make it default ?

Thanks!

--
tixxdz