2020-07-30 12:00:13

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

Currently, there is no a way to list or iterate all or subset of namespaces
in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
but some also may be as open files, which are not attached to a process.
When a namespace open fd is sent over unix socket and then closed, it is
impossible to know whether the namespace exists or not.

Also, even if namespace is exposed as attached to a process or as open file,
iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
this multiplies at tasks and fds number.

This patchset introduces a new /proc/namespaces/ directory, which exposes
subset of permitted namespaces in linear view:

# ls /proc/namespaces/ -l
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'

Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns.
I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is

in_userns(pid_ns->user_ns, ns->user_ns).

In case of ns is a user_ns:

in_userns(pid_ns->user_ns, ns).

The patchset follows this steps:

1)A generic counter in ns_common is introduced instead of separate
counters for every ns type (net::count, uts_namespace::kref,
user_namespace::count, etc). Patches [1-8];
2)Patch [9] introduces IDR to link and iterate alive namespaces;
3)Patch [10] is refactoring;
4)Patch [11] actually adds /proc/namespace directory and fs methods;
5)Patches [12-23] make every namespace to use the added methods
and to appear in /proc/namespace directory.

This may be usefull to write effective debug utils (say, fast build
of networks topology) and checkpoint/restore software.
---

Kirill Tkhai (23):
ns: Add common refcount into ns_common add use it as counter for net_ns
uts: Use generic ns_common::count
ipc: Use generic ns_common::count
pid: Use generic ns_common::count
user: Use generic ns_common::count
mnt: Use generic ns_common::count
cgroup: Use generic ns_common::count
time: Use generic ns_common::count
ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system
fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c
fs: Add /proc/namespaces/ directory
user: Free user_ns one RCU grace period after final counter put
user: Add user namespaces into ns_idr
net: Add net namespaces into ns_idr
pid: Eextract child_reaper check from pidns_for_children_get()
proc_ns_operations: Add can_get method
pid: Add pid namespaces into ns_idr
uts: Free uts namespace one RCU grace period after final counter put
uts: Add uts namespaces into ns_idr
ipc: Add ipc namespaces into ns_idr
mnt: Add mount namespaces into ns_idr
cgroup: Add cgroup namespaces into ns_idr
time: Add time namespaces into ns_idr


fs/mount.h | 4
fs/namespace.c | 14 +
fs/nsfs.c | 78 ++++++++
fs/proc/Makefile | 1
fs/proc/internal.h | 18 +-
fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++-------------
fs/proc/root.c | 17 ++
fs/proc/task_namespaces.c | 183 +++++++++++++++++++
include/linux/cgroup.h | 6 -
include/linux/ipc_namespace.h | 3
include/linux/ns_common.h | 11 +
include/linux/pid_namespace.h | 4
include/linux/proc_fs.h | 1
include/linux/proc_ns.h | 12 +
include/linux/time_namespace.h | 10 +
include/linux/user_namespace.h | 10 +
include/linux/utsname.h | 10 +
include/net/net_namespace.h | 11 -
init/version.c | 2
ipc/msgutil.c | 2
ipc/namespace.c | 17 +-
ipc/shm.c | 1
kernel/cgroup/cgroup.c | 2
kernel/cgroup/namespace.c | 25 ++-
kernel/pid.c | 2
kernel/pid_namespace.c | 46 +++--
kernel/time/namespace.c | 20 +-
kernel/user.c | 2
kernel/user_namespace.c | 23 ++
kernel/utsname.c | 23 ++
net/core/net-sysfs.c | 6 -
net/core/net_namespace.c | 18 +-
net/ipv4/inet_timewait_sock.c | 4
net/ipv4/tcp_metrics.c | 2
34 files changed, 746 insertions(+), 224 deletions(-)
create mode 100644 fs/proc/task_namespaces.c

--
Signed-off-by: Kirill Tkhai <[email protected]>


2020-07-30 12:00:15

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 03/23] ipc: Use generic ns_common::count

Convert uts namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/ipc_namespace.h | 3 +--
ipc/msgutil.c | 2 +-
ipc/namespace.c | 4 ++--
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index a06a78c67f19..05e22770af51 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -27,7 +27,6 @@ struct ipc_ids {
};

struct ipc_namespace {
- refcount_t count;
struct ipc_ids ids[3];

int sem_ctls[4];
@@ -128,7 +127,7 @@ extern struct ipc_namespace *copy_ipcs(unsigned long flags,
static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
{
if (ns)
- refcount_inc(&ns->count);
+ refcount_inc(&ns->ns.count);
return ns;
}

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 3149b4a379de..d0a0e877cadd 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
* and not CONFIG_IPC_NS.
*/
struct ipc_namespace init_ipc_ns = {
- .count = REFCOUNT_INIT(1),
+ .ns.count = REFCOUNT_INIT(1),
.user_ns = &init_user_ns,
.ns.inum = PROC_IPC_INIT_INO,
#ifdef CONFIG_IPC_NS
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 24e7b45320f7..7bd0766ddc3b 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -51,7 +51,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
goto fail_free;
ns->ns.ops = &ipcns_operations;

- refcount_set(&ns->count, 1);
+ refcount_set(&ns->ns.count, 1);
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;

@@ -164,7 +164,7 @@ static DECLARE_WORK(free_ipc_work, free_ipc);
*/
void put_ipc_ns(struct ipc_namespace *ns)
{
- if (refcount_dec_and_lock(&ns->count, &mq_lock)) {
+ if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) {
mq_clear_sbinfo(ns);
spin_unlock(&mq_lock);



2020-07-30 12:00:43

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 05/23] user: Use generic ns_common::count

Convert user namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/user_namespace.h | 5 ++---
kernel/user.c | 2 +-
kernel/user_namespace.c | 4 ++--
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..64cf8ebdc4ec 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -57,7 +57,6 @@ struct user_namespace {
struct uid_gid_map uid_map;
struct uid_gid_map gid_map;
struct uid_gid_map projid_map;
- atomic_t count;
struct user_namespace *parent;
int level;
kuid_t owner;
@@ -109,7 +108,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
{
if (ns)
- atomic_inc(&ns->count);
+ refcount_inc(&ns->ns.count);
return ns;
}

@@ -119,7 +118,7 @@ extern void __put_user_ns(struct user_namespace *ns);

static inline void put_user_ns(struct user_namespace *ns)
{
- if (ns && atomic_dec_and_test(&ns->count))
+ if (ns && refcount_dec_and_test(&ns->ns.count))
__put_user_ns(ns);
}

diff --git a/kernel/user.c b/kernel/user.c
index b1635d94a1f2..a2478cddf536 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -55,7 +55,7 @@ struct user_namespace init_user_ns = {
},
},
},
- .count = ATOMIC_INIT(3),
+ .ns.count = REFCOUNT_INIT(3),
.owner = GLOBAL_ROOT_UID,
.group = GLOBAL_ROOT_GID,
.ns.inum = PROC_USER_INIT_INO,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 87804e0371fe..7c2bbe8f3e45 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -111,7 +111,7 @@ int create_user_ns(struct cred *new)
goto fail_free;
ns->ns.ops = &userns_operations;

- atomic_set(&ns->count, 1);
+ refcount_set(&ns->ns.count, 1);
/* Leave the new->user_ns reference with the new user namespace. */
ns->parent = parent_ns;
ns->level = parent_ns->level + 1;
@@ -197,7 +197,7 @@ static void free_user_ns(struct work_struct *work)
kmem_cache_free(user_ns_cachep, ns);
dec_user_namespaces(ucounts);
ns = parent;
- } while (atomic_dec_and_test(&parent->count));
+ } while (refcount_dec_and_test(&parent->ns.count));
}

void __put_user_ns(struct user_namespace *ns)


2020-07-30 12:00:53

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 06/23] mnt: Use generic ns_common::count

Convert mount namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/mount.h | 3 +--
fs/namespace.c | 4 ++--
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index c3e0bb6e5782..f296862032ec 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -7,7 +7,6 @@
#include <linux/watch_queue.h>

struct mnt_namespace {
- atomic_t count;
struct ns_common ns;
struct mount * root;
/*
@@ -130,7 +129,7 @@ static inline void detach_mounts(struct dentry *dentry)

static inline void get_mnt_ns(struct mnt_namespace *ns)
{
- atomic_inc(&ns->count);
+ refcount_inc(&ns->ns.count);
}

extern seqlock_t mount_lock;
diff --git a/fs/namespace.c b/fs/namespace.c
index 31c387794fbd..8c39810e6ec3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3296,7 +3296,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
new_ns->ns.ops = &mntns_operations;
if (!anon)
new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
- atomic_set(&new_ns->count, 1);
+ refcount_set(&new_ns->ns.count, 1);
INIT_LIST_HEAD(&new_ns->list);
init_waitqueue_head(&new_ns->poll);
spin_lock_init(&new_ns->ns_lock);
@@ -3870,7 +3870,7 @@ void __init mnt_init(void)

void put_mnt_ns(struct mnt_namespace *ns)
{
- if (!atomic_dec_and_test(&ns->count))
+ if (!refcount_dec_and_test(&ns->ns.count))
return;
drop_collected_mounts(&ns->root->mnt);
free_mnt_ns(ns);


2020-07-30 12:00:59

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 04/23] pid: Use generic ns_common::count

Convert pid namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/pid_namespace.h | 4 +---
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 13 +++----------
3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 5a5cb45ac57e..7c7e627503d2 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -8,7 +8,6 @@
#include <linux/workqueue.h>
#include <linux/threads.h>
#include <linux/nsproxy.h>
-#include <linux/kref.h>
#include <linux/ns_common.h>
#include <linux/idr.h>

@@ -18,7 +17,6 @@
struct fs_pin;

struct pid_namespace {
- struct kref kref;
struct idr idr;
struct rcu_head rcu;
unsigned int pid_allocated;
@@ -43,7 +41,7 @@ extern struct pid_namespace init_pid_ns;
static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
if (ns != &init_pid_ns)
- kref_get(&ns->kref);
+ refcount_inc(&ns->ns.count);
return ns;
}

diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..3b9e67736ef4 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -72,7 +72,7 @@ int pid_max_max = PID_MAX_LIMIT;
* the scheme scales to up to 4 million PIDs, runtime.
*/
struct pid_namespace init_pid_ns = {
- .kref = KREF_INIT(2),
+ .ns.count = REFCOUNT_INIT(2),
.idr = IDR_INIT(init_pid_ns.idr),
.pid_allocated = PIDNS_ADDING,
.level = 0,
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..d02dc1696edf 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -102,7 +102,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
goto out_free_idr;
ns->ns.ops = &pidns_operations;

- kref_init(&ns->kref);
+ refcount_set(&ns->ns.count, 1);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
ns->user_ns = get_user_ns(user_ns);
@@ -148,22 +148,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
return create_pid_namespace(user_ns, old_ns);
}

-static void free_pid_ns(struct kref *kref)
-{
- struct pid_namespace *ns;
-
- ns = container_of(kref, struct pid_namespace, kref);
- destroy_pid_namespace(ns);
-}
-
void put_pid_ns(struct pid_namespace *ns)
{
struct pid_namespace *parent;

while (ns != &init_pid_ns) {
parent = ns->parent;
- if (!kref_put(&ns->kref, free_pid_ns))
+ if (!refcount_dec_and_test(&ns->ns.count))
break;
+ destroy_pid_namespace(ns);
ns = parent;
}
}


2020-07-30 12:01:09

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 07/23] cgroup: Use generic ns_common::count

Convert cgroup namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/cgroup.h | 5 ++---
kernel/cgroup/cgroup.c | 2 +-
kernel/cgroup/namespace.c | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 618838c48313..451c2d26a5db 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -854,7 +854,6 @@ static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
#endif /* CONFIG_CGROUP_DATA */

struct cgroup_namespace {
- refcount_t count;
struct ns_common ns;
struct user_namespace *user_ns;
struct ucounts *ucounts;
@@ -889,12 +888,12 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
static inline void get_cgroup_ns(struct cgroup_namespace *ns)
{
if (ns)
- refcount_inc(&ns->count);
+ refcount_inc(&ns->ns.count);
}

static inline void put_cgroup_ns(struct cgroup_namespace *ns)
{
- if (ns && refcount_dec_and_test(&ns->count))
+ if (ns && refcount_dec_and_test(&ns->ns.count))
free_cgroup_ns(ns);
}

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dd247747ec14..22e466926853 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -199,7 +199,7 @@ static u16 have_canfork_callback __read_mostly;

/* cgroup namespace for init task */
struct cgroup_namespace init_cgroup_ns = {
- .count = REFCOUNT_INIT(2),
+ .ns.count = REFCOUNT_INIT(2),
.user_ns = &init_user_ns,
.ns.ops = &cgroupns_operations,
.ns.inum = PROC_CGROUP_INIT_INO,
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index 812a61afd538..f5e8828c109c 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -32,7 +32,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
kfree(new_ns);
return ERR_PTR(ret);
}
- refcount_set(&new_ns->count, 1);
+ refcount_set(&new_ns->ns.count, 1);
new_ns->ns.ops = &cgroupns_operations;
return new_ns;
}


2020-07-30 12:01:17

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

This patch introduces a new IDR and functions to add/remove and iterate
registered namespaces in the system. It will be used to list namespaces
in /proc/namespaces/... in next patches.

The IDR is protected by ns_idr, and it's choosen to be a spinlock (not
mutex) to allow calling ns_idr_unregister() from put_xxx_ns() methods,
which may be called from (say) softirq context. Spinlock allows us
to avoid introduction of kwork on top of put_xxx_ns() to call mutex_lock().

We introduce a new IDR, because there is no appropriate items to reuse
instead of this. The closest proc_inum_ida does not fit our goals:
it is IDA and its convertation to IDR will bring a big overhead by proc
entries, which are not interested in IDR functionality (pointers).

Read access to ns_idr is made lockless (see ns_get_next()). This is made
for better parallelism and better performance from start. One new requirement
to do this is that namespace memory must be freed one RCU grace period
after ns_idr_unregister(). Some namespaces types already does this (say, net),
the rest will be converted to use kfree_rcu()/etc, where they become
linked to the IDR. See next patches in this series for the details.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/nsfs.c | 76 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/ns_common.h | 10 ++++++
include/linux/proc_ns.h | 11 ++++---
3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 800c1d0eb0d0..ee4be67d3a0b 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -11,10 +11,13 @@
#include <linux/user_namespace.h>
#include <linux/nsfs.h>
#include <linux/uaccess.h>
+#include <linux/idr.h>

#include "internal.h"

static struct vfsmount *nsfs_mnt;
+static DEFINE_SPINLOCK(ns_lock);
+static DEFINE_IDR(ns_idr);

static long ns_ioctl(struct file *filp, unsigned int ioctl,
unsigned long arg);
@@ -304,3 +307,76 @@ void __init nsfs_init(void)
panic("can't set nsfs up\n");
nsfs_mnt->mnt_sb->s_flags &= ~SB_NOUSER;
}
+
+/*
+ * Add a newly created ns to ns_idr. The ns must be fully
+ * initialized since it becomes available for ns_get_next()
+ * right after we exit this function.
+ */
+int ns_idr_register(struct ns_common *ns)
+{
+ int ret, id = ns->inum - PROC_NS_MIN_INO;
+
+ if (WARN_ON(id < 0))
+ return -EINVAL;
+
+ idr_preload(GFP_KERNEL);
+ spin_lock_irq(&ns_lock);
+ ret = idr_alloc(&ns_idr, ns, id, id + 1, GFP_ATOMIC);
+ spin_unlock_irq(&ns_lock);
+ idr_preload_end();
+
+ return ret < 0 ? ret : 0;
+}
+
+/*
+ * Remove a dead ns from ns_idr. Note, that ns memory must
+ * be freed not earlier then one RCU grace period after
+ * this function, since ns_get_next() uses RCU to iterate the IDR.
+ */
+void ns_idr_unregister(struct ns_common *ns)
+{
+ int id = ns->inum - PROC_NS_MIN_INO;
+ unsigned long flags;
+
+ if (WARN_ON(id < 0))
+ return;
+
+ spin_lock_irqsave(&ns_lock, flags);
+ idr_remove(&ns_idr, id);
+ spin_unlock_irqrestore(&ns_lock, flags);
+}
+
+/*
+ * This returns ns with inum greater than @id or NULL.
+ * @id is updated to refer the ns inum.
+ */
+struct ns_common *ns_get_next(unsigned int *id)
+{
+ struct ns_common *ns;
+
+ if (*id < PROC_NS_MIN_INO - 1)
+ *id = PROC_NS_MIN_INO - 1;
+
+ *id += 1;
+ *id -= PROC_NS_MIN_INO;
+
+ rcu_read_lock();
+ do {
+ ns = idr_get_next(&ns_idr, id);
+ if (!ns)
+ break;
+ if (!refcount_inc_not_zero(&ns->count)) {
+ ns = NULL;
+ *id += 1;
+ }
+ } while (!ns);
+ rcu_read_unlock();
+
+ if (ns) {
+ *id += PROC_NS_MIN_INO;
+ WARN_ON(*id != ns->inum);
+ }
+
+ return ns;
+}
diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 27db02ebdf36..5f460e97151a 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -4,6 +4,12 @@

struct proc_ns_operations;

+/*
+ * Common part of all namespaces. Note, that we link namespaces
+ * into IDR, and they are dereferenced via RCU. So, a namespace
+ * memory is allowed to be freed one RCU grace period after final
+ * .count put. See ns_get_next() for the details.
+ */
struct ns_common {
atomic_long_t stashed;
const struct proc_ns_operations *ops;
@@ -11,4 +17,8 @@ struct ns_common {
refcount_t count;
};

+extern int ns_idr_register(struct ns_common *ns);
+extern void ns_idr_unregister(struct ns_common *ns);
+extern struct ns_common *ns_get_next(unsigned int *id);
+
#endif
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 75807ecef880..906e6ebb43e4 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -40,12 +40,13 @@ extern const struct proc_ns_operations timens_for_children_operations;
*/
enum {
PROC_ROOT_INO = 1,
- PROC_IPC_INIT_INO = 0xEFFFFFFFU,
- PROC_UTS_INIT_INO = 0xEFFFFFFEU,
- PROC_USER_INIT_INO = 0xEFFFFFFDU,
- PROC_PID_INIT_INO = 0xEFFFFFFCU,
- PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
PROC_TIME_INIT_INO = 0xEFFFFFFAU,
+ PROC_NS_MIN_INO = PROC_TIME_INIT_INO,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
+ PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_USER_INIT_INO = 0xEFFFFFFDU,
+ PROC_UTS_INIT_INO = 0xEFFFFFFEU,
+ PROC_IPC_INIT_INO = 0xEFFFFFFFU,
};

#ifdef CONFIG_PROC_FS


2020-07-30 12:01:26

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 08/23] time: Use generic ns_common::count

Convert time namespace to use generic counter.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/time_namespace.h | 9 ++++-----
kernel/time/namespace.c | 9 +++------
2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 5b6031385db0..a51ffc089219 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -4,7 +4,6 @@


#include <linux/sched.h>
-#include <linux/kref.h>
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
#include <linux/err.h>
@@ -18,7 +17,6 @@ struct timens_offsets {
};

struct time_namespace {
- struct kref kref;
struct user_namespace *user_ns;
struct ucounts *ucounts;
struct ns_common ns;
@@ -37,20 +35,21 @@ extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);

static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
- kref_get(&ns->kref);
+ refcount_inc(&ns->ns.count);
return ns;
}

struct time_namespace *copy_time_ns(unsigned long flags,
struct user_namespace *user_ns,
struct time_namespace *old_ns);
-void free_time_ns(struct kref *kref);
+void free_time_ns(struct time_namespace *ns);
int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
struct vdso_data *arch_get_vdso_data(void *vvar_page);

static inline void put_time_ns(struct time_namespace *ns)
{
- kref_put(&ns->kref, free_time_ns);
+ if (refcount_dec_and_test(&ns->ns.count))
+ free_time_ns(ns);
}

void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index afc65e6be33e..c4c829eb3511 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -92,7 +92,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
if (!ns)
goto fail_dec;

- kref_init(&ns->kref);
+ refcount_set(&ns->ns.count, 1);

ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!ns->vvar_page)
@@ -226,11 +226,8 @@ static void timens_set_vvar_page(struct task_struct *task,
mutex_unlock(&offset_lock);
}

-void free_time_ns(struct kref *kref)
+void free_time_ns(struct time_namespace *ns)
{
- struct time_namespace *ns;
-
- ns = container_of(kref, struct time_namespace, kref);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
@@ -464,7 +461,7 @@ const struct proc_ns_operations timens_for_children_operations = {
};

struct time_namespace init_time_ns = {
- .kref = KREF_INIT(3),
+ .ns.count = REFCOUNT_INIT(3),
.user_ns = &init_user_ns,
.ns.inum = PROC_TIME_INIT_INO,
.ns.ops = &timens_operations,


2020-07-30 12:01:46

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 11/23] fs: Add /proc/namespaces/ directory

This is a new directory to show all namespaces, which can be
accessed from this /proc tasks credentials.

Every /proc is related to a pid_namespace, and the pid_namespace
is related to a user_namespace. The items, we show in this
/proc/namespaces/ directory, are the namespaces,
whose user_namespaces are the same as /proc's user_namespace,
or their descendants.

Say, /proc has pid_ns->user_ns, so in /proc/namespace we show
only a ns, which is in_userns(pid_ns->user_ns, ns->user_ns).

The final result is like below:

# ls /proc/namespaces/ -l
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'

Every namespace may be open like ordinary file in /proc/[pid]/ns.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/nsfs.c | 2
fs/proc/Makefile | 1
fs/proc/internal.h | 16 ++
fs/proc/namespaces.c | 314 +++++++++++++++++++++++++++++++++++++++++++++++
fs/proc/root.c | 17 ++-
include/linux/proc_fs.h | 1
6 files changed, 345 insertions(+), 6 deletions(-)
create mode 100644 fs/proc/namespaces.c

diff --git a/fs/nsfs.c b/fs/nsfs.c
index ee4be67d3a0b..61b789d2089c 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -58,7 +58,7 @@ static void nsfs_evict(struct inode *inode)
ns->ops->put(ns);
}

-static int __ns_get_path(struct path *path, struct ns_common *ns)
+int __ns_get_path(struct path *path, struct ns_common *ns)
{
struct vfsmount *mnt = nsfs_mnt;
struct dentry *dentry;
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index dc2d51f42905..34ff671c6d59 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -25,6 +25,7 @@ proc-y += util.o
proc-y += version.o
proc-y += softirqs.o
proc-y += task_namespaces.o
+proc-y += namespaces.o
proc-y += self.o
proc-y += thread_self.o
proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 572757ff97be..d19fe5574799 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -134,10 +134,11 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
kuid_t *ruid, kgid_t *rgid);

unsigned name_to_int(const struct qstr *qstr);
-/*
- * Offset of the first process in the /proc root directory..
- */
-#define FIRST_PROCESS_ENTRY 256
+
+/* Offset of "namespaces" entry in /proc root directory */
+#define NAMESPACES_ENTRY 256
+/* Offset of the first process in the /proc root directory */
+#define FIRST_PROCESS_ENTRY (NAMESPACES_ENTRY + 1)

/* Worst case buffer size needed for holding an integer. */
#define PROC_NUMBUF 13
@@ -168,6 +169,7 @@ extern void proc_pid_evict_inode(struct proc_inode *);
extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
extern void pid_update_inode(struct task_struct *, struct inode *);
extern int pid_delete_dentry(const struct dentry *);
+extern int proc_emit_namespaces(struct file *, struct dir_context *);
extern int proc_pid_readdir(struct file *, struct dir_context *);
struct dentry *proc_pid_lookup(struct dentry *, unsigned int);
extern loff_t mem_lseek(struct file *, loff_t, int);
@@ -222,6 +224,12 @@ void set_proc_pid_nlink(void);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
extern void proc_entry_rundown(struct proc_dir_entry *);

+/*
+ * namespaces.c
+ */
+extern int proc_setup_namespaces(struct super_block *);
+extern void proc_namespaces_init(void);
+
/*
* task_namespaces.c
*/
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
new file mode 100644
index 000000000000..ab47e1555619
--- /dev/null
+++ b/fs/proc/namespaces.c
@@ -0,0 +1,314 @@
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include <linux/namei.h>
+#include "internal.h"
+
+static unsigned namespaces_inum __ro_after_init;
+
+int proc_emit_namespaces(struct file *file, struct dir_context *ctx)
+{
+ struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
+ struct inode *inode = d_inode(fs_info->proc_namespaces);
+
+ return dir_emit(ctx, "namespaces", 10, inode->i_ino, DT_DIR);
+}
+
+static int parse_namespace_dentry_name(const struct dentry *dentry,
+ const char **type, unsigned int *type_len, unsigned int *inum)
+{
+ const char *p, *name;
+ int count;
+
+ *type = name = dentry->d_name.name;
+ p = strchr(name, ':');
+ *type_len = p - name;
+ if (!p || p == name)
+ return -ENOENT;
+
+ p += 1;
+ if (sscanf(p, "[%u]%n", inum, &count) != 1 || *(p + count) != '\0' ||
+ *inum < PROC_NS_MIN_INO)
+ return -ENOENT;
+
+ return 0;
+}
+
+static struct ns_common *get_namespace_by_dentry(struct pid_namespace *pid_ns,
+ const struct dentry *dentry)
+{
+ unsigned int type_len, inum, p_inum;
+ struct user_namespace *user_ns;
+ struct ns_common *ns;
+ const char *type;
+
+ if (parse_namespace_dentry_name(dentry, &type, &type_len, &inum) < 0)
+ return NULL;
+
+ p_inum = inum - 1;
+ ns = ns_get_next(&p_inum);
+ if (!ns)
+ return NULL;
+
+ if (ns->inum != inum || strncmp(type, ns->ops->name, type_len) != 0 ||
+ ns->ops->name[type_len] != '\0') {
+ ns->ops->put(ns);
+ return NULL;
+ }
+
+ if (ns->ops != &userns_operations)
+ user_ns = ns->ops->owner(ns);
+ else
+ user_ns = container_of(ns, struct user_namespace, ns);
+
+ if (!in_userns(pid_ns->user_ns, user_ns)) {
+ ns->ops->put(ns);
+ return NULL;
+ }
+
+ return ns;
+}
+
+static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
+ struct task_struct *task, const void *ptr);
+
+static struct dentry *proc_namespaces_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
+ struct task_struct *task;
+ struct ns_common *ns;
+
+ ns = get_namespace_by_dentry(pid_ns, dentry);
+ if (!ns)
+ return ERR_PTR(-ENOENT);
+
+ read_lock(&tasklist_lock);
+ task = get_task_struct(pid_ns->child_reaper);
+ read_unlock(&tasklist_lock);
+
+ dentry = proc_namespace_instantiate(dentry, task, ns);
+ put_task_struct(task);
+ ns->ops->put(ns);
+
+ return dentry;
+}
+
+static int proc_namespaces_permission(struct inode *inode, int mask)
+{
+ if ((mask & MAY_EXEC) && S_ISLNK(inode->i_mode))
+ return -EACCES;
+
+ return 0;
+}
+
+static int proc_namespaces_getattr(const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = d_inode(path->dentry);
+
+ generic_fillattr(inode, stat);
+ return 0;
+}
+
+static const struct inode_operations proc_namespaces_inode_operations = {
+ .lookup = proc_namespaces_lookup,
+ .permission = proc_namespaces_permission,
+ .getattr = proc_namespaces_getattr,
+};
+
+static int proc_namespaces_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
+ struct ns_common *ns;
+
+ ns = get_namespace_by_dentry(pid_ns, dentry);
+ if (!ns)
+ return -ENOENT;
+ ns->ops->put(ns);
+
+ /* proc_namespaces_readdir() creates dentry names in namespace format */
+ return readlink_copy(buffer, buflen, dentry->d_iname);
+}
+
+int __ns_get_path(struct path *path, struct ns_common *ns);
+
+static const char *proc_namespaces_getlink(struct dentry *dentry,
+ struct inode *inode, struct delayed_call *done)
+{
+ struct pid_namespace *pid_ns = proc_pid_ns(inode->i_sb);
+ struct ns_common *ns;
+ struct path path;
+ int ret;
+
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
+ while (1) {
+ ret = -ENOENT;
+ ns = get_namespace_by_dentry(pid_ns, dentry);
+ if (!ns)
+ goto out;
+
+ ret = __ns_get_path(&path, ns);
+ if (ret == -EAGAIN)
+ continue;
+ if (ret)
+ goto out;
+ break;
+ }
+
+ ret = nd_jump_link(&path);
+out:
+ return ERR_PTR(ret);
+}
+
+static const struct inode_operations proc_namespaces_link_inode_operations = {
+ .readlink = proc_namespaces_readlink,
+ .get_link = proc_namespaces_getlink,
+};
+
+static int namespace_delete_dentry(const struct dentry *dentry)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
+ struct ns_common *ns;
+
+ ns = get_namespace_by_dentry(pid_ns, dentry);
+ if (!ns)
+ return 1;
+
+ ns->ops->put(ns);
+ return 0;
+}
+
+const struct dentry_operations namespaces_dentry_operations = {
+ .d_delete = namespace_delete_dentry,
+};
+
+static void namespace_update_inode(struct inode *inode)
+{
+ struct user_namespace *user_ns = proc_pid_ns(inode->i_sb)->user_ns;
+
+ inode->i_uid = make_kuid(user_ns, 0);
+ if (!uid_valid(inode->i_uid))
+ inode->i_uid = GLOBAL_ROOT_UID;
+
+ inode->i_gid = make_kgid(user_ns, 0);
+ if (!gid_valid(inode->i_gid))
+ inode->i_gid = GLOBAL_ROOT_GID;
+}
+
+static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
+ struct task_struct *task, const void *ptr)
+{
+ const struct ns_common *ns = ptr;
+ struct inode *inode;
+ struct proc_inode *ei;
+
+ /*
+ * Create inode with credentials of @task, and add it to @task's
+ * quick removal list.
+ */
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ ei = PROC_I(inode);
+ inode->i_op = &proc_namespaces_link_inode_operations;
+ ei->ns_ops = ns->ops;
+ namespace_update_inode(inode);
+
+ d_set_d_op(dentry, &namespaces_dentry_operations);
+ return d_splice_alias(inode, dentry);
+}
+
+static int proc_namespaces_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct pid_namespace *pid_ns = proc_pid_ns(file_inode(file)->i_sb);
+ struct user_namespace *user_ns;
+ struct task_struct *task;
+ struct ns_common *ns;
+ unsigned int inum;
+
+ read_lock(&tasklist_lock);
+ task = get_task_struct(pid_ns->child_reaper);
+ read_unlock(&tasklist_lock);
+
+ if (!dir_emit_dots(file, ctx))
+ goto out;
+
+ inum = ctx->pos - 2;
+ while ((ns = ns_get_next(&inum)) != NULL) {
+ unsigned int len;
+ char name[32];
+
+ if (ns->ops != &userns_operations)
+ user_ns = ns->ops->owner(ns);
+ else
+ user_ns = container_of(ns, struct user_namespace, ns);
+
+ if (!in_userns(pid_ns->user_ns, user_ns))
+ goto next;
+
+ len = snprintf(name, sizeof(name), "%s:[%u]", ns->ops->name, inum);
+
+ if (!proc_fill_cache(file, ctx, name, len,
+ proc_namespace_instantiate, task, ns)) {
+ ns->ops->put(ns);
+ break;
+ }
+next:
+ ns->ops->put(ns);
+ ctx->pos = inum + 2;
+ }
+out:
+ put_task_struct(task);
+ return 0;
+}
+
+static const struct file_operations proc_namespaces_file_operations = {
+ .read = generic_read_dir,
+ .iterate_shared = proc_namespaces_readdir,
+ .llseek = generic_file_llseek,
+};
+
+int proc_setup_namespaces(struct super_block *s)
+{
+ struct proc_fs_info *fs_info = proc_sb_info(s);
+ struct inode *root_inode = d_inode(s->s_root);
+ struct dentry *namespaces;
+ int ret = -ENOMEM;
+
+ inode_lock(root_inode);
+ namespaces = d_alloc_name(s->s_root, "namespaces");
+ if (namespaces) {
+ struct inode *inode = new_inode_pseudo(s);
+ if (inode) {
+ inode->i_ino = namespaces_inum;
+ inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+ inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
+ inode->i_uid = GLOBAL_ROOT_UID;
+ inode->i_gid = GLOBAL_ROOT_GID;
+ inode->i_op = &proc_namespaces_inode_operations;
+ inode->i_fop = &proc_namespaces_file_operations;
+ d_add(namespaces, inode);
+ ret = 0;
+ } else {
+ dput(namespaces);
+ }
+ }
+ inode_unlock(root_inode);
+
+ if (ret)
+ pr_err("proc_setup_namespaces: can't allocate /proc/namespaces\n");
+ else
+ fs_info->proc_namespaces = namespaces;
+
+ return ret;
+}
+
+void __init proc_namespaces_init(void)
+{
+ proc_alloc_inum(&namespaces_inum);
+}
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..e4e4f90fca3d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -206,6 +206,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return -ENOMEM;
}

+ ret = proc_setup_namespaces(s);
+ if (ret)
+ return ret;
+
ret = proc_setup_self(s);
if (ret) {
return ret;
@@ -272,6 +276,9 @@ static void proc_kill_sb(struct super_block *sb)
dput(fs_info->proc_self);
dput(fs_info->proc_thread_self);

+ if (fs_info->proc_namespaces)
+ dput(fs_info->proc_namespaces);
+
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
kfree(fs_info);
@@ -289,6 +296,7 @@ void __init proc_root_init(void)
{
proc_init_kmemcache();
set_proc_pid_nlink();
+ proc_namespaces_init();
proc_self_init();
proc_thread_self_init();
proc_symlink("mounts", NULL, "self/mounts");
@@ -326,8 +334,15 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr

static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
- if (ctx->pos < FIRST_PROCESS_ENTRY) {
+ if (ctx->pos < NAMESPACES_ENTRY) {
int error = proc_readdir(file, ctx);
+ if (unlikely(error <= 0))
+ return error;
+ ctx->pos = NAMESPACES_ENTRY;
+ }
+
+ if (ctx->pos == NAMESPACES_ENTRY) {
+ int error = proc_emit_namespaces(file, ctx);
if (unlikely(error <= 0))
return error;
ctx->pos = FIRST_PROCESS_ENTRY;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 97b3f5f06db9..8b0002a6cacf 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -61,6 +61,7 @@ struct proc_fs_info {
struct pid_namespace *pid_ns;
struct dentry *proc_self; /* For /proc/self */
struct dentry *proc_thread_self; /* For /proc/thread-self */
+ struct dentry *proc_namespaces; /* For /proc/namespaces */
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;


2020-07-30 12:01:53

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 12/23] user: Free user_ns one RCU grace period after final counter put

This is needed to link user_ns into ns_idr in next patch.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/user_namespace.h | 5 ++++-
kernel/user_namespace.c | 9 ++++++++-
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..58fede304201 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -79,7 +79,10 @@ struct user_namespace {
#ifdef CONFIG_PERSISTENT_KEYRINGS
struct key *persistent_keyring_register;
#endif
- struct work_struct work;
+ union {
+ struct work_struct work;
+ struct rcu_head rcu;
+ };
#ifdef CONFIG_SYSCTL
struct ctl_table_set set;
struct ctl_table_header *sysctls;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 7c2bbe8f3e45..367a942bb484 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -171,6 +171,13 @@ int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
return err;
}

+static void free_user_ns_rcu(struct rcu_head *head)
+{
+ struct user_namespace *ns = container_of(head, struct user_namespace,
+ rcu);
+ kmem_cache_free(user_ns_cachep, ns);
+}
+
static void free_user_ns(struct work_struct *work)
{
struct user_namespace *parent, *ns =
@@ -194,7 +201,7 @@ static void free_user_ns(struct work_struct *work)
retire_userns_sysctls(ns);
key_free_user_ns(ns);
ns_free_inum(&ns->ns);
- kmem_cache_free(user_ns_cachep, ns);
+ call_rcu(&ns->rcu, free_user_ns_rcu);
dec_user_namespaces(ucounts);
ns = parent;
} while (refcount_dec_and_test(&parent->ns.count));


2020-07-30 12:01:55

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 13/23] user: Add user namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/user_namespace.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 367a942bb484..bbfd7f0f9e7c 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -137,7 +137,13 @@ int create_user_ns(struct cred *new)
goto fail_keyring;

set_cred_user_ns(new, ns);
+
+ if (ns_idr_register(&ns->ns))
+ goto fail_sysctl;
+
return 0;
+fail_sysctl:
+ retire_userns_sysctls(ns);
fail_keyring:
#ifdef CONFIG_PERSISTENT_KEYRINGS
key_put(ns->persistent_keyring_register);
@@ -186,6 +192,7 @@ static void free_user_ns(struct work_struct *work)
do {
struct ucounts *ucounts = ns->ucounts;
parent = ns->parent;
+ ns_idr_unregister(&ns->ns);
if (ns->gid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
kfree(ns->gid_map.forward);
kfree(ns->gid_map.reverse);
@@ -1327,6 +1334,7 @@ const struct proc_ns_operations userns_operations = {
static __init int user_namespaces_init(void)
{
user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
- return 0;
+
+ return ns_idr_register(&init_user_ns.ns);
}
subsys_initcall(user_namespaces_init);


2020-07-30 12:01:58

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 14/23] net: Add net namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

We already wait RCU grace period in cleanup_net()
before pernet_operations exit, so ns_idr_unregister()
works as expected.

Signed-off-by: Kirill Tkhai <[email protected]>
---
net/core/net_namespace.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 5f658cbedd34..f78655a670e5 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -701,14 +701,24 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_pid);

static __net_init int net_ns_net_init(struct net *net)
{
+ int ret;
#ifdef CONFIG_NET_NS
net->ns.ops = &netns_operations;
#endif
- return ns_alloc_inum(&net->ns);
+ ret = ns_alloc_inum(&net->ns);
+ if (ret)
+ return ret;
+
+ ret = ns_idr_register(&net->ns);
+ if (ret < 0)
+ ns_free_inum(&net->ns);
+
+ return ret;
}

static __net_exit void net_ns_net_exit(struct net *net)
{
+ ns_idr_unregister(&net->ns);
ns_free_inum(&net->ns);
}



2020-07-30 12:02:07

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 16/23] proc_ns_operations: Add can_get method

This is a new method to prohibit some namespaces in intermediate state.
Currently, it's used to prohibit pid namespace, whose child reaper is not
created yet (similar to we have in /proc/[pid]/pid_for_children).

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/proc/namespaces.c | 5 +++++
include/linux/proc_ns.h | 1 +
kernel/pid_namespace.c | 1 +
3 files changed, 7 insertions(+)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index ab47e1555619..70fc23295315 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -149,6 +149,11 @@ static const char *proc_namespaces_getlink(struct dentry *dentry,
ns = get_namespace_by_dentry(pid_ns, dentry);
if (!ns)
goto out;
+ ret = -ESRCH;
+ if (ns->ops->can_get && !ns->ops->can_get(ns)) {
+ ns->ops->put(ns);
+ goto out;
+ }

ret = __ns_get_path(&path, ns);
if (ret == -EAGAIN)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 906e6ebb43e4..e44ec466711a 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -19,6 +19,7 @@ struct proc_ns_operations {
int type;
struct ns_common *(*get)(struct task_struct *task);
void (*put)(struct ns_common *ns);
+ bool (*can_get)(struct ns_common *ns);
int (*install)(struct nsset *nsset, struct ns_common *ns);
struct user_namespace *(*owner)(struct ns_common *ns);
struct ns_common *(*get_parent)(struct ns_common *ns);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 4a01328e8763..da8490390f51 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -452,6 +452,7 @@ const struct proc_ns_operations pidns_for_children_operations = {
.real_ns_name = "pid",
.type = CLONE_NEWPID,
.get = pidns_for_children_get,
+ .can_get = pidns_can_get,
.put = pidns_put,
.install = pidns_install,
.owner = pidns_owner,


2020-07-30 12:02:07

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 15/23] pid: Eextract child_reaper check from pidns_for_children_get()

This check if for prohibiting access to /proc/[pid]/ns/pid_for_children
before first task of the pid namespace is created.

/proc/namespaces/ code will use this check too, so we move it into
a separate function.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/pid_namespace.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index d02dc1696edf..4a01328e8763 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -343,6 +343,21 @@ static struct ns_common *pidns_get(struct task_struct *task)
return ns ? &ns->ns : NULL;
}

+static bool pidns_can_get(struct ns_common *ns)
+{
+ struct pid_namespace *pid_ns;
+ bool ret = true;
+
+ pid_ns = container_of(ns, struct pid_namespace, ns);
+
+ read_lock(&tasklist_lock);
+ if (!pid_ns->child_reaper)
+ ret = false;
+ read_unlock(&tasklist_lock);
+
+ return ret;
+}
+
static struct ns_common *pidns_for_children_get(struct task_struct *task)
{
struct pid_namespace *ns = NULL;
@@ -354,13 +369,9 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
}
task_unlock(task);

- if (ns) {
- read_lock(&tasklist_lock);
- if (!ns->child_reaper) {
- put_pid_ns(ns);
- ns = NULL;
- }
- read_unlock(&tasklist_lock);
+ if (ns && !pidns_can_get(&ns->ns)) {
+ put_pid_ns(ns);
+ ns = NULL;
}

return ns ? &ns->ns : NULL;


2020-07-30 12:02:13

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 18/23] uts: Free uts namespace one RCU grace period after final counter put

This is needed to link uts_ns into ns_idr in next patch.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/utsname.h | 1 +
kernel/utsname.c | 10 +++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 2b1737c9b244..b783d0fe6ca4 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -25,6 +25,7 @@ struct uts_namespace {
struct user_namespace *user_ns;
struct ucounts *ucounts;
struct ns_common ns;
+ struct rcu_head rcu;
} __randomize_layout;
extern struct uts_namespace init_uts_ns;

diff --git a/kernel/utsname.c b/kernel/utsname.c
index b1ac3ca870f2..aebf4df5f592 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -103,12 +103,20 @@ struct uts_namespace *copy_utsname(unsigned long flags,
return new_ns;
}

+static void free_uts_ns_rcu(struct rcu_head *head)
+{
+ struct uts_namespace *ns;
+
+ ns = container_of(head, struct uts_namespace, rcu);
+ kmem_cache_free(uts_ns_cache, ns);
+}
+
void free_uts_ns(struct uts_namespace *ns)
{
dec_uts_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
- kmem_cache_free(uts_ns_cache, ns);
+ call_rcu(&ns->rcu, free_uts_ns_rcu);
}

static inline struct uts_namespace *to_uts_ns(struct ns_common *ns)


2020-07-30 12:02:25

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 21/23] mnt: Add mount namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/mount.h | 1 +
fs/namespace.c | 10 +++++++++-
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/mount.h b/fs/mount.h
index f296862032ec..cde7f7bed8ec 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -23,6 +23,7 @@ struct mnt_namespace {
u64 event;
unsigned int mounts; /* # of mounts in the namespace */
unsigned int pending_mounts;
+ struct rcu_head rcu;
} __randomize_layout;

struct mnt_pcp {
diff --git a/fs/namespace.c b/fs/namespace.c
index 8c39810e6ec3..756e43fd21f3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3258,7 +3258,7 @@ static void free_mnt_ns(struct mnt_namespace *ns)
ns_free_inum(&ns->ns);
dec_mnt_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
- kfree(ns);
+ kfree_rcu(ns, rcu);
}

/*
@@ -3382,6 +3382,12 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
if (pwdmnt)
mntput(pwdmnt);

+ if (ns_idr_register(&new_ns->ns) < 0) {
+ drop_collected_mounts(&new_ns->root->mnt);
+ free_mnt_ns(new_ns);
+ new_ns = ERR_PTR(-ENOMEM);
+ }
+
return new_ns;
}

@@ -3824,6 +3830,7 @@ static void __init init_mount_tree(void)
list_add(&m->mnt_list, &ns->list);
init_task.nsproxy->mnt_ns = ns;
get_mnt_ns(ns);
+ WARN_ON(ns_idr_register(&ns->ns) < 0);

root.mnt = mnt;
root.dentry = mnt->mnt_root;
@@ -3872,6 +3879,7 @@ void put_mnt_ns(struct mnt_namespace *ns)
{
if (!refcount_dec_and_test(&ns->ns.count))
return;
+ ns_idr_unregister(&ns->ns);
drop_collected_mounts(&ns->root->mnt);
free_mnt_ns(ns);
}


2020-07-30 12:02:42

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 23/23] time: Add time namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/time_namespace.h | 1 +
kernel/time/namespace.c | 11 ++++++++++-
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index a51ffc089219..18eb8a9f7d68 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -24,6 +24,7 @@ struct time_namespace {
struct page *vvar_page;
/* If set prevents changing offsets after any task joined namespace. */
bool frozen_offsets;
+ struct rcu_head rcu;
} __randomize_layout;

extern struct time_namespace init_time_ns;
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index c4c829eb3511..164a057ccbfc 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -107,8 +107,15 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->offsets = old_ns->offsets;
ns->frozen_offsets = false;
+
+ err = ns_idr_register(&ns->ns);
+ if (err)
+ goto fail_put_userns;
return ns;

+fail_put_userns:
+ put_user_ns(user_ns);
+ ns_free_inum(&ns->ns);
fail_free_page:
__free_page(ns->vvar_page);
fail_free:
@@ -228,11 +235,12 @@ static void timens_set_vvar_page(struct task_struct *task,

void free_time_ns(struct time_namespace *ns)
{
+ ns_idr_unregister(&ns->ns);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
__free_page(ns->vvar_page);
- kfree(ns);
+ kfree_rcu(ns, rcu);
}

static struct time_namespace *to_time_ns(struct ns_common *ns)
@@ -470,6 +478,7 @@ struct time_namespace init_time_ns = {

static int __init time_ns_init(void)
{
+ WARN_ON(ns_idr_register(&init_time_ns.ns) < 0);
return 0;
}
subsys_initcall(time_ns_init);


2020-07-30 12:03:36

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 10/23] fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c

This file is about task namespaces, so we rename it.
No functional changes.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/proc/Makefile | 2
fs/proc/internal.h | 2
fs/proc/namespaces.c | 183 ---------------------------------------------
fs/proc/task_namespaces.c | 183 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 185 insertions(+), 185 deletions(-)
delete mode 100644 fs/proc/namespaces.c
create mode 100644 fs/proc/task_namespaces.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index bd08616ed8ba..dc2d51f42905 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -24,7 +24,7 @@ proc-y += uptime.o
proc-y += util.o
proc-y += version.o
proc-y += softirqs.o
-proc-y += namespaces.o
+proc-y += task_namespaces.o
proc-y += self.o
proc-y += thread_self.o
proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 917cc85e3466..572757ff97be 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -223,7 +223,7 @@ extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry
extern void proc_entry_rundown(struct proc_dir_entry *);

/*
- * proc_namespaces.c
+ * task_namespaces.c
*/
extern const struct inode_operations proc_ns_dir_inode_operations;
extern const struct file_operations proc_ns_dir_operations;
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
deleted file mode 100644
index 8e159fc78c0a..000000000000
--- a/fs/proc/namespaces.c
+++ /dev/null
@@ -1,183 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/proc_fs.h>
-#include <linux/nsproxy.h>
-#include <linux/ptrace.h>
-#include <linux/namei.h>
-#include <linux/file.h>
-#include <linux/utsname.h>
-#include <net/net_namespace.h>
-#include <linux/ipc_namespace.h>
-#include <linux/pid_namespace.h>
-#include <linux/user_namespace.h>
-#include "internal.h"
-
-
-static const struct proc_ns_operations *ns_entries[] = {
-#ifdef CONFIG_NET_NS
- &netns_operations,
-#endif
-#ifdef CONFIG_UTS_NS
- &utsns_operations,
-#endif
-#ifdef CONFIG_IPC_NS
- &ipcns_operations,
-#endif
-#ifdef CONFIG_PID_NS
- &pidns_operations,
- &pidns_for_children_operations,
-#endif
-#ifdef CONFIG_USER_NS
- &userns_operations,
-#endif
- &mntns_operations,
-#ifdef CONFIG_CGROUPS
- &cgroupns_operations,
-#endif
-#ifdef CONFIG_TIME_NS
- &timens_operations,
- &timens_for_children_operations,
-#endif
-};
-
-static const char *proc_ns_get_link(struct dentry *dentry,
- struct inode *inode,
- struct delayed_call *done)
-{
- const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
- struct task_struct *task;
- struct path ns_path;
- int error = -EACCES;
-
- if (!dentry)
- return ERR_PTR(-ECHILD);
-
- task = get_proc_task(inode);
- if (!task)
- return ERR_PTR(-EACCES);
-
- if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
- goto out;
-
- error = ns_get_path(&ns_path, task, ns_ops);
- if (error)
- goto out;
-
- error = nd_jump_link(&ns_path);
-out:
- put_task_struct(task);
- return ERR_PTR(error);
-}
-
-static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int buflen)
-{
- struct inode *inode = d_inode(dentry);
- const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
- struct task_struct *task;
- char name[50];
- int res = -EACCES;
-
- task = get_proc_task(inode);
- if (!task)
- return res;
-
- if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
- res = ns_get_name(name, sizeof(name), task, ns_ops);
- if (res >= 0)
- res = readlink_copy(buffer, buflen, name);
- }
- put_task_struct(task);
- return res;
-}
-
-static const struct inode_operations proc_ns_link_inode_operations = {
- .readlink = proc_ns_readlink,
- .get_link = proc_ns_get_link,
- .setattr = proc_setattr,
-};
-
-static struct dentry *proc_ns_instantiate(struct dentry *dentry,
- struct task_struct *task, const void *ptr)
-{
- const struct proc_ns_operations *ns_ops = ptr;
- struct inode *inode;
- struct proc_inode *ei;
-
- inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
- if (!inode)
- return ERR_PTR(-ENOENT);
-
- ei = PROC_I(inode);
- inode->i_op = &proc_ns_link_inode_operations;
- ei->ns_ops = ns_ops;
- pid_update_inode(task, inode);
-
- d_set_d_op(dentry, &pid_dentry_operations);
- return d_splice_alias(inode, dentry);
-}
-
-static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx)
-{
- struct task_struct *task = get_proc_task(file_inode(file));
- const struct proc_ns_operations **entry, **last;
-
- if (!task)
- return -ENOENT;
-
- if (!dir_emit_dots(file, ctx))
- goto out;
- if (ctx->pos >= 2 + ARRAY_SIZE(ns_entries))
- goto out;
- entry = ns_entries + (ctx->pos - 2);
- last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
- while (entry <= last) {
- const struct proc_ns_operations *ops = *entry;
- if (!proc_fill_cache(file, ctx, ops->name, strlen(ops->name),
- proc_ns_instantiate, task, ops))
- break;
- ctx->pos++;
- entry++;
- }
-out:
- put_task_struct(task);
- return 0;
-}
-
-const struct file_operations proc_ns_dir_operations = {
- .read = generic_read_dir,
- .iterate_shared = proc_ns_dir_readdir,
- .llseek = generic_file_llseek,
-};
-
-static struct dentry *proc_ns_dir_lookup(struct inode *dir,
- struct dentry *dentry, unsigned int flags)
-{
- struct task_struct *task = get_proc_task(dir);
- const struct proc_ns_operations **entry, **last;
- unsigned int len = dentry->d_name.len;
- struct dentry *res = ERR_PTR(-ENOENT);
-
- if (!task)
- goto out_no_task;
-
- last = &ns_entries[ARRAY_SIZE(ns_entries)];
- for (entry = ns_entries; entry < last; entry++) {
- if (strlen((*entry)->name) != len)
- continue;
- if (!memcmp(dentry->d_name.name, (*entry)->name, len))
- break;
- }
- if (entry == last)
- goto out;
-
- res = proc_ns_instantiate(dentry, task, *entry);
-out:
- put_task_struct(task);
-out_no_task:
- return res;
-}
-
-const struct inode_operations proc_ns_dir_inode_operations = {
- .lookup = proc_ns_dir_lookup,
- .getattr = pid_getattr,
- .setattr = proc_setattr,
-};
diff --git a/fs/proc/task_namespaces.c b/fs/proc/task_namespaces.c
new file mode 100644
index 000000000000..8e159fc78c0a
--- /dev/null
+++ b/fs/proc/task_namespaces.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/proc_fs.h>
+#include <linux/nsproxy.h>
+#include <linux/ptrace.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/utsname.h>
+#include <net/net_namespace.h>
+#include <linux/ipc_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
+#include "internal.h"
+
+
+static const struct proc_ns_operations *ns_entries[] = {
+#ifdef CONFIG_NET_NS
+ &netns_operations,
+#endif
+#ifdef CONFIG_UTS_NS
+ &utsns_operations,
+#endif
+#ifdef CONFIG_IPC_NS
+ &ipcns_operations,
+#endif
+#ifdef CONFIG_PID_NS
+ &pidns_operations,
+ &pidns_for_children_operations,
+#endif
+#ifdef CONFIG_USER_NS
+ &userns_operations,
+#endif
+ &mntns_operations,
+#ifdef CONFIG_CGROUPS
+ &cgroupns_operations,
+#endif
+#ifdef CONFIG_TIME_NS
+ &timens_operations,
+ &timens_for_children_operations,
+#endif
+};
+
+static const char *proc_ns_get_link(struct dentry *dentry,
+ struct inode *inode,
+ struct delayed_call *done)
+{
+ const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
+ struct task_struct *task;
+ struct path ns_path;
+ int error = -EACCES;
+
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
+ task = get_proc_task(inode);
+ if (!task)
+ return ERR_PTR(-EACCES);
+
+ if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+ goto out;
+
+ error = ns_get_path(&ns_path, task, ns_ops);
+ if (error)
+ goto out;
+
+ error = nd_jump_link(&ns_path);
+out:
+ put_task_struct(task);
+ return ERR_PTR(error);
+}
+
+static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+ struct inode *inode = d_inode(dentry);
+ const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
+ struct task_struct *task;
+ char name[50];
+ int res = -EACCES;
+
+ task = get_proc_task(inode);
+ if (!task)
+ return res;
+
+ if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
+ res = ns_get_name(name, sizeof(name), task, ns_ops);
+ if (res >= 0)
+ res = readlink_copy(buffer, buflen, name);
+ }
+ put_task_struct(task);
+ return res;
+}
+
+static const struct inode_operations proc_ns_link_inode_operations = {
+ .readlink = proc_ns_readlink,
+ .get_link = proc_ns_get_link,
+ .setattr = proc_setattr,
+};
+
+static struct dentry *proc_ns_instantiate(struct dentry *dentry,
+ struct task_struct *task, const void *ptr)
+{
+ const struct proc_ns_operations *ns_ops = ptr;
+ struct inode *inode;
+ struct proc_inode *ei;
+
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ ei = PROC_I(inode);
+ inode->i_op = &proc_ns_link_inode_operations;
+ ei->ns_ops = ns_ops;
+ pid_update_inode(task, inode);
+
+ d_set_d_op(dentry, &pid_dentry_operations);
+ return d_splice_alias(inode, dentry);
+}
+
+static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct task_struct *task = get_proc_task(file_inode(file));
+ const struct proc_ns_operations **entry, **last;
+
+ if (!task)
+ return -ENOENT;
+
+ if (!dir_emit_dots(file, ctx))
+ goto out;
+ if (ctx->pos >= 2 + ARRAY_SIZE(ns_entries))
+ goto out;
+ entry = ns_entries + (ctx->pos - 2);
+ last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
+ while (entry <= last) {
+ const struct proc_ns_operations *ops = *entry;
+ if (!proc_fill_cache(file, ctx, ops->name, strlen(ops->name),
+ proc_ns_instantiate, task, ops))
+ break;
+ ctx->pos++;
+ entry++;
+ }
+out:
+ put_task_struct(task);
+ return 0;
+}
+
+const struct file_operations proc_ns_dir_operations = {
+ .read = generic_read_dir,
+ .iterate_shared = proc_ns_dir_readdir,
+ .llseek = generic_file_llseek,
+};
+
+static struct dentry *proc_ns_dir_lookup(struct inode *dir,
+ struct dentry *dentry, unsigned int flags)
+{
+ struct task_struct *task = get_proc_task(dir);
+ const struct proc_ns_operations **entry, **last;
+ unsigned int len = dentry->d_name.len;
+ struct dentry *res = ERR_PTR(-ENOENT);
+
+ if (!task)
+ goto out_no_task;
+
+ last = &ns_entries[ARRAY_SIZE(ns_entries)];
+ for (entry = ns_entries; entry < last; entry++) {
+ if (strlen((*entry)->name) != len)
+ continue;
+ if (!memcmp(dentry->d_name.name, (*entry)->name, len))
+ break;
+ }
+ if (entry == last)
+ goto out;
+
+ res = proc_ns_instantiate(dentry, task, *entry);
+out:
+ put_task_struct(task);
+out_no_task:
+ return res;
+}
+
+const struct inode_operations proc_ns_dir_inode_operations = {
+ .lookup = proc_ns_dir_lookup,
+ .getattr = pid_getattr,
+ .setattr = proc_setattr,
+};


2020-07-30 12:03:41

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 22/23] cgroup: Add cgroup namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup/namespace.c | 23 +++++++++++++++++++----
2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 451c2d26a5db..38913d91fa92 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -858,6 +858,7 @@ struct cgroup_namespace {
struct user_namespace *user_ns;
struct ucounts *ucounts;
struct css_set *root_cset;
+ struct rcu_head rcu;
};

extern struct cgroup_namespace init_cgroup_ns;
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828c109c..64393bbafb2c 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -39,11 +39,12 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)

void free_cgroup_ns(struct cgroup_namespace *ns)
{
+ ns_idr_unregister(&ns->ns);
put_css_set(ns->root_cset);
dec_cgroup_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
- kfree(ns);
+ kfree_rcu(ns, rcu);
}
EXPORT_SYMBOL(free_cgroup_ns);

@@ -54,6 +55,7 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
struct cgroup_namespace *new_ns;
struct ucounts *ucounts;
struct css_set *cset;
+ int err;

BUG_ON(!old_ns);

@@ -78,16 +80,28 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,

new_ns = alloc_cgroup_ns();
if (IS_ERR(new_ns)) {
- put_css_set(cset);
- dec_cgroup_namespaces(ucounts);
- return new_ns;
+ err = PTR_ERR(new_ns);
+ goto err_put_css_set;
}

new_ns->user_ns = get_user_ns(user_ns);
new_ns->ucounts = ucounts;
new_ns->root_cset = cset;

+ err = ns_idr_register(&new_ns->ns);
+ if (err < 0)
+ goto err_put_user_ns;
+
return new_ns;
+
+err_put_user_ns:
+ put_user_ns(new_ns->user_ns);
+ ns_free_inum(&new_ns->ns);
+ kfree(new_ns);
+err_put_css_set:
+ put_css_set(cset);
+ dec_cgroup_namespaces(ucounts);
+ return ERR_PTR(err);
}

static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
@@ -152,6 +166,7 @@ const struct proc_ns_operations cgroupns_operations = {

static __init int cgroup_namespaces_init(void)
{
+ WARN_ON(ns_idr_register(&init_cgroup_ns.ns) < 0);
return 0;
}
subsys_initcall(cgroup_namespaces_init);


2020-07-30 12:03:56

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 17/23] pid: Add pid namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Note, that we already wait RCU grace period before
pid namespace's memory is freed.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/pid_namespace.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index da8490390f51..06398a7c4c59 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -109,8 +109,13 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;

+ if (ns_idr_register(&ns->ns) < 0)
+ goto out_free_inum;
+
return ns;

+out_free_inum:
+ ns_free_inum(&ns->ns);
out_free_idr:
idr_destroy(&ns->idr);
kmem_cache_free(pid_ns_cachep, ns);
@@ -132,6 +137,7 @@ static void delayed_free_pidns(struct rcu_head *p)

static void destroy_pid_namespace(struct pid_namespace *ns)
{
+ ns_idr_unregister(&ns->ns);
ns_free_inum(&ns->ns);

idr_destroy(&ns->idr);
@@ -466,6 +472,7 @@ static __init int pid_namespaces_init(void)
#ifdef CONFIG_CHECKPOINT_RESTORE
register_sysctl_paths(kern_path, pid_ns_ctl_table);
#endif
+ WARN_ON(ns_idr_register(&init_pid_ns.ns) < 0);
return 0;
}



2020-07-30 12:04:20

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 20/23] ipc: Add ipc namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
ipc/namespace.c | 13 ++++++++++++-
ipc/shm.c | 1 +
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766ddc3b..ce6f87dd6d08 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -63,8 +63,17 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
msg_init_ns(ns);
shm_init_ns(ns);

+ err = ns_idr_register(&ns->ns);
+ if (err)
+ goto fail_exit;
+
return ns;

+fail_exit:
+ mq_put_mnt(ns);
+ sem_exit_ns(ns);
+ msg_exit_ns(ns);
+ shm_exit_ns(ns);
fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
@@ -117,6 +126,8 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,

static void free_ipc_ns(struct ipc_namespace *ns)
{
+ ns_idr_unregister(&ns->ns);
+
/* mq_put_mnt() waits for a grace period as kern_unmount()
* uses synchronize_rcu().
*/
@@ -128,7 +139,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
dec_ipc_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
- kfree(ns);
+ kfree(ns); /* RCU grace period wait is done in mq_put_mnt */
}

static LLIST_HEAD(free_ipc_list);
diff --git a/ipc/shm.c b/ipc/shm.c
index 6cf24a5994ec..9e83556d9dcb 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -137,6 +137,7 @@ void shm_exit_ns(struct ipc_namespace *ns)
static int __init ipc_ns_init(void)
{
shm_init_ns(&init_ipc_ns);
+ WARN_ON(ns_idr_register(&init_ipc_ns.ns) < 0);
return 0;
}



2020-07-30 12:04:30

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 19/23] uts: Add uts namespaces into ns_idr

Now they are exposed in /proc/namespace/ directory.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/utsname.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/utsname.c b/kernel/utsname.c
index aebf4df5f592..883855ca16cd 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -70,8 +70,16 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
ns->user_ns = get_user_ns(user_ns);
up_read(&uts_sem);
+
+ err = ns_idr_register(&ns->ns);
+ if (err)
+ goto fail_put;
+
return ns;

+fail_put:
+ put_user_ns(user_ns);
+ ns_free_inum(&ns->ns);
fail_free:
kmem_cache_free(uts_ns_cache, ns);
fail_dec:
@@ -113,6 +121,7 @@ static void free_uts_ns_rcu(struct rcu_head *head)

void free_uts_ns(struct uts_namespace *ns)
{
+ ns_idr_unregister(&ns->ns);
dec_uts_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
@@ -182,4 +191,5 @@ void __init uts_ns_init(void)
offsetof(struct uts_namespace, name),
sizeof_field(struct uts_namespace, name),
NULL);
+ WARN_ON(ns_idr_register(&init_uts_ns.ns) < 0);
}


2020-07-30 12:04:54

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 02/23] uts: Use generic ns_common::count

Convert uts namespace to use generic counter instead of kref.

Signed-off-by: Kirill Tkhai <[email protected]>
---
include/linux/utsname.h | 9 ++++-----
init/version.c | 2 +-
kernel/utsname.c | 7 ++-----
3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 44429d9142ca..2b1737c9b244 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -4,7 +4,6 @@


#include <linux/sched.h>
-#include <linux/kref.h>
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
#include <linux/err.h>
@@ -22,7 +21,6 @@ struct user_namespace;
extern struct user_namespace init_user_ns;

struct uts_namespace {
- struct kref kref;
struct new_utsname name;
struct user_namespace *user_ns;
struct ucounts *ucounts;
@@ -33,16 +31,17 @@ extern struct uts_namespace init_uts_ns;
#ifdef CONFIG_UTS_NS
static inline void get_uts_ns(struct uts_namespace *ns)
{
- kref_get(&ns->kref);
+ refcount_inc(&ns->ns.count);
}

extern struct uts_namespace *copy_utsname(unsigned long flags,
struct user_namespace *user_ns, struct uts_namespace *old_ns);
-extern void free_uts_ns(struct kref *kref);
+extern void free_uts_ns(struct uts_namespace *ns);

static inline void put_uts_ns(struct uts_namespace *ns)
{
- kref_put(&ns->kref, free_uts_ns);
+ if (refcount_dec_and_test(&ns->ns.count))
+ free_uts_ns(ns);
}

void uts_ns_init(void);
diff --git a/init/version.c b/init/version.c
index cba341161b58..80d2b7566b39 100644
--- a/init/version.c
+++ b/init/version.c
@@ -25,7 +25,7 @@ int version_string(LINUX_VERSION_CODE);
#endif

struct uts_namespace init_uts_ns = {
- .kref = KREF_INIT(2),
+ .ns.count = REFCOUNT_INIT(2),
.name = {
.sysname = UTS_SYSNAME,
.nodename = UTS_NODENAME,
diff --git a/kernel/utsname.c b/kernel/utsname.c
index e488d0e2ab45..b1ac3ca870f2 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -33,7 +33,7 @@ static struct uts_namespace *create_uts_ns(void)

uts_ns = kmem_cache_alloc(uts_ns_cache, GFP_KERNEL);
if (uts_ns)
- kref_init(&uts_ns->kref);
+ refcount_set(&uts_ns->ns.count, 1);
return uts_ns;
}

@@ -103,11 +103,8 @@ struct uts_namespace *copy_utsname(unsigned long flags,
return new_ns;
}

-void free_uts_ns(struct kref *kref)
+void free_uts_ns(struct uts_namespace *ns)
{
- struct uts_namespace *ns;
-
- ns = container_of(kref, struct uts_namespace, kref);
dec_uts_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);


2020-07-30 12:21:06

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

On Thu, Jul 30, 2020 at 03:00:19PM +0300, Kirill Tkhai wrote:

> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'

I'd say make it '%s-%llu'. The brackets don't carry any information.
And ':' forces quoting with recent coreutils.

> +static int parse_namespace_dentry_name(const struct dentry *dentry,
> + const char **type, unsigned int *type_len, unsigned int *inum)
> +{
> + const char *p, *name;
> + int count;
> +
> + *type = name = dentry->d_name.name;
> + p = strchr(name, ':');
> + *type_len = p - name;
> + if (!p || p == name)
> + return -ENOENT;
> +
> + p += 1;
> + if (sscanf(p, "[%u]%n", inum, &count) != 1 || *(p + count) != '\0' ||
> + *inum < PROC_NS_MIN_INO)
> + return -ENOENT;

sscanf is banned from lookup code due to lax whitespace rules.
See

commit ac7f1061c2c11bb8936b1b6a94cdb48de732f7a4
proc: fix /proc/*/map_files lookup

Of course someone sneaked in 1 instance, yikes.

$ grep -e scanf -n -r fs/proc/
fs/proc/base.c:1596: err = sscanf(pos, "%9s %lld %lu", clock,

> +static int proc_namespaces_readdir(struct file *file, struct dir_context *ctx)

> + len = snprintf(name, sizeof(name), "%s:[%u]", ns->ops->name, inum);

[] -- no need.

2020-07-30 12:26:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On Thu, Jul 30, 2020 at 03:00:08PM +0300, Kirill Tkhai wrote:
> This patch introduces a new IDR and functions to add/remove and iterate
> registered namespaces in the system. It will be used to list namespaces
> in /proc/namespaces/... in next patches.

Looks like you could use an XArray for this and it would be fewer lines of
code.

>
> static struct vfsmount *nsfs_mnt;
> +static DEFINE_SPINLOCK(ns_lock);
> +static DEFINE_IDR(ns_idr);

XArray includes its own spinlock.

> +/*
> + * Add a newly created ns to ns_idr. The ns must be fully
> + * initialized since it becomes available for ns_get_next()
> + * right after we exit this function.
> + */
> +int ns_idr_register(struct ns_common *ns)
> +{
> + int ret, id = ns->inum - PROC_NS_MIN_INO;
> +
> + if (WARN_ON(id < 0))
> + return -EINVAL;
> +
> + idr_preload(GFP_KERNEL);
> + spin_lock_irq(&ns_lock);
> + ret = idr_alloc(&ns_idr, ns, id, id + 1, GFP_ATOMIC);
> + spin_unlock_irq(&ns_lock);
> + idr_preload_end();
> + return ret < 0 ? ret : 0;

This would simply be return xa_insert_irq(...);

> +}
> +
> +/*
> + * Remove a dead ns from ns_idr. Note, that ns memory must
> + * be freed not earlier then one RCU grace period after
> + * this function, since ns_get_next() uses RCU to iterate the IDR.
> + */
> +void ns_idr_unregister(struct ns_common *ns)
> +{
> + int id = ns->inum - PROC_NS_MIN_INO;
> + unsigned long flags;
> +
> + if (WARN_ON(id < 0))
> + return;
> +
> + spin_lock_irqsave(&ns_lock, flags);
> + idr_remove(&ns_idr, id);
> + spin_unlock_irqrestore(&ns_lock, flags);
> +}

xa_erase_irqsave();

> +
> +/*
> + * This returns ns with inum greater than @id or NULL.
> + * @id is updated to refer the ns inum.
> + */
> +struct ns_common *ns_get_next(unsigned int *id)
> +{
> + struct ns_common *ns;
> +
> + if (*id < PROC_NS_MIN_INO - 1)
> + *id = PROC_NS_MIN_INO - 1;
> +
> + *id += 1;
> + *id -= PROC_NS_MIN_INO;
> +
> + rcu_read_lock();
> + do {
> + ns = idr_get_next(&ns_idr, id);
> + if (!ns)
> + break;

xa_find_after();

You'll want a temporary unsigned long to work with ...

> + if (!refcount_inc_not_zero(&ns->count)) {
> + ns = NULL;
> + *id += 1;

you won't need this increment.

2020-07-30 13:10:35

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote:
> Currently, there is no a way to list or iterate all or subset of namespaces
> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> but some also may be as open files, which are not attached to a process.
> When a namespace open fd is sent over unix socket and then closed, it is
> impossible to know whether the namespace exists or not.
>
> Also, even if namespace is exposed as attached to a process or as open file,
> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> this multiplies at tasks and fds number.
>
> This patchset introduces a new /proc/namespaces/ directory, which exposes
> subset of permitted namespaces in linear view:
>
> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>
> Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns.
> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is
>
> in_userns(pid_ns->user_ns, ns->user_ns).
>
> In case of ns is a user_ns:
>
> in_userns(pid_ns->user_ns, ns).
>
> The patchset follows this steps:
>
> 1)A generic counter in ns_common is introduced instead of separate
> counters for every ns type (net::count, uts_namespace::kref,
> user_namespace::count, etc). Patches [1-8];
> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> 3)Patch [10] is refactoring;
> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> 5)Patches [12-23] make every namespace to use the added methods
> and to appear in /proc/namespace directory.
>
> This may be usefull to write effective debug utils (say, fast build
> of networks topology) and checkpoint/restore software.

Kirill,

Thanks for working on this!
We have a need for this functionality too for namespace introspection.
I actually had a prototype of this as well but mine was based on debugfs
but /proc/namespaces seems like a good place.

Christian

2020-07-30 13:23:07

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

On 30.07.2020 15:18, Alexey Dobriyan wrote:
> On Thu, Jul 30, 2020 at 03:00:19PM +0300, Kirill Tkhai wrote:
>
>> # ls /proc/namespaces/ -l
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>
> I'd say make it '%s-%llu'. The brackets don't carry any information.
> And ':' forces quoting with recent coreutils.
>
>> +static int parse_namespace_dentry_name(const struct dentry *dentry,
>> + const char **type, unsigned int *type_len, unsigned int *inum)
>> +{
>> + const char *p, *name;
>> + int count;
>> +
>> + *type = name = dentry->d_name.name;
>> + p = strchr(name, ':');
>> + *type_len = p - name;
>> + if (!p || p == name)
>> + return -ENOENT;
>> +
>> + p += 1;
>> + if (sscanf(p, "[%u]%n", inum, &count) != 1 || *(p + count) != '\0' ||
>> + *inum < PROC_NS_MIN_INO)
>> + return -ENOENT;
>
> sscanf is banned from lookup code due to lax whitespace rules.
> See
>
> commit ac7f1061c2c11bb8936b1b6a94cdb48de732f7a4
> proc: fix /proc/*/map_files lookup

Ok, thanks for pointing this.

> Of course someone sneaked in 1 instance, yikes.
>
> $ grep -e scanf -n -r fs/proc/
> fs/proc/base.c:1596: err = sscanf(pos, "%9s %lld %lu", clock,
>
>> +static int proc_namespaces_readdir(struct file *file, struct dir_context *ctx)
>
>> + len = snprintf(name, sizeof(name), "%s:[%u]", ns->ops->name, inum);
>
> [] -- no need.
>

2020-07-30 13:29:40

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

On Thu, Jul 30, 2020 at 03:00:19PM +0300, Kirill Tkhai wrote:
> This is a new directory to show all namespaces, which can be
> accessed from this /proc tasks credentials.
>
> Every /proc is related to a pid_namespace, and the pid_namespace
> is related to a user_namespace. The items, we show in this
> /proc/namespaces/ directory, are the namespaces,
> whose user_namespaces are the same as /proc's user_namespace,
> or their descendants.
>
> Say, /proc has pid_ns->user_ns, so in /proc/namespace we show
> only a ns, which is in_userns(pid_ns->user_ns, ns->user_ns).
>
> The final result is like below:
>
> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'

So usually, the /proc/<pid>/ns entries are guarded by
ptrace_may_access() but from skimming the patch it seems that
/proc/namespaces/ would be accessible by any user.

I think we should guard /proc/namespaces/. Either by restricting it to
userns CAP_SYS_ADMIN or - to make it work with unprivileged CRIU - by
ns_capable(proc's_pid_ns->user_ns, CAP_SYS_PTRACE).


This should probably also be a mount option on procfs given that we now
allow a restricted view of procfs.

Christian

>
> Every namespace may be open like ordinary file in /proc/[pid]/ns.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> fs/nsfs.c | 2
> fs/proc/Makefile | 1
> fs/proc/internal.h | 16 ++
> fs/proc/namespaces.c | 314 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/proc/root.c | 17 ++-
> include/linux/proc_fs.h | 1
> 6 files changed, 345 insertions(+), 6 deletions(-)
> create mode 100644 fs/proc/namespaces.c
>
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index ee4be67d3a0b..61b789d2089c 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -58,7 +58,7 @@ static void nsfs_evict(struct inode *inode)
> ns->ops->put(ns);
> }
>
> -static int __ns_get_path(struct path *path, struct ns_common *ns)
> +int __ns_get_path(struct path *path, struct ns_common *ns)
> {
> struct vfsmount *mnt = nsfs_mnt;
> struct dentry *dentry;
> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
> index dc2d51f42905..34ff671c6d59 100644
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -25,6 +25,7 @@ proc-y += util.o
> proc-y += version.o
> proc-y += softirqs.o
> proc-y += task_namespaces.o
> +proc-y += namespaces.o
> proc-y += self.o
> proc-y += thread_self.o
> proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 572757ff97be..d19fe5574799 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -134,10 +134,11 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
> kuid_t *ruid, kgid_t *rgid);
>
> unsigned name_to_int(const struct qstr *qstr);
> -/*
> - * Offset of the first process in the /proc root directory..
> - */
> -#define FIRST_PROCESS_ENTRY 256
> +
> +/* Offset of "namespaces" entry in /proc root directory */
> +#define NAMESPACES_ENTRY 256
> +/* Offset of the first process in the /proc root directory */
> +#define FIRST_PROCESS_ENTRY (NAMESPACES_ENTRY + 1)
>
> /* Worst case buffer size needed for holding an integer. */
> #define PROC_NUMBUF 13
> @@ -168,6 +169,7 @@ extern void proc_pid_evict_inode(struct proc_inode *);
> extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
> extern void pid_update_inode(struct task_struct *, struct inode *);
> extern int pid_delete_dentry(const struct dentry *);
> +extern int proc_emit_namespaces(struct file *, struct dir_context *);
> extern int proc_pid_readdir(struct file *, struct dir_context *);
> struct dentry *proc_pid_lookup(struct dentry *, unsigned int);
> extern loff_t mem_lseek(struct file *, loff_t, int);
> @@ -222,6 +224,12 @@ void set_proc_pid_nlink(void);
> extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
> extern void proc_entry_rundown(struct proc_dir_entry *);
>
> +/*
> + * namespaces.c
> + */
> +extern int proc_setup_namespaces(struct super_block *);
> +extern void proc_namespaces_init(void);
> +
> /*
> * task_namespaces.c
> */
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> new file mode 100644
> index 000000000000..ab47e1555619
> --- /dev/null
> +++ b/fs/proc/namespaces.c
> @@ -0,0 +1,314 @@
> +#include <linux/pid_namespace.h>
> +#include <linux/user_namespace.h>
> +#include <linux/namei.h>
> +#include "internal.h"
> +
> +static unsigned namespaces_inum __ro_after_init;
> +
> +int proc_emit_namespaces(struct file *file, struct dir_context *ctx)
> +{
> + struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
> + struct inode *inode = d_inode(fs_info->proc_namespaces);
> +
> + return dir_emit(ctx, "namespaces", 10, inode->i_ino, DT_DIR);
> +}
> +
> +static int parse_namespace_dentry_name(const struct dentry *dentry,
> + const char **type, unsigned int *type_len, unsigned int *inum)
> +{
> + const char *p, *name;
> + int count;
> +
> + *type = name = dentry->d_name.name;
> + p = strchr(name, ':');
> + *type_len = p - name;
> + if (!p || p == name)
> + return -ENOENT;

Hm, rather:

p = strchr(name, ':');
if (!p || p == name)
return -ENOENT;
*type_len = p - name;

> +
> + p += 1;
> + if (sscanf(p, "[%u]%n", inum, &count) != 1 || *(p + count) != '\0' ||
> + *inum < PROC_NS_MIN_INO)
> + return -ENOENT;
> +
> + return 0;
> +}
> +
> +static struct ns_common *get_namespace_by_dentry(struct pid_namespace *pid_ns,
> + const struct dentry *dentry)
> +{
> + unsigned int type_len, inum, p_inum;
> + struct user_namespace *user_ns;
> + struct ns_common *ns;
> + const char *type;
> +
> + if (parse_namespace_dentry_name(dentry, &type, &type_len, &inum) < 0)
> + return NULL;
> +
> + p_inum = inum - 1;
> + ns = ns_get_next(&p_inum);
> + if (!ns)
> + return NULL;
> +
> + if (ns->inum != inum || strncmp(type, ns->ops->name, type_len) != 0 ||
> + ns->ops->name[type_len] != '\0') {
> + ns->ops->put(ns);
> + return NULL;
> + }
> +
> + if (ns->ops != &userns_operations)
> + user_ns = ns->ops->owner(ns);
> + else
> + user_ns = container_of(ns, struct user_namespace, ns);
> +
> + if (!in_userns(pid_ns->user_ns, user_ns)) {
> + ns->ops->put(ns);
> + return NULL;
> + }
> +
> + return ns;
> +}
> +
> +static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
> + struct task_struct *task, const void *ptr);
> +
> +static struct dentry *proc_namespaces_lookup(struct inode *dir, struct dentry *dentry,
> + unsigned int flags)
> +{
> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
> + struct task_struct *task;
> + struct ns_common *ns;
> +
> + ns = get_namespace_by_dentry(pid_ns, dentry);
> + if (!ns)
> + return ERR_PTR(-ENOENT);
> +
> + read_lock(&tasklist_lock);
> + task = get_task_struct(pid_ns->child_reaper);
> + read_unlock(&tasklist_lock);
> +
> + dentry = proc_namespace_instantiate(dentry, task, ns);
> + put_task_struct(task);
> + ns->ops->put(ns);
> +
> + return dentry;
> +}
> +
> +static int proc_namespaces_permission(struct inode *inode, int mask)
> +{
> + if ((mask & MAY_EXEC) && S_ISLNK(inode->i_mode))
> + return -EACCES;
> +
> + return 0;
> +}
> +
> +static int proc_namespaces_getattr(const struct path *path, struct kstat *stat,
> + u32 request_mask, unsigned int query_flags)
> +{
> + struct inode *inode = d_inode(path->dentry);
> +
> + generic_fillattr(inode, stat);
> + return 0;
> +}
> +
> +static const struct inode_operations proc_namespaces_inode_operations = {
> + .lookup = proc_namespaces_lookup,
> + .permission = proc_namespaces_permission,
> + .getattr = proc_namespaces_getattr,
> +};
> +
> +static int proc_namespaces_readlink(struct dentry *dentry, char __user *buffer, int buflen)
> +{
> + struct inode *dir = dentry->d_parent->d_inode;
> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
> + struct ns_common *ns;
> +
> + ns = get_namespace_by_dentry(pid_ns, dentry);
> + if (!ns)
> + return -ENOENT;
> + ns->ops->put(ns);
> +
> + /* proc_namespaces_readdir() creates dentry names in namespace format */
> + return readlink_copy(buffer, buflen, dentry->d_iname);
> +}
> +
> +int __ns_get_path(struct path *path, struct ns_common *ns);
> +
> +static const char *proc_namespaces_getlink(struct dentry *dentry,
> + struct inode *inode, struct delayed_call *done)
> +{
> + struct pid_namespace *pid_ns = proc_pid_ns(inode->i_sb);
> + struct ns_common *ns;
> + struct path path;
> + int ret;
> +
> + if (!dentry)
> + return ERR_PTR(-ECHILD);
> +
> + while (1) {
> + ret = -ENOENT;
> + ns = get_namespace_by_dentry(pid_ns, dentry);
> + if (!ns)
> + goto out;
> +
> + ret = __ns_get_path(&path, ns);
> + if (ret == -EAGAIN)
> + continue;
> + if (ret)
> + goto out;
> + break;
> + }
> +
> + ret = nd_jump_link(&path);
> +out:
> + return ERR_PTR(ret);
> +}
> +
> +static const struct inode_operations proc_namespaces_link_inode_operations = {
> + .readlink = proc_namespaces_readlink,
> + .get_link = proc_namespaces_getlink,
> +};
> +
> +static int namespace_delete_dentry(const struct dentry *dentry)
> +{
> + struct inode *dir = dentry->d_parent->d_inode;
> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
> + struct ns_common *ns;
> +
> + ns = get_namespace_by_dentry(pid_ns, dentry);
> + if (!ns)
> + return 1;
> +
> + ns->ops->put(ns);
> + return 0;
> +}
> +
> +const struct dentry_operations namespaces_dentry_operations = {
> + .d_delete = namespace_delete_dentry,
> +};
> +
> +static void namespace_update_inode(struct inode *inode)
> +{
> + struct user_namespace *user_ns = proc_pid_ns(inode->i_sb)->user_ns;
> +
> + inode->i_uid = make_kuid(user_ns, 0);
> + if (!uid_valid(inode->i_uid))
> + inode->i_uid = GLOBAL_ROOT_UID;
> +
> + inode->i_gid = make_kgid(user_ns, 0);
> + if (!gid_valid(inode->i_gid))
> + inode->i_gid = GLOBAL_ROOT_GID;
> +}
> +
> +static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
> + struct task_struct *task, const void *ptr)
> +{
> + const struct ns_common *ns = ptr;
> + struct inode *inode;
> + struct proc_inode *ei;
> +
> + /*
> + * Create inode with credentials of @task, and add it to @task's
> + * quick removal list.
> + */
> + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
> + if (!inode)
> + return ERR_PTR(-ENOENT);
> +
> + ei = PROC_I(inode);
> + inode->i_op = &proc_namespaces_link_inode_operations;
> + ei->ns_ops = ns->ops;
> + namespace_update_inode(inode);
> +
> + d_set_d_op(dentry, &namespaces_dentry_operations);
> + return d_splice_alias(inode, dentry);
> +}
> +
> +static int proc_namespaces_readdir(struct file *file, struct dir_context *ctx)
> +{
> + struct pid_namespace *pid_ns = proc_pid_ns(file_inode(file)->i_sb);
> + struct user_namespace *user_ns;
> + struct task_struct *task;
> + struct ns_common *ns;
> + unsigned int inum;
> +
> + read_lock(&tasklist_lock);
> + task = get_task_struct(pid_ns->child_reaper);
> + read_unlock(&tasklist_lock);
> +
> + if (!dir_emit_dots(file, ctx))
> + goto out;
> +
> + inum = ctx->pos - 2;
> + while ((ns = ns_get_next(&inum)) != NULL) {
> + unsigned int len;
> + char name[32];
> +
> + if (ns->ops != &userns_operations)
> + user_ns = ns->ops->owner(ns);
> + else
> + user_ns = container_of(ns, struct user_namespace, ns);
> +
> + if (!in_userns(pid_ns->user_ns, user_ns))
> + goto next;
> +
> + len = snprintf(name, sizeof(name), "%s:[%u]", ns->ops->name, inum);
> +
> + if (!proc_fill_cache(file, ctx, name, len,
> + proc_namespace_instantiate, task, ns)) {
> + ns->ops->put(ns);
> + break;
> + }
> +next:
> + ns->ops->put(ns);
> + ctx->pos = inum + 2;
> + }
> +out:
> + put_task_struct(task);
> + return 0;
> +}
> +
> +static const struct file_operations proc_namespaces_file_operations = {
> + .read = generic_read_dir,
> + .iterate_shared = proc_namespaces_readdir,
> + .llseek = generic_file_llseek,
> +};
> +
> +int proc_setup_namespaces(struct super_block *s)
> +{
> + struct proc_fs_info *fs_info = proc_sb_info(s);
> + struct inode *root_inode = d_inode(s->s_root);
> + struct dentry *namespaces;
> + int ret = -ENOMEM;
> +
> + inode_lock(root_inode);
> + namespaces = d_alloc_name(s->s_root, "namespaces");
> + if (namespaces) {
> + struct inode *inode = new_inode_pseudo(s);
> + if (inode) {
> + inode->i_ino = namespaces_inum;
> + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> + inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
> + inode->i_uid = GLOBAL_ROOT_UID;
> + inode->i_gid = GLOBAL_ROOT_GID;
> + inode->i_op = &proc_namespaces_inode_operations;
> + inode->i_fop = &proc_namespaces_file_operations;
> + d_add(namespaces, inode);
> + ret = 0;
> + } else {
> + dput(namespaces);
> + }
> + }
> + inode_unlock(root_inode);
> +
> + if (ret)
> + pr_err("proc_setup_namespaces: can't allocate /proc/namespaces\n");
> + else
> + fs_info->proc_namespaces = namespaces;
> +
> + return ret;
> +}
> +
> +void __init proc_namespaces_init(void)
> +{
> + proc_alloc_inum(&namespaces_inum);
> +}
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 5e444d4f9717..e4e4f90fca3d 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -206,6 +206,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> return -ENOMEM;
> }
>
> + ret = proc_setup_namespaces(s);
> + if (ret)
> + return ret;
> +
> ret = proc_setup_self(s);
> if (ret) {
> return ret;
> @@ -272,6 +276,9 @@ static void proc_kill_sb(struct super_block *sb)
> dput(fs_info->proc_self);
> dput(fs_info->proc_thread_self);
>
> + if (fs_info->proc_namespaces)
> + dput(fs_info->proc_namespaces);
> +
> kill_anon_super(sb);
> put_pid_ns(fs_info->pid_ns);
> kfree(fs_info);
> @@ -289,6 +296,7 @@ void __init proc_root_init(void)
> {
> proc_init_kmemcache();
> set_proc_pid_nlink();
> + proc_namespaces_init();
> proc_self_init();
> proc_thread_self_init();
> proc_symlink("mounts", NULL, "self/mounts");
> @@ -326,8 +334,15 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
>
> static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> {
> - if (ctx->pos < FIRST_PROCESS_ENTRY) {
> + if (ctx->pos < NAMESPACES_ENTRY) {
> int error = proc_readdir(file, ctx);
> + if (unlikely(error <= 0))
> + return error;
> + ctx->pos = NAMESPACES_ENTRY;
> + }
> +
> + if (ctx->pos == NAMESPACES_ENTRY) {
> + int error = proc_emit_namespaces(file, ctx);
> if (unlikely(error <= 0))
> return error;
> ctx->pos = FIRST_PROCESS_ENTRY;
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 97b3f5f06db9..8b0002a6cacf 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -61,6 +61,7 @@ struct proc_fs_info {
> struct pid_namespace *pid_ns;
> struct dentry *proc_self; /* For /proc/self */
> struct dentry *proc_thread_self; /* For /proc/thread-self */
> + struct dentry *proc_namespaces; /* For /proc/namespaces */
> kgid_t pid_gid;
> enum proc_hidepid hide_pid;
> enum proc_pidonly pidonly;
>
>

2020-07-30 13:33:03

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On 30.07.2020 15:23, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 03:00:08PM +0300, Kirill Tkhai wrote:
>> This patch introduces a new IDR and functions to add/remove and iterate
>> registered namespaces in the system. It will be used to list namespaces
>> in /proc/namespaces/... in next patches.
>
> Looks like you could use an XArray for this and it would be fewer lines of
> code.
>
>>
>> static struct vfsmount *nsfs_mnt;
>> +static DEFINE_SPINLOCK(ns_lock);
>> +static DEFINE_IDR(ns_idr);
>
> XArray includes its own spinlock.
>
>> +/*
>> + * Add a newly created ns to ns_idr. The ns must be fully
>> + * initialized since it becomes available for ns_get_next()
>> + * right after we exit this function.
>> + */
>> +int ns_idr_register(struct ns_common *ns)
>> +{
>> + int ret, id = ns->inum - PROC_NS_MIN_INO;
>> +
>> + if (WARN_ON(id < 0))
>> + return -EINVAL;
>> +
>> + idr_preload(GFP_KERNEL);
>> + spin_lock_irq(&ns_lock);
>> + ret = idr_alloc(&ns_idr, ns, id, id + 1, GFP_ATOMIC);
>> + spin_unlock_irq(&ns_lock);
>> + idr_preload_end();
>> + return ret < 0 ? ret : 0;
>
> This would simply be return xa_insert_irq(...);
>
>> +}
>> +
>> +/*
>> + * Remove a dead ns from ns_idr. Note, that ns memory must
>> + * be freed not earlier then one RCU grace period after
>> + * this function, since ns_get_next() uses RCU to iterate the IDR.
>> + */
>> +void ns_idr_unregister(struct ns_common *ns)
>> +{
>> + int id = ns->inum - PROC_NS_MIN_INO;
>> + unsigned long flags;
>> +
>> + if (WARN_ON(id < 0))
>> + return;
>> +
>> + spin_lock_irqsave(&ns_lock, flags);
>> + idr_remove(&ns_idr, id);
>> + spin_unlock_irqrestore(&ns_lock, flags);
>> +}
>
> xa_erase_irqsave();

static inline void *xa_erase_irqsave(struct xarray *xa, unsigned long index)
{
unsigned long flags;
void *entry;

xa_lock_irqsave(xa, flags);
entry = __xa_erase(xa, index);
xa_unlock_irqrestore(xa, flags);

return entry;
}

>> +
>> +/*
>> + * This returns ns with inum greater than @id or NULL.
>> + * @id is updated to refer the ns inum.
>> + */
>> +struct ns_common *ns_get_next(unsigned int *id)
>> +{
>> + struct ns_common *ns;
>> +
>> + if (*id < PROC_NS_MIN_INO - 1)
>> + *id = PROC_NS_MIN_INO - 1;
>> +
>> + *id += 1;
>> + *id -= PROC_NS_MIN_INO;
>> +
>> + rcu_read_lock();
>> + do {
>> + ns = idr_get_next(&ns_idr, id);
>> + if (!ns)
>> + break;
>
> xa_find_after();
>
> You'll want a temporary unsigned long to work with ...
>
>> + if (!refcount_inc_not_zero(&ns->count)) {
>> + ns = NULL;
>> + *id += 1;
>
> you won't need this increment.

Why? I don't see a way xarray allows to avoid this.

2020-07-30 13:39:06

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

[Cc: linux-api]

On Thu, Jul 30, 2020 at 03:08:53PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 02:59:20PM +0300, Kirill Tkhai wrote:
> > Currently, there is no a way to list or iterate all or subset of namespaces
> > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> > but some also may be as open files, which are not attached to a process.
> > When a namespace open fd is sent over unix socket and then closed, it is
> > impossible to know whether the namespace exists or not.
> >
> > Also, even if namespace is exposed as attached to a process or as open file,
> > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> > this multiplies at tasks and fds number.
> >
> > This patchset introduces a new /proc/namespaces/ directory, which exposes
> > subset of permitted namespaces in linear view:
> >
> > # ls /proc/namespaces/ -l
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
> > lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
> >
> > Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns.
> > I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is
> >
> > in_userns(pid_ns->user_ns, ns->user_ns).
> >
> > In case of ns is a user_ns:
> >
> > in_userns(pid_ns->user_ns, ns).
> >
> > The patchset follows this steps:
> >
> > 1)A generic counter in ns_common is introduced instead of separate
> > counters for every ns type (net::count, uts_namespace::kref,
> > user_namespace::count, etc). Patches [1-8];
> > 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> > 3)Patch [10] is refactoring;
> > 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> > 5)Patches [12-23] make every namespace to use the added methods
> > and to appear in /proc/namespace directory.
> >
> > This may be usefull to write effective debug utils (say, fast build
> > of networks topology) and checkpoint/restore software.
>
> Kirill,
>
> Thanks for working on this!
> We have a need for this functionality too for namespace introspection.
> I actually had a prototype of this as well but mine was based on debugfs
> but /proc/namespaces seems like a good place.

2020-07-30 14:00:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On Thu, Jul 30, 2020 at 04:32:22PM +0300, Kirill Tkhai wrote:
> On 30.07.2020 15:23, Matthew Wilcox wrote:
> > xa_erase_irqsave();
>
> static inline void *xa_erase_irqsave(struct xarray *xa, unsigned long index)
> {
> unsigned long flags;
> void *entry;
>
> xa_lock_irqsave(xa, flags);
> entry = __xa_erase(xa, index);
> xa_unlock_irqrestore(xa, flags);
>
> return entry;
> }

was there a question here?

> >> +struct ns_common *ns_get_next(unsigned int *id)
> >> +{
> >> + struct ns_common *ns;
> >> +
> >> + if (*id < PROC_NS_MIN_INO - 1)
> >> + *id = PROC_NS_MIN_INO - 1;
> >> +
> >> + *id += 1;
> >> + *id -= PROC_NS_MIN_INO;
> >> +
> >> + rcu_read_lock();
> >> + do {
> >> + ns = idr_get_next(&ns_idr, id);
> >> + if (!ns)
> >> + break;
> >
> > xa_find_after();
> >
> > You'll want a temporary unsigned long to work with ...
> >
> >> + if (!refcount_inc_not_zero(&ns->count)) {
> >> + ns = NULL;
> >> + *id += 1;
> >
> > you won't need this increment.
>
> Why? I don't see a way xarray allows to avoid this.

It's embedded in xa_find_after().

2020-07-30 14:12:46

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On 30.07.2020 16:56, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 04:32:22PM +0300, Kirill Tkhai wrote:
>> On 30.07.2020 15:23, Matthew Wilcox wrote:
>>> xa_erase_irqsave();
>>
>> static inline void *xa_erase_irqsave(struct xarray *xa, unsigned long index)
>> {
>> unsigned long flags;
>> void *entry;
>>
>> xa_lock_irqsave(xa, flags);
>> entry = __xa_erase(xa, index);
>> xa_unlock_irqrestore(xa, flags);
>>
>> return entry;
>> }
>
> was there a question here?

No, I just I will add this in separate patch.

>>>> +struct ns_common *ns_get_next(unsigned int *id)
>>>> +{
>>>> + struct ns_common *ns;
>>>> +
>>>> + if (*id < PROC_NS_MIN_INO - 1)
>>>> + *id = PROC_NS_MIN_INO - 1;
>>>> +
>>>> + *id += 1;
>>>> + *id -= PROC_NS_MIN_INO;
>>>> +
>>>> + rcu_read_lock();
>>>> + do {
>>>> + ns = idr_get_next(&ns_idr, id);
>>>> + if (!ns)
>>>> + break;
>>>
>>> xa_find_after();
>>>
>>> You'll want a temporary unsigned long to work with ...
>>>
>>>> + if (!refcount_inc_not_zero(&ns->count)) {
>>>> + ns = NULL;
>>>> + *id += 1;
>>>
>>> you won't need this increment.
>>
>> Why? I don't see a way xarray allows to avoid this.
>
> It's embedded in xa_find_after().

How is it embedded to check ns->count that it knows nothing?

2020-07-30 14:16:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On Thu, Jul 30, 2020 at 05:12:09PM +0300, Kirill Tkhai wrote:
> On 30.07.2020 16:56, Matthew Wilcox wrote:
> > On Thu, Jul 30, 2020 at 04:32:22PM +0300, Kirill Tkhai wrote:
> >> On 30.07.2020 15:23, Matthew Wilcox wrote:
> >>> xa_erase_irqsave();
> >>
> >> static inline void *xa_erase_irqsave(struct xarray *xa, unsigned long index)
> >> {
> >> unsigned long flags;
> >> void *entry;
> >>
> >> xa_lock_irqsave(xa, flags);
> >> entry = __xa_erase(xa, index);
> >> xa_unlock_irqrestore(xa, flags);
> >>
> >> return entry;
> >> }
> >
> > was there a question here?
>
> No, I just I will add this in separate patch.

Ah, yes. Thanks!

> >>>> +struct ns_common *ns_get_next(unsigned int *id)
> >>>> +{
> >>>> + struct ns_common *ns;
> >>>> +
> >>>> + if (*id < PROC_NS_MIN_INO - 1)
> >>>> + *id = PROC_NS_MIN_INO - 1;
> >>>> +
> >>>> + *id += 1;
> >>>> + *id -= PROC_NS_MIN_INO;
> >>>> +
> >>>> + rcu_read_lock();
> >>>> + do {
> >>>> + ns = idr_get_next(&ns_idr, id);
> >>>> + if (!ns)
> >>>> + break;
> >>>
> >>> xa_find_after();
> >>>
> >>> You'll want a temporary unsigned long to work with ...
> >>>
> >>>> + if (!refcount_inc_not_zero(&ns->count)) {
> >>>> + ns = NULL;
> >>>> + *id += 1;
> >>>
> >>> you won't need this increment.
> >>
> >> Why? I don't see a way xarray allows to avoid this.
> >
> > It's embedded in xa_find_after().
>
> How is it embedded to check ns->count that it knows nothing?

I meant you won't need to increment '*id'. The refcount is, of course,
your business.

2020-07-30 14:23:59

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system

On 30.07.2020 17:15, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 05:12:09PM +0300, Kirill Tkhai wrote:
>> On 30.07.2020 16:56, Matthew Wilcox wrote:
>>> On Thu, Jul 30, 2020 at 04:32:22PM +0300, Kirill Tkhai wrote:
>>>> On 30.07.2020 15:23, Matthew Wilcox wrote:
>>>>> xa_erase_irqsave();
>>>>
>>>> static inline void *xa_erase_irqsave(struct xarray *xa, unsigned long index)
>>>> {
>>>> unsigned long flags;
>>>> void *entry;
>>>>
>>>> xa_lock_irqsave(xa, flags);
>>>> entry = __xa_erase(xa, index);
>>>> xa_unlock_irqrestore(xa, flags);
>>>>
>>>> return entry;
>>>> }
>>>
>>> was there a question here?
>>
>> No, I just I will add this in separate patch.
>
> Ah, yes. Thanks!
>
>>>>>> +struct ns_common *ns_get_next(unsigned int *id)
>>>>>> +{
>>>>>> + struct ns_common *ns;
>>>>>> +
>>>>>> + if (*id < PROC_NS_MIN_INO - 1)
>>>>>> + *id = PROC_NS_MIN_INO - 1;
>>>>>> +
>>>>>> + *id += 1;
>>>>>> + *id -= PROC_NS_MIN_INO;
>>>>>> +
>>>>>> + rcu_read_lock();
>>>>>> + do {
>>>>>> + ns = idr_get_next(&ns_idr, id);
>>>>>> + if (!ns)
>>>>>> + break;
>>>>>
>>>>> xa_find_after();
>>>>>
>>>>> You'll want a temporary unsigned long to work with ...
>>>>>
>>>>>> + if (!refcount_inc_not_zero(&ns->count)) {
>>>>>> + ns = NULL;
>>>>>> + *id += 1;
>>>>>
>>>>> you won't need this increment.
>>>>
>>>> Why? I don't see a way xarray allows to avoid this.
>>>
>>> It's embedded in xa_find_after().
>>
>> How is it embedded to check ns->count that it knows nothing?
>
> I meant you won't need to increment '*id'. The refcount is, of course,
> your business.

Ok, this brings comfort to me, because first time I thought xarray is a
big brother, which knows everything about my counters :)

2020-07-30 14:31:32

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 02/23] uts: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:31PM +0300, Kirill Tkhai wrote:
> Convert uts namespace to use generic counter instead of kref.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

(sidenote: given that kref is implemented on top of refcount_t I wonder
whether we shouldn't just slowly convert all places where kref is used
to refcount_t and remove the kref api.)

Looks good!
Acked-by: Christian Brauner <[email protected]>

> include/linux/utsname.h | 9 ++++-----
> init/version.c | 2 +-
> kernel/utsname.c | 7 ++-----
> 3 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
> index 44429d9142ca..2b1737c9b244 100644
> --- a/include/linux/utsname.h
> +++ b/include/linux/utsname.h
> @@ -4,7 +4,6 @@
>
>
> #include <linux/sched.h>
> -#include <linux/kref.h>
> #include <linux/nsproxy.h>
> #include <linux/ns_common.h>
> #include <linux/err.h>
> @@ -22,7 +21,6 @@ struct user_namespace;
> extern struct user_namespace init_user_ns;
>
> struct uts_namespace {
> - struct kref kref;
> struct new_utsname name;
> struct user_namespace *user_ns;
> struct ucounts *ucounts;
> @@ -33,16 +31,17 @@ extern struct uts_namespace init_uts_ns;
> #ifdef CONFIG_UTS_NS
> static inline void get_uts_ns(struct uts_namespace *ns)
> {
> - kref_get(&ns->kref);
> + refcount_inc(&ns->ns.count);
> }
>
> extern struct uts_namespace *copy_utsname(unsigned long flags,
> struct user_namespace *user_ns, struct uts_namespace *old_ns);
> -extern void free_uts_ns(struct kref *kref);
> +extern void free_uts_ns(struct uts_namespace *ns);
>
> static inline void put_uts_ns(struct uts_namespace *ns)
> {
> - kref_put(&ns->kref, free_uts_ns);
> + if (refcount_dec_and_test(&ns->ns.count))
> + free_uts_ns(ns);
> }
>
> void uts_ns_init(void);
> diff --git a/init/version.c b/init/version.c
> index cba341161b58..80d2b7566b39 100644
> --- a/init/version.c
> +++ b/init/version.c
> @@ -25,7 +25,7 @@ int version_string(LINUX_VERSION_CODE);
> #endif
>
> struct uts_namespace init_uts_ns = {
> - .kref = KREF_INIT(2),
> + .ns.count = REFCOUNT_INIT(2),
> .name = {
> .sysname = UTS_SYSNAME,
> .nodename = UTS_NODENAME,
> diff --git a/kernel/utsname.c b/kernel/utsname.c
> index e488d0e2ab45..b1ac3ca870f2 100644
> --- a/kernel/utsname.c
> +++ b/kernel/utsname.c
> @@ -33,7 +33,7 @@ static struct uts_namespace *create_uts_ns(void)
>
> uts_ns = kmem_cache_alloc(uts_ns_cache, GFP_KERNEL);
> if (uts_ns)
> - kref_init(&uts_ns->kref);
> + refcount_set(&uts_ns->ns.count, 1);
> return uts_ns;
> }
>
> @@ -103,11 +103,8 @@ struct uts_namespace *copy_utsname(unsigned long flags,
> return new_ns;
> }
>
> -void free_uts_ns(struct kref *kref)
> +void free_uts_ns(struct uts_namespace *ns)
> {
> - struct uts_namespace *ns;
> -
> - ns = container_of(kref, struct uts_namespace, kref);
> dec_uts_namespaces(ns->ucounts);
> put_user_ns(ns->user_ns);
> ns_free_inum(&ns->ns);
>
>

2020-07-30 14:33:33

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

On 30.07.2020 16:26, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 03:00:19PM +0300, Kirill Tkhai wrote:
>> This is a new directory to show all namespaces, which can be
>> accessed from this /proc tasks credentials.
>>
>> Every /proc is related to a pid_namespace, and the pid_namespace
>> is related to a user_namespace. The items, we show in this
>> /proc/namespaces/ directory, are the namespaces,
>> whose user_namespaces are the same as /proc's user_namespace,
>> or their descendants.
>>
>> Say, /proc has pid_ns->user_ns, so in /proc/namespace we show
>> only a ns, which is in_userns(pid_ns->user_ns, ns->user_ns).
>>
>> The final result is like below:
>>
>> # ls /proc/namespaces/ -l
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>
> So usually, the /proc/<pid>/ns entries are guarded by
> ptrace_may_access() but from skimming the patch it seems that
> /proc/namespaces/ would be accessible by any user.
>
> I think we should guard /proc/namespaces/. Either by restricting it to
> userns CAP_SYS_ADMIN or - to make it work with unprivileged CRIU - by
> ns_capable(proc's_pid_ns->user_ns, CAP_SYS_PTRACE).

I do agree with you, the restrictions have to be strict.

Advising this, do you mean only open() on /proc/namespaces/* files?
I'm not sure we should prohibit simple readdir of this directory. What do you think?

> This should probably also be a mount option on procfs given that we now
> allow a restricted view of procfs.
>
> Christian
>
>>
>> Every namespace may be open like ordinary file in /proc/[pid]/ns.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> fs/nsfs.c | 2
>> fs/proc/Makefile | 1
>> fs/proc/internal.h | 16 ++
>> fs/proc/namespaces.c | 314 +++++++++++++++++++++++++++++++++++++++++++++++
>> fs/proc/root.c | 17 ++-
>> include/linux/proc_fs.h | 1
>> 6 files changed, 345 insertions(+), 6 deletions(-)
>> create mode 100644 fs/proc/namespaces.c
>>
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index ee4be67d3a0b..61b789d2089c 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -58,7 +58,7 @@ static void nsfs_evict(struct inode *inode)
>> ns->ops->put(ns);
>> }
>>
>> -static int __ns_get_path(struct path *path, struct ns_common *ns)
>> +int __ns_get_path(struct path *path, struct ns_common *ns)
>> {
>> struct vfsmount *mnt = nsfs_mnt;
>> struct dentry *dentry;
>> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
>> index dc2d51f42905..34ff671c6d59 100644
>> --- a/fs/proc/Makefile
>> +++ b/fs/proc/Makefile
>> @@ -25,6 +25,7 @@ proc-y += util.o
>> proc-y += version.o
>> proc-y += softirqs.o
>> proc-y += task_namespaces.o
>> +proc-y += namespaces.o
>> proc-y += self.o
>> proc-y += thread_self.o
>> proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index 572757ff97be..d19fe5574799 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -134,10 +134,11 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
>> kuid_t *ruid, kgid_t *rgid);
>>
>> unsigned name_to_int(const struct qstr *qstr);
>> -/*
>> - * Offset of the first process in the /proc root directory..
>> - */
>> -#define FIRST_PROCESS_ENTRY 256
>> +
>> +/* Offset of "namespaces" entry in /proc root directory */
>> +#define NAMESPACES_ENTRY 256
>> +/* Offset of the first process in the /proc root directory */
>> +#define FIRST_PROCESS_ENTRY (NAMESPACES_ENTRY + 1)
>>
>> /* Worst case buffer size needed for holding an integer. */
>> #define PROC_NUMBUF 13
>> @@ -168,6 +169,7 @@ extern void proc_pid_evict_inode(struct proc_inode *);
>> extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
>> extern void pid_update_inode(struct task_struct *, struct inode *);
>> extern int pid_delete_dentry(const struct dentry *);
>> +extern int proc_emit_namespaces(struct file *, struct dir_context *);
>> extern int proc_pid_readdir(struct file *, struct dir_context *);
>> struct dentry *proc_pid_lookup(struct dentry *, unsigned int);
>> extern loff_t mem_lseek(struct file *, loff_t, int);
>> @@ -222,6 +224,12 @@ void set_proc_pid_nlink(void);
>> extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
>> extern void proc_entry_rundown(struct proc_dir_entry *);
>>
>> +/*
>> + * namespaces.c
>> + */
>> +extern int proc_setup_namespaces(struct super_block *);
>> +extern void proc_namespaces_init(void);
>> +
>> /*
>> * task_namespaces.c
>> */
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> new file mode 100644
>> index 000000000000..ab47e1555619
>> --- /dev/null
>> +++ b/fs/proc/namespaces.c
>> @@ -0,0 +1,314 @@
>> +#include <linux/pid_namespace.h>
>> +#include <linux/user_namespace.h>
>> +#include <linux/namei.h>
>> +#include "internal.h"
>> +
>> +static unsigned namespaces_inum __ro_after_init;
>> +
>> +int proc_emit_namespaces(struct file *file, struct dir_context *ctx)
>> +{
>> + struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
>> + struct inode *inode = d_inode(fs_info->proc_namespaces);
>> +
>> + return dir_emit(ctx, "namespaces", 10, inode->i_ino, DT_DIR);
>> +}
>> +
>> +static int parse_namespace_dentry_name(const struct dentry *dentry,
>> + const char **type, unsigned int *type_len, unsigned int *inum)
>> +{
>> + const char *p, *name;
>> + int count;
>> +
>> + *type = name = dentry->d_name.name;
>> + p = strchr(name, ':');
>> + *type_len = p - name;
>> + if (!p || p == name)
>> + return -ENOENT;
>
> Hm, rather:
>
> p = strchr(name, ':');
> if (!p || p == name)
> return -ENOENT;
> *type_len = p - name;
>
>> +
>> + p += 1;
>> + if (sscanf(p, "[%u]%n", inum, &count) != 1 || *(p + count) != '\0' ||
>> + *inum < PROC_NS_MIN_INO)
>> + return -ENOENT;
>> +
>> + return 0;
>> +}
>> +
>> +static struct ns_common *get_namespace_by_dentry(struct pid_namespace *pid_ns,
>> + const struct dentry *dentry)
>> +{
>> + unsigned int type_len, inum, p_inum;
>> + struct user_namespace *user_ns;
>> + struct ns_common *ns;
>> + const char *type;
>> +
>> + if (parse_namespace_dentry_name(dentry, &type, &type_len, &inum) < 0)
>> + return NULL;
>> +
>> + p_inum = inum - 1;
>> + ns = ns_get_next(&p_inum);
>> + if (!ns)
>> + return NULL;
>> +
>> + if (ns->inum != inum || strncmp(type, ns->ops->name, type_len) != 0 ||
>> + ns->ops->name[type_len] != '\0') {
>> + ns->ops->put(ns);
>> + return NULL;
>> + }
>> +
>> + if (ns->ops != &userns_operations)
>> + user_ns = ns->ops->owner(ns);
>> + else
>> + user_ns = container_of(ns, struct user_namespace, ns);
>> +
>> + if (!in_userns(pid_ns->user_ns, user_ns)) {
>> + ns->ops->put(ns);
>> + return NULL;
>> + }
>> +
>> + return ns;
>> +}
>> +
>> +static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
>> + struct task_struct *task, const void *ptr);
>> +
>> +static struct dentry *proc_namespaces_lookup(struct inode *dir, struct dentry *dentry,
>> + unsigned int flags)
>> +{
>> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
>> + struct task_struct *task;
>> + struct ns_common *ns;
>> +
>> + ns = get_namespace_by_dentry(pid_ns, dentry);
>> + if (!ns)
>> + return ERR_PTR(-ENOENT);
>> +
>> + read_lock(&tasklist_lock);
>> + task = get_task_struct(pid_ns->child_reaper);
>> + read_unlock(&tasklist_lock);
>> +
>> + dentry = proc_namespace_instantiate(dentry, task, ns);
>> + put_task_struct(task);
>> + ns->ops->put(ns);
>> +
>> + return dentry;
>> +}
>> +
>> +static int proc_namespaces_permission(struct inode *inode, int mask)
>> +{
>> + if ((mask & MAY_EXEC) && S_ISLNK(inode->i_mode))
>> + return -EACCES;
>> +
>> + return 0;
>> +}
>> +
>> +static int proc_namespaces_getattr(const struct path *path, struct kstat *stat,
>> + u32 request_mask, unsigned int query_flags)
>> +{
>> + struct inode *inode = d_inode(path->dentry);
>> +
>> + generic_fillattr(inode, stat);
>> + return 0;
>> +}
>> +
>> +static const struct inode_operations proc_namespaces_inode_operations = {
>> + .lookup = proc_namespaces_lookup,
>> + .permission = proc_namespaces_permission,
>> + .getattr = proc_namespaces_getattr,
>> +};
>> +
>> +static int proc_namespaces_readlink(struct dentry *dentry, char __user *buffer, int buflen)
>> +{
>> + struct inode *dir = dentry->d_parent->d_inode;
>> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
>> + struct ns_common *ns;
>> +
>> + ns = get_namespace_by_dentry(pid_ns, dentry);
>> + if (!ns)
>> + return -ENOENT;
>> + ns->ops->put(ns);
>> +
>> + /* proc_namespaces_readdir() creates dentry names in namespace format */
>> + return readlink_copy(buffer, buflen, dentry->d_iname);
>> +}
>> +
>> +int __ns_get_path(struct path *path, struct ns_common *ns);
>> +
>> +static const char *proc_namespaces_getlink(struct dentry *dentry,
>> + struct inode *inode, struct delayed_call *done)
>> +{
>> + struct pid_namespace *pid_ns = proc_pid_ns(inode->i_sb);
>> + struct ns_common *ns;
>> + struct path path;
>> + int ret;
>> +
>> + if (!dentry)
>> + return ERR_PTR(-ECHILD);
>> +
>> + while (1) {
>> + ret = -ENOENT;
>> + ns = get_namespace_by_dentry(pid_ns, dentry);
>> + if (!ns)
>> + goto out;
>> +
>> + ret = __ns_get_path(&path, ns);
>> + if (ret == -EAGAIN)
>> + continue;
>> + if (ret)
>> + goto out;
>> + break;
>> + }
>> +
>> + ret = nd_jump_link(&path);
>> +out:
>> + return ERR_PTR(ret);
>> +}
>> +
>> +static const struct inode_operations proc_namespaces_link_inode_operations = {
>> + .readlink = proc_namespaces_readlink,
>> + .get_link = proc_namespaces_getlink,
>> +};
>> +
>> +static int namespace_delete_dentry(const struct dentry *dentry)
>> +{
>> + struct inode *dir = dentry->d_parent->d_inode;
>> + struct pid_namespace *pid_ns = proc_pid_ns(dir->i_sb);
>> + struct ns_common *ns;
>> +
>> + ns = get_namespace_by_dentry(pid_ns, dentry);
>> + if (!ns)
>> + return 1;
>> +
>> + ns->ops->put(ns);
>> + return 0;
>> +}
>> +
>> +const struct dentry_operations namespaces_dentry_operations = {
>> + .d_delete = namespace_delete_dentry,
>> +};
>> +
>> +static void namespace_update_inode(struct inode *inode)
>> +{
>> + struct user_namespace *user_ns = proc_pid_ns(inode->i_sb)->user_ns;
>> +
>> + inode->i_uid = make_kuid(user_ns, 0);
>> + if (!uid_valid(inode->i_uid))
>> + inode->i_uid = GLOBAL_ROOT_UID;
>> +
>> + inode->i_gid = make_kgid(user_ns, 0);
>> + if (!gid_valid(inode->i_gid))
>> + inode->i_gid = GLOBAL_ROOT_GID;
>> +}
>> +
>> +static struct dentry *proc_namespace_instantiate(struct dentry *dentry,
>> + struct task_struct *task, const void *ptr)
>> +{
>> + const struct ns_common *ns = ptr;
>> + struct inode *inode;
>> + struct proc_inode *ei;
>> +
>> + /*
>> + * Create inode with credentials of @task, and add it to @task's
>> + * quick removal list.
>> + */
>> + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
>> + if (!inode)
>> + return ERR_PTR(-ENOENT);
>> +
>> + ei = PROC_I(inode);
>> + inode->i_op = &proc_namespaces_link_inode_operations;
>> + ei->ns_ops = ns->ops;
>> + namespace_update_inode(inode);
>> +
>> + d_set_d_op(dentry, &namespaces_dentry_operations);
>> + return d_splice_alias(inode, dentry);
>> +}
>> +
>> +static int proc_namespaces_readdir(struct file *file, struct dir_context *ctx)
>> +{
>> + struct pid_namespace *pid_ns = proc_pid_ns(file_inode(file)->i_sb);
>> + struct user_namespace *user_ns;
>> + struct task_struct *task;
>> + struct ns_common *ns;
>> + unsigned int inum;
>> +
>> + read_lock(&tasklist_lock);
>> + task = get_task_struct(pid_ns->child_reaper);
>> + read_unlock(&tasklist_lock);
>> +
>> + if (!dir_emit_dots(file, ctx))
>> + goto out;
>> +
>> + inum = ctx->pos - 2;
>> + while ((ns = ns_get_next(&inum)) != NULL) {
>> + unsigned int len;
>> + char name[32];
>> +
>> + if (ns->ops != &userns_operations)
>> + user_ns = ns->ops->owner(ns);
>> + else
>> + user_ns = container_of(ns, struct user_namespace, ns);
>> +
>> + if (!in_userns(pid_ns->user_ns, user_ns))
>> + goto next;
>> +
>> + len = snprintf(name, sizeof(name), "%s:[%u]", ns->ops->name, inum);
>> +
>> + if (!proc_fill_cache(file, ctx, name, len,
>> + proc_namespace_instantiate, task, ns)) {
>> + ns->ops->put(ns);
>> + break;
>> + }
>> +next:
>> + ns->ops->put(ns);
>> + ctx->pos = inum + 2;
>> + }
>> +out:
>> + put_task_struct(task);
>> + return 0;
>> +}
>> +
>> +static const struct file_operations proc_namespaces_file_operations = {
>> + .read = generic_read_dir,
>> + .iterate_shared = proc_namespaces_readdir,
>> + .llseek = generic_file_llseek,
>> +};
>> +
>> +int proc_setup_namespaces(struct super_block *s)
>> +{
>> + struct proc_fs_info *fs_info = proc_sb_info(s);
>> + struct inode *root_inode = d_inode(s->s_root);
>> + struct dentry *namespaces;
>> + int ret = -ENOMEM;
>> +
>> + inode_lock(root_inode);
>> + namespaces = d_alloc_name(s->s_root, "namespaces");
>> + if (namespaces) {
>> + struct inode *inode = new_inode_pseudo(s);
>> + if (inode) {
>> + inode->i_ino = namespaces_inum;
>> + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
>> + inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO;
>> + inode->i_uid = GLOBAL_ROOT_UID;
>> + inode->i_gid = GLOBAL_ROOT_GID;
>> + inode->i_op = &proc_namespaces_inode_operations;
>> + inode->i_fop = &proc_namespaces_file_operations;
>> + d_add(namespaces, inode);
>> + ret = 0;
>> + } else {
>> + dput(namespaces);
>> + }
>> + }
>> + inode_unlock(root_inode);
>> +
>> + if (ret)
>> + pr_err("proc_setup_namespaces: can't allocate /proc/namespaces\n");
>> + else
>> + fs_info->proc_namespaces = namespaces;
>> +
>> + return ret;
>> +}
>> +
>> +void __init proc_namespaces_init(void)
>> +{
>> + proc_alloc_inum(&namespaces_inum);
>> +}
>> diff --git a/fs/proc/root.c b/fs/proc/root.c
>> index 5e444d4f9717..e4e4f90fca3d 100644
>> --- a/fs/proc/root.c
>> +++ b/fs/proc/root.c
>> @@ -206,6 +206,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
>> return -ENOMEM;
>> }
>>
>> + ret = proc_setup_namespaces(s);
>> + if (ret)
>> + return ret;
>> +
>> ret = proc_setup_self(s);
>> if (ret) {
>> return ret;
>> @@ -272,6 +276,9 @@ static void proc_kill_sb(struct super_block *sb)
>> dput(fs_info->proc_self);
>> dput(fs_info->proc_thread_self);
>>
>> + if (fs_info->proc_namespaces)
>> + dput(fs_info->proc_namespaces);
>> +
>> kill_anon_super(sb);
>> put_pid_ns(fs_info->pid_ns);
>> kfree(fs_info);
>> @@ -289,6 +296,7 @@ void __init proc_root_init(void)
>> {
>> proc_init_kmemcache();
>> set_proc_pid_nlink();
>> + proc_namespaces_init();
>> proc_self_init();
>> proc_thread_self_init();
>> proc_symlink("mounts", NULL, "self/mounts");
>> @@ -326,8 +334,15 @@ static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentr
>>
>> static int proc_root_readdir(struct file *file, struct dir_context *ctx)
>> {
>> - if (ctx->pos < FIRST_PROCESS_ENTRY) {
>> + if (ctx->pos < NAMESPACES_ENTRY) {
>> int error = proc_readdir(file, ctx);
>> + if (unlikely(error <= 0))
>> + return error;
>> + ctx->pos = NAMESPACES_ENTRY;
>> + }
>> +
>> + if (ctx->pos == NAMESPACES_ENTRY) {
>> + int error = proc_emit_namespaces(file, ctx);
>> if (unlikely(error <= 0))
>> return error;
>> ctx->pos = FIRST_PROCESS_ENTRY;
>> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
>> index 97b3f5f06db9..8b0002a6cacf 100644
>> --- a/include/linux/proc_fs.h
>> +++ b/include/linux/proc_fs.h
>> @@ -61,6 +61,7 @@ struct proc_fs_info {
>> struct pid_namespace *pid_ns;
>> struct dentry *proc_self; /* For /proc/self */
>> struct dentry *proc_thread_self; /* For /proc/thread-self */
>> + struct dentry *proc_namespaces; /* For /proc/namespaces */
>> kgid_t pid_gid;
>> enum proc_hidepid hide_pid;
>> enum proc_pidonly pidonly;
>>
>>

2020-07-30 14:35:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 03/23] ipc: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:36PM +0300, Kirill Tkhai wrote:
> Convert uts namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Acked-by: Christian Brauner <[email protected]>

> include/linux/ipc_namespace.h | 3 +--
> ipc/msgutil.c | 2 +-
> ipc/namespace.c | 4 ++--
> 3 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index a06a78c67f19..05e22770af51 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -27,7 +27,6 @@ struct ipc_ids {
> };
>
> struct ipc_namespace {
> - refcount_t count;
> struct ipc_ids ids[3];
>
> int sem_ctls[4];
> @@ -128,7 +127,7 @@ extern struct ipc_namespace *copy_ipcs(unsigned long flags,
> static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
> {
> if (ns)
> - refcount_inc(&ns->count);
> + refcount_inc(&ns->ns.count);
> return ns;
> }
>
> diff --git a/ipc/msgutil.c b/ipc/msgutil.c
> index 3149b4a379de..d0a0e877cadd 100644
> --- a/ipc/msgutil.c
> +++ b/ipc/msgutil.c
> @@ -26,7 +26,7 @@ DEFINE_SPINLOCK(mq_lock);
> * and not CONFIG_IPC_NS.
> */
> struct ipc_namespace init_ipc_ns = {
> - .count = REFCOUNT_INIT(1),
> + .ns.count = REFCOUNT_INIT(1),
> .user_ns = &init_user_ns,
> .ns.inum = PROC_IPC_INIT_INO,
> #ifdef CONFIG_IPC_NS
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 24e7b45320f7..7bd0766ddc3b 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -51,7 +51,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
> goto fail_free;
> ns->ns.ops = &ipcns_operations;
>
> - refcount_set(&ns->count, 1);
> + refcount_set(&ns->ns.count, 1);
> ns->user_ns = get_user_ns(user_ns);
> ns->ucounts = ucounts;
>
> @@ -164,7 +164,7 @@ static DECLARE_WORK(free_ipc_work, free_ipc);
> */
> void put_ipc_ns(struct ipc_namespace *ns)
> {
> - if (refcount_dec_and_lock(&ns->count, &mq_lock)) {
> + if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) {
> mq_clear_sbinfo(ns);
> spin_unlock(&mq_lock);
>
>
>

2020-07-30 14:37:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

Kirill Tkhai <[email protected]> writes:

> Currently, there is no a way to list or iterate all or subset of namespaces
> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> but some also may be as open files, which are not attached to a process.
> When a namespace open fd is sent over unix socket and then closed, it is
> impossible to know whether the namespace exists or not.
>
> Also, even if namespace is exposed as attached to a process or as open file,
> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> this multiplies at tasks and fds number.

I am very dubious about this.

I have been avoiding exactly this kind of interface because it can
create rather fundamental problems with checkpoint restart.

You do have some filtering and the filtering is not based on current.
Which is good.

A view that is relative to a user namespace might be ok. It almost
certainly does better as it's own little filesystem than as an extension
to proc though.

The big thing we want to ensure is that if you migrate you can restore
everything. I don't see how you will be able to restore these files
after migration. Anything like this without having a complete
checkpoint/restore story is a non-starter.

Further by not going through the processes it looks like you are
bypassing the existing permission checks. Which has the potential
to allow someone to use a namespace who would not be able to otherwise.

So I think this goes one step too far but I am willing to be persuaded
otherwise.

Eric




> This patchset introduces a new /proc/namespaces/ directory, which exposes
> subset of permitted namespaces in linear view:
>
> # ls /proc/namespaces/ -l
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>
> Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns.
> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is
>
> in_userns(pid_ns->user_ns, ns->user_ns).
>
> In case of ns is a user_ns:
>
> in_userns(pid_ns->user_ns, ns).
>
> The patchset follows this steps:
>
> 1)A generic counter in ns_common is introduced instead of separate
> counters for every ns type (net::count, uts_namespace::kref,
> user_namespace::count, etc). Patches [1-8];
> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
> 3)Patch [10] is refactoring;
> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
> 5)Patches [12-23] make every namespace to use the added methods
> and to appear in /proc/namespace directory.
>
> This may be usefull to write effective debug utils (say, fast build
> of networks topology) and checkpoint/restore software.
> ---
>
> Kirill Tkhai (23):
> ns: Add common refcount into ns_common add use it as counter for net_ns
> uts: Use generic ns_common::count
> ipc: Use generic ns_common::count
> pid: Use generic ns_common::count
> user: Use generic ns_common::count
> mnt: Use generic ns_common::count
> cgroup: Use generic ns_common::count
> time: Use generic ns_common::count
> ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system
> fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c
> fs: Add /proc/namespaces/ directory
> user: Free user_ns one RCU grace period after final counter put
> user: Add user namespaces into ns_idr
> net: Add net namespaces into ns_idr
> pid: Eextract child_reaper check from pidns_for_children_get()
> proc_ns_operations: Add can_get method
> pid: Add pid namespaces into ns_idr
> uts: Free uts namespace one RCU grace period after final counter put
> uts: Add uts namespaces into ns_idr
> ipc: Add ipc namespaces into ns_idr
> mnt: Add mount namespaces into ns_idr
> cgroup: Add cgroup namespaces into ns_idr
> time: Add time namespaces into ns_idr
>
>
> fs/mount.h | 4
> fs/namespace.c | 14 +
> fs/nsfs.c | 78 ++++++++
> fs/proc/Makefile | 1
> fs/proc/internal.h | 18 +-
> fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++-------------
> fs/proc/root.c | 17 ++
> fs/proc/task_namespaces.c | 183 +++++++++++++++++++
> include/linux/cgroup.h | 6 -
> include/linux/ipc_namespace.h | 3
> include/linux/ns_common.h | 11 +
> include/linux/pid_namespace.h | 4
> include/linux/proc_fs.h | 1
> include/linux/proc_ns.h | 12 +
> include/linux/time_namespace.h | 10 +
> include/linux/user_namespace.h | 10 +
> include/linux/utsname.h | 10 +
> include/net/net_namespace.h | 11 -
> init/version.c | 2
> ipc/msgutil.c | 2
> ipc/namespace.c | 17 +-
> ipc/shm.c | 1
> kernel/cgroup/cgroup.c | 2
> kernel/cgroup/namespace.c | 25 ++-
> kernel/pid.c | 2
> kernel/pid_namespace.c | 46 +++--
> kernel/time/namespace.c | 20 +-
> kernel/user.c | 2
> kernel/user_namespace.c | 23 ++
> kernel/utsname.c | 23 ++
> net/core/net-sysfs.c | 6 -
> net/core/net_namespace.c | 18 +-
> net/ipv4/inet_timewait_sock.c | 4
> net/ipv4/tcp_metrics.c | 2
> 34 files changed, 746 insertions(+), 224 deletions(-)
> create mode 100644 fs/proc/task_namespaces.c
>
> --
> Signed-off-by: Kirill Tkhai <[email protected]>

2020-07-30 14:40:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 04/23] pid: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:41PM +0300, Kirill Tkhai wrote:
> Convert pid namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Looks good!
Acked-by: Christian Brauner <[email protected]>

> include/linux/pid_namespace.h | 4 +---
> kernel/pid.c | 2 +-
> kernel/pid_namespace.c | 13 +++----------
> 3 files changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 5a5cb45ac57e..7c7e627503d2 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -8,7 +8,6 @@
> #include <linux/workqueue.h>
> #include <linux/threads.h>
> #include <linux/nsproxy.h>
> -#include <linux/kref.h>
> #include <linux/ns_common.h>
> #include <linux/idr.h>
>
> @@ -18,7 +17,6 @@
> struct fs_pin;
>
> struct pid_namespace {
> - struct kref kref;
> struct idr idr;
> struct rcu_head rcu;
> unsigned int pid_allocated;
> @@ -43,7 +41,7 @@ extern struct pid_namespace init_pid_ns;
> static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
> {
> if (ns != &init_pid_ns)
> - kref_get(&ns->kref);
> + refcount_inc(&ns->ns.count);
> return ns;
> }
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index de9d29c41d77..3b9e67736ef4 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -72,7 +72,7 @@ int pid_max_max = PID_MAX_LIMIT;
> * the scheme scales to up to 4 million PIDs, runtime.
> */
> struct pid_namespace init_pid_ns = {
> - .kref = KREF_INIT(2),
> + .ns.count = REFCOUNT_INIT(2),
> .idr = IDR_INIT(init_pid_ns.idr),
> .pid_allocated = PIDNS_ADDING,
> .level = 0,
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0e5ac162c3a8..d02dc1696edf 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -102,7 +102,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> goto out_free_idr;
> ns->ns.ops = &pidns_operations;
>
> - kref_init(&ns->kref);
> + refcount_set(&ns->ns.count, 1);
> ns->level = level;
> ns->parent = get_pid_ns(parent_pid_ns);
> ns->user_ns = get_user_ns(user_ns);
> @@ -148,22 +148,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
> return create_pid_namespace(user_ns, old_ns);
> }
>
> -static void free_pid_ns(struct kref *kref)
> -{
> - struct pid_namespace *ns;
> -
> - ns = container_of(kref, struct pid_namespace, kref);
> - destroy_pid_namespace(ns);
> -}
> -
> void put_pid_ns(struct pid_namespace *ns)
> {
> struct pid_namespace *parent;
>
> while (ns != &init_pid_ns) {
> parent = ns->parent;
> - if (!kref_put(&ns->kref, free_pid_ns))
> + if (!refcount_dec_and_test(&ns->ns.count))
> break;
> + destroy_pid_namespace(ns);
> ns = parent;
> }
> }
>
>

2020-07-30 14:43:51

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Thu, Jul 30, 2020 at 09:34:01AM -0500, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
> > Currently, there is no a way to list or iterate all or subset of namespaces
> > in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> > but some also may be as open files, which are not attached to a process.
> > When a namespace open fd is sent over unix socket and then closed, it is
> > impossible to know whether the namespace exists or not.
> >
> > Also, even if namespace is exposed as attached to a process or as open file,
> > iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> > this multiplies at tasks and fds number.
>
> I am very dubious about this.
>
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.
>
> You do have some filtering and the filtering is not based on current.
> Which is good.
>
> A view that is relative to a user namespace might be ok. It almost
> certainly does better as it's own little filesystem than as an extension
> to proc though.
>
> The big thing we want to ensure is that if you migrate you can restore
> everything. I don't see how you will be able to restore these files
> after migration. Anything like this without having a complete
> checkpoint/restore story is a non-starter.
>
> Further by not going through the processes it looks like you are
> bypassing the existing permission checks. Which has the potential
> to allow someone to use a namespace who would not be able to otherwise.
>
> So I think this goes one step too far but I am willing to be persuaded
> otherwise.

I think we discussed this at Plumbers (last year I want to say?) and
you were against making this a part of procfs already back then, I
think. The last known idead we could agree on was debugfs (shudder). But
a tiny separate fs might work as well.

We really would want those introspection abilities this provides though.
For us it was for debugging when namespaces linger and also to crawl
and inspect namespaces from LXD and various other use-cases. So if we
could make this happen in some form that'd be great.

Thanks!
Christian

2020-07-30 14:49:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 05/23] user: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:47PM +0300, Kirill Tkhai wrote:
> Convert user namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Looks good!
Acked-by: Christian Brauner <[email protected]>

> include/linux/user_namespace.h | 5 ++---
> kernel/user.c | 2 +-
> kernel/user_namespace.c | 4 ++--
> 3 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 6ef1c7109fc4..64cf8ebdc4ec 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -57,7 +57,6 @@ struct user_namespace {
> struct uid_gid_map uid_map;
> struct uid_gid_map gid_map;
> struct uid_gid_map projid_map;
> - atomic_t count;
> struct user_namespace *parent;
> int level;
> kuid_t owner;
> @@ -109,7 +108,7 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> {
> if (ns)
> - atomic_inc(&ns->count);
> + refcount_inc(&ns->ns.count);
> return ns;
> }
>
> @@ -119,7 +118,7 @@ extern void __put_user_ns(struct user_namespace *ns);
>
> static inline void put_user_ns(struct user_namespace *ns)
> {
> - if (ns && atomic_dec_and_test(&ns->count))
> + if (ns && refcount_dec_and_test(&ns->ns.count))
> __put_user_ns(ns);
> }
>
> diff --git a/kernel/user.c b/kernel/user.c
> index b1635d94a1f2..a2478cddf536 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -55,7 +55,7 @@ struct user_namespace init_user_ns = {
> },
> },
> },
> - .count = ATOMIC_INIT(3),
> + .ns.count = REFCOUNT_INIT(3),

Note-to-self: I really wish we'd had a comment in cases where the
refcount isn't set to 1 but to something like 3. Otherwise one always
needs to dig up the reasons why. :)

> .owner = GLOBAL_ROOT_UID,
> .group = GLOBAL_ROOT_GID,
> .ns.inum = PROC_USER_INIT_INO,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 87804e0371fe..7c2bbe8f3e45 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -111,7 +111,7 @@ int create_user_ns(struct cred *new)
> goto fail_free;
> ns->ns.ops = &userns_operations;
>
> - atomic_set(&ns->count, 1);
> + refcount_set(&ns->ns.count, 1);
> /* Leave the new->user_ns reference with the new user namespace. */
> ns->parent = parent_ns;
> ns->level = parent_ns->level + 1;
> @@ -197,7 +197,7 @@ static void free_user_ns(struct work_struct *work)
> kmem_cache_free(user_ns_cachep, ns);
> dec_user_namespaces(ucounts);
> ns = parent;
> - } while (atomic_dec_and_test(&parent->count));
> + } while (refcount_dec_and_test(&parent->ns.count));
> }
>
> void __put_user_ns(struct user_namespace *ns)
>
>

2020-07-30 14:51:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 06/23] mnt: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:52PM +0300, Kirill Tkhai wrote:
> Convert mount namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Looks good!
Acked-by: Christian Brauner <[email protected]>

> fs/mount.h | 3 +--
> fs/namespace.c | 4 ++--
> 2 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index c3e0bb6e5782..f296862032ec 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -7,7 +7,6 @@
> #include <linux/watch_queue.h>
>
> struct mnt_namespace {
> - atomic_t count;
> struct ns_common ns;
> struct mount * root;
> /*
> @@ -130,7 +129,7 @@ static inline void detach_mounts(struct dentry *dentry)
>
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> - atomic_inc(&ns->count);
> + refcount_inc(&ns->ns.count);
> }
>
> extern seqlock_t mount_lock;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 31c387794fbd..8c39810e6ec3 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3296,7 +3296,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
> new_ns->ns.ops = &mntns_operations;
> if (!anon)
> new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
> - atomic_set(&new_ns->count, 1);
> + refcount_set(&new_ns->ns.count, 1);
> INIT_LIST_HEAD(&new_ns->list);
> init_waitqueue_head(&new_ns->poll);
> spin_lock_init(&new_ns->ns_lock);
> @@ -3870,7 +3870,7 @@ void __init mnt_init(void)
>
> void put_mnt_ns(struct mnt_namespace *ns)
> {
> - if (!atomic_dec_and_test(&ns->count))
> + if (!refcount_dec_and_test(&ns->ns.count))
> return;
> drop_collected_mounts(&ns->root->mnt);
> free_mnt_ns(ns);
>
>

2020-07-30 14:51:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 07/23] cgroup: Use generic ns_common::count

On Thu, Jul 30, 2020 at 02:59:57PM +0300, Kirill Tkhai wrote:
> Convert cgroup namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Looks good!
Acked-by: Christian Brauner <[email protected]>

> include/linux/cgroup.h | 5 ++---
> kernel/cgroup/cgroup.c | 2 +-
> kernel/cgroup/namespace.c | 2 +-
> 3 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 618838c48313..451c2d26a5db 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -854,7 +854,6 @@ static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
> #endif /* CONFIG_CGROUP_DATA */
>
> struct cgroup_namespace {
> - refcount_t count;
> struct ns_common ns;
> struct user_namespace *user_ns;
> struct ucounts *ucounts;
> @@ -889,12 +888,12 @@ copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
> static inline void get_cgroup_ns(struct cgroup_namespace *ns)
> {
> if (ns)
> - refcount_inc(&ns->count);
> + refcount_inc(&ns->ns.count);
> }
>
> static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> {
> - if (ns && refcount_dec_and_test(&ns->count))
> + if (ns && refcount_dec_and_test(&ns->ns.count))
> free_cgroup_ns(ns);
> }
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index dd247747ec14..22e466926853 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -199,7 +199,7 @@ static u16 have_canfork_callback __read_mostly;
>
> /* cgroup namespace for init task */
> struct cgroup_namespace init_cgroup_ns = {
> - .count = REFCOUNT_INIT(2),
> + .ns.count = REFCOUNT_INIT(2),
> .user_ns = &init_user_ns,
> .ns.ops = &cgroupns_operations,
> .ns.inum = PROC_CGROUP_INIT_INO,
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index 812a61afd538..f5e8828c109c 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -32,7 +32,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
> kfree(new_ns);
> return ERR_PTR(ret);
> }
> - refcount_set(&new_ns->count, 1);
> + refcount_set(&new_ns->ns.count, 1);
> new_ns->ns.ops = &cgroupns_operations;
> return new_ns;
> }
>
>

2020-07-30 14:55:36

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 08/23] time: Use generic ns_common::count

On Thu, Jul 30, 2020 at 03:00:03PM +0300, Kirill Tkhai wrote:
> Convert time namespace to use generic counter.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---

Looks good!
Acked-by: Christian Brauner <[email protected]>

> include/linux/time_namespace.h | 9 ++++-----
> kernel/time/namespace.c | 9 +++------
> 2 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
> index 5b6031385db0..a51ffc089219 100644
> --- a/include/linux/time_namespace.h
> +++ b/include/linux/time_namespace.h
> @@ -4,7 +4,6 @@
>
>
> #include <linux/sched.h>
> -#include <linux/kref.h>
> #include <linux/nsproxy.h>
> #include <linux/ns_common.h>
> #include <linux/err.h>
> @@ -18,7 +17,6 @@ struct timens_offsets {
> };
>
> struct time_namespace {
> - struct kref kref;
> struct user_namespace *user_ns;
> struct ucounts *ucounts;
> struct ns_common ns;
> @@ -37,20 +35,21 @@ extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);
>
> static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
> {
> - kref_get(&ns->kref);
> + refcount_inc(&ns->ns.count);
> return ns;
> }
>
> struct time_namespace *copy_time_ns(unsigned long flags,
> struct user_namespace *user_ns,
> struct time_namespace *old_ns);
> -void free_time_ns(struct kref *kref);
> +void free_time_ns(struct time_namespace *ns);
> int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
> struct vdso_data *arch_get_vdso_data(void *vvar_page);
>
> static inline void put_time_ns(struct time_namespace *ns)
> {
> - kref_put(&ns->kref, free_time_ns);
> + if (refcount_dec_and_test(&ns->ns.count))
> + free_time_ns(ns);
> }
>
> void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index afc65e6be33e..c4c829eb3511 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -92,7 +92,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
> if (!ns)
> goto fail_dec;
>
> - kref_init(&ns->kref);
> + refcount_set(&ns->ns.count, 1);
>
> ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> if (!ns->vvar_page)
> @@ -226,11 +226,8 @@ static void timens_set_vvar_page(struct task_struct *task,
> mutex_unlock(&offset_lock);
> }
>
> -void free_time_ns(struct kref *kref)
> +void free_time_ns(struct time_namespace *ns)
> {
> - struct time_namespace *ns;
> -
> - ns = container_of(kref, struct time_namespace, kref);
> dec_time_namespaces(ns->ucounts);
> put_user_ns(ns->user_ns);
> ns_free_inum(&ns->ns);
> @@ -464,7 +461,7 @@ const struct proc_ns_operations timens_for_children_operations = {
> };
>
> struct time_namespace init_time_ns = {
> - .kref = KREF_INIT(3),
> + .ns.count = REFCOUNT_INIT(3),
> .user_ns = &init_user_ns,
> .ns.inum = PROC_TIME_INIT_INO,
> .ns.ops = &timens_operations,
>
>

2020-07-30 15:02:04

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 30.07.2020 17:34, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> Currently, there is no a way to list or iterate all or subset of namespaces
>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>> but some also may be as open files, which are not attached to a process.
>> When a namespace open fd is sent over unix socket and then closed, it is
>> impossible to know whether the namespace exists or not.
>>
>> Also, even if namespace is exposed as attached to a process or as open file,
>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>> this multiplies at tasks and fds number.
>
> I am very dubious about this.
>
> I have been avoiding exactly this kind of interface because it can
> create rather fundamental problems with checkpoint restart.

restart/restore :)

> You do have some filtering and the filtering is not based on current.
> Which is good.
>
> A view that is relative to a user namespace might be ok. It almost
> certainly does better as it's own little filesystem than as an extension
> to proc though.
>
> The big thing we want to ensure is that if you migrate you can restore
> everything. I don't see how you will be able to restore these files
> after migration. Anything like this without having a complete
> checkpoint/restore story is a non-starter.

There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.

CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
problem here.

If you have a specific worries about, let's discuss them.

CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.

> Further by not going through the processes it looks like you are
> bypassing the existing permission checks. Which has the potential
> to allow someone to use a namespace who would not be able to otherwise.

I agree, and I wrote to Christian, that permissions should be more strict.
This just should be formalized. Let's discuss this.

> So I think this goes one step too far but I am willing to be persuaded
> otherwise.
>
> Eric
>
>
>
>
>> This patchset introduces a new /proc/namespaces/ directory, which exposes
>> subset of permitted namespaces in linear view:
>>
>> # ls /proc/namespaces/ -l
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'cgroup:[4026531835]' -> 'cgroup:[4026531835]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'ipc:[4026531839]' -> 'ipc:[4026531839]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531840]' -> 'mnt:[4026531840]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026531861]' -> 'mnt:[4026531861]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532133]' -> 'mnt:[4026532133]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532134]' -> 'mnt:[4026532134]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532135]' -> 'mnt:[4026532135]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'mnt:[4026532136]' -> 'mnt:[4026532136]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'net:[4026531993]' -> 'net:[4026531993]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'pid:[4026531836]' -> 'pid:[4026531836]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'time:[4026531834]' -> 'time:[4026531834]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'user:[4026531837]' -> 'user:[4026531837]'
>> lrwxrwxrwx 1 root root 0 Jul 29 16:50 'uts:[4026531838]' -> 'uts:[4026531838]'
>>
>> Namespace ns is exposed, in case of its user_ns is permitted from /proc's pid_ns.
>> I.e., /proc is related to pid_ns, so in /proc/namespace we show only a ns, which is
>>
>> in_userns(pid_ns->user_ns, ns->user_ns).
>>
>> In case of ns is a user_ns:
>>
>> in_userns(pid_ns->user_ns, ns).
>>
>> The patchset follows this steps:
>>
>> 1)A generic counter in ns_common is introduced instead of separate
>> counters for every ns type (net::count, uts_namespace::kref,
>> user_namespace::count, etc). Patches [1-8];
>> 2)Patch [9] introduces IDR to link and iterate alive namespaces;
>> 3)Patch [10] is refactoring;
>> 4)Patch [11] actually adds /proc/namespace directory and fs methods;
>> 5)Patches [12-23] make every namespace to use the added methods
>> and to appear in /proc/namespace directory.
>>
>> This may be usefull to write effective debug utils (say, fast build
>> of networks topology) and checkpoint/restore software.
>> ---
>>
>> Kirill Tkhai (23):
>> ns: Add common refcount into ns_common add use it as counter for net_ns
>> uts: Use generic ns_common::count
>> ipc: Use generic ns_common::count
>> pid: Use generic ns_common::count
>> user: Use generic ns_common::count
>> mnt: Use generic ns_common::count
>> cgroup: Use generic ns_common::count
>> time: Use generic ns_common::count
>> ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system
>> fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c
>> fs: Add /proc/namespaces/ directory
>> user: Free user_ns one RCU grace period after final counter put
>> user: Add user namespaces into ns_idr
>> net: Add net namespaces into ns_idr
>> pid: Eextract child_reaper check from pidns_for_children_get()
>> proc_ns_operations: Add can_get method
>> pid: Add pid namespaces into ns_idr
>> uts: Free uts namespace one RCU grace period after final counter put
>> uts: Add uts namespaces into ns_idr
>> ipc: Add ipc namespaces into ns_idr
>> mnt: Add mount namespaces into ns_idr
>> cgroup: Add cgroup namespaces into ns_idr
>> time: Add time namespaces into ns_idr
>>
>>
>> fs/mount.h | 4
>> fs/namespace.c | 14 +
>> fs/nsfs.c | 78 ++++++++
>> fs/proc/Makefile | 1
>> fs/proc/internal.h | 18 +-
>> fs/proc/namespaces.c | 382 +++++++++++++++++++++++++++-------------
>> fs/proc/root.c | 17 ++
>> fs/proc/task_namespaces.c | 183 +++++++++++++++++++
>> include/linux/cgroup.h | 6 -
>> include/linux/ipc_namespace.h | 3
>> include/linux/ns_common.h | 11 +
>> include/linux/pid_namespace.h | 4
>> include/linux/proc_fs.h | 1
>> include/linux/proc_ns.h | 12 +
>> include/linux/time_namespace.h | 10 +
>> include/linux/user_namespace.h | 10 +
>> include/linux/utsname.h | 10 +
>> include/net/net_namespace.h | 11 -
>> init/version.c | 2
>> ipc/msgutil.c | 2
>> ipc/namespace.c | 17 +-
>> ipc/shm.c | 1
>> kernel/cgroup/cgroup.c | 2
>> kernel/cgroup/namespace.c | 25 ++-
>> kernel/pid.c | 2
>> kernel/pid_namespace.c | 46 +++--
>> kernel/time/namespace.c | 20 +-
>> kernel/user.c | 2
>> kernel/user_namespace.c | 23 ++
>> kernel/utsname.c | 23 ++
>> net/core/net-sysfs.c | 6 -
>> net/core/net_namespace.c | 18 +-
>> net/ipv4/inet_timewait_sock.c | 4
>> net/ipv4/tcp_metrics.c | 2
>> 34 files changed, 746 insertions(+), 224 deletions(-)
>> create mode 100644 fs/proc/task_namespaces.c
>>
>> --
>> Signed-off-by: Kirill Tkhai <[email protected]>

2020-07-30 20:56:59

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

Hi Kirill,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20200729]
[also build test ERROR on v5.8-rc7]
[cannot apply to cgroup/for-next tip/timers/core net-next/master sparc-next/master net/master linus/master v5.8-rc7 v5.8-rc6 v5.8-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/proc-Introduce-proc-namespaces-directory-to-expose-namespaces-lineary/20200730-200346
base: 04b4571786305a76ad81757bbec78eb16a5de582
config: m68k-defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

m68k-linux-ld: fs/proc/namespaces.o: in function `get_namespace_by_dentry.isra.0':
>> namespaces.c:(.text+0x170): undefined reference to `userns_operations'
m68k-linux-ld: fs/proc/namespaces.o: in function `proc_namespaces_readdir':
namespaces.c:(.text+0x3d2): undefined reference to `userns_operations'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.63 kB)
.config.gz (16.59 kB)
Download all attachments

2020-07-30 22:17:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

Kirill Tkhai <[email protected]> writes:

> On 30.07.2020 17:34, Eric W. Biederman wrote:
>> Kirill Tkhai <[email protected]> writes:
>>
>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>> but some also may be as open files, which are not attached to a process.
>>> When a namespace open fd is sent over unix socket and then closed, it is
>>> impossible to know whether the namespace exists or not.
>>>
>>> Also, even if namespace is exposed as attached to a process or as open file,
>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>> this multiplies at tasks and fds number.
>>
>> I am very dubious about this.
>>
>> I have been avoiding exactly this kind of interface because it can
>> create rather fundamental problems with checkpoint restart.
>
> restart/restore :)
>
>> You do have some filtering and the filtering is not based on current.
>> Which is good.
>>
>> A view that is relative to a user namespace might be ok. It almost
>> certainly does better as it's own little filesystem than as an extension
>> to proc though.
>>
>> The big thing we want to ensure is that if you migrate you can restore
>> everything. I don't see how you will be able to restore these files
>> after migration. Anything like this without having a complete
>> checkpoint/restore story is a non-starter.
>
> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>
> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> problem here.

An obvious diffference is that you are adding the inode to the inode to
the file name. Which means that now you really do have to preserve the
inode numbers during process migration.

Which means now we have to do all of the work to make inode number
restoration possible. Which means now we need to have multiple
instances of nsfs so that we can restore inode numbers.

I think this is still possible but we have been delaying figuring out
how to restore inode numbers long enough that may be actual technical
problems making it happen.

Now maybe CRIU can handle the names of the files changing during
migration but you have just increased the level of difficulty for doing
that.

> If you have a specific worries about, let's discuss them.

I was asking and I am asking that it be described in the patch
description how a container using this feature can be migrated
from one machine to another. This code is so close to being problematic
that we need be very careful we don't fundamentally break CRIU while
trying to make it's job simpler and easier.

> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
>
>> Further by not going through the processes it looks like you are
>> bypassing the existing permission checks. Which has the potential
>> to allow someone to use a namespace who would not be able to otherwise.
>
> I agree, and I wrote to Christian, that permissions should be more strict.
> This just should be formalized. Let's discuss this.
>
>> So I think this goes one step too far but I am willing to be persuaded
>> otherwise.
>>

Eric

2020-07-30 22:24:26

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

Hi Kirill,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20200729]
[also build test ERROR on v5.8-rc7]
[cannot apply to cgroup/for-next tip/timers/core net-next/master sparc-next/master net/master linus/master v5.8-rc7 v5.8-rc6 v5.8-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/proc-Introduce-proc-namespaces-directory-to-expose-namespaces-lineary/20200730-200346
base: 04b4571786305a76ad81757bbec78eb16a5de582
config: csky-defconfig (attached as .config)
compiler: csky-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=csky

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

csky-linux-ld: fs/proc/namespaces.o: in function `$d':
namespaces.c:(.text+0x1d0): undefined reference to `userns_operations'
>> csky-linux-ld: namespaces.c:(.text+0x370): undefined reference to `userns_operations'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.54 kB)
.config.gz (9.87 kB)
Download all attachments

2020-07-31 08:50:05

by Pavel Tikhomirov

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary



On 7/31/20 1:13 AM, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>> but some also may be as open files, which are not attached to a process.
>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>> impossible to know whether the namespace exists or not.
>>>>
>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>> this multiplies at tasks and fds number.
>>>
>>> I am very dubious about this.
>>>
>>> I have been avoiding exactly this kind of interface because it can
>>> create rather fundamental problems with checkpoint restart.
>>
>> restart/restore :)
>>
>>> You do have some filtering and the filtering is not based on current.
>>> Which is good.
>>>
>>> A view that is relative to a user namespace might be ok. It almost
>>> certainly does better as it's own little filesystem than as an extension
>>> to proc though.
>>>
>>> The big thing we want to ensure is that if you migrate you can restore
>>> everything. I don't see how you will be able to restore these files
>>> after migration. Anything like this without having a complete
>>> checkpoint/restore story is a non-starter.
>>
>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>
>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>> problem here.
>
> An obvious diffference is that you are adding the inode to the inode to
> the file name. Which means that now you really do have to preserve the
> inode numbers during process migration.
>
> Which means now we have to do all of the work to make inode number
> restoration possible. Which means now we need to have multiple
> instances of nsfs so that we can restore inode numbers.
>
> I think this is still possible but we have been delaying figuring out
> how to restore inode numbers long enough that may be actual technical
> problems making it happen.
>
> Now maybe CRIU can handle the names of the files changing during
> migration but you have just increased the level of difficulty for doing
> that.

Yes adding /proc/namespaces/<ns_name>:[<ns_ino>] files may be a problem
to CRIU.

First I would like to highlight that open files are not a problem.
Because open file from /proc/namespaces/* are exactly the same as open
files from /proc/<pid>/ns/<ns_name>. So when we c/r an nsfs open file fd
on dump we readlink the fd and get <ns_name>:[<ns_ino>] and on restore
we recreate each dumped namespace and open an fd to each, so we can
'dup' it when restoring open file. It will be an fd to topologically
same namespace though ns_ino would be newly generated.

But the problem I see is with readdir. What if some task is reading
/proc/namespaces/ directory at the time of dump, after restore directory
will contain new names for namespaces and possibly in different order,
this way if process continues to readdir it can miss some namespaces or
read some twice.

May be instead of multiple files in /proc/namespaces directory, we can
leave just one file /proc/namespaces and when we open it we would return
e.g. a unix socket filled with all the fds of all namespacess visible at
this point. It looks like a possible solution to the above problem.

CRIU can restore unix sockets with fds inside, so we should be able to
dump process using this functionality.

>
>> If you have a specific worries about, let's discuss them.
>
> I was asking and I am asking that it be described in the patch
> description how a container using this feature can be migrated
> from one machine to another. This code is so close to being problematic
> that we need be very careful we don't fundamentally break CRIU while
> trying to make it's job simpler and easier.
>
>> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
>>
>>> Further by not going through the processes it looks like you are
>>> bypassing the existing permission checks. Which has the potential
>>> to allow someone to use a namespace who would not be able to otherwise.
>>
>> I agree, and I wrote to Christian, that permissions should be more strict.
>> This just should be formalized. Let's discuss this.
>>
>>> So I think this goes one step too far but I am willing to be persuaded
>>> otherwise.
>>>
>
> Eric
>

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

2020-08-03 10:05:19

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 31.07.2020 01:13, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>> but some also may be as open files, which are not attached to a process.
>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>> impossible to know whether the namespace exists or not.
>>>>
>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>> this multiplies at tasks and fds number.
>>>
>>> I am very dubious about this.
>>>
>>> I have been avoiding exactly this kind of interface because it can
>>> create rather fundamental problems with checkpoint restart.
>>
>> restart/restore :)
>>
>>> You do have some filtering and the filtering is not based on current.
>>> Which is good.
>>>
>>> A view that is relative to a user namespace might be ok. It almost
>>> certainly does better as it's own little filesystem than as an extension
>>> to proc though.
>>>
>>> The big thing we want to ensure is that if you migrate you can restore
>>> everything. I don't see how you will be able to restore these files
>>> after migration. Anything like this without having a complete
>>> checkpoint/restore story is a non-starter.
>>
>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>
>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>> problem here.
>
> An obvious diffference is that you are adding the inode to the inode to
> the file name. Which means that now you really do have to preserve the
> inode numbers during process migration.
>
> Which means now we have to do all of the work to make inode number
> restoration possible. Which means now we need to have multiple
> instances of nsfs so that we can restore inode numbers.
>
> I think this is still possible but we have been delaying figuring out
> how to restore inode numbers long enough that may be actual technical
> problems making it happen.

Yeah, this matters. But it looks like here is not a dead end. We just need
change the names the namespaces are exported to particular fs and to support
rename().

Before introduction a principally new filesystem type for this, can't
this be solved in current /proc?

Alexey, does rename() is prohibited for /proc fs?

> Now maybe CRIU can handle the names of the files changing during
> migration but you have just increased the level of difficulty for doing
> that.
>
>> If you have a specific worries about, let's discuss them.
>
> I was asking and I am asking that it be described in the patch
> description how a container using this feature can be migrated
> from one machine to another. This code is so close to being problematic
> that we need be very careful we don't fundamentally break CRIU while
> trying to make it's job simpler and easier.
>
>> CC: Pavel Tikhomirov CRIU maintainer, who knows everything about namespaces C/R.
>>
>>> Further by not going through the processes it looks like you are
>>> bypassing the existing permission checks. Which has the potential
>>> to allow someone to use a namespace who would not be able to otherwise.
>>
>> I agree, and I wrote to Christian, that permissions should be more strict.
>> This just should be formalized. Let's discuss this.
>>
>>> So I think this goes one step too far but I am willing to be persuaded
>>> otherwise.
>>>
>
> Eric
>

2020-08-03 10:53:34

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai <[email protected]> writes:
> >
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai <[email protected]> writes:
> >>>
> >>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>> but some also may be as open files, which are not attached to a process.
> >>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>> impossible to know whether the namespace exists or not.
> >>>>
> >>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>> this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok. It almost
> >>> certainly does better as it's own little filesystem than as an extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything. I don't see how you will be able to restore these files
> >>> after migration. Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >> problem here.
> >
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name. Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible. Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> >
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
>
> Yeah, this matters. But it looks like here is not a dead end. We just need
> change the names the namespaces are exported to particular fs and to support
> rename().
>
> Before introduction a principally new filesystem type for this, can't
> this be solved in current /proc?
>
> Alexey, does rename() is prohibited for /proc fs?

Techically it is allowed: add ->rename to /proc/ns inode.
But nobody does it.

2020-08-04 05:44:27

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
> On 30.07.2020 17:34, Eric W. Biederman wrote:
> > Kirill Tkhai <[email protected]> writes:
> >
> >> Currently, there is no a way to list or iterate all or subset of namespaces
> >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >> but some also may be as open files, which are not attached to a process.
> >> When a namespace open fd is sent over unix socket and then closed, it is
> >> impossible to know whether the namespace exists or not.
> >>
> >> Also, even if namespace is exposed as attached to a process or as open file,
> >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >> this multiplies at tasks and fds number.

Could you describe with more details when you need to iterate
namespaces?

There are three ways to hold namespaces.

* processes
* bind-mounts
* file descriptors

When CRIU dumps a container, it enumirates all processes, collects file
descriptors and mounts. This means that we will be able to collect all
namespaces, doesn't it?

2020-08-04 12:23:26

by Pavel Tikhomirov

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary



On 8/4/20 8:43 AM, Andrei Vagin wrote:
> On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>> but some also may be as open files, which are not attached to a process.
>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>> impossible to know whether the namespace exists or not.
>>>>
>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>> this multiplies at tasks and fds number.
>
> Could you describe with more details when you need to iterate
> namespaces?
>
> There are three ways to hold namespaces.
>
> * processes
> * bind-mounts
> * file descriptors
>
> When CRIU dumps a container, it enumirates all processes, collects file
> descriptors and mounts. This means that we will be able to collect all
> namespaces, doesn't it?

Yes we can. But it would be much easier for us to have all namespaces in
one place isn't it?

And this patch-set has another non-CRIU use case. It can simplify a view
to namespaces for a normal user. Lets consider some cases:

Lets assume we have an empty (no processes) mount namespace M which is
held by single open fd, which was put in a unix socket and closed, unix
socket has single open fd to it which was in it's turn put to another
unix socket and again and again until we reach unix socket max depth...
How should normal user find this mount namespace M?

Lets assume that M also has a nsfs bindmount which helds some empty
network namespace N... How should normal user find N?

Lets also assume that M has overmounted "/":

mount -t tmpfs tmpfs /

Now if you would enter M you would see single tmpfs (because of implicit
chroot to overmount on setns) in mountinfo and there is no way to see
full mountinfo if you does not know real root dentry... How should
normal user (or even CRIU) find N?

So my personal opinion is that we need this interface, maybe it should
be done somehow different but we need it.

>

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

2020-08-04 14:48:50

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 04.08.2020 08:43, Andrei Vagin wrote:
> On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>> but some also may be as open files, which are not attached to a process.
>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>> impossible to know whether the namespace exists or not.
>>>>
>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>> this multiplies at tasks and fds number.
>
> Could you describe with more details when you need to iterate
> namespaces?
>
> There are three ways to hold namespaces.
>
> * processes
> * bind-mounts
> * file descriptors
>
> When CRIU dumps a container, it enumirates all processes, collects file
> descriptors and mounts. This means that we will be able to collect all
> namespaces, doesn't it?

1)It's not only for CRIU. No one util can read content of another task unix socket like CRIU does.
Sometimes we may just want to see all mount namespaces to found a mount, which owns a reference on
a device.
2)In case of CRIU, recursive dump (when you iterate unix socket content, then you find another
namespace and iterate another unix socket content, then you find one more namespace) is less
effective and less fast, then dumping different types sequentially: first namespaces, second fds, etc.
3)It's still impossible to collect all namespaces like Pasha wrote.

2020-08-05 08:20:37

by kernel test robot

[permalink] [raw]
Subject: [RFC PATCH] fs: namespaces_dentry_operations can be static


Signed-off-by: kernel test robot <[email protected]>
---
namespaces.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index ab47e15556197..63f27f66a96a4 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -182,7 +182,7 @@ static int namespace_delete_dentry(const struct dentry *dentry)
return 0;
}

-const struct dentry_operations namespaces_dentry_operations = {
+static const struct dentry_operations namespaces_dentry_operations = {
.d_delete = namespace_delete_dentry,
};

2020-08-05 08:36:04

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/23] fs: Add /proc/namespaces/ directory

Hi Kirill,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20200729]
[also build test WARNING on v5.8]
[cannot apply to cgroup/for-next tip/timers/core net-next/master sparc-next/master net/master linus/master v5.8-rc7 v5.8-rc6 v5.8-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Kirill-Tkhai/proc-Introduce-proc-namespaces-directory-to-expose-namespaces-lineary/20200730-200346
base: 04b4571786305a76ad81757bbec78eb16a5de582
config: i386-randconfig-s001-20200805 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.2-117-g8c7aee71-dirty
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)

>> fs/proc/namespaces.c:185:32: sparse: sparse: symbol 'namespaces_dentry_operations' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.44 kB)
.config.gz (25.58 kB)
Download all attachments

2020-08-06 08:09:06

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai <[email protected]> writes:
> >
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai <[email protected]> writes:
> >>>
> >>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>> but some also may be as open files, which are not attached to a process.
> >>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>> impossible to know whether the namespace exists or not.
> >>>>
> >>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>> this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok. It almost
> >>> certainly does better as it's own little filesystem than as an extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything. I don't see how you will be able to restore these files
> >>> after migration. Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >> problem here.
> >
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name. Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible. Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> >
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
>
> Yeah, this matters. But it looks like here is not a dead end. We just need
> change the names the namespaces are exported to particular fs and to support
> rename().
>
> Before introduction a principally new filesystem type for this, can't
> this be solved in current /proc?

do you mean to introduce names for namespaces which users will be able
to change? By default, this can be uuid.

And I have a suggestion about the structure of /proc/namespaces/.

Each namespace is owned by one of user namespaces. Maybe it makes sense
to group namespaces by their user-namespaces?

/proc/namespaces/
user
mnt-X
mnt-Y
pid-X
uts-Z
user-X/
user
mnt-A
mnt-B
user-C
user-C/
user
user-Y/
user

Do we try to invent cgroupfs for namespaces?

Thanks,
Andrei

2020-08-07 08:49:48

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 06.08.2020 11:05, Andrei Vagin wrote:
> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>
>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>> impossible to know whether the namespace exists or not.
>>>>>>
>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>> this multiplies at tasks and fds number.
>>>>>
>>>>> I am very dubious about this.
>>>>>
>>>>> I have been avoiding exactly this kind of interface because it can
>>>>> create rather fundamental problems with checkpoint restart.
>>>>
>>>> restart/restore :)
>>>>
>>>>> You do have some filtering and the filtering is not based on current.
>>>>> Which is good.
>>>>>
>>>>> A view that is relative to a user namespace might be ok. It almost
>>>>> certainly does better as it's own little filesystem than as an extension
>>>>> to proc though.
>>>>>
>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>> everything. I don't see how you will be able to restore these files
>>>>> after migration. Anything like this without having a complete
>>>>> checkpoint/restore story is a non-starter.
>>>>
>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>
>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>> problem here.
>>>
>>> An obvious diffference is that you are adding the inode to the inode to
>>> the file name. Which means that now you really do have to preserve the
>>> inode numbers during process migration.
>>>
>>> Which means now we have to do all of the work to make inode number
>>> restoration possible. Which means now we need to have multiple
>>> instances of nsfs so that we can restore inode numbers.
>>>
>>> I think this is still possible but we have been delaying figuring out
>>> how to restore inode numbers long enough that may be actual technical
>>> problems making it happen.
>>
>> Yeah, this matters. But it looks like here is not a dead end. We just need
>> change the names the namespaces are exported to particular fs and to support
>> rename().
>>
>> Before introduction a principally new filesystem type for this, can't
>> this be solved in current /proc?
>
> do you mean to introduce names for namespaces which users will be able
> to change? By default, this can be uuid.

Yes, I mean this.

Currently I won't give a final answer about UUID, but I planned to show some
default names, which based on namespace type and inode num. Completely custom
names for any /proc by default will waste too much memory.

So, I think the good way will be:

1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
random seed, which is generated on boot;

2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}

3)Allow rename, and allocate space only for renamed names.

Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.

> And I have a suggestion about the structure of /proc/namespaces/.
>
> Each namespace is owned by one of user namespaces. Maybe it makes sense
> to group namespaces by their user-namespaces?
>
> /proc/namespaces/
> user
> mnt-X
> mnt-Y
> pid-X
> uts-Z
> user-X/
> user
> mnt-A
> mnt-B
> user-C
> user-C/
> user
> user-Y/
> user

Hm, I don't think that user namespace is a generic key value for everybody.
For generic people tasks a user namespace is just a namespace among another
namespace types. For me it will look a bit strage to iterate some user namespaces
to build container net topology.

> Do we try to invent cgroupfs for namespaces?

Could you clarify your thought?

2020-08-10 17:38:04

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> On 06.08.2020 11:05, Andrei Vagin wrote:
> > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>> Kirill Tkhai <[email protected]> writes:
> >>>
> >>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>
> >>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>>>> but some also may be as open files, which are not attached to a process.
> >>>>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>>>> impossible to know whether the namespace exists or not.
> >>>>>>
> >>>>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>>>> this multiplies at tasks and fds number.
> >>>>>
> >>>>> I am very dubious about this.
> >>>>>
> >>>>> I have been avoiding exactly this kind of interface because it can
> >>>>> create rather fundamental problems with checkpoint restart.
> >>>>
> >>>> restart/restore :)
> >>>>
> >>>>> You do have some filtering and the filtering is not based on current.
> >>>>> Which is good.
> >>>>>
> >>>>> A view that is relative to a user namespace might be ok. It almost
> >>>>> certainly does better as it's own little filesystem than as an extension
> >>>>> to proc though.
> >>>>>
> >>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>> everything. I don't see how you will be able to restore these files
> >>>>> after migration. Anything like this without having a complete
> >>>>> checkpoint/restore story is a non-starter.
> >>>>
> >>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>>>
> >>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >>>> problem here.
> >>>
> >>> An obvious diffference is that you are adding the inode to the inode to
> >>> the file name. Which means that now you really do have to preserve the
> >>> inode numbers during process migration.
> >>>
> >>> Which means now we have to do all of the work to make inode number
> >>> restoration possible. Which means now we need to have multiple
> >>> instances of nsfs so that we can restore inode numbers.
> >>>
> >>> I think this is still possible but we have been delaying figuring out
> >>> how to restore inode numbers long enough that may be actual technical
> >>> problems making it happen.
> >>
> >> Yeah, this matters. But it looks like here is not a dead end. We just need
> >> change the names the namespaces are exported to particular fs and to support
> >> rename().
> >>
> >> Before introduction a principally new filesystem type for this, can't
> >> this be solved in current /proc?
> >
> > do you mean to introduce names for namespaces which users will be able
> > to change? By default, this can be uuid.
>
> Yes, I mean this.
>
> Currently I won't give a final answer about UUID, but I planned to show some
> default names, which based on namespace type and inode num. Completely custom
> names for any /proc by default will waste too much memory.
>
> So, I think the good way will be:
>
> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
> random seed, which is generated on boot;
>
> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>
> 3)Allow rename, and allocate space only for renamed names.
>
> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>
> > And I have a suggestion about the structure of /proc/namespaces/.
> >
> > Each namespace is owned by one of user namespaces. Maybe it makes sense
> > to group namespaces by their user-namespaces?
> >
> > /proc/namespaces/
> > user
> > mnt-X
> > mnt-Y
> > pid-X
> > uts-Z
> > user-X/
> > user
> > mnt-A
> > mnt-B
> > user-C
> > user-C/
> > user
> > user-Y/
> > user
>
> Hm, I don't think that user namespace is a generic key value for everybody.
> For generic people tasks a user namespace is just a namespace among another
> namespace types. For me it will look a bit strage to iterate some user namespaces
> to build container net topology.

I can’t agree with you that the user namespace is one of others. It is
the namespace for namespaces. It sets security boundaries in the system
and we need to know them to understand the whole system.

If user namespaces are not used in the system or on a container, you
will see all namespaces in one directory. But if the system has a more
complicated structure, you will be able to build a full picture of it.

You said that one of the users of this feature is CRIU (the tool to
checkpoint/restore containers) and you said that it would be good if
CRIU will be able to collect all container namespaces before dumping
processes, sockets, files etc. But how will we be able to do this if we
will list all namespaces in one directory?

Here are my thoughts why we need to the suggested structure is better
than just a list of namespaces:

* Users will be able to understand securies bondaries in the system.
Each namespace in the system is owned by one of user namespace and we
need to know these relationshipts to understand the whole system.

* This is simplify collecting namespaces which belong to one container.

For example, CRIU collects all namespaces before dumping file
descriptors. Then it collects all sockets with socket-diag in network
namespaces and collects mount points via /proc/pid/mountinfo in mount
namesapces. Then these information is used to dump socket file
descriptors and opened files.

* We are going to assign names to namespaces. But this means that we
need to guarantee that all names in one directory are unique. The
initial proposal was to enumerate all namespaces in one proc directory,
that means names of all namespaces have to be unique. This can be
problematic in some cases. For example, we may want to dump a container
and then restore it more than once on the same host. How are we going to
avoid namespace name conficts in such cases?

If we will have per-user-namespace directories, we will need to
guarantee that names are unique only inside one user namespace.

* With the suggested structure, for each user namepsace, we will show
only its subtree of namespaces. This looks more natural than
filltering content of one directory.


>
> > Do we try to invent cgroupfs for namespaces?
>
> Could you clarify your thought?

2020-08-11 10:26:47

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 10.08.2020 20:34, Andrei Vagin wrote:
> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>
>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>
>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>>>> impossible to know whether the namespace exists or not.
>>>>>>>>
>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>>>> this multiplies at tasks and fds number.
>>>>>>>
>>>>>>> I am very dubious about this.
>>>>>>>
>>>>>>> I have been avoiding exactly this kind of interface because it can
>>>>>>> create rather fundamental problems with checkpoint restart.
>>>>>>
>>>>>> restart/restore :)
>>>>>>
>>>>>>> You do have some filtering and the filtering is not based on current.
>>>>>>> Which is good.
>>>>>>>
>>>>>>> A view that is relative to a user namespace might be ok. It almost
>>>>>>> certainly does better as it's own little filesystem than as an extension
>>>>>>> to proc though.
>>>>>>>
>>>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>>>> everything. I don't see how you will be able to restore these files
>>>>>>> after migration. Anything like this without having a complete
>>>>>>> checkpoint/restore story is a non-starter.
>>>>>>
>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>>>
>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>>>> problem here.
>>>>>
>>>>> An obvious diffference is that you are adding the inode to the inode to
>>>>> the file name. Which means that now you really do have to preserve the
>>>>> inode numbers during process migration.
>>>>>
>>>>> Which means now we have to do all of the work to make inode number
>>>>> restoration possible. Which means now we need to have multiple
>>>>> instances of nsfs so that we can restore inode numbers.
>>>>>
>>>>> I think this is still possible but we have been delaying figuring out
>>>>> how to restore inode numbers long enough that may be actual technical
>>>>> problems making it happen.
>>>>
>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
>>>> change the names the namespaces are exported to particular fs and to support
>>>> rename().
>>>>
>>>> Before introduction a principally new filesystem type for this, can't
>>>> this be solved in current /proc?
>>>
>>> do you mean to introduce names for namespaces which users will be able
>>> to change? By default, this can be uuid.
>>
>> Yes, I mean this.
>>
>> Currently I won't give a final answer about UUID, but I planned to show some
>> default names, which based on namespace type and inode num. Completely custom
>> names for any /proc by default will waste too much memory.
>>
>> So, I think the good way will be:
>>
>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
>> random seed, which is generated on boot;
>>
>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>>
>> 3)Allow rename, and allocate space only for renamed names.
>>
>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>
>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>
>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>> to group namespaces by their user-namespaces?
>>>
>>> /proc/namespaces/
>>> user
>>> mnt-X
>>> mnt-Y
>>> pid-X
>>> uts-Z
>>> user-X/
>>> user
>>> mnt-A
>>> mnt-B
>>> user-C
>>> user-C/
>>> user
>>> user-Y/
>>> user
>>
>> Hm, I don't think that user namespace is a generic key value for everybody.
>> For generic people tasks a user namespace is just a namespace among another
>> namespace types. For me it will look a bit strage to iterate some user namespaces
>> to build container net topology.
>
> I can’t agree with you that the user namespace is one of others. It is
> the namespace for namespaces. It sets security boundaries in the system
> and we need to know them to understand the whole system.
>
> If user namespaces are not used in the system or on a container, you
> will see all namespaces in one directory. But if the system has a more
> complicated structure, you will be able to build a full picture of it.
>
> You said that one of the users of this feature is CRIU (the tool to
> checkpoint/restore containers) and you said that it would be good if
> CRIU will be able to collect all container namespaces before dumping
> processes, sockets, files etc. But how will we be able to do this if we
> will list all namespaces in one directory?

There is no a problem, this looks rather simple. Two cases are possible:

1)a container has dedicated namespaces set, and CRIU just has to iterate
files in /proc/namespaces of root pid namespace of the container.
The relationships between parents and childs of pid and user namespaces
are founded via ioctl(NS_GET_PARENT).

2)container has no dedicated namespaces set. Then CRIU just has to iterate
all host namespaces. There is no another way to do that, because container
may have any host namespaces, and hierarchy in /proc/namespaces won't
help you.

> Here are my thoughts why we need to the suggested structure is better
> than just a list of namespaces:
>
> * Users will be able to understand securies bondaries in the system.
> Each namespace in the system is owned by one of user namespace and we
> need to know these relationshipts to understand the whole system.

Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
this interfaces?

> * This is simplify collecting namespaces which belong to one container.
>
> For example, CRIU collects all namespaces before dumping file
> descriptors. Then it collects all sockets with socket-diag in network
> namespaces and collects mount points via /proc/pid/mountinfo in mount
> namesapces. Then these information is used to dump socket file
> descriptors and opened files.

This is just the thing I say. This allows to avoid writing recursive dump.
But this has nothing about advantages of hierarchy in /proc/namespaces.

> * We are going to assign names to namespaces. But this means that we
> need to guarantee that all names in one directory are unique. The
> initial proposal was to enumerate all namespaces in one proc directory,
> that means names of all namespaces have to be unique. This can be
> problematic in some cases. For example, we may want to dump a container
> and then restore it more than once on the same host. How are we going to
> avoid namespace name conficts in such cases?

Previous message I wrote about .rename of proc files, Alexey Dobriyan
said this is not a taboo. Are there problem which doesn't cover the case
you point?

> If we will have per-user-namespace directories, we will need to
> guarantee that names are unique only inside one user namespace.

Unique names inside one user namespace won't introduce a new /proc
mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
container. To give a virtualized name you have to have a dedicated pid ns.

Let we have in one /proc mount:

/mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]

In another another /proc mount we have:

/mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]

The virtualization is made per /proc (i.e., per pid ns). Container should
receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.

There is no a sense of directory hierarchy for virtualization, since
you can't use specific sub-directory as a root directory of /proc/namespaces
to a container. You still have to introduce a new pid ns to have virtualized
/proc.

> * With the suggested structure, for each user namepsace, we will show
> only its subtree of namespaces. This looks more natural than
> filltering content of one directory.

It's rather subjectively I think. /proc is related to pid ns, and user ns
hierarchy does not look more natural for me.

2020-08-12 17:54:54

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> On 10.08.2020 20:34, Andrei Vagin wrote:
> > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>
> >>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>>>
> >>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>>>>>> but some also may be as open files, which are not attached to a process.
> >>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>
> >>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>
> >>>>>>> I am very dubious about this.
> >>>>>>>
> >>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>
> >>>>>> restart/restore :)
> >>>>>>
> >>>>>>> You do have some filtering and the filtering is not based on current.
> >>>>>>> Which is good.
> >>>>>>>
> >>>>>>> A view that is relative to a user namespace might be ok. It almost
> >>>>>>> certainly does better as it's own little filesystem than as an extension
> >>>>>>> to proc though.
> >>>>>>>
> >>>>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>>>> everything. I don't see how you will be able to restore these files
> >>>>>>> after migration. Anything like this without having a complete
> >>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>
> >>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>>>>>
> >>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >>>>>> problem here.
> >>>>>
> >>>>> An obvious diffference is that you are adding the inode to the inode to
> >>>>> the file name. Which means that now you really do have to preserve the
> >>>>> inode numbers during process migration.
> >>>>>
> >>>>> Which means now we have to do all of the work to make inode number
> >>>>> restoration possible. Which means now we need to have multiple
> >>>>> instances of nsfs so that we can restore inode numbers.
> >>>>>
> >>>>> I think this is still possible but we have been delaying figuring out
> >>>>> how to restore inode numbers long enough that may be actual technical
> >>>>> problems making it happen.
> >>>>
> >>>> Yeah, this matters. But it looks like here is not a dead end. We just need
> >>>> change the names the namespaces are exported to particular fs and to support
> >>>> rename().
> >>>>
> >>>> Before introduction a principally new filesystem type for this, can't
> >>>> this be solved in current /proc?
> >>>
> >>> do you mean to introduce names for namespaces which users will be able
> >>> to change? By default, this can be uuid.
> >>
> >> Yes, I mean this.
> >>
> >> Currently I won't give a final answer about UUID, but I planned to show some
> >> default names, which based on namespace type and inode num. Completely custom
> >> names for any /proc by default will waste too much memory.
> >>
> >> So, I think the good way will be:
> >>
> >> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
> >> random seed, which is generated on boot;
> >>
> >> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
> >>
> >> 3)Allow rename, and allocate space only for renamed names.
> >>
> >> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
> >>
> >>> And I have a suggestion about the structure of /proc/namespaces/.
> >>>
> >>> Each namespace is owned by one of user namespaces. Maybe it makes sense
> >>> to group namespaces by their user-namespaces?
> >>>
> >>> /proc/namespaces/
> >>> user
> >>> mnt-X
> >>> mnt-Y
> >>> pid-X
> >>> uts-Z
> >>> user-X/
> >>> user
> >>> mnt-A
> >>> mnt-B
> >>> user-C
> >>> user-C/
> >>> user
> >>> user-Y/
> >>> user
> >>
> >> Hm, I don't think that user namespace is a generic key value for everybody.
> >> For generic people tasks a user namespace is just a namespace among another
> >> namespace types. For me it will look a bit strage to iterate some user namespaces
> >> to build container net topology.
> >
> > I can’t agree with you that the user namespace is one of others. It is
> > the namespace for namespaces. It sets security boundaries in the system
> > and we need to know them to understand the whole system.
> >
> > If user namespaces are not used in the system or on a container, you
> > will see all namespaces in one directory. But if the system has a more
> > complicated structure, you will be able to build a full picture of it.
> >
> > You said that one of the users of this feature is CRIU (the tool to
> > checkpoint/restore containers) and you said that it would be good if
> > CRIU will be able to collect all container namespaces before dumping
> > processes, sockets, files etc. But how will we be able to do this if we
> > will list all namespaces in one directory?
>
> There is no a problem, this looks rather simple. Two cases are possible:
>
> 1)a container has dedicated namespaces set, and CRIU just has to iterate
> files in /proc/namespaces of root pid namespace of the container.
> The relationships between parents and childs of pid and user namespaces
> are founded via ioctl(NS_GET_PARENT).
>
> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
> all host namespaces. There is no another way to do that, because container
> may have any host namespaces, and hierarchy in /proc/namespaces won't
> help you.
>
> > Here are my thoughts why we need to the suggested structure is better
> > than just a list of namespaces:
> >
> > * Users will be able to understand securies bondaries in the system.
> > Each namespace in the system is owned by one of user namespace and we
> > need to know these relationshipts to understand the whole system.
>
> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
> this interfaces?

We can use these ioctl-s, but we will need to enumerate all namespaces in
the system to build a view of the namespace hierarchy. This will be very
expensive. The kernel can show this hierarchy without additional cost.

>
> > * This is simplify collecting namespaces which belong to one container.
> >
> > For example, CRIU collects all namespaces before dumping file
> > descriptors. Then it collects all sockets with socket-diag in network
> > namespaces and collects mount points via /proc/pid/mountinfo in mount
> > namesapces. Then these information is used to dump socket file
> > descriptors and opened files.
>
> This is just the thing I say. This allows to avoid writing recursive dump.

I don't understand this. How are you going to collect namespaces in CRIU
without knowing which are used by a dumped container?

> But this has nothing about advantages of hierarchy in /proc/namespaces.

Really? You said that you implemented this series to help CRIU dumping
namespaces. I think we need to implement the CRIU part to prove that
this interface is usable for this case. Right now, I have doubts about
this.

>
> > * We are going to assign names to namespaces. But this means that we
> > need to guarantee that all names in one directory are unique. The
> > initial proposal was to enumerate all namespaces in one proc directory,
> > that means names of all namespaces have to be unique. This can be
> > problematic in some cases. For example, we may want to dump a container
> > and then restore it more than once on the same host. How are we going to
> > avoid namespace name conficts in such cases?
>
> Previous message I wrote about .rename of proc files, Alexey Dobriyan
> said this is not a taboo. Are there problem which doesn't cover the case
> you point?

Yes, there is. Namespace names will be visible from a container, so they
have to be restored. But this means that two containers can't be
restored from the same snapshot due to namespace name conflicts.

But if we will show namespaces how I suggest, each container will see
only its sub-tree of namespaces and we will be able to specify any name
for the container root user namespace.

>
> > If we will have per-user-namespace directories, we will need to
> > guarantee that names are unique only inside one user namespace.
>
> Unique names inside one user namespace won't introduce a new /proc
> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
> container. To give a virtualized name you have to have a dedicated pid ns.
>
> Let we have in one /proc mount:
>
> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
>
> In another another /proc mount we have:
>
> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
>
> The virtualization is made per /proc (i.e., per pid ns). Container should
> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
>
> There is no a sense of directory hierarchy for virtualization, since
> you can't use specific sub-directory as a root directory of /proc/namespaces
> to a container. You still have to introduce a new pid ns to have virtualized
> /proc.

I think we can figure out how to implement this. As the first idea, we
can use the same way how /proc/net is implemented.

>
> > * With the suggested structure, for each user namepsace, we will show
> > only its subtree of namespaces. This looks more natural than
> > filltering content of one directory.
>
> It's rather subjectively I think. /proc is related to pid ns, and user ns
> hierarchy does not look more natural for me.

or /proc is wrong place for this.

2020-08-13 08:14:00

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 12.08.2020 20:53, Andrei Vagin wrote:
> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>> On 10.08.2020 20:34, Andrei Vagin wrote:
>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>
>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>>>
>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>>>>>> impossible to know whether the namespace exists or not.
>>>>>>>>>>
>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>>>>>> this multiplies at tasks and fds number.
>>>>>>>>>
>>>>>>>>> I am very dubious about this.
>>>>>>>>>
>>>>>>>>> I have been avoiding exactly this kind of interface because it can
>>>>>>>>> create rather fundamental problems with checkpoint restart.
>>>>>>>>
>>>>>>>> restart/restore :)
>>>>>>>>
>>>>>>>>> You do have some filtering and the filtering is not based on current.
>>>>>>>>> Which is good.
>>>>>>>>>
>>>>>>>>> A view that is relative to a user namespace might be ok. It almost
>>>>>>>>> certainly does better as it's own little filesystem than as an extension
>>>>>>>>> to proc though.
>>>>>>>>>
>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>>>>>> everything. I don't see how you will be able to restore these files
>>>>>>>>> after migration. Anything like this without having a complete
>>>>>>>>> checkpoint/restore story is a non-starter.
>>>>>>>>
>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>>>>>
>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>>>>>> problem here.
>>>>>>>
>>>>>>> An obvious diffference is that you are adding the inode to the inode to
>>>>>>> the file name. Which means that now you really do have to preserve the
>>>>>>> inode numbers during process migration.
>>>>>>>
>>>>>>> Which means now we have to do all of the work to make inode number
>>>>>>> restoration possible. Which means now we need to have multiple
>>>>>>> instances of nsfs so that we can restore inode numbers.
>>>>>>>
>>>>>>> I think this is still possible but we have been delaying figuring out
>>>>>>> how to restore inode numbers long enough that may be actual technical
>>>>>>> problems making it happen.
>>>>>>
>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
>>>>>> change the names the namespaces are exported to particular fs and to support
>>>>>> rename().
>>>>>>
>>>>>> Before introduction a principally new filesystem type for this, can't
>>>>>> this be solved in current /proc?
>>>>>
>>>>> do you mean to introduce names for namespaces which users will be able
>>>>> to change? By default, this can be uuid.
>>>>
>>>> Yes, I mean this.
>>>>
>>>> Currently I won't give a final answer about UUID, but I planned to show some
>>>> default names, which based on namespace type and inode num. Completely custom
>>>> names for any /proc by default will waste too much memory.
>>>>
>>>> So, I think the good way will be:
>>>>
>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
>>>> random seed, which is generated on boot;
>>>>
>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>>>>
>>>> 3)Allow rename, and allocate space only for renamed names.
>>>>
>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>>>
>>>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>>>
>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>>>> to group namespaces by their user-namespaces?
>>>>>
>>>>> /proc/namespaces/
>>>>> user
>>>>> mnt-X
>>>>> mnt-Y
>>>>> pid-X
>>>>> uts-Z
>>>>> user-X/
>>>>> user
>>>>> mnt-A
>>>>> mnt-B
>>>>> user-C
>>>>> user-C/
>>>>> user
>>>>> user-Y/
>>>>> user
>>>>
>>>> Hm, I don't think that user namespace is a generic key value for everybody.
>>>> For generic people tasks a user namespace is just a namespace among another
>>>> namespace types. For me it will look a bit strage to iterate some user namespaces
>>>> to build container net topology.
>>>
>>> I can’t agree with you that the user namespace is one of others. It is
>>> the namespace for namespaces. It sets security boundaries in the system
>>> and we need to know them to understand the whole system.
>>>
>>> If user namespaces are not used in the system or on a container, you
>>> will see all namespaces in one directory. But if the system has a more
>>> complicated structure, you will be able to build a full picture of it.
>>>
>>> You said that one of the users of this feature is CRIU (the tool to
>>> checkpoint/restore containers) and you said that it would be good if
>>> CRIU will be able to collect all container namespaces before dumping
>>> processes, sockets, files etc. But how will we be able to do this if we
>>> will list all namespaces in one directory?
>>
>> There is no a problem, this looks rather simple. Two cases are possible:
>>
>> 1)a container has dedicated namespaces set, and CRIU just has to iterate
>> files in /proc/namespaces of root pid namespace of the container.
>> The relationships between parents and childs of pid and user namespaces
>> are founded via ioctl(NS_GET_PARENT).
>>
>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
>> all host namespaces. There is no another way to do that, because container
>> may have any host namespaces, and hierarchy in /proc/namespaces won't
>> help you.
>>
>>> Here are my thoughts why we need to the suggested structure is better
>>> than just a list of namespaces:
>>>
>>> * Users will be able to understand securies bondaries in the system.
>>> Each namespace in the system is owned by one of user namespace and we
>>> need to know these relationshipts to understand the whole system.
>>
>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
>> this interfaces?
>
> We can use these ioctl-s, but we will need to enumerate all namespaces in
> the system to build a view of the namespace hierarchy. This will be very
> expensive. The kernel can show this hierarchy without additional cost.

No. We will have to iterate /proc/namespaces of a specific container to get
its namespaces. It's a subset of all namespaces in system, and these all the
namespaces, which are potentially allowed for the container.

>>
>>> * This is simplify collecting namespaces which belong to one container.
>>>
>>> For example, CRIU collects all namespaces before dumping file
>>> descriptors. Then it collects all sockets with socket-diag in network
>>> namespaces and collects mount points via /proc/pid/mountinfo in mount
>>> namesapces. Then these information is used to dump socket file
>>> descriptors and opened files.
>>
>> This is just the thing I say. This allows to avoid writing recursive dump.
>
> I don't understand this. How are you going to collect namespaces in CRIU
> without knowing which are used by a dumped container?

My patchset exports only the namespaces, which are allowed for a specific
container, and no more above this. All exported namespaces are alive,
so someone holds a reference on every of it. So they are used.

It seems you haven't understood the way I suggested here. See patch [11/23]
for the details. It's about permissions, and the subset of exported namespaces
is formalized there.

>> But this has nothing about advantages of hierarchy in /proc/namespaces.
>
> Really? You said that you implemented this series to help CRIU dumping
> namespaces. I think we need to implement the CRIU part to prove that
> this interface is usable for this case. Right now, I have doubts about
> this.

Yes, really. See my comment above and patch [11/23].

>>
>>> * We are going to assign names to namespaces. But this means that we
>>> need to guarantee that all names in one directory are unique. The
>>> initial proposal was to enumerate all namespaces in one proc directory,
>>> that means names of all namespaces have to be unique. This can be
>>> problematic in some cases. For example, we may want to dump a container
>>> and then restore it more than once on the same host. How are we going to
>>> avoid namespace name conficts in such cases?
>>
>> Previous message I wrote about .rename of proc files, Alexey Dobriyan
>> said this is not a taboo. Are there problem which doesn't cover the case
>> you point?
>
> Yes, there is. Namespace names will be visible from a container, so they
> have to be restored. But this means that two containers can't be
> restored from the same snapshot due to namespace name conflicts.
>
> But if we will show namespaces how I suggest, each container will see
> only its sub-tree of namespaces and we will be able to specify any name
> for the container root user namespace.

Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].

I do export sub-tree.

>>
>>> If we will have per-user-namespace directories, we will need to
>>> guarantee that names are unique only inside one user namespace.
>>
>> Unique names inside one user namespace won't introduce a new /proc
>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
>> container. To give a virtualized name you have to have a dedicated pid ns.
>>
>> Let we have in one /proc mount:
>>
>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
>>
>> In another another /proc mount we have:
>>
>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
>>
>> The virtualization is made per /proc (i.e., per pid ns). Container should
>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
>>
>> There is no a sense of directory hierarchy for virtualization, since
>> you can't use specific sub-directory as a root directory of /proc/namespaces
>> to a container. You still have to introduce a new pid ns to have virtualized
>> /proc.
>
> I think we can figure out how to implement this. As the first idea, we
> can use the same way how /proc/net is implemented.
>
>>
>>> * With the suggested structure, for each user namepsace, we will show
>>> only its subtree of namespaces. This looks more natural than
>>> filltering content of one directory.
>>
>> It's rather subjectively I think. /proc is related to pid ns, and user ns
>> hierarchy does not look more natural for me.
>
> or /proc is wrong place for this

2020-08-14 01:17:54

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> On 12.08.2020 20:53, Andrei Vagin wrote:
> > On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >>>> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>>>
> >>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>>>>>
> >>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>>>>>>>> but some also may be as open files, which are not attached to a process.
> >>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>>>
> >>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>>>
> >>>>>>>>> I am very dubious about this.
> >>>>>>>>>
> >>>>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>>>
> >>>>>>>> restart/restore :)
> >>>>>>>>
> >>>>>>>>> You do have some filtering and the filtering is not based on current.
> >>>>>>>>> Which is good.
> >>>>>>>>>
> >>>>>>>>> A view that is relative to a user namespace might be ok. It almost
> >>>>>>>>> certainly does better as it's own little filesystem than as an extension
> >>>>>>>>> to proc though.
> >>>>>>>>>
> >>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>>>>>> everything. I don't see how you will be able to restore these files
> >>>>>>>>> after migration. Anything like this without having a complete
> >>>>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>>>
> >>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>>>>>>>
> >>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >>>>>>>> problem here.
> >>>>>>>
> >>>>>>> An obvious diffference is that you are adding the inode to the inode to
> >>>>>>> the file name. Which means that now you really do have to preserve the
> >>>>>>> inode numbers during process migration.
> >>>>>>>
> >>>>>>> Which means now we have to do all of the work to make inode number
> >>>>>>> restoration possible. Which means now we need to have multiple
> >>>>>>> instances of nsfs so that we can restore inode numbers.
> >>>>>>>
> >>>>>>> I think this is still possible but we have been delaying figuring out
> >>>>>>> how to restore inode numbers long enough that may be actual technical
> >>>>>>> problems making it happen.
> >>>>>>
> >>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
> >>>>>> change the names the namespaces are exported to particular fs and to support
> >>>>>> rename().
> >>>>>>
> >>>>>> Before introduction a principally new filesystem type for this, can't
> >>>>>> this be solved in current /proc?
> >>>>>
> >>>>> do you mean to introduce names for namespaces which users will be able
> >>>>> to change? By default, this can be uuid.
> >>>>
> >>>> Yes, I mean this.
> >>>>
> >>>> Currently I won't give a final answer about UUID, but I planned to show some
> >>>> default names, which based on namespace type and inode num. Completely custom
> >>>> names for any /proc by default will waste too much memory.
> >>>>
> >>>> So, I think the good way will be:
> >>>>
> >>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
> >>>> random seed, which is generated on boot;
> >>>>
> >>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
> >>>>
> >>>> 3)Allow rename, and allocate space only for renamed names.
> >>>>
> >>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
> >>>>
> >>>>> And I have a suggestion about the structure of /proc/namespaces/.
> >>>>>
> >>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
> >>>>> to group namespaces by their user-namespaces?
> >>>>>
> >>>>> /proc/namespaces/
> >>>>> user
> >>>>> mnt-X
> >>>>> mnt-Y
> >>>>> pid-X
> >>>>> uts-Z
> >>>>> user-X/
> >>>>> user
> >>>>> mnt-A
> >>>>> mnt-B
> >>>>> user-C
> >>>>> user-C/
> >>>>> user
> >>>>> user-Y/
> >>>>> user
> >>>>
> >>>> Hm, I don't think that user namespace is a generic key value for everybody.
> >>>> For generic people tasks a user namespace is just a namespace among another
> >>>> namespace types. For me it will look a bit strage to iterate some user namespaces
> >>>> to build container net topology.
> >>>
> >>> I can’t agree with you that the user namespace is one of others. It is
> >>> the namespace for namespaces. It sets security boundaries in the system
> >>> and we need to know them to understand the whole system.
> >>>
> >>> If user namespaces are not used in the system or on a container, you
> >>> will see all namespaces in one directory. But if the system has a more
> >>> complicated structure, you will be able to build a full picture of it.
> >>>
> >>> You said that one of the users of this feature is CRIU (the tool to
> >>> checkpoint/restore containers) and you said that it would be good if
> >>> CRIU will be able to collect all container namespaces before dumping
> >>> processes, sockets, files etc. But how will we be able to do this if we
> >>> will list all namespaces in one directory?
> >>
> >> There is no a problem, this looks rather simple. Two cases are possible:
> >>
> >> 1)a container has dedicated namespaces set, and CRIU just has to iterate
> >> files in /proc/namespaces of root pid namespace of the container.
> >> The relationships between parents and childs of pid and user namespaces
> >> are founded via ioctl(NS_GET_PARENT).
> >>
> >> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
> >> all host namespaces. There is no another way to do that, because container
> >> may have any host namespaces, and hierarchy in /proc/namespaces won't
> >> help you.
> >>
> >>> Here are my thoughts why we need to the suggested structure is better
> >>> than just a list of namespaces:
> >>>
> >>> * Users will be able to understand securies bondaries in the system.
> >>> Each namespace in the system is owned by one of user namespace and we
> >>> need to know these relationshipts to understand the whole system.
> >>
> >> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
> >> this interfaces?
> >
> > We can use these ioctl-s, but we will need to enumerate all namespaces in
> > the system to build a view of the namespace hierarchy. This will be very
> > expensive. The kernel can show this hierarchy without additional cost.
>
> No. We will have to iterate /proc/namespaces of a specific container to get
> its namespaces. It's a subset of all namespaces in system, and these all the
> namespaces, which are potentially allowed for the container.

"""
Every /proc is related to a pid_namespace, and the pid_namespace
is related to a user_namespace. The items, we show in this
/proc/namespaces/ directory, are the namespaces,
whose user_namespaces are the same as /proc's user_namespace,
or their descendants.
""" // [PATCH 11/23] fs: Add /proc/namespaces/ directory

This means that if a user want to find out all container namespaces, it
has to have access to the container procfs and the container should
a separate pid namespace.

I would say these are two big limitations. The first one will not affect
CRIU and I agree CRIU can use this interface in its current form.

The second one will be still the issue for CRIU. And they both will
affect other users.

For end users, it will be a pain. They will need to create a pid
namespaces in a specified user-namespace, if a container doesn't have
its own. Then they will need to mount /proc from the container pid
namespace and only then they will be able to enumerate namespaces.

But to build a view of a hierarchy of these namespaces, they will need to
use a binary tool which will open each of these namespaces, call
NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree.

>
> >>
> >>> * This is simplify collecting namespaces which belong to one container.
> >>>
> >>> For example, CRIU collects all namespaces before dumping file
> >>> descriptors. Then it collects all sockets with socket-diag in network
> >>> namespaces and collects mount points via /proc/pid/mountinfo in mount
> >>> namesapces. Then these information is used to dump socket file
> >>> descriptors and opened files.
> >>
> >> This is just the thing I say. This allows to avoid writing recursive dump.
> >
> > I don't understand this. How are you going to collect namespaces in CRIU
> > without knowing which are used by a dumped container?
>
> My patchset exports only the namespaces, which are allowed for a specific
> container, and no more above this. All exported namespaces are alive,
> so someone holds a reference on every of it. So they are used.
>
> It seems you haven't understood the way I suggested here. See patch [11/23]
> for the details. It's about permissions, and the subset of exported namespaces
> is formalized there.

Honestly, I have not read all patches in this series and you didn't
describe this behavior in the cover letter. Thank you for pointing out
to the 11 patch, but I still think it doesn't solve the problem
completely. More details is in the comment which is a few lines above
this one.

>
> >> But this has nothing about advantages of hierarchy in /proc/namespaces.

Yes, it has. For example, in cases when a container doesn't have its own
pid namespaces.

> >
> > Really? You said that you implemented this series to help CRIU dumping
> > namespaces. I think we need to implement the CRIU part to prove that
> > this interface is usable for this case. Right now, I have doubts about
> > this.
>
> Yes, really. See my comment above and patch [11/23].
>
> >>
> >>> * We are going to assign names to namespaces. But this means that we
> >>> need to guarantee that all names in one directory are unique. The
> >>> initial proposal was to enumerate all namespaces in one proc directory,
> >>> that means names of all namespaces have to be unique. This can be
> >>> problematic in some cases. For example, we may want to dump a container
> >>> and then restore it more than once on the same host. How are we going to
> >>> avoid namespace name conficts in such cases?
> >>
> >> Previous message I wrote about .rename of proc files, Alexey Dobriyan
> >> said this is not a taboo. Are there problem which doesn't cover the case
> >> you point?
> >
> > Yes, there is. Namespace names will be visible from a container, so they
> > have to be restored. But this means that two containers can't be
> > restored from the same snapshot due to namespace name conflicts.
> >
> > But if we will show namespaces how I suggest, each container will see
> > only its sub-tree of namespaces and we will be able to specify any name
> > for the container root user namespace.
>
> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].
>
> I do export sub-tree.

I got your idea, but it is unclear how your are going to avoid name
conflicts.

In the root container, you will show all namespaces in the system. These
means that all namespaces have to have unique names. This means we will
not able to restore two containers from the same snapshot without
renaming namespaces. But we can't change namespace names, because they
are visible from containers and container processes can use them.

>
> >>
> >>> If we will have per-user-namespace directories, we will need to
> >>> guarantee that names are unique only inside one user namespace.
> >>
> >> Unique names inside one user namespace won't introduce a new /proc
> >> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
> >> container. To give a virtualized name you have to have a dedicated pid ns.
> >>
> >> Let we have in one /proc mount:
> >>
> >> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
> >>
> >> In another another /proc mount we have:
> >>
> >> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
> >>
> >> The virtualization is made per /proc (i.e., per pid ns). Container should
> >> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
> >>
> >> There is no a sense of directory hierarchy for virtualization, since
> >> you can't use specific sub-directory as a root directory of /proc/namespaces
> >> to a container. You still have to introduce a new pid ns to have virtualized
> >> /proc.
> >
> > I think we can figure out how to implement this. As the first idea, we
> > can use the same way how /proc/net is implemented.
> >
> >>
> >>> * With the suggested structure, for each user namepsace, we will show
> >>> only its subtree of namespaces. This looks more natural than
> >>> filltering content of one directory.
> >>
> >> It's rather subjectively I think. /proc is related to pid ns, and user ns
> >> hierarchy does not look more natural for me.
> >
> > or /proc is wrong place for this

2020-08-14 16:08:33

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 14.08.2020 04:16, Andrei Vagin wrote:
> On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
>> On 12.08.2020 20:53, Andrei Vagin wrote:
>>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>>>> On 10.08.2020 20:34, Andrei Vagin wrote:
>>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>>>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>>>
>>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>>>>>
>>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>>>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>>>>>>>> impossible to know whether the namespace exists or not.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>>>>>>>> this multiplies at tasks and fds number.
>>>>>>>>>>>
>>>>>>>>>>> I am very dubious about this.
>>>>>>>>>>>
>>>>>>>>>>> I have been avoiding exactly this kind of interface because it can
>>>>>>>>>>> create rather fundamental problems with checkpoint restart.
>>>>>>>>>>
>>>>>>>>>> restart/restore :)
>>>>>>>>>>
>>>>>>>>>>> You do have some filtering and the filtering is not based on current.
>>>>>>>>>>> Which is good.
>>>>>>>>>>>
>>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost
>>>>>>>>>>> certainly does better as it's own little filesystem than as an extension
>>>>>>>>>>> to proc though.
>>>>>>>>>>>
>>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>>>>>>>> everything. I don't see how you will be able to restore these files
>>>>>>>>>>> after migration. Anything like this without having a complete
>>>>>>>>>>> checkpoint/restore story is a non-starter.
>>>>>>>>>>
>>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>>>>>>>
>>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>>>>>>>> problem here.
>>>>>>>>>
>>>>>>>>> An obvious diffference is that you are adding the inode to the inode to
>>>>>>>>> the file name. Which means that now you really do have to preserve the
>>>>>>>>> inode numbers during process migration.
>>>>>>>>>
>>>>>>>>> Which means now we have to do all of the work to make inode number
>>>>>>>>> restoration possible. Which means now we need to have multiple
>>>>>>>>> instances of nsfs so that we can restore inode numbers.
>>>>>>>>>
>>>>>>>>> I think this is still possible but we have been delaying figuring out
>>>>>>>>> how to restore inode numbers long enough that may be actual technical
>>>>>>>>> problems making it happen.
>>>>>>>>
>>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
>>>>>>>> change the names the namespaces are exported to particular fs and to support
>>>>>>>> rename().
>>>>>>>>
>>>>>>>> Before introduction a principally new filesystem type for this, can't
>>>>>>>> this be solved in current /proc?
>>>>>>>
>>>>>>> do you mean to introduce names for namespaces which users will be able
>>>>>>> to change? By default, this can be uuid.
>>>>>>
>>>>>> Yes, I mean this.
>>>>>>
>>>>>> Currently I won't give a final answer about UUID, but I planned to show some
>>>>>> default names, which based on namespace type and inode num. Completely custom
>>>>>> names for any /proc by default will waste too much memory.
>>>>>>
>>>>>> So, I think the good way will be:
>>>>>>
>>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
>>>>>> random seed, which is generated on boot;
>>>>>>
>>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>>>>>>
>>>>>> 3)Allow rename, and allocate space only for renamed names.
>>>>>>
>>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>>>>>
>>>>>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>>>>>
>>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>>>>>> to group namespaces by their user-namespaces?
>>>>>>>
>>>>>>> /proc/namespaces/
>>>>>>> user
>>>>>>> mnt-X
>>>>>>> mnt-Y
>>>>>>> pid-X
>>>>>>> uts-Z
>>>>>>> user-X/
>>>>>>> user
>>>>>>> mnt-A
>>>>>>> mnt-B
>>>>>>> user-C
>>>>>>> user-C/
>>>>>>> user
>>>>>>> user-Y/
>>>>>>> user
>>>>>>
>>>>>> Hm, I don't think that user namespace is a generic key value for everybody.
>>>>>> For generic people tasks a user namespace is just a namespace among another
>>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces
>>>>>> to build container net topology.
>>>>>
>>>>> I can’t agree with you that the user namespace is one of others. It is
>>>>> the namespace for namespaces. It sets security boundaries in the system
>>>>> and we need to know them to understand the whole system.
>>>>>
>>>>> If user namespaces are not used in the system or on a container, you
>>>>> will see all namespaces in one directory. But if the system has a more
>>>>> complicated structure, you will be able to build a full picture of it.
>>>>>
>>>>> You said that one of the users of this feature is CRIU (the tool to
>>>>> checkpoint/restore containers) and you said that it would be good if
>>>>> CRIU will be able to collect all container namespaces before dumping
>>>>> processes, sockets, files etc. But how will we be able to do this if we
>>>>> will list all namespaces in one directory?
>>>>
>>>> There is no a problem, this looks rather simple. Two cases are possible:
>>>>
>>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate
>>>> files in /proc/namespaces of root pid namespace of the container.
>>>> The relationships between parents and childs of pid and user namespaces
>>>> are founded via ioctl(NS_GET_PARENT).
>>>>
>>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
>>>> all host namespaces. There is no another way to do that, because container
>>>> may have any host namespaces, and hierarchy in /proc/namespaces won't
>>>> help you.
>>>>
>>>>> Here are my thoughts why we need to the suggested structure is better
>>>>> than just a list of namespaces:
>>>>>
>>>>> * Users will be able to understand securies bondaries in the system.
>>>>> Each namespace in the system is owned by one of user namespace and we
>>>>> need to know these relationshipts to understand the whole system.
>>>>
>>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
>>>> this interfaces?
>>>
>>> We can use these ioctl-s, but we will need to enumerate all namespaces in
>>> the system to build a view of the namespace hierarchy. This will be very
>>> expensive. The kernel can show this hierarchy without additional cost.
>>
>> No. We will have to iterate /proc/namespaces of a specific container to get
>> its namespaces. It's a subset of all namespaces in system, and these all the
>> namespaces, which are potentially allowed for the container.
>
> """
> Every /proc is related to a pid_namespace, and the pid_namespace
> is related to a user_namespace. The items, we show in this
> /proc/namespaces/ directory, are the namespaces,
> whose user_namespaces are the same as /proc's user_namespace,
> or their descendants.
> """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory
>
> This means that if a user want to find out all container namespaces, it
> has to have access to the container procfs and the container should
> a separate pid namespace.
>
> I would say these are two big limitations. The first one will not affect
> CRIU and I agree CRIU can use this interface in its current form.
>
> The second one will be still the issue for CRIU. And they both will
> affect other users.
>
> For end users, it will be a pain. They will need to create a pid
> namespaces in a specified user-namespace, if a container doesn't have
> its own. Then they will need to mount /proc from the container pid
> namespace and only then they will be able to enumerate namespaces.

In case of a container does not have its own pid namespace, CRIU already
sucks. Every file in /proc directory is not reliable after restore,
so /proc/namespaces is just one of them. Container, who may access files
in /proc, does have to have its own pid namespace.

Even if we imagine an unreal situation, when the rest of /proc files are reliable,
sub-directories won't help in this case also. In case of we introduce user ns
hierarchy, the namespaces names above container's user ns, will still
be unchangeble:

/proc/namespaces/parent_user_ns/container_user_ns/...

Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns
file, it will suck a pow after restore again.

So, the suggested sub-directories just don't work.

> But to build a view of a hierarchy of these namespaces, they will need to
> use a binary tool which will open each of these namespaces, call
> NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree.

Yes, it's the same way we have on a construction of tasks tree.

Linear /proc/namespaces is rather natural way. The sense is "all namespaces,
which are available for tasks in this /proc directory".

Grouping by user ns directories looks odd. CRIU is only util, who needs
such the grouping. But even for CRIU performance advantages look dubious.

For another utils, the preference of user ns grouping over another hierarchy
namespaces looks just weirdy weird.

I can agree with an idea of separate top-level sub-directories for different
namespaces types like:

/proc/namespaces/uts/
/proc/namespaces/user/
/proc/namespaces/pid/
...

But grouping of all another namespaces by user ns sub-directories absolutely
does not look sane for me.

>>
>>>>
>>>>> * This is simplify collecting namespaces which belong to one container.
>>>>>
>>>>> For example, CRIU collects all namespaces before dumping file
>>>>> descriptors. Then it collects all sockets with socket-diag in network
>>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount
>>>>> namesapces. Then these information is used to dump socket file
>>>>> descriptors and opened files.
>>>>
>>>> This is just the thing I say. This allows to avoid writing recursive dump.
>>>
>>> I don't understand this. How are you going to collect namespaces in CRIU
>>> without knowing which are used by a dumped container?
>>
>> My patchset exports only the namespaces, which are allowed for a specific
>> container, and no more above this. All exported namespaces are alive,
>> so someone holds a reference on every of it. So they are used.
>>
>> It seems you haven't understood the way I suggested here. See patch [11/23]
>> for the details. It's about permissions, and the subset of exported namespaces
>> is formalized there.
>
> Honestly, I have not read all patches in this series and you didn't
> describe this behavior in the cover letter. Thank you for pointing out
> to the 11 patch, but I still think it doesn't solve the problem
> completely. More details is in the comment which is a few lines above
> this one.
>
>>
>>>> But this has nothing about advantages of hierarchy in /proc/namespaces.
>
> Yes, it has. For example, in cases when a container doesn't have its own
> pid namespaces.
>
>>>
>>> Really? You said that you implemented this series to help CRIU dumping
>>> namespaces. I think we need to implement the CRIU part to prove that
>>> this interface is usable for this case. Right now, I have doubts about
>>> this.
>>
>> Yes, really. See my comment above and patch [11/23].
>>
>>>>
>>>>> * We are going to assign names to namespaces. But this means that we
>>>>> need to guarantee that all names in one directory are unique. The
>>>>> initial proposal was to enumerate all namespaces in one proc directory,
>>>>> that means names of all namespaces have to be unique. This can be
>>>>> problematic in some cases. For example, we may want to dump a container
>>>>> and then restore it more than once on the same host. How are we going to
>>>>> avoid namespace name conficts in such cases?
>>>>
>>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan
>>>> said this is not a taboo. Are there problem which doesn't cover the case
>>>> you point?
>>>
>>> Yes, there is. Namespace names will be visible from a container, so they
>>> have to be restored. But this means that two containers can't be
>>> restored from the same snapshot due to namespace name conflicts.
>>>
>>> But if we will show namespaces how I suggest, each container will see
>>> only its sub-tree of namespaces and we will be able to specify any name
>>> for the container root user namespace.
>>
>> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].
>>
>> I do export sub-tree.
>
> I got your idea, but it is unclear how your are going to avoid name
> conflicts.
>
> In the root container, you will show all namespaces in the system. These
> means that all namespaces have to have unique names. This means we will
> not able to restore two containers from the same snapshot without
> renaming namespaces. But we can't change namespace names, because they
> are visible from containers and container processes can use them.

Grouping by user ns sub-directories does not solve a problem with names
of containers w/o own pid ns. See above.

>>
>>>>
>>>>> If we will have per-user-namespace directories, we will need to
>>>>> guarantee that names are unique only inside one user namespace.
>>>>
>>>> Unique names inside one user namespace won't introduce a new /proc
>>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
>>>> container. To give a virtualized name you have to have a dedicated pid ns.
>>>>
>>>> Let we have in one /proc mount:
>>>>
>>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
>>>>
>>>> In another another /proc mount we have:
>>>>
>>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
>>>>
>>>> The virtualization is made per /proc (i.e., per pid ns). Container should
>>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
>>>>
>>>> There is no a sense of directory hierarchy for virtualization, since
>>>> you can't use specific sub-directory as a root directory of /proc/namespaces
>>>> to a container. You still have to introduce a new pid ns to have virtualized
>>>> /proc.
>>>
>>> I think we can figure out how to implement this. As the first idea, we
>>> can use the same way how /proc/net is implemented.
>>>
>>>>
>>>>> * With the suggested structure, for each user namepsace, we will show
>>>>> only its subtree of namespaces. This looks more natural than
>>>>> filltering content of one directory.
>>>>
>>>> It's rather subjectively I think. /proc is related to pid ns, and user ns
>>>> hierarchy does not look more natural for me.
>>>
>>> or /proc is wrong place for this

2020-08-14 21:27:37

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
> On 14.08.2020 04:16, Andrei Vagin wrote:
> > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> >> On 12.08.2020 20:53, Andrei Vagin wrote:
> >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >>>> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >>>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>>>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>>>>>
> >>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>>>>>> Kirill Tkhai <[email protected]> writes:
> >>>>>>>>>>>
> >>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
> >>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >>>>>>>>>>>> but some also may be as open files, which are not attached to a process.
> >>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
> >>>>>>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
> >>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >>>>>>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>>>>>
> >>>>>>>>>>> I am very dubious about this.
> >>>>>>>>>>>
> >>>>>>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>>>>>
> >>>>>>>>>> restart/restore :)
> >>>>>>>>>>
> >>>>>>>>>>> You do have some filtering and the filtering is not based on current.
> >>>>>>>>>>> Which is good.
> >>>>>>>>>>>
> >>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost
> >>>>>>>>>>> certainly does better as it's own little filesystem than as an extension
> >>>>>>>>>>> to proc though.
> >>>>>>>>>>>
> >>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>>>>>>>> everything. I don't see how you will be able to restore these files
> >>>>>>>>>>> after migration. Anything like this without having a complete
> >>>>>>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>>>>>
> >>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
> >>>>>>>>>>
> >>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
> >>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
> >>>>>>>>>> problem here.
> >>>>>>>>>
> >>>>>>>>> An obvious diffference is that you are adding the inode to the inode to
> >>>>>>>>> the file name. Which means that now you really do have to preserve the
> >>>>>>>>> inode numbers during process migration.
> >>>>>>>>>
> >>>>>>>>> Which means now we have to do all of the work to make inode number
> >>>>>>>>> restoration possible. Which means now we need to have multiple
> >>>>>>>>> instances of nsfs so that we can restore inode numbers.
> >>>>>>>>>
> >>>>>>>>> I think this is still possible but we have been delaying figuring out
> >>>>>>>>> how to restore inode numbers long enough that may be actual technical
> >>>>>>>>> problems making it happen.
> >>>>>>>>
> >>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
> >>>>>>>> change the names the namespaces are exported to particular fs and to support
> >>>>>>>> rename().
> >>>>>>>>
> >>>>>>>> Before introduction a principally new filesystem type for this, can't
> >>>>>>>> this be solved in current /proc?
> >>>>>>>
> >>>>>>> do you mean to introduce names for namespaces which users will be able
> >>>>>>> to change? By default, this can be uuid.
> >>>>>>
> >>>>>> Yes, I mean this.
> >>>>>>
> >>>>>> Currently I won't give a final answer about UUID, but I planned to show some
> >>>>>> default names, which based on namespace type and inode num. Completely custom
> >>>>>> names for any /proc by default will waste too much memory.
> >>>>>>
> >>>>>> So, I think the good way will be:
> >>>>>>
> >>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
> >>>>>> random seed, which is generated on boot;
> >>>>>>
> >>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
> >>>>>>
> >>>>>> 3)Allow rename, and allocate space only for renamed names.
> >>>>>>
> >>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
> >>>>>>
> >>>>>>> And I have a suggestion about the structure of /proc/namespaces/.
> >>>>>>>
> >>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
> >>>>>>> to group namespaces by their user-namespaces?
> >>>>>>>
> >>>>>>> /proc/namespaces/
> >>>>>>> user
> >>>>>>> mnt-X
> >>>>>>> mnt-Y
> >>>>>>> pid-X
> >>>>>>> uts-Z
> >>>>>>> user-X/
> >>>>>>> user
> >>>>>>> mnt-A
> >>>>>>> mnt-B
> >>>>>>> user-C
> >>>>>>> user-C/
> >>>>>>> user
> >>>>>>> user-Y/
> >>>>>>> user
> >>>>>>
> >>>>>> Hm, I don't think that user namespace is a generic key value for everybody.
> >>>>>> For generic people tasks a user namespace is just a namespace among another
> >>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces
> >>>>>> to build container net topology.
> >>>>>
> >>>>> I can’t agree with you that the user namespace is one of others. It is
> >>>>> the namespace for namespaces. It sets security boundaries in the system
> >>>>> and we need to know them to understand the whole system.
> >>>>>
> >>>>> If user namespaces are not used in the system or on a container, you
> >>>>> will see all namespaces in one directory. But if the system has a more
> >>>>> complicated structure, you will be able to build a full picture of it.
> >>>>>
> >>>>> You said that one of the users of this feature is CRIU (the tool to
> >>>>> checkpoint/restore containers) and you said that it would be good if
> >>>>> CRIU will be able to collect all container namespaces before dumping
> >>>>> processes, sockets, files etc. But how will we be able to do this if we
> >>>>> will list all namespaces in one directory?
> >>>>
> >>>> There is no a problem, this looks rather simple. Two cases are possible:
> >>>>
> >>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate
> >>>> files in /proc/namespaces of root pid namespace of the container.
> >>>> The relationships between parents and childs of pid and user namespaces
> >>>> are founded via ioctl(NS_GET_PARENT).
> >>>>
> >>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
> >>>> all host namespaces. There is no another way to do that, because container
> >>>> may have any host namespaces, and hierarchy in /proc/namespaces won't
> >>>> help you.
> >>>>
> >>>>> Here are my thoughts why we need to the suggested structure is better
> >>>>> than just a list of namespaces:
> >>>>>
> >>>>> * Users will be able to understand securies bondaries in the system.
> >>>>> Each namespace in the system is owned by one of user namespace and we
> >>>>> need to know these relationshipts to understand the whole system.
> >>>>
> >>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
> >>>> this interfaces?
> >>>
> >>> We can use these ioctl-s, but we will need to enumerate all namespaces in
> >>> the system to build a view of the namespace hierarchy. This will be very
> >>> expensive. The kernel can show this hierarchy without additional cost.
> >>
> >> No. We will have to iterate /proc/namespaces of a specific container to get
> >> its namespaces. It's a subset of all namespaces in system, and these all the
> >> namespaces, which are potentially allowed for the container.
> >
> > """
> > Every /proc is related to a pid_namespace, and the pid_namespace
> > is related to a user_namespace. The items, we show in this
> > /proc/namespaces/ directory, are the namespaces,
> > whose user_namespaces are the same as /proc's user_namespace,
> > or their descendants.
> > """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory
> >
> > This means that if a user want to find out all container namespaces, it
> > has to have access to the container procfs and the container should
> > a separate pid namespace.
> >
> > I would say these are two big limitations. The first one will not affect
> > CRIU and I agree CRIU can use this interface in its current form.
> >
> > The second one will be still the issue for CRIU. And they both will
> > affect other users.
> >
> > For end users, it will be a pain. They will need to create a pid
> > namespaces in a specified user-namespace, if a container doesn't have
> > its own. Then they will need to mount /proc from the container pid
> > namespace and only then they will be able to enumerate namespaces.
>
> In case of a container does not have its own pid namespace, CRIU already
> sucks. Every file in /proc directory is not reliable after restore,
> so /proc/namespaces is just one of them. Container, who may access files
> in /proc, does have to have its own pid namespace.

Can you be more detailed here? What files are not reliable? And why we
don't need to think about this use-case? If we have any issues here,
maybe we need to think how to fix them instead of adding a new one.

>
> Even if we imagine an unreal situation, when the rest of /proc files are reliable,
> sub-directories won't help in this case also. In case of we introduce user ns
> hierarchy, the namespaces names above container's user ns, will still
> be unchangeble:
>
> /proc/namespaces/parent_user_ns/container_user_ns/...
>
> Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns
> file, it will suck a pow after restore again.


In case of user ns hierarchy, a container will see only its sub-tree and
it will not know a name of its root namespace. It will look like this:

From host:
/proc/namespaces/user_ns_ct1/user1
user2

/proc/namespaces/user_ns_ct2/user1
user2

From ct1:
/proc/namespaces/user1
user2

And now could you explain how you are going to solve this problem with
your interface?

>
> So, the suggested sub-directories just don't work.

I am sure it will work.

>
> > But to build a view of a hierarchy of these namespaces, they will need to
> > use a binary tool which will open each of these namespaces, call
> > NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree.
>
> Yes, it's the same way we have on a construction of tasks tree.
>
> Linear /proc/namespaces is rather natural way. The sense is "all namespaces,
> which are available for tasks in this /proc directory".
>
> Grouping by user ns directories looks odd. CRIU is only util, who needs
> such the grouping. But even for CRIU performance advantages look dubious.

I can't agree with you here. This isn't about CRIU. Grouping by user ns
doesn't look odd for me, because this is how namespaces are grouped in
the kernel.

>
> For another utils, the preference of user ns grouping over another hierarchy
> namespaces looks just weirdy weird.
>
> I can agree with an idea of separate top-level sub-directories for different
> namespaces types like:
>
> /proc/namespaces/uts/
> /proc/namespaces/user/
> /proc/namespaces/pid/
> ...
>
> But grouping of all another namespaces by user ns sub-directories absolutely
> does not look sane for me.

I think we are stuck here and we need to ask an opinion of someone else.

>
> >>
> >>>>
> >>>>> * This is simplify collecting namespaces which belong to one container.
> >>>>>
> >>>>> For example, CRIU collects all namespaces before dumping file
> >>>>> descriptors. Then it collects all sockets with socket-diag in network
> >>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount
> >>>>> namesapces. Then these information is used to dump socket file
> >>>>> descriptors and opened files.
> >>>>
> >>>> This is just the thing I say. This allows to avoid writing recursive dump.
> >>>
> >>> I don't understand this. How are you going to collect namespaces in CRIU
> >>> without knowing which are used by a dumped container?
> >>
> >> My patchset exports only the namespaces, which are allowed for a specific
> >> container, and no more above this. All exported namespaces are alive,
> >> so someone holds a reference on every of it. So they are used.
> >>
> >> It seems you haven't understood the way I suggested here. See patch [11/23]
> >> for the details. It's about permissions, and the subset of exported namespaces
> >> is formalized there.
> >
> > Honestly, I have not read all patches in this series and you didn't
> > describe this behavior in the cover letter. Thank you for pointing out
> > to the 11 patch, but I still think it doesn't solve the problem
> > completely. More details is in the comment which is a few lines above
> > this one.
> >
> >>
> >>>> But this has nothing about advantages of hierarchy in /proc/namespaces.
> >
> > Yes, it has. For example, in cases when a container doesn't have its own
> > pid namespaces.
> >
> >>>
> >>> Really? You said that you implemented this series to help CRIU dumping
> >>> namespaces. I think we need to implement the CRIU part to prove that
> >>> this interface is usable for this case. Right now, I have doubts about
> >>> this.
> >>
> >> Yes, really. See my comment above and patch [11/23].
> >>
> >>>>
> >>>>> * We are going to assign names to namespaces. But this means that we
> >>>>> need to guarantee that all names in one directory are unique. The
> >>>>> initial proposal was to enumerate all namespaces in one proc directory,
> >>>>> that means names of all namespaces have to be unique. This can be
> >>>>> problematic in some cases. For example, we may want to dump a container
> >>>>> and then restore it more than once on the same host. How are we going to
> >>>>> avoid namespace name conficts in such cases?
> >>>>
> >>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan
> >>>> said this is not a taboo. Are there problem which doesn't cover the case
> >>>> you point?
> >>>
> >>> Yes, there is. Namespace names will be visible from a container, so they
> >>> have to be restored. But this means that two containers can't be
> >>> restored from the same snapshot due to namespace name conflicts.
> >>>
> >>> But if we will show namespaces how I suggest, each container will see
> >>> only its sub-tree of namespaces and we will be able to specify any name
> >>> for the container root user namespace.
> >>
> >> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].
> >>
> >> I do export sub-tree.
> >
> > I got your idea, but it is unclear how your are going to avoid name
> > conflicts.
> >
> > In the root container, you will show all namespaces in the system. These
> > means that all namespaces have to have unique names. This means we will
> > not able to restore two containers from the same snapshot without
> > renaming namespaces. But we can't change namespace names, because they
> > are visible from containers and container processes can use them.
>
> Grouping by user ns sub-directories does not solve a problem with names
> of containers w/o own pid ns. See above.

It solves, you just doesn't understand how it works. See above.

>
> >>
> >>>>
> >>>>> If we will have per-user-namespace directories, we will need to
> >>>>> guarantee that names are unique only inside one user namespace.
> >>>>
> >>>> Unique names inside one user namespace won't introduce a new /proc
> >>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
> >>>> container. To give a virtualized name you have to have a dedicated pid ns.
> >>>>
> >>>> Let we have in one /proc mount:
> >>>>
> >>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
> >>>>
> >>>> In another another /proc mount we have:
> >>>>
> >>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
> >>>>
> >>>> The virtualization is made per /proc (i.e., per pid ns). Container should
> >>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
> >>>>
> >>>> There is no a sense of directory hierarchy for virtualization, since
> >>>> you can't use specific sub-directory as a root directory of /proc/namespaces
> >>>> to a container. You still have to introduce a new pid ns to have virtualized
> >>>> /proc.
> >>>
> >>> I think we can figure out how to implement this. As the first idea, we
> >>> can use the same way how /proc/net is implemented.
> >>>
> >>>>
> >>>>> * With the suggested structure, for each user namepsace, we will show
> >>>>> only its subtree of namespaces. This looks more natural than
> >>>>> filltering content of one directory.
> >>>>
> >>>> It's rather subjectively I think. /proc is related to pid ns, and user ns
> >>>> hierarchy does not look more natural for me.
> >>>
> >>> or /proc is wrong place for this
>

2020-08-17 14:07:12

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On 14.08.2020 22:21, Andrei Vagin wrote:
> On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
>> On 14.08.2020 04:16, Andrei Vagin wrote:
>>> On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
>>>> On 12.08.2020 20:53, Andrei Vagin wrote:
>>>>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
>>>>>> On 10.08.2020 20:34, Andrei Vagin wrote:
>>>>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>>>>>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>>>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>>>>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>>>>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>>>>>
>>>>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>>>>>>>>>> Kirill Tkhai <[email protected]> writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>>>>>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>>>>>>>>>> impossible to know whether the namespace exists or not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>>>>>>>>>> this multiplies at tasks and fds number.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am very dubious about this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been avoiding exactly this kind of interface because it can
>>>>>>>>>>>>> create rather fundamental problems with checkpoint restart.
>>>>>>>>>>>>
>>>>>>>>>>>> restart/restore :)
>>>>>>>>>>>>
>>>>>>>>>>>>> You do have some filtering and the filtering is not based on current.
>>>>>>>>>>>>> Which is good.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A view that is relative to a user namespace might be ok. It almost
>>>>>>>>>>>>> certainly does better as it's own little filesystem than as an extension
>>>>>>>>>>>>> to proc though.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>>>>>>>>>> everything. I don't see how you will be able to restore these files
>>>>>>>>>>>>> after migration. Anything like this without having a complete
>>>>>>>>>>>>> checkpoint/restore story is a non-starter.
>>>>>>>>>>>>
>>>>>>>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>>>>>>>>>
>>>>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>>>>>>>>>> problem here.
>>>>>>>>>>>
>>>>>>>>>>> An obvious diffference is that you are adding the inode to the inode to
>>>>>>>>>>> the file name. Which means that now you really do have to preserve the
>>>>>>>>>>> inode numbers during process migration.
>>>>>>>>>>>
>>>>>>>>>>> Which means now we have to do all of the work to make inode number
>>>>>>>>>>> restoration possible. Which means now we need to have multiple
>>>>>>>>>>> instances of nsfs so that we can restore inode numbers.
>>>>>>>>>>>
>>>>>>>>>>> I think this is still possible but we have been delaying figuring out
>>>>>>>>>>> how to restore inode numbers long enough that may be actual technical
>>>>>>>>>>> problems making it happen.
>>>>>>>>>>
>>>>>>>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
>>>>>>>>>> change the names the namespaces are exported to particular fs and to support
>>>>>>>>>> rename().
>>>>>>>>>>
>>>>>>>>>> Before introduction a principally new filesystem type for this, can't
>>>>>>>>>> this be solved in current /proc?
>>>>>>>>>
>>>>>>>>> do you mean to introduce names for namespaces which users will be able
>>>>>>>>> to change? By default, this can be uuid.
>>>>>>>>
>>>>>>>> Yes, I mean this.
>>>>>>>>
>>>>>>>> Currently I won't give a final answer about UUID, but I planned to show some
>>>>>>>> default names, which based on namespace type and inode num. Completely custom
>>>>>>>> names for any /proc by default will waste too much memory.
>>>>>>>>
>>>>>>>> So, I think the good way will be:
>>>>>>>>
>>>>>>>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
>>>>>>>> random seed, which is generated on boot;
>>>>>>>>
>>>>>>>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>>>>>>>>
>>>>>>>> 3)Allow rename, and allocate space only for renamed names.
>>>>>>>>
>>>>>>>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>>>>>>>
>>>>>>>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>>>>>>>
>>>>>>>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>>>>>>>> to group namespaces by their user-namespaces?
>>>>>>>>>
>>>>>>>>> /proc/namespaces/
>>>>>>>>> user
>>>>>>>>> mnt-X
>>>>>>>>> mnt-Y
>>>>>>>>> pid-X
>>>>>>>>> uts-Z
>>>>>>>>> user-X/
>>>>>>>>> user
>>>>>>>>> mnt-A
>>>>>>>>> mnt-B
>>>>>>>>> user-C
>>>>>>>>> user-C/
>>>>>>>>> user
>>>>>>>>> user-Y/
>>>>>>>>> user
>>>>>>>>
>>>>>>>> Hm, I don't think that user namespace is a generic key value for everybody.
>>>>>>>> For generic people tasks a user namespace is just a namespace among another
>>>>>>>> namespace types. For me it will look a bit strage to iterate some user namespaces
>>>>>>>> to build container net topology.
>>>>>>>
>>>>>>> I can’t agree with you that the user namespace is one of others. It is
>>>>>>> the namespace for namespaces. It sets security boundaries in the system
>>>>>>> and we need to know them to understand the whole system.
>>>>>>>
>>>>>>> If user namespaces are not used in the system or on a container, you
>>>>>>> will see all namespaces in one directory. But if the system has a more
>>>>>>> complicated structure, you will be able to build a full picture of it.
>>>>>>>
>>>>>>> You said that one of the users of this feature is CRIU (the tool to
>>>>>>> checkpoint/restore containers) and you said that it would be good if
>>>>>>> CRIU will be able to collect all container namespaces before dumping
>>>>>>> processes, sockets, files etc. But how will we be able to do this if we
>>>>>>> will list all namespaces in one directory?
>>>>>>
>>>>>> There is no a problem, this looks rather simple. Two cases are possible:
>>>>>>
>>>>>> 1)a container has dedicated namespaces set, and CRIU just has to iterate
>>>>>> files in /proc/namespaces of root pid namespace of the container.
>>>>>> The relationships between parents and childs of pid and user namespaces
>>>>>> are founded via ioctl(NS_GET_PARENT).
>>>>>>
>>>>>> 2)container has no dedicated namespaces set. Then CRIU just has to iterate
>>>>>> all host namespaces. There is no another way to do that, because container
>>>>>> may have any host namespaces, and hierarchy in /proc/namespaces won't
>>>>>> help you.
>>>>>>
>>>>>>> Here are my thoughts why we need to the suggested structure is better
>>>>>>> than just a list of namespaces:
>>>>>>>
>>>>>>> * Users will be able to understand securies bondaries in the system.
>>>>>>> Each namespace in the system is owned by one of user namespace and we
>>>>>>> need to know these relationshipts to understand the whole system.
>>>>>>
>>>>>> Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
>>>>>> this interfaces?
>>>>>
>>>>> We can use these ioctl-s, but we will need to enumerate all namespaces in
>>>>> the system to build a view of the namespace hierarchy. This will be very
>>>>> expensive. The kernel can show this hierarchy without additional cost.
>>>>
>>>> No. We will have to iterate /proc/namespaces of a specific container to get
>>>> its namespaces. It's a subset of all namespaces in system, and these all the
>>>> namespaces, which are potentially allowed for the container.
>>>
>>> """
>>> Every /proc is related to a pid_namespace, and the pid_namespace
>>> is related to a user_namespace. The items, we show in this
>>> /proc/namespaces/ directory, are the namespaces,
>>> whose user_namespaces are the same as /proc's user_namespace,
>>> or their descendants.
>>> """ // [PATCH 11/23] fs: Add /proc/namespaces/ directory
>>>
>>> This means that if a user want to find out all container namespaces, it
>>> has to have access to the container procfs and the container should
>>> a separate pid namespace.
>>>
>>> I would say these are two big limitations. The first one will not affect
>>> CRIU and I agree CRIU can use this interface in its current form.
>>>
>>> The second one will be still the issue for CRIU. And they both will
>>> affect other users.
>>>
>>> For end users, it will be a pain. They will need to create a pid
>>> namespaces in a specified user-namespace, if a container doesn't have
>>> its own. Then they will need to mount /proc from the container pid
>>> namespace and only then they will be able to enumerate namespaces.
>>
>> In case of a container does not have its own pid namespace, CRIU already
>> sucks. Every file in /proc directory is not reliable after restore,
>> so /proc/namespaces is just one of them. Container, who may access files
>> in /proc, does have to have its own pid namespace.
>
> Can you be more detailed here? What files are not reliable? And why we
> don't need to think about this use-case? If we have any issues here,
> maybe we need to think how to fix them instead of adding a new one.

Any file in /proc is not reliable. You can't guarantee, the pid you need
at restore time will be free. Simple example: a program reading information
about its threads. It can't believe /proc/XXX/task/YYY/ after restore,
any access will results in error. The same is with other files in /proc.
Why do you require additional guarantees from the only directory in /proc?
This is really strange approach.

The issue is already fixed, and the fix is called pid namespace.

Did you get my proposition? Any container will rename namespaces like it wants
in its own /proc. Current patchset does not contain this, but I wrote this in
replies. Maybe you missed that.

>>
>> Even if we imagine an unreal situation, when the rest of /proc files are reliable,
>> sub-directories won't help in this case also. In case of we introduce user ns
>> hierarchy, the namespaces names above container's user ns, will still
>> be unchangeble:
>>
>> /proc/namespaces/parent_user_ns/container_user_ns/...
>>
>> Path to container_user_ns is fixed. If container accesses /proc/namespace/parent_user_ns
>> file, it will suck a pow after restore again.
>
>
> In case of user ns hierarchy, a container will see only its sub-tree and
> it will not know a name of its root namespace. It will look like this:
>
> From host:
> /proc/namespaces/user_ns_ct1/user1
> user2
>
> /proc/namespaces/user_ns_ct2/user1
> user2
>
> From ct1:
> /proc/namespaces/user1
> user2

This is not expedient. You can't reliable restore certain pid in the same pid namespace,
which is very likely used information.

But you request this strange functionality from rare used /proc/namespaces,
which is only for system utils. This is really strange and useless.

Hierarchy during user namespace is completely crap IMO. The world does not
spinning around CRIU. It will be really strange to analyze container net
namespaces topology (say, where veth is connected) iterating over user
namespaces directories. What is this information for? Nobody needs it.
It is just bad design and ugly interface, which makes users to say curses
for inventor of such the interface.

> And now could you explain how you are going to solve this problem with
> your interface?

I don't give more guarantees, than guarantees during pid restore.
What do you have on restore w/o pid namespace now?! If there is free pid,
you restore you program with this pid number. Otherwise, you restore with
another pid, or do not restore. The same is with namespace aliases. No more.

>>
>> So, the suggested sub-directories just don't work.
>
> I am sure it will work.
>
>>
>>> But to build a view of a hierarchy of these namespaces, they will need to
>>> use a binary tool which will open each of these namespaces, call
>>> NS_GET_PARENT and NS_GET_USERNS ioctl-s and build a tree.
>>
>> Yes, it's the same way we have on a construction of tasks tree.
>>
>> Linear /proc/namespaces is rather natural way. The sense is "all namespaces,
>> which are available for tasks in this /proc directory".
>>
>> Grouping by user ns directories looks odd. CRIU is only util, who needs
>> such the grouping. But even for CRIU performance advantages look dubious.
>
> I can't agree with you here. This isn't about CRIU. Grouping by user ns
> doesn't look odd for me, because this is how namespaces are grouped in
> the kernel.

Nope. Namespaces are not grouped by user namespace hierarchy. Pid and
user namespace use their own parent/child grouping, all another namespaces
types are linked in double linked lists.

>>
>> For another utils, the preference of user ns grouping over another hierarchy
>> namespaces looks just weirdy weird.
>>
>> I can agree with an idea of separate top-level sub-directories for different
>> namespaces types like:
>>
>> /proc/namespaces/uts/
>> /proc/namespaces/user/
>> /proc/namespaces/pid/
>> ...
>>
>> But grouping of all another namespaces by user ns sub-directories absolutely
>> does not look sane for me.
>
> I think we are stuck here and we need to ask an opinion of someone else.
>
>>
>>>>
>>>>>>
>>>>>>> * This is simplify collecting namespaces which belong to one container.
>>>>>>>
>>>>>>> For example, CRIU collects all namespaces before dumping file
>>>>>>> descriptors. Then it collects all sockets with socket-diag in network
>>>>>>> namespaces and collects mount points via /proc/pid/mountinfo in mount
>>>>>>> namesapces. Then these information is used to dump socket file
>>>>>>> descriptors and opened files.
>>>>>>
>>>>>> This is just the thing I say. This allows to avoid writing recursive dump.
>>>>>
>>>>> I don't understand this. How are you going to collect namespaces in CRIU
>>>>> without knowing which are used by a dumped container?
>>>>
>>>> My patchset exports only the namespaces, which are allowed for a specific
>>>> container, and no more above this. All exported namespaces are alive,
>>>> so someone holds a reference on every of it. So they are used.
>>>>
>>>> It seems you haven't understood the way I suggested here. See patch [11/23]
>>>> for the details. It's about permissions, and the subset of exported namespaces
>>>> is formalized there.
>>>
>>> Honestly, I have not read all patches in this series and you didn't
>>> describe this behavior in the cover letter. Thank you for pointing out
>>> to the 11 patch, but I still think it doesn't solve the problem
>>> completely. More details is in the comment which is a few lines above
>>> this one.
>>>
>>>>
>>>>>> But this has nothing about advantages of hierarchy in /proc/namespaces.
>>>
>>> Yes, it has. For example, in cases when a container doesn't have its own
>>> pid namespaces.
>>>
>>>>>
>>>>> Really? You said that you implemented this series to help CRIU dumping
>>>>> namespaces. I think we need to implement the CRIU part to prove that
>>>>> this interface is usable for this case. Right now, I have doubts about
>>>>> this.
>>>>
>>>> Yes, really. See my comment above and patch [11/23].
>>>>
>>>>>>
>>>>>>> * We are going to assign names to namespaces. But this means that we
>>>>>>> need to guarantee that all names in one directory are unique. The
>>>>>>> initial proposal was to enumerate all namespaces in one proc directory,
>>>>>>> that means names of all namespaces have to be unique. This can be
>>>>>>> problematic in some cases. For example, we may want to dump a container
>>>>>>> and then restore it more than once on the same host. How are we going to
>>>>>>> avoid namespace name conficts in such cases?
>>>>>>
>>>>>> Previous message I wrote about .rename of proc files, Alexey Dobriyan
>>>>>> said this is not a taboo. Are there problem which doesn't cover the case
>>>>>> you point?
>>>>>
>>>>> Yes, there is. Namespace names will be visible from a container, so they
>>>>> have to be restored. But this means that two containers can't be
>>>>> restored from the same snapshot due to namespace name conflicts.
>>>>>
>>>>> But if we will show namespaces how I suggest, each container will see
>>>>> only its sub-tree of namespaces and we will be able to specify any name
>>>>> for the container root user namespace.
>>>>
>>>> Now I'm sure you missed my idea. See proc_namespaces_readdir() in [11/23].
>>>>
>>>> I do export sub-tree.
>>>
>>> I got your idea, but it is unclear how your are going to avoid name
>>> conflicts.
>>>
>>> In the root container, you will show all namespaces in the system. These
>>> means that all namespaces have to have unique names. This means we will
>>> not able to restore two containers from the same snapshot without
>>> renaming namespaces. But we can't change namespace names, because they
>>> are visible from containers and container processes can use them.
>>
>> Grouping by user ns sub-directories does not solve a problem with names
>> of containers w/o own pid ns. See above.
>
> It solves, you just doesn't understand how it works. See above.
>
>>
>>>>
>>>>>>
>>>>>>> If we will have per-user-namespace directories, we will need to
>>>>>>> guarantee that names are unique only inside one user namespace.
>>>>>>
>>>>>> Unique names inside one user namespace won't introduce a new /proc
>>>>>> mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
>>>>>> container. To give a virtualized name you have to have a dedicated pid ns.
>>>>>>
>>>>>> Let we have in one /proc mount:
>>>>>>
>>>>>> /mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]
>>>>>>
>>>>>> In another another /proc mount we have:
>>>>>>
>>>>>> /mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]
>>>>>>
>>>>>> The virtualization is made per /proc (i.e., per pid ns). Container should
>>>>>> receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.
>>>>>>
>>>>>> There is no a sense of directory hierarchy for virtualization, since
>>>>>> you can't use specific sub-directory as a root directory of /proc/namespaces
>>>>>> to a container. You still have to introduce a new pid ns to have virtualized
>>>>>> /proc.
>>>>>
>>>>> I think we can figure out how to implement this. As the first idea, we
>>>>> can use the same way how /proc/net is implemented.
>>>>>
>>>>>>
>>>>>>> * With the suggested structure, for each user namepsace, we will show
>>>>>>> only its subtree of namespaces. This looks more natural than
>>>>>>> filltering content of one directory.
>>>>>>
>>>>>> It's rather subjectively I think. /proc is related to pid ns, and user ns
>>>>>> hierarchy does not look more natural for me.
>>>>>
>>>>> or /proc is wrong place for this

2020-08-17 17:50:54

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote:
>
> Creating names in the kernel for namespaces is very difficult and
> problematic. I have not seen anything that looks like all of the
> problems have been solved with restoring these new names.
>
> When your filter for your list of namespaces is user namespace creating
> a new directory in proc is highly questionable.
>
> As everyone uses proc placing this functionality in proc also amplifies
> the problem of creating names.
>
>
> Rather than proc having a way to mount a namespace filesystem filter by
> the user namespace of the mounter likely to have many many fewer
> problems. Especially as we are limiting/not allow new non-process
> things and ideally finding a way to remove the non-process things.
>
>
> Kirill you have a good point that taking the case where a pid namespace
> does not exist in a user namespace is likely quite unrealistic.
>
> Kirill mentioned upthread that the list of namespaces are the list that
> can appear in a container. Except by discipline in creating containers
> it is not possible to know which namespaces may appear in attached to a
> process. It is possible to be very creative with setns, and violate any
> constraint you may have. Which means your filtered list of namespaces
> may not contain all of the namespaces used by a set of processes. This

Indeed. We use setns() quite creatively when intercepting syscalls and
when attaching to a container.

> further argues that attaching the list of namespaces to proc does not
> make sense.
>
> Andrei has a good point that placing the names in a hierarchy by
> user namespace has the potential to create more freedom when
> assigning names to namespaces, as it means the names for namespaces
> do not need to be globally unique, and while still allowing the names
> to stay the same.
>
>
> To recap the possibilities for names for namespaces that I have seen
> mentioned in this thread are:
> - Names per mount
> - Names per user namespace
>
> I personally suspect that names per mount are likely to be so flexibly
> they are confusing, while names per user namespace are likely to be
> rigid, possibly too rigid to use.
>
> It all depends upon how everything is used. I have yet to see a
> complete story of how these names will be generated and used. So I can
> not really judge.

So I haven't fully understood either what the motivation for this
patchset is.
I can just speak to the use-case I had when I started prototyping
something similar: We needed a way to get a view on all namespaces
that exist on the system because we wanted a way to do namespace
debugging on a live system. This interface could've easily lived in
debugfs. The main point was that it should contain all namespaces.
Note, that it wasn't supposed to be a hierarchical format it was only
mean to list all namespaces and accessible to real root.
The interface here is way more flexible/complex and I haven't yet
figured out what exactly it is supposed to be used for.

>
>
> Let me add another take on this idea that might give this work a path
> forward. If I were solving this I would explore giving nsfs directories
> per user namespace, and a way to mount it that exposed the directory of
> the mounters current user namespace (something like btrfs snapshots).
>
> Hmm. For the user namespace directory I think I would give it a file
> "ns" that can be opened to get a file handle on the user namespace.
> Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
> "user", "uts") for each type of namespace. In each directory I think
> I would just have a 64bit counter and each new entry I would assign the
> next number from that counter.
>
> The restore could either have the ability to rename files or simply the
> ability to bump the counter (like we do with pids) so the names of the
> namespaces can be restored.
>
> That winds up making a user namespace the namespace of namespaces, so
> I am not 100% about the idea.

I think you're right that we need to understand better what the use-case
is. If I understand your suggestion correctly it wouldn't allow to show
nested user namespaces if the nsfs mount is per-user namespace.

Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk
a namespace hierarchy? For example, you could pass in a user namespace
fd and then you'd get back a struct with handles for fds for the
namespaces owned by that user namespace and then you could use
NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd
passed in initially and so on? Or something similar/simpler. This would
also decouple this from procfs somewhat.

Christian

2020-08-17 18:56:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary


Creating names in the kernel for namespaces is very difficult and
problematic. I have not seen anything that looks like all of the
problems have been solved with restoring these new names.

When your filter for your list of namespaces is user namespace creating
a new directory in proc is highly questionable.

As everyone uses proc placing this functionality in proc also amplifies
the problem of creating names.


Rather than proc having a way to mount a namespace filesystem filter by
the user namespace of the mounter likely to have many many fewer
problems. Especially as we are limiting/not allow new non-process
things and ideally finding a way to remove the non-process things.


Kirill you have a good point that taking the case where a pid namespace
does not exist in a user namespace is likely quite unrealistic.

Kirill mentioned upthread that the list of namespaces are the list that
can appear in a container. Except by discipline in creating containers
it is not possible to know which namespaces may appear in attached to a
process. It is possible to be very creative with setns, and violate any
constraint you may have. Which means your filtered list of namespaces
may not contain all of the namespaces used by a set of processes. This
further argues that attaching the list of namespaces to proc does not
make sense.

Andrei has a good point that placing the names in a hierarchy by
user namespace has the potential to create more freedom when
assigning names to namespaces, as it means the names for namespaces
do not need to be globally unique, and while still allowing the names
to stay the same.


To recap the possibilities for names for namespaces that I have seen
mentioned in this thread are:
- Names per mount
- Names per user namespace

I personally suspect that names per mount are likely to be so flexibly
they are confusing, while names per user namespace are likely to be
rigid, possibly too rigid to use.

It all depends upon how everything is used. I have yet to see a
complete story of how these names will be generated and used. So I can
not really judge.


Let me add another take on this idea that might give this work a path
forward. If I were solving this I would explore giving nsfs directories
per user namespace, and a way to mount it that exposed the directory of
the mounters current user namespace (something like btrfs snapshots).

Hmm. For the user namespace directory I think I would give it a file
"ns" that can be opened to get a file handle on the user namespace.
Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
"user", "uts") for each type of namespace. In each directory I think
I would just have a 64bit counter and each new entry I would assign the
next number from that counter.

The restore could either have the ability to rename files or simply the
ability to bump the counter (like we do with pids) so the names of the
namespaces can be restored.

That winds up making a user namespace the namespace of namespaces, so
I am not 100% about the idea.

Eric


2020-08-17 22:44:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

Christian Brauner <[email protected]> writes:

> On Mon, Aug 17, 2020 at 10:48:01AM -0500, Eric W. Biederman wrote:
>>
>> Creating names in the kernel for namespaces is very difficult and
>> problematic. I have not seen anything that looks like all of the
>> problems have been solved with restoring these new names.
>>
>> When your filter for your list of namespaces is user namespace creating
>> a new directory in proc is highly questionable.
>>
>> As everyone uses proc placing this functionality in proc also amplifies
>> the problem of creating names.
>>
>>
>> Rather than proc having a way to mount a namespace filesystem filter by
>> the user namespace of the mounter likely to have many many fewer
>> problems. Especially as we are limiting/not allow new non-process
>> things and ideally finding a way to remove the non-process things.
>>
>>
>> Kirill you have a good point that taking the case where a pid namespace
>> does not exist in a user namespace is likely quite unrealistic.
>>
>> Kirill mentioned upthread that the list of namespaces are the list that
>> can appear in a container. Except by discipline in creating containers
>> it is not possible to know which namespaces may appear in attached to a
>> process. It is possible to be very creative with setns, and violate any
>> constraint you may have. Which means your filtered list of namespaces
>> may not contain all of the namespaces used by a set of processes. This
>
> Indeed. We use setns() quite creatively when intercepting syscalls and
> when attaching to a container.
>
>> further argues that attaching the list of namespaces to proc does not
>> make sense.
>>
>> Andrei has a good point that placing the names in a hierarchy by
>> user namespace has the potential to create more freedom when
>> assigning names to namespaces, as it means the names for namespaces
>> do not need to be globally unique, and while still allowing the names
>> to stay the same.
>>
>>
>> To recap the possibilities for names for namespaces that I have seen
>> mentioned in this thread are:
>> - Names per mount
>> - Names per user namespace
>>
>> I personally suspect that names per mount are likely to be so flexibly
>> they are confusing, while names per user namespace are likely to be
>> rigid, possibly too rigid to use.
>>
>> It all depends upon how everything is used. I have yet to see a
>> complete story of how these names will be generated and used. So I can
>> not really judge.
>
> So I haven't fully understood either what the motivation for this
> patchset is.
> I can just speak to the use-case I had when I started prototyping
> something similar: We needed a way to get a view on all namespaces
> that exist on the system because we wanted a way to do namespace
> debugging on a live system. This interface could've easily lived in
> debugfs. The main point was that it should contain all namespaces.
> Note, that it wasn't supposed to be a hierarchical format it was only
> mean to list all namespaces and accessible to real root.
> The interface here is way more flexible/complex and I haven't yet
> figured out what exactly it is supposed to be used for.
>
>>
>>
>> Let me add another take on this idea that might give this work a path
>> forward. If I were solving this I would explore giving nsfs directories
>> per user namespace, and a way to mount it that exposed the directory of
>> the mounters current user namespace (something like btrfs snapshots).
>>
>> Hmm. For the user namespace directory I think I would give it a file
>> "ns" that can be opened to get a file handle on the user namespace.
>> Plus a set of subdirectories "cgroup", "ipc", "mnt", "net", "pid",
>> "user", "uts") for each type of namespace. In each directory I think
>> I would just have a 64bit counter and each new entry I would assign the
>> next number from that counter.
>>
>> The restore could either have the ability to rename files or simply the
>> ability to bump the counter (like we do with pids) so the names of the
>> namespaces can be restored.
>>
>> That winds up making a user namespace the namespace of namespaces, so
>> I am not 100% about the idea.
>
> I think you're right that we need to understand better what the use-case
> is. If I understand your suggestion correctly it wouldn't allow to show
> nested user namespaces if the nsfs mount is per-user namespace.

So what I was thinking is that we have the user namespace directories
and that the mount code would perform a bind mount such that the
directory that matches the mounters user namespace is the root
directory.

> Let me throw in a crazy idea: couldn't we just make the ioctl_ns() walk
> a namespace hierarchy? For example, you could pass in a user namespace
> fd and then you'd get back a struct with handles for fds for the
> namespaces owned by that user namespace and then you could use
> NS_GET_USERNS/NS_GET_PARENT to walk upwards from the user namespace fd
> passed in initially and so on? Or something similar/simpler. This would
> also decouple this from procfs somewhat.

Hmm.

That would remove the need to have names. We could just keep a list
of the namespaces in creation order. Hopefully the CRIU folks could
preserve that create order without too much trouble.

Say with an ioctl NS_NEXT_CREATION which takes two fds, and returns
a new file descriptor. The arguments would be the user namespace
and -1 or the file descriptor last returned fro NS_NEXT_CREATION.


Assuming that is not difficult for CRIU to restore that would be a very
simple patch.

Eric