2018-04-04 19:18:26

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH RFC v5] pidns: introduce syscall translate_pid

Each process have different pids, one for each pid namespace it belongs.
When interaction happens within single pid-ns translation isn't required.
More complicated scenarios needs special handling.

For example:
- reading pid-files or logs written inside container with pid namespace
- attaching with ptrace to tasks from different pid namespace
- passing pids across pid namespaces in any kind of API

Currently there are several interfaces that could be used here:

Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.

Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
In some cases conversion pid -> vpid could be easily done using this
information, but backward translation requires scanning all tasks.

Unix socket automatically translates pid attached to SCM_CREDENTIALS.
This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
into pid namespace, this expose process and could be insecure.

This patch adds new syscall for converting pids between pid namespaces:

pid_t translate_pid(pid_t pid, int source_type, int source,
int target_type, int target);

@source_type and @target_type defines type of following arguments:

TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused
TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid
TRANSLATE_PID_FD_PIDNS - pidns fd, argument is file descriptor

Syscall returns pid in target pid-ns or zero if task have no pid there.

Error codes:
-EINVAL - @source or @target couldn't be resolved into pid namespace
-ESRCH - task with @pid is not found in @source pid-namespace

Other pid namespaces are referenced either by pid of any process who
lives inside it or by file descriptor pointing to /proc/[pid]/ns/pid.
Latter method provides better protection against races but in some
cases requires CAP_SYS_PTRACE.

Translate_pid could breach pid isolation and return pids from outer pid
namespaces iff process already has file descriptor pointing to them.


Examples:

- get pid in current pid namespace

translate_pid(pid, TRANSLATE_PID_FD_PIDNS, ns_fd,
TRANSLATE_PID_CURRENT_PIDNS, 0)
or
translate_pid(pid, TRANSLATE_PID_TASK_PIDNS, ns_pid,
TRANSLATE_PID_CURRENT_PIDNS, 0)

- get pid in other pid namespace

translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
TRANSLATE_PID_FD_PIDNS, ns_fd)
or
translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
TRANSLATE_PID_TASK_PIDNS, ns_pid)

- get deepest pid

translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
TRANSLATE_PID_TASK_PIDNS, pid)

- get pid of init task for namespace

translate_pid(1, TRANSLATE_PID_FD_PIDNS, ns_fd,
TRANSLATE_PID_CURRENT_PIDNS, 0)


This syscall also could be used for checking topology of pid namespaces:

- ns1 nests inside ns2

translate_pid(1, TRANSLATE_PID_FD_PIDNS, ns1_fd,
TRANSLATE_PID_FD_PIDNS, ns2_fd) > 1

- task1 lives in same pid-namespace as task2

translate_pid(1, TRANSLATE_PID_TASK_PIDNS, task1_pid,
TRANSLATE_PID_TASK_PIDNS, task2_pid) == 1

- task1 is isolated from task2

translate_pid(task1_pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
TRANSLATE_PID_TASK_PIDNS, task2_pid) == 0

- pid is reachable from ns

translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
TRANSLATE_PID_FD_PIDNS, ns_fd) > 0

Signed-off-by: Konstantin Khlebnikov <[email protected]>

---

v1: https://lkml.org/lkml/2015/9/15/411
v2: https://lkml.org/lkml/2015/9/24/278
* use namespace-fd as second/third argument
* add -pid for getting parent pid
* move code into kernel/sys.c next to getppid
* drop ifdef CONFIG_PID_NS
* add generic syscall
v3: https://lkml.org/lkml/2015/9/28/3
* use proc_ns_fdget()
* update description
* rebase to next-20150925
* fix conflict with mlock2
v4: https://lkml.org/lkml/2017/10/16/852
* rename into translate_pid()
* remove syscall if CONFIG_PID_NS=n
* drop -pid for parent task
* drop fget-fdget optimizations
* add helper get_pid_ns_by_fd()
* wire only into x86
v5:
* rewrite commit message
* resolve pidns by task pid or by pidns fd
* add arguments source_type and target_type

--- sample tool translate_pid.c ---

#define _GNU_SOURCE
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sched.h>
#include <fcntl.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#ifndef SYS_translate_pid
#ifdef __x86_64__
#define SYS_translate_pid 333
#endif
#endif

#ifndef TRANSLATE_PID_CURRENT_PIDNS
#define TRANSLATE_PID_CURRENT_PIDNS 0
#define TRANSLATE_PID_TASK_PIDNS 1
#define TRANSLATE_PID_FD_PIDNS 2
#endif

pid_t translate_pid(pid_t pid, int source_type, int source,
int target_type, int target) {
return syscall(SYS_translate_pid, pid, source_type, source,
target_type, target);
}

int main(int argc, char **argv) {
int pid, source, target;
char buf[64];

if (argc != 4)
errx(1, "usage: %s <pid> <source> <traget>", argv[0]);

pid = atoi(argv[1]);
int source_type, target_type;
source = atoi(argv[2]);
target = atoi(argv[3]);

if (source < 0) {
source_type = TRANSLATE_PID_TASK_PIDNS;
source = -source;
} else if (source > 0) {
source_type = TRANSLATE_PID_FD_PIDNS;
sprintf(buf, "/proc/%d/ns/pid", source);
source = open(buf, O_RDONLY);
if (source < 0)
err(2, "open source %s", buf);
} else {
source_type = TRANSLATE_PID_CURRENT_PIDNS;
}

if (target < 0) {
target_type = TRANSLATE_PID_TASK_PIDNS;
target = -target;
} else if (target > 0) {
target_type = TRANSLATE_PID_FD_PIDNS;
sprintf(buf, "/proc/%d/ns/pid", target);
target = open(buf, O_RDONLY);
if (target < 0)
err(2, "open target %s", buf);
} else {
target_type = TRANSLATE_PID_CURRENT_PIDNS;
}

pid = translate_pid(pid, source_type, source, target_type, target);
if (pid < 0)
err(2, "translate");

printf("%d\n", pid);
return 0;
}

---
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 4 ++
include/uapi/linux/sched.h | 7 ++++
kernel/pid_namespace.c | 64 ++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 ++
6 files changed, 80 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c58f75b088c5..aef52c709845 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 translate_pid sys_translate_pid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..1ebdab83c6f4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 common translate_pid sys_translate_pid

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b961184f597a..d189a1f61160 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -553,6 +553,10 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
/* kernel/printk.c */
asmlinkage long sys_syslog(int type, char __user *buf, int len);

+/* kernel/pid_namespace.c */
+asmlinkage long sys_translate_pid(pid_t pid, int source_type, int source,
+ int target_type, int target);
+
/* kernel/ptrace.c */
asmlinkage long sys_ptrace(long request, long pid, unsigned long addr,
unsigned long data);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..7c45fd8d33d7 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -55,4 +55,11 @@
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN)

+/*
+ * For translate_pid()
+ */
+#define TRANSLATE_PID_CURRENT_PIDNS 0 /* Current pid namespace */
+#define TRANSLATE_PID_TASK_PIDNS 1 /* Namespace by task pid */
+#define TRANSLATE_PID_FD_PIDNS 2 /* Namespace by pidns fd */
+
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 2a2ac53d8b8b..84c8b47289d5 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -13,6 +13,7 @@
#include <linux/user_namespace.h>
#include <linux/syscalls.h>
#include <linux/cred.h>
+#include <linux/file.h>
#include <linux/err.h>
#include <linux/acct.h>
#include <linux/slab.h>
@@ -380,6 +381,69 @@ static void pidns_put(struct ns_common *ns)
put_pid_ns(to_pid_ns(ns));
}

+/* Under rcu_read_lock(). Returns pointer to pid_namespace or NULL. */
+static struct pid_namespace *resolve_pid_ns(int type, int fd_or_pid)
+{
+ struct pid_namespace *current_ns = task_active_pid_ns(current);
+ struct pid_namespace *pidns = NULL;
+ struct ns_common *ns;
+ struct file *file;
+
+ switch (type) {
+ case TRANSLATE_PID_CURRENT_PIDNS:
+ pidns = current_ns;
+ break;
+ case TRANSLATE_PID_TASK_PIDNS:
+ pidns = ns_of_pid(find_pid_ns(fd_or_pid, current_ns));
+ break;
+ case TRANSLATE_PID_FD_PIDNS:
+ file = proc_ns_fget(fd_or_pid);
+ if (!IS_ERR(file)) {
+ ns = get_proc_ns(file_inode(file));
+ if (ns->ops->type == CLONE_NEWPID)
+ pidns = to_pid_ns(ns);
+ fput(file);
+ }
+ break;
+ }
+
+ return pidns;
+}
+
+/*
+ * translate_pid - convert pid in source pid-ns into target pid-ns.
+ * @pid: pid for translation
+ * @source_type: one of TRANSLATE_PID_*
+ * @source: depending on @source_type pid-ns fd, pid, or nothing
+ * @target_type: one of TRANSLATE_PID_*
+ * @target: depending on @target_type pid-ns fd, pid, or nothing
+ *
+ * Returns pid in @target pid-ns, zero if task have no pid there,
+ * or -ESRCH if task with @pid does not found in @source pid-ns,
+ * or -EINVAL if @source or @target couldn't be resolved into pid-ns.
+ */
+SYSCALL_DEFINE5(translate_pid, pid_t, pid,
+ int, source_type, int, source,
+ int, target_type, int, target)
+{
+ struct pid_namespace *source_ns, *target_ns;
+ struct pid *struct_pid;
+ pid_t result = -EINVAL;
+
+ rcu_read_lock();
+ source_ns = resolve_pid_ns(source_type, source);
+ if (!source_ns)
+ goto out;
+ target_ns = resolve_pid_ns(target_type, target);
+ if (!target_ns)
+ goto out;
+ struct_pid = find_pid_ns(pid, source_ns);
+ result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+out:
+ rcu_read_unlock();
+ return result;
+}
+
static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
struct pid_namespace *active = task_active_pid_ns(current);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6cafc008f6db..777689bce406 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -146,6 +146,9 @@ COND_SYSCALL(delete_module);
/* kernel/printk.c */
COND_SYSCALL(syslog);

+/* kernel/pid_namespace.c */
+COND_SYSCALL(sys_translate_pid);
+
/* kernel/ptrace.c */

/* kernel/sched/core.c */



2018-04-04 20:38:59

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
> Each process have different pids, one for each pid namespace it belongs.
> When interaction happens within single pid-ns translation isn't required.
> More complicated scenarios needs special handling.
>
> For example:
> - reading pid-files or logs written inside container with pid namespace
> - attaching with ptrace to tasks from different pid namespace
> - passing pids across pid namespaces in any kind of API
>
> Currently there are several interfaces that could be used here:
>
> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>
> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
> In some cases conversion pid -> vpid could be easily done using this
> information, but backward translation requires scanning all tasks.
>
> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> into pid namespace, this expose process and could be insecure.
>
> This patch adds new syscall for converting pids between pid namespaces:
>
> pid_t translate_pid(pid_t pid, int source_type, int source,
> int target_type, int target);
>
> @source_type and @target_type defines type of following arguments:
>
> TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused
> TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid

I believe using pid to represent the namespace has been already
discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
after which we moved on to fd based version of this interface.

-Nagarathnam.

> TRANSLATE_PID_FD_PIDNS - pidns fd, argument is file descriptor
>
> Syscall returns pid in target pid-ns or zero if task have no pid there.
>
> Error codes:
> -EINVAL - @source or @target couldn't be resolved into pid namespace
> -ESRCH - task with @pid is not found in @source pid-namespace
>
> Other pid namespaces are referenced either by pid of any process who
> lives inside it or by file descriptor pointing to /proc/[pid]/ns/pid.
> Latter method provides better protection against races but in some
> cases requires CAP_SYS_PTRACE.
>
> Translate_pid could breach pid isolation and return pids from outer pid
> namespaces iff process already has file descriptor pointing to them.
>
>
> Examples:
>
> - get pid in current pid namespace
>
> translate_pid(pid, TRANSLATE_PID_FD_PIDNS, ns_fd,
> TRANSLATE_PID_CURRENT_PIDNS, 0)
> or
> translate_pid(pid, TRANSLATE_PID_TASK_PIDNS, ns_pid,
> TRANSLATE_PID_CURRENT_PIDNS, 0)
>
> - get pid in other pid namespace
>
> translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
> TRANSLATE_PID_FD_PIDNS, ns_fd)
> or
> translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
> TRANSLATE_PID_TASK_PIDNS, ns_pid)
>
> - get deepest pid
>
> translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
> TRANSLATE_PID_TASK_PIDNS, pid)
>
> - get pid of init task for namespace
>
> translate_pid(1, TRANSLATE_PID_FD_PIDNS, ns_fd,
> TRANSLATE_PID_CURRENT_PIDNS, 0)
>
>
> This syscall also could be used for checking topology of pid namespaces:
>
> - ns1 nests inside ns2
>
> translate_pid(1, TRANSLATE_PID_FD_PIDNS, ns1_fd,
> TRANSLATE_PID_FD_PIDNS, ns2_fd) > 1
>
> - task1 lives in same pid-namespace as task2
>
> translate_pid(1, TRANSLATE_PID_TASK_PIDNS, task1_pid,
> TRANSLATE_PID_TASK_PIDNS, task2_pid) == 1
>
> - task1 is isolated from task2
>
> translate_pid(task1_pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
> TRANSLATE_PID_TASK_PIDNS, task2_pid) == 0
>
> - pid is reachable from ns
>
> translate_pid(pid, TRANSLATE_PID_CURRENT_PIDNS, 0,
> TRANSLATE_PID_FD_PIDNS, ns_fd) > 0
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>
> ---
>
> v1: https://lkml.org/lkml/2015/9/15/411
> v2: https://lkml.org/lkml/2015/9/24/278
> * use namespace-fd as second/third argument
> * add -pid for getting parent pid
> * move code into kernel/sys.c next to getppid
> * drop ifdef CONFIG_PID_NS
> * add generic syscall
> v3: https://lkml.org/lkml/2015/9/28/3
> * use proc_ns_fdget()
> * update description
> * rebase to next-20150925
> * fix conflict with mlock2
> v4: https://lkml.org/lkml/2017/10/16/852
> * rename into translate_pid()
> * remove syscall if CONFIG_PID_NS=n
> * drop -pid for parent task
> * drop fget-fdget optimizations
> * add helper get_pid_ns_by_fd()
> * wire only into x86
> v5:
> * rewrite commit message
> * resolve pidns by task pid or by pidns fd
> * add arguments source_type and target_type
>
> --- sample tool translate_pid.c ---
>
> #define _GNU_SOURCE
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <err.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #ifndef SYS_translate_pid
> #ifdef __x86_64__
> #define SYS_translate_pid 333
> #endif
> #endif
>
> #ifndef TRANSLATE_PID_CURRENT_PIDNS
> #define TRANSLATE_PID_CURRENT_PIDNS 0
> #define TRANSLATE_PID_TASK_PIDNS 1
> #define TRANSLATE_PID_FD_PIDNS 2
> #endif
>
> pid_t translate_pid(pid_t pid, int source_type, int source,
> int target_type, int target) {
> return syscall(SYS_translate_pid, pid, source_type, source,
> target_type, target);
> }
>
> int main(int argc, char **argv) {
> int pid, source, target;
> char buf[64];
>
> if (argc != 4)
> errx(1, "usage: %s <pid> <source> <traget>", argv[0]);
>
> pid = atoi(argv[1]);
> int source_type, target_type;
> source = atoi(argv[2]);
> target = atoi(argv[3]);
>
> if (source < 0) {
> source_type = TRANSLATE_PID_TASK_PIDNS;
> source = -source;
> } else if (source > 0) {
> source_type = TRANSLATE_PID_FD_PIDNS;
> sprintf(buf, "/proc/%d/ns/pid", source);
> source = open(buf, O_RDONLY);
> if (source < 0)
> err(2, "open source %s", buf);
> } else {
> source_type = TRANSLATE_PID_CURRENT_PIDNS;
> }
>
> if (target < 0) {
> target_type = TRANSLATE_PID_TASK_PIDNS;
> target = -target;
> } else if (target > 0) {
> target_type = TRANSLATE_PID_FD_PIDNS;
> sprintf(buf, "/proc/%d/ns/pid", target);
> target = open(buf, O_RDONLY);
> if (target < 0)
> err(2, "open target %s", buf);
> } else {
> target_type = TRANSLATE_PID_CURRENT_PIDNS;
> }
>
> pid = translate_pid(pid, source_type, source, target_type, target);
> if (pid < 0)
> err(2, "translate");
>
> printf("%d\n", pid);
> return 0;
> }
>
> ---
> ---
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 4 ++
> include/uapi/linux/sched.h | 7 ++++
> kernel/pid_namespace.c | 64 ++++++++++++++++++++++++++++++++
> kernel/sys_ni.c | 3 ++
> 6 files changed, 80 insertions(+)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index c58f75b088c5..aef52c709845 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
> 382 i386 pkey_free sys_pkey_free
> 383 i386 statx sys_statx
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> +385 i386 translate_pid sys_translate_pid
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..1ebdab83c6f4 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
> 330 common pkey_alloc sys_pkey_alloc
> 331 common pkey_free sys_pkey_free
> 332 common statx sys_statx
> +333 common translate_pid sys_translate_pid
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b961184f597a..d189a1f61160 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -553,6 +553,10 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
> /* kernel/printk.c */
> asmlinkage long sys_syslog(int type, char __user *buf, int len);
>
> +/* kernel/pid_namespace.c */
> +asmlinkage long sys_translate_pid(pid_t pid, int source_type, int source,
> + int target_type, int target);
> +
> /* kernel/ptrace.c */
> asmlinkage long sys_ptrace(long request, long pid, unsigned long addr,
> unsigned long data);
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 22627f80063e..7c45fd8d33d7 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -55,4 +55,11 @@
> SCHED_FLAG_RECLAIM | \
> SCHED_FLAG_DL_OVERRUN)
>
> +/*
> + * For translate_pid()
> + */
> +#define TRANSLATE_PID_CURRENT_PIDNS 0 /* Current pid namespace */
> +#define TRANSLATE_PID_TASK_PIDNS 1 /* Namespace by task pid */
> +#define TRANSLATE_PID_FD_PIDNS 2 /* Namespace by pidns fd */
> +
> #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 2a2ac53d8b8b..84c8b47289d5 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -13,6 +13,7 @@
> #include <linux/user_namespace.h>
> #include <linux/syscalls.h>
> #include <linux/cred.h>
> +#include <linux/file.h>
> #include <linux/err.h>
> #include <linux/acct.h>
> #include <linux/slab.h>
> @@ -380,6 +381,69 @@ static void pidns_put(struct ns_common *ns)
> put_pid_ns(to_pid_ns(ns));
> }
>
> +/* Under rcu_read_lock(). Returns pointer to pid_namespace or NULL. */
> +static struct pid_namespace *resolve_pid_ns(int type, int fd_or_pid)
> +{
> + struct pid_namespace *current_ns = task_active_pid_ns(current);
> + struct pid_namespace *pidns = NULL;
> + struct ns_common *ns;
> + struct file *file;
> +
> + switch (type) {
> + case TRANSLATE_PID_CURRENT_PIDNS:
> + pidns = current_ns;
> + break;
> + case TRANSLATE_PID_TASK_PIDNS:
> + pidns = ns_of_pid(find_pid_ns(fd_or_pid, current_ns));
> + break;
> + case TRANSLATE_PID_FD_PIDNS:
> + file = proc_ns_fget(fd_or_pid);
> + if (!IS_ERR(file)) {
> + ns = get_proc_ns(file_inode(file));
> + if (ns->ops->type == CLONE_NEWPID)
> + pidns = to_pid_ns(ns);
> + fput(file);
> + }
> + break;
> + }
> +
> + return pidns;
> +}
> +
> +/*
> + * translate_pid - convert pid in source pid-ns into target pid-ns.
> + * @pid: pid for translation
> + * @source_type: one of TRANSLATE_PID_*
> + * @source: depending on @source_type pid-ns fd, pid, or nothing
> + * @target_type: one of TRANSLATE_PID_*
> + * @target: depending on @target_type pid-ns fd, pid, or nothing
> + *
> + * Returns pid in @target pid-ns, zero if task have no pid there,
> + * or -ESRCH if task with @pid does not found in @source pid-ns,
> + * or -EINVAL if @source or @target couldn't be resolved into pid-ns.
> + */
> +SYSCALL_DEFINE5(translate_pid, pid_t, pid,
> + int, source_type, int, source,
> + int, target_type, int, target)
> +{
> + struct pid_namespace *source_ns, *target_ns;
> + struct pid *struct_pid;
> + pid_t result = -EINVAL;
> +
> + rcu_read_lock();
> + source_ns = resolve_pid_ns(source_type, source);
> + if (!source_ns)
> + goto out;
> + target_ns = resolve_pid_ns(target_type, target);
> + if (!target_ns)
> + goto out;
> + struct_pid = find_pid_ns(pid, source_ns);
> + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> +out:
> + rcu_read_unlock();
> + return result;
> +}
> +
> static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> {
> struct pid_namespace *active = task_active_pid_ns(current);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 6cafc008f6db..777689bce406 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -146,6 +146,9 @@ COND_SYSCALL(delete_module);
> /* kernel/printk.c */
> COND_SYSCALL(syslog);
>
> +/* kernel/pid_namespace.c */
> +COND_SYSCALL(sys_translate_pid);
> +
> /* kernel/ptrace.c */
>
> /* kernel/sched/core.c */
>


2018-04-04 22:33:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

Nagarathnam Muthusamy <[email protected]> writes:

> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>> Each process have different pids, one for each pid namespace it belongs.
>> When interaction happens within single pid-ns translation isn't required.
>> More complicated scenarios needs special handling.
>>
>> For example:
>> - reading pid-files or logs written inside container with pid namespace
>> - attaching with ptrace to tasks from different pid namespace
>> - passing pids across pid namespaces in any kind of API
>>
>> Currently there are several interfaces that could be used here:
>>
>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.

Using the inode number in interfaces is not an option. Especially not
withou referencing the device number for the filesystem as well.

>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>> In some cases conversion pid -> vpid could be easily done using this
>> information, but backward translation requires scanning all tasks.
>>
>> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>> into pid namespace, this expose process and could be insecure.
>>
>> This patch adds new syscall for converting pids between pid namespaces:
>>
>> pid_t translate_pid(pid_t pid, int source_type, int source,
>> int target_type, int target);
>>
>> @source_type and @target_type defines type of following arguments:
>>
>> TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused
>> TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid
>
> I believe using pid to represent the namespace has been already
> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
> after which we moved on to fd based version of this interface.

Or in short why is the case of pids important?

You Konstantin you almost said why they were important in your message
saying you were going to send this one. However you don't explain in
your description why you want to identify pid namespaces by pid.

Eric

2018-04-05 07:04:33

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

On 05.04.2018 01:29, Eric W. Biederman wrote:
> Nagarathnam Muthusamy <[email protected]> writes:
>
>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>> Each process have different pids, one for each pid namespace it belongs.
>>> When interaction happens within single pid-ns translation isn't required.
>>> More complicated scenarios needs special handling.
>>>
>>> For example:
>>> - reading pid-files or logs written inside container with pid namespace
>>> - attaching with ptrace to tasks from different pid namespace
>>> - passing pids across pid namespaces in any kind of API
>>>
>>> Currently there are several interfaces that could be used here:
>>>
>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>
> Using the inode number in interfaces is not an option. Especially not
> withou referencing the device number for the filesystem as well.

This is supposed to be single-instance fs,
not part of proc but referenced but its magic "symlinks".

Device numbers are not mentioned in "man namespaces".

>
>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>>> In some cases conversion pid -> vpid could be easily done using this
>>> information, but backward translation requires scanning all tasks.
>>>
>>> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>>> into pid namespace, this expose process and could be insecure.
>>>
>>> This patch adds new syscall for converting pids between pid namespaces:
>>>
>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>> int target_type, int target);
>>>
>>> @source_type and @target_type defines type of following arguments:
>>>
>>> TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused
>>> TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid
>>
>> I believe using pid to represent the namespace has been already
>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
>> after which we moved on to fd based version of this interface.
>
> Or in short why is the case of pids important?
>
> You Konstantin you almost said why they were important in your message
> saying you were going to send this one. However you don't explain in
> your description why you want to identify pid namespaces by pid.
>

Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
pid based variant doesn't have such restrictions.
Most pid-based syscalls are racy in some cases but they are
here for decades and everybody knowns how to deal with it.
So, I've decided to merge both worlds in one interface which clearly tells what to expect.

2018-04-23 17:44:18

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
> On 05.04.2018 01:29, Eric W. Biederman wrote:
>> Nagarathnam Muthusamy <[email protected]> writes:
>>
>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>> Each process have different pids, one for each pid namespace it
>>>> belongs.
>>>> When interaction happens within single pid-ns translation isn't
>>>> required.
>>>> More complicated scenarios needs special handling.
>>>>
>>>> For example:
>>>> - reading pid-files or logs written inside container with pid
>>>> namespace
>>>> - attaching with ptrace to tasks from different pid namespace
>>>> - passing pids across pid namespaces in any kind of API
>>>>
>>>> Currently there are several interfaces that could be used here:
>>>>
>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>
>> Using the inode number in interfaces is not an option. Especially not
>> withou referencing the device number for the filesystem as well.
>
> This is supposed to be single-instance fs,
> not part of proc but referenced but its magic "symlinks".
>
> Device numbers are not mentioned in "man namespaces".
>
>>
>>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>>>> In some cases conversion pid -> vpid could be easily done using this
>>>> information, but backward translation requires scanning all tasks.
>>>>
>>>> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>>>> into pid namespace, this expose process and could be insecure.
>>>>
>>>> This patch adds new syscall for converting pids between pid
>>>> namespaces:
>>>>
>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>                                  int target_type, int target);
>>>>
>>>> @source_type and @target_type defines type of following arguments:
>>>>
>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument is
>>>> unused
>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>
>>> I believe using pid to represent the namespace has been already
>>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
>>> after which we moved on to fd based version of this interface.
>>
>> Or in short why is the case of pids important?
>>
>> You Konstantin you almost said why they were important in your message
>> saying you were going to send this one.  However you don't explain in
>> your description why you want to identify pid namespaces by pid.
>>
>
> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
> pid based variant doesn't have such restrictions.

Can you provide more information on usecase requiring PID translation
but not used for tracing related purposes?
On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS and
TRANSLATE_PID_FD_PIDNS integrated first and then possibly extend the
interface to include TRANSLATE_PID_TASK_PIDNS in future?

Thanks,
Nagarathnam.
> Most pid-based syscalls are racy in some cases but they are
> here for decades and everybody knowns how to deal with it.
> So, I've decided to merge both worlds in one interface which clearly
> tells what to expect.


2018-04-25 05:38:23

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>
>
> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>
>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>> Each process have different pids, one for each pid namespace it belongs.
>>>>> When interaction happens within single pid-ns translation isn't required.
>>>>> More complicated scenarios needs special handling.
>>>>>
>>>>> For example:
>>>>> - reading pid-files or logs written inside container with pid namespace
>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>> - passing pids across pid namespaces in any kind of API
>>>>>
>>>>> Currently there are several interfaces that could be used here:
>>>>>
>>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>>
>>> Using the inode number in interfaces is not an option. Especially not
>>> withou referencing the device number for the filesystem as well.
>>
>> This is supposed to be single-instance fs,
>> not part of proc but referenced but its magic "symlinks".
>>
>> Device numbers are not mentioned in "man namespaces".
>>
>>>
>>>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>>>>> In some cases conversion pid -> vpid could be easily done using this
>>>>> information, but backward translation requires scanning all tasks.
>>>>>
>>>>> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>>>>> into pid namespace, this expose process and could be insecure.
>>>>>
>>>>> This patch adds new syscall for converting pids between pid namespaces:
>>>>>
>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>                                  int target_type, int target);
>>>>>
>>>>> @source_type and @target_type defines type of following arguments:
>>>>>
>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument is unused
>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>
>>>> I believe using pid to represent the namespace has been already
>>>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
>>>> after which we moved on to fd based version of this interface.
>>>
>>> Or in short why is the case of pids important?
>>>
>>> You Konstantin you almost said why they were important in your message
>>> saying you were going to send this one.  However you don't explain in
>>> your description why you want to identify pid namespaces by pid.
>>>
>>
>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>> pid based variant doesn't have such restrictions.
>
> Can you provide more information on usecase requiring PID translation but not used for tracing related purposes?

Any introspection for [nested] containers. It's easier to work when you have all information when you don't have any.
For example our CMS https://github.com/yandex/porto allows to start nested sub-container (or even deeper) by request from any container and
have to tell back which pid task is have. And it could translate any pid inside into accessible by client and vice versa.

> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS and TRANSLATE_PID_FD_PIDNS integrated first and then possibly extend the
> interface to include TRANSLATE_PID_TASK_PIDNS in future?

I don't see reason for this separation.
Pids and pid namespaces are part of the API for a long time.

>
> Thanks,
> Nagarathnam.
>> Most pid-based syscalls are racy in some cases but they are
>> here for decades and everybody knowns how to deal with it.
>> So, I've decided to merge both worlds in one interface which clearly tells what to expect.
>

Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

On 04/05/2018 09:02 AM, Konstantin Khlebnikov wrote:
> On 05.04.2018 01:29, Eric W. Biederman wrote:
>> Nagarathnam Muthusamy <[email protected]> writes:
>>
>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>> Each process have different pids, one for each pid namespace it belongs.
>>>> When interaction happens within single pid-ns translation isn't required.
>>>> More complicated scenarios needs special handling.
>>>>
>>>> For example:
>>>> - reading pid-files or logs written inside container with pid namespace
>>>> - attaching with ptrace to tasks from different pid namespace
>>>> - passing pids across pid namespaces in any kind of API
>>>>
>>>> Currently there are several interfaces that could be used here:
>>>>
>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>
>> Using the inode number in interfaces is not an option. Especially not
>> withou referencing the device number for the filesystem as well.
>
> This is supposed to be single-instance fs,
> not part of proc but referenced but its magic "symlinks".
>
> Device numbers are not mentioned in "man namespaces".

Thanks for the heads-up!

That was a bug in the man-page. ioctl_ns(2) already says the right thing.
Now I patches namespaces(7), as below.

Cheers,

Michael

diff --git a/man7/namespaces.7 b/man7/namespaces.7
index 725ebaff6..3c155de7e 100644
--- a/man7/namespaces.7
+++ b/man7/namespaces.7
@@ -154,11 +154,14 @@ In Linux 3.7 and earlier, these files were visible as hard links.
Since Linux 3.8,
.\" commit bf056bfa80596a5d14b26b17276a56a0dcb080e5
they appear as symbolic links.
-If two processes are in the same namespace, then the inode numbers of their
+If two processes are in the same namespace,
+then the device IDs and inode numbers of their
.IR /proc/[pid]/ns/xxx
symbolic links will be the same; an application can check this using the
+.I stat.st_dev
+and
.I stat.st_ino
-field returned by
+fields returned by
.BR stat (2).
The content of this symbolic link is a string containing
the namespace type and inode number as in the following example:


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2018-05-15 17:26:05

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>
>>
>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>
>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>> Each process have different pids, one for each pid namespace it
>>>>>> belongs.
>>>>>> When interaction happens within single pid-ns translation isn't
>>>>>> required.
>>>>>> More complicated scenarios needs special handling.
>>>>>>
>>>>>> For example:
>>>>>> - reading pid-files or logs written inside container with pid
>>>>>> namespace
>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>
>>>>>> Currently there are several interfaces that could be used here:
>>>>>>
>>>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>>>
>>>> Using the inode number in interfaces is not an option. Especially not
>>>> withou referencing the device number for the filesystem as well.
>>>
>>> This is supposed to be single-instance fs,
>>> not part of proc but referenced but its magic "symlinks".
>>>
>>> Device numbers are not mentioned in "man namespaces".
>>>
>>>>
>>>>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>>>>>> In some cases conversion pid -> vpid could be easily done using this
>>>>>> information, but backward translation requires scanning all tasks.
>>>>>>
>>>>>> Unix socket automatically translates pid attached to
>>>>>> SCM_CREDENTIALS.
>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>
>>>>>> This patch adds new syscall for converting pids between pid
>>>>>> namespaces:
>>>>>>
>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>                                  int target_type, int target);
>>>>>>
>>>>>> @source_type and @target_type defines type of following arguments:
>>>>>>
>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument is
>>>>>> unused
>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>
>>>>> I believe using pid to represent the namespace has been already
>>>>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
>>>>> after which we moved on to fd based version of this interface.
>>>>
>>>> Or in short why is the case of pids important?
>>>>
>>>> You Konstantin you almost said why they were important in your message
>>>> saying you were going to send this one.  However you don't explain in
>>>> your description why you want to identify pid namespaces by pid.
>>>>
>>>
>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>> pid based variant doesn't have such restrictions.
>>
>> Can you provide more information on usecase requiring PID translation
>> but not used for tracing related purposes?
>
> Any introspection for [nested] containers. It's easier to work when
> you have all information when you don't have any.
> For example our CMS https://github.com/yandex/porto allows to start
> nested sub-container (or even deeper) by request from any container
> and have to tell back which pid task is have. And it could translate
> any pid inside into accessible by client and vice versa.
>

I still dont get the exact reason why PID based approach to identify the
namespace during pid translation process is absolutely required compared
to fd based approach. From your version of TranslatePid in

https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp

I see that you are going through the trouble of forking a process and
sending SMC_CREDENTIALS for pid translation. Even your existing API
could be extremely simplified if translate_pid based on file descriptors
make it to the gate and I believe from the last discussion it was almost
there https://patchwork.kernel.org/patch/10305439/


>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS and
>> TRANSLATE_PID_FD_PIDNS integrated first and then possibly extend the
>> interface to include TRANSLATE_PID_TASK_PIDNS in future?
>
> I don't see reason for this separation.
> Pids and pid namespaces are part of the API for a long time.

If you are talking about the translate_pid API proposed, I believe the
V4 proposed under https://patchwork.kernel.org/patch/10003935/ had only
fd based API before a mix of PID and fd based is proposed in V5. Again,
I was just wondering if we can get the FD based approach in first and
then extend the API to include PID based approach later as fd based
approach could provide a lot of immediate benefits?

Thanks,
Nagarathnam.
>
>>
>> Thanks,
>> Nagarathnam.
>>> Most pid-based syscalls are racy in some cases but they are
>>> here for decades and everybody knowns how to deal with it.
>>> So, I've decided to merge both worlds in one interface which clearly
>>> tells what to expect.
>>


2018-05-15 17:37:36

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 15.05.2018 20:19, Nagarathnam Muthusamy wrote:
>
>
> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>>
>>>
>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>>
>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>>> Each process have different pids, one for each pid namespace it belongs.
>>>>>>> When interaction happens within single pid-ns translation isn't required.
>>>>>>> More complicated scenarios needs special handling.
>>>>>>>
>>>>>>> For example:
>>>>>>> - reading pid-files or logs written inside container with pid namespace
>>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>>
>>>>>>> Currently there are several interfaces that could be used here:
>>>>>>>
>>>>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>>>>
>>>>> Using the inode number in interfaces is not an option. Especially not
>>>>> withou referencing the device number for the filesystem as well.
>>>>
>>>> This is supposed to be single-instance fs,
>>>> not part of proc but referenced but its magic "symlinks".
>>>>
>>>> Device numbers are not mentioned in "man namespaces".
>>>>
>>>>>
>>>>>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status.
>>>>>>> In some cases conversion pid -> vpid could be easily done using this
>>>>>>> information, but backward translation requires scanning all tasks.
>>>>>>>
>>>>>>> Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>>
>>>>>>> This patch adds new syscall for converting pids between pid namespaces:
>>>>>>>
>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>>                                  int target_type, int target);
>>>>>>>
>>>>>>> @source_type and @target_type defines type of following arguments:
>>>>>>>
>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument is unused
>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>>
>>>>>> I believe using pid to represent the namespace has been already
>>>>>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087
>>>>>> after which we moved on to fd based version of this interface.
>>>>>
>>>>> Or in short why is the case of pids important?
>>>>>
>>>>> You Konstantin you almost said why they were important in your message
>>>>> saying you were going to send this one.  However you don't explain in
>>>>> your description why you want to identify pid namespaces by pid.
>>>>>
>>>>
>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>>> pid based variant doesn't have such restrictions.
>>>
>>> Can you provide more information on usecase requiring PID translation but not used for tracing related purposes?
>>
>> Any introspection for [nested] containers. It's easier to work when you have all information when you don't have any.
>> For example our CMS https://github.com/yandex/porto allows to start nested sub-container (or even deeper) by request from any container
>> and have to tell back which pid task is have. And it could translate any pid inside into accessible by client and vice versa.
>>
>
> I still dont get the exact reason why PID based approach to identify the namespace during pid translation process is absolutely required
> compared to fd based approach.

As I told open(/proc/%d/ns/pid) have security restrictions - same uid/CAP_SYS_PTRACE/whatever
Pidns-fd holds pid-namespace and without restrictions could be abused.
Pid based API is racy but always available without any restrictions.


> From your version of TranslatePid in
>
> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp
>
> I see that you are going through the trouble of forking a process and sending SMC_CREDENTIALS for pid translation. Even your existing API
> could be extremely simplified if translate_pid based on file descriptors make it to the gate and I believe from the last discussion it was
> almost there https://patchwork.kernel.org/patch/10305439/
>
>
>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS and TRANSLATE_PID_FD_PIDNS integrated first and then possibly extend
>>> the interface to include TRANSLATE_PID_TASK_PIDNS in future?
>>
>> I don't see reason for this separation.
>> Pids and pid namespaces are part of the API for a long time.
>
> If you are talking about the translate_pid API proposed, I believe the V4 proposed under https://patchwork.kernel.org/patch/10003935/ had
> only fd based API before a mix of PID and fd based is proposed in V5. Again, I was just wondering if we can get the FD based approach in
> first and then extend the API to include PID based approach later as fd based approach could provide a lot of immediate benefits?
>
> Thanks,
> Nagarathnam.
>>
>>>
>>> Thanks,
>>> Nagarathnam.
>>>> Most pid-based syscalls are racy in some cases but they are
>>>> here for decades and everybody knowns how to deal with it.
>>>> So, I've decided to merge both worlds in one interface which clearly tells what to expect.
>>>
>

2018-05-15 17:46:39

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 05/15/2018 10:36 AM, Konstantin Khlebnikov wrote:
>
>
> On 15.05.2018 20:19, Nagarathnam Muthusamy wrote:
>>
>>
>> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
>>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>>>
>>>>
>>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>>>
>>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>>>> Each process have different pids, one for each pid namespace it
>>>>>>>> belongs.
>>>>>>>> When interaction happens within single pid-ns translation isn't
>>>>>>>> required.
>>>>>>>> More complicated scenarios needs special handling.
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> - reading pid-files or logs written inside container with pid
>>>>>>>> namespace
>>>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>>>
>>>>>>>> Currently there are several interfaces that could be used here:
>>>>>>>>
>>>>>>>> Pid namespaces are identified by inode number of
>>>>>>>> /proc/[pid]/ns/pid.
>>>>>>
>>>>>> Using the inode number in interfaces is not an option. Especially
>>>>>> not
>>>>>> withou referencing the device number for the filesystem as well.
>>>>>
>>>>> This is supposed to be single-instance fs,
>>>>> not part of proc but referenced but its magic "symlinks".
>>>>>
>>>>> Device numbers are not mentioned in "man namespaces".
>>>>>
>>>>>>
>>>>>>>> Pids for nested Pid namespaces are shown in file
>>>>>>>> /proc/[pid]/status.
>>>>>>>> In some cases conversion pid -> vpid could be easily done using
>>>>>>>> this
>>>>>>>> information, but backward translation requires scanning all tasks.
>>>>>>>>
>>>>>>>> Unix socket automatically translates pid attached to
>>>>>>>> SCM_CREDENTIALS.
>>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and
>>>>>>>> entering
>>>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>>>
>>>>>>>> This patch adds new syscall for converting pids between pid
>>>>>>>> namespaces:
>>>>>>>>
>>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>>>                                  int target_type, int target);
>>>>>>>>
>>>>>>>> @source_type and @target_type defines type of following arguments:
>>>>>>>>
>>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument
>>>>>>>> is unused
>>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>>>
>>>>>>> I believe using pid to represent the namespace has been already
>>>>>>> discussed in V1 of this patch in
>>>>>>> https://lkml.org/lkml/2015/9/22/1087
>>>>>>> after which we moved on to fd based version of this interface.
>>>>>>
>>>>>> Or in short why is the case of pids important?
>>>>>>
>>>>>> You Konstantin you almost said why they were important in your
>>>>>> message
>>>>>> saying you were going to send this one.  However you don't
>>>>>> explain in
>>>>>> your description why you want to identify pid namespaces by pid.
>>>>>>
>>>>>
>>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>>>> pid based variant doesn't have such restrictions.
>>>>
>>>> Can you provide more information on usecase requiring PID
>>>> translation but not used for tracing related purposes?
>>>
>>> Any introspection for [nested] containers. It's easier to work when
>>> you have all information when you don't have any.
>>> For example our CMS https://github.com/yandex/porto allows to start
>>> nested sub-container (or even deeper) by request from any container
>>> and have to tell back which pid task is have. And it could translate
>>> any pid inside into accessible by client and vice versa.
>>>
>>
>> I still dont get the exact reason why PID based approach to identify
>> the namespace during pid translation process is absolutely required
>> compared to fd based approach.
>
> As I told open(/proc/%d/ns/pid) have security restrictions - same
> uid/CAP_SYS_PTRACE/whatever
> Pidns-fd holds pid-namespace and without restrictions could be abused.
> Pid based API is racy but always available without any restrictions.

I get that Pid based API is available without any restrictions but do we
have any existing usecase which requires Pid based API but cannot use
Pidns-fd based API? Most of the usecases discussed in this thread deals
with introspection of a process by another process and I believe that
security requirement for opening (/proc/%d/ns/pid) is required for all
such usecases. In other words, Why would a process which does not belong
to same uid of the process observed or have CAP_SYS_PTRACE be allowed to
translate PID?

Thanks,
Nagarathnam.
>
>
>> From your version of TranslatePid in
>>
>> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp
>>
>>
>> I see that you are going through the trouble of forking a process and
>> sending SMC_CREDENTIALS for pid translation. Even your existing API
>> could be extremely simplified if translate_pid based on file
>> descriptors make it to the gate and I believe from the last
>> discussion it was almost there
>> https://patchwork.kernel.org/patch/10305439/
>>
>>
>>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS
>>>> and TRANSLATE_PID_FD_PIDNS integrated first and then possibly
>>>> extend the interface to include TRANSLATE_PID_TASK_PIDNS in future?
>>>
>>> I don't see reason for this separation.
>>> Pids and pid namespaces are part of the API for a long time.
>>
>> If you are talking about the translate_pid API proposed, I believe
>> the V4 proposed under https://patchwork.kernel.org/patch/10003935/
>> had only fd based API before a mix of PID and fd based is proposed in
>> V5. Again, I was just wondering if we can get the FD based approach
>> in first and then extend the API to include PID based approach later
>> as fd based approach could provide a lot of immediate benefits?
>>
>> Thanks,
>> Nagarathnam.
>>>
>>>>
>>>> Thanks,
>>>> Nagarathnam.
>>>>> Most pid-based syscalls are racy in some cases but they are
>>>>> here for decades and everybody knowns how to deal with it.
>>>>> So, I've decided to merge both worlds in one interface which
>>>>> clearly tells what to expect.
>>>>
>>


2018-05-15 17:52:46

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 05/15/2018 10:40 AM, Nagarathnam Muthusamy wrote:
>
>
> On 05/15/2018 10:36 AM, Konstantin Khlebnikov wrote:
>>
>>
>> On 15.05.2018 20:19, Nagarathnam Muthusamy wrote:
>>>
>>>
>>> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
>>>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>>>>
>>>>>
>>>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>>>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>>>>
>>>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>>>>> Each process have different pids, one for each pid namespace
>>>>>>>>> it belongs.
>>>>>>>>> When interaction happens within single pid-ns translation
>>>>>>>>> isn't required.
>>>>>>>>> More complicated scenarios needs special handling.
>>>>>>>>>
>>>>>>>>> For example:
>>>>>>>>> - reading pid-files or logs written inside container with pid
>>>>>>>>> namespace
>>>>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>>>>
>>>>>>>>> Currently there are several interfaces that could be used here:
>>>>>>>>>
>>>>>>>>> Pid namespaces are identified by inode number of
>>>>>>>>> /proc/[pid]/ns/pid.
>>>>>>>
>>>>>>> Using the inode number in interfaces is not an option.
>>>>>>> Especially not
>>>>>>> withou referencing the device number for the filesystem as well.
>>>>>>
>>>>>> This is supposed to be single-instance fs,
>>>>>> not part of proc but referenced but its magic "symlinks".
>>>>>>
>>>>>> Device numbers are not mentioned in "man namespaces".
>>>>>>
>>>>>>>
>>>>>>>>> Pids for nested Pid namespaces are shown in file
>>>>>>>>> /proc/[pid]/status.
>>>>>>>>> In some cases conversion pid -> vpid could be easily done
>>>>>>>>> using this
>>>>>>>>> information, but backward translation requires scanning all
>>>>>>>>> tasks.
>>>>>>>>>
>>>>>>>>> Unix socket automatically translates pid attached to
>>>>>>>>> SCM_CREDENTIALS.
>>>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and
>>>>>>>>> entering
>>>>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>>>>
>>>>>>>>> This patch adds new syscall for converting pids between pid
>>>>>>>>> namespaces:
>>>>>>>>>
>>>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>>>>                                  int target_type, int target);
>>>>>>>>>
>>>>>>>>> @source_type and @target_type defines type of following
>>>>>>>>> arguments:
>>>>>>>>>
>>>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument
>>>>>>>>> is unused
>>>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>>>>
>>>>>>>> I believe using pid to represent the namespace has been already
>>>>>>>> discussed in V1 of this patch in
>>>>>>>> https://lkml.org/lkml/2015/9/22/1087
>>>>>>>> after which we moved on to fd based version of this interface.
>>>>>>>
>>>>>>> Or in short why is the case of pids important?
>>>>>>>
>>>>>>> You Konstantin you almost said why they were important in your
>>>>>>> message
>>>>>>> saying you were going to send this one.  However you don't
>>>>>>> explain in
>>>>>>> your description why you want to identify pid namespaces by pid.
>>>>>>>
>>>>>>
>>>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>>>>> pid based variant doesn't have such restrictions.
>>>>>
>>>>> Can you provide more information on usecase requiring PID
>>>>> translation but not used for tracing related purposes?
>>>>
>>>> Any introspection for [nested] containers. It's easier to work when
>>>> you have all information when you don't have any.
>>>> For example our CMS https://github.com/yandex/porto allows to start
>>>> nested sub-container (or even deeper) by request from any container
>>>> and have to tell back which pid task is have. And it could
>>>> translate any pid inside into accessible by client and vice versa.
>>>>
>>>
>>> I still dont get the exact reason why PID based approach to identify
>>> the namespace during pid translation process is absolutely required
>>> compared to fd based approach.
>>
>> As I told open(/proc/%d/ns/pid) have security restrictions - same
>> uid/CAP_SYS_PTRACE/whatever
>> Pidns-fd holds pid-namespace and without restrictions could be abused.
>> Pid based API is racy but always available without any restrictions.
>
> I get that Pid based API is available without any restrictions but do
> we have any existing usecase which requires Pid based API but cannot
> use Pidns-fd based API? Most of the usecases discussed in this thread
> deals with introspection of a process by another process and I believe
> that security requirement for opening (/proc/%d/ns/pid) is required
> for all such usecases. In other words, Why would a process which does
> not belong to same uid

Typo: inspection of a process by another process

Thanks,
Nagarathnam.

> of the process observed or have CAP_SYS_PTRACE be allowed to translate
> PID?
>
> Thanks,
> Nagarathnam.
>>
>>
>>> From your version of TranslatePid in
>>>
>>> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp
>>>
>>>
>>> I see that you are going through the trouble of forking a process
>>> and sending SMC_CREDENTIALS for pid translation. Even your existing
>>> API could be extremely simplified if translate_pid based on file
>>> descriptors make it to the gate and I believe from the last
>>> discussion it was almost there
>>> https://patchwork.kernel.org/patch/10305439/
>>>
>>>
>>>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS
>>>>> and TRANSLATE_PID_FD_PIDNS integrated first and then possibly
>>>>> extend the interface to include TRANSLATE_PID_TASK_PIDNS in future?
>>>>
>>>> I don't see reason for this separation.
>>>> Pids and pid namespaces are part of the API for a long time.
>>>
>>> If you are talking about the translate_pid API proposed, I believe
>>> the V4 proposed under https://patchwork.kernel.org/patch/10003935/
>>> had only fd based API before a mix of PID and fd based is proposed
>>> in V5. Again, I was just wondering if we can get the FD based
>>> approach in first and then extend the API to include PID based
>>> approach later as fd based approach could provide a lot of immediate
>>> benefits?
>>>
>>> Thanks,
>>> Nagarathnam.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Nagarathnam.
>>>>>> Most pid-based syscalls are racy in some cases but they are
>>>>>> here for decades and everybody knowns how to deal with it.
>>>>>> So, I've decided to merge both worlds in one interface which
>>>>>> clearly tells what to expect.
>>>>>
>>>
>


2018-05-31 17:47:47

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 05/15/2018 10:36 AM, Konstantin Khlebnikov wrote:
>
>
> On 15.05.2018 20:19, Nagarathnam Muthusamy wrote:
>>
>>
>> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
>>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>>>
>>>>
>>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>>>
>>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>>>> Each process have different pids, one for each pid namespace it
>>>>>>>> belongs.
>>>>>>>> When interaction happens within single pid-ns translation isn't
>>>>>>>> required.
>>>>>>>> More complicated scenarios needs special handling.
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> - reading pid-files or logs written inside container with pid
>>>>>>>> namespace
>>>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>>>
>>>>>>>> Currently there are several interfaces that could be used here:
>>>>>>>>
>>>>>>>> Pid namespaces are identified by inode number of
>>>>>>>> /proc/[pid]/ns/pid.
>>>>>>
>>>>>> Using the inode number in interfaces is not an option. Especially
>>>>>> not
>>>>>> withou referencing the device number for the filesystem as well.
>>>>>
>>>>> This is supposed to be single-instance fs,
>>>>> not part of proc but referenced but its magic "symlinks".
>>>>>
>>>>> Device numbers are not mentioned in "man namespaces".
>>>>>
>>>>>>
>>>>>>>> Pids for nested Pid namespaces are shown in file
>>>>>>>> /proc/[pid]/status.
>>>>>>>> In some cases conversion pid -> vpid could be easily done using
>>>>>>>> this
>>>>>>>> information, but backward translation requires scanning all tasks.
>>>>>>>>
>>>>>>>> Unix socket automatically translates pid attached to
>>>>>>>> SCM_CREDENTIALS.
>>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and
>>>>>>>> entering
>>>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>>>
>>>>>>>> This patch adds new syscall for converting pids between pid
>>>>>>>> namespaces:
>>>>>>>>
>>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>>>                                  int target_type, int target);
>>>>>>>>
>>>>>>>> @source_type and @target_type defines type of following arguments:
>>>>>>>>
>>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument
>>>>>>>> is unused
>>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>>>
>>>>>>> I believe using pid to represent the namespace has been already
>>>>>>> discussed in V1 of this patch in
>>>>>>> https://lkml.org/lkml/2015/9/22/1087
>>>>>>> after which we moved on to fd based version of this interface.
>>>>>>
>>>>>> Or in short why is the case of pids important?
>>>>>>
>>>>>> You Konstantin you almost said why they were important in your
>>>>>> message
>>>>>> saying you were going to send this one.  However you don't
>>>>>> explain in
>>>>>> your description why you want to identify pid namespaces by pid.
>>>>>>
>>>>>
>>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>>>> pid based variant doesn't have such restrictions.
>>>>
>>>> Can you provide more information on usecase requiring PID
>>>> translation but not used for tracing related purposes?
>>>
>>> Any introspection for [nested] containers. It's easier to work when
>>> you have all information when you don't have any.
>>> For example our CMS https://github.com/yandex/porto allows to start
>>> nested sub-container (or even deeper) by request from any container
>>> and have to tell back which pid task is have. And it could translate
>>> any pid inside into accessible by client and vice versa.
>>>
>>
>> I still dont get the exact reason why PID based approach to identify
>> the namespace during pid translation process is absolutely required
>> compared to fd based approach.
>
> As I told open(/proc/%d/ns/pid) have security restrictions - same
> uid/CAP_SYS_PTRACE/whatever
> Pidns-fd holds pid-namespace and without restrictions could be abused.
> Pid based API is racy but always available without any restrictions.
>
>
>> From your version of TranslatePid in
>>
>> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp
>>
>>
>> I see that you are going through the trouble of forking a process and
>> sending SMC_CREDENTIALS for pid translation. Even your existing API
>> could be extremely simplified if translate_pid based on file
>> descriptors make it to the gate and I believe from the last
>> discussion it was almost there
>> https://patchwork.kernel.org/patch/10305439/
>>
>>
>>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS
>>>> and TRANSLATE_PID_FD_PIDNS integrated first and then possibly
>>>> extend the interface to include TRANSLATE_PID_TASK_PIDNS in future?
>>>
>>> I don't see reason for this separation.
>>> Pids and pid namespaces are part of the API for a long time.
>>
>> If you are talking about the translate_pid API proposed, I believe
>> the V4 proposed under https://patchwork.kernel.org/patch/10003935/
>> had only fd based API before a mix of PID and fd based is proposed in
>> V5. Again, I was just wondering if we can get the FD based approach
>> in first and then extend the API to include PID based approach later
>> as fd based approach could provide a lot of immediate benefits?
>>
>> Thanks,
>> Nagarathnam.
>>>
>>>>
>>>> Thanks,
>>>> Nagarathnam.
>>>>> Most pid-based syscalls are racy in some cases but they are
>>>>> here for decades and everybody knowns how to deal with it.
>>>>> So, I've decided to merge both worlds in one interface which
>>>>> clearly tells what to expect.
>>>>
>>

Ping? Any additional comments on this patch?

Thanks,
Nagarathnam.

2018-05-31 17:54:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

Nagarathnam Muthusamy <[email protected]> writes:

> On 05/15/2018 10:36 AM, Konstantin Khlebnikov wrote:
>>
>>
>> On 15.05.2018 20:19, Nagarathnam Muthusamy wrote:
>>>
>>>
>>> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote:
>>>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote:
>>>>>
>>>>>
>>>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote:
>>>>>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>>>>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>>>>>
>>>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>>>>>> Each process have different pids, one for each pid namespace
>>>>>>>>> it belongs.
>>>>>>>>> When interaction happens within single pid-ns translation
>>>>>>>>> isn't required.
>>>>>>>>> More complicated scenarios needs special handling.
>>>>>>>>>
>>>>>>>>> For example:
>>>>>>>>> - reading pid-files or logs written inside container with pid
>>>>>>>>> namespace
>>>>>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>>>>>> - passing pids across pid namespaces in any kind of API
>>>>>>>>>
>>>>>>>>> Currently there are several interfaces that could be used here:
>>>>>>>>>
>>>>>>>>> Pid namespaces are identified by inode number of
>>>>>>>>> /proc/[pid]/ns/pid.
>>>>>>>
>>>>>>> Using the inode number in interfaces is not an
>>>>>>> option. Especially not
>>>>>>> withou referencing the device number for the filesystem as well.
>>>>>>
>>>>>> This is supposed to be single-instance fs,
>>>>>> not part of proc but referenced but its magic "symlinks".
>>>>>>
>>>>>> Device numbers are not mentioned in "man namespaces".
>>>>>>
>>>>>>>
>>>>>>>>> Pids for nested Pid namespaces are shown in file
>>>>>>>>> /proc/[pid]/status.
>>>>>>>>> In some cases conversion pid -> vpid could be easily done
>>>>>>>>> using this
>>>>>>>>> information, but backward translation requires scanning all tasks.
>>>>>>>>>
>>>>>>>>> Unix socket automatically translates pid attached to
>>>>>>>>> SCM_CREDENTIALS.
>>>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and
>>>>>>>>> entering
>>>>>>>>> into pid namespace, this expose process and could be insecure.
>>>>>>>>>
>>>>>>>>> This patch adds new syscall for converting pids between pid
>>>>>>>>> namespaces:
>>>>>>>>>
>>>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source,
>>>>>>>>>                                  int target_type, int target);
>>>>>>>>>
>>>>>>>>> @source_type and @target_type defines type of following arguments:
>>>>>>>>>
>>>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace,
>>>>>>>>> argument is unused
>>>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid
>>>>>>>>
>>>>>>>> I believe using pid to represent the namespace has been already
>>>>>>>> discussed in V1 of this patch in
>>>>>>>> https://lkml.org/lkml/2015/9/22/1087
>>>>>>>> after which we moved on to fd based version of this interface.
>>>>>>>
>>>>>>> Or in short why is the case of pids important?
>>>>>>>
>>>>>>> You Konstantin you almost said why they were important in your
>>>>>>> message
>>>>>>> saying you were going to send this one.  However you don't
>>>>>>> explain in
>>>>>>> your description why you want to identify pid namespaces by pid.
>>>>>>>
>>>>>>
>>>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace,
>>>>>> pid based variant doesn't have such restrictions.
>>>>>
>>>>> Can you provide more information on usecase requiring PID
>>>>> translation but not used for tracing related purposes?
>>>>
>>>> Any introspection for [nested] containers. It's easier to work
>>>> when you have all information when you don't have any.
>>>> For example our CMS https://github.com/yandex/porto allows to
>>>> start nested sub-container (or even deeper) by request from any
>>>> container and have to tell back which pid task is have. And it
>>>> could translate any pid inside into accessible by client and vice
>>>> versa.
>>>>
>>>
>>> I still dont get the exact reason why PID based approach to
>>> identify the namespace during pid translation process is absolutely
>>> required compared to fd based approach.
>>
>> As I told open(/proc/%d/ns/pid) have security restrictions - same
>> uid/CAP_SYS_PTRACE/whatever
>> Pidns-fd holds pid-namespace and without restrictions could be abused.
>> Pid based API is racy but always available without any restrictions.
>>
>>
>>> From your version of TranslatePid in
>>>
>>> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp
>>>
>>>
>>> I see that you are going through the trouble of forking a process
>>> and sending SMC_CREDENTIALS for pid translation. Even your existing
>>> API could be extremely simplified if translate_pid based on file
>>> descriptors make it to the gate and I believe from the last
>>> discussion it was almost there
>>> https://patchwork.kernel.org/patch/10305439/
>>>
>>>
>>>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS
>>>>> and TRANSLATE_PID_FD_PIDNS integrated first and then possibly
>>>>> extend the interface to include TRANSLATE_PID_TASK_PIDNS in
>>>>> future?
>>>>
>>>> I don't see reason for this separation.
>>>> Pids and pid namespaces are part of the API for a long time.
>>>
>>> If you are talking about the translate_pid API proposed, I believe
>>> the V4 proposed under https://patchwork.kernel.org/patch/10003935/
>>> had only fd based API before a mix of PID and fd based is proposed
>>> in V5. Again, I was just wondering if we can get the FD based
>>> approach in first and then extend the API to include PID based
>>> approach later as fd based approach could provide a lot of
>>> immediate benefits?
>>>
>>> Thanks,
>>> Nagarathnam.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Nagarathnam.
>>>>>> Most pid-based syscalls are racy in some cases but they are
>>>>>> here for decades and everybody knowns how to deal with it.
>>>>>> So, I've decided to merge both worlds in one interface which
>>>>>> clearly tells what to expect.
>>>>>
>>>
>
> Ping? Any additional comments on this patch?

I have totally lost the thread. Let me see if I can find enough of the
thread to see what is going on.

The whole let's use pids instead of fds was a major distraction.

Eric

2018-05-31 18:04:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

"Michael Kerrisk (man-pages)" <[email protected]> writes:

> On 04/05/2018 09:02 AM, Konstantin Khlebnikov wrote:
>> On 05.04.2018 01:29, Eric W. Biederman wrote:
>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>
>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote:
>>>>> Each process have different pids, one for each pid namespace it belongs.
>>>>> When interaction happens within single pid-ns translation isn't required.
>>>>> More complicated scenarios needs special handling.
>>>>>
>>>>> For example:
>>>>> - reading pid-files or logs written inside container with pid namespace
>>>>> - attaching with ptrace to tasks from different pid namespace
>>>>> - passing pids across pid namespaces in any kind of API
>>>>>
>>>>> Currently there are several interfaces that could be used here:
>>>>>
>>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid.
>>>
>>> Using the inode number in interfaces is not an option. Especially not
>>> withou referencing the device number for the filesystem as well.
>>
>> This is supposed to be single-instance fs,
>> not part of proc but referenced but its magic "symlinks".
>>
>> Device numbers are not mentioned in "man namespaces".
>
> Thanks for the heads-up!
>
> That was a bug in the man-page. ioctl_ns(2) already says the right thing.
> Now I patches namespaces(7), as below.

Acked-by: "Eric W. Biederman" <[email protected]>

For the changes to namespaces.7. I suspect you have already applied
them by now, but if not.

Eric


> Cheers,
>
> Michael
>
> diff --git a/man7/namespaces.7 b/man7/namespaces.7
> index 725ebaff6..3c155de7e 100644
> --- a/man7/namespaces.7
> +++ b/man7/namespaces.7
> @@ -154,11 +154,14 @@ In Linux 3.7 and earlier, these files were visible as hard links.
> Since Linux 3.8,
> .\" commit bf056bfa80596a5d14b26b17276a56a0dcb080e5
> they appear as symbolic links.
> -If two processes are in the same namespace, then the inode numbers of their
> +If two processes are in the same namespace,
> +then the device IDs and inode numbers of their
> .IR /proc/[pid]/ns/xxx
> symbolic links will be the same; an application can check this using the
> +.I stat.st_dev
> +and
> .I stat.st_ino
> -field returned by
> +fields returned by
> .BR stat (2).
> The content of this symbolic link is a string containing
> the namespace type and inode number as in the following example:

2018-05-31 18:06:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

Nagarathnam Muthusamy <[email protected]> writes:

> Ping? Any additional comments on this patch?

Konstantin's v5 no.

I am uncomfortable with unnecessary flexibility. I don't want to
encourage the increased usage of pids to identify namespaces. I saw
no arguments in favor of pids to identify namespaces from Konstantin.

Your v4 updated to reflect Andrew Morton's concerns in the changelog
yes.

Eric

2018-06-01 07:00:40

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

On Thu, May 31, 2018 at 9:05 PM, Eric W. Biederman
<[email protected]> wrote:
> Nagarathnam Muthusamy <[email protected]> writes:
>
>> Ping? Any additional comments on this patch?
>
> Konstantin's v5 no.
>
> I am uncomfortable with unnecessary flexibility. I don't want to
> encourage the increased usage of pids to identify namespaces. I saw
> no arguments in favor of pids to identify namespaces from Konstantin.
>
> Your v4 updated to reflect Andrew Morton's concerns in the changelog
> yes.

Ok, whatever. If I the only blocker then let it be v4 design.

>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2018-06-01 15:57:34

by Nagarathnam Muthusamy

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid



On 5/31/2018 11:58 PM, Konstantin Khlebnikov wrote:
> On Thu, May 31, 2018 at 9:05 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Nagarathnam Muthusamy <[email protected]> writes:
>>
>>> Ping? Any additional comments on this patch?
>> Konstantin's v5 no.
>>
>> I am uncomfortable with unnecessary flexibility. I don't want to
>> encourage the increased usage of pids to identify namespaces. I saw
>> no arguments in favor of pids to identify namespaces from Konstantin.
>>
>> Your v4 updated to reflect Andrew Morton's concerns in the changelog
>> yes.
> Ok, whatever. If I the only blocker then let it be v4 design.

Thanks! I will resend the V4 patch with a plain text man page for
translate_pid.

Thanks,
Nagarathnam.
>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


2018-06-01 16:16:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

NAGARATHNAM MUTHUSAMY <[email protected]> writes:

> On 5/31/2018 11:58 PM, Konstantin Khlebnikov wrote:
>> On Thu, May 31, 2018 at 9:05 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>
>>>> Ping? Any additional comments on this patch?
>>> Konstantin's v5 no.
>>>
>>> I am uncomfortable with unnecessary flexibility. I don't want to
>>> encourage the increased usage of pids to identify namespaces. I saw
>>> no arguments in favor of pids to identify namespaces from Konstantin.
>>>
>>> Your v4 updated to reflect Andrew Morton's concerns in the changelog
>>> yes.
>> Ok, whatever. If I the only blocker then let it be v4 design.
>
> Thanks! I will resend the V4 patch with a plain text man page for
> translate_pid.

I believe Andrew was also asking for the motivating use case in the
description of the patch.

Thank you,
Eric

2018-06-01 16:17:34

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid

On 01.06.2018 18:55, NAGARATHNAM MUTHUSAMY wrote:
>
>
> On 5/31/2018 11:58 PM, Konstantin Khlebnikov wrote:
>> On Thu, May 31, 2018 at 9:05 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Nagarathnam Muthusamy <[email protected]> writes:
>>>
>>>> Ping? Any additional comments on this patch?
>>> Konstantin's v5 no.
>>>
>>>    I am uncomfortable with unnecessary flexibility.  I don't want to
>>>    encourage the increased usage of pids to identify namespaces.  I saw
>>>    no arguments in favor of pids to identify namespaces from Konstantin.
>>>
>>> Your v4 updated to reflect Andrew Morton's concerns in the changelog
>>> yes.
>> Ok, whatever. If I the only blocker then let it be v4 design.
>
> Thanks! I will resend the V4 patch with a plain text man page for translate_pid.

Syscall integration have changed since then.
I've prepared v6 patch with v4 design and updated comment.

>
> Thanks,
> Nagarathnam.
>>
>>> Eric
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>>> the body of a message to [email protected]
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>