Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp942481imm; Fri, 1 Jun 2018 12:19:03 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJ404j3UceNDZl+fapfvCLE1UDiCP/0CTAulfc685/VgZ93aF3H4S/QdEL6L3LpV+Tn4+z0 X-Received: by 2002:a63:2505:: with SMTP id l5-v6mr3981834pgl.40.1527880743930; Fri, 01 Jun 2018 12:19:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527880743; cv=none; d=google.com; s=arc-20160816; b=K2h7Dpdp+bPgFNCjPw7Q3veI6boG0RT2lMQV9VBDZ0C4l2VOYHC6nqb2jL6qq2oVtj iDB8Npwc3Q/DSbp/knM2hFqpqy91G0Jq/MyV+LFylxiJlshi5sgGe3r4T7mL5ya6bOF4 fJ7wpl0is7dvDiqGCkVXPbNijcn3nO7hEoDxZKayLlJM40t00bslJZmKyeCE3WwI1/0b NTU8MirbE8MCtVoRsoX2geZree2RYAcYqO4pGijoDcUsp2z6TSD322eSkbmnkGuF7LvY qzRIBPAFnwezi/jGpe9Eoi8mh5rP+91IDC7iVRorJRMue+I4ToTz6PTPf41lRULKHCEO jH4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:date:cc:to:from:subject:dkim-signature :arc-authentication-results; bh=UwY4jkGLvwdUv+FQ39MNjL3yJumaeflm1vTcisS/JRk=; b=EZvs/xxwWZ5hoxN9zjg8S0fbP8vwlOhf6yObTc20434tbFOk5CaJ1B5inSbSXWs94s 9Tp8GTx0vrOxodC2syY79dxUfE6Kh+x8a0sFPFuHIGYKPv9RGvnHQLdoOO3Ed1Ck92IT iq4tpR8qv68pumk1zcy9OHjSOHszI3Waj+04Z41kSbeBYOPOz/Is4GcWJkpnxvDcT8Jv 7Q1l3eok1sx3c/QWQVkHLrr1gOa/Wm/Nc/xfz2Bkb9l/Bv9mRFi6iUFDobtx8DQ5mvFg Kx9NHIQryBgw/ClzUN8pN51UlOwBYPvNUZ70vySukSIlO5PWHkAWDV1hCx7pp58kpCr1 //Zw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=TV6CFxC9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k22-v6si18544431pgn.257.2018.06.01.12.18.49; Fri, 01 Jun 2018 12:19:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=TV6CFxC9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753567AbeFATST (ORCPT + 99 others); Fri, 1 Jun 2018 15:18:19 -0400 Received: from forwardcorp1o.cmail.yandex.net ([37.9.109.47]:57867 "EHLO forwardcorp1o.cmail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753520AbeFATSJ (ORCPT ); Fri, 1 Jun 2018 15:18:09 -0400 Received: from smtpcorp1p.mail.yandex.net (smtpcorp1p.mail.yandex.net [IPv6:2a02:6b8:0:1472:2741:0:8b6:10]) by forwardcorp1o.cmail.yandex.net (Yandex) with ESMTP id 09E2A21712; Fri, 1 Jun 2018 22:18:06 +0300 (MSK) Received: from smtpcorp1p.mail.yandex.net (localhost.localdomain [127.0.0.1]) by smtpcorp1p.mail.yandex.net (Yandex) with ESMTP id 00FB76E40D03; Fri, 1 Jun 2018 22:18:06 +0300 (MSK) Received: from unknown (unknown [2a02:6b8:0:827::1:41]) by smtpcorp1p.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 24SSv9I3iS-I5q0LDWE; Fri, 01 Jun 2018 22:18:05 +0300 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client certificate not present) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1527880685; bh=UwY4jkGLvwdUv+FQ39MNjL3yJumaeflm1vTcisS/JRk=; h=Subject:From:To:Cc:Date:Message-ID; b=TV6CFxC9C/Zjjn/6SG/maJOficq+zQPl0GX6k/u9LbNkcI6N7TS6k6WIfQDZuuC1E D8UbL6gZzUzVTUanipHWwZtmHRnE2Q5O47FMAtOcA00ylrvWGMNFKK+ZBADoBSMbqg h406v4n4k2OzxFeLyYkNtIycQyoWVkj0SuVGMs0s= Authentication-Results: smtpcorp1p.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Subject: [PATCH v6] pidns: introduce syscall translate_pid From: Konstantin Khlebnikov To: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Jann Horn , Serge Hallyn , Prakash Sangappa , Oleg Nesterov , Nagarathnam Muthusamy , "Eric W. Biederman" , Andrew Morton , Andy Lutomirski , "Michael Kerrisk \(man-pages\)" Date: Fri, 01 Jun 2018 22:18:02 +0300 Message-ID: <152788068212.768348.15192457501079586650.stgit@buzz> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Each process have different pids, one for each pid namespace it belongs. When interaction happens within single pid-ns translation isn't required. More complicated scenarios needs special handling. For example: - reading pid-files or logs written inside container with pid namespace - writing logs with internal pids outside container for pushing them into - attaching with ptrace to tasks from different pid namespace Generally speaking, any cross pid-ns API with pids needs translation. Currently there are several interfaces that could be used here: Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid. Pids for nested pid namespaces are shown in file /proc/[pid]/status. In some cases pid translation could be easily done using this information. Backward translation requires scanning all tasks and becomes really complicated for deeper namespace nesting. Unix socket automatically translates pid attached to SCM_CREDENTIALS. This requires CAP_SYS_ADMIN for sending arbitrary pids and entering into pid namespace, this expose process and could be insecure. This patch adds new syscall for converting pids between pid namespaces: pid_t translate_pid(pid_t pid, int source, int target); Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument points to current pid namespace. Syscall returns pid in target pid-ns or zero if task have no pid there. Error codes: EBADF - file descriptor is closed EINVAL - file descriptor isn't pid namespace ESRCH - task not found in @source namespace Translation could breach pid-ns isolation and return pids from outer pid namespaces iff process already has file descriptor for these namespaces. Examples: translate_pid(pid, ns, -1) - get pid in our pid namespace translate_pid(pid, -1, ns) - get pid in other pid namespace translate_pid(1, ns, -1) - get pid of init task for namespace translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? Signed-off-by: Konstantin Khlebnikov Reanimated-by: Nagarathnam Muthusamy --- v1: https://lkml.org/lkml/2015/9/15/411 v2: https://lkml.org/lkml/2015/9/24/278 * use namespace-fd as second/third argument * add -pid for getting parent pid * move code into kernel/sys.c next to getppid * drop ifdef CONFIG_PID_NS * add generic syscall v3: https://lkml.org/lkml/2015/9/28/3 * use proc_ns_fdget() * update description * rebase to next-20150925 * fix conflict with mlock2 v4: https://lkml.org/lkml/2017/10/13/177 * rename from getvpid() into translate_pid() * remove syscall if CONFIG_PID_NS=n * drop -pid for parent task * drop fget-fdget optimizations * add helper get_pid_ns_by_fd() * wire only into x86 v5: https://lkml.org/lkml/2018/4/4/677 * rewrite commit message * resolve pidns by task pid or by pidns fd * add arguments source_type and target_type v6: * revert back minimized v4 design * rebase to next-20180601 * fix COND_SYSCALL stub * use next syscall number, old used for io_pgetevents --- sample tool --- #define _GNU_SOURCE #include #include #include #include #include #include #include #ifndef SYS_translate_pid #ifdef __x86_64__ #define SYS_translate_pid 334 #elif defined __i386__ #define SYS_translate_pid 386 #endif #endif pid_t translate_pid(pid_t pid, int source, int target) { return syscall(SYS_translate_pid, pid, source, target); } int main(int argc, char **argv) { int pid, source, target; char buf[64]; if (argc != 4) errx(1, "usage: %s ", argv[0]); pid = atoi(argv[1]); source = atoi(argv[2]); target = atoi(argv[3]); if (source > 0) { snprintf(buf, sizeof(buf), "/proc/%d/ns/pid", source); source = open(buf, O_RDONLY); if (source < 0) err(2, "open source %s", buf); } if (target > 0) { snprintf(buf, sizeof(buf), "/proc/%d/ns/pid", target); target = open(buf, O_RDONLY); if (target < 0) err(2, "open target %s", buf); } pid = translate_pid(pid, source, target); if (pid < 0) err(2, "translate_pid"); printf("%d\n", pid); return 0; } --- --- arch/x86/entry/syscalls/syscall_32.tbl | 1 arch/x86/entry/syscalls/syscall_64.tbl | 1 include/linux/syscalls.h | 1 kernel/pid_namespace.c | 66 ++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 3 + 5 files changed, 72 insertions(+) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 14a2f996e543..e70685750d43 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -397,3 +397,4 @@ 383 i386 statx sys_statx __ia32_sys_statx 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents +386 i386 translate_pid sys_translate_pid __ia32_sys_translate_pid diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index cd36232ab62f..ebfd89055424 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -342,6 +342,7 @@ 331 common pkey_free __x64_sys_pkey_free 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents +334 common translate_pid __x64_sys_translate_pid # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 390e814fdc8d..3f33971cf1c8 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -843,6 +843,7 @@ asmlinkage long sys_clock_adjtime(clockid_t which_clock, struct timex __user *tx); asmlinkage long sys_syncfs(int fd); asmlinkage long sys_setns(int fd, int nstype); +asmlinkage long sys_translate_pid(pid_t pid, int source, int target); asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg, unsigned int vlen, unsigned flags); asmlinkage long sys_process_vm_readv(pid_t pid, diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 2a2ac53d8b8b..3b872cbbe264 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -380,6 +381,71 @@ static void pidns_put(struct ns_common *ns) put_pid_ns(to_pid_ns(ns)); } +static struct pid_namespace *get_pid_ns_by_fd(int fd) +{ + struct pid_namespace *pidns; + struct ns_common *ns; + struct file *file; + + file = proc_ns_fget(fd); + if (IS_ERR(file)) + return ERR_CAST(file); + + ns = get_proc_ns(file_inode(file)); + if (ns->ops->type == CLONE_NEWPID) + pidns = get_pid_ns(to_pid_ns(ns)); + else + pidns = ERR_PTR(-EINVAL); + + fput(file); + return pidns; +} + +/* + * translate_pid - convert pid in source pid-ns into target pid-ns. + * @pid: pid for translation + * @source: pid-ns file descriptor or -1 for active namespace + * @target: pid-ns file descriptor or -1 for active namesapce + * + * Returns pid in @target pid-ns, zero if task have no pid there, + * or -ESRCH if task with @pid does not found in @source pid-ns. + */ +SYSCALL_DEFINE3(translate_pid, pid_t, pid, int, source, int, target) +{ + struct pid_namespace *source_ns, *target_ns; + struct pid *struct_pid; + pid_t result; + + if (source >= 0) { + source_ns = get_pid_ns_by_fd(source); + result = PTR_ERR(source_ns); + if (IS_ERR(source_ns)) + goto err_source; + } else + source_ns = task_active_pid_ns(current); + + if (target >= 0) { + target_ns = get_pid_ns_by_fd(target); + result = PTR_ERR(target_ns); + if (IS_ERR(target_ns)) + goto err_target; + } else + target_ns = task_active_pid_ns(current); + + rcu_read_lock(); + struct_pid = find_pid_ns(pid, source_ns); + result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH; + rcu_read_unlock(); + + if (target >= 0) + put_pid_ns(target_ns); +err_target: + if (source >= 0) + put_pid_ns(source_ns); +err_source: + return result; +} + static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns) { struct pid_namespace *active = task_active_pid_ns(current); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 06b4ccee0047..bf276e9ace9a 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -153,6 +153,9 @@ COND_SYSCALL_COMPAT(kexec_load); COND_SYSCALL(init_module); COND_SYSCALL(delete_module); +/* kernel/pid_namespace.c */ +COND_SYSCALL(translate_pid); + /* kernel/posix-timers.c */ /* kernel/printk.c */