Received: by 10.213.65.68 with SMTP id h4csp598566imn; Tue, 13 Mar 2018 14:27:08 -0700 (PDT) X-Google-Smtp-Source: AG47ELuerg+eLKnrKWou+eWPEQy04uTt1JB2hY1Zt/3wPshXSVTe3cQ/7KFnOrMlECBlOH8Yg09e X-Received: by 2002:a17:902:5489:: with SMTP id e9-v6mr1853903pli.81.1520976428769; Tue, 13 Mar 2018 14:27:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1520976428; cv=none; d=google.com; s=arc-20160816; b=HecQN/2IUm3jL8exzONYDEMdmy1M3i6WIgP2L9tjPYBIhc5AnPlY5LHGnRwelaadlY ymQL+xemWco7oMSHmj2D2F47C5la3rrt/Y5cCDmW100B/ZDtscokW4VLbiTWE0QO8TzG LVUmR9VDggn8cuQ5DTOB/MzB6zV2WhhjLh9S9jMpcDq+1kmBfKofW4PeRxYRn1FNwEYr RomYYwkXbGVt31VR7EMXPioAcELVwwzRiNJXyOYtbLPVujeGROhD2QN3zwNTjtJhAbzF PX+6Ws3RE1FlyHsMBl7QiY8j+oeolDMa3+vseDkTz5RFM+g+a5y8VkulvTfV0VUVWXas 1w5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=cNVnLqhdJVbqL//i6/yFvQwPZXn36WVgijZfkYEjV1c=; b=ELyE8Ze4M6O5By7gCZTYx4mhH3wnk/VnLVNrPserdkVfSV+gGH+ln59+tgZ3TVw+7X P1Fqt55OnXgyticY7RSZYbPh46mSYe/KPreHechWJUVT7UvedSBW38O0cftp0OJHrW7R jBKO2tFJp1NVyv9fjrdA2tA9p77s1dxkG3JWbqeLeFUvflHsLYzWlU3MEO8iHHh1xMHu OTdIl4107xJxjAjNQ2Lg7bC0EoO1ZDj7WQBeneIdClCfOqGBDFfYSbt6eAPR2zik/ANO fNTXoPwujSHhq+0VnrTiyCYuhylc6D5WmZxVE/IOpgSRXK1QV/DJTaG07NMsrRv8xwMU HiuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=qMc09WBH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f5si653745pgq.806.2018.03.13.14.26.54; Tue, 13 Mar 2018 14:27:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=qMc09WBH; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932443AbeCMVZM (ORCPT + 99 others); Tue, 13 Mar 2018 17:25:12 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:56318 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932257AbeCMVZK (ORCPT ); Tue, 13 Mar 2018 17:25:10 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w2DLLZDw125101; Tue, 13 Mar 2018 21:24:55 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=cNVnLqhdJVbqL//i6/yFvQwPZXn36WVgijZfkYEjV1c=; b=qMc09WBH+C3RO94xbz4jvy4Yfx2XYLz6raWCuqRtniAFEO3Q2JkzKFGIjdBbAvNgGWIS 6rR0YSipxNBGKipk9jlt6UivhyvUEyF6RctXaCGtQZ35NrQ9Yl962Sgw3zkBaKTOH60F 5DK7NFkIzNATsta1lxR+wboXKB44HgzsgYbQXxONhq2fqIrzZkfWxUQIUipobtsErdG2 kHfbLv11Ox68NWIMJYBba2IoHSSF6fy96W1r2ewOEb4uhdZtemuQlAaWXIYn5dBK9ehx r4G9RNffKpO2ka1H2g/NmBp/n9tw4iolLVZCLPwBzTQ4U0qZpLWZ94I1NCcrfJ05C0nO eA== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2130.oracle.com with ESMTP id 2gpku70qm5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 13 Mar 2018 21:24:55 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w2DLOsWJ001706 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 13 Mar 2018 21:24:54 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w2DLOrne002506; Tue, 13 Mar 2018 21:24:53 GMT Received: from [10.132.92.135] (/10.132.92.135) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 13 Mar 2018 14:24:52 -0700 Subject: Re: [RESEND RFC] translate_pid API To: Jann Horn Cc: kernel list , Linux API , Konstantin Khlebnikov , Nagarajan.Muthukrishnan@oracle.com, Prakash Sangappa , Andy Lutomirski , Andrew Morton , Oleg Nesterov , Serge Hallyn , "Eric W. Biederman" , Eugene Syromiatnikov , xemul@parallels.com References: <1520875093-18174-1-git-send-email-nagarathnam.muthusamy@oracle.com> From: Nagarathnam Muthusamy Message-ID: <69f13674-7f84-5dc7-0bd7-e5e65e9cb3b0@oracle.com> Date: Tue, 13 Mar 2018 14:20:16 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8831 signatures=668690 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=1 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1803130238 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/13/2018 01:47 PM, Jann Horn wrote: > On Mon, Mar 12, 2018 at 10:18 AM, wrote: >> Resending the RFC with participants of previous discussions >> in the list. >> >> Following patch which is a variation of a solution discussed >> in https://lwn.net/Articles/736330/ provides the users of >> pid namespace, the functionality of pid translation between >> namespaces using a namespace identifier. The topic of >> pid translation has been discussed in the community few times >> but there has always been a resistance to adding new solution >> for this problem. >> I will outline the planned usecase of pid namespace by oracle >> database and explain why any of the existing solution cannot >> be used to solve their problem. >> >> Consider a system in which several PID namespaces with multiple >> nested levels exists in parallel with monitor processes managing >> all the namespaces. PID translation is required for controlling >> and accessing information about the processes by the monitors >> and other processes down the hierarchy of namespaces. Controlling >> primarily involves sending signals or using ptrace by a process in >> parent namespace on any of the processes in its child namespace. >> Accessing information deals with the reading /proc//* files >> of processes in child namespace. None of the processes have >> root/CAP_SYS_ADMIN privileges. > How are you dealing with PID reuse? We have a monitor process which keeps track of the aliveness of important processes. When a process dies, monitor makes a note of it and hence detects if pid is reused. > > [...] >> diff --git a/fs/nsfs.c b/fs/nsfs.c >> index 36b0772..c635465 100644 >> --- a/fs/nsfs.c >> +++ b/fs/nsfs.c >> @@ -222,8 +222,13 @@ int ns_get_name(char *buf, size_t size, struct task_struct *task, >> const char *name; >> ns = ns_ops->get(task); >> if (ns) { >> - name = ns_ops->real_ns_name ? : ns_ops->name; >> - res = snprintf(buf, size, "%s:[%u]", name, ns->inum); >> + if (!strcmp(ns_ops->name, "pidns_id")) { > Wouldn't it be cleaner to check for "ns_ops==&pidns_id_operations"? Yup. Will fix it. > >> + res = snprintf(buf, size, "[%llu]", >> + (unsigned long long)ns->ns_id); >> + } else { >> + name = ns_ops->real_ns_name ? : ns_ops->name; >> + res = snprintf(buf, size, "%s:[%u]", name, ns->inum); >> + } >> ns_ops->put(ns); >> } >> return res; > [...] >> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h >> index 49538b1..11d1d57 100644 >> --- a/include/linux/pid_namespace.h >> +++ b/include/linux/pid_namespace.h >> @@ -11,6 +11,7 @@ >> #include >> #include >> #include >> +#include >> >> >> struct fs_pin; >> @@ -44,6 +45,8 @@ struct pid_namespace { >> kgid_t pid_gid; >> int hide_pid; >> int reboot; /* group exit code if this pidns was rebooted */ >> + struct hlist_bl_node node; >> + atomic_t lookups_pending; >> struct ns_common ns; >> } __randomize_layout; >> > [...] >> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c >> index 0b53eef..ff83aa8 100644 >> --- a/kernel/pid_namespace.c >> +++ b/kernel/pid_namespace.c > [...] >> @@ -159,6 +201,30 @@ static void delayed_free_pidns(struct rcu_head *p) >> >> static void destroy_pid_namespace(struct pid_namespace *ns) >> { >> + struct pid_namespace *ph; >> + struct hlist_bl_head *head; >> + struct hlist_bl_node *dup_node; >> + >> + /* >> + * Remove the namespace structure from hash table so >> + * now new lookups can start on it. > s/now new/no new/ Will fix it. > > [...] >> @@ -474,9 +551,116 @@ static struct user_namespace *pidns_owner(struct ns_common *ns) >> .get_parent = pidns_get_parent, >> }; >> >> +/* >> + * translate_pid - convert pid in source pid-ns into target pid-ns. >> + * @pid: pid for translation >> + * @source: pid-ns id >> + * @target: pid-ns id >> + * >> + * Return pid in @target pid-ns, zero if task have no pid there, >> + * or -ESRCH of task with @pid is not found in @source pid-ns. > s/of/if/ Will fix it. > >> + */ >> +SYSCALL_DEFINE3(translate_pid, pid_t, pid, u64, source, >> + u64, target) >> +{ >> + struct pid_namespace *source_ns = NULL, *target_ns = NULL; >> + struct pid *struct_pid; >> + struct pid_namespace *ph; >> + struct hlist_bl_head *shead = NULL; >> + struct hlist_bl_head *thead = NULL; >> + struct hlist_bl_node *dup_node; >> + pid_t result; >> + >> + if (!source) { >> + source_ns = &init_pid_ns; >> + } else { >> + shead = pid_ns_hash_head(pid_ns_hash, source); >> + hlist_bl_lock(shead); >> + hlist_bl_for_each_entry(ph, dup_node, shead, node) { >> + if (source == ph->ns.ns_id) { >> + source_ns = ph; >> + break; >> + } >> + } >> + if (!source_ns) { >> + hlist_bl_unlock(shead); >> + return -EINVAL; >> + } >> + } >> + if (!ptrace_may_access(source_ns->child_reaper, >> + PTRACE_MODE_READ_FSCREDS)) { > AFAICS this proposal breaks the visibility restrictions that > namespaces normally create. If there are two namespaces-based > containers that use the same UID range, I don't think they should be > able to learn information about each other, such as which PIDs are in > use in the other container; but as far as I can tell, your proposal > makes it possible to do that (unless an LSM or so is interfering). I > would prefer it if this API required visibility of the targeted PID > namespaces in the caller's PID namespace. I am trying to simulate the same access restrictions allowed on a process's /proc//ns/pid file. If the translator has access to /proc//ns/pid file of both source and destination namespaces, shouldn't it be allowed to translate the pid between them? > > When doing ptrace access checks, please use the real creds in syscalls > like this one, not the fs creds. The fs creds are for filesystem > syscalls (in particular sys_open()), not for specialized syscalls like > ptrace() or this one. Will fix this. Thanks, Nagarathnam.