Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp906862imm; Tue, 15 May 2018 10:52:46 -0700 (PDT) X-Google-Smtp-Source: AB8JxZosMtSaPm+ENAHOGZe7745yjsPdppo0XnkdHcnxIJybvTt3PI6I1VjuqiZyGA0VWaukFg9L X-Received: by 2002:a17:902:bd46:: with SMTP id b6-v6mr15514769plx.170.1526406766869; Tue, 15 May 2018 10:52:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526406766; cv=none; d=google.com; s=arc-20160816; b=FUX9lc8mgImYZ4ikFcFUvGe0IQw0XSNMBCghcb9lDwpU1wtnk7h4D4z/oY8mIYGxDj kgARUDcDjxqulE1NEjycKiCyN1Mgh7WXgr/hYpHjO6l+0LBegcWIyl4S6scDvVoFe5Y7 KfAosbcMWu9UG8E7AExzmUwgIuedwoaz4FX1iCBfaSbRn3zAhzGUiOudERox5mE2MapS rdaf5r6BD/5OUx0rd+e4I7fkNortzKNxFKa8wJBtjKXu9c0ABEuKI2eRdt7ezqiW8Eu8 55JqkjA3Gs/n3xKJEzcZLqgQ7/k+q8DMveNcLLAkyifxKo3EqHdlHYpIwPNFGkaTsRQR 7Tow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject:dkim-signature :arc-authentication-results; bh=yhd3UhPot+5O5/qOm0LJJh7xK3d+2nuyFsJSk1SMBQ0=; b=l2sAv6GhAoUtKLNIU2wNp8TTsBSAMVLdD14KNCqa4343n4MzC94OY/TZy2mpGYYUZN VmvEAyrxBfOTmUoeiTZhmwgcCR7u1hh/kLpdLoJ7CuubBVKL8bENeqFBoXUjsf2dbQ5Z YUWK4I2JlEKHAoJoBiWKLw5cZrCVU/Ga60Eglv5VQx4++6M24omxDOu5FTX6BxWeS9uw X0c95mgUmJYaSVB+XVSZnUgkfmnf6to/RwwYBsUuhzgCkMnHzCRaXJFeMSyHs90DKPnT JRr4iS1eyznaOnxxV0DnnkVZun81EBGHiILpXvFyu/liTB2QJISxdsxbpAoySlZ3V8tq bzFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=aK1+8jJZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t8-v6si516567pfl.344.2018.05.15.10.52.27; Tue, 15 May 2018 10:52:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=aK1+8jJZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753424AbeEORuf (ORCPT + 99 others); Tue, 15 May 2018 13:50:35 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:42098 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753002AbeEORud (ORCPT ); Tue, 15 May 2018 13:50:33 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w4FHkier001516; Tue, 15 May 2018 17:50:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : references : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=yhd3UhPot+5O5/qOm0LJJh7xK3d+2nuyFsJSk1SMBQ0=; b=aK1+8jJZ4wUlcJx1tA4JaN1BS+Cl4QwA8BlpuhSiUvBb4rA124NfSJgOVDv7LH9YLJNS bVJdMmGreWWyJ70QeKnJ59F/+Oig1HcZ7zm2Ilg/pzvR4Vn7cr9iY7UIcKewk6/nQmVB Q49OO5sN1uxhmzyN/U8UTHJx9kgG+ga5wigiIKm1i/mUNFzeynQ+0Z2QRTr3+V8ap7WR Dh/BJ9pASSTRKHvM1CpXJ+5IE+/7vDXv+8CHu75u0y1AZKUAXmjBGaCHANC92SbFALK3 wcRzFNvSBwxoL+C23HADvSI2/OwhkgiMJymVEdqC7APlAcFWu7lW/OP2ymwlr60+uc1b IQ== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2hx29w1hap-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 15 May 2018 17:50:06 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w4FHo512019885 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 15 May 2018 17:50:05 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w4FHo4PR017985; Tue, 15 May 2018 17:50:04 GMT Received: from [10.132.93.82] (/10.132.93.82) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 15 May 2018 10:50:04 -0700 Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid From: Nagarathnam Muthusamy To: Konstantin Khlebnikov , "Eric W. Biederman" Cc: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Jann Horn , Serge Hallyn , Oleg Nesterov , Andy Lutomirski , Prakash Sangappa , Andrew Morton References: <152286911105.615669.14053871624892399807.stgit@buzz> <87h8oqhagl.fsf@xmission.com> <112c7cac-1982-3a2e-ffc0-878bc5ae4bb6@yandex-team.ru> <778ab3d0-b6bc-fdb5-669a-40222e5020d4@yandex-team.ru> <3e2c285a-1bf8-f71d-1b74-4d6465c29a54@yandex-team.ru> <4b21a648-0abc-e92d-ec37-681dff63bb55@oracle.com> Message-ID: <32016c04-8835-bf8f-74c7-254e28d815aa@oracle.com> Date: Tue, 15 May 2018 10:44:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: <4b21a648-0abc-e92d-ec37-681dff63bb55@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8894 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=17 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1805150177 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/15/2018 10:40 AM, Nagarathnam Muthusamy wrote: > > > On 05/15/2018 10:36 AM, Konstantin Khlebnikov wrote: >> >> >> On 15.05.2018 20:19, Nagarathnam Muthusamy wrote: >>> >>> >>> On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote: >>>> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote: >>>>> >>>>> >>>>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote: >>>>>> On 05.04.2018 01:29, Eric W. Biederman wrote: >>>>>>> Nagarathnam Muthusamy writes: >>>>>>> >>>>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote: >>>>>>>>> Each process have different pids, one for each pid namespace >>>>>>>>> it belongs. >>>>>>>>> When interaction happens within single pid-ns translation >>>>>>>>> isn't required. >>>>>>>>> More complicated scenarios needs special handling. >>>>>>>>> >>>>>>>>> For example: >>>>>>>>> - reading pid-files or logs written inside container with pid >>>>>>>>> namespace >>>>>>>>> - attaching with ptrace to tasks from different pid namespace >>>>>>>>> - passing pids across pid namespaces in any kind of API >>>>>>>>> >>>>>>>>> Currently there are several interfaces that could be used here: >>>>>>>>> >>>>>>>>> Pid namespaces are identified by inode number of >>>>>>>>> /proc/[pid]/ns/pid. >>>>>>> >>>>>>> Using the inode number in interfaces is not an option. >>>>>>> Especially not >>>>>>> withou referencing the device number for the filesystem as well. >>>>>> >>>>>> This is supposed to be single-instance fs, >>>>>> not part of proc but referenced but its magic "symlinks". >>>>>> >>>>>> Device numbers are not mentioned in "man namespaces". >>>>>> >>>>>>> >>>>>>>>> Pids for nested Pid namespaces are shown in file >>>>>>>>> /proc/[pid]/status. >>>>>>>>> In some cases conversion pid -> vpid could be easily done >>>>>>>>> using this >>>>>>>>> information, but backward translation requires scanning all >>>>>>>>> tasks. >>>>>>>>> >>>>>>>>> Unix socket automatically translates pid attached to >>>>>>>>> SCM_CREDENTIALS. >>>>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and >>>>>>>>> entering >>>>>>>>> into pid namespace, this expose process and could be insecure. >>>>>>>>> >>>>>>>>> This patch adds new syscall for converting pids between pid >>>>>>>>> namespaces: >>>>>>>>> >>>>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source, >>>>>>>>>                                  int target_type, int target); >>>>>>>>> >>>>>>>>> @source_type and @target_type defines type of following >>>>>>>>> arguments: >>>>>>>>> >>>>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument >>>>>>>>> is unused >>>>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid >>>>>>>> >>>>>>>> I believe using pid to represent the namespace has been already >>>>>>>> discussed in V1 of this patch in >>>>>>>> https://lkml.org/lkml/2015/9/22/1087 >>>>>>>> after which we moved on to fd based version of this interface. >>>>>>> >>>>>>> Or in short why is the case of pids important? >>>>>>> >>>>>>> You Konstantin you almost said why they were important in your >>>>>>> message >>>>>>> saying you were going to send this one.  However you don't >>>>>>> explain in >>>>>>> your description why you want to identify pid namespaces by pid. >>>>>>> >>>>>> >>>>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace, >>>>>> pid based variant doesn't have such restrictions. >>>>> >>>>> Can you provide more information on usecase requiring PID >>>>> translation but not used for tracing related purposes? >>>> >>>> Any introspection for [nested] containers. It's easier to work when >>>> you have all information when you don't have any. >>>> For example our CMS https://github.com/yandex/porto allows to start >>>> nested sub-container (or even deeper) by request from any container >>>> and have to tell back which pid task is have. And it could >>>> translate any pid inside into accessible by client and vice versa. >>>> >>> >>> I still dont get the exact reason why PID based approach to identify >>> the namespace during pid translation process is absolutely required >>> compared to fd based approach. >> >> As I told open(/proc/%d/ns/pid) have security restrictions - same >> uid/CAP_SYS_PTRACE/whatever >> Pidns-fd holds pid-namespace and without restrictions could be abused. >> Pid based API is racy but always available without any restrictions. > > I get that Pid based API is available without any restrictions but do > we have any existing usecase which requires Pid based API but cannot > use Pidns-fd based API? Most of the usecases discussed in this thread > deals with introspection of a process by another process and I believe > that security requirement for opening (/proc/%d/ns/pid) is required > for all such usecases. In other words, Why would a process which does > not belong to same uid Typo: inspection of a process by another process Thanks, Nagarathnam. > of the process observed or have CAP_SYS_PTRACE be allowed to translate > PID? > > Thanks, > Nagarathnam. >> >> >>> From your version of TranslatePid in >>> >>> https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp >>> >>> >>> I see that you are going through the trouble of forking a process >>> and sending SMC_CREDENTIALS for pid translation. Even your existing >>> API could be extremely simplified if translate_pid based on file >>> descriptors make it to the gate and I believe from the last >>> discussion it was almost there >>> https://patchwork.kernel.org/patch/10305439/ >>> >>> >>>>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS >>>>> and TRANSLATE_PID_FD_PIDNS integrated first and then possibly >>>>> extend the interface to include TRANSLATE_PID_TASK_PIDNS in future? >>>> >>>> I don't see reason for this separation. >>>> Pids and pid namespaces are part of the API for a long time. >>> >>> If you are talking about the translate_pid API proposed, I believe >>> the V4 proposed under https://patchwork.kernel.org/patch/10003935/ >>> had only fd based API before a mix of PID and fd based is proposed >>> in V5. Again, I was just wondering if we can get the FD based >>> approach in first and then extend the API to include PID based >>> approach later as fd based approach could provide a lot of immediate >>> benefits? >>> >>> Thanks, >>> Nagarathnam. >>>> >>>>> >>>>> Thanks, >>>>> Nagarathnam. >>>>>> Most pid-based syscalls are racy in some cases but they are >>>>>> here for decades and everybody knowns how to deal with it. >>>>>> So, I've decided to merge both worlds in one interface which >>>>>> clearly tells what to expect. >>>>> >>> >