Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp890325imm; Tue, 15 May 2018 10:37:36 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpTRwcT371y/HU59stcqjslFwvVVPdcF0hWc6+sQt+P702VaYXXNEHxlOTXDI0rD/wIl8gJ X-Received: by 2002:a62:9696:: with SMTP id s22-v6mr15961844pfk.191.1526405856434; Tue, 15 May 2018 10:37:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526405856; cv=none; d=google.com; s=arc-20160816; b=Pbl5KiTyLho+AGhBlDfcRQPqTC4zSWAzA55sghxNBopHB5UQgEkz45Ia6V4fTZvFtZ VCZrsGCt2cABVVMBerHK4JA894qsikmKtCkrhGeU2R5TmReoL3XecVEa5+KS36oe2uzh AHJVNpTI6F/qsyUysehBCt57VH4m91BZctIXnYs/EuckUtXJPXyKqWIItuPaaeLq+lSD p6vfuPjWXUImEWfHfN5Dv7byUnIX9xh6LuXcmnS1/W66VTrQSI7DRgrKWr+d7PwjiHit b8dCiR6xzoZerUMmgnuKJQ9m8DctvYV5pSfkmUdm14zuRr1+pjXReajGUy4Akl+mQydP T9lw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=XW/4md3pndbvttHeOm2Yt2UA/c4f0cTxJCyGN+IzuWg=; b=SysFP4aT4BmW9m2rHpn1nMbvkZlyl5UIeWnSSlaBQGhsBLPEEHabkTVAFXWEf68sHr BDegeK10doIJLuniccp4uPMMZr0WyyYAl+VpByvNgM1Xta9Gv3b/7/R/tEXM+7cLcpR4 1ZpZUmn1BfTeqCumXF+etGQmu0lhlcRQxagukoB2RaEaYAZteyVyAhfQs1EtawNVxOrr 1cKgb7JO69ZxiTKfjLByjN2pwhhbC2w8w9JfMPzpBWRMltsdGDE3KKn7aQpC5cxVFMxC nU2eg3AIXkbOD4T1+D9cgF2+yBMotODbXic43mirjtRbJPP2MXyceWgfgw8xmtty20Hz hvkQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=ItEk5VPf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w17-v6si452096plq.115.2018.05.15.10.37.22; Tue, 15 May 2018 10:37:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=ItEk5VPf; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932239AbeEORgr (ORCPT + 99 others); Tue, 15 May 2018 13:36:47 -0400 Received: from forwardcorp1o.cmail.yandex.net ([37.9.109.47]:53903 "EHLO forwardcorp1o.cmail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932226AbeEORgp (ORCPT ); Tue, 15 May 2018 13:36:45 -0400 Received: from smtpcorp1p.mail.yandex.net (smtpcorp1p.mail.yandex.net [IPv6:2a02:6b8:0:1472:2741:0:8b6:10]) by forwardcorp1o.cmail.yandex.net (Yandex) with ESMTP id A5838216D6; Tue, 15 May 2018 20:36:40 +0300 (MSK) Received: from smtpcorp1p.mail.yandex.net (localhost.localdomain [127.0.0.1]) by smtpcorp1p.mail.yandex.net (Yandex) with ESMTP id 97E706E40CD0; Tue, 15 May 2018 20:36:40 +0300 (MSK) Received: from unknown (unknown [2a02:6b8:b010:d007::1:41]) by smtpcorp1p.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 0yeKzUIYL2-aeQqoggL; Tue, 15 May 2018 20:36:40 +0300 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client certificate not present) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1526405800; bh=XW/4md3pndbvttHeOm2Yt2UA/c4f0cTxJCyGN+IzuWg=; h=Subject:To:Cc:References:From:Message-ID:Date:In-Reply-To; b=ItEk5VPfz2GGOjua0RxT3vyQU4Ur25S2hf8EFxujhYquL4bagc4r1sUcTR9b/5DK0 AzJsmIe6veRuvAxMBuKP3PzVY4hVdRk1ddyuVHdAXiTzZqGJXsITOJaondxnuxC994 0hg5y0RjUacjUKNJozs8Oa6U8cUltoSH7UsmOjaU= Authentication-Results: smtpcorp1p.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Subject: Re: [PATCH RFC v5] pidns: introduce syscall translate_pid To: Nagarathnam Muthusamy , "Eric W. Biederman" Cc: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, Jann Horn , Serge Hallyn , Oleg Nesterov , Andy Lutomirski , Prakash Sangappa , Andrew Morton References: <152286911105.615669.14053871624892399807.stgit@buzz> <87h8oqhagl.fsf@xmission.com> <112c7cac-1982-3a2e-ffc0-878bc5ae4bb6@yandex-team.ru> <778ab3d0-b6bc-fdb5-669a-40222e5020d4@yandex-team.ru> From: Konstantin Khlebnikov Message-ID: <3e2c285a-1bf8-f71d-1b74-4d6465c29a54@yandex-team.ru> Date: Tue, 15 May 2018 20:36:39 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-CA Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15.05.2018 20:19, Nagarathnam Muthusamy wrote: > > > On 04/24/2018 10:36 PM, Konstantin Khlebnikov wrote: >> On 23.04.2018 20:37, Nagarathnam Muthusamy wrote: >>> >>> >>> On 04/05/2018 12:02 AM, Konstantin Khlebnikov wrote: >>>> On 05.04.2018 01:29, Eric W. Biederman wrote: >>>>> Nagarathnam Muthusamy writes: >>>>> >>>>>> On 04/04/2018 12:11 PM, Konstantin Khlebnikov wrote: >>>>>>> Each process have different pids, one for each pid namespace it belongs. >>>>>>> When interaction happens within single pid-ns translation isn't required. >>>>>>> More complicated scenarios needs special handling. >>>>>>> >>>>>>> For example: >>>>>>> - reading pid-files or logs written inside container with pid namespace >>>>>>> - attaching with ptrace to tasks from different pid namespace >>>>>>> - passing pids across pid namespaces in any kind of API >>>>>>> >>>>>>> Currently there are several interfaces that could be used here: >>>>>>> >>>>>>> Pid namespaces are identified by inode number of /proc/[pid]/ns/pid. >>>>> >>>>> Using the inode number in interfaces is not an option. Especially not >>>>> withou referencing the device number for the filesystem as well. >>>> >>>> This is supposed to be single-instance fs, >>>> not part of proc but referenced but its magic "symlinks". >>>> >>>> Device numbers are not mentioned in "man namespaces". >>>> >>>>> >>>>>>> Pids for nested Pid namespaces are shown in file /proc/[pid]/status. >>>>>>> In some cases conversion pid -> vpid could be easily done using this >>>>>>> information, but backward translation requires scanning all tasks. >>>>>>> >>>>>>> Unix socket automatically translates pid attached to SCM_CREDENTIALS. >>>>>>> This requires CAP_SYS_ADMIN for sending arbitrary pids and entering >>>>>>> into pid namespace, this expose process and could be insecure. >>>>>>> >>>>>>> This patch adds new syscall for converting pids between pid namespaces: >>>>>>> >>>>>>> pid_t translate_pid(pid_t pid, int source_type, int source, >>>>>>>                                  int target_type, int target); >>>>>>> >>>>>>> @source_type and @target_type defines type of following arguments: >>>>>>> >>>>>>> TRANSLATE_PID_CURRENT_PIDNS  - current pid namespace, argument is unused >>>>>>> TRANSLATE_PID_TASK_PIDNS     - task pid-ns, argument is task pid >>>>>> >>>>>> I believe using pid to represent the namespace has been already >>>>>> discussed in V1 of this patch in https://lkml.org/lkml/2015/9/22/1087 >>>>>> after which we moved on to fd based version of this interface. >>>>> >>>>> Or in short why is the case of pids important? >>>>> >>>>> You Konstantin you almost said why they were important in your message >>>>> saying you were going to send this one.  However you don't explain in >>>>> your description why you want to identify pid namespaces by pid. >>>>> >>>> >>>> Open of /proc/[pid]/ns/pid requires same permissions as ptrace, >>>> pid based variant doesn't have such restrictions. >>> >>> Can you provide more information on usecase requiring PID translation but not used for tracing related purposes? >> >> Any introspection for [nested] containers. It's easier to work when you have all information when you don't have any. >> For example our CMS https://github.com/yandex/porto allows to start nested sub-container (or even deeper) by request from any container >> and have to tell back which pid task is have. And it could translate any pid inside into accessible by client and vice versa. >> > > I still dont get the exact reason why PID based approach to identify the namespace during pid translation process is absolutely required > compared to fd based approach. As I told open(/proc/%d/ns/pid) have security restrictions - same uid/CAP_SYS_PTRACE/whatever Pidns-fd holds pid-namespace and without restrictions could be abused. Pid based API is racy but always available without any restrictions. > From your version of TranslatePid in > > https://github.com/yandex/porto/blob/0d7e6e7e1830dcd0038a057b2ab9964cec5b8fab/src/util/unix.cpp > > I see that you are going through the trouble of forking a process and sending SMC_CREDENTIALS for pid translation. Even your existing API > could be extremely simplified if translate_pid based on file descriptors make it to the gate and I believe from the last discussion it was > almost there https://patchwork.kernel.org/patch/10305439/ > > >>> On a side note, can we have the types TRANSLATE_PID_CURRENT_PIDNS and TRANSLATE_PID_FD_PIDNS integrated first and then possibly extend >>> the interface to include TRANSLATE_PID_TASK_PIDNS in future? >> >> I don't see reason for this separation. >> Pids and pid namespaces are part of the API for a long time. > > If you are talking about the translate_pid API proposed, I believe the V4 proposed under https://patchwork.kernel.org/patch/10003935/ had > only fd based API before a mix of PID and fd based is proposed in V5. Again, I was just wondering if we can get the FD based approach in > first and then extend the API to include PID based approach later as fd based approach could provide a lot of immediate benefits? > > Thanks, > Nagarathnam. >> >>> >>> Thanks, >>> Nagarathnam. >>>> Most pid-based syscalls are racy in some cases but they are >>>> here for decades and everybody knowns how to deal with it. >>>> So, I've decided to merge both worlds in one interface which clearly tells what to expect. >>> >