Received: by 10.223.164.221 with SMTP id h29csp2742570wrb; Wed, 18 Oct 2017 06:15:56 -0700 (PDT) X-Google-Smtp-Source: AOwi7QBTnSjr+QNyx9Smbi1GoA44ihPQEORcgM6IeDVlw6UWdg/rbvaArv9VFm462B1GNioiJpWR X-Received: by 10.159.195.7 with SMTP id bd7mr15529636plb.366.1508332556542; Wed, 18 Oct 2017 06:15:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1508332556; cv=none; d=google.com; s=arc-20160816; b=I0LXc5HS2VD/F1pAOkyZC/++aA4tLUhbyL5IeeKUwAaZEUeafuGtqBrU5CwPsBO/e8 wbXsFlA0G8CZGHQPPazveNfJrintqvDmwgRmjkB9VyEQYupW+/H61NDBYn3WadmwQsoA hQmWisoGdZH2moxpA0OArlhMu76UM0jwN6TtpTTf9yESg7YI4NTlnNYc0vt+1C422Hib 9UdfTeHqfgO7ndRpAzCGAXmmYriGOpT1IJa0rABRagBHCMvqTHLO4MrC6xdJroFTTby3 1j1UFGCt1uHPgeE+CN3sfFgtcwEoDeAp82t203jICDosHGrjTknk84N5zn+nuXN7dTiu heow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :references:subject:cc:to:mime-version:user-agent:organization:from :date:message-id:arc-authentication-results; bh=84GNyS/Nm2MSwXL6S4kN8Br8BMBplXcpWTlPpUUC+58=; b=tv3bEX6jR20+PVrc4fVX2xa4rZ9+JBqMFAUKd8craeKFMswdlrnVAO1Qp97/lyU0cj vdnInlZl40PcmBrioTc5B5Mh2+BwqX1blMWy7ZYTiuDh+VDWQxtst/KlVS/w8RbWTCUZ fTAF7XkR6zwvC16yIWwxkECtJk6nFHvfUiVXhkLJQgzzaQ5IDduGu1eB+SRnotTu40RX 8jlP6UL7eu4lKxA2qtl5OM8tzB3dHpgb4fgEWBNtACFVHT5bzvaEU2U03YAg5OxCqQp3 1kazTtm7G/+GAkXMLMNI9K93w8r/HWxF8Fpvwue214Xr5K3+UFwM5Z9pZHUp1Rq1MWU4 99EQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 202si5203022pgg.496.2017.10.18.06.15.41; Wed, 18 Oct 2017 06:15:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932547AbdJQWxH (ORCPT + 99 others); Tue, 17 Oct 2017 18:53:07 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:39450 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752209AbdJQWxF (ORCPT ); Tue, 17 Oct 2017 18:53:05 -0400 Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v9HMqv6N007112 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 17 Oct 2017 22:52:57 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id v9HMqvo6001514 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 17 Oct 2017 22:52:57 GMT Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id v9HMqugs001925; Tue, 17 Oct 2017 22:52:56 GMT Received: from [10.159.255.91] (/10.159.255.91) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 17 Oct 2017 15:52:55 -0700 Message-ID: <59E689F5.2080706@oracle.com> Date: Tue, 17 Oct 2017 15:53:41 -0700 From: prakash sangappa Organization: Oracle Corporation User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130328 Thunderbird/17.0.5 MIME-Version: 1.0 To: Andy Lutomirski CC: Nagarathnam Muthusamy , Andrew Morton , Konstantin Khlebnikov , Oleg Nesterov , Linux API , "linux-kernel@vger.kernel.org" , Serge Hallyn , "Eric W. Biederman" , Eugene Syromiatnikov Subject: Re: [PATCH v4] pidns: introduce syscall translate_pid References: <150788678482.924140.11785205105514746135.stgit@buzz> <20171013160514.GA27812@redhat.com> <3bdb5341-9ae6-265a-ce5b-45c2cfc76fad@yandex-team.ru> <20171016143628.b2ef80a9ef16d4345889b4d9@linux-foundation.org> <59E685B3.1000200@oracle.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/17/2017 3:40 PM, Andy Lutomirski wrote: > On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa > wrote: >> On 10/17/2017 3:02 PM, Andy Lutomirski wrote: >>> On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa >>> wrote: >>>> >>>> On 10/16/17 5:52 PM, Andy Lutomirski wrote: >>>>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa >>>>> wrote: >>>>>> >>>>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: >>>>>>> >>>>>>> >>>>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote: >>>>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov >>>>>>>> wrote: >>>>>>>> >>>>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target); >>>>>>>>>>>> >>>>>>>>>>>> This syscall converts pid from source pid-ns into pid in target >>>>>>>>>>>> pid-ns. >>>>>>>>>>>> If pid is unreachable from target pid-ns it returns zero. >>>>>>>>>>>> >>>>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files >>>>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative >>>>>>>>>>>> argument >>>>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid. >>>>>>>>>>>> >>>>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but >>>>>>>>>>>> backward >>>>>>>>>>>> translation requires scanning all tasks. Also pids could be >>>>>>>>>>>> translated >>>>>>>>>>>> by sending them through unix socket between namespaces, this >>>>>>>>>>>> method >>>>>>>>>>>> is >>>>>>>>>>>> slow and insecure because other side is exposed inside pid >>>>>>>>>>>> namespace. >>>>>>>>> Andrew asked why we might need this. >>>>>>>>> >>>>>>>>> Such conversion is required for interaction between processes across >>>>>>>>> pid-namespaces. >>>>>>>>> For example to identify process in container by pid file looking >>>>>>>>> from >>>>>>>>> outside. >>>>>>>>> >>>>>>>>> Two years ago I've solved this in project of mine with monstrous >>>>>>>>> code >>>>>>>>> which >>>>>>>>> forks couple times just to convert pid, lucky for me performance >>>>>>>>> wasn't >>>>>>>>> important. >>>>>>>> That's a single user who needed this a single time, and found a >>>>>>>> userspace-based solution anyway. This is not exactly compelling! >>>>>>>> >>>>>>>> Is there a stronger case to be made? How does this change benefit >>>>>>>> our >>>>>>>> users? Sell it to us! >>>>>>> Oracle database is planning to use pid namespace for sandboxing >>>>>>> database >>>>>>> instances and they need an API similar to translate_pid to effectively >>>>>>> translate process IDs from other pid namespaces. Prakash (cced in >>>>>>> mail) >>>>>>> can >>>>>>> provide more details on this usecase. >>>>>> >>>>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces >>>>>> and >>>>>> needs a direct method of converting pids of processes in the pid >>>>>> namespace >>>>>> hierarchy. In this use case multiple >>>>>> nested PID namespaces will be used. The currently available mechanism >>>>>> are >>>>>> not very efficient for this use case. For ex. as Konstantin described, >>>>>> using >>>>>> /proc//status would require the application to scan all the pid's >>>>>> status files to determine the pid of given process in a child >>>>>> namespace. >>>>>> >>>>>> Use of SCM_CREDENTIALS's socket message is another way, which would >>>>>> require >>>>>> every process starting inside a pid namespace to send this message and >>>>>> the >>>>>> receiving process in the target namespace would have to save the >>>>>> converted >>>>>> pid and reference it. This mechanism becomes cumbersome especially if >>>>>> the >>>>>> application has to deal with multiple nested pid namespaces. Also, the >>>>>> Database needs to be able to convert a thread's global pid(gettid()). >>>>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires >>>>>> CAP_SYS_ADMIN, which is an issue. >>>>>> >>>>>> So having a direct method, like the API that Konstantin is proposing, >>>>>> will >>>>>> work best for the Database >>>>>> since pid of a process in any of the nested pid namespaces can be >>>>>> converted >>>>>> as and when required. I think with the proposed API, the application >>>>>> should >>>>>> be able to convert pid of a process or tid(gettid()) of a thread as >>>>>> well. >>>>>> >>>>> Can you explain what Oracle's database is planning to do with this >>>>> information? >>>> >>>> Database uses the PID to programmatically find out if the process/thread >>>> is >>>> alive(kill 0) also send signals to the processes requesting it to dump >>>> status/debug information and kill the processes in case of a shutdown >>>> abort >>>> of the instance. >>> What I'm wondering is: how does the caller of kill() end up >>> controlling a task whose pid it doesn't know in its own namespace? >> >> I was generally describing how DB would use the PID of process. The above >> description >> was in the case when no namespaces are used. >> >> With use of namespaces, the DB would convert the PID of processes inside >> its children namespaces to PID in its namespace and use that pid to issue >> kill(). > Seems vaguely sensible. > > If I were designing this type of system, I'd have a manager process in > each namespace running as PID 1, though -- PID 1 is special and needs > to understand what's going on anyway. Then PID 1 would do the kill() > calls and wouldn't need translate_pid(). Yes, this has been tried out with the prototype use of PID namespaces in the DB. It works, but would be slow as the manager would have to exchange messages with the controlling processes which would be in the parent namespace. DB could use the api to convert the pid. From 1581601287281305574@xxx Wed Oct 18 13:15:26 +0000 2017 X-GM-THRID: 1581133950441644275 X-Gmail-Labels: Inbox,Category Forums