Received: by 10.223.164.221 with SMTP id h29csp1473119wrb; Wed, 1 Nov 2017 17:40:31 -0700 (PDT) X-Google-Smtp-Source: ABhQp+RRBUR8LDAmIDSidmIqCjcUv4iHABCaVC3kmasNpgOrCT0G2oGpKLgChgdFsDjkwLi/1WaG X-Received: by 10.99.106.136 with SMTP id f130mr1641822pgc.430.1509583231115; Wed, 01 Nov 2017 17:40:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1509583231; cv=none; d=google.com; s=arc-20160816; b=wvN040I79rOIBvWwrb3nqycKziTlVHfkYFhBq1quuJMez6ZLjV2CwG8tIPdfG/DLIL 1NtdnpamKTh8+ZUU3zlC8yq6g9q77aD7+KCyzCZ9X4Ptwqwogh1qTmSwGTW6raCFf0x1 WRw8m2J0E5t2ivbt+VGALUwJeQBhWSqF1CBs0E3EG0Rqfm+znkCO0WT4RNmD36Gyx76s DLnYMVuzOWSUfjUAnQcdtH5BMR0iMOE7ZvGJF8PM78A+BWiO214UpttYEx24mWLaKdtf eUwX9mTfjW+e3t89mL6u2U+mGfmvhYc+8sg0CAfCzD/OQ8U05gIQ8Au5xeMMVC1/h/Hu HsSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:cc:to:references :subject:reply-to:arc-authentication-results; bh=+rPI7XRFIxrl/2WCq83y+j7jX44bDV4vQgfdG0BsnB8=; b=o0GXFU8kopkX+A9CwIYuFtUcuopRPZh0UF9YwV0RolA/QKuLd60jgTdR/8B/+5CeGE U+sq+yYgkZ/CtBZc4Si5ig1iEDcsIhKn+XdYKtNBKDcXIO3ACrO1aPVt1H3m53ay+N5x EZixDKtDUru+vW5CkCMmvSdEmIOpSuXAlDvE1hktjHdkO9E9nf5nHp4R1CKT+MMW+GSg 2nwUVA1gf8ogHWQHaZ8dTHLuj3i8c0CINiZTgw2CFt/NI98Girp3qrBVEGASQ8sj8IlF +NDQSAeeCML8jULRQSukbFbsNWq7qPGoXqlqVV+MC0bxNjYyWk0sRso4lZcQjHTULK3e UEMg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c25si2030852pgn.808.2017.11.01.17.40.18; Wed, 01 Nov 2017 17:40:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934030AbdKBAjW (ORCPT + 99 others); Wed, 1 Nov 2017 20:39:22 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:22570 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932433AbdKBAjT (ORCPT ); Wed, 1 Nov 2017 20:39:19 -0400 Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id vA20dAhZ027306 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 2 Nov 2017 00:39:10 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id vA20d9NG006168 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 2 Nov 2017 00:39:09 GMT Received: from abhmp0011.oracle.com (abhmp0011.oracle.com [141.146.116.17]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id vA20d8RT027180; Thu, 2 Nov 2017 00:39:08 GMT Received: from [10.132.93.61] (/10.132.93.61) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 01 Nov 2017 17:39:08 -0700 Reply-To: prakash.sangappa@oracle.com Subject: Re: [PATCH v4] pidns: introduce syscall translate_pid References: <150788678482.924140.11785205105514746135.stgit@buzz> <20171013160514.GA27812@redhat.com> <3bdb5341-9ae6-265a-ce5b-45c2cfc76fad@yandex-team.ru> <20171016143628.b2ef80a9ef16d4345889b4d9@linux-foundation.org> To: Jann Horn Cc: Andy Lutomirski , Nagarathnam Muthusamy , Andrew Morton , Konstantin Khlebnikov , Oleg Nesterov , Linux API , "linux-kernel@vger.kernel.org" , Serge Hallyn , "Eric W. Biederman" , Eugene Syromiatnikov From: "prakash.sangappa" Message-ID: Date: Wed, 1 Nov 2017 17:38:10 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/01/2017 10:43 AM, Jann Horn wrote: > On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa > wrote: >> >> On 10/16/17 5:52 PM, Andy Lutomirski wrote: >>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa >>> wrote: >>>> >>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: >>>>> >>>>> >>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote: >>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov >>>>>> wrote: >>>>>> >>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target); >>>>>>>>>> >>>>>>>>>> This syscall converts pid from source pid-ns into pid in target >>>>>>>>>> pid-ns. >>>>>>>>>> If pid is unreachable from target pid-ns it returns zero. >>>>>>>>>> >>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files >>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative >>>>>>>>>> argument >>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid. >>>>>>>>>> >>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but >>>>>>>>>> backward >>>>>>>>>> translation requires scanning all tasks. Also pids could be >>>>>>>>>> translated >>>>>>>>>> by sending them through unix socket between namespaces, this method >>>>>>>>>> is >>>>>>>>>> slow and insecure because other side is exposed inside pid >>>>>>>>>> namespace. >>>>>>> Andrew asked why we might need this. >>>>>>> >>>>>>> Such conversion is required for interaction between processes across >>>>>>> pid-namespaces. >>>>>>> For example to identify process in container by pid file looking from >>>>>>> outside. >>>>>>> >>>>>>> Two years ago I've solved this in project of mine with monstrous code >>>>>>> which >>>>>>> forks couple times just to convert pid, lucky for me performance >>>>>>> wasn't >>>>>>> important. >>>>>> That's a single user who needed this a single time, and found a >>>>>> userspace-based solution anyway. This is not exactly compelling! >>>>>> >>>>>> Is there a stronger case to be made? How does this change benefit our >>>>>> users? Sell it to us! >>>>> Oracle database is planning to use pid namespace for sandboxing database >>>>> instances and they need an API similar to translate_pid to effectively >>>>> translate process IDs from other pid namespaces. Prakash (cced in mail) >>>>> can >>>>> provide more details on this usecase. >>>> >>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces >>>> and >>>> needs a direct method of converting pids of processes in the pid >>>> namespace >>>> hierarchy. In this use case multiple >>>> nested PID namespaces will be used. The currently available mechanism >>>> are >>>> not very efficient for this use case. For ex. as Konstantin described, >>>> using >>>> /proc//status would require the application to scan all the pid's >>>> status files to determine the pid of given process in a child namespace. >>>> >>>> Use of SCM_CREDENTIALS's socket message is another way, which would >>>> require >>>> every process starting inside a pid namespace to send this message and >>>> the >>>> receiving process in the target namespace would have to save the >>>> converted >>>> pid and reference it. This mechanism becomes cumbersome especially if the >>>> application has to deal with multiple nested pid namespaces. Also, the >>>> Database needs to be able to convert a thread's global pid(gettid()). >>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires >>>> CAP_SYS_ADMIN, which is an issue. >>>> >>>> So having a direct method, like the API that Konstantin is proposing, >>>> will >>>> work best for the Database >>>> since pid of a process in any of the nested pid namespaces can be >>>> converted >>>> as and when required. I think with the proposed API, the application >>>> should >>>> be able to convert pid of a process or tid(gettid()) of a thread as well. >>>> >>> Can you explain what Oracle's database is planning to do with this >>> information? >> >> Database uses the PID to programmatically find out if the process/thread is >> alive(kill 0) also send signals to the processes requesting it to dump >> status/debug information and kill the processes in case of a shutdown abort >> of the instance. > But if kill(pid, 0) returns 0, that doesn't tell you anything, right? > It could be that > the process you're trying to check is still alive, but it could also > be that it has > died, ns_last_pid has wrapped around, and the PID is now being reused by > another process, right? That is true. Database checks the process start time by reading /proc//stat file to verify that it is the correct process. > > Wouldn't it be more reliable to open("/proc/self", O_RDONLY) > (or /proc/thread-self) in the process you want to monitor, then send > the resulting file descriptor to the monitoring process with SCM_RIGHTS? > Then something like this should work for checking whether the process > is still alive without relying on PIDs at all: > > int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0); > if (retval == 0) { > /* process still exists */ > } else if (retval == -1 && errno == ESRCH) { > /* process is gone */ > } else { > err(1, "unexpected fstatat result"); > } Yes, but there will be a large number of processes to deal with and few processes monitoring. All these processes would have to open /proc/self and send fd to all the monitoring processes. In the database case, there is one fixed monitoring process, but other processes monitoring can exit and new ones started. From 1582886581476967006@xxx Wed Nov 01 17:44:38 +0000 2017 X-GM-THRID: 1581133950441644275 X-Gmail-Labels: Inbox,Category Forums