Received: by 10.223.164.221 with SMTP id h29csp2740780wrb; Wed, 18 Oct 2017 06:14:26 -0700 (PDT) X-Received: by 10.99.125.18 with SMTP id y18mr13610884pgc.428.1508332466358; Wed, 18 Oct 2017 06:14:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1508332466; cv=none; d=google.com; s=arc-20160816; b=Qyz2gAZPpmAmToZxDLQSc9Bz+G9ZWZTg0G50mj08D384UxZQvN7sesJLH6/VlifL/L 5CkrhRz2vyBSustCiBwFxnGvQC5qOW1y/FEhCzKwiTG/xzCo49L3RjR0nXfDOqcat8Da rdEn3MP6hwtdjVH7NRnfw7rD7TP955moe/lps9fwXjcuc1PPjB5lEY2GxGq6tRb8+hNM 3xUf8lRiSgrCvt8uZp1bSlAEIVUjtQV9GycpDl4J2D+mUfvWgVTfoDYAJFjqmS8Fw++Y CznEOQgDEa+RPQhEihM8xxLCbstAlEhGoA3zBx8EadHH4q+L9x+NQ7EI00mc7PAYujE4 lPXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dmarc-filter :arc-authentication-results; bh=B9juJuLG286mFu9t+HjQQkD7xsTThXfrvTyiKecLFvw=; b=035oMghJ7fUOWi8sD9KdIEILdfzzgeqte1nhPAf+nj9vcRpbgR63vVx/2lB0ijA6SA fEUV+F2XqwFVEJr+0YEfioLLf0ESrrEIId/Ya/w+17qoukqrp+7v2lfWYIPQ+l1s7/4m mGi02e5MAXmsOh2bh8oztYtjt/ZW8PJJJioBjIPk3VbGvDk1fbqAJ0Jn09QZh4WsKF8E p7M/7Ah+9ZKbmSSQcT2EvWb8VpaDbUtewA6D2n2fRgLwyKCEEJytnnEURUkPKfXAF37k NaoSXfhHGBY1Nutfjd45uuM5VIYINRSbtWwrz6KNnQ3i9aEqrWdP8ZnOJV7BJmYS/Rwt ngfg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l27si5391695pfk.454.2017.10.18.06.14.12; Wed, 18 Oct 2017 06:14:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758779AbdJQWkk (ORCPT + 99 others); Tue, 17 Oct 2017 18:40:40 -0400 Received: from mail.kernel.org ([198.145.29.99]:39032 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750910AbdJQWkj (ORCPT ); Tue, 17 Oct 2017 18:40:39 -0400 Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8209021879 for ; Tue, 17 Oct 2017 22:40:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8209021879 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-io0-f175.google.com with SMTP id n137so4014893iod.6 for ; Tue, 17 Oct 2017 15:40:38 -0700 (PDT) X-Gm-Message-State: AMCzsaVUecSn5bfXFL14Xd94o76Zj+mAAquHkfN7uREdzVxqWFesWg/D Jzt132R3qSpr2J+o5nE6L1Ux33Uu1Ks87YulqL8W+A== X-Google-Smtp-Source: AOwi7QC/YKNzjjTfc/86SFsGZILWHVfq3pX46CnGN9WxpplaUpr7FMC/WIAGJhhFDdDiFgfdC6Jd8OUezydMKH2+UeM= X-Received: by 10.107.11.27 with SMTP id v27mr19097413ioi.179.1508280037924; Tue, 17 Oct 2017 15:40:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.2.106.77 with HTTP; Tue, 17 Oct 2017 15:40:17 -0700 (PDT) In-Reply-To: <59E685B3.1000200@oracle.com> References: <150788678482.924140.11785205105514746135.stgit@buzz> <20171013160514.GA27812@redhat.com> <3bdb5341-9ae6-265a-ce5b-45c2cfc76fad@yandex-team.ru> <20171016143628.b2ef80a9ef16d4345889b4d9@linux-foundation.org> <59E685B3.1000200@oracle.com> From: Andy Lutomirski Date: Tue, 17 Oct 2017 15:40:17 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v4] pidns: introduce syscall translate_pid To: prakash sangappa Cc: Andy Lutomirski , Nagarathnam Muthusamy , Andrew Morton , Konstantin Khlebnikov , Oleg Nesterov , Linux API , "linux-kernel@vger.kernel.org" , Serge Hallyn , "Eric W. Biederman" , Eugene Syromiatnikov Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa wrote: > > On 10/17/2017 3:02 PM, Andy Lutomirski wrote: >> >> On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa >> wrote: >>> >>> >>> On 10/16/17 5:52 PM, Andy Lutomirski wrote: >>>> >>>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa >>>> wrote: >>>>> >>>>> >>>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote: >>>>>>> >>>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov >>>>>>> wrote: >>>>>>> >>>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target); >>>>>>>>>>> >>>>>>>>>>> This syscall converts pid from source pid-ns into pid in target >>>>>>>>>>> pid-ns. >>>>>>>>>>> If pid is unreachable from target pid-ns it returns zero. >>>>>>>>>>> >>>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files >>>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative >>>>>>>>>>> argument >>>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid. >>>>>>>>>>> >>>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but >>>>>>>>>>> backward >>>>>>>>>>> translation requires scanning all tasks. Also pids could be >>>>>>>>>>> translated >>>>>>>>>>> by sending them through unix socket between namespaces, this >>>>>>>>>>> method >>>>>>>>>>> is >>>>>>>>>>> slow and insecure because other side is exposed inside pid >>>>>>>>>>> namespace. >>>>>>>> >>>>>>>> Andrew asked why we might need this. >>>>>>>> >>>>>>>> Such conversion is required for interaction between processes across >>>>>>>> pid-namespaces. >>>>>>>> For example to identify process in container by pid file looking >>>>>>>> from >>>>>>>> outside. >>>>>>>> >>>>>>>> Two years ago I've solved this in project of mine with monstrous >>>>>>>> code >>>>>>>> which >>>>>>>> forks couple times just to convert pid, lucky for me performance >>>>>>>> wasn't >>>>>>>> important. >>>>>>> >>>>>>> That's a single user who needed this a single time, and found a >>>>>>> userspace-based solution anyway. This is not exactly compelling! >>>>>>> >>>>>>> Is there a stronger case to be made? How does this change benefit >>>>>>> our >>>>>>> users? Sell it to us! >>>>>> >>>>>> Oracle database is planning to use pid namespace for sandboxing >>>>>> database >>>>>> instances and they need an API similar to translate_pid to effectively >>>>>> translate process IDs from other pid namespaces. Prakash (cced in >>>>>> mail) >>>>>> can >>>>>> provide more details on this usecase. >>>>> >>>>> >>>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces >>>>> and >>>>> needs a direct method of converting pids of processes in the pid >>>>> namespace >>>>> hierarchy. In this use case multiple >>>>> nested PID namespaces will be used. The currently available mechanism >>>>> are >>>>> not very efficient for this use case. For ex. as Konstantin described, >>>>> using >>>>> /proc//status would require the application to scan all the pid's >>>>> status files to determine the pid of given process in a child >>>>> namespace. >>>>> >>>>> Use of SCM_CREDENTIALS's socket message is another way, which would >>>>> require >>>>> every process starting inside a pid namespace to send this message and >>>>> the >>>>> receiving process in the target namespace would have to save the >>>>> converted >>>>> pid and reference it. This mechanism becomes cumbersome especially if >>>>> the >>>>> application has to deal with multiple nested pid namespaces. Also, the >>>>> Database needs to be able to convert a thread's global pid(gettid()). >>>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires >>>>> CAP_SYS_ADMIN, which is an issue. >>>>> >>>>> So having a direct method, like the API that Konstantin is proposing, >>>>> will >>>>> work best for the Database >>>>> since pid of a process in any of the nested pid namespaces can be >>>>> converted >>>>> as and when required. I think with the proposed API, the application >>>>> should >>>>> be able to convert pid of a process or tid(gettid()) of a thread as >>>>> well. >>>>> >>>> Can you explain what Oracle's database is planning to do with this >>>> information? >>> >>> >>> Database uses the PID to programmatically find out if the process/thread >>> is >>> alive(kill 0) also send signals to the processes requesting it to dump >>> status/debug information and kill the processes in case of a shutdown >>> abort >>> of the instance. >> >> What I'm wondering is: how does the caller of kill() end up >> controlling a task whose pid it doesn't know in its own namespace? > > > I was generally describing how DB would use the PID of process. The above > description > was in the case when no namespaces are used. > > With use of namespaces, the DB would convert the PID of processes inside > its children namespaces to PID in its namespace and use that pid to issue > kill(). Seems vaguely sensible. If I were designing this type of system, I'd have a manager process in each namespace running as PID 1, though -- PID 1 is special and needs to understand what's going on anyway. Then PID 1 would do the kill() calls and wouldn't need translate_pid(). > > -Prakash. > >> >>> -Prakash. >>> >>> > From 1581601153734163847@xxx Wed Oct 18 13:13:19 +0000 2017 X-GM-THRID: 1581133950441644275 X-Gmail-Labels: Inbox,Category Forums