Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3583415img; Mon, 25 Mar 2019 13:16:26 -0700 (PDT) X-Google-Smtp-Source: APXvYqwH+aWOb04wy5NQR5VvI9BEpXgqGM9x4X+67jk8GQ3RIQqYPam5lHxTkZ0HWWdv5DOcfFHh X-Received: by 2002:a63:4616:: with SMTP id t22mr24372483pga.217.1553544986064; Mon, 25 Mar 2019 13:16:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553544986; cv=none; d=google.com; s=arc-20160816; b=Sul/EtZhR9r98AesTNVo9uhFbN6OMyqEyAySmLKLg7zxxl16Di5jiZFrvR2V9gApU9 c7l3SjNCMIrwsrjlcv2apCnkJSWQOhJaNXx73AksqVQb4Ysh58LaPTLdRy5P7DFRjNA1 J9TVoiuHUPKSo3NEjxMfbi6qlhv0cjUg4BZfWNUDJkNbfBqKJjU96WsAqp369B0ne+X4 23+A3Hx051xbFNJ9lt81pYuQ8uqUnoiDcf8mnuhtTRsva6lUBtZHYqQ7qKmreEc+mLG1 EbjxJi5Qy65wTTfv2JsPqA7J55TJizZJrY58tHXSQ8LtxD6CGQOdQfhQt5FyJ56ByuIi Z1Yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=8qw8qGWFBZcCBNy2FEs74PgE/iNrgEIhVKWlrRtMeaE=; b=Y7TPk+EioMoTzrvL0vQJcd5lGcBnJtlYhgGigoMZCGXcNvxIaxSSdv3i63d2PkfwYm 8GEvSqvHtmldCwSWlHr2M8ke1vakdJSSSOxWpVJHgdTEJsm+1nTgEWeJhERDiyUIncCY nbFv+BOXM8Cd3RKDlz1UWS3OSHXYg3c/Xqmw49YixR7inoq/9G6+eUmQ39auvMhLdFuB fQJitV5rUGvMvGOGLCiZlWDWXmXLg6RrZCVLPC+Lbo7TVRJXB4hhILX6xtHHjlN0RN4S a3mn3LrZBnjcw/f4olS7mR6OBIbCLU8fjxE5qsvgr1x9Nu6eophz2c1SH/nZr2zt8ABE +fcQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=rxsPkDYc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e71si14245629pgc.593.2019.03.25.13.16.10; Mon, 25 Mar 2019 13:16:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=rxsPkDYc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730301AbfCYUOT (ORCPT + 99 others); Mon, 25 Mar 2019 16:14:19 -0400 Received: from mail-oi1-f194.google.com ([209.85.167.194]:32940 "EHLO mail-oi1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729938AbfCYUOS (ORCPT ); Mon, 25 Mar 2019 16:14:18 -0400 Received: by mail-oi1-f194.google.com with SMTP id e22so8075341oiy.0 for ; Mon, 25 Mar 2019 13:14:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8qw8qGWFBZcCBNy2FEs74PgE/iNrgEIhVKWlrRtMeaE=; b=rxsPkDYcE0bn/ko/AANN72QCMtByv3xeZx08Ig72ehyIp3Wu6Wnqcqz4IzpDxvYRiW VcmxV1jrvQA27YStyQHhLVMAuD96ReQS+Lr6MHjnefmDADcxhGbqn4Dmh5AxFk93yZkz bm3SLsLngAa+decGwvQA9V/Jz9a6gTv+nc9VvnW9rQ1gm8FlUFy5FhHF9hiG5Qc+mdsq Vi4Cuv3juNCGwW5PfYpu4vSvYCxIm2+J3RJwXwrKYPcJQnIbDxF2yyJX5m9IzRR+gvXj oHVjErU7HbgDO7tMsatj2fiyZKaj+8vmNfvnLD+uS6PNTF+DeR+WXBNJTw8LxkOdkxzD Pb5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8qw8qGWFBZcCBNy2FEs74PgE/iNrgEIhVKWlrRtMeaE=; b=NoWJBBR0oKbVMgCx4+lol3bAuY9dgk9XQcZU3iLLX2bqaofnBryq0THIsrH1v1pa22 dwdhFgG623z8ww6segL9migiebOPgdXqv1N1aKuP/aQmmic301Gk1ezUji69ec/Vmgvr FTn1w0XNR4imGo/EIDEdzvHIgHDmtECwQegSLGrlvnYyejgFlChJgfH2jVQth42y5eo1 qm7f3hLkYkn4ZrJAWvyKFscaZeysLThMvDVUWKU+XWVTgNqzSgLCnOY6eDxWreL+Ph23 BKhvK9Na915cBl7TrPCuOfVuqOgTYCJVV3M36gnLd4+h89FtFrDGwoaFDzUUPXCoQK8Y 5qNw== X-Gm-Message-State: APjAAAW2Px5yNigLOSVMerST/O2Y5sh//5Hh87m8k0Es9RD/JqUMACJZ MSH4DwAG3b3P2NIrs44JUo4fE/qhNNelmhzx1S/Tqw== X-Received: by 2002:aca:4908:: with SMTP id w8mr13140167oia.157.1553544856750; Mon, 25 Mar 2019 13:14:16 -0700 (PDT) MIME-Version: 1.0 References: <20190319231020.tdcttojlbmx57gke@brauner.io> <20190320015249.GC129907@google.com> <20190320035953.mnhax3vd47ya4zzm@brauner.io> <4A06C5BB-9171-4E70-BE31-9574B4083A9F@joelfernandes.org> <20190320182649.spryp5uaeiaxijum@brauner.io> <20190320185156.7bq775vvtsxqlzfn@brauner.io> <20190320191412.5ykyast3rgotz3nu@brauner.io> In-Reply-To: From: Jann Horn Date: Mon, 25 Mar 2019 21:13:50 +0100 Message-ID: Subject: Re: pidfd design To: Andy Lutomirski , Christian Brauner Cc: Daniel Colascione , Joel Fernandes , Suren Baghdasaryan , Steven Rostedt , Sultan Alsawaf , Tim Murray , Michal Hocko , Greg Kroah-Hartman , =?UTF-8?B?QXJ2ZSBIasO4bm5ldsOlZw==?= , Todd Kjos , Martijn Coenen , Ingo Molnar , Peter Zijlstra , LKML , "open list:ANDROID DRIVERS" , kernel-team , Oleg Nesterov , "Serge E. Hallyn" , Kees Cook , Jonathan Kowalski , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski wrote: > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione wrote: > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner wrote: > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner wrote: > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > review the actual code. > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > Andy, thanks for commenting! > > > > > > > > > > > $ ls /proc/self > > > > attr exe mountinfo projid_map status > > > > autogroup fd mounts root syscall > > > > auxv fdinfo mountstats sched task > > > > cgroup gid_map net schedstat timers > > > > clear_refs io ns sessionid timerslack_ns > > > > cmdline latency numa_maps setgroups uid_map > > > > comm limits oom_adj smaps wchan > > > > coredump_filter loginuid oom_score smaps_rollup > > > > cpuset map_files oom_score_adj stack > > > > cwd maps pagemap stat > > > > environ mem personality statm > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > namespace-wide things that certain policies expect to be unavailable. > > > > stack, for example, is a potential attack surface. Etc. > > > > If you can access these files sources via open(2) on /proc/, you > > should be able to access them via a pidfd. If you can't, you > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > see how pidfd makes any material changes to anyone's security. As far > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > broken and unsupported configuration. > > It's not "broken and unsupported". I know of an actual working, > deployed container-ish sandbox that does exactly this. I would also > guess that quite a few not-at-all-container-like sandboxes work like > this. (The obvious seccomp + unshare + pivot_root > deny-myself-access-to-lots-of-things trick results in no /proc, which > is by dsign.) > > > > > An actual threat model and real thought paid to access capabilities > > would help. Almost everything around the interaction of Linux kernel > > namespaces and security feels like a jumble of ad-hoc patches added as > > afterthoughts in response to random objections. > > I fully agree. But if you start thinking for real about access > capabilities, there's no way that you're going to conclude that a > capability to access some process implies a capability to access the > settings of its network namespace. > > > > > >> All these new APIs either need to > > > > return something more restrictive than a proc dirfd or they need to > > > > follow the same rules. > > > > ... > > > What's special about libraries? How is a library any worse-off using > > openat(2) on a pidfd than it would be just opening the file called > > "/proc/$apid"? > > Because most libraries actually work, right now, without /proc. Even > libraries that spawn subprocesses. If we make the new API have the > property that it doesn't work if you're in a non-root user namespace > and /proc isn't mounted, the result will be an utter mess. > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > suppose that we could return magic restricted dirfds, or we could > > > > return things that aren't dirfds and all and have some API that gives > > > > you the dirfd associated with a procfd but only if you can see > > > > /proc/PID. > > > > > > What would be your opinion to having a > > > /proc//handle > > > file instead of having a dirfd. Essentially, what I initially proposed > > > at LPC. The change on what we currently have in master would be: > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > And how do you propose, given one of these handle objects, getting a > > process's current priority, or its current oom score, or its list of > > memory maps? As I mentioned in my original email, and which nobody has > > addressed, if you don't use a dirfd as your process handle or you > > don't provide an easy way to get one of these proc directory FDs, you > > need to duplicate a lot of metadata access interfaces. > > An API that takes a process handle object and an fd pointing at /proc > (the root of the proc fs) and gives you back a proc dirfd would do the > trick. You could do this with no new kernel features at all if you're > willing to read the pid, call openat(2), and handle the races in user > code. This seems like something that might be a good fit for two ioctls? One ioctl on procfs roots to translate pidfds into that procfs, subject to both the normal lookup permission checks and only working if the pidfd has a translation into the procfs: int proc_root_fd = open("/proc", O_RDONLY); int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: int proc_pgid_fd = open("/proc/self", O_RDONLY); int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); int proc_pid_fd = open("/proc/thread-self", O_RDONLY); int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); And then, as you proposed, the new sys_clone() can just return a pidfd, and you can convert it into a procfs fd yourself if you want.