Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3589490img; Mon, 25 Mar 2019 13:24:58 -0700 (PDT) X-Google-Smtp-Source: APXvYqwPY++Jk1Wdgjyh4Rz4GGKtX6VFhBQLKAiwMtRmQnZ9QFtc8tNUzJWNw/1m9GIPVr4IlQLD X-Received: by 2002:a17:902:b217:: with SMTP id t23mr27647536plr.184.1553545498202; Mon, 25 Mar 2019 13:24:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553545498; cv=none; d=google.com; s=arc-20160816; b=tMlT10Fkx5j+7hfaA2U/zjb6orh05QH0u33b7YlmrBBSV+eLcHJJg4XwfY8+9mrxTT dCiYpNt4nojYMQ4F7KlY22NYPEZfJm2VqROg0S6gtsjkJjBO04TTnoOLe6k5EQL8NYoJ UfrbXp5EFLqd1sW7J/CAeLlEZAoiIosXWlk3ixm8Kk6gSWv7/YgLCobFUztGv2fvVamz cHlkYbdIO/pPUnjLVvZGCyKYbNQc74vpovbutxVp6N2mH7Ym7AKHhmzfTHDDu6FGqU6K c80SJQvIvue8U5v0fWJ5FAVzJCn6KdkeoJJvLGE3Up2IVDnCQPOd+ZykUJPEthH+9vfl 0wSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=kuf/CUjb0V2c8RpjecXsYX3PuKA2P32xcFIq/XDvi5E=; b=CN5CuMt02JF/+yrFqMIoQ2/Cr4BnmUU9rdVQSqN15OhpoghthpT1yajonKi+lAJB8v 2fki5Z+lDJWS0pfqaf6iws7SYdSfjrlUI447+wb/7XgSORBuASkDhMSJnB11NHoj+m1t JxzTXOGNGowj8xC/2WEiHhyim8KM6dsXtBFzmyJ/fug9wZMnhpIJg4+ZQXJFL2Us2nQk O8QoQelYKP0/rvHa8Y+UZbc3H9j30DIIvaIcKo9XCsol1e8GvZLf2AbS0gMt89s96PfD ZJpEIhhRIJAemw+hyqBnUjpNi3D9sbAdnqga20ge8VEmfVNAJF561bML0urnUI8Bm0xs 3nTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=QwvnpVMB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q12si15805910pli.428.2019.03.25.13.24.43; Mon, 25 Mar 2019 13:24:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=QwvnpVMB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730361AbfCYUXg (ORCPT + 99 others); Mon, 25 Mar 2019 16:23:36 -0400 Received: from mail-vs1-f68.google.com ([209.85.217.68]:42570 "EHLO mail-vs1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729283AbfCYUXf (ORCPT ); Mon, 25 Mar 2019 16:23:35 -0400 Received: by mail-vs1-f68.google.com with SMTP id f15so6230202vsk.9 for ; Mon, 25 Mar 2019 13:23:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=kuf/CUjb0V2c8RpjecXsYX3PuKA2P32xcFIq/XDvi5E=; b=QwvnpVMBEZMGRpiWTRxsXyNB5PxDFVkuyqmbE67I1rC2UnROIBvXYnIL6DJ+8b6yqO JFT00w7KeBFM7DW3ZagIAc23itGj09Fl0Y4xSDTadFzPxZHL7HX9eFSSOEjpme7x78+B LChv18LYTDo2ZwM5zer8BLFcKPrcgvgCpGgLuxt10On3tFJWvOS3lRnYtcBz5t9n3d0Y a4DrJg1BDUMiKUERqNUzPRgQjQ83oSKBWh9QE1mt+1joSsTOTAbihkwxQ+ZdqfqiDExb vlnnoMjOpmpPqnYt34v12hvCutUdIIDyJDJk/EO0zpgdbI4vjUe2HpK4Znzsdw9XpZqZ DVsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kuf/CUjb0V2c8RpjecXsYX3PuKA2P32xcFIq/XDvi5E=; b=cXWsXkKfqmaIds3z0+//IZgsH3zDVjibgRDQto4etpd7CrUguL8KUAzfEivOqzYxGz jvYWOSxO9I1cRnP1vHPFRPqFy+Ac9F+EJPntgd9XQPGAGu2pXoDV7Wi+/WiXh/I3jmD2 RJfxX0l8odpstW85N2nmqjYfrcJr847OSz9eXjmGZsxleHlsnC4/pWr6noT8dPmmKBKj zyts0NhpJJ3yBn5sNevSki/e1JTUMHuD9UzwgmHANqq5TuYjeW64pz97oEbPjvjBMMhO 69B84spYo9apjaaukdSHD6xMwT/FspR4knPlkjKu32WgM3/cBboCxfZGcxNW+bwU5CJ1 EkXw== X-Gm-Message-State: APjAAAWbxs5SjS5vbuOKN0CWHqqPNyQ6PK9cXNqf8ChY67X6dSHUQCto LBurYf7/3GsCIIZaDs6CfLFR3RWom7+4CFEX/xVUEA== X-Received: by 2002:a67:cc2:: with SMTP id 185mr16474350vsm.77.1553545413970; Mon, 25 Mar 2019 13:23:33 -0700 (PDT) MIME-Version: 1.0 References: <20190319231020.tdcttojlbmx57gke@brauner.io> <20190320015249.GC129907@google.com> <20190320035953.mnhax3vd47ya4zzm@brauner.io> <4A06C5BB-9171-4E70-BE31-9574B4083A9F@joelfernandes.org> <20190320182649.spryp5uaeiaxijum@brauner.io> <20190320185156.7bq775vvtsxqlzfn@brauner.io> <20190320191412.5ykyast3rgotz3nu@brauner.io> In-Reply-To: From: Daniel Colascione Date: Mon, 25 Mar 2019 13:23:21 -0700 Message-ID: Subject: Re: pidfd design To: Jann Horn Cc: Andy Lutomirski , Christian Brauner , Joel Fernandes , Suren Baghdasaryan , Steven Rostedt , Sultan Alsawaf , Tim Murray , Michal Hocko , Greg Kroah-Hartman , =?UTF-8?B?QXJ2ZSBIasO4bm5ldsOlZw==?= , Todd Kjos , Martijn Coenen , Ingo Molnar , Peter Zijlstra , LKML , "open list:ANDROID DRIVERS" , kernel-team , Oleg Nesterov , "Serge E. Hallyn" , Kees Cook , Jonathan Kowalski , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 25, 2019 at 1:14 PM Jann Horn wrote: > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski wrote: > > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione wrote: > > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner wrote: > > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner wrote: > > > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > > review the actual code. > > > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > > > Andy, thanks for commenting! > > > > > > > > > > > > > > $ ls /proc/self > > > > > attr exe mountinfo projid_map status > > > > > autogroup fd mounts root syscall > > > > > auxv fdinfo mountstats sched task > > > > > cgroup gid_map net schedstat timers > > > > > clear_refs io ns sessionid timerslack_ns > > > > > cmdline latency numa_maps setgroups uid_map > > > > > comm limits oom_adj smaps wchan > > > > > coredump_filter loginuid oom_score smaps_rollup > > > > > cpuset map_files oom_score_adj stack > > > > > cwd maps pagemap stat > > > > > environ mem personality statm > > > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > > namespace-wide things that certain policies expect to be unavailable. > > > > > stack, for example, is a potential attack surface. Etc. > > > > > > If you can access these files sources via open(2) on /proc/, you > > > should be able to access them via a pidfd. If you can't, you > > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > > see how pidfd makes any material changes to anyone's security. As far > > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > > broken and unsupported configuration. > > > > It's not "broken and unsupported". I know of an actual working, > > deployed container-ish sandbox that does exactly this. I would also > > guess that quite a few not-at-all-container-like sandboxes work like > > this. (The obvious seccomp + unshare + pivot_root > > deny-myself-access-to-lots-of-things trick results in no /proc, which > > is by dsign.) > > > > > > > > An actual threat model and real thought paid to access capabilities > > > would help. Almost everything around the interaction of Linux kernel > > > namespaces and security feels like a jumble of ad-hoc patches added as > > > afterthoughts in response to random objections. > > > > I fully agree. But if you start thinking for real about access > > capabilities, there's no way that you're going to conclude that a > > capability to access some process implies a capability to access the > > settings of its network namespace. > > > > > > > > >> All these new APIs either need to > > > > > return something more restrictive than a proc dirfd or they need to > > > > > follow the same rules. > > > > > > > ... > > > > > What's special about libraries? How is a library any worse-off using > > > openat(2) on a pidfd than it would be just opening the file called > > > "/proc/$apid"? > > > > Because most libraries actually work, right now, without /proc. Even > > libraries that spawn subprocesses. If we make the new API have the > > property that it doesn't work if you're in a non-root user namespace > > and /proc isn't mounted, the result will be an utter mess. > > > > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > > suppose that we could return magic restricted dirfds, or we could > > > > > return things that aren't dirfds and all and have some API that gives > > > > > you the dirfd associated with a procfd but only if you can see > > > > > /proc/PID. > > > > > > > > What would be your opinion to having a > > > > /proc//handle > > > > file instead of having a dirfd. Essentially, what I initially proposed > > > > at LPC. The change on what we currently have in master would be: > > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > > > And how do you propose, given one of these handle objects, getting a > > > process's current priority, or its current oom score, or its list of > > > memory maps? As I mentioned in my original email, and which nobody has > > > addressed, if you don't use a dirfd as your process handle or you > > > don't provide an easy way to get one of these proc directory FDs, you > > > need to duplicate a lot of metadata access interfaces. > > > > An API that takes a process handle object and an fd pointing at /proc > > (the root of the proc fs) and gives you back a proc dirfd would do the > > trick. You could do this with no new kernel features at all if you're > > willing to read the pid, call openat(2), and handle the races in user > > code. > > This seems like something that might be a good fit for two ioctls? As an aside, we had a long discussion about why fundamental facilities like this should be system calls, not ioctls. I think the arguments still apply. > One ioctl on procfs roots to translate pidfds into that procfs, > subject to both the normal lookup permission checks and only working > if the pidfd has a translation into the procfs: > > int proc_root_fd = open("/proc", O_RDONLY); > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > And then, as you proposed, the new sys_clone() can just return a > pidfd, and you can convert it into a procfs fd yourself if you want. I think that's the consensus we reached on the other thread. The O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well enough.