Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2139722imu; Sun, 18 Nov 2018 16:55:22 -0800 (PST) X-Google-Smtp-Source: AJdET5d1/F5M/ZLsun0F/ufLi8zdMljGbRooXZ1jfiuTa0LhVb5O5Ce0zAozD0pxmvgHxT0cM5F8 X-Received: by 2002:a63:6704:: with SMTP id b4mr18281502pgc.100.1542588922533; Sun, 18 Nov 2018 16:55:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542588922; cv=none; d=google.com; s=arc-20160816; b=l/bj/CBWdri8TcWxCQiJU3FhG4pjNBEl+enAtTHt7X3HsU7xFakEeYcGwIeRVc8UXh g+mYlZV68BXlQc/za/XzysR3JWlDz//tswZPBs2X8DKBK6nS4JWMF4o/tRhaGuyqbNnP x2gKUsBnTt/U3Owdcex0LHotlYAhslDSbPl8YbFfmZ1hHUbK72mBLBrMHqam9CwiyZJa 17Mg92Lz4wtwazlyJqt3plN5ojk3jWFKGG2cMQ5LnREbzS0YjE9EWUOtahNHfmbcyPvJ /iXsYKZsZWbSq70eWnpLEDZeRdjeOq/CtgvP/f925aPrFoPrD8g5X6iBCLzL02s87uDP 6Ylg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature; bh=XMrpTzq27oMBdd9mAMCvpReP8LWXh9Cz6YyI2Pj02kk=; b=ggctthc2UF85RZ1SHI4oluH7rVv2axX+D4+91T+F7MaL30j/QHL+e8SHa+ufMhcoZI tZh2y+LSLamE7TcTad0pTPzFjinFsqOyVylPkqqUnn0NbmDw6gii1kob4fUIxqHas1WR 2gP6mbQD6k9UsT7VfTvCFIz2IQOejVNw9ahWjzYPX2uEvMa0B7cqwT4hlPqO8PEh0qPg 57bdHsrsld8/v+3vXdTHCUGQRFE9tTKhOJM9hzArxWv2USk96Aiy2bQl9GsV3rWgNN25 wSrA7t37WmnB/o790T0LbcS6fUqMl3oosQuOChA8T5mPmB6aWkLDAr4CfEpaZqTvX+UZ m/Rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=A6l652Cb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p14si5459783pgf.52.2018.11.18.16.55.06; Sun, 18 Nov 2018 16:55:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=A6l652Cb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726542AbeKSLPx (ORCPT + 99 others); Mon, 19 Nov 2018 06:15:53 -0500 Received: from mail-ua1-f68.google.com ([209.85.222.68]:34820 "EHLO mail-ua1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725919AbeKSLPw (ORCPT ); Mon, 19 Nov 2018 06:15:52 -0500 Received: by mail-ua1-f68.google.com with SMTP id d2so6086892ual.2 for ; Sun, 18 Nov 2018 16:53:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=XMrpTzq27oMBdd9mAMCvpReP8LWXh9Cz6YyI2Pj02kk=; b=A6l652CbOFbkGkQdu/XGN4qPXIzXc0rDTk/1K8/LnoElCW6pjCkg3fbyYZOSog/cVL ePsS3Ljo6JA2zLSX2+YvUJSrBnA62wXQia80VqFkGLwRn+93i3tXYoHvWcfrgLFKPu+v lWeEpFqxqKr3cI86m2yfcbK9bIwhp4StzvZd+/l3jU4Na8AVypUcH1vpfJXFWOiVj0om O1HY/T2A4ayTUKfOt8p0wUi5XHNFvRrScerxfHcoxLdUHsmQxqBpasNRDnuQfBHxtxIi C6+6kg+8IxOwxm1sXAsVtlttwYAEwd/8zW32pIGhf58HI/ECQmZ3NwrHom1S5/dd2oXg V7MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=XMrpTzq27oMBdd9mAMCvpReP8LWXh9Cz6YyI2Pj02kk=; b=tVpFfmQg/nnNBPwTUsthwAmUAttDkH82kB8BJMkecxqjhdXWHx8D4Vbhm7lI66h5ji E9r6F3cdlawlmhIrBKP80EmQ/WsyWJZQMTe+REdUKOcy+EZgsf722D77uZzpcqDHXFFc QV7jTadP5zH7swNCOAHAxRV6mxgsxXhRiTwqa27wsVXv1dJhA9zPb/AGzY72IEQpcQCZ Kr3zf0oT0bz2GgH2VFSDsFZntQxsKzdNX1AFOjPDMPGSUJESlQ6lD66UK2AwtskscVaQ DBhksrHhMQ1z061PBNHLiRe4RFFy1Y5/fc3tnG3qcyCfaicVVvz1zyOsVkZ5ibGEMDYq N3zA== X-Gm-Message-State: AGRZ1gKXnK98Emno3wRevQMz2jnPeR45wJUE5ckFho8dFwFyL3Oj7NWy TlmXT/0FngLVc6ilN4P2+YilLxj0UeNNeThagauU4Q== X-Received: by 2002:ab0:72a:: with SMTP id h39mr9059753uah.11.1542588838292; Sun, 18 Nov 2018 16:53:58 -0800 (PST) MIME-Version: 1.0 Received: by 2002:a67:f48d:0:0:0:0:0 with HTTP; Sun, 18 Nov 2018 16:53:57 -0800 (PST) In-Reply-To: <20181119000928.h2wp2rn2pnlfysp7@yavin> References: <20181118190504.ixglsqbn6mxkcdzu@yavin> <20181119000928.h2wp2rn2pnlfysp7@yavin> From: Daniel Colascione Date: Sun, 18 Nov 2018 16:53:57 -0800 Message-ID: Subject: Re: [PATCH] proc: allow killing processes via file descriptors To: Aleksa Sarai Cc: Andy Lutomirski , Randy Dunlap , Christian Brauner , "Eric W. Biederman" , LKML , "Serge E. Hallyn" , Jann Horn , Andrew Morton , Oleg Nesterov , Al Viro , Linux FS Devel , Linux API , Tim Murray , Kees Cook , Jan Engelhardt Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Nov 18, 2018 at 4:09 PM, Aleksa Sarai wrote: > On 2018-11-18, Daniel Colascione wrote: >> On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai wrote: >> > On 2018-11-18, Daniel Colascione wrote: >> >> > Here's my point: if we're really going to make a new API to manipulate >> >> > processes by their fd, I think we should have at least a decent idea >> >> > of how that API will get extended in the future. Right now, we have >> >> > an extremely awkward situation where opening an fd in /proc requires >> >> > certain capabilities or uids, and using those fds often also checks >> >> > current's capabilities, and the target process may have changed its >> >> > own security context, including gaining privilege via SUID, SGID, or >> >> > LSM transition rules in the mean time. This has been a huge source of >> >> > security bugs. It would be nice to have a model for future APIs that >> >> > avoids these problems. >> >> > >> >> > And I didn't say in my proposal that a process's identity should >> >> > fundamentally change when it calls execve(). I'm suggesting that >> >> > certain operations that could cause a process to gain privilege or >> >> > otherwise require greater permission to introspect (mainly execve) >> >> > could be handled by invalidating the new process management fds. >> >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't >> >> > necessarily mean that: >> >> > >> >> > fd = process_open_management_fd(1); >> >> > [init reexecs] >> >> > process_do_something(fd); >> >> > >> >> > needs to work. >> >> >> >> PID 1 is a bad example here, because it doesn't get recycled. Other >> >> PIDs do. The snippet you gave *does* need to work, in general, because >> >> if exec invalidates the handle, and you need to reopen by PID to >> >> re-establish your right to do something with the process, that process >> >> may in fact have died between the invalidation and your reopen, and >> >> your reopened FD may refer to some other random process. >> > >> > I imagine the error would be -EPERM rather than -ESRCH in this case, >> > which would be incredibly trivial for userspace to differentiate >> > between. >> >> Why would userspace necessarily see EPERM? The PID might get recycled >> into a different random process that the caller has the ability to >> affect. > > I'm not sure what you're talking about. execve() doesn't change the PID > of a process, and in the case we are talking about: > > pidX_handle = open_pid_handle(pidX); > [ pidX execs a setuid binary ] > do_something(pidX_handle); > > pidX still has the same PID (so PID recycling is irrelevant in this > case). The key point is whether do_something() should give you an error > in such a state transition, and in that case I would say you'd get > -EPERM which would indicate (obviously) insufficient privileges. EPERM is the wrong error. All that's happened here is that the process has execed itself; you may still have permission to operate on the post-execve process. ESTALE is the right error here. But yes, there is a PID trace. What do you do after getting ESTALE? You reopen the handle and retry your operation. How do you open a new handle? Unless you're using some awful /proc/self/fd/... hack, you reopen by PID. And at that point, you've introduced a PID race again. That's why, in my sketch below, I imagined creating the capability handle from the process-identity handle and not, as in the snippet above, directly from the PID. >> Anyway: what other API requires, for correct operation, occasional >> reopening through /proc/self/fd? It's cumbersome, and it doesn't add >> anything. If we invalidate process handles on execve, and processes >> are legally allowed to re-exec themselves for arbitrary reasons at any >> time, that's tantamount to saying that handles might become invalid at >> any time and that all callers must be prepared to go through the >> reopen-and-retry path before any operation. > > O_PATH. In container runtimes this is necessary for several reasons to > protect against malicious container root filesystems as well as avoiding > exposing a dirfd to the container. > > In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other > operations. In runc we use it for FIFO re-opening so that we can signal > pid1 in a container to execve() into user code. > > So this isn't a new thing. Yuck. I'd still argue that 1) the reopen trick isn't really intended as the mainline path for that kernel functionality, and 2) there ought to be a way to do what you're describing in a cleaner way. I'd classify this approach as a hack. It's one thing to require a hack in specialized container initialization code, but it's another to bake it into a hopefully-common API for something as fundamental as process management, especially when there's a perfectly good alternative that doesn't require this hack. >> Why are we making them do that? So that a process can have an open FD >> that represents a process-operation capability? Which capability does >> the open FD represent? > > The re-opening part was just an argument to show that there isn't a > condition where you wouldn't be able to get access to the 'struct pid'. > I doubt that anyone would actually need to use this -- since you'd need > to pass "/proc/pid/fd/..." to a more privileged process in order to use > the re-opening. > > But this also means that we don't need to have a concept of a pidfd that > isn't actually associated with a PID but is instead associated with > current->mm (which is what you appear to be proposing with the whole > "identity fd" concept). Not current->mm; that can be shared with clone. struct signal is the right long-term identity. It's usually easier to keep the struct pid around though, which is exactly what a procfs FD is today: just a lightweight handle to a struct pid. >> I think when you and Andy must be talking about is an API that looks like this: >> >> int open_process_operation_handle(int procfs_dirfd, int capability_bitmask) >> >> capability_bitmask would have bits like >> >> PROCESS_CAPABILITY_KILL --- send a signal >> PROCESS_CAPABILITY_PTRACE --- attach to a process >> PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin >> PROCESS_CAPABILITY_READ_CMDLINE --- etc. >> >> Then you'd have system calls like >> >> int process_kill(int process_capability_fd, int signo, const union sigval data) >> int process_ptrace_attach(int process_capability_fd) >> int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) >> >> that worked on these capability bits. If a process execs or does >> something else to change its security capabilities, operations on >> these capability FDs would fail with ESTALE or something and callers >> would have to re-acquire their capabilities. >> >> This approach works fine. It has some nice theoretical properties, and >> could allow for things like nicer privilege separation for debuggers. >> I wouldn't mind something like this getting into the kernel. > > Andy might be arguing for this (and as you said, I can see the benefit > of doing it this way). > > I'm not convinced that doing permission checks on-open is necessary here > -- I get Andy's point about write(2) semantics but I think a new set of > proc_* syscalls wouldn't need to follow those semantics. I might be > wrong though. For now, it's fine to just expose system calls that operate directly on the procfs dfd. >> I just don't think this model is necessary right now. I want a small >> change from what we have today, one likely to actually make it into >> the tree. And bypassing the capability FDs and just allowing callers >> to operate directly on process *identity* FDs, using privileges in >> effect at the time of all, is at least no worse than what we have now. >> >> That is, I'm proposing an API that looks like this: >> >> int process_kill(int procfs_dfd, int signo, const union sigval value) >> >> If, later, process_kill were to *also* accept process-capability FDs, >> nothing would break. > > Again, I think we should agree on whether it's necessary to have both > types of fds before we commit to maintaining both APIs forever... I don't think noting that an API *could* be extended in a certain way in the future creates any obligation to decide, immediately, whether that extension will ever be needed. Right now, I don't see a reason to supply the capability FD API I described. I'm just saying that it could be added in a low-friction way if necessary one day. >> >> > Similarly, it seems like >> >> > it's probably safe to be able to open an fd that lets you watch the >> >> > exit status of a process, have the process call setresuid(), and still >> >> > see the exit status. >> >> >> >> Is it? That's an open question. >> > >> > Well, if we consider wait4(2) it seems that this is already the case. >> > If you fork+exec a setuid binary you can definitely see its exit code. >> >> Only if you're the parent. Otherwise, you can't see the process exit >> status unless you pass a ptrace access check and consult >> /proc/pid/stat after the process dies, but before the zombie >> disappears. Random unrelated and unprivileged processes can't see exit >> statuses from distant parts of the system. > > Sure, I'd propose that ptrace_may_access() is what we should use for > operation permission checks. The tricky part is that ptrace_may_access takes a struct task. We want logic that's *like* ptrace_may_access, but that works posthumously. It's especially tricky because there's an LSM hook that lets __ptrace_may_access do arbitrary things. And we can't just run that hook upon process death, since *after* a process dies, a process holding an exithand FD (or whatever we call it) may pass that FD to another process, and *that* process can read(2) from it. Another option is doing the exithand access check at open time. I think that's probably fine, and it would make things a lot simpler. But if we use this option, we should understand what we're doing, and get some security-conscious people to think through the implications.