Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2112093imu; Sun, 18 Nov 2018 16:12:47 -0800 (PST) X-Google-Smtp-Source: AJdET5er28gbuAdaryXQjFeZt82RsclbSu/3d0N6B8GPwPTlCmwlM+VCKj1Mc3aOiOTltQ+mVb7D X-Received: by 2002:a62:f247:: with SMTP id y7mr11080817pfl.25.1542586367460; Sun, 18 Nov 2018 16:12:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542586367; cv=none; d=google.com; s=arc-20160816; b=tbh1ADYfTk4OGgHzdM0ko9SW4Ms+ih/SiMKOP2I3r3IlCF4UZfdY2EdXSsdsCJXZcI 4KLM8jEJf2TjB/cWIc96enPZB8yo56QK8rQuGUYp0TPX/wbnoirZR5fNcxBBkbW6NaSD 0p2wsAJTVKZFKfP2XRhNjAcSffSvDluXvF+b8Jl/usaxtocmmKFWODV3KF1Wtydbb1lI pt3wwzbaqL51X+RiN+ArrADtUSlg52+idYzcpyrdCwmA+7dxO4K9bp7XgFADX6voWszM vejn3dg8JrigCr7udROfllwy9hyfE2gb6jstEzJ7a9JoNmBXj6oC7+D/tPgnD5pXEmek ZSdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=RgD9Zx7MvFNN/eQZ5kwSWRWl6nSyVCexK8gdtpzu298=; b=fHhItB6k7ojojpG5jU5Hmir6EMNXlg9gOwETxDUvh9FxTUe5vbyCwx3lagAF6FXHMt 1mWfPh+W6XjCEmAdahh6qddocdbp6aH916LzZVs91SEn47KSaJBTjjfR5tIvW96viPcF 9mEIFf2h4Iv17HdZ5e4zvYKMRQeO3Mnk+81gyWAN0D9+gZySeSAQMXIa7yQ/Xs8e3Ww0 NGU+Fv3WY0dGy6TFHUE8xjwgEt5h/Sp7jitk42bOoyY9bY/h2fUvxW7zrIIRPMU3UTGw qchYFIfmwKofZ3X+macHmNHxVXza6mRBzZdt6yi16jsmnJdZzvZTNkysFKBuf5+qMLP9 E7Jw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t127si11093223pfd.21.2018.11.18.16.12.32; Sun, 18 Nov 2018 16:12:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727951AbeKSKba (ORCPT + 99 others); Mon, 19 Nov 2018 05:31:30 -0500 Received: from mx2.mailbox.org ([80.241.60.215]:19900 "EHLO mx2.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727068AbeKSKba (ORCPT ); Mon, 19 Nov 2018 05:31:30 -0500 Received: from smtp1.mailbox.org (smtp1.mailbox.org [80.241.60.240]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by mx2.mailbox.org (Postfix) with ESMTPS id 390E5A106F; Mon, 19 Nov 2018 01:09:41 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp1.mailbox.org ([80.241.60.240]) by hefe.heinlein-support.de (hefe.heinlein-support.de [91.198.250.172]) (amavisd-new, port 10030) with ESMTP id IR3ae9qkkTNp; Mon, 19 Nov 2018 01:09:39 +0100 (CET) Date: Mon, 19 Nov 2018 11:09:28 +1100 From: Aleksa Sarai To: Daniel Colascione Cc: Andy Lutomirski , Randy Dunlap , Christian Brauner , "Eric W. Biederman" , LKML , "Serge E. Hallyn" , Jann Horn , Andrew Morton , Oleg Nesterov , Al Viro , Linux FS Devel , Linux API , Tim Murray , Kees Cook , Jan Engelhardt Subject: Re: [PATCH] proc: allow killing processes via file descriptors Message-ID: <20181119000928.h2wp2rn2pnlfysp7@yavin> References: <20181118190504.ixglsqbn6mxkcdzu@yavin> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="xnndpxdsburq7xhv" Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --xnndpxdsburq7xhv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2018-11-18, Daniel Colascione wrote: > On Sun, Nov 18, 2018 at 11:05 AM, Aleksa Sarai wrote: > > On 2018-11-18, Daniel Colascione wrote: > >> > Here's my point: if we're really going to make a new API to manipula= te > >> > processes by their fd, I think we should have at least a decent idea > >> > of how that API will get extended in the future. Right now, we have > >> > an extremely awkward situation where opening an fd in /proc requires > >> > certain capabilities or uids, and using those fds often also checks > >> > current's capabilities, and the target process may have changed its > >> > own security context, including gaining privilege via SUID, SGID, or > >> > LSM transition rules in the mean time. This has been a huge source = of > >> > security bugs. It would be nice to have a model for future APIs that > >> > avoids these problems. > >> > > >> > And I didn't say in my proposal that a process's identity should > >> > fundamentally change when it calls execve(). I'm suggesting that > >> > certain operations that could cause a process to gain privilege or > >> > otherwise require greater permission to introspect (mainly execve) > >> > could be handled by invalidating the new process management fds. > >> > Sure, if init re-execs itself, it's still PID 1, but that doesn't > >> > necessarily mean that: > >> > > >> > fd =3D process_open_management_fd(1); > >> > [init reexecs] > >> > process_do_something(fd); > >> > > >> > needs to work. > >> > >> PID 1 is a bad example here, because it doesn't get recycled. Other > >> PIDs do. The snippet you gave *does* need to work, in general, because > >> if exec invalidates the handle, and you need to reopen by PID to > >> re-establish your right to do something with the process, that process > >> may in fact have died between the invalidation and your reopen, and > >> your reopened FD may refer to some other random process. > > > > I imagine the error would be -EPERM rather than -ESRCH in this case, > > which would be incredibly trivial for userspace to differentiate > > between. >=20 > Why would userspace necessarily see EPERM? The PID might get recycled > into a different random process that the caller has the ability to > affect. I'm not sure what you're talking about. execve() doesn't change the PID of a process, and in the case we are talking about: pidX_handle =3D open_pid_handle(pidX); [ pidX execs a setuid binary ] do_something(pidX_handle); pidX still has the same PID (so PID recycling is irrelevant in this case). The key point is whether do_something() should give you an error in such a state transition, and in that case I would say you'd get -EPERM which would indicate (obviously) insufficient privileges. If the PID has died you'd get -ESRCH. Even if it was eventually recycled. Because you've pinned a 'struct pid'. > > If you wish to re-open the path that is also trivial by > > re-opening through /proc/self/fd/$fd -- which will re-do any permission > > checks and will guarantee that you are re-opening the same 'struct file' > > and thus the same 'struct pid'. >=20 > When you reopen via /proc/self/fd, you get a new struct file > referencing the existing inode, not the same struct file. A new > reference to the old struct file would just be dup. I don't think this is really relevant to what I'm trying to say... > Anyway: what other API requires, for correct operation, occasional > reopening through /proc/self/fd? It's cumbersome, and it doesn't add > anything. If we invalidate process handles on execve, and processes > are legally allowed to re-exec themselves for arbitrary reasons at any > time, that's tantamount to saying that handles might become invalid at > any time and that all callers must be prepared to go through the > reopen-and-retry path before any operation. O_PATH. In container runtimes this is necessary for several reasons to protect against malicious container root filesystems as well as avoiding exposing a dirfd to the container. In LXC, O_PATH re-opening is used for /dev/ptmx as well as some other operations. In runc we use it for FIFO re-opening so that we can signal pid1 in a container to execve() into user code. So this isn't a new thing. > Why are we making them do that? So that a process can have an open FD > that represents a process-operation capability? Which capability does > the open FD represent? The re-opening part was just an argument to show that there isn't a condition where you wouldn't be able to get access to the 'struct pid'. I doubt that anyone would actually need to use this -- since you'd need to pass "/proc/pid/fd/..." to a more privileged process in order to use the re-opening. But this also means that we don't need to have a concept of a pidfd that isn't actually associated with a PID but is instead associated with current->mm (which is what you appear to be proposing with the whole "identity fd" concept). > I think when you and Andy must be talking about is an API that looks like= this: >=20 > int open_process_operation_handle(int procfs_dirfd, int capability_bitmas= k) >=20 > capability_bitmask would have bits like >=20 > PROCESS_CAPABILITY_KILL --- send a signal > PROCESS_CAPABILITY_PTRACE --- attach to a process > PROCESS_CAPABILITY_READ_EXIT_STATUS --- what it says on the tin > PROCESS_CAPABILITY_READ_CMDLINE --- etc. >=20 > Then you'd have system calls like >=20 > int process_kill(int process_capability_fd, int signo, const union sigval= data) > int process_ptrace_attach(int process_capability_fd) > int process_wait_for_exit(int process_capability_fd, siginfo_t* exit_info) >=20 > that worked on these capability bits. If a process execs or does > something else to change its security capabilities, operations on > these capability FDs would fail with ESTALE or something and callers > would have to re-acquire their capabilities. >=20 > This approach works fine. It has some nice theoretical properties, and > could allow for things like nicer privilege separation for debuggers. > I wouldn't mind something like this getting into the kernel. Andy might be arguing for this (and as you said, I can see the benefit of doing it this way). I'm not convinced that doing permission checks on-open is necessary here -- I get Andy's point about write(2) semantics but I think a new set of proc_* syscalls wouldn't need to follow those semantics. I might be wrong though. > I just don't think this model is necessary right now. I want a small > change from what we have today, one likely to actually make it into > the tree. And bypassing the capability FDs and just allowing callers > to operate directly on process *identity* FDs, using privileges in > effect at the time of all, is at least no worse than what we have now. >=20 > That is, I'm proposing an API that looks like this: >=20 > int process_kill(int procfs_dfd, int signo, const union sigval value) >=20 > If, later, process_kill were to *also* accept process-capability FDs, > nothing would break. Again, I think we should agree on whether it's necessary to have both types of fds before we commit to maintaining both APIs forever... > >> The only way around this problem is to have two separate FDs --- one > >> to represent process identity, which *must* be continuous across > >> execve, and the other to represent some specific capability, some > >> ability to do something to that process. It's reasonable to invalidate > >> capability after execve, but it's not reasonable to invalidate > >> identity. In concrete terms, I don't see a big advantage to this > >> separation, and I think a single identity FD combined with > >> per-operation capability checks is sufficient. And much simpler. > > > > I think that the error separation above would trivially allow user-space > > to know whether the identity or capability of a process being monitored > > has changed. > > > > Currently, all operations on a '/proc/$pid' which you've previously > > opened and has died will give you -ESRCH. >=20 > Not the case. Zombies have died, but profs operations work fine on zombie= s. It is the case if the process is dead in the sense that the PID might be re-used. That is what I meant be "dead" here, not semi-dead in the sense that zombies are. > >> > Similarly, it seems like > >> > it's probably safe to be able to open an fd that lets you watch the > >> > exit status of a process, have the process call setresuid(), and sti= ll > >> > see the exit status. > >> > >> Is it? That's an open question. > > > > Well, if we consider wait4(2) it seems that this is already the case. > > If you fork+exec a setuid binary you can definitely see its exit code. >=20 > Only if you're the parent. Otherwise, you can't see the process exit > status unless you pass a ptrace access check and consult > /proc/pid/stat after the process dies, but before the zombie > disappears. Random unrelated and unprivileged processes can't see exit > statuses from distant parts of the system. Sure, I'd propose that ptrace_may_access() is what we should use for operation permission checks. --=20 Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH --xnndpxdsburq7xhv Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEb6Gz4/mhjNy+aiz1Snvnv3Dem58FAlvx/zcACgkQSnvnv3De m5/y1A/7BsU8nrbbk4WkM6zHX4iWZMUB0JrNF66rv8nPYq16fDSrEoUtSxbpYVgq UXkodHIZNoxkLE39NhP1z3cr1GjSwRifB++3XYSWgTA6R72U+hCDkL3VGiZOIV+g OPkthne/oZNK8oGUFFGkSIFe1zAJ3ApCwzf6FMcRm7UphpOgRgdiSli/SdxXy11X wVPoJObkHgcuKoqfii59xQU7aEi//D7MYvgDCMZ7h5baJ484+AQG2aXtdsGhNJM0 jxC7+0wF9yZ+JzSUw5tweBFHWzrHAZBvOpyP9ia2LKg3jy0lRXCAm9qL9jMqtwYk boVYm1p4406zWSkBJa7lJ0lgvKBX07jq844P2VKSHFsiFXYxYc9t/rNmO6KhkL5D u3xxSRRnYGqbAFuAnTxMJk5VQmwz8/jDzhKskh+wdMIvbNZrdiFLtQM0s2hQcWiL 0X+77/JRJh6ujf2g3I7c7NPZAA8jqbGB3tNDKVOOMoUUxMsQuhph0AJj81u4F42Q J5s48tdpS5Qev3AAXnCAtSWAUpQqZj4apJ9q+iKeyLW7r1/+bCE++Bi/Ska1e2aa Dm9ByuVn1yVcPUjH6A9pEltJoFklAT/sPdCRBbECgcOKQRD0btmzyBEVF2haGwqv TfoCm+L/hxNVBL1+AUs9/wF465LkCY+MrymQsCFqwr30Wi9Npxk= =O+a0 -----END PGP SIGNATURE----- --xnndpxdsburq7xhv--