Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp682468imm; Fri, 21 Sep 2018 06:39:58 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYGaDC6rwCRny0tXhQjtUWvkmvLMGZGN9e7SJKBgvjKNG4PYr5LgrvUUMVh48y8wNz/Uq1s X-Received: by 2002:a17:902:1566:: with SMTP id b35-v6mr44425056plh.135.1537537198812; Fri, 21 Sep 2018 06:39:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537537198; cv=none; d=google.com; s=arc-20160816; b=iBn5HhHN53ha0pwn7Ut7eHU8TPxkstXrtHBDRSxu1KSQqf1fH854XCch7w73NB/jBU 1JktNpfCcUPzaIQWdnhrbZnyzRjaz8RRUZJN3vP5zNSyzoED5GLX+wwDxjaY2kTlL+Ws I9celwbBN92kEU04TZDvAxpdsO2vJspEwuvJcBxMGFtbrRSm3tBtubQWOun87Id7TvWc APW2HMqMY41PWx50kmDPhm7BtG4++G9hLeT3W2080AHIYNw+zuHc8bnCCtFI0VDDMqpZ 6hcOYS27GxpfI/zZQnQd+1U8xgyFU62mdJ3TNws7zXVdLrQb2Sm+jaiVMwyrwsh2WBRx kCiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=dEDi/RdhKg1iw0zuffJVQigaZvB50LUeWGo7/t1is1s=; b=TEUYMhXqrUiWzG2xeD/hTfDpr9h0fKxc93G1NCmICyVJEssRADCPhfoclkrnJ4xCBZ GMj+x30nDsklgnK0zqwY2uT9SLXP/6Gs/vEvQ5wjqeTlhHymbwP5eFiAk+T6Ceomi22k lqiJt3Kdu1NY9vcPjdEULz4ZMNpH9PGOUrxJ1xQoTxDe63a7zkyntHg57fMpIGYpWW0o EdJRdjtsFFjBo5iO3aOyHfmW2tt98dvzLzubczVf8C9N+zEoKs5VWhpudRlDAiKEYGgQ P9ftICziqI7cpmiWJvN9OXXrute+hJpa0cOhuyDaGxKEWsr/HETRGkSTXWxQK9vq8tyu mzZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@tycho-ws.20150623.gappssmtp.com header.s=20150623 header.b=gMc2CsnQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p2-v6si26553168plo.246.2018.09.21.06.39.43; Fri, 21 Sep 2018 06:39:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@tycho-ws.20150623.gappssmtp.com header.s=20150623 header.b=gMc2CsnQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389795AbeIUT2a (ORCPT + 99 others); Fri, 21 Sep 2018 15:28:30 -0400 Received: from mail-pf1-f195.google.com ([209.85.210.195]:36699 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389702AbeIUT21 (ORCPT ); Fri, 21 Sep 2018 15:28:27 -0400 Received: by mail-pf1-f195.google.com with SMTP id b11-v6so6014202pfo.3 for ; Fri, 21 Sep 2018 06:39:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tycho-ws.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=dEDi/RdhKg1iw0zuffJVQigaZvB50LUeWGo7/t1is1s=; b=gMc2CsnQ8CKDEe7/hyDvOxc+Lrp//6g/hY9YYFa/sredKTnHfA5qBb5TbeeqoUjARt jS9FPLQnYN8TA+Y4fXKjlD8D8H7YgDKyEaXGthJH24ITi7OzdqRxtF3HRe70thK1FB5z e1V+tAteQa4xe0QPGIcjsVNO1eW70gAHULrl8nMRx1AJ0ZuhQFvlCIj10NkelxFUzhMO Dnbc7G19/0AsW+Jq3teE6CIJ2BnSfA1H+JdVPe6P4Uj/JZgkyvM6WLbvFXep9VmTnP0H 4QlXCMUq7vJZ7QzpmqCp96eGbSUWN3oReDHMjeOH3qE/ez0oPsWRiLIiFzpHDJD4TBuk HYFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=dEDi/RdhKg1iw0zuffJVQigaZvB50LUeWGo7/t1is1s=; b=oi2YrLwmEj8cfeVPs1ZiBr/Vc1vhzsQQTwqpQl2rGHlPJBRvHlL1Uk1G0xwXLW5RdY yBxj8GI4o92Qke+sSWPFev45I3sWyC0je7hh9Hl3p1oP6aR00o/gevrMeH/Jhk3mXWu5 igDPnOivbqIX8pXv1roMSWPcltps2qnQVoqPK9cSiK38mWdUPY9naXgGZ1izkBxJHf07 j22sSFrqvNlMPSNz9pBR8f0Rcjzj4DhMh5qIs2+PVdgIVqfyIOPSM7aZxMfRYnovnC+F VGHWTFpaW5Tr4UZrjo+Sum2GSw62td7tp+2nnDvPwyb2WRcxIHVtdqi2B0BZyuJNSiQR bkSA== X-Gm-Message-State: APzg51A+fvArMCV7OttxrR4jke+e+YQgsuQGR0O+rVgXrJ9hoQkXumsu zmqhe5jtGtaXK9AWAdtdBL6nYQ== X-Received: by 2002:a62:435c:: with SMTP id q89-v6mr47150478pfa.135.1537537172397; Fri, 21 Sep 2018 06:39:32 -0700 (PDT) Received: from cisco ([128.107.241.177]) by smtp.gmail.com with ESMTPSA id b203-v6sm42279265pfb.174.2018.09.21.06.39.29 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 21 Sep 2018 06:39:31 -0700 (PDT) Date: Fri, 21 Sep 2018 07:39:28 -0600 From: Tycho Andersen To: Andy Lutomirski Cc: Kees Cook , LKML , Linux Containers , Linux API , Oleg Nesterov , "Eric W. Biederman" , "Serge E. Hallyn" , Christian Brauner , Tyler Hicks , Akihiro Suda , Jann Horn Subject: Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Message-ID: <20180921133928.GS4672@cisco> References: <20180906152859.7810-1-tycho@tycho.ws> <20180906152859.7810-5-tycho@tycho.ws> <20180919095536.GM4672@cisco> <20180919143842.GN4672@cisco> <20180920234240.GR4672@cisco> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote: > On Thu, Sep 20, 2018 at 4:42 PM Tycho Andersen wrote: > > > > On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote: > > > On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen wrote: > > > > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote: > > > >> > > > >> > > > >> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen wrote: > > > >> > > > > >> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote: > > > >> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen wrote: > > > >> >>> The idea here is that the userspace handler should be able to pass an fd > > > >> >>> back to the trapped task, for example so it can be returned from socket(). > > > >> >>> > > > >> >>> I've proposed one API here, but I'm open to other options. In particular, > > > >> >>> this only lets you return an fd from a syscall, which may not be enough in > > > >> >>> all cases. For example, if an fd is written to an output parameter instead > > > >> >>> of returned, the current API can't handle this. Another case is that > > > >> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink > > > >> >>> ever decides to install an fd and output it, we wouldn't be able to handle > > > >> >>> this either. > > > >> >> > > > >> >> An alternative could be to have an API (an ioctl on the listener, > > > >> >> perhaps) that just copies an fd into the tracee. There would be the > > > >> >> obvious set of options: do we replace an existing fd or allocate a new > > > >> >> one, and is it CLOEXEC. Then the tracer could add an fd and then > > > >> >> return it just like it's a regular number. > > > >> >> > > > >> >> I feel like this would be more flexible and conceptually simpler, but > > > >> >> maybe a little slower for the common cases. What do you think? > > > >> > > > > >> > I'm just implementing this now, and there's one question: when do we > > > >> > actually do the fd install? Should we do it when the user calls > > > >> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels > > > >> > like we should do it when the response is sent, instead of doing it > > > >> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a > > > >> > subsequent signal and the tracer decides to discard the response, > > > >> > we'll have to implement some delete mechanism to delete the fd, but it > > > >> > would have already been visible to the process, etc. So I'll go > > > >> > forward with this unless there are strong objections, but I thought > > > >> > I'd point it out just to avoid another round trip. > > > >> > > > > >> > > > > >> > > > >> Can you do that non-racily? That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd? > > > > > > > > I was thinking we could just do an __alloc_fd() and then do the > > > > fd_install() when the response is sent or clean up the case that the > > > > listener or task dies. I haven't actually tried to run the code yet, > > > > so it's possible the locking won't work :) > > > > > > I would be very surprised if the locking works. How can you run a > > > thread in a process when another thread has allocated but not > > > installed an fd and is blocked for an arbitrarily long time? > > > > I think the trick is that there's no actual locking required (except > > for a brief locking of task->files). I've run the patch below and it > > seems to work. But perhaps that's abusing __alloc_fd a little too > > hard, I don't really know. > > > > Hmm. This makes me highly nervous. If nothing else, what releases > the busy-but-not-open fd if the whole process aborts? Nothing right now, it gets installed even though the syscall gets -ENOSYS. So not ideal, but that's why I was thinking we needed some form of delete support. But, > > > > > > > >> Do we really allow non-“kill” signals to interrupt the whole process? It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies. > > > > > > > > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122 > > > > > > > > > > I'm still not sure I see the problem. Suppose I'm implementing a user > > > notifier for a nasty syscall like recvmsg(). If I'm the tracer, by > > > the time I decide to install an fd, I've committed to returning > > > something other than -EINTR, even if a non-fatal signal is sent before > > > I finish. No rollback should be necessary. > > > > I don't understand why this is true. Surely you could stop a handler > > on receipt of a new signal, and have it do something else entirely? > > I think you *could*, but I'm not sure why you would. In general, > syscalls never execute signal handlers mid-syscall. There is a very > small number of syscalls that use sys_restart_syscall(), but I don't > think any of them allocate fds, and I'm not sure we need or want to > support them with user notifiers. The rest of the syscalls will, if > they're behaving correct, either do *something* (like reading some or > all of a buffer) and return success or they'll do nothing and return > -EINTR. Or they return an -ERESTARTSYS variant. And then, only > *after* the syscall logically returns (i.e. completely finishes > processing and puts its return code into the relevant register) will a > signal be delivered. In other words, the case where something like > recv() gets interrupted but still returns a success code does not mean > that a signal handler was called and then recv() resumed. It means > that recv() noticed the signal, stopped receiving, returned the number > of bytes read, and then allowed the signal to be delivered. > > In the -ERESTARTSYS case, the syscall returns -ERESTARTSYS (or a > variant) and returns without doing anything. But it returns in a > special case where, after the signal returns, the syscall will happen > again. > > So, for user notifiers, I think that any sane handler that notices a > non-fatal signal will do one of these things: > > - Return -EINTR without changing any tracee state. > > - Return success, possibly without blocking as long as it would have > without the signal. > > - Return -ERESTARTSYS without changing any tracee state. > > - Kill the tracee. > > None of these would involve backing out an fd that was already > installed. I suppose another way of looking at this is that. > > Although... now that I think about it, there are some special cases, > like socketpair(). Look for put_unused_fd(). So maybe we need to > expose get_unused_fd_flags() and put_unused_fd(), but I think that > these are exceptions and will be very uncommon in the context of > seccomp user notifiers. (For example, socketpair() can be implemented > almost correctly without put_unused_fd().) socketpair() is a good point. In particular, if we use this queuing thing I've done above, then you can only ever send one fd, and you'll need to send two here. So perhaps we really do need to do this as soon as the tracer calls ioctl(), vs queuing and waiting. > Hmm. This does mean that we need a test case for a user notifier > returning -ERESTARTSYS. It should Just Work (tm), but those are > famous last words. > > -ERESTARTSYS_RESTARTBLOCK is the case that I don't think we need to worry about. > > > > > > In the (unlikely?) event that some tracer needs to be able to rollback > > > an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD > > > operation should be good enough, I think. Or maybe PUT_FD can put -1 > > > to delete an fd. > > > > Yes, I think even with something like what I did below we'd need some > > sort of REMOVE_FD option, because otherwise there's no way to change > > your mind and send -EINTR without the fd you just PUT_FD'd. > > > > I think we just want the operation to cover all the cases. Let PUT_FD > take a source fd and a dest fd. If the source fd is -1, the dest is > closed. If the source is -1 and the dest is -1, return -EINVAL. If > the dest is -1, allocate an fd. If the dest is >= 0, work like > dup2(). (The latter could be necessary to emulate things like, say, > dup2 :)) ...then if we're going to allow overwriting fds, we'd need to lift out the logic from do_dup2 somewhere? Is this getting too complicated? :) Tycho