Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp42435pxx; Wed, 28 Oct 2020 17:35:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxgQ0D/VOz3AfBOfqkSyviC7CNRVQCI8ejcUxF+Ngc9NlMN5Rm78j9pslozF6Zj7GzfAU58 X-Received: by 2002:a05:6402:d0d:: with SMTP id eb13mr1572141edb.244.1603931701640; Wed, 28 Oct 2020 17:35:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603931701; cv=none; d=google.com; s=arc-20160816; b=x6mxY448Le+c4C08AaD27jE6y/OHMTQr8smybq1mtuWC90infMPizk/ZedWItdYNO2 EkxG7N+121Kf5WHMrTKLmjJnn2mdY0p8XFUhibbw1y4LkBTpUBqSBCqtTfS02lsLqcO+ 0r+JurL5aTBpHOvhRwzvqHbih2Gw8yEYbQqhG1xpcpkuM0f3okMsGmg9H1kSLrY4vWV+ AJDTuBj84VOGZ5eJJqHrfusJLoYYXUy2OqrgKorlHUPDfuMs4k/0cdbRorBqwquUtIsK MQdbPMznqMLsvRkR8bCE47FLzzzGbidWMPnedQemmWAzea9DtiLPnMz+XRrD/Z9CYQv4 w64w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=0mJKft8XKv9esuODa0HybPLYtMlIcMD5oOcu7Q4PSJI=; b=H7aPgqQUlxM1QG3RhhUK/W+vPqRRBZx4mWbIFzYjWCc2eFCJ0qnGtnevOZgLVsn5ev rTehlCvIy+heRVIwInZFy9X4KKU5hYpp+jD0+zegqK1ByyELWcEIwTu5sKSHa+tZwwKw Ba9rQ65t6pxhqfCNMTcjXsp4O/jq5R9iydVjFpSWIhxFksQYCzDA44AkQo3FBvb5/6Xa rhh9ZQYej9ICCiNr+uDjVThCkxhCFnVRRAk4xQD5ROm6dFcacJk/eAL3kUZ3jwU6H9+j xFopz+JzcUogMFwF0ob4tYRPJMcadD7CX2VUk5wsFhIz2DOcByoWV+sUc2DGwqDk994O Nzmg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=iU06lpir; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z6si627367ejw.432.2020.10.28.17.34.39; Wed, 28 Oct 2020 17:35:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=iU06lpir; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729449AbgJ1WFk (ORCPT + 99 others); Wed, 28 Oct 2020 18:05:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730020AbgJ1WFg (ORCPT ); Wed, 28 Oct 2020 18:05:36 -0400 Received: from mail-wr1-x429.google.com (mail-wr1-x429.google.com [IPv6:2a00:1450:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E130EC0613D1 for ; Wed, 28 Oct 2020 15:05:35 -0700 (PDT) Received: by mail-wr1-x429.google.com with SMTP id b8so717065wrn.0 for ; Wed, 28 Oct 2020 15:05:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0mJKft8XKv9esuODa0HybPLYtMlIcMD5oOcu7Q4PSJI=; b=iU06lpirPLmYBtyihfia3PCIDHgD4Hv1VmfiHd3ZlYfgicQdUWU5wyYCJi/7DoyRjC W//HSHhrNMnE1lr/mNTNyraHjIIQYzW8sKsk6wAMdexpSxRhsMc3CVN4k6RnmsIHoXnq stre3jFgIUrZa93sIOqSvErbnGWUasoLfvwyoNYHJlJu1HNNpJhI87yVbqyoKQxkVSUE JBD5d/Smb0YbFLdPwmNBvvIrh2xXXM2lOhdIBJY2CwKSjWMt+c06mjNFGvzSxFTJP4G2 kfM9Wncbywwwe2Q4mVy32TLg3RXou2nRohbsi+cNd6u7r7UPP8lOj7WyyCZRMKe7qHcP oC6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0mJKft8XKv9esuODa0HybPLYtMlIcMD5oOcu7Q4PSJI=; b=ubWnZ4KxCBMxxzIH5No301oHBkugScqoFSm9oyqE9j98lBytlhtvBrwgAr7OXw+RcM 2rXPXbfrb20J/2VZ/3v+M9gY9afWdeFdlWLEjUGZ7XJGCc4iTK8DjeIs6vKf8MGruuAr HkF/5SJKybeQdCOkvutLWXCOgzcnkxAlACGHOxnfbJ3Qtn5bLC2WvuLxHNuob/7F7G6D BTWfBSr15ieFZsTP6bu0tJlfQoEvDowb8m+1BeSy0iFKpt53EprVrPB5ttSskKfCK/xn /LD77ZyquIJ1mp7Rroz86KHAhAzYccPKBqrX0pqJ+byPB9lGvT1/VtnZeVHE1nVOhMNs JRgQ== X-Gm-Message-State: AOAM531TlALdxjwv7M9myWrsvj+zZg9m7f9TMPToPDlscl2t6NP7gR04 g7xnveR/q8RL2W7JWRlwceownrJNnHXxm92nKGYR02XxazA= X-Received: by 2002:a19:c357:: with SMTP id t84mr148491lff.34.1603909281937; Wed, 28 Oct 2020 11:21:21 -0700 (PDT) MIME-Version: 1.0 References: <45f07f17-18b6-d187-0914-6f341fe90857@gmail.com> <20200930150330.GC284424@cisco> <8bcd956f-58d2-d2f0-ca7c-0a30f3fcd5b8@gmail.com> <20200930230327.GA1260245@cisco> <20200930232456.GB1260245@cisco> <656a37b5-75e3-0ded-6ba8-3bb57b537b24@gmail.com> In-Reply-To: From: Jann Horn Date: Wed, 28 Oct 2020 19:20:55 +0100 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page To: Sargun Dhillon Cc: "Michael Kerrisk (man-pages)" , Tycho Andersen , Kees Cook , Christian Brauner , linux-man , lkml , Aleksa Sarai , Alexei Starovoitov , Will Drewry , bpf , Song Liu , Daniel Borkmann , Andy Lutomirski , Linux Containers , Giuseppe Scrivano , Robert Sesek Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 28, 2020 at 6:44 PM Sargun Dhillon wrote: > On Wed, Oct 28, 2020 at 2:43 AM Jann Horn wrote: > > On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon wrote: > > > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn wrote: > > > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages) > > > > wrote: > > > > > On 10/26/20 4:54 PM, Jann Horn wrote: > > > > > > I'm a bit on the fence now on whether non-blocking mode should use > > > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are > > > > > > no more listeners, you'd have to disambiguate through the poll() > > > > > > revents, which would be kinda ugly? > > > > > > > > > > I must confess, I'm not quite clear on which two cases you > > > > > are trying to distinguish. Can you elaborate? > > > > > > > > Let's say someone writes a program whose responsibilities are just to > > > > handle seccomp events and to listen on some other fd for commands. And > > > > this is implemented with an event loop. Then once all the target > > > > processes are gone (including zombie reaping), we'll start getting > > > > EPOLLERR. > > > > > > > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop > > > > can just call into the seccomp logic without any arguments; it can > > > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate. > > > > The downside is that there's one more error code userspace has to > > > > special-case. > > > > This would be more consistent with what we'd be doing in the blocking case. > > > > > > > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell > > > > the seccomp logic what the revents are. > > > > > > > > I guess it probably doesn't really matter much. > > > > > > So, in practice, if you're emulating a blocking syscall (such as open, > > > perf_event_open, or any of a number of other syscalls), you probably > > > have to do it on a separate thread in the supervisor because you want > > > to continue to be able to receive new notifications if any other process > > > generates a seccomp notification event that you need to handle. > > > > > > In addition to that, some of these syscalls are preemptible, so you need > > > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program > > > under supervision hasn't left the syscall. > > > > > > If we're to implement a mechanism that makes the seccomp ioctl receive > > > non-blocking, it would be valuable to address this problem as well (getting > > > a notification when the supervisor is processing a syscall and needs to > > > preempt it). In the best case, this can be a minor inconvenience, and > > > in the worst case this can result in weird errors where you're keeping > > > resources open that the container expects to be closed. > > > > Does "a notification" mean signals? Or would you want to have a second > > thread in userspace that poll()s for cancellation events on the > > seccomp fd and then somehow takes care of interrupting the first > > thread, or something like that? > > I would be reluctant to be prescriptive in that it be a signal. Right > now, it's implemented > as a second thread in userspace that does a ioctl(...) and checks if > the notification > is valid / alive, and does what's required if the notification has > died (interrupting > the first thread). > > > > > Either way, I think your proposal goes beyond the scope of patching > > the existing weirdness, and should be a separate patch. > > I agree it should be a separate patch, but I think that it'd be nice if there > was a way to do something like: > * opt-in to getting another message after receiving the notification > that indicates the program has left the syscall I guess to do that cleanly, we'd want something like an array associated with the seccomp filter that has a size N that's determined when the filter is set up... and then when a received but unanswered notification is cancelled, we'd insert its identifier into that array. And if we enforce that the supervisor can never have more than N pending messages (by just not delivering new ones if there are N old ones pending), we'll know that any possible cancellation will always fit, and we don't need to worry about dynamic memory allocation. And we could raise EPOLLPRI on the file descriptor when the array is non-empty, so that if userspace doesn't currently want to handle new notifications (because it's already dealing with a bunch of them), userspace can do that, too. > * when you do the RECV, you can specify a flag or some such asking > that you get signaled / notified about the program leaving the syscall I think filter setup time is easier to deal with than RECV time. > * a multiplexed receive that can say if an existing notification in progress > has left the valid state. Or alternatively a separate ioctl for receiving cancellation messages, which you'd only call on EPOLLPRI.