Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp259649pxx; Thu, 29 Oct 2020 01:47:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxeT+/vTudnUNmcdupfLPSeS8wqRAI7TFzGEd+RyOKWv+x6b/OCN24f8LYzmYlZtsSU+gzT X-Received: by 2002:aa7:c2c4:: with SMTP id m4mr3028935edp.172.1603961229172; Thu, 29 Oct 2020 01:47:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603961229; cv=none; d=google.com; s=arc-20160816; b=0CM10nHGaUeA1FKiewfjp8iZYVInUd+rhxdc0Phj3/I8UA9QT8JvZVJDeYWyfhtZ0F FTg97IhM3U7hOS68rcBOUorAMPYOyjYlM7uWna8S0pVdCR+Z+WM5nEvDWCUHTHeRMf/r 5JBCH6WFqUetXmkbZ4yc4fErSFu92kuNpn6lDRDFuHWBUWeD2P7cOia96EKSbsPBBtZg s0NtHQqCV4F5W/15uL75KodIGtqFshFLJIgKGDs/+jO5+/R5oDnYhGVRh9OhSey6s8B8 fY0qFOwHjHe+D3eztXW6zDdDs8fApSNyYrAKBzHhdRRYhLeAxZAqFxvbfmEPpKlTYwK3 +bMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=ENdMKmrqyXsv1/gvLj91K20WZXwFdpBVsOP+30aC+9A=; b=ImSBTSHbdVqjNURH/lBLGs5tq/LtqXLs9jIfPDtITwDY3Owd3JYNcqu0mqsiZeykgO 5eeLU0+fnDjqKzso4Dku8lS6dRioQe9bq052tBBP1/Y62d4os5qjckTt9tRxzKXLgm3j gwjjNBgTzsg9ne1/moYeJlTW7WhVOTRV1yzjbP9VFhvE+ZPATXtG9nvm4U215kNQONbD 6asAKHUqgRjx5F0lU0RBwwJiiAV+DS+dvh0AFF7RKFzw1dgOfmYZMDZvGQkNpt5nvjL4 NNsaBlGHCaWEoBRMwfZAOel4J6fN0Xrcm24ILpym3gWWfA1aSe0L3cd9G5DXQ7gN0jio Evqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=n+NDm9uR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k15si1440122edr.208.2020.10.29.01.46.46; Thu, 29 Oct 2020 01:47:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=n+NDm9uR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729684AbgJ2B1B (ORCPT + 99 others); Wed, 28 Oct 2020 21:27:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55950 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729580AbgJ2B0D (ORCPT ); Wed, 28 Oct 2020 21:26:03 -0400 Received: from mail-lj1-x230.google.com (mail-lj1-x230.google.com [IPv6:2a00:1450:4864:20::230]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82880C0613CF for ; Wed, 28 Oct 2020 18:26:02 -0700 (PDT) Received: by mail-lj1-x230.google.com with SMTP id y16so1362054ljk.1 for ; Wed, 28 Oct 2020 18:26:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ENdMKmrqyXsv1/gvLj91K20WZXwFdpBVsOP+30aC+9A=; b=n+NDm9uRHARMU/e6YLntF4bh7Ic1db7mSFvWo17ThyF1eYKn1L85739nSonOGKrqnC kZ0nlFbNvlIR/Q/2JRF3dO6W6ujHxM96SnAuNllGwpL9m6A+7kEFmy4RAWQS+jUmTxjn 9ul9cuqR7ho/eeQJthrON4IxQxk3p2YDClTnPzxAtgdgwrTvyZ27DjwyBaGl4BqIG4V7 0puw9K+PQT7gNCkfzLeKhBnYXnZGde9EZifRfkMemy25mZarKMIyUp2s0gdY++jrwXvQ clYZysyxsTB7gC158N4MZLmHi519zTHkui8tiQKb99V1Puz5rk62zUFAGSyC+HqFyQbA 1hZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ENdMKmrqyXsv1/gvLj91K20WZXwFdpBVsOP+30aC+9A=; b=RqDP/DlQ36C6IxAkWcRKpb4YXUd4IbabaQOcnzpGKxNHDqITqOohyr5LDK3UrASjbH 67js5ll8X7Gy8ZvpCj54Dzt4mEgTo7EN2QBZZfvks8cTueUCiVAEClSih1R5HJ+Cpalr AVbTxG6Rgcb6pZvf2ptfTLyUA2g/XAfhS+tjaMmrChTjuiiUwAE2anih1Y9Jn+1yeiwb Icvp8W0RABs++MU/cdJAPlbkhOMvR8fyQAyvAHhDYrj7wT8VEkhf6hJBoc8yvIAzilyB vanoR70XAVZ+Y7+LpsEb++IbjJM0R8eyKyH9+yg54GlokEI9EtRXQCuAQYIxXOrGXXda CtGg== X-Gm-Message-State: AOAM531oo1wKGviSdh6JhmWcGTVokOb7gcXsegcJlKg9y0qTEh+rQK21 152Jo6UTEcdD8rNKrshjbpxjSiKO+2nUgGmi+jnE/Q== X-Received: by 2002:a2e:b6cf:: with SMTP id m15mr723367ljo.74.1603934760698; Wed, 28 Oct 2020 18:26:00 -0700 (PDT) MIME-Version: 1.0 References: <45f07f17-18b6-d187-0914-6f341fe90857@gmail.com> <20200930150330.GC284424@cisco> <8bcd956f-58d2-d2f0-ca7c-0a30f3fcd5b8@gmail.com> <20200930230327.GA1260245@cisco> <20200930232456.GB1260245@cisco> <202010251725.2BD96926E3@keescook> <202010281548.CCA92731F@keescook> In-Reply-To: <202010281548.CCA92731F@keescook> From: Jann Horn Date: Thu, 29 Oct 2020 02:25:34 +0100 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page To: Kees Cook , "Michael Kerrisk (man-pages)" Cc: Tycho Andersen , Sargun Dhillon , Christian Brauner , linux-man , lkml , Aleksa Sarai , Alexei Starovoitov , Will Drewry , bpf , Song Liu , Daniel Borkmann , Andy Lutomirski , Linux Containers , Giuseppe Scrivano , Robert Sesek Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 28, 2020 at 11:53 PM Kees Cook wrote: > On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote: > > The problem is the scenario where a process is interrupted while it's > > waiting for the supervisor to reply. > > > > Consider the following scenario (with supervisor "S" and target "T"; S > > wants to wait for events on two file descriptors seccomp_fd and > > other_fd): > > > > S: starts poll() to wait for events on seccomp_fd and other_fd > > T: performs a syscall that's filtered with RET_USER_NOTIF > > S: poll() returns and signals readiness of seccomp_fd > > T: receives signal SIGUSR1 > > T: syscall aborts, enters signal handler > > T: signal handler blocks on unfiltered syscall (e.g. write()) > > S: starts SECCOMP_IOCTL_NOTIF_RECV > > S: blocks because no syscalls are pending > > Oooh, yes, ew. Thanks for the illustration. > > Thinking about this from userspace's least-surprise view, I would expect > the "recv" to stay "queued", in the sense we'd see this: > > S: starts poll() to wait for events on seccomp_fd and other_fd > T: performs a syscall that's filtered with RET_USER_NOTIF > S: poll() returns and signals readiness of seccomp_fd > T: receives signal SIGUSR1 > T: syscall aborts, enters signal handler > T: signal handler blocks on unfiltered syscall (e.g. write()) > S: starts SECCOMP_IOCTL_NOTIF_RECV > S: gets (stale) seccomp_notif from seccomp_fd > S: sends seccomp_notif_resp, receives ENOENT (or some better errno?) > > This is not at all how things are designed internally right now, but > that behavior would work, yes? It would be really ugly, but it could theoretically be made to work, to some degree. The first bit of trouble is that currently the notification lives on the stack of the target process. If we want to be able to show userspace the stale notification, we'd have to store it elsewhere. And since we really don't want to start randomly throwing -ENOMEM in any of this stuff, we'd basically have to store it in pre-allocated memory inside the filter. The second bit of trouble is that if the supervisor is so oblivious that it doesn't realize that syscalls can be interrupted, it'll run into other problems. Let's say the target process does something like this: int func(void) { char pathbuf[4096]; sprintf(pathbuf, "/tmp/blah.%d", some_number); mount("foo", pathbuf, ...); } and mount() is handled with a notification. If the supervisor just reads the path string and immediately passes it into the real mount() syscall, something like this can happen: target: starts mount() target: receives signal, aborts mount() target: runs signal handler, returns from signal handler target: returns out of func() supervisor: receives notification supervisor: reads path from remote buffer supervisor: calls mount() but because the stack allocation has already been freed by the time the supervisor reads it, the supervisor just reads random garbage, and beautiful fireworks ensue. So the supervisor *fundamentally* has to be written to expect that at *any* time, the target can abandon a syscall. And every read of remote memory has to be separated from uses of that remote memory by a notification ID recheck. And at that point, I think it's reasonable to expect the supervisor to also be able to handle that a syscall can be aborted before the notification is delivered.