Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp96460pxx; Mon, 26 Oct 2020 04:21:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwQSVcwOKXuJgsd0U9ZSeut57HqNW8rYp6jh3djQFvkPNqG1usqt0wsDmGcPK7/tLugHp9I X-Received: by 2002:a50:ec02:: with SMTP id g2mr13358301edr.104.1603711279640; Mon, 26 Oct 2020 04:21:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603711279; cv=none; d=google.com; s=arc-20160816; b=Wi6u1tTyEnJ75CRIzVNI1z5/9k76SezSSzi9b74FLftPnip51/WFOZmP+D5yRwuyvN 9tCBroAUYTRp6xdqktfckuuh8ERhoRGoKF0tjGUZO6IbiaQ7AnqO7biN0HCw9ZXX1Fbs 2Kb0xadnjsVaVrR0f6BtWze3XW1n6Ydb5AKsEfwhyZdSBdOetzixB1OLInGQOpG6+Cm+ vOiEzOCMfpbRQyAdI1TiVZAlKC0KUQWrJ1OzrLJrOSFw70OhsGrZDNCBYzA0moySW4TJ 7pqrsC7AnVn4ZGbiXMnMlYS88tKzm1vVhj4wq2YInvz1l7zaY7o/hwSFb1Da9TY7Xux2 hWtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=cvHwOhMOSComXSdb9JzaD65zvZyc8ZwqG/MT8SEmCr8=; b=g6QzVhpITOlEhEC7Ddg/8hmtA9LSt8/VxVlCRcFejo99RfEM476LoHpBdRPPblGLTR 1o82eqhZ7ljjeYRzLeyWBMFMGYwlf2ZUhJ/D7dZPuBPVtepFN/sImpTxQGk8ID8qNcmR q83AvKNB4uhbL/hpcJxW/Bz+EDeLXUcA00Q2dSjkkNDpJ0WcVroYXhOGhlvMyQ/5KBcs LmZmplNm/74FYRHaMUySO7jhF8m8aGRr4Vt6q3mG0ZLA3gXXj5fS8ZmqYBKJMD7UwkDf DVBe8sVbM8AQg9kZKS93NXz/LnKzvepV/G1xQUtY/WURHVvOqrb5iRCM1Fz6rvi3qrp4 oIFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XQtZGUAH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kt25si6715343ejb.438.2020.10.26.04.20.57; Mon, 26 Oct 2020 04:21:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XQtZGUAH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1768495AbgJZJvq (ORCPT + 99 others); Mon, 26 Oct 2020 05:51:46 -0400 Received: from mail-lj1-f196.google.com ([209.85.208.196]:38803 "EHLO mail-lj1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1737062AbgJZJvc (ORCPT ); Mon, 26 Oct 2020 05:51:32 -0400 Received: by mail-lj1-f196.google.com with SMTP id m20so9081648ljj.5 for ; Mon, 26 Oct 2020 02:51:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=cvHwOhMOSComXSdb9JzaD65zvZyc8ZwqG/MT8SEmCr8=; b=XQtZGUAHkXcAtHyAFMQ1vUo2SRdyz1VTbCBYc3OxQqPVEDxUROYbSDuXYTV+wPwvOs bo6Whtw0w3Gch8zWAmLP2t9kbJxBFNr6Mi5f+xVH3J1ZAVGWhx0VWR2avbCuyOshfRO0 DWEXO8kzycOpjVtgfOBpaH+M5qb1mNKM5AJxU5TnoJPyU4x6nZfHSebAOZd3yCQV6g+8 oUjYGls24BaLCdidNGO5roRQAtdFSRSSyxhFr4rA8QHpe/nTUDkeMHDT527K8vAMYO/w D3HWTTv7P6xwQKqO0U0SMYcZ/IboJdZHtFdbG8SQYIsx+ANP0tj6ZvjWn9VO557znNi2 x6hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=cvHwOhMOSComXSdb9JzaD65zvZyc8ZwqG/MT8SEmCr8=; b=d1p0HJ9ciWWp8VKGe1Dp2EzwBLFCEpQW0Ofs7lWU8DRdC2AlDySekO7rDfTJW4/Izs +7/2Nr1k30OdroZ/I37os2urIM8n7uCs260G5Y0bjPmODOXz31g2+XwzU/1HoOOeLykI Ez+U3hLgGCzGl/gQfVkLKHRFL1VZ5SZhdo0SyPnxyUxAbN4i2+rJH27L3E9B8M9hVBxf V/cxy4lWS52QH0DlgiDIq8kMilVqXU4foUYYZfZMWwdg5yj2F7q+wVf2V+SjbyWbEOgQ BCJVjWBfCNKpRizgkf+9ISxDn2AgxYDERI4xLiICda3AC4RAJQqYaKeOk0ZX71rMrn27 4QzQ== X-Gm-Message-State: AOAM532wRBoXOCF82fICOkoyA4vG2mfT0iw1C4F/h+zH0uXi1McXklXe DC3WHUXBu2CbjKUZFxw+C0cuei5Nc8t4i9mnBap2hjK6uGgsdQ== X-Received: by 2002:a2e:9f13:: with SMTP id u19mr5255189ljk.160.1603705889194; Mon, 26 Oct 2020 02:51:29 -0700 (PDT) MIME-Version: 1.0 References: <45f07f17-18b6-d187-0914-6f341fe90857@gmail.com> <20200930150330.GC284424@cisco> <8bcd956f-58d2-d2f0-ca7c-0a30f3fcd5b8@gmail.com> <20200930230327.GA1260245@cisco> <20200930232456.GB1260245@cisco> <202010251725.2BD96926E3@keescook> In-Reply-To: <202010251725.2BD96926E3@keescook> From: Jann Horn Date: Mon, 26 Oct 2020 10:51:02 +0100 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page To: Kees Cook Cc: Tycho Andersen , "Michael Kerrisk (man-pages)" , Sargun Dhillon , Christian Brauner , linux-man , lkml , Aleksa Sarai , Alexei Starovoitov , Will Drewry , bpf , Song Liu , Daniel Borkmann , Andy Lutomirski , Linux Containers , Giuseppe Scrivano , Robert Sesek Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 26, 2020 at 1:32 AM Kees Cook wrote: > On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote: > > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen wrote= : > > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote: > > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen w= rote: > > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pa= ges) wrote: > > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote: > > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (ma= n-pages) wrote: > > > > > > >> =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90 > > > > > > >> =E2=94=82FIXME = =E2=94=82 > > > > > > >> =E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=A4 > > > > > > >> =E2=94=82From my experiments, it appears that if = a SEC=E2=80=90 =E2=94=82 > > > > > > >> =E2=94=82COMP_IOCTL_NOTIF_RECV is done after the= target =E2=94=82 > > > > > > >> =E2=94=82process terminates, then the ioctl() simply= blocks =E2=94=82 > > > > > > >> =E2=94=82(rather than returning an error to indicate = that the =E2=94=82 > > > > > > >> =E2=94=82target process no longer exists). = =E2=94=82 > > > > > > > > > > > > > > Yeah, I think Christian wanted to fix this at some point, > > > > > > > > > > > > Do you have a pointer that discussion? I could not find it with= a > > > > > > quick search. > > > > > > > > > > > > > but it's a > > > > > > > bit sticky to do. > > > > > > > > > > > > Can you say a few words about the nature of the problem? > > > > > > > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("sec= comp: > > > > > notify about unused filter"). So maybe there's a bug here? > > > > > > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and > > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So tha= t > > > > commit doesn't have any effect on this kind of usage. > > > > > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so > > > we don't have a count of all of them, unfortunately. > > > > > > We could maybe look inside the wait_list, but that will probably make > > > people angry :) > > > > The easiest way would probably be to open-code the semaphore-ish part, > > and let the semaphore and poll share the waitqueue. The current code > > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the > > entire semaphore would IMO be cleaner than that. And it's not like > > semaphore semantics are even a good fit for this code anyway. > > > > Let's see... if we didn't have the existing UAPI to worry about, I'd > > do it as follows (*completely* untested). That way, the ioctl would > > block exactly until either there actually is a request to deliver or > > there are no more users of the filter. The problem is that if we just > > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use > > an event loop and don't set O_NONBLOCK will be screwed. So we'd > > Wait, why? Do you mean a ioctl calling loop (rather than a poll event > loop)? No, I'm talking about poll event loops. > I think poll would be fine, but a "try calling RECV and expect to > return ENOENT" loop would change. But I don't think anyone would do this > exactly because it _currently_ acts like O_NONBLOCK, yes? > > > probably also have to add some stupid counter in place of the > > semaphore's counter that we can use to preserve the old behavior of > > returning -ENOENT once for each cancelled request. :( > > I only see this in Debian Code Search: > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notif= y.c/?hl=3D166#L166 > which is using epoll_wait(): > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?= hl=3D1326#L1326 > > I expect LXC is using it. :) The problem is the scenario where a process is interrupted while it's waiting for the supervisor to reply. Consider the following scenario (with supervisor "S" and target "T"; S wants to wait for events on two file descriptors seccomp_fd and other_fd): S: starts poll() to wait for events on seccomp_fd and other_fd T: performs a syscall that's filtered with RET_USER_NOTIF S: poll() returns and signals readiness of seccomp_fd T: receives signal SIGUSR1 T: syscall aborts, enters signal handler T: signal handler blocks on unfiltered syscall (e.g. write()) S: starts SECCOMP_IOCTL_NOTIF_RECV S: blocks because no syscalls are pending Depending on what other_fd is, this could in a worst case even lead to a deadlock (if e.g. the signal handler wants to write to stdout, but the stdout fd is hooked up to other_fd in the supervisor, but the supervisor can't consume the data written because it's stuck in seccomp handling). So we have to ensure that when existing code (like that crun code you linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns immediately instead of blocking. (Oh, but by the way, that crun code looks broken anyway, because AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV equally by bailing out; and it kinda looks like that bailout path then nukes the container, or something? So that needs to be fixed either way.)