Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp556562pxk; Thu, 1 Oct 2020 08:49:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzJrOVLwRuq9yj5hLFyFwj3GtzBsn/AYo6fqp2ldS/HP4jM5fsyxlVC8fE5BgZPk8Ux9Gwa X-Received: by 2002:aa7:d501:: with SMTP id y1mr8876462edq.29.1601567397580; Thu, 01 Oct 2020 08:49:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601567397; cv=none; d=google.com; s=arc-20160816; b=sKcopz6qnJcd9ISA5fFl7RTSCA27v21DHvkt5JglFMvDuu8Exua5VZ8Pw1WIL5wn5j YyrNDmrg9fdy2TDBshIfjuZn+1tBTazAqtLn2ozMgDRdxY60HjWThPFRUc9Sg1pvlgXT xGLWu4iNXarbB1zIXY8eIEPcmGlCugCOEcXyW+XBq4KP2PnXmLhZy+IbOH57Wg6EvYrE 3rF+igHTE4tSKLPJ47uyCDwft6mFigL7aMRN/IUJZZpIeNNXwmIcIds7yNtIYlKLrjcj oq3MjCqwVKN8JalDZ6CJwzYK2i1VlycU6f65lHit6US+io5dwN1kwA9LHgDhapyrqqnY /DqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=8IFUkJPw/kRc69ZxpOZH7IdKr2uNB4GcMACMPjhB0mM=; b=NJjUvYdx1F9IT8XLNuI5wCMz/u0t0j1mIUXnWWK/ICOhfPtUq5ttrDNdo79NnLnuLw NS2LymCblAdqs+MNFOgR82/XHHivf5ZMZRQ9tnlaOBYA4wbmFlsqSVW8sEHOTBJEp3vr mIx5Fm12PtUbrBfs+cGjFuPDT2vweW90djwlPd1gQ1C0nE9Adirh7I6IZmgk0VTSMFjh QJSP/J49CewzFe7H5UisKUuW7jEO7cNMDadzn+4jqAxwHogQ2AW0e1JQeBZ/ycLADMH6 687vQtuZ8dX1K+4N2Mwjdlhq7QqByzJE8xrrYA1sIfQ+dwI3GycBye5meSnUN7HidJ3m PPEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sY7GNFIv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m23si3319624ejr.382.2020.10.01.08.49.23; Thu, 01 Oct 2020 08:49:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=sY7GNFIv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732360AbgJAPsW (ORCPT + 99 others); Thu, 1 Oct 2020 11:48:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41860 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732308AbgJAPsW (ORCPT ); Thu, 1 Oct 2020 11:48:22 -0400 Received: from mail-ej1-x644.google.com (mail-ej1-x644.google.com [IPv6:2a00:1450:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 505E8C0613E3 for ; Thu, 1 Oct 2020 08:48:22 -0700 (PDT) Received: by mail-ej1-x644.google.com with SMTP id p15so8776144ejm.7 for ; Thu, 01 Oct 2020 08:48:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=8IFUkJPw/kRc69ZxpOZH7IdKr2uNB4GcMACMPjhB0mM=; b=sY7GNFIvYaNfMnuHljMi/SlQZ6n4BcGZhCTAkvb73jwrTAP++F0FA96srmG4n+saym LPs0Qo1rf+Gc/KbMjrfFVvv0dVBlQo5xErKWSXr8vfIbgudw/fJB81Prkbi9Pii7XtRh bwufKgriflGSXWbjmbN1+qvivFzU+VKLk390fTxTMNBo7z+CsZ1XFTp/HjN3R699Gw+C vfciTTmpKgNNA11LmOMtAz8ZR9nakBZvoaMyZp8Hz1UVlkM5vLsghRNGZtT0pDQ9PVkF +xdKQnhRbsvPbhl29QQtDseBMWcFAJ8tpq31IZJtch0aWdpaPLCLP06UMF/rwp7udtiZ WVOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=8IFUkJPw/kRc69ZxpOZH7IdKr2uNB4GcMACMPjhB0mM=; b=El7i8N7N5vST/3xm/HcTszpG9I7v6WAYdIArk0qW3LZuNJvlLCpDju76TmHkX9FfKk b79ZilE635nIF+9g0ZpsD4Yn3NgqtQ2hkanWibnM+hwVgOFS2tfDnbj70auvloks7M5Y Ov1RecWCGdnMLhogshJrfkgYd6ALe3gJKzkYZZgKkRGHngYGBOFFR/qukSxpy1WxE9Yu weJhSVgIYcrXGSPqde8N4j0ZbOql4PxPovCf/giGYvQk7DlxPlPPqzN9LdNgKGPbRh7t 4ueghp4zfCZgkWdeBGJhJx/7xAG/7JDSwEwyITairX+k+BjGzkNSVEOGGY9vjfF/igwE 8pyA== X-Gm-Message-State: AOAM533/psRwKGqo4Vwq+4EQmKY/bn+jq2JIMtlPlhExYkdUz3pmP/G7 sc8AvBbQYPATSOoKBfE8UL2HdRDtOzXilWAbAG6Jhw== X-Received: by 2002:a17:906:1f94:: with SMTP id t20mr8931066ejr.493.1601567300584; Thu, 01 Oct 2020 08:48:20 -0700 (PDT) MIME-Version: 1.0 References: <45f07f17-18b6-d187-0914-6f341fe90857@gmail.com> <20201001125043.dj6taeieatpw3a4w@gmail.com> In-Reply-To: <20201001125043.dj6taeieatpw3a4w@gmail.com> From: Jann Horn Date: Thu, 1 Oct 2020 17:47:54 +0200 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page To: Christian Brauner Cc: "Michael Kerrisk (man-pages)" , linux-man , Song Liu , Will Drewry , Kees Cook , Daniel Borkmann , Giuseppe Scrivano , Robert Sesek , Linux Containers , lkml , Alexei Starovoitov , bpf , Andy Lutomirski , Christian Brauner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner wrote: > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote: > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages) > > wrote: > > > NOTES > > > The file descriptor returned when seccomp(2) is employed with = the > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored us= ing > > > poll(2), epoll(7), and select(2). When a notification is pe= nd=E2=80=90 > > > ing, these interfaces indicate that the file descriptor is re= ad=E2=80=90 > > > able. > > > > We should probably also point out somewhere that, as > > include/uapi/linux/seccomp.h says: > > > > * Similar precautions should be applied when stacking SECCOMP_RET_USER= _NOTIF > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on t= he > > * same syscall, the most recently added filter takes precedence. This = means > > * that the new SECCOMP_RET_USER_NOTIF filter can override any > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing = all > > * such filtered syscalls to be executed by sending the response > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can eq= ually > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE. > > > > In other words, from a security perspective, you must assume that the > > target process can bypass any SECCOMP_RET_USER_NOTIF (or > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from > > calling seccomp(). This should also be noted over in the main > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part. > > So I was actually wondering about this when I skimmed this and a while > ago but forgot about this again... Afaict, you can only ever load a > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property > in the tasks filter hierarchy then the kernel will refuse to load a new > one? > > static struct file *init_listener(struct seccomp_filter *filter) > { > struct file *ret =3D ERR_PTR(-EBUSY); > struct seccomp_filter *cur; > > for (cur =3D current->seccomp.filter; cur; cur =3D cur->prev) { > if (cur->notif) > goto out; > } > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't > override each other for the same task simply because there can only ever > be a single one? Good point. Exceeeept that that check seems ineffective because this happens before we take the locks that guard against TSYNC, and also before we decide to which existing filter we want to chain the new filter. So if two threads race with TSYNC, I think they'll be able to chain two filters with listeners together. I don't know whether we want to eternalize this "only one listener across all the filters" restriction in the manpage though, or whether the man page should just say that the kernel currently doesn't support it but that security-wise you should assume that it might at some point. [...] > > > if (procMemFd =3D=3D -1) > > > errExit("Supervisor: open"); > > > > > > /* Check that the process whose info we are accessing is s= till alive. > > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performe= d > > > in checkNotificationIdIsValid()) succeeds, we know that= the > > > /proc/PID/mem file descriptor that we opened correspond= s to the > > > process for which we received a notification. If that p= rocess > > > subsequently terminates, then read() on that file descr= iptor > > > will return 0 (EOF). */ > > > > > > checkNotificationIdIsValid(notifyFd, req->id); > > > > > > /* Seek to the location containing the pathname argument (= i.e., the > > > first argument) of the mkdir(2) call and read that path= name */ > > > > > > if (lseek(procMemFd, req->data.args[0], SEEK_SET) =3D=3D -= 1) > > > errExit("Supervisor: lseek"); > > > > > > ssize_t s =3D read(procMemFd, path, PATH_MAX); > > > if (s =3D=3D -1) > > > errExit("read"); > > > > Why not pread() instead of lseek()+read()? > > With multiple arguments to be read process_vm_readv() should also be > considered. process_vm_readv() can end up doing each read against a different process, which is sort of weird semantically. You would end up taking page faults at random addresses in unrelated processes, blocking on their mmap locks, potentially triggering their userfaultfd notifiers, and so on. Whereas if you first open /proc/$tid/mem, then re-check SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that you're only taking page faults on the process where you intended to do it. So until there is a variant of process_vm_readv() that operates on pidfds, I would not recommend using that here.