Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp269668pxx; Thu, 29 Oct 2020 02:04:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxu6vghVS5NCIVyGYy4YCKaHBYYGwfr9uDQgyYu61mHmt/G+1lxobAxoGWuv1jHpaVBLhxo X-Received: by 2002:a17:906:6d89:: with SMTP id h9mr2945656ejt.152.1603962297514; Thu, 29 Oct 2020 02:04:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603962297; cv=none; d=google.com; s=arc-20160816; b=xlCJZPH8b6f44XboYtZy4jv6yqRpJ/lvIkPaFpO+pi7lI//j4wd2CYJ6F0hOYWb5qQ 0EGRBaArbEhfN41EB0WSGYqRAQ6OtbbvFOW/2Bo49VZ7E72wR4pLHYyoU7RzA7N1TG+d HG2L+VL9XQ6h50UsKybK5Z29LzqnUVutzJlcKGThR+vAnFa1oNx++T0ZqzbARHpRbjzr ynOjiOzjCeO3FY9Hmn6hWPzvLXdVhL2zK9/F8uvHk6AQRBZVR3UDB73LX2k/ov052HdV qqUMX2Iy34Qu9a9Lb8ZUUVB/bU4Zh0Y4Uj8HK1su9sJ4kpB4mjr2chZThnY6i12KlBoV Wxag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=8vJY7X+2qf9mk4yjNjSimkTKUqnRW4h3rIif/0iLOrE=; b=mWFKdd4jPq/ZMhD8B8V6Eed7/vymaiUgSTuBv1Q21GYxlwH+VdvG0IJ9Ar9l2F0hg3 lryOeLL0n48JnlGwxP3cbLDyAX+DDX6HjK4vL6oGtqvueuINMhQjhbd+YXeH/kpBg1RN imz9JL0JpiLCoigMw5HNUdOIgvN9t1zwuT+pn2ylGDhSiqcuD2sXjxnGG42gUWiDWE8E c+fQYcwDeKzp+A9mh/2JWS0/2+xNXNxtpDZiUDN9JA6mtU8jRR75L5VBU4MYUEorEFmk AvByVdxdNk0bIeBbNDXWyGyl2lkgczU3F9yW+ibVHn+oBhh5eFw5qM46dcxvqb8r8pYk gXkw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="gZ8H/p46"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y16si1622368edm.444.2020.10.29.02.04.35; Thu, 29 Oct 2020 02:04:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="gZ8H/p46"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727250AbgJ2HYr (ORCPT + 99 others); Thu, 29 Oct 2020 03:24:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726731AbgJ2HYe (ORCPT ); Thu, 29 Oct 2020 03:24:34 -0400 Received: from mail-lj1-x242.google.com (mail-lj1-x242.google.com [IPv6:2a00:1450:4864:20::242]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB1F6C0613AC for ; Wed, 28 Oct 2020 21:44:03 -0700 (PDT) Received: by mail-lj1-x242.google.com with SMTP id p15so1653021ljj.8 for ; Wed, 28 Oct 2020 21:44:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8vJY7X+2qf9mk4yjNjSimkTKUqnRW4h3rIif/0iLOrE=; b=gZ8H/p46CECRx9wDoVffbN/h0TJ0bUslUCW4zEC9UGCLKm/cXqu5uiZC1A8cryr+E8 e0dLqac5CrQ0wZy68hcC5mDMEnQW3ZBUwc7+nrlVGLwjzOWHmxGB84B6A7tq9XDnv19s M7QjYPJ1Rgm+8fE1FFuZrcNgCnJkq5XlKbHAN6QN93QfeuU4o+r0DZj/rSPyFxLyR9HX goLYDldHw4PKN4C16oNQfIjUL+sxQTmKcTIGcQlhOv8tjXF8krfMfUE47zD8S0ABnIzk C8yOWDT4K5Mx4UbUn4ZaqYyEFrICmzbt6zNb1oXbkYLIFdMceLLg+ASw3I9w2T4L15rH 1NBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8vJY7X+2qf9mk4yjNjSimkTKUqnRW4h3rIif/0iLOrE=; b=M5ncuy3imvTQJ9FXPfEd+JsOPoWSCm2WcUzV4UKMDNui5Jlodj8YjbWhE8NoKi/tKQ lFjsRuGJtYf2ZH48ymQyBSVJUo7qfHsIMIYFeK/8oCMt9opeMIN0DC+h4disttQ4dkZU KFPjPpyReDA3UcXfS+oJf81hfYSijllnv91SOVBS3Bn8ln7ktzsMrDoz/wgZfCpY1pWw VqmbRn6waoyp4FAhKclrsOH2b8loSzQdDWnMRMNlHGxMyZOHRPR7ZdKWSEtAwc56nPfn ygarp8Npx78G7bFkG935KKxEk/YL1HeM7SoS/v9UOOzLMxkBTSdqtL6aD92/K1g5Apab FpDg== X-Gm-Message-State: AOAM533G3OvFj5pr7bwYDEVw2zhv/cRG/z/LkQvL8S88x23/U1G6sqm/ 9VMHhXEPFn0Y8gxrbKgJ1INtin5yupAMWsK8ZjzF9g== X-Received: by 2002:a2e:9c84:: with SMTP id x4mr981272lji.326.1603946642027; Wed, 28 Oct 2020 21:44:02 -0700 (PDT) MIME-Version: 1.0 References: <63598b4f-6ce3-5a11-4552-cdfe308f68e4@gmail.com> <20201029020438.GA25673@cisco> In-Reply-To: <20201029020438.GA25673@cisco> From: Jann Horn Date: Thu, 29 Oct 2020 05:43:35 +0100 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page [v2] To: Tycho Andersen , "Michael Kerrisk (man-pages)" Cc: Kees Cook , Sargun Dhillon , Christian Brauner , Daniel Borkmann , Giuseppe Scrivano , Song Liu , Robert Sesek , Containers , linux-man , lkml , Aleksa Sarai , Alexei Starovoitov , Will Drewry , bpf , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2020 at 3:04 AM Tycho Andersen wrote: > On Thu, Oct 29, 2020 at 02:42:58AM +0100, Jann Horn wrote: > > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages) > > wrote: > > > static bool > > > getTargetPathname(struct seccomp_notif *req, int notifyFd, > > > char *path, size_t len) > > > { > > > char procMemPath[PATH_MAX]; > > > > > > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid); > > > > > > int procMemFd = open(procMemPath, O_RDONLY); > > > if (procMemFd == -1) > > > errExit("\tS: open"); > > > > > > /* Check that the process whose info we are accessing is still alive. > > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed > > > in checkNotificationIdIsValid()) succeeds, we know that the > > > /proc/PID/mem file descriptor that we opened corresponds to the > > > process for which we received a notification. If that process > > > subsequently terminates, then read() on that file descriptor > > > will return 0 (EOF). */ > > > > > > checkNotificationIdIsValid(notifyFd, req->id); > > > > > > /* Read bytes at the location containing the pathname argument > > > (i.e., the first argument) of the mkdir(2) call */ > > > > > > ssize_t nread = pread(procMemFd, path, len, req->data.args[0]); > > > if (nread == -1) > > > errExit("pread"); > > > > As discussed at > > , > > we need to re-check checkNotificationIdIsValid() after reading remote > > memory but before using the read value in any way. Otherwise, the > > syscall could in the meantime get interrupted by a signal handler, the > > signal handler could return, and then the function that performed the > > syscall could free() allocations or return (thereby freeing buffers on > > the stack). > > > > In essence, this pread() is (unavoidably) a potential use-after-free > > read; and to make that not have any security impact, we need to check > > whether UAF read occurred before using the read value. This should > > probably be called out elsewhere in the manpage, too... > > > > Now, of course, **reading** is the easy case. The difficult case is if > > we have to **write** to the remote process... because then we can't > > play games like that. If we write data to a freed pointer, we're > > screwed, that's it. (And for somewhat unrelated bonus fun, consider > > that /proc/$pid/mem is originally intended for process debugging, > > including installing breakpoints, and will therefore happily write > > over "readonly" private mappings, such as typical mappings of > > executable code.) > > > > So, uuuuh... I guess if anyone wants to actually write memory back to > > the target process, we'd better come up with some dedicated API for > > that, using an ioctl on the seccomp fd that magically freezes the > > By freeze here you mean a killable wait instead of an interruptible > wait, right? Nope, nonkillable. Consider the case of vfork(), where a target process does something like this: void spawn_executable(char **argv, char **envv) { pid_t child = vfork(); if (child == 0) { char path[1000]; sprintf(path, ...); execve(path, argv, envv); } } and the seccomp notifier wants to look at the execve() path (as a somewhat silly example). The child process is just borrowing the parent's stack, and as soon as the child either gets far enough into execve() or dies, the parent continues using that stack. So keeping the child in killable sleep would not be enough to prevent reuse of the child's stack. But conceptually that's not really a big problem - we already have a way to force the target task to stay inside the seccomp code no matter if it gets SIGKILLed or whatever, and that is to take the notify_lock. When the target task wakes up and wants to continue executing, it has to first get through mutex_lock(&match->notify_lock) - and that will always block until the lock is free. So we could e.g. do something like: - Grab references to the source pages in the supervisor's address space with get_user_pages_fast(). - Take mmap_sem on the target. - Grab references to pages in the relevant range with pin_user_pages_remote(). - Drop the mmap_sem. - Take the notify_lock. - Recheck whether the notification with the right ID is still there. - Copy data from the pinned source pages to the pinned target pages. - Drop the notify_lock. - Drop the page references. and this way we would still guarantee that the target process would only be blocked in noninterruptible sleep for a small amount of time (and would not be indirectly blocked on sleeping operations through the mutex). It'd be pretty straightforward, I think. But as long as we don't actually need it, it might be easier to just note in the manpage that this is not currently supported.