Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2676807pxa; Mon, 17 Aug 2020 16:20:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwulhkXjJLcUp56iFX7eGsktHUiqop0RH/6SmnrmRu2awkvS76sy4zqbturxd8dawmPl/T9 X-Received: by 2002:aa7:d145:: with SMTP id r5mr16873888edo.323.1597706403186; Mon, 17 Aug 2020 16:20:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597706403; cv=none; d=google.com; s=arc-20160816; b=sHEvcTpNnXbcyU32Vt6wAcYuUIPQWngMgWOtTMi6FXS8B5FI+lx9nETywAgieXE3xi ltK6Qx33M3oxlin3lXR3bLkVGIfWjgb00gm02t9td8HYh7zzgB46e2GmmadwaPe5QVvY 5DTPm/8WnoObCQHWoR3jE/hdJmUgbwhnHH/Anu7nwAtAHwT500yT6uQtXrv7Y9w67Gq2 b6o5M43NovOLqdTzL0mkNdXR1BY6jSSY6O3IIFufDTPaFIHJ19zPQygVZHyL2ht566cU tu2Y/TTZ3PGe+85JpN3WbLifPJpMTFahB9oGjV1k9PwE7kCB02z7HIQBQ9DEyS46Ui2M 66gA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=2tREdSh21Iyn6r0BayW1rlUc24M2xAB6JGj1Q/yEJOs=; b=BD5gwCxSgtaJzHcQBkpsD0yrZbDZ6KkWE5/gjJBepu+4Npnwsbs9/ZEvtbxut8Ls0g ltaZUqfd0zz/oKKyJxk98r2qQjO9d/zdPz8/shux5iRf+2c/9fYItd1BrmZJcN2ia1mT XbVxyUC24xQZToSNrUqm/cwq2JqzFjokDlTbZ5YOsaW5miO/JIoDB5XBmXSS4vYyXWjn P0L7WatGX7YdJOGCSK4u5WiRe/G+FHgjO0z5oirjEfrPEisvr8Nky3J1qQcJu5iZFfnR 9d0zTd+Vc4bBuh6xmzle7nW8ieyPhnkCPu+J1o7dCzxjHa2kslNVcZCiqwkxCUJA13ZE umyw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Jmac+Jdu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l11si13278403ejx.280.2020.08.17.16.19.39; Mon, 17 Aug 2020 16:20:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Jmac+Jdu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730240AbgHQWLi (ORCPT + 99 others); Mon, 17 Aug 2020 18:11:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730236AbgHQWL3 (ORCPT ); Mon, 17 Aug 2020 18:11:29 -0400 Received: from mail-io1-xd44.google.com (mail-io1-xd44.google.com [IPv6:2607:f8b0:4864:20::d44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 211CDC061343 for ; Mon, 17 Aug 2020 15:11:28 -0700 (PDT) Received: by mail-io1-xd44.google.com with SMTP id a5so19128342ioa.13 for ; Mon, 17 Aug 2020 15:11:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=2tREdSh21Iyn6r0BayW1rlUc24M2xAB6JGj1Q/yEJOs=; b=Jmac+Jdu6CNOqZau1ltTM5nvHI5mWcU9B7UeZu+aPGa0aBCL3YLZDKsu8sh07S1Sun J0N8UQ0QMfghLdToAbmo8jUiCiqEtbyYUcO9BrZEF7bjRqiP7AgjKz1Z7sf7Tor/aoJv k96jBhA38Yqnu1gYkzso3NW7/mj9yvMN460rtw+LEHwJzVwqd9k5/jl6giWJYsN1iVQ8 Hv92+xZY7oVEhqG7F24kTrWbkXwaTMOHU2SkK5JPuYzcQQ6kITngfyGIvelpXRYDJCFb 4B42k5lL6RTW8M6Wj3Lsw/3LfM5wLBWgj15JhgdKbTx+vP9RqvBsNC8oOXRWunqi4aYP tTTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=2tREdSh21Iyn6r0BayW1rlUc24M2xAB6JGj1Q/yEJOs=; b=sJuFgngSlHkuR8rDqXu5gC6p9x20LnuFDpZFysFrXay/woChhtpoZI2f8k4vIZ3f56 MWsoS14cLG4GXpjfmRtpKMMGQhg0Ku3aNcg7WoD2oKDLbRARJ6lidxid+5/cjn2wnI6+ USIS5q0TARaR1u4+8u+z6B5zqY3K27Jh4pKP5e8+jQ6eCfU0+2I9ud89sc6bcGYeQROL b381LrdkutkZfRxJth/tuOuqVjTZmvuH5gfWYB5Wb4i3iwzdWDC3ZzZ5FmygKaVk5KGV qKfrw9r7BNqfX4hBNDg0ONWlaC1psT/kVC28WlctJjVkv0vqtjCMbg2jfVZD2oPTH5vD bU7g== X-Gm-Message-State: AOAM533oGNt4ldFHLjv0yRoOTraE93LhpOWzt/F/c/6f232C8sKk6pRp xt9avLwV0a+6AoaZK9/ecMOic9oUot9v2AwLg9NbJA== X-Received: by 2002:a05:6638:138a:: with SMTP id w10mr16083796jad.36.1597702287823; Mon, 17 Aug 2020 15:11:27 -0700 (PDT) MIME-Version: 1.0 References: <202005200921.2BD5A0ADD@keescook> <20200520194804.GJ26186@redhat.com> <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> <20200806004351-mutt-send-email-mst@kernel.org> In-Reply-To: <20200806004351-mutt-send-email-mst@kernel.org> From: Lokesh Gidra Date: Mon, 17 Aug 2020 15:11:16 -0700 Message-ID: Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only To: "Michael S. Tsirkin" Cc: Nick Kralevich , Jeffrey Vander Stoep , Andrea Arcangeli , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 5, 2020 at 10:44 PM Michael S. Tsirkin wrote: > > On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin wro= te: > > > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra wrote: > > > > > From the discussion so far it seems that there is a consensus tha= t > > > > > patch 1/2 in this series should be upstreamed in any case. Is the= re > > > > > anything that is pending on that patch? > > > > > > > > That's my reading of this thread too. > > > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the= second > > > > > > > parameter of the userfaultfd syscall to be set with seccomp-b= pf, this > > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > > > The question would be: does it ever happen that a seccomp fil= ter isn't > > > > > > > already applied to unprivileged software running without > > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > > > Yes. > > > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do= use > > > > > > seccomp on a few processes, but we have found that it has a > > > > > > surprisingly high performance cost [1] on arm64 devices so turn= ing it > > > > > > on system wide is not a good option. > > > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.= 3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > > not applied to all processes. It's too expensive and impractical wh= en > > > > simpler implementations (such as this sysctl) can exist. It's also > > > > significantly simpler to test a sysctl value for correctness as > > > > opposed to a seccomp filter. > > > > > > Given that selinux is already used system-wide on Android, what is wr= ong > > > with using selinux to control userfaultfd as opposed to seccomp? > > > > Userfaultfd file descriptors will be generally controlled by SELinux. > > You can see the patchset at > > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ > > (which is also referenced in the original commit message for this > > patchset). However, the SELinux patchset doesn't include the ability > > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > > which combined with this patch, is largely equivalent to direct > > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > > above, movement of userfaultfd file descriptors can be mediated by > > SELinux, preventing one process from acquiring userfaultfd descriptors > > of other processes unless allowed by security policy. > > > > It's an interesting question whether finer-grain SELinux support for > > controlling UFFD_USER_MODE_ONLY should be added. I can see some > > advantages to implementing this. However, we don't need to decide that > > now. > > > > Kernel security checks generally break down into DAC (discretionary > > access control) and MAC (mandatory access control) controls. Most > > kernel security features check via both of these mechanisms. Security > > attributes of the system should be settable without necessarily > > relying on an LSM such as SELinux. This patch follows the same basic > > model -- system wide control of a hardening feature is provided by the > > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > > SELinux support for this can also be implemented on top of the DAC > > controls. > > > > This DAC/MAC split has been successful in several other security > > features. For example, the ability to map at page zero is controlled > > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > > mmap_zero access vector [2]. Similarly, access to the kernel ring > > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > > sysctl is very similar to this patch -- it introduces a capability > > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > > > If we want to ensure that a security feature will be well tested and > > vetted, it's important to not limit its use to LSMs only. This ensures > > that kernel and application developers will always be able to test the > > effects of a security feature, without relying on LSMs like SELinux. > > It also ensures that all distributions can enable this security > > mitigation should it be necessary for their unique environments, > > without introducing an SELinux dependency. And this patch does not > > preclude an SELinux implementation should it be necessary. > > > > Even if we decide to implement fine-grain SELinux controls on > > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > > an either/or choice between SELinux and this patch. Both are > > necessary. > > > > -- Nick > > > > [1] https://wiki.debian.org/mmap_min_addr > > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt > > I am not sure I agree this is similar to dmesg access. > > The reason I say it is this: it is pretty easy for admins to know > whether they run something that needs to access the kernel ring buffer. > Or if it's a tool developer poking at dmesg, they can tell admins "we > need these permissions". But it seems impossible for either an admin to > know that a userfaultfd page e.g. used with shared memory is accessed > from the kernel. > > So I guess the question is: how does anyone not running Android > know to set this flag? > > I got the feeling it's not really possible, and so for a single-user > feature like this a single API seems enough. Given a choice between a > knob an admin is supposed to set and selinux policy written by > presumably knowledgeable OS vendors, I'd opt for a second option. > > Hope this helps. > There has been an emphasis that Android is probably the only user for the restriction of userfaults from kernel-space and that it wouldn=E2=80=99= t be useful anywhere else. I humbly disagree! There are various areas where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely user-space setting. Basically, any lazy, on-demand, initialization/decompression/loading could be a good candidate for this trick. My project happens to be one of them. In fact, in Android we are also thinking of using it in some other places, all in user-space. And given that userfaultfd is an efficient replacement for this trick [1], there are various scenarios which would benefit from the restriction of userfaults from kernel-space, provided the admins care about security on such devices. IIUC, a security admin would never trust an unprivileged process with userfaults from kernel space. Therefore, a sysctl knob restriction with CAP_SYS_PTRACE for privileged processes seems like the right choice to me. Coming to sysctl vs. SELinux debate, I think wherever the role of OS vendor and admin is played by different people, I doubt a generic SELinux policy set by the former will be blindly acceptable to the latter. Furthermore, I=E2=80=99m not sure if an admin is expected to even k= now which packages running on their system are using userfaultfd. So they anyway have to rely on developers reaching out to get the required permission. With the new sysctl knob enabled, the number of such requests is only going to decrease. [1] https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2= (in > > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > > assembly. Otherwise it'd require more changes. > > > > > > > > It would be good to understand what these changes are. > > > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack= of the > > > > > > > kernel and take the risk there is a real bug in userfaultfd c= ode (not > > > > > > > just a facilitation of exploiting some other kernel bug) that= leads to > > > > > > > a privilege escalation, when you still break 99% of userfault= fd users, > > > > > > > if you set with option "2"? > > > > > > > > I can see your point if you think about the feature as a whole. > > > > However, distributions (such as Android) have specialized knowledge= of > > > > their security environments, and may not want to support the typica= l > > > > usages of userfaultfd. For such distributions, providing a mechanis= m > > > > to prevent userfaultfd from being useful as an exploit primitive, > > > > while still allowing the very limited use of userfaultfd for usersp= ace > > > > faults only, is desirable. Distributions shouldn't be forced into > > > > supporting 100% of the use cases envisioned by userfaultfd when the= ir > > > > needs may be more specialized, and this sysctl knob empowers > > > > distributions to make this choice for themselves. > > > > > > > > > > > Is the system owner really going to purely run on his systems= CRIU > > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRA= CE) and > > > > > > > nothing else that could break? > > > > > > > > This is a great example of a capability which a distribution may no= t > > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incid= entally > > > > > > > this single user can already enforce model "2" by only tweaki= ng its > > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if an= droid > > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > > > Can you elaborate on what bug is present by processes being > > > > unprotected by seccomp? > > > > > > > > Seccomp cannot be universally applied on Android due to previously > > > > mentioned performance concerns. Seccomp is used in Android primaril= y > > > > as a tool to enforce the list of allowed syscalls, so that such > > > > syscalls can be audited before being included as part of the Androi= d > > > > API. > > > > > > > > -- Nick > > > > > > > > -- > > > > Nick Kralevich | nnk@google.com > > > > > > > > > -- > > Nick Kralevich | nnk@google.com >