Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp961586pxa; Wed, 5 Aug 2020 17:44:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyO+73QdrOPxfGn3e84RbR+pacHCeXDpmTlvS79OtojokADZQDFKMjO4LWShKHpl9aER5Ze X-Received: by 2002:a50:e846:: with SMTP id k6mr1720172edn.27.1596674645310; Wed, 05 Aug 2020 17:44:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1596674645; cv=none; d=google.com; s=arc-20160816; b=II6+u1vG1suW2jxEAD2IEHSDJDNRivuVTFf/opACyfT5irkLsbFWseGlaUi9tIBYp8 ZocQj5oKaY8iJNhw94p9wsKuwIH+uShDsfLcMhM/X3krgNtewSKTEb4LS5GXhWQxAIv6 O09Sh7ezWCxxvxwOPUAE9M1Le6SxVDtnzDjfIXyE3HpQ2RrOQdSZvYvKI57SdI2ylZDh 0zkvIGmoyp5U92WWRD7oQx88J8OTiN4wf7gkNCXcdo5sUNYS1dpN/7+TYI4lwv7IZSGC bkcoiyKsCS0oxTLKkfYzoUYWr2FkQU8LypIWQIrmiePxb0RdjMFkMyULFKS00aRWF4CR 6PaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=XqxrCBcebzwTBSlBbZQuXod3klRM9Vw/uQ+yD1TuH7k=; b=KASG0Ao1xJSUFCxuTg2K8ngTRvMh3Na/lcOM14FD/C7gydoBirydnJbCT5ttKytThq 7tc+bdKWyb8XNZKX2TrvI158xU2kKdHe5PQGszt1B1Kn7PItZ0f31al394aYBqMD/WJg uaHoD2qxp0LAQhYk9F4G2s0Pj2o4ETf0+nV8nqAF/mKQ5bP2KlVF2F6MqvpyXZzNmnve UWwyI6nFA3NnIgQcE2nq/939ryGeUyIoJX6YQCU5VWfQ5sv5v01HM7a7JRtofAoyhRM1 9Dnvrbwur5NpNqm/ooQMJhhsVtcMNe0S3RtMaMLZ7McaL71yLOOWuwcVPsyvtxEWJac1 RHwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pOlNC5FY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ga5si2278114ejb.689.2020.08.05.17.43.40; Wed, 05 Aug 2020 17:44:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pOlNC5FY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726797AbgHFAnX (ORCPT + 99 others); Wed, 5 Aug 2020 20:43:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35926 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726026AbgHFAnR (ORCPT ); Wed, 5 Aug 2020 20:43:17 -0400 Received: from mail-wr1-x444.google.com (mail-wr1-x444.google.com [IPv6:2a00:1450:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97B17C061575 for ; Wed, 5 Aug 2020 17:43:16 -0700 (PDT) Received: by mail-wr1-x444.google.com with SMTP id a15so42316149wrh.10 for ; Wed, 05 Aug 2020 17:43:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XqxrCBcebzwTBSlBbZQuXod3klRM9Vw/uQ+yD1TuH7k=; b=pOlNC5FY3C1j9IhvS3DZJb3ce7tSIc2UV2rOiD19AqeDsDU6X2ZtoTUxi0h30RYqEK gYmhMR6IVLr2uD47iFLfmwma8Ek3R5C3GGzB976sl8id8ELyIIW3wAyBWhhLyNLO+Wd+ 6SwPLjCD9e35MNeMm/FsFQ+2lesscQoPVJdGgjT+3sjNQGE/gmTFjE0nW5hwe4BASOeV L+DVV3e4ZyL0aJjl7Rc8/0BhWFHXQEu4ZyOIItROM2xfIt7uJ+EimpiVziSllgi8FQKX 5wgoTsrhWUMIqK7EQND6bJdn1LIIWxZ8MPML/bzXGk8/lDZMI89Q3rrI5P/v7BPKKR+k DuLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XqxrCBcebzwTBSlBbZQuXod3klRM9Vw/uQ+yD1TuH7k=; b=K90omSEPI2DHXCtAe9D0a5wxvuAZXOt+ojqKe1F367iClhB5TUWZ4Ibzmx3Tpt6tER q0tniZPtth+DDl4/W7Cc0vk/YVMWSdxMhLj5yDyiRwVz5LWG5d10NfLSVcYx6pA4LDR8 5H+H3NYLZdIeNVbKd70udyDJkPRDpmRJrZFg8HRvp4jJbMWhWp9zBiMIklj5CAB3unFX ONtU/0cWk0M+XGinEhhDbaHX7+UMf06sjCatEkXaa8a6AuVd4EHnJW2c0+QeCV7JU94I 12MTS1uOepYeZnEb8+utkJfvetjFQb10SeskyNAz+Sun2mlkg1tUS2LGcYI4Xc0tk/J+ rVLg== X-Gm-Message-State: AOAM532YEyZoc0kdFLCYKRRWNhB8Ga0GulgWkJevlM5DGm1x7u44bmtK lTzajk2pVLg8eJ5URo2FBV9AwOnle+9aDkwSvjD4qg== X-Received: by 2002:a5d:5704:: with SMTP id a4mr4760122wrv.318.1596674594491; Wed, 05 Aug 2020 17:43:14 -0700 (PDT) MIME-Version: 1.0 References: <20200508125314-mutt-send-email-mst@kernel.org> <20200520045938.GC26186@redhat.com> <202005200921.2BD5A0ADD@keescook> <20200520194804.GJ26186@redhat.com> <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> In-Reply-To: <20200724093852-mutt-send-email-mst@kernel.org> From: Nick Kralevich Date: Wed, 5 Aug 2020 17:43:02 -0700 Message-ID: Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only To: "Michael S. Tsirkin" Cc: Lokesh Gidra , Jeffrey Vander Stoep , Andrea Arcangeli , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin wrote: > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra wrote: > > > From the discussion so far it seems that there is a consensus that > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > anything that is pending on that patch? > > > > That's my reading of this thread too. > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > would be more a question to the Android userland team. > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > already applied to unprivileged software running without > > > > > SYS_CAP_PTRACE capability? > > > > > > > > Yes. > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > seccomp on a few processes, but we have found that it has a > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > on system wide is not a good option. > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > not applied to all processes. It's too expensive and impractical when > > simpler implementations (such as this sysctl) can exist. It's also > > significantly simpler to test a sysctl value for correctness as > > opposed to a seccomp filter. > > Given that selinux is already used system-wide on Android, what is wrong > with using selinux to control userfaultfd as opposed to seccomp? Userfaultfd file descriptors will be generally controlled by SELinux. You can see the patchset at https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ (which is also referenced in the original commit message for this patchset). However, the SELinux patchset doesn't include the ability to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. SELinux already has the ability to control who gets CAP_SYS_PTRACE, which combined with this patch, is largely equivalent to direct UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch above, movement of userfaultfd file descriptors can be mediated by SELinux, preventing one process from acquiring userfaultfd descriptors of other processes unless allowed by security policy. It's an interesting question whether finer-grain SELinux support for controlling UFFD_USER_MODE_ONLY should be added. I can see some advantages to implementing this. However, we don't need to decide that now. Kernel security checks generally break down into DAC (discretionary access control) and MAC (mandatory access control) controls. Most kernel security features check via both of these mechanisms. Security attributes of the system should be settable without necessarily relying on an LSM such as SELinux. This patch follows the same basic model -- system wide control of a hardening feature is provided by the unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, SELinux support for this can also be implemented on top of the DAC controls. This DAC/MAC split has been successful in several other security features. For example, the ability to map at page zero is controlled in DAC via the mmap_min_addr sysctl [1], and via SELinux via the mmap_zero access vector [2]. Similarly, access to the kernel ring buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict sysctl is very similar to this patch -- it introduces a capability (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. If we want to ensure that a security feature will be well tested and vetted, it's important to not limit its use to LSMs only. This ensures that kernel and application developers will always be able to test the effects of a security feature, without relying on LSMs like SELinux. It also ensures that all distributions can enable this security mitigation should it be necessary for their unique environments, without introducing an SELinux dependency. And this patch does not preclude an SELinux implementation should it be necessary. Even if we decide to implement fine-grain SELinux controls on UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this an either/or choice between SELinux and this patch. Both are necessary. -- Nick [1] https://wiki.debian.org/mmap_min_addr [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > subject) should be enforceable with minor changes to the BPF > > > > > assembly. Otherwise it'd require more changes. > > > > It would be good to understand what these changes are. > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > if you set with option "2"? > > > > I can see your point if you think about the feature as a whole. > > However, distributions (such as Android) have specialized knowledge of > > their security environments, and may not want to support the typical > > usages of userfaultfd. For such distributions, providing a mechanism > > to prevent userfaultfd from being useful as an exploit primitive, > > while still allowing the very limited use of userfaultfd for userspace > > faults only, is desirable. Distributions shouldn't be forced into > > supporting 100% of the use cases envisioned by userfaultfd when their > > needs may be more specialized, and this sysctl knob empowers > > distributions to make this choice for themselves. > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > nothing else that could break? > > > > This is a great example of a capability which a distribution may not > > want to support, due to distribution specific security policies. > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > this single user can already enforce model "2" by only tweaking its > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > Can you elaborate on what bug is present by processes being > > unprotected by seccomp? > > > > Seccomp cannot be universally applied on Android due to previously > > mentioned performance concerns. Seccomp is used in Android primarily > > as a tool to enforce the list of allowed syscalls, so that such > > syscalls can be audited before being included as part of the Android > > API. > > > > -- Nick > > > > -- > > Nick Kralevich | nnk@google.com > -- Nick Kralevich | nnk@google.com