Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp2060667pxk; Sat, 19 Sep 2020 11:18:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwfRC2+R5yGGyr2ivgPI1vhpiRH8DqV1cPkKo2p7Nd/9DGd0Tiz0vT7uxKTp0828YFaIU3L X-Received: by 2002:a17:906:9491:: with SMTP id t17mr41062182ejx.227.1600539498702; Sat, 19 Sep 2020 11:18:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600539498; cv=none; d=google.com; s=arc-20160816; b=wlZHwwxkKdAyHZLZMkerFvKsagmguRpXvl4CM2hEQ2PSzlx6NRGvBhWubTdc2cgRKU LmknVkzbYTP9FmauaEldpbrtU7kdIuju8Cn2ILTFWjyukYtPjaINdhdU4zehZnjwJXJr oJlQ32qpeH3k9a3jhsJ9zNNq3jmkmTQL7tcsw9+woHzfkZeXr5v3jovQVzWAUUlpjMOQ uKew5V0v2RIBqO8A+crCJewiwE/1l8im8/GidenYd9wDTzmviFR909vB4iRFuU1EQZU4 vRLjmh1auEM39g1bTw+hId1PYDlrxVkVPLtVK8NDO9odo8Wh8nECWcARFJpg0AWw7SRI nqMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Q9ptxfdoEfwu73EnjCKZtaPVR45R7yiJGjZWUa14+0Q=; b=W3ZumoRes0Kn1vyiGFC+7T8gXWnRVJSsVTIhnyZUJT1asZG1YlowqV0wYh523xcs6h JVsqZmpRrrhzZo0VVzUlnmZqVnGGHH4p3S2XJY2VQ/0Mo1w1EcGlnZMb+YsruJmqyaKx SlKeE/0yRSVU1i1C5NaJgqpuNcZabYM2qNNfdA8gZiBheCb6kbUZCsDH8FDUVpvCe8Wa n8m4jBilpeMn2CBtLOESpiUvx9DCDOmcjyv+87wmVuEp90FU773mJUGCXGf30vSozhCU 3q1wx2SKbU2k9P6lFZmiDICVPv71PysYsGB1Q9QOkedJxh8hNZjCXYrazyUxO/n/uPyj ASgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CtqaZek9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i7si6277870ejo.706.2020.09.19.11.17.55; Sat, 19 Sep 2020 11:18:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CtqaZek9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726463AbgISSOS (ORCPT + 99 others); Sat, 19 Sep 2020 14:14:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59192 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726511AbgISSOS (ORCPT ); Sat, 19 Sep 2020 14:14:18 -0400 Received: from mail-yb1-xb44.google.com (mail-yb1-xb44.google.com [IPv6:2607:f8b0:4864:20::b44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E301CC0613D0 for ; Sat, 19 Sep 2020 11:14:17 -0700 (PDT) Received: by mail-yb1-xb44.google.com with SMTP id b142so6940717ybg.9 for ; Sat, 19 Sep 2020 11:14:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Q9ptxfdoEfwu73EnjCKZtaPVR45R7yiJGjZWUa14+0Q=; b=CtqaZek9ZP2Z2+ATOKYiaaPz8FaW/DMezoGqqNdpYCtZQpT30aNYA+No+MbCkulDW6 6KZ7QSGsUIKbYPsIjGhW+6tLI2pf98R+9919LefYj+SfSuKZ/uHHWJ/lmqXGw/fIbgCQ cg3ImJNKdDTsqIJFWNfSf98awtSPYv4lCpHuWf735EpjboOmbwHJyEW12dxxEAjSkMZS nb6xqDZZxFKHjsff6mLCulQFA+j4XmXvR3mkTjD97tWFzmb5fNbkHUl1kwxWzb7WyhDB u47QhI41/vGWXsMOlrBw72Wp+ZciO1oEAMdgLLVrFkuOPV8DpuOI/Naaw3JIhEo27Z7d m+Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Q9ptxfdoEfwu73EnjCKZtaPVR45R7yiJGjZWUa14+0Q=; b=JS6VUBbEzU2H+MZBs7gxa8K5+IvXylMGEECj0Ta1K9cwDAE/mBhwbC6E+g8tfb0p44 UAOzdf6T1k5s3kW3Jj67mVG4bkZMWiT3rX0swyxQAMUcfVEmGALdQkuVk9+/ETyhX9km Ln0BEdfUxPQKXHptUTmcTl5CU9RrU/DopnohwJx0+hnOEINqEriOjP6t+pByKDXsKTXS z9kEyT5y7L5S5wX2PpWkdX0+m1FmGNdG27VDV5UyDPRU74H12oDZoQ5zcRiIVNI6e+cw NxKs5uum2Cdsx1pEk8ZlEsy5XhuWbV+TtMD5n/CevF0OBJsIqya3unUowodjJju/yBIt OU6g== X-Gm-Message-State: AOAM533vAxjRDHsNHHYEPpL3xJSN7pI/fki8GktVg5nNidiyYtu+GaXH Lz4IsNPXSC9p5LLwWDkokkWrRgtPe4eoWvLPxsPV0g== X-Received: by 2002:a25:5546:: with SMTP id j67mr8320422ybb.170.1600539256725; Sat, 19 Sep 2020 11:14:16 -0700 (PDT) MIME-Version: 1.0 References: <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> <20200806004351-mutt-send-email-mst@kernel.org> <20200904033438.GI9411@redhat.com> In-Reply-To: From: Nick Kralevich Date: Sat, 19 Sep 2020 11:14:03 -0700 Message-ID: Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only To: Lokesh Gidra Cc: Andrea Arcangeli , "Michael S. Tsirkin" , Jeffrey Vander Stoep , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh , "Dr. David Alan Gilbert" , Tetsuo Handa , Dmitry Vyukov Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 4, 2020 at 5:36 PM Lokesh Gidra wrote: > > On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli wrote: > > > > 1) why don't you enforce the block of kernel initiated faults with > > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > > an optimization to remove a few instructions per syscall in the bpf > > execution of Android unprivileged apps? You should block a lot of > > other syscalls by default to all unprivileged processes, including > > vmsplice. > > > > In other words if it's just for Android, why can't Android solve it > > with only patch 1/2 by tweaking the seccomp filter? > > I would let Nick (nnk@) and Jeff (jeffv@) respond to this. > > The previous responses from both of them on this email thread > (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ > and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/) > suggest that the performance overhead of seccomp-bpf is too much. Kees > also objected to it > (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) > > I'm not familiar with how seccomp-bpf works. All that I can add here > is that userfaultfd syscall is usually not invoked in a performance > critical code path. So, if the performance overhead of seccomp-bpf (if > enabled) is observed on all syscalls originating from a process, then > I'd say patch 2/2 is essential. Otherwise, it should be ok to let > seccomp perform the same functionality instead. > There are two primary reasons why seccomp isn't viable here. 1) Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. 2) Attempts to enable seccomp on Android have run into performance problems, even for trivial seccomp filters. Let's go into each one. Issue #1: Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. Andrea suggests deploying a seccomp filter purely focused on Android unprivileged[1] (third party installed) apps. However, the intention is for this security control to be used system-wide[2]. Only processes which have a need for kernel initiated faults should be allowed to use them; all other processes should be denied by default. And when I say "all' processes, I mean "all" processes, even those which run with UID=0. Andrea's proposal is akin to a denylist, where many modern distributions (such as Android) use allowlists. The seemingly obvious solution is to apply a global seccomp filter in init (PID=1), but it falls down in practice. Seccomp is an incredibly useful tool, but it wasn't designed to be applied system-wide. Seccomp is fundamentally hierarchical in nature. A seccomp filter, once applied, cannot be subsequently relaxed or removed in child processes. While this restriction is great for leaf processes, it causes problems for OS designers - a parent process must maintain an unused capability if any process in the parent's process tree uses that capability. This makes applying a userfaultfd seccomp filter in init impossible, since we expect a few of init's children (but not init itself or most of init's children) to use userfaultfd kernel faults. We end up back to a wack-a-mole (denylist) problem of trying to modify each individual process to block userfaultfd kernel faults, defeating the goals of system-wide protection, and introducing significant complexity into the system design. Seccomp should be used in the context where it provides the most value -- process leaf nodes. But trying to apply seccomp as a system-wide control just isn't viable. Lokesh's sysctl proposal doesn't have these problems. When the sysctl is set to 2 by the OS distributor, all processes which don't have CAP_SYS_PTRACE are denied kernel generated faults, making the system safe-by-default. Only processes which are on the OS distributor's CAP_SYS_PTRACE allowlist (see Android's allowlist at [3]) can generate these faults, and capabilities can be managed without regards to process hierarchy. This keeps the system minimally privileged and safe. Seccomp isn't a viable solution here. Issue #2: Attempts to enable seccomp on Android globally have run into performance problems, even for trivial seccomp filters. Android has tried a few times to enable seccomp globally, but even excluding the above-mentioned hierarchical process problems, we've seen performance regressions across the board. Imposing a seccomp filter merely for userfaultfd imposes a tax on every syscall, even if the process never makes use of userfaultfd. Lokesh's sysctl proposal avoids this tax and places the check where it's most effective, with the rest of the userfaultfd functionality. See also the threads that Lokesh mentioned above: * https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ * https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/ * https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/ Thanks, -- Nick [1] The use of the term "unprivileged" is unfortunate. In Android, there's no coarse-grain privileged vs unprivileged process. Each process, including root processes, have only the privileges they need, and not a bit more. As a concrete example, Android's init process (PID=1) is not allowed to open TCP/UDP sockets, but is allowed to spawn children which can do so. Having each process be differently privileged, and ensuring that functionality is only given out on a need-to-have basis, is an important part of modern OS design. [2] The trend in modern exploits isn't to perform attacks directly from untrusted code to the kernel. A lot of the attack surface needed by an attacker isn't reachable directly from untrusted code, but only indirectly through other processes. The attacker moves laterally through the system, exploiting a process which has the necessary capabilities, then escalating to the kernel. Enforcing security controls system-wide is an important part of denying an attacker the tools for an effective exploit and preventing this kind of lateral movement from being useful. Denying an attacker access to kernel initiated faults in userfaultfd system-wide (except for authorized processes) is doubly important, as these kinds of faults are extremely valuable to an exploit writer (see explanation at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0 or https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit) [3] https://android.googlesource.com/platform/system/sepolicy/+/7be9e9e372c70a5518f729a0cdcb0d39a28be377/private/domain.te#107 line 107 -- Nick Kralevich | nnk@google.com