Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1710973pxk; Fri, 4 Sep 2020 17:38:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxb4zSHlO1ix/zl2WDYRsr91yCANSOaj6FMZxBjbFtYlxqLArXVyPlpXX/kk3f2eV2VyTgq X-Received: by 2002:a17:906:d9da:: with SMTP id qk26mr9965572ejb.435.1599266332047; Fri, 04 Sep 2020 17:38:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599266332; cv=none; d=google.com; s=arc-20160816; b=y5hbT+Kj4yVlOzoiWvHM1m+aDJH+fIgziCdXVrxoHmugJWmNB1My7FxhBiKDdnc2wL KERyxeK8s6ErK7Z1djn6KLiPKJ61Svvwz0ElFd1/9cIDXUMkkSqf5luOdY8TPln2X+Wq IN+KGuEZyT4vUiJ/mQ4PHk3YTkoP/1R2uX5bxhVMp9KGXTft9y79VirY6z34kRu2VDLg 4EpZP2SURQsJmh7rBu/+tGKW0wP0zoTOeyx0E/xNG/q7za4qyAonOsyu+CdEwdpY2CkL lJipZWoauDWkEiRAO+71C7nkmJufuMVBd6YleUnS3jk/C5b0DgrCk7wpEa9mbeZNC5Hb Ey5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=MIVXqzuKXJr11dfoTHnTlrj+2SwhBNnLLWr83/0KD3w=; b=np2Bk+p89909uojCPGfKTeL3WzGkBx7kvKIwsfWONk+9qn86H0OEjseNc2DyZD8ef6 K8Ikcr+tr6dSgCrlYJyhY6kQTDjmlUSD2clX1bY58oL2H0VPamp08tWnC1YEEc25M+0t 3RiyK5O+P7qpMkBhAaYQvBdnf+kwKVHaescupW7jOvGVUNGyN9TU45Sfqqk5esKooTQO sn7FBnjJgYxKolirPYTl8G40L+gIStoAZAcbpMaU3+QXAejbkhuMk531OBrDKIbF94b/ mwsOBQLjQnqsvOcV7OOT1+UNYapNLdWIADCHY2LOVYe28VjzuPsBeCfVm4T4k7FXn9hK n82A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=E1y2EOsE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id mb24si5050937ejb.583.2020.09.04.17.38.27; Fri, 04 Sep 2020 17:38:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=E1y2EOsE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727860AbgIEAgR (ORCPT + 99 others); Fri, 4 Sep 2020 20:36:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40826 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726208AbgIEAgQ (ORCPT ); Fri, 4 Sep 2020 20:36:16 -0400 Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E803C061244 for ; Fri, 4 Sep 2020 17:36:15 -0700 (PDT) Received: by mail-io1-xd42.google.com with SMTP id m17so8884400ioo.1 for ; Fri, 04 Sep 2020 17:36:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=MIVXqzuKXJr11dfoTHnTlrj+2SwhBNnLLWr83/0KD3w=; b=E1y2EOsETk3pbdJ6/iRUlu7O0cJxQvNJIexWcxc141uC4ftHh2EdX/3IBHJl/mgpox +evW4Te59Tfu4p7IYZm9RL9vW1U2LXnqbpRr10oMGX1NGMRhKF09mvOZvd8EC56mjJCa KiFXsjTz4muznj2ugtZnS0NGZyEzOMK8Zv29Iw+ThzgRjM/c5VQWdrrhIh+wlJbrQeeW 1Lf9+/fvyMhxTPBF+qtwn6TgwCt1RE7sfl+FmP2vaFVSP0++cw0m1cOIpSOmka4nYI8F aHjaRZkBmzzhrmiCSGbWvbOwG5dQeKDyRKTsOLaFVNbXY6S1I/JjEc6CULIjK9WCJOgO Facg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=MIVXqzuKXJr11dfoTHnTlrj+2SwhBNnLLWr83/0KD3w=; b=AdORsVcUnw3y0AHn8bw3ssPL2APWyPRlON83LeYdgT/mLsnxsZWp1IF4zdzghvdmfd oKQVOBOusmxhi5UVwn66cvGcXN/YaVrSMqFYN/yAbh5RUE4h7w98S7rXeHLjXfjVIPcK GMixhF7BNNSCWBqyss8tlNyWb0r0xNBq4jvkJcx+264ZisDTE7OjaOMJlHzTYdLIDfvL ihe+1Q1CHJLCB1s3z1sUJVCxYs8J6/M/GCdu03fDMWjQUGReBYxd7BlqF4+QwHAIqPJS 1/YN1S0Kwukf+svdrid1l2Mi1WEj3disskBJAainvNJ/3Guqe0O3l5f6R1+5tybPdWef qORg== X-Gm-Message-State: AOAM530cfZcNcXd46gLFgZZ7cF5yvlkVZ3K4w+WKRnKjwOdWeKDlk1uV uZ0ZYJuW86yswRh9hpik9ZuAf9hotQ+DKQyaZQoStA== X-Received: by 2002:a02:780e:: with SMTP id p14mr2360306jac.144.1599266173913; Fri, 04 Sep 2020 17:36:13 -0700 (PDT) MIME-Version: 1.0 References: <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> <20200806004351-mutt-send-email-mst@kernel.org> <20200904033438.GI9411@redhat.com> In-Reply-To: <20200904033438.GI9411@redhat.com> From: Lokesh Gidra Date: Fri, 4 Sep 2020 17:36:02 -0700 Message-ID: Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only To: Andrea Arcangeli Cc: "Michael S. Tsirkin" , Nick Kralevich , Jeffrey Vander Stoep , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh , "Dr. David Alan Gilbert" , Tetsuo Handa , Dmitry Vyukov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli wrote= : > > Hello, > > On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote: > > There has been an emphasis that Android is probably the only user for > > the restriction of userfaults from kernel-space and that it wouldn=E2= =80=99t > > be useful anywhere else. I humbly disagree! There are various areas > > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely > > user-space setting. Basically, any lazy, on-demand, > > For the record what I said is quoted below > https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com : > > """It all boils down of how peculiar it is to be able to leverage only > the acceleration [..] Right now there's a single user that can cope > with that limitation [..] If there will be more users [..] it'd be > fine to add a value "2" later.""" > > Specifically I never said "that it wouldn=E2=80=99t be useful anywhere el= se.". > Thanks a lot for clarifying. > Also I'm only arguing about the sysctl visible kABI change in patch > 2/2: the flag passed as parameter to the syscall in patch 1/2 is all > great, because seccomp needs it in the scalar parameter of the syscall > to implement a filter equivalent to your sysctl "2" policy with only > patch 1/2 applied. > > I've two more questions now: > > 1) why don't you enforce the block of kernel initiated faults with > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > an optimization to remove a few instructions per syscall in the bpf > execution of Android unprivileged apps? You should block a lot of > other syscalls by default to all unprivileged processes, including > vmsplice. > > In other words if it's just for Android, why can't Android solve it > with only patch 1/2 by tweaking the seccomp filter? I would let Nick (nnk@) and Jeff (jeffv@) respond to this. The previous responses from both of them on this email thread (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2Ofq= Vsy=3DSQ@mail.gmail.com/ and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV= +=3DRrFntyQ@mail.gmail.com/) suggest that the performance overhead of seccomp-bpf is too much. Kees also objected to it (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) I'm not familiar with how seccomp-bpf works. All that I can add here is that userfaultfd syscall is usually not invoked in a performance critical code path. So, if the performance overhead of seccomp-bpf (if enabled) is observed on all syscalls originating from a process, then I'd say patch 2/2 is essential. Otherwise, it should be ok to let seccomp perform the same functionality instead. > > 2) given that Android is secure enough with the sysctl at value 2, why > should we even retain the current sysctl 0 semantics? Why can't > more secure systems just use seccomp and block userfaultfd, as it > is already happens by default in the podman default seccomp > whitelist (for those containers that don't define a new json > whitelist in the OCI schema)? Shouldn't we focus our energy in > making containers more secure by preventing the OCI schema of a > random container to re-enable userfaultfd in the container seccomp > filter instead of trying to solve this with a global sysctl? > > What's missing in my view is a kubernetes hard allowlist/denylist > that cannot be overridden with the OCI schema in case people has > the bad idea of running containers downloaded from a not fully > trusted source, without adding virt isolation and that's an > userland problem to be solved in the container runtime, not a > kernel issue. Then you'd just add userfaultfd to the json of the > k8s hard seccomp denylist instead of going around tweaking sysctl. > > What's your take in changing your 2/2 patch to just replace value "0" > and avoid introducing a new value "2"? SGTM. Disabling uffd completely for unprivileged processes can be achieved either using seccomp-bpf, or via SELinux, once the following patch series is upstreamed https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.co= m/ > > The value "0" was motivated by the concern that uffd can enlarge the > race window for use after free by providing one more additional way to > block kernel faults, but value "2" is already enough to solve that > concern completely and it'll be the default on all Android. > > In other words by adding "2" you're effectively doing a more > finegrined and more optimal implementation of "0" that remains useful > and available to unprivileged apps and it already resolves all > "robustness against side effects other kernel bugs" concerns. Clearly > "0" is even more secure statistically but that would apply to every > other syscall including vmsplice, and there's no > /proc/sys/vm/unprivileged_vmsplice sysctl out there. > > The next issue we have now is with the pipe mutex (which is not a > major concern but we need to solve it somehow for correctness). So I > wonder if should make the default value to be "0" (or "2" if think we > should not replace "0") and to allow only user initiated faults by > default. > > Changing the sysctl default to be 0, will make live migration fail to > switch to postcopy which will be (unnoticeable to the guest), instead > of risking the VM to be killed because of network latency > outlier. Then we wouldn't need to change the pipe code at all. > SGTM. I can change the default value to '0' (or '2') in the next revision of patch 2/2, unless somebody objects to this. > Alternatively we could still fix the pipe code so it runs better (but > it'll be more complex) or to disable uffd faults only in the pipe > code. > > One thing to keep in mind is that if we change the default, then > hypervisor hosts running QEMU would need to set: > > vm.userfaultfd =3D 1 > > in /etc/sysctl.conf if postcopy live migration is required, that's not > particularly concerning constraint for qemu (there are likely other > tweaks required and it looks less risky than an arbitrary timeout > which could kill the VM: if the above is forgotten the postcopy live > migration won't even start and it'll be unnoticeable to the guest). > > The main concern really are future apps that may want to use uffd for > kernel initiated faults won't be allowed to do so by default anymore, > those apps will be heavily incentivated to use bounce buffers before > passing data to syscalls, similarly to the current use case of patch 2/2. > > Comments welcome, > Andrea > > PS. Another usage of uffd that remains possible without privilege with > the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV > acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib > will remain possible without requiring extra privileges, by adding a > UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS > (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be > indistinguishable to UFFDIO_ZEROPAGE but it will solve the last > performance bottleneck by avoiding a wrprotect fault after the > allocation and it will be THP capable too). Memory will be freed with > MADV_DONTNEED, without ever having to call mmap/mumap. It could move > memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP > which already exists. > UFFDIO_POPULATE sounds like a really useful feature. I don't see it in the kernel yet. Is there a patch under work on this? If so, kindly share.