Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp4907360imw; Tue, 19 Jul 2022 16:01:21 -0700 (PDT) X-Google-Smtp-Source: AGRyM1u7Vo+Lr3gLg8L7l3D76/BLZ705Lbj19Q3G4dowEbPxGcVPEh/Ys/S8XY6swIeSaG5w42qL X-Received: by 2002:a17:907:2ce9:b0:72b:30e5:f1bc with SMTP id hz9-20020a1709072ce900b0072b30e5f1bcmr31873746ejc.127.1658271681157; Tue, 19 Jul 2022 16:01:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658271681; cv=none; d=google.com; s=arc-20160816; b=IVhQ5SzGhU5nYh8HEDBFPLrYogNgRNTg7VeFcbst6TWy2/tdT9oOGaoC1nWqm+Yulz wmrIHrdMWkj0SKuBId5FTyCk4e6cPvqPOTWcmmL9jvH9eArcrLEtQAbI5M8AGhPudwE0 ucBONh09M+Pj9YXwO13yeNF3C1VMBQmqAgp99WkXNkMaYDs4eqOLtaFmX8DN2CUx3QlI 2ImIebHthe4r1plfU63mXNfvvelEBE+p2x8GWIuazjRTOxHCsltgvzcxN11tYLLYCsh5 6+Ku5vB0zxiUOlmlurmTkYWga1qPC+XEmjuUt78/skQbLkdO3TQe7jENTbeK48qQ8x/H VtHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=hUJ+lJh6uLH33bbxVblagghibhPGv6teUo+IbwtQ8opqMkdyxEQeAcBBcqzHDbBSVv cb6tt45LDheBuWwLhQZhZ9A4AlZllx69XNp1T9XwqXMr+Kci6OENLQKBklKRzQo+PV4h 7x0OznLXpYaM3+g1KmIMAuh2zKG4VUvQQSz5HMymjX9bBbiArUPkYkTVPgzvt1Be3CJ+ TIguGxCWjRutuiFdfzSV7dnFhj4wrve4fbdC8KBIfK/wPulif/nY60egkPJcWp+rHYyg xUSzQfx32ScUWwyDdDWYFqDrlkHYbeyWcavc763XpuWbS2k31sMYdQihav60OFqO/XFC YgEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=MjL9o3vE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x3-20020a05640226c300b004356d09e34asi5664666edd.216.2022.07.19.16.00.56; Tue, 19 Jul 2022 16:01:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=MjL9o3vE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240349AbiGSWqR (ORCPT + 99 others); Tue, 19 Jul 2022 18:46:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46058 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232110AbiGSWqP (ORCPT ); Tue, 19 Jul 2022 18:46:15 -0400 Received: from mail-io1-xd30.google.com (mail-io1-xd30.google.com [IPv6:2607:f8b0:4864:20::d30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5038F4E87D for ; Tue, 19 Jul 2022 15:46:14 -0700 (PDT) Received: by mail-io1-xd30.google.com with SMTP id n7so13031749ioo.7 for ; Tue, 19 Jul 2022 15:46:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=MjL9o3vEOb0cC/AOC29+FJHaYEDdp3YF4uPnUKdH2MSDVyDrF7ODr3Ke0Wcy4ilYzf BzCFwxsZ/XHiEHTOzXcQJv3vCytelIwqMbSVQy/FRLtlMuURP6ov3NWCIw8RZPLahcDd C48SeV+unpKZ2XPeGrdmEDCsa6LF6ZASoymNm+TqHMtWQYybbebrFT/9bV5YYeMDLZ4x oaPJRoZ/aZAKBfaZ9E6cpvtaiAhFYSDi9LC7qcvIPN12naTuSVOg+/o2fA2e6EiO7Fa0 xHvsR1NLSDdqKv1q7gdh+mZLQNyp7KE54xFUiOdTzB8i+cvo8dMLbT1d0lXkvAObz4G2 IAjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=lnuskSEx1rWRFyExO7VDAIyPqC7rs15nNdWPt5ojSJo=; b=40rUBBxmN7g2JBYANA/Jn3Ok9gRwToBMNbJxnWK5WI+U+IeZaGtGMir2qg3SjIYhkw RXffWuXt0eXWAHMniwpEflByYmLslCgc/1rAcgm1DpOejvlMv8JPJS56OWr4CGb5SW7m Wfx55+ueYg/kXMjM5z+oXs8oUmiLO3gfZcjY+c9FPnTIgIDnj4DQcS4+J0pgWKUTK4Qx f/1ROJ7LCNR0gBjQ0w+vAxKd2dg484YFqQz6Onz9RonlnN/Jm8GP9nEANu1ls5GmOWLd dVKstiReRkRDT1HTNP0AgQthghHyNjgpiTSLGKJm8T6qWreDYVTOpMhg6e653EPfCB2h QgiA== X-Gm-Message-State: AJIora/tElblCs1X4tK/4t8x+JhjLIvuOtmmkaMVnXBRq0uX7Dns7o3J 7b4pfAzs+Bba7QKZIkxF7GCPcQfTar1tUV02lZE6dQ== X-Received: by 2002:a05:6602:1644:b0:678:8ba4:8df6 with SMTP id y4-20020a056602164400b006788ba48df6mr16243773iow.138.1658270773545; Tue, 19 Jul 2022 15:46:13 -0700 (PDT) MIME-Version: 1.0 References: <20220719195628.3415852-1-axelrasmussen@google.com> <20220719195628.3415852-3-axelrasmussen@google.com> In-Reply-To: From: Axel Rasmussen Date: Tue, 19 Jul 2022 15:45:37 -0700 Message-ID: Subject: Re: [PATCH v4 2/5] userfaultfd: add /dev/userfaultfd for fine grained access control To: Nadav Amit Cc: Alexander Viro , Andrew Morton , Dave Hansen , "Dmitry V . Levin" , Gleb Fotengauer-Malinovskiy , Hugh Dickins , Jan Kara , Jonathan Corbet , Mel Gorman , Mike Kravetz , Mike Rapoport , Peter Xu , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , zhangyi , "linux-doc@vger.kernel.org" , linux-fsdevel , LKML , Linux MM , "linux-kselftest@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 19, 2022 at 3:32 PM Nadav Amit wrote: > > On Jul 19, 2022, at 12:56 PM, Axel Rasmussen w= rote: > > > Historically, it has been shown that intercepting kernel faults with > > userfaultfd (thereby forcing the kernel to wait for an arbitrary amount > > of time) can be exploited, or at least can make some kinds of exploits > > easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we > > changed things so, in order for kernel faults to be handled by > > userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl > > must be configured so that any unprivileged user can do it. > > > > In a typical implementation of a hypervisor with live migration (take > > QEMU/KVM as one such example), we do indeed need to be able to handle > > kernel faults. But, both options above are less than ideal: > > > > - Toggling the sysctl increases attack surface by allowing any > > unprivileged user to do it. > > > > - Granting the live migration process CAP_SYS_PTRACE gives it this > > ability, but *also* the ability to "observe and control the > > execution of another process [...], and examine and change [its] > > memory and registers" (from ptrace(2)). This isn't something we need > > or want to be able to do, so granting this permission violates the > > "principle of least privilege". > > > > This is all a long winded way to say: we want a more fine-grained way t= o > > grant access to userfaultfd, without granting other additional > > permissions at the same time. > > > > To achieve this, add a /dev/userfaultfd misc device. This device > > provides an alternative to the userfaultfd(2) syscall for the creation > > of new userfaultfds. The idea is, any userfaultfds created this way wil= l > > be able to handle kernel faults, without the caller having any special > > capabilities. Access to this mechanism is instead restricted using e.g. > > standard filesystem permissions. > > Are there any other =E2=80=9Cdevices" that when opened by different proce= sses > provide such isolated interfaces in each process? I.e., devices that if y= ou > read from them in different processes you get completely unrelated data? > (putting aside namespaces). > > It all sounds so wrong to me, that I am going to try again to pushback > (sorry). No need to be sorry. :) > > From a semantic point of view - userfaultfd is process specific. It is > therefore similar to /proc/[pid]/mem (or /proc/[pid]/pagemap and so on). > > So why can=E2=80=99t we put it there? I saw that you argued against it in= your > cover-letter, and I think that your argument is you would need > CAP_SYS_PTRACE if you want to access userfaultfd of other processes. But > this is EXACTLY the way opening /proc/[pid]/mem is performed - see > proc_mem_open(). > > So instead of having some strange device that behaves differently in the > context of each process, you can just have /proc/[pid]/userfaultfd and th= en > use mm_access() to check if you have permissions to access userfaultfd (j= ust > like proc_mem_open() does). This would be more intuitive for users as it = is > similar to other /proc/[pid]/X, and would cover both local and remote > use-cases. Ah, so actually I find this argument much more compelling. I don't find it persuasive that we should put it in /proc for the purpose of supporting cross-process memory manipulation, because I think the syscall works better for that, and in that case we don't mind depending on CAP_SYS_PTRACE. But, what you've argued here I do find persuasive. :) You are right, I can't think of any other example of a device node in /dev that works like this, where it is completely independent on a per-process basis. The closest I could come up with was /dev/zero or /dev/null or similar. You won't affect any other process by touching these, but I don't think these are good examples. I'll send a v5 which does this. I do worry that cross-process support is probably complex to get right, so I might leave that out and only allow a process to open its own device for now. >