Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp184752rwb; Mon, 26 Sep 2022 10:52:59 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6FMHCtboYp+ARWk5AN1g3smuaKdWWftfWy6pGk/3PKpd6DFaHSb7ubYZCTK3akolDlYtRf X-Received: by 2002:a17:907:3186:b0:777:3fe7:4659 with SMTP id xe6-20020a170907318600b007773fe74659mr19235527ejb.336.1664214779617; Mon, 26 Sep 2022 10:52:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664214779; cv=none; d=google.com; s=arc-20160816; b=iz+HxGlTlDkKInZToCQpSoxSiKGBxHPIKo0UB5T0oN+ATqeCmmOE9ywhj332v0BFFe /L1HAJuhBPHD6bQidUaq6PIwjPoddBpGJyQFg1oosUChM7XyB/PQoC5A/kID0CEvf5K6 QjS8z4bZq79VqDeduOyQOLOpQJ60x/5JHDNW9q66fm+Upp5iTPdj8VMkTE8CIKxW4z0e yfqEBulfEyTog/kpdmck1Hu+qXxeI/W2ikdkxOI4ufq4HLL7Fs7olFN89Ly9DcNyjlow zDe531QSrErT39jd47ruUgoGyM22fRVLPGdYM5b07PXy2jPpwOJ3Ml4DArCZ+CJhSfTY hYqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=x4NTypTBJW4b7REu+E6eGP25pHpiqubalYhFPom5anA=; b=ZX+Iz1F3Vygs4/5My577G/T665LT8II+R8seLtJJe0w5qe2yGQ2KjWO19wycvCy8+p ksS/2677PNLP1U6JBZsk9ZOTykIBigBACL2v5VtDJPJvj2bea7BUjK5Uu9+Lt76N1H8I d7bRWbz1aYGo4eF/dKqQb8z9gqN5uCWvWFulWr2YFy2gikvcH29elTeTAwH+VL13MnLt 5bEj6pvT8ozTXX6MtunY4/Rr3n8QB9ILhFA6zRQWWvgzXT5ypLb+n033823e5RhKnLxG SburO/w1d7hjzKdJ6fEU+51GLxqe4TLjGg7XeXxtJPiBevacOsfTd7hHEFfotuEsZsBq 7QXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nyBYl3et; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a71-20020a509ecd000000b004534c7d4ebfsi15657584edf.434.2022.09.26.10.52.34; Mon, 26 Sep 2022 10:52:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nyBYl3et; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229940AbiIZQ5K (ORCPT + 99 others); Mon, 26 Sep 2022 12:57:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229917AbiIZQ4t (ORCPT ); Mon, 26 Sep 2022 12:56:49 -0400 Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A591D6BD57 for ; Mon, 26 Sep 2022 08:52:22 -0700 (PDT) Received: by mail-lj1-x236.google.com with SMTP id p5so7901505ljc.13 for ; Mon, 26 Sep 2022 08:52:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=x4NTypTBJW4b7REu+E6eGP25pHpiqubalYhFPom5anA=; b=nyBYl3et3D/Hwg0PXcmhDbdFVdgPs3wT1T5m4FWqxqT5warTXHeLblDbQ/PO42CWGY jbpOwM8f7b+hUVCjTVqEffZDcvNQvdlgXth8PswhOzbaHGeI6lrVsgAHqFlK93PKl3Ai 29wu0qn3ZcEfauxpueUyETsbUfSS3n0CXd8kORVyU3Uk9W+0ddF1wH325LwJztuinsWa lNrIfEakQ6yL4BBqZDrJojilF7V7sSd37sQ8L03NWeQ320JAVL8LJpzI5X3t8KB5QrvI KVgePx51bdk4JXI/uNiFSoF+7/stF86Grk0vLeOtU1pn4URzIXpZfxGdKpM0y4tu7MVO 9wDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=x4NTypTBJW4b7REu+E6eGP25pHpiqubalYhFPom5anA=; b=syBLtDTBRa0++InYyW0y9eufb0Ce+2Xj6hZiefU5s6h6G4L65b2R0k0x4Pcas4NEQr Iy4T3EduMTzIq0kwkkFQC0vMFv1Fy0jzb0IfEam3QvuYV4TwG78S3qfB7BcTNdj7SllK Cw5JpIH/EsXTNenlROuA1ChIntIohfvNlEgqXzOtx6Eu1Qopodk+5TrsQx3wOYVQZu8u Rf4bAxy8cqtkCeUCKQtayQdJ0lKC58mrb1FtkoDdCIIkWKTcEJrbnzqKr9ix/YukNn1k YZS41+Nybjc3t1lJZcCYZfODqdQnMl2lMYlQ4/xxjFvYmSxCLMcHzkoDoHn1VbLEJhFT aANg== X-Gm-Message-State: ACrzQf18r7GiuM02tslxz4fKZYf21twsEb8tvEDa9iFK1MtUFQ4zuALT qNPJNW/+Kp9/5d9sfzOyBqt3WSWQ2MkBKPSa7N1H4g== X-Received: by 2002:a05:651c:1508:b0:26c:622e:abe1 with SMTP id e8-20020a05651c150800b0026c622eabe1mr7742402ljf.228.1664207540822; Mon, 26 Sep 2022 08:52:20 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: <20220926142330.GC2658254@chaop.bj.intel.com> From: Fuad Tabba Date: Mon, 26 Sep 2022 16:51:44 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Chao Peng Cc: Sean Christopherson , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > Regarding pKVM's use case, with the shim approach I believe this can be done by > > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions > > > piled on top. > > > > > > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly > > > tightly control usage without taking on too much complexity in the kernel, but > > > working through things, routing the behavior through the shim itself might not be > > > all that horrific. > > > > > > IIRC, we discarded the idea of allowing userspace to map the "private" fd because > > > things got too complex, but with the shim it doesn't seem _that_ bad. > > > > > > E.g. on the memfd side: > > > > > > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. > > > mapping is all or nothing. > > > > > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for > > > the restricted memfd. > > > > > > 3. Add notifier hooks to allow downstream users to further restrict things. > > > > > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in > > > one shot. > > > > > > 5. Require that there are no outstanding references at munmap(). Or if this > > > can't be guaranteed by userspace, maybe add some way for userspace to wait > > > until it's ok to convert to private? E.g. so that get_pfn() doesn't need > > > to do an expensive check every time. > > > > > > static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma) > > > { > > > if (vma->vm_pgoff) > > > return -EINVAL; > > > > > > if ((vma->vm_end - vma->vm_start) != ) > > > return -EINVAL; > > > > > > mutex_lock(&data->lock); > > > > > > if (data->has_mapping) { > > > r = -EINVAL; > > > goto err; > > > } > > > list_for_each_entry(notifier, &data->notifiers, list) { > > > r = notifier->ops->mmap_start(notifier, ...); > > > if (r) > > > goto abort; > > > } > > > > > > notifier->ops->mmap_end(notifier, ...); > > > mutex_unlock(&data->lock); > > > return 0; > > > > > > abort: > > > list_for_each_entry_continue_reverse(notifier &data->notifiers, list) > > > notifier->ops->mmap_abort(notifier, ...); > > > err: > > > mutex_unlock(&data->lock); > > > return r; > > > } > > > > > > static void memfd_restricted_close(struct vm_area_struct *vma) > > > { > > > mutex_lock(...); > > > > > > /* > > > * Destroy the memfd and disable all future accesses if there are > > > * outstanding refcounts (or other unsatisfied restrictions?). > > > */ > > > if ( || ???) > > > memfd_restricted_destroy(...); > > > else > > > data->has_mapping = false; > > > > > > mutex_unlock(...); > > > } > > > > > > static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr) > > > { > > > return -EINVAL; > > > } > > > > > > static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma) > > > { > > > return -EINVAL; > > > } > > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > memory into the guest (after pre-boot phase). > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > and only if the entire gfn range of the associated memslot is shared. > > > > In general I think that this would work with pKVM. However, limiting > > private<->shared conversions to the granularity of a whole memslot > > might be difficult to handle in pKVM, since the guest doesn't have the > > concept of memslots. For example, in pKVM right now, when a guest > > shares back its restricted DMA pool with the host it does so at the > > page-level. pKVM would also need a way to make an fd accessible again > > when shared back, which I think isn't possible with this patch. > > But does pKVM really want to mmap/munmap a new region at the page-level, > that can cause VMA fragmentation if the conversion is frequent as I see. > Even with a KVM ioctl for mapping as mentioned below, I think there will > be the same issue. pKVM doesn't really need to unmap the memory. What is really important is that the memory is not GUP'able. Having private memory mapped and then accessed by a misbehaving/malicious process will reinject a fault into the misbehaving process. Cheers, /fuad > > > > You were initially considering a KVM ioctl for mapping, which might be > > better suited for this since KVM knows which pages are shared and > > which ones are private. So routing things through KVM might simplify > > things and allow it to enforce all the necessary restrictions (e.g., > > private memory cannot be mapped). What do you think? > > > > Thanks, > > /fuad