Received: by 2002:a05:622a:251a:b0:39a:b4a2:e86 with SMTP id cm26csp558340qtb; Wed, 19 Oct 2022 08:53:23 -0700 (PDT) X-Google-Smtp-Source: AMsMyM553xGMJsTj8aWhP9S52LYiBXIBNFq01EVWNj23sFYcVJS+bhUEdfnODFNhkZtiikY0E9SQ X-Received: by 2002:a17:90b:3e8d:b0:20d:2ab9:517f with SMTP id rj13-20020a17090b3e8d00b0020d2ab9517fmr10286993pjb.88.1666194802738; Wed, 19 Oct 2022 08:53:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666194802; cv=none; d=google.com; s=arc-20160816; b=UuBmjXLzmYKAciKWEIaTzobO9M6+T6rK7Z95X/XcbNm93E6ZfF42FN/fX69a2sK25H Z6lXxBOp0+YTUngZMIvOLX2LKdjFz2jU/zdJBKPxpucmTjp+twZ05fKPHSQQKuHxOEKS 9J/fufOrz6Zw49FHALIDkZK/HS80QkQIcLzmQHrdqzXE1je4FaNSkWRABcAedI5mLvLI HqJod7rvmc50iXGEJD8M75UUuIoic+Aoq5ozNA7mURmkKQioCGjKM2cmTtVLz0WJM1nG eJ0AtOROwfTtQBLmDIJlz/wiYlAuf69ON75V/7/sG3J560ke0crJe7kJf0ZpHkCEu1LM 5O6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=GQN3xPWEz80R9HLK7JV5q12hLSAfmmt3AkrUC4EVJho=; b=NN0bLSygLjDA58geJvSUcnX3rnBx8eWlWNWDiiHs2peRKN77m7GAK8ZaSW5pQE41Gc NWyhpH/CmDfUy7htFJEFElC6jq/UuU+PYn0FqB+DnBT5X17h5D8TmrEq1ReyBYkMFQpX s/huSY2vrLK4Ht+NRs8OmdBATXKqq+9DZ62o+zII3WL15bKVDuZ/xLV4GxhlkMN8QjIe jMivmrjachwcvpvJQOOK8czEQppuV34tXxIZUhQkILQe8ifcswAH0H3o4qI7xNuwhUZm KbujdoZnrnBI7jKb69SbIqndzhUik7BWnKQOxHKpVM33yvwKLQIUuhmm0XZyWO5KZFJz TXHA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZtzKb9bf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c20-20020a170902849400b001748ede4e05si18188374plo.119.2022.10.19.08.53.09; Wed, 19 Oct 2022 08:53:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZtzKb9bf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232363AbiJSPNd (ORCPT + 99 others); Wed, 19 Oct 2022 11:13:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232301AbiJSPNL (ORCPT ); Wed, 19 Oct 2022 11:13:11 -0400 Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFFD1BC1 for ; Wed, 19 Oct 2022 08:05:40 -0700 (PDT) Received: by mail-lj1-x22e.google.com with SMTP id r22so22537750ljn.10 for ; Wed, 19 Oct 2022 08:05:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=GQN3xPWEz80R9HLK7JV5q12hLSAfmmt3AkrUC4EVJho=; b=ZtzKb9bf0kl/pILB/MreCGWs3HY3xz/J5znIIt+ih1I+DS9Baz4ZCKTyXqVnYQOAS+ V6QVBNa/A5tGi+XgQ3wQ/nMzWhSb4oPpDy5DVP/oTOR1D99/6mXkdRjTjPGkgyFCVd/k MeCz2GQlahKw0B4SPorBsy6aaMo1Y8vetsa5D14CxosC/gFGr1U7i/O2pf1WWTzAYw+t ISKJm3BRcC3fYJeGm6J60yWx9ZOpYS1Z+ehehirudx28YZ914nLWPcDEakMmtkEMsxra 1zFnLTP6AZbcwQ6m5YMxlfTE2slt6esDIpPTrJAG7863klJALjIVm0v020eIkoRGYSkV O4Ew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=GQN3xPWEz80R9HLK7JV5q12hLSAfmmt3AkrUC4EVJho=; b=XoDBXM6UL2JQdgbKt6fj8peZVo0qbTHJ9xnXHmkFQY7APksqaQIO8JVZSthoAdvtzJ yCu9VIO02FY4EQALQMVHneyTrR8GCgmBi5UjqQEa9goIn3e1e6GlN0X30A6LoAhA+Q4L rXKVu3oZR+VcqSMWzQcLCL3xoWmWu6mqb7DtJEnZih50/zzNyg3fPZ/lHv9oFOFLjJ4r TkYhiFF1Z8SqXA2dbaxwR2XYx2FXWaIjuYZB3FF+mF7OlucYjSHWtFR0k15kie9gxSC6 fMzzJXRWaKEOjhx6EvgCK7ZUuVf44yWmeuR8kpHbSuIol5hZ4rOvZ0cMQZAX5/SwuVR7 FbUw== X-Gm-Message-State: ACrzQf1ATVjHfb4Z/W8ExutaL4qSDZIufoaDXwVYtCqQ7LZaj/zTzulF UhAgmw8vJ0jwLyl5DJloIpCmDF7h3vbAernMvyzM0Q== X-Received: by 2002:a2e:bd12:0:b0:264:7373:3668 with SMTP id n18-20020a2ebd12000000b0026473733668mr2834642ljq.18.1666191934004; Wed, 19 Oct 2022 08:05:34 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: From: Fuad Tabba Date: Wed, 19 Oct 2022 16:04:57 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Sean Christopherson Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson wrote: > > On Fri, Sep 30, 2022, Fuad Tabba wrote: > > > > > > pKVM would also need a way to make an fd accessible again > > > > > > when shared back, which I think isn't possible with this patch. > > > > > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > > > be the same issue. > > > > > > > > pKVM doesn't really need to unmap the memory. What is really important > > > > is that the memory is not GUP'able. > > > > > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > > > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > > > > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > > > the end result is the same. > > > > > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > > > current approach is to do that only in the stage-2 page tables, i.e. only in the > > > context of the hypervisor. Which is also the source of the gup() problems; the > > > untrusted kernel is blissfully unaware that the memory is inaccessible. > > > > > > Any approach that moves some of that information into the untrusted kernel so that > > > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > > > all of guest memory becomes unguppable, but that's likely not a viable option. > > > > Actually, for pKVM, there is no need for the guest memory to be GUP'able at > > all if we use the new inaccessible_get_pfn(). > > Ya, I was referring to pKVM without UPM / inaccessible memory. > > Jumping back to blocking gup(), what about using the same tricks as secretmem to > block gup()? E.g. compare vm_ops to block regular gup() and a_ops to block fast > gup() on struct page? With a Kconfig that's selected by pKVM (which would also > need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero > performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be > controversial. > > I suspect the fast gup() path could even be optimized to avoid the page_mapping() > lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is > selected. I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems > anytime soon, so there should be plenty of page flags to go around. > > Blocking gup() instead of trying to play refcount games when converting back to > private would eliminate the need to put heavy restrictions on mapping, as the goal > of those were purely to simplify the KVM implementation, e.g. the "one mapping per > memslot" thing would go away entirely. My implementation of mmap for inaccessible_fops was setting VM_PFNMAP. That said, I realized that that might be adding an unnecessary restriction, and now have changed it to do it the secretmem way. That's straightforward and works well. > > This of course goes back to what I'd mentioned before in v7; it seems that > > representing the memslot memory as a file descriptor should be orthogonal to > > whether the memory is shared or private, rather than a private_fd for private > > memory and the userspace_addr for shared memory. > > I also explored the idea of backing any guest memory with an fd, but came to > the conclusion that private memory needs a separate handle[1], at least on x86. > > For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and > TDX steal GPA bits to differentiate private vs. shared), the two types need to be > treated as separate mappings[2]. Post-boot, converting is lossy in both directions, > so even conceptually they are two disctint pages that just happen to share (some) > GPA bits. > > To allow conversions, i.e. changing which mapping to use, without memslot updates, > KVM needs to let userspace provide both mappings in a single memslot. So while > fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory, > KVM would still need a dedicated private handle. > > For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing > userspace_addr, but since the private_fd is going to be added for x86, I think it > makes sense to use that instead of adding generic fd-based memory for pKVM's use > case (which is arguably still "private" memory but with special semantics). > > [1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com > [2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org As long as the API does not impose this limit, which would imply pKVM is misusing it, then I agree. I think that's why renaming it to something like "restricted" might be clearer. Thanks, /fuad