Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp3531716pxb; Mon, 4 Apr 2022 20:09:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyElRrUmwV6DBElSKfDQ3EIrpKoSeaS/Jvu+M5WTvWxI5NT3nWmxnIUms+RW7nbkFYkk35F X-Received: by 2002:a17:903:183:b0:154:61ec:74a3 with SMTP id z3-20020a170903018300b0015461ec74a3mr1270754plg.69.1649128142587; Mon, 04 Apr 2022 20:09:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649128142; cv=none; d=google.com; s=arc-20160816; b=zRSFtmxEmQVD43pKOcn0nTRNVMBz9SzCYVFi3KAfY9bzCckIjzl1XAeLaP+rdwN99M i+EfQl95f+ywuQBRXQlpjm5NR35gbnxhJvhPFR0HPJpIOiw0CeqQn6o16kPUmVS0cdnD uURFt+YKldFA2w1PE2WXEMkl7Tlp0s9tyUOMt8HLuqO8Ww2FapGqX6azIrWqkWeqZGI1 UtjlCroqaEatsWvEiNkg2Jieh2NuphV/G/X85qp7N4OadohgGhbPLFz04qFP9QcUUq/0 MQLB6kVNnBFcXA2yhF6Tq4Drsr38QyR8L1emqBWaoqH7p4Is1yZ6l64X/8zEo+RRoiSx Yr9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ebGPtXiu5HUtQyoqNhZdO92KCGPkkOR97ApJgxTMzik=; b=Mj5ptv8Ek+3flxzFF7na89axMisWmyWYEXGTlu8/nEohmManoNti1CF+4zJPLmwOoa ITo+g0VMH5jWv/fNDYQpl4tFZfP2KrptcMRdyO0pHMIR/avucwDwZ+geYxTuiscE2Xh8 UBHYIw0DvEPRXlXGYm6qu6yk7WwRMABUBx5UrQUZhfQCet0hxxcmfxkH8EutQ9bXmHJx fUev8TfXVMo+zcWKK5GmsGyVKN25SHUuWgOKtTge2HPRXR1gWk/1dgJwapK4827E+Qfl P8caQP4G9rONbOdwhhPYDYp21Xtod1fs00Bx07l4h9hmnIFypVFpdsZkr4+lWfWHNmuG OXvA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qKMvZ0uz; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id z13-20020a170903018d00b001540e087687si12836318plg.198.2022.04.04.20.09.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Apr 2022 20:09:02 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qKMvZ0uz; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5E07866AF3; Mon, 4 Apr 2022 18:09:35 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1357030AbiDAPqO (ORCPT + 99 others); Fri, 1 Apr 2022 11:46:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349499AbiDAPRY (ORCPT ); Fri, 1 Apr 2022 11:17:24 -0400 Received: from mail-ej1-x635.google.com (mail-ej1-x635.google.com [IPv6:2a00:1450:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D92ED6B for ; Fri, 1 Apr 2022 07:59:31 -0700 (PDT) Received: by mail-ej1-x635.google.com with SMTP id i16so6345271ejk.12 for ; Fri, 01 Apr 2022 07:59:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=ebGPtXiu5HUtQyoqNhZdO92KCGPkkOR97ApJgxTMzik=; b=qKMvZ0uz+73PRmlGOgQsh3zTqfJPV2XLOZeRQy0SNRj7ikzNPzjZCE6WDkxxsMLYGl jp1SmVDUE4DnEbIb62wHtywaPHT3KZ81zMqI1KkFPnJiIWqTiqb+9Aysn/Hoy2APHnbX NeGhEaPPoyVrEMI4k+h0+jnAyhQ45WhIX6WGzSBe/boN4a1k4eC3Tc4wa5yaF8HGJJnZ S0nVU9bnQmaEE95KucUq9tXOAXUkWjz0xA5T+ADXrfyErCHCfEJX2DGyEcCfrEM/cg1l FKCBb5p1wg1qFEOtg7z+HuklCdulWyZBAOGrsf/npODiLif4HYaNF7NR9Vs8z7acaTsn ejhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ebGPtXiu5HUtQyoqNhZdO92KCGPkkOR97ApJgxTMzik=; b=m59mqlRGaXh5a4aeP9c6LOShKXXVfaIl9aInKODCgmr2WMz48yDdlFknYJyQ4DDwN3 Khhk6IdhESAwmcR1Z3s3GjWHk96FUwenkRNMMrzAbCxlxSBrQgualhiUxEQ4gHoSE6sV 5A77MQhS/ujazeW1lgxsMH6jfCDC5CMxtBPFFeXkpaATamNakiqeoVY0MSZBjJCgwUnw FOuvKB/H5Wlm75cT1N0rEDM9BV/Xmwh48f6mTwbmUqtuMJZelSVAycdIxJrw/564bFEo VtjuFcZcRxmNMu3rjCsy2YwYwSZBLXNSvG1AlqxKDmWJsvsHfuRtGRYS3FRjX5CaLTri D/VQ== X-Gm-Message-State: AOAM533dKdVyw4k+FkvBa/tZ2zMuy5/aTJPuY6/w6z8HGobMsOYW5anp suQLFgIOlvZryvsS+sXbHPBis043epB0DA== X-Received: by 2002:a17:906:8a6d:b0:6e0:68ac:7197 with SMTP id hy13-20020a1709068a6d00b006e068ac7197mr147150ejc.703.1648825169282; Fri, 01 Apr 2022 07:59:29 -0700 (PDT) Received: from google.com (30.171.91.34.bc.googleusercontent.com. [34.91.171.30]) by smtp.gmail.com with ESMTPSA id gl2-20020a170906e0c200b006a767d52373sm1090474ejb.182.2022.04.01.07.59.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Apr 2022 07:59:28 -0700 (PDT) Date: Fri, 1 Apr 2022 14:59:25 +0000 From: Quentin Perret To: Andy Lutomirski Cc: Sean Christopherson , Steven Price , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen , David Hildenbrand , Marc Zyngier , Will Deacon Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <88620519-029e-342b-0a85-ce2a20eaf41b@arm.com> <80aad2f9-9612-4e87-a27a-755d3fa97c92@www.fastmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <80aad2f9-9612-4e87-a27a-755d3fa97c92@www.fastmail.com> X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote: > On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote: > > On Wed, Mar 30, 2022, Quentin Perret wrote: > >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote: > >> > On 29/03/2022 18:01, Quentin Perret wrote: > >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in > >> > > the shared gpa range at an address that doesn't have a backing memslot, > >> > > will KVM check whether there is a corresponding private memslot at the > >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or > >> > > would that just generate an MMIO exit as usual? > >> > > >> > My understanding is that the guest needs some way of tagging whether a > >> > page is expected to be shared or private. On the architectures I'm aware > >> > of this is done by effectively stealing a bit from the IPA space and > >> > pretending it's a flag bit. > >> > >> Right, and that is in fact the main point of divergence we have I think. > >> While I understand this might be necessary for TDX and the likes, this > >> makes little sense for pKVM. This would effectively embed into the IPA a > >> purely software-defined non-architectural property/protocol although we > >> don't actually need to: we (pKVM) can reasonably expect the guest to > >> explicitly issue hypercalls to share pages in-place. So I'd be really > >> keen to avoid baking in assumptions about that model too deep in the > >> host mm bits if at all possible. > > > > There is no assumption about stealing PA bits baked into this API. Even within > > x86 KVM, I consider it a hard requirement that the common flows not assume the > > private vs. shared information is communicated through the PA. > > Quentin, I think we might need a clarification. The API in this patchset indeed has no requirement that a PA bit distinguish between private and shared, but I think it makes at least a weak assumption that *something*, a priori, distinguishes them. In particular, there are private memslots and shared memslots, so the logical flow of resolving a guest memory access looks like: > > 1. guest accesses a GVA > > 2. read guest paging structures > > 3. determine whether this is a shared or private access > > 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures accordingly. In particular, the memslot to reference is different depending on the access type. > > For TDX, this maps on to the fd-based model perfectly: the host-side paging structures for the shared and private slots are completely separate. For SEV, the structures are shared and KVM will need to figure out what to do in case a private and shared memslot overlap. Presumably it's sufficient to declare that one of them wins, although actually determining which one is active for a given GPA may involve checking whether the backing store for a given page actually exists. > > But I don't understand pKVM well enough to understand how it fits in. Quentin, how is the shared vs private mode of a memory access determined? How do the paging structures work? Can a guest switch between shared and private by issuing a hypercall without changing any guest-side paging structures or anything else? My apologies, I've indeed shared very little details about how pKVM works. We'll be posting patches upstream really soon that will hopefully help with this, but in the meantime, here is the idea. pKVM is designed around MMU-based protection as opposed to encryption as is the case for many confidential computing solutions. It's probably worth mentioning that, although it targets arm64, pKVM is distinct from the Arm CC-A stuff and requires no fancy hardware extensions -- it is applicable all the way back to Arm v8.0 which makes it an interesting solution for mobile. Another particularity of the pKVM approach is that the code of the hypervisor itself lives in the kernel source tree (see arch/arm64/kvm/hyp/nvhe/). The hypervisor is built with the rest of the kernel but as a self-sufficient object, and ends up in its own dedicated ELF section (.hyp.*) in the kernel image. The main requirement for pKVM (and KVM on arm64 in general) is to have the bootloader enter the kernel at the hypervisor exception level (a.k.a EL2). The boot procedure is a bit involved, but eventually the hypervisor object is installed at EL2, and the kernel is deprivileged to EL1 and proceeds to boot. From that point on the hypervisor no longer trusts the kernel and will enable the stage-2 MMU to impose access-control restrictions to all memory accesses from the host. All that to say: the pKVM approach offers a great deal of flexibility when it comes to hypervisor behaviour. We have control over the hypervisor code and can change it as we see fit. Since both the hypervisor and the host kernel are part of the same image, the ABI between them is very much *not* stable and can be adjusted to whatever makes the most sense. So, I think we'd be quite keen to use that flexibility to align some of the pKVM behaviours with other players (TDX, SEV, CC-A), especially when it comes to host mm APIs. But that flexibility also means we can do some things a bit better (e.g. pKVM can handle illegal accesses from the host mostly fine -- the hypervisor can re-inject the fault in the host) so I would definitely like to use this to our advantage and not be held back by unrelated constraints. To answer your original question about memory 'conversion', the key thing is that the pKVM hypervisor controls the stage-2 page-tables for everyone in the system, all guests as well as the host. As such, a page 'conversion' is nothing more than a permission change in the relevant page-tables. The typical flow is as follows: - the host asks the hypervisor to run a guest; - the hypervisor does the context switch, which includes switching stage-2 page-tables; - initially the guest has an empty stage-2 (we don't require pre-faulting everything), which means it'll immediately fault; - the hypervisor switches back to host context to handle the guest fault; - the host handler finds the corresponding memslot and does the ipa->hva conversion. In our current implementation it uses a longterm GUP pin on the corresponding page; - once it has a page, the host handler issues a hypercall to donate the page to the guest; - the hypervisor does a bunch of checks to make sure the host owns the page, and if all is fine it will unmap it from the host stage-2 and map it in the guest stage-2, and do some bookkeeping as it needs to track page ownership, etc; - the guest can then proceed to run, and possibly faults in many more pages; - when it wants to, the guest can then issue a hypercall to share a page back with the host; - the hypervisor checks the request, maps the page back in the host stage-2, does more bookkeeping and returns back to the host to notify it of the share; - the host kernel at that point can exit back to userspace to relay that information to the VMM; - rinse and repeat. We currently don't allow the host punching holes in the guest IPA space. Once it has donated a page to a guest, it can't have it back until the guest has been entirely torn down (at which point all of memory is poisoned by the hypervisor obviously). But we could certainly reconsider that part. OTOH, I'm still inclined to think that in-place sharing is desirable. In our case it's dirt cheap, and could even work on huge pages, which would allow very efficient sharing of large amounts of data. So, I'm a bit hesitant to use the private-fd approach as-is since it's not immediately obvious how we'll ever be able reconcile these things if mmap-ing the fd is a firm no. With that said, I don't think our *current* use-cases have a strong need for this, so I mostly agree with Sean's point earlier. But since we're talking about committing to a userspace ABI, I would feel better if there was a clear path towards having support for in-place sharing -- I can certainly see it being useful. I'll think about it, but if folks have ideas in the meantime I'll be happy to discuss. I hope the above was useful and clears up the confusion. Thanks, Quentin