Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp2036347rdf; Mon, 6 Nov 2023 02:55:15 -0800 (PST) X-Google-Smtp-Source: AGHT+IGw2aBbBPSHAFzt6HaJeuBCuTbrhQ5d6OTJKtlt67xoE5cCnLG9PXruUsUmOpQ2MiMjPFGv X-Received: by 2002:a17:90a:cb12:b0:280:1cfb:f7ad with SMTP id z18-20020a17090acb1200b002801cfbf7admr20246656pjt.4.1699268115472; Mon, 06 Nov 2023 02:55:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699268115; cv=none; d=google.com; s=arc-20160816; b=tWOo1H8UXvt4BgJenuDF5ie0HG4lYv0FDT26Fu28saULSObI9cWzCbfdUrGIQSSpCp gCoCLcimqS0+jG5r+Bg4hfLChRp6Xh1RyG2YlhMgamYGYTeNxEiIgNBeyDBYW5vyilXq rqIngw6Bx6bXbsBRtV3lych13RnAT90+Q2LzyGmsN5mzGKt2l/3VWgr4l0MRAHLW/+w0 TbBaZLcFR06pcCvw3MdNell3R5jp14uzX7Ea+SxgxFkgL9moMJysKgx6eZGX7z9r/40S aoPOgKmLDveNb4d6pBI0uInfr9fEKdibJTTkcxWi2TnN5IIDqgbsGtLc8fDvfjwBneBc Tt+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=PHBvDRtRAEwaxL1lSliJcpIEY0ugq80CfKQxFJAL4q0=; fh=6NDLbPegWN39ihzRI+XD38qv7OUBTPlYZUUvtWMZU2Q=; b=Xg6f+Dd/Ns18oi7sw0yuk9jZWh+jL4k7nmLCUxf9om1WD7joIhrsB6Vu9hZH+7koRr 1NeOf+jeoI7zxCH5KF1E1U7gyC7yXzz8LNzZK11iYelVlj+siLP5HfeQdIiq3B0Ohe31 tDWwh5L/fv//UgzJqLW+rMXsd1JsYdg8+37Ltykd7di0ho81FKaaPKIVOCubXdXpFgBM Z/rF2xvLlnRokY4yfvhE2Zy3YvM4pTQZ7URtGMvIgrqULrV5Aws1UOnhBR0oZus7Cknj iO8w9KV1FdldsO6PWQqRmhH0Kf2WfoeXuRXUom9HbZ1+OpF7h53Pz9hMf54bDCkFuQ5i gKbw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kLCasMPU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id fw18-20020a17090b129200b002684bc84493si7688296pjb.131.2023.11.06.02.55.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 02:55:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=kLCasMPU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 57946802B072; Mon, 6 Nov 2023 02:55:14 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229921AbjKFKzN (ORCPT + 99 others); Mon, 6 Nov 2023 05:55:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54580 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229583AbjKFKzM (ORCPT ); Mon, 6 Nov 2023 05:55:12 -0500 Received: from mail-qk1-x72b.google.com (mail-qk1-x72b.google.com [IPv6:2607:f8b0:4864:20::72b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43E1FB7 for ; Mon, 6 Nov 2023 02:55:07 -0800 (PST) Received: by mail-qk1-x72b.google.com with SMTP id af79cd13be357-778af1b5b1eso279721985a.2 for ; Mon, 06 Nov 2023 02:55:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699268106; x=1699872906; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PHBvDRtRAEwaxL1lSliJcpIEY0ugq80CfKQxFJAL4q0=; b=kLCasMPU31DaYByFdMZh+Xq8q1ztGZ1h3iNtcXyygXRz8HqYCqSNXqSR5G6PGQ5fKZ 48/ER32rJstXoif6Y0klI2h+XyDQuN5mnQM5Hr2ImA8MtOnVmZDasC2LgLs3FplvMxk+ 3tBeBYzUDxbkoR8qNH0PkiSMI9yPjdicUKwathXKj48l2AP5g9+4WMy0Vkqmh3j5W76t NSd3+t4xUq5TycRJTtU7ZMM205ztaHbO/0GMNsXbOROvUyincs1+RZrpH68ZPnlRPzFX rqnnnB7mGdvqSXgUhjn+WJqA3eoP3yUfHSbTOQ4lYUJMIQdJEc8RqwKJlyCcFYE7KdnY QYIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699268106; x=1699872906; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PHBvDRtRAEwaxL1lSliJcpIEY0ugq80CfKQxFJAL4q0=; b=iR01zoNO1ZP6Ya5RFz5AeWqip3NQ67IVNrQrJVDZQ/us7BGk0nDXn1sZiC2r7/FUWn 9l3M1y7GSqEQfGPvgwVWr5HUD1vx0ACB7JghRpYefWWRFuUEr+D5SIuEzlvyHJ+0kMTU u7rIVOeo1xH/lGNSSo3ZYvLNeDUM16TcB/63w4KNdhjgGPte0a9IGiuGWzC8nQkdeoJQ H9eXqLcMMLvcDSr1tv+Xa5tl50OeusIf2a9zJPejpmBaRDQHrmz7YF0nbZnXO+A+5Lfu IVHhpw8OcabuM44bdHyYPiFhFMzU5paZwXf+2uIOBO8LKMGcNKGrzgyFkfwkIMl71Akm V5Fw== X-Gm-Message-State: AOJu0Yw1s4rrIoX8PIIwDvJj3efUWGPuL2z68DrmGQIUc+bMyNKfK8kH ATxxMh5gTslJBPhVcDRw3bNHdlqqennwuB5CA6MXWA== X-Received: by 2002:ad4:5ce3:0:b0:66d:5b50:44d with SMTP id iv3-20020ad45ce3000000b0066d5b50044dmr43103399qvb.57.1699268106185; Mon, 06 Nov 2023 02:55:06 -0800 (PST) MIME-Version: 1.0 References: <20231105163040.14904-1-pbonzini@redhat.com> <20231105163040.14904-19-pbonzini@redhat.com> In-Reply-To: <20231105163040.14904-19-pbonzini@redhat.com> From: Fuad Tabba Date: Mon, 6 Nov 2023 10:54:30 +0000 Message-ID: Subject: Re: [PATCH 18/34] KVM: x86/mmu: Handle page fault for private memory To: Paolo Bonzini Cc: Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Sean Christopherson , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A. Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 06 Nov 2023 02:55:14 -0800 (PST) Hi, On Sun, Nov 5, 2023 at 4:33=E2=80=AFPM Paolo Bonzini = wrote: > > From: Chao Peng > > Add support for resolving page faults on guest private memory for VMs > that differentiate between "shared" and "private" memory. For such VMs, > KVM_MEM_PRIVATE memslots can include both fd-based private memory and KVM_MEM_PRIVATE -> KVM_MEM_GUEST_MEMFD Cheers, /fuad > hva-based shared memory, and KVM needs to map in the "correct" variant, > i.e. KVM needs to map the gfn shared/private as appropriate based on the > current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag. > > For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request > shared vs. private via a bit in the guest page tables, i.e. what the gues= t > wants may conflict with the current memory attributes. To support such > "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT > to forward the request to userspace. Add a new flag for memory faults, > KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to > map memory as shared vs. private. > > Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory > so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace > needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to > exit on missing mappings when handling guest page fault VM-Exits. In > that case, userspace will want to know RWX information in order to > correctly/precisely resolve the fault. > > Note, private memory *must* be backed by guest_memfd, i.e. shared mapping= s > always come from the host userspace page tables, and private mappings > always come from a guest_memfd instance. > > Co-developed-by: Yu Zhang > Signed-off-by: Yu Zhang > Signed-off-by: Chao Peng > Co-developed-by: Sean Christopherson > Signed-off-by: Sean Christopherson > Reviewed-by: Fuad Tabba > Tested-by: Fuad Tabba > Message-Id: <20231027182217.3615211-21-seanjc@google.com> > Signed-off-by: Paolo Bonzini > --- > Documentation/virt/kvm/api.rst | 8 ++- > arch/x86/kvm/mmu/mmu.c | 101 ++++++++++++++++++++++++++++++-- > arch/x86/kvm/mmu/mmu_internal.h | 1 + > include/linux/kvm_host.h | 8 ++- > include/uapi/linux/kvm.h | 1 + > 5 files changed, 110 insertions(+), 9 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.= rst > index 6d681f45969e..4a9a291380ad 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6953,6 +6953,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc. > > /* KVM_EXIT_MEMORY_FAULT */ > struct { > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) > __u64 flags; > __u64 gpa; > __u64 size; > @@ -6961,8 +6962,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc= . > KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault = that > could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe = the > guest physical address range [gpa, gpa + size) of the fault. The 'flags= ' field > -describes properties of the faulting access that are likely pertinent. > -Currently, no flags are defined. > +describes properties of the faulting access that are likely pertinent: > + > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault o= ccurred > + on a private memory access. When clear, indicates the fault occurred= on a > + shared access. > > Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in tha= t it > accompanies a return code of '-1', not '0'! errno will always be set to= EFAULT > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index f5c6b0643645..754a5aaebee5 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, = gfn_t gfn, > return level; > } > > -int kvm_mmu_max_mapping_level(struct kvm *kvm, > - const struct kvm_memory_slot *slot, gfn_t g= fn, > - int max_level) > +static int __kvm_mmu_max_mapping_level(struct kvm *kvm, > + const struct kvm_memory_slot *slot= , > + gfn_t gfn, int max_level, bool is_= private) > { > struct kvm_lpage_info *linfo; > int host_level; > @@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > break; > } > > + if (is_private) > + return max_level; > + > if (max_level =3D=3D PG_LEVEL_4K) > return PG_LEVEL_4K; > > @@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > return min(host_level, max_level); > } > > +int kvm_mmu_max_mapping_level(struct kvm *kvm, > + const struct kvm_memory_slot *slot, gfn_t g= fn, > + int max_level) > +{ > + bool is_private =3D kvm_slot_can_be_private(slot) && > + kvm_mem_is_private(kvm, gfn); > + > + return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_= private); > +} > + > void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_faul= t *fault) > { > struct kvm_memory_slot *slot =3D fault->slot; > @@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu,= struct kvm_page_fault *fault > * Enforce the iTLB multihit workaround after capturing the reque= sted > * level, which will be used to do precise, accurate accounting. > */ > - fault->req_level =3D kvm_mmu_max_mapping_level(vcpu->kvm, slot, > - fault->gfn, fault->m= ax_level); > + fault->req_level =3D __kvm_mmu_max_mapping_level(vcpu->kvm, slot, > + fault->gfn, fault-= >max_level, > + fault->is_private)= ; > if (fault->req_level =3D=3D PG_LEVEL_4K || fault->huge_page_disal= lowed) > return; > > @@ -4269,6 +4283,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vc= pu, struct kvm_async_pf *work) > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL); > } > > +static inline u8 kvm_max_level_for_order(int order) > +{ > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G); > + > + KVM_MMU_WARN_ON(order !=3D KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) && > + order !=3D KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) && > + order !=3D KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)); > + > + if (order >=3D KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) > + return PG_LEVEL_1G; > + > + if (order >=3D KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) > + return PG_LEVEL_2M; > + > + return PG_LEVEL_4K; > +} > + > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + struct kvm_page_fault *faul= t) > +{ > + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, > + PAGE_SIZE, fault->write, fault->exe= c, > + fault->is_private); > +} > + > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, > + struct kvm_page_fault *fault) > +{ > + int max_order, r; > + > + if (!kvm_slot_can_be_private(fault->slot)) { > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > + return -EFAULT; > + } > + > + r =3D kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault= ->pfn, > + &max_order); > + if (r) { > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > + return r; > + } > + > + fault->max_level =3D min(kvm_max_level_for_order(max_order), > + fault->max_level); > + fault->map_writable =3D !(fault->slot->flags & KVM_MEM_READONLY); > + > + return RET_PF_CONTINUE; > +} > + > static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_faul= t *fault) > { > struct kvm_memory_slot *slot =3D fault->slot; > @@ -4301,6 +4364,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu= , struct kvm_page_fault *fault > return RET_PF_EMULATE; > } > > + if (fault->is_private !=3D kvm_mem_is_private(vcpu->kvm, fault->g= fn)) { > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); > + return -EFAULT; > + } > + > + if (fault->is_private) > + return kvm_faultin_pfn_private(vcpu, fault); > + > async =3D false; > fault->pfn =3D __gfn_to_pfn_memslot(slot, fault->gfn, false, fals= e, &async, > fault->write, &fault->map_writa= ble, > @@ -7188,6 +7259,26 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) > } > > #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, > + struct kvm_gfn_range *range) > +{ > + /* > + * Zap SPTEs even if the slot can't be mapped PRIVATE. KVM x86 o= nly > + * supports KVM_MEMORY_ATTRIBUTE_PRIVATE, and so it *seems* like = KVM > + * can simply ignore such slots. But if userspace is making memo= ry > + * PRIVATE, then KVM must prevent the guest from accessing the me= mory > + * as shared. And if userspace is making memory SHARED and this = point > + * is reached, then at least one page within the range was previo= usly > + * PRIVATE, i.e. the slot's possible hugepage ranges are changing= . > + * Zapping SPTEs in this case ensures KVM will reassess whether o= r not > + * a hugepage can be used for affected ranges. > + */ > + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) > + return false; > + > + return kvm_unmap_gfn_range(kvm, range); > +} > + > static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn, > int level) > { > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_inter= nal.h > index decc1f153669..86c7cb692786 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -201,6 +201,7 @@ struct kvm_page_fault { > > /* Derived from mmu and global state. */ > const bool is_tdp; > + const bool is_private; > const bool nx_huge_page_workaround_enabled; > > /* > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index a6de526c0426..67dfd4d79529 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2357,14 +2357,18 @@ static inline void kvm_account_pgtable_pages(void= *virt, int nr) > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > - gpa_t gpa, gpa_t size) > + gpa_t gpa, gpa_t size, > + bool is_write, bool is_e= xec, > + bool is_private) > { > vcpu->run->exit_reason =3D KVM_EXIT_MEMORY_FAULT; > vcpu->run->memory_fault.gpa =3D gpa; > vcpu->run->memory_fault.size =3D size; > > - /* Flags are not (yet) defined or communicated to userspace. */ > + /* RWX flags are not (yet) defined or communicated to userspace. = */ > vcpu->run->memory_fault.flags =3D 0; > + if (is_private) > + vcpu->run->memory_fault.flags |=3D KVM_MEMORY_EXIT_FLAG_P= RIVATE; > } > > #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 2802d10aa88c..8eb10f560c69 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -535,6 +535,7 @@ struct kvm_run { > } notify; > /* KVM_EXIT_MEMORY_FAULT */ > struct { > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) > __u64 flags; > __u64 gpa; > __u64 size; > -- > 2.39.1 > >