2023-12-02 09:41:40

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

This RFC series proposes a framework to resolve IOPF by sharing KVM TDP
(Two Dimensional Paging) page table to IOMMU as its stage 2 paging
structure to support IOPF (IO page fault) on IOMMU's stage 2 paging
structure.

Previously, all guest pages have to be pinned and mapped in IOMMU stage 2
paging structures after pass-through devices attached, even if the device
has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF
handling for stage 2 paging structure is supported and if there are only
IOPF-capable devices attached to a VM.

There are 2 approaches to support IOPF on IOMMU stage 2 paging structures:
- Supporting by IOMMUFD/IOMMU alone
IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then
iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA
to HVA, but page pinning/unpinning needs to be skipped.)
Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to
adjust IOVA mappings accordingly.
IOMMU driver needs to support unmapping sub-ranges of a previous mapped
range and take care of huge page merge and split in atomic way. [1][2].

- Sharing KVM TDP
IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root
of IOMMU stage 2 paging structure, and routes IO page faults to KVM.
(This assumes that the iommu hw supports the same stage-2 page table
format as CPU.)
In this model the page table is centrally managed by KVM (mmu notifier,
page mapping, subpage unmapping, atomic huge page split/merge, etc.),
while IOMMUFD only needs to invalidate iotlb/devtlb properly.

Currently, there's no upstream code available to support stage 2 IOPF yet.

This RFC chooses to implement "Sharing KVM TDP" approach which has below
main benefits:

- Unified page table management
The complexity of allocating guest pages per GPAs, registering to MMU
notifier on host primary MMU, sub-page unmapping, atomic page merge/split
are only required to by handled in KVM side, which has been doing that
well for a long time.

- Reduced page faults:
Only one page fault is triggered on a single GPA, either caused by IO
access or by vCPU access. (compared to one IO page fault for DMA and one
CPU page fault for vCPUs in the non-shared approach.)

- Reduced memory consumption:
Memory of one page table are saved.


Design
==
In this series, term "exported" is used in place of "shared" to avoid
confusion with terminology "shared EPT" in TDX.

The framework contains 3 main objects:

"KVM TDP FD" object - The interface of KVM to export TDP page tables.
With this object, KVM allows external components to
access a TDP page table exported by KVM.

"IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
This HWPT has no IOAS associated.

"KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
structures are managed by KVM.
Its hardware TLB invalidation requests are
notified from KVM via IOMMUFD KVM HWPT
object.



2.IOMMU_HWPT_ALLOC(fd) 1. KVM_CREATE_TDP_FD
.------.
+--------------| QEMU |----------------------+
| '------'<---+ fd |
| | v
| | .-------.
v | create | KVM |
.------------------. .------------.<------'-------'
| IOMMUFD KVM HWPT | | KVM TDP FD | |
'------------------' '------------' |
| kvm_tdp_fd_get(fd) | |
|------------------------->| |
IOMMU | | |
driver alloc(meta) |---------get meta-------->| |
.------------.<---------| | |
| KVM Domain | |----register_importer---->| |
'------------' | | |
| | | |
| 3. | | |
|----iopf handler---->|----------fault---------->|------map------->|
| | | 4. |
|<-------invalidate---|<-------invalidate--------|<---TLB flush----|
| | | |
|<-----free-----------| 5. | |
|----unregister_importer-->| |
| | |
|------------------------->| |
kvm_tdp_fd_put()


1. QEMU calls KVM_CREATE_TDP_FD to create a TDP FD object.
Address space must be specified to identify the exported TDP page table
(e.g. system memory or SMM mode system memory in x86).

2. QEMU calls IOMMU_HWPT_ALLOC to create a KVM-type HWPT.
The KVM-type HWPT is created upon an exported KVM TDP FD (rather than
upon an IOAS), acting as the proxy between KVM TDP and IOMMU driver:
- Obtain reference on the exported KVM TDP FD.
- get and pass meta data of KVM TDP page tables to IOMMU driver for KVM
domain allocation.
- register importer callbacks to KVM for invalidation notification.
- register a IOPF handler into IOMMU's KVM domain.

Upon device attachment, the root HPA of the exported TDP page table is
installed to IOMMU hardware.

3. When IO page faults come, IOMMUFD fault handler forwards the fault to
KVM.

4. When KVM performs TLB flush, it notifies all importers of KVM TDP FD
object. IOMMUFD KVM HWPT, as an importer, will pass the notification to
IOMMU driver for hardware TLB invalidations.

5. On destroy IOMMUFD KVM HWPT, it frees IOMMU's KVM domain, unregisters
itself as an importer from KVM TDP FD object and puts reference count of
KVM TDP FD object.


Status
==
Current support of IOPF on IOMMU stage 2 paging structure is verified on
Intel DSA devices on Intel SPR platform. There's no vIOMMU for guest and
Intel DSA devices run in-kernel DMA tests successfully with IOPFs handled
in host.

- Nested translation in IOMMU is currently not supported.

- QEMU code in IOMMUFD to create KVM HWPT is just a temporary hack.
As KVM HWPT has no IOAS associated, need to fit in current QEMU code to
create KVM HWPT with no IOAS and to ensure the address space is from GPA
to HPA.

- DSA IOPF hack in guest driver.
Although DSA hw tolerates IOPF in all DMA paths, DSA driver has the
flexibility to turn off IOPF in certain paths.
This RFC currently hacks the guest driver to always turn on IOPF.


Note
==
- KVM page write-tracking

Unlike write-protection which usually adds back the write permission upon
a write fault and re-executes the faulting instruction, KVM page
write-tracking keeps the write permission disabled for the tracked pages
and instead always emulates the faulting instruction upon fault.
There is no way to emulate a faulting DMA request so IOPF and KVM page
write-tracking are incompatible.

In this RFC we didn't handle the conflict given write-tracking is applied
to guest page table pages so far, which are unlikely to be used as DMA
buffer.

- IOMMU page-walk coherency

It's about whether IOMMU hardware will snoop the processor cache of the
I/O paging structures. If IOMMU page-walk is non-coherent, the software
needs to do clflush after changing the I/O paging structures.

Supporting non-coherent IOMMU page-walk adds extra burden (i.e. clflush)
in KVM mmu in this shared model, which we don't plan to support.
Fortunately most Intel platforms do support coherent page-walk in IOMMU
so this exception should not be a big matter.

- Non-coherent DMA

Non-coherent DMA requires KVM mmu to align the effective memory type
with the guest memory type (CR0.CD, vPAT, vMTRR) instead of forcing all
guest memory to be WB. It further involves complexities in fault handler
to check guest memory type too which requires a vCPU context.

There is certainly no vCPU context in an I/O page fault. So this RFC
doesn't support devices which cannot be enforced to do coherent DMA.

If there is interest in supporting non-coherent DMA in this shared model,
there's a discussion about removing vMTRR stuffs in KVM page fault
handler [3] hence it's also possible to further remove the vCPU context
there.

- Enforce DMA cache coherency

This design requires the IOMMU supporting a configuration forcing all
DMAs to be coherent (even if the PCI request out of the device sets the
non-snoop bit) due to aforementioned reason.

The control of enforcing cache coherency could be per-IOPT or per-page.
e.g. Intel VT-d defines a per-page format (bit 11 in PTE represents the
enforce-snoop bit) in legacy mode and a per-IOPT format (control bit in
the pasid entry) in scalable mode.

Supporting per-page format requires KVM mmu to disable any software use
of bit 11 and also provide additional ops for on-demand set/clear-snp
requests from iommufd. It's complex and dirty.

Therefore the per-IOPT scheme is assumed in this design. For Intel IOMMU,
the scalable mode is the default mode for all new IOMMU features (nested
translation, pasid, etc.) anyway.


- About device which partially supports IOPF

Many devices claiming PCIe PRS capability actually only tolerate IOPF in
certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
applications or driver data such as ring descriptors). But the PRS
capability doesn't include a bit to tell whether a device 100% tolerates
IOPF in all DMA paths.

This creates a trouble how the userspace driver framework (e.g. VFIO)
knows that a device with PRS can really avoid static-pinning of the
entire guest memory and then reports such knowledge to the VMM.

A simple way is to track an allowed list of devices which are known 100%
IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
device reporting whether it fully or partially supports IOPF in the PRS
capability.

Another interesting option is to explore supporting partial-IOPF in this
sharing model:
* Create a VFIO variant driver to intercept guest operations which
registers non-faultable memory to the device and to call KVM TDP ops to
request on-demand pinning of traped memory pages in KVM mmu. This
allows the VMM to start with zero-pinning as for 100%-faultable device
with on demand pinning initiated by the variant driver.

* Supporting on-demand pinning in KVM mmu however requires non-trivial
effort. Besides introducing logic to pin pages in long term and manage
the list of pinned GFNs, more caveats are required to avoid breaking
the implication of page pinning, e.g.:

a. PTE updates in a pinned GFN range must be atomic, otherwise an
in-fly DMA might be broken

b. PTE zap in a pinned GFN range is allowed only when the related
memory slot is removed (indicating guest won't use it for DMA).
The PTE zap for the affected range must be either disabled or
replaced by an atomic update.

c. any feature related to write-protecting the pinned GFN range is
not allowed. This implies live migration is also broken in current
way as it starts with write-protection even when TDP dirty bit
tracking is enabled. To support on-demand pinning it then requires
to rely on a less efficient way by always walking TDP dirty bit
instead of using write-protection. Or, we may enhance the live
migration code to treat pinned ranges as dirty always.

d. Auto NUMA balance also needs to be disabled. [4]

If above trickiness can be resolved cleanly, this sharing model could
also support a non-faultable device in theory by pinning/unpinning guest
memory on slot addition/removal.


- How to map MSI page on arm platform demands discussions.


Patches layout
==
[01-08]: Skeleton implementation of KVM's TDP FD object.
Patch 1 and 2 are for public and arch specific headers.
Patch 4's commit message outlines overall data structure hierarchy
on x86 for preview.

[09-23]: IOMMU, IOMMUFD and Intel vt-d.
- 09-11: IOMMU core part
- 12-16: IOMMUFD part
Patch 13 is the main patch in IOMMUFD to implement KVM
HWPT.
- 17-23: Intel vt-d part for KVM domain
Patch 18 is the main patch to implement KVM domain.

[24-42]: KVM x86 and VMX part
- 24-34: KVM x86 preparation patches.
Patch 24: Let KVM to reserve bit 11 since bit 11 is
reserved as 0 in IOMMU side.
Patch 25: Abstract "struct kvm_mmu_common" from
"struct kvm_mmu" for "kvm_exported_tdp_mmu"
Patches 26~34: Prepare for page fault in non-vCPU context.

- 35-38: Core part in KVM x86
Patch 35: X86 MMU core part to show how exported TDP root
page is shared between KVM external components
and vCPUs.
Patch 37: TDP FD fault op implementation

- 39-42: KVM VMX part for meta data composing and tlb flush
notification.


Code base
==
The code base is commit b85ea95d08647 ("Linux 6.7-rc1") +
Yi Liu's v7 series "Add Intel VT-d nested translation (part 2/2)" [5] +
Baolu's v7 series "iommu: Prepare to deliver page faults to user space" [6]

Complete code can be found at [7], Qemu could be found at [8],
Guest test script and workaround patch is at [9].

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/BN9PR11MB5276D897431C7E1399EFFF338C14A@BN9PR11MB5276.namprd11.prod.outlook.com/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/linux-iommu/[email protected]/
[6] https://lore.kernel.org/linux-iommu/[email protected]/
[7] https://github.com/yanhwizhao/linux_kernel/tree/sharept_iopt
[8] https://github.com/yanhwizhao/qemu/tree/sharept_iopf
[9] https://github.com/yanhwizhao/misc/tree/master


Yan Zhao (42):
KVM: Public header for KVM to export TDP
KVM: x86: Arch header for kvm to export TDP for Intel
KVM: Introduce VM ioctl KVM_CREATE_TDP_FD
KVM: Skeleton of KVM TDP FD object
KVM: Embed "arch" object and call arch init/destroy in TDP FD
KVM: Register/Unregister importers to KVM exported TDP
KVM: Forward page fault requests to arch specific code for exported
TDP
KVM: Add a helper to notify importers that KVM exported TDP is flushed
iommu: Add IOMMU_DOMAIN_KVM
iommu: Add new iommu op to create domains managed by KVM
iommu: Add new domain op cache_invalidate_kvm
iommufd: Introduce allocation data info and flag for KVM managed HWPT
iommufd: Add a KVM HW pagetable object
iommufd: Enable KVM HW page table object to be proxy between KVM and
IOMMU
iommufd: Add iopf handler to KVM hw pagetable
iommufd: Enable device feature IOPF during device attachment to KVM
HWPT
iommu/vt-d: Make some macros and helpers to be extern
iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU
iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is
enforced
iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain
iommu/vt-d: Check reserved bits for IOMMU_DOMAIN_KVM domain
iommu/vt-d: Support cache invalidate of IOMMU_DOMAIN_KVM domain
iommu/vt-d: Allow pasid 0 in IOPF
KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59
KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu"
KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops
KVM: x86/mmu: change param "vcpu" to "kvm" in
kvm_mmu_hugepage_adjust()
KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track()
KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level()
KVM: x86/mmu: remove param "vcpu" from
kvm_calc_tdp_mmu_root_page_role()
KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn()
KVM: x86/mmu: add extra param "kvm" to make_mmio_spte()
KVM: x86/mmu: add extra param "kvm" to make_spte()
KVM: x86/mmu: add extra param "kvm" to
tdp_mmu_map_handle_target_level()
KVM: x86/mmu: Get/Put TDP root page to be exported
KVM: x86/mmu: Keep exported TDP root valid
KVM: x86: Implement KVM exported TDP fault handler on x86
KVM: x86: "compose" and "get" interface for meta data of exported TDP
KVM: VMX: add config KVM_INTEL_EXPORTED_EPT
KVM: VMX: Compose VMX specific meta data for KVM exported TDP
KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on
KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM
flushes EPT

arch/x86/include/asm/kvm-x86-ops.h | 4 +
arch/x86/include/asm/kvm_exported_tdp.h | 43 +++
arch/x86/include/asm/kvm_host.h | 48 ++-
arch/x86/kvm/Kconfig | 13 +
arch/x86/kvm/mmu.h | 12 +-
arch/x86/kvm/mmu/mmu.c | 434 +++++++++++++++++------
arch/x86/kvm/mmu/mmu_internal.h | 8 +-
arch/x86/kvm/mmu/paging_tmpl.h | 15 +-
arch/x86/kvm/mmu/spte.c | 31 +-
arch/x86/kvm/mmu/spte.h | 82 ++++-
arch/x86/kvm/mmu/tdp_mmu.c | 209 +++++++++--
arch/x86/kvm/mmu/tdp_mmu.h | 9 +
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 56 ++-
arch/x86/kvm/x86.c | 68 +++-
drivers/iommu/intel/Kconfig | 9 +
drivers/iommu/intel/Makefile | 1 +
drivers/iommu/intel/iommu.c | 68 ++--
drivers/iommu/intel/iommu.h | 47 +++
drivers/iommu/intel/kvm.c | 185 ++++++++++
drivers/iommu/intel/pasid.c | 3 +-
drivers/iommu/intel/svm.c | 37 +-
drivers/iommu/iommufd/Kconfig | 10 +
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/device.c | 31 +-
drivers/iommu/iommufd/hw_pagetable.c | 29 +-
drivers/iommu/iommufd/hw_pagetable_kvm.c | 270 ++++++++++++++
drivers/iommu/iommufd/iommufd_private.h | 44 +++
drivers/iommu/iommufd/main.c | 4 +
include/linux/iommu.h | 18 +
include/linux/kvm_host.h | 58 +++
include/linux/kvm_tdp_fd.h | 137 +++++++
include/linux/kvm_types.h | 12 +
include/uapi/linux/iommufd.h | 15 +
include/uapi/linux/kvm.h | 19 +
virt/kvm/Kconfig | 6 +
virt/kvm/Makefile.kvm | 1 +
virt/kvm/kvm_main.c | 24 ++
virt/kvm/tdp_fd.c | 344 ++++++++++++++++++
virt/kvm/tdp_fd.h | 15 +
41 files changed, 2177 insertions(+), 247 deletions(-)
create mode 100644 arch/x86/include/asm/kvm_exported_tdp.h
create mode 100644 drivers/iommu/intel/kvm.c
create mode 100644 drivers/iommu/iommufd/hw_pagetable_kvm.c
create mode 100644 include/linux/kvm_tdp_fd.h
create mode 100644 virt/kvm/tdp_fd.c
create mode 100644 virt/kvm/tdp_fd.h

--
2.17.1


2023-12-02 09:42:44

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 01/42] KVM: Public header for KVM to export TDP

Introduce public header for data structures and interfaces for KVM to
export TDP page table (EPT/NPT in x86) to external components of KVM.

KVM exposes a TDP FD object which allows external components to get page
table meta data, request mapping, and register invalidation callbacks to
the TDP page table exported by KVM.

Two symbols kvm_tdp_fd_get() and kvm_tdp_fd_put() are exported by KVM to
external components to get/put the TDP FD object.

New header file kvm_tdp_fd.h is added because kvm_host.h is not expected to
be included from outside of KVM in future AFAIK.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_tdp_fd.h | 137 +++++++++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)
create mode 100644 include/linux/kvm_tdp_fd.h

diff --git a/include/linux/kvm_tdp_fd.h b/include/linux/kvm_tdp_fd.h
new file mode 100644
index 0000000000000..3661779dd8cf5
--- /dev/null
+++ b/include/linux/kvm_tdp_fd.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_TDP_FD_H
+#define __KVM_TDP_FD_H
+
+#include <linux/types.h>
+#include <linux/mm.h>
+
+struct kvm_exported_tdp;
+struct kvm_exported_tdp_ops;
+struct kvm_tdp_importer_ops;
+
+/**
+ * struct kvm_tdp_fd - KVM TDP FD object
+ *
+ * Interface of exporting KVM TDP page table to external components of KVM.
+ *
+ * This KVM TDP FD object is created by KVM VM ioctl KVM_CREATE_TDP_FD.
+ * On object creation, KVM will find or create a TDP page table, mark it as
+ * exported and increase reference count of this exported TDP page table.
+ *
+ * On object destroy, the exported TDP page table is unmarked as exported with
+ * its reference count decreased.
+ *
+ * During the life cycle of KVM TDP FD object, ref count of KVM VM is hold.
+ *
+ * Components outside of KVM can get meta data (e.g. page table type, levels,
+ * root HPA,...), request page fault on the exported TDP page table and register
+ * themselves as importers to receive notification through kvm_exported_tdp_ops
+ * @ops.
+ *
+ * @file: struct file object associated with the KVM TDP FD object.
+ * @ops: kvm_exported_tdp_ops associated with the exported TDP page table.
+ * @priv: internal data structures used by KVM to manage TDP page table
+ * exported by KVM.
+ *
+ */
+struct kvm_tdp_fd {
+ /* Public */
+ struct file *file;
+ const struct kvm_exported_tdp_ops *ops;
+
+ /* private to KVM */
+ struct kvm_exported_tdp *priv;
+};
+
+/**
+ * kvm_tdp_fd_get - Public interface to get KVM TDP FD object.
+ *
+ * @fd: fd of the KVM TDP FD object.
+ * @return: KVM TDP FD object if @fd corresponds to a valid KVM TDP FD file.
+ * -EBADF if @fd does not correspond a struct file.
+ * -EINVAL if @fd does not correspond to a KVM TDP FD file.
+ *
+ * Callers of this interface will get a KVM TDP FD object with ref count
+ * increased.
+ */
+struct kvm_tdp_fd *kvm_tdp_fd_get(int fd);
+
+/**
+ * kvm_tdp_fd_put - Public interface to put ref count of a KVM TDP FD object.
+ *
+ * @tdp: KVM TDP FD object.
+ *
+ * Put reference count of the KVM TDP FD object.
+ * After the last reference count of the TDP FD object goes away,
+ * kvm_tdp_fd_release() will be called to decrease KVM VM ref count and destroy
+ * the KVM TDP FD object.
+ */
+void kvm_tdp_fd_put(struct kvm_tdp_fd *tdp);
+
+struct kvm_tdp_fault_type {
+ u32 read:1;
+ u32 write:1;
+ u32 exec:1;
+};
+
+/**
+ * struct kvm_exported_tdp_ops - operations possible on KVM TDP FD object.
+ * @register_importer: This is called from components outside of KVM to register
+ * importer callback ops and the importer data.
+ * This callback is a must.
+ * Returns: 0 on success, negative error code on failure.
+ * -EBUSY if the importer ops is already registered.
+ * @unregister_importer:This is called from components outside of KVM if it does
+ * not want to receive importer callbacks any more.
+ * This callback is a must.
+ * @fault: This is called from components outside of KVM to trigger
+ * page fault on a GPA and to map physical page into the
+ * TDP page tables exported by KVM.
+ * This callback is optional.
+ * If this callback is absent, components outside KVM will
+ * not be able to trigger page fault and map physical pages
+ * into the TDP page tables exported by KVM.
+ * @get_metadata: This is called from components outside of KVM to retrieve
+ * meta data of the TDP page tables exported by KVM, e.g.
+ * page table type,root HPA, levels, reserved zero bits...
+ * Returns: pointer to a vendor meta data on success.
+ * Error PTR on error.
+ * This callback is a must.
+ */
+struct kvm_exported_tdp_ops {
+ int (*register_importer)(struct kvm_tdp_fd *tdp_fd,
+ struct kvm_tdp_importer_ops *ops,
+ void *importer_data);
+
+ void (*unregister_importer)(struct kvm_tdp_fd *tdp_fd,
+ struct kvm_tdp_importer_ops *ops);
+
+ int (*fault)(struct kvm_tdp_fd *tdp_fd, struct mm_struct *mm,
+ unsigned long gfn, struct kvm_tdp_fault_type type);
+
+ void *(*get_metadata)(struct kvm_tdp_fd *tdp_fd);
+};
+
+/**
+ * struct kvm_tdp_importer_ops - importer callbacks
+ *
+ * Components outside of KVM can be registered as importers of KVM's exported
+ * TDP page tables via register_importer op in kvm_exported_tdp_ops of a KVM TDP
+ * FD object.
+ *
+ * Each importer must define its own importer callbacks and KVM will notify
+ * importers of changes of the exported TDP page tables.
+ */
+struct kvm_tdp_importer_ops {
+ /**
+ * This is called by KVM to notify the importer that a range of KVM
+ * TDP has been invalidated.
+ * When @start is 0 and @size is -1, a whole of KVM TDP is invalidated.
+ *
+ * @data: the importer private data.
+ * @start: start GPA of the invalidated range.
+ * @size: length of in the invalidated range.
+ */
+ void (*invalidate)(void *data, unsigned long start, unsigned long size);
+};
+#endif /* __KVM_TDP_FD_H */
--
2.17.1

2023-12-02 09:44:17

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 02/42] KVM: x86: Arch header for kvm to export TDP for Intel

Headers to define Intel specific meta data for TDP page tables exported by
KVM.
The meta data includes page table type, level, HPA of root page, max huge
page level, and reserved zero bits currently.
(Note, each vendor can define their own meta data format .e.g. it could be
kvm_exported_tdp_meta_svm on AMD platform.)

The consumer of the exported TDP (e.g. Intel vt-d driver) can retrieve and
check the vendor specific meta data before loading the KVM exported TDP
page tables to their own secondary MMU.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm_exported_tdp.h | 43 +++++++++++++++++++++++++
include/linux/kvm_types.h | 12 +++++++
2 files changed, 55 insertions(+)
create mode 100644 arch/x86/include/asm/kvm_exported_tdp.h

diff --git a/arch/x86/include/asm/kvm_exported_tdp.h b/arch/x86/include/asm/kvm_exported_tdp.h
new file mode 100644
index 0000000000000..c7fe3f3cf89fb
--- /dev/null
+++ b/arch/x86/include/asm/kvm_exported_tdp.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_KVM_EXPORTED_TDP_H
+#define _ASM_X86_KVM_EXPORTED_TDP_H
+#define PT64_ROOT_MAX_LEVEL 5
+
+#include <linux/kvm_types.h>
+/**
+ * struct kvm_exported_tdp_meta_vmx - Intel specific meta data format of TDP
+ * page tables exported by KVM.
+ *
+ * Importers of KVM exported TDPs can decode meta data of the page tables with
+ * this structure.
+ *
+ * @type: Type defined across platforms to identify hardware
+ * platform of a KVM exported TDP. Importers of KVM
+ * exported TDP need to first check the type before
+ * decoding page table meta data.
+ * @level: Levels of the TDP exported by KVM.
+ * @root_hpa: HPA of the root page of TDP exported by KVM.
+ * @max_huge_page_level: Max huge page level allowed on the TDP exported by KVM.
+ * @rsvd_bits_mask: The must-be-zero bits of leaf and non-leaf PTEs.
+ * rsvd_bits_mask[0] or rsvd_bits_mask[1] is selected by
+ * bit 7 or a PTE.
+ * This field is provided as a way for importers to check
+ * if the must-be-zero bits from KVM is compatible to the
+ * importer side. KVM will ensure that the must-be-zero
+ * bits must not be set even for software purpose.
+ * (e.g. on Intel platform, bit 11 is usually used by KVM
+ * to identify a present SPTE, though bit 11 is ignored by
+ * EPT. However, Intel vt-d requires the bit 11 to be 0.
+ * Before importing KVM TDP, Intel vt-d driver needs to
+ * check if bit 11 is set in the must-be-zero bits by KVM
+ * to avoid possible DMAR fault.)
+ */
+struct kvm_exported_tdp_meta_vmx {
+ enum kvm_exported_tdp_type type;
+ int level;
+ hpa_t root_hpa;
+ int max_huge_page_level;
+ u64 rsvd_bits_mask[2][PT64_ROOT_MAX_LEVEL];
+};
+
+#endif
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 6f4737d5046a4..04deb8334ce42 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -123,4 +123,16 @@ struct kvm_vcpu_stat_generic {

#define KVM_STATS_NAME_SIZE 48

+/**
+ * enum kvm_exported_tdp_type - Type defined across platforms for TDP exported
+ * by KVM.
+ *
+ * @KVM_TDP_TYPE_EPT: The TDP is of type EPT running on Intel platform.
+ *
+ * Currently, @KVM_TDP_TYPE_EPT is the only supported type for TDPs exported by
+ * KVM.
+ */
+enum kvm_exported_tdp_type {
+ KVM_TDP_TYPE_EPT = 1,
+};
#endif /* __KVM_TYPES_H__ */
--
2.17.1

2023-12-02 09:44:50

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 03/42] KVM: Introduce VM ioctl KVM_CREATE_TDP_FD

Introduce VM ioctl KVM_CREATE_TDP_FD to create KVM TDP FD object, which
will act as an interface of KVM to export TDP page tables and communicate
with external components of KVM.

Signed-off-by: Yan Zhao <[email protected]>
---
include/uapi/linux/kvm.h | 19 +++++++++++++++++++
virt/kvm/kvm_main.c | 19 +++++++++++++++++++
virt/kvm/tdp_fd.h | 10 ++++++++++
3 files changed, 48 insertions(+)
create mode 100644 virt/kvm/tdp_fd.h

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 211b86de35ac5..f181883c60fed 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1582,6 +1582,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)

+/* ioctl for vm fd to create tdp fd */
+#define KVM_CREATE_TDP_FD _IOWR(KVMIO, 0xe4, struct kvm_create_tdp_fd)
+
/*
* ioctls for vcpu fds
*/
@@ -2267,4 +2270,20 @@ struct kvm_s390_zpci_op {
/* flags for kvm_s390_zpci_op->u.reg_aen.flags */
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)

+/**
+ * struct kvm_create_tdp_fd - VM ioctl(KVM_CREATE_TDP_FD)
+ * Create a TDP fd object for a TDP exported by KVM.
+ *
+ * @as_id: in: Address space ID for this TDP.
+ * @mode: in: Mode of this tdp.
+ * Reserved for future usage. Currently, this field must be 0.
+ * @fd: out: fd of TDP fd object for a TDP exported by KVM.
+ * @pad: in: Reserved as 0.
+ */
+struct kvm_create_tdp_fd {
+ __u32 as_id;
+ __u32 mode;
+ __u32 fd;
+ __u32 pad;
+};
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b3..494b6301a6065 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -61,6 +61,7 @@
#include "async_pf.h"
#include "kvm_mm.h"
#include "vfio.h"
+#include "tdp_fd.h"

#include <trace/events/ipi.h>

@@ -4973,6 +4974,24 @@ static long kvm_vm_ioctl(struct file *filp,
case KVM_GET_STATS_FD:
r = kvm_vm_ioctl_get_stats_fd(kvm);
break;
+ case KVM_CREATE_TDP_FD: {
+ struct kvm_create_tdp_fd ct;
+
+ r = -EFAULT;
+ if (copy_from_user(&ct, argp, sizeof(ct)))
+ goto out;
+
+ r = kvm_create_tdp_fd(kvm, &ct);
+ if (r)
+ goto out;
+
+ r = -EFAULT;
+ if (copy_to_user(argp, &ct, sizeof(ct)))
+ goto out;
+
+ r = 0;
+ break;
+ }
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
diff --git a/virt/kvm/tdp_fd.h b/virt/kvm/tdp_fd.h
new file mode 100644
index 0000000000000..05c8a6d767469
--- /dev/null
+++ b/virt/kvm/tdp_fd.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TDP_FD_H
+#define __TDP_FD_H
+
+static inline int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
+{
+ return -EOPNOTSUPP;
+}
+
+#endif /* __TDP_FD_H */
--
2.17.1

2023-12-02 09:45:28

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 04/42] KVM: Skeleton of KVM TDP FD object

This is a skeleton implementation of KVM TDP FD object.
The KVM TDP FD object is created by ioctl KVM_CREATE_TDP_FD in
kvm_create_tdp_fd(), which contains

Public part (defined in <linux/kvm_tdp_fd.h>):
- A file object for reference count
file reference count is 1 on creating KVM TDP FD object.
On the reference count of the file object goes to 0, its .release()
handler will destroy the KVM TDP FD object.
- ops kvm_exported_tdp_ops (empty implementation in this patch).

Private part (kvm_exported_tdp object defined in this patch) :
The kvm_exported_tdp object is linked in kvm->exported_tdp_list, one for
each KVM address space. It records address space id, and "kvm" pointer
for TDP FD object, and KVM VM ref is hold during object life cycle.
In later patches, this kvm_exported_tdp object will be associated to a
TDP page table exported by KVM.

Two symbols kvm_tdp_fd_get() and kvm_tdp_fd_put() are implemented and
exported to external components to get/put KVM TDP FD object.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_host.h | 18 ++++
virt/kvm/Kconfig | 3 +
virt/kvm/Makefile.kvm | 1 +
virt/kvm/kvm_main.c | 5 +
virt/kvm/tdp_fd.c | 208 +++++++++++++++++++++++++++++++++++++++
virt/kvm/tdp_fd.h | 5 +
6 files changed, 240 insertions(+)
create mode 100644 virt/kvm/tdp_fd.c

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4944136efaa22..122f47c94ecae 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -44,6 +44,7 @@

#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
+#include <linux/kvm_tdp_fd.h>

#ifndef KVM_MAX_VCPU_IDS
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
@@ -808,6 +809,11 @@ struct kvm {
struct notifier_block pm_notifier;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ struct list_head exported_tdp_list;
+ spinlock_t exported_tdplist_lock;
+#endif
};

#define kvm_err(fmt, ...) \
@@ -2318,4 +2324,16 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+
+struct kvm_exported_tdp {
+ struct kvm_tdp_fd *tdp_fd;
+
+ struct kvm *kvm;
+ u32 as_id;
+ /* head at kvm->exported_tdp_list */
+ struct list_head list_node;
+};
+
+#endif /* CONFIG_HAVE_KVM_EXPORTED_TDP */
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 484d0873061ca..63b5d55c84e95 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -92,3 +92,6 @@ config HAVE_KVM_PM_NOTIFIER

config KVM_GENERIC_HARDWARE_ENABLING
bool
+
+config HAVE_KVM_EXPORTED_TDP
+ bool
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 2c27d5d0c367c..fad4638e407c5 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,3 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
+kvm-$(CONFIG_HAVE_KVM_EXPORTED_TDP) += $(KVM)/tdp_fd.o
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 494b6301a6065..9fa9132055807 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1232,6 +1232,11 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
#endif

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ INIT_LIST_HEAD(&kvm->exported_tdp_list);
+ spin_lock_init(&kvm->exported_tdplist_lock);
+#endif
+
r = kvm_init_mmu_notifier(kvm);
if (r)
goto out_err_no_mmu_notifier;
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
new file mode 100644
index 0000000000000..a5c4c3597e94f
--- /dev/null
+++ b/virt/kvm/tdp_fd.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KVM TDP FD
+ *
+ */
+#include <linux/anon_inodes.h>
+#include <uapi/linux/kvm.h>
+#include <linux/kvm_host.h>
+
+#include "tdp_fd.h"
+
+static inline int is_tdp_fd_file(struct file *file);
+static const struct file_operations kvm_tdp_fd_fops;
+static const struct kvm_exported_tdp_ops exported_tdp_ops;
+
+int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
+{
+ struct kvm_exported_tdp *tdp;
+ struct kvm_tdp_fd *tdp_fd;
+ int as_id = ct->as_id;
+ int ret, fd;
+
+ if (as_id >= KVM_ADDRESS_SPACE_NUM || ct->pad || ct->mode)
+ return -EINVAL;
+
+ /* for each address space, only one exported tdp is allowed */
+ spin_lock(&kvm->exported_tdplist_lock);
+ list_for_each_entry(tdp, &kvm->exported_tdp_list, list_node) {
+ if (tdp->as_id != as_id)
+ continue;
+
+ spin_unlock(&kvm->exported_tdplist_lock);
+ return -EEXIST;
+ }
+ spin_unlock(&kvm->exported_tdplist_lock);
+
+ tdp_fd = kzalloc(sizeof(*tdp_fd), GFP_KERNEL_ACCOUNT);
+ if (!tdp)
+ return -ENOMEM;
+
+ tdp = kzalloc(sizeof(*tdp), GFP_KERNEL_ACCOUNT);
+ if (!tdp) {
+ kfree(tdp_fd);
+ return -ENOMEM;
+ }
+ tdp_fd->priv = tdp;
+ tdp->tdp_fd = tdp_fd;
+ tdp->as_id = as_id;
+
+ if (!kvm_get_kvm_safe(kvm)) {
+ ret = -ENODEV;
+ goto out;
+ }
+ tdp->kvm = kvm;
+
+ tdp_fd->file = anon_inode_getfile("tdp_fd", &kvm_tdp_fd_fops,
+ tdp_fd, O_RDWR | O_CLOEXEC);
+ if (!tdp_fd->file) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ goto out;
+
+ fd_install(fd, tdp_fd->file);
+ ct->fd = fd;
+ tdp_fd->ops = &exported_tdp_ops;
+
+ spin_lock(&kvm->exported_tdplist_lock);
+ list_add(&tdp->list_node, &kvm->exported_tdp_list);
+ spin_unlock(&kvm->exported_tdplist_lock);
+ return 0;
+
+out:
+ if (tdp_fd->file)
+ fput(tdp_fd->file);
+
+ if (tdp->kvm)
+ kvm_put_kvm_no_destroy(tdp->kvm);
+ kfree(tdp);
+ kfree(tdp_fd);
+ return ret;
+}
+
+static int kvm_tdp_fd_release(struct inode *inode, struct file *file)
+{
+ struct kvm_exported_tdp *tdp;
+ struct kvm_tdp_fd *tdp_fd;
+
+ if (!is_tdp_fd_file(file))
+ return -EINVAL;
+
+ tdp_fd = file->private_data;
+ tdp = tdp_fd->priv;
+
+ if (WARN_ON(!tdp || !tdp->kvm))
+ return -EFAULT;
+
+ spin_lock(&tdp->kvm->exported_tdplist_lock);
+ list_del(&tdp->list_node);
+ spin_unlock(&tdp->kvm->exported_tdplist_lock);
+
+ kvm_put_kvm(tdp->kvm);
+ kfree(tdp);
+ kfree(tdp_fd);
+ return 0;
+}
+
+static long kvm_tdp_fd_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ /* Do not support ioctl currently. May add it in future */
+ return -ENODEV;
+}
+
+static int kvm_tdp_fd_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ return -ENODEV;
+}
+
+static const struct file_operations kvm_tdp_fd_fops = {
+ .unlocked_ioctl = kvm_tdp_fd_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
+ .release = kvm_tdp_fd_release,
+ .mmap = kvm_tdp_fd_mmap,
+};
+
+static inline int is_tdp_fd_file(struct file *file)
+{
+ return file->f_op == &kvm_tdp_fd_fops;
+}
+
+static int kvm_tdp_register_importer(struct kvm_tdp_fd *tdp_fd,
+ struct kvm_tdp_importer_ops *ops, void *data)
+{
+ return -EOPNOTSUPP;
+}
+
+static void kvm_tdp_unregister_importer(struct kvm_tdp_fd *tdp_fd,
+ struct kvm_tdp_importer_ops *ops)
+{
+}
+
+static void *kvm_tdp_get_metadata(struct kvm_tdp_fd *tdp_fd)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static int kvm_tdp_fault(struct kvm_tdp_fd *tdp_fd, struct mm_struct *mm,
+ unsigned long gfn, struct kvm_tdp_fault_type type)
+{
+ return -EOPNOTSUPP;
+}
+
+static const struct kvm_exported_tdp_ops exported_tdp_ops = {
+ .register_importer = kvm_tdp_register_importer,
+ .unregister_importer = kvm_tdp_unregister_importer,
+ .get_metadata = kvm_tdp_get_metadata,
+ .fault = kvm_tdp_fault,
+};
+
+/**
+ * kvm_tdp_fd_get - Public interface to get KVM TDP FD object.
+ *
+ * @fd: fd of the KVM TDP FD object.
+ * @return: KVM TDP FD object if @fd corresponds to a valid KVM TDP FD file.
+ * -EBADF if @fd does not correspond a struct file.
+ * -EINVAL if @fd does not correspond to a KVM TDP FD file.
+ *
+ * Callers of this interface will get a KVM TDP FD object with ref count
+ * increased.
+ */
+struct kvm_tdp_fd *kvm_tdp_fd_get(int fd)
+{
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ if (!is_tdp_fd_file(file)) {
+ fput(file);
+ return ERR_PTR(-EINVAL);
+ }
+ return file->private_data;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_fd_get);
+
+/**
+ * kvm_tdp_fd_put - Public interface to put ref count of a KVM TDP FD object.
+ *
+ * @tdp_fd: KVM TDP FD object.
+ *
+ * Put reference count of the KVM TDP FD object.
+ * After the last reference count of the TDP fd goes away,
+ * kvm_tdp_fd_release() will be called to decrease KVM VM ref count and destroy
+ * the KVM TDP FD object.
+ */
+void kvm_tdp_fd_put(struct kvm_tdp_fd *tdp_fd)
+{
+ if (WARN_ON(!tdp_fd || !tdp_fd->file || !is_tdp_fd_file(tdp_fd->file)))
+ return;
+
+ fput(tdp_fd->file);
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_fd_put);
diff --git a/virt/kvm/tdp_fd.h b/virt/kvm/tdp_fd.h
index 05c8a6d767469..85da9d8cc1ce4 100644
--- a/virt/kvm/tdp_fd.h
+++ b/virt/kvm/tdp_fd.h
@@ -2,9 +2,14 @@
#ifndef __TDP_FD_H
#define __TDP_FD_H

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct);
+
+#else
static inline int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
{
return -EOPNOTSUPP;
}
+#endif /* CONFIG_HAVE_KVM_EXPORTED_TDP */

#endif /* __TDP_FD_H */
--
2.17.1

2023-12-02 09:46:49

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 05/42] KVM: Embed "arch" object and call arch init/destroy in TDP FD

Embed "arch" object in private "kvm_exported_tdp" object of KVM TDP FD
object in order to associate a TDP page table to this private object.

With later patches for arch x86, the overall data structure hierarchy on
x86 for TDP FD to export TDP is outlined below for preview.

kvm_tdp_fd
.------
| ops-|-->kvm_exported_tdp_ops
| file | public
-----------------------------------------------------------------------
| priv-|-->kvm_exported_tdp private
'------' .-----------.
| tdp_fd |
| as_id |
| kvm |
| importers |
| arch -|-->kvm_arch_exported_tdp
| list_node | .------.
'-----------' | mmu -|--> kvm_exported_tdp_mmu
| meta | .-----------.
'--|---' | common -|--> kvm_mmu_common
| | root_page |
| '-----------'
|
|
|
+-->kvm_exported_tdp_meta_vmx
.--------------------.
| type |
| level |
| root_hpa |
| max_huge_page_level|
| rsvd_bits_mask |
'--------------------'

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_host.h | 17 +++++++++++++++++
virt/kvm/tdp_fd.c | 12 +++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 122f47c94ecae..5a74b2b0ac81f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2327,6 +2327,9 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP

struct kvm_exported_tdp {
+#ifdef __KVM_HAVE_ARCH_EXPORTED_TDP
+ struct kvm_arch_exported_tdp arch;
+#endif
struct kvm_tdp_fd *tdp_fd;

struct kvm *kvm;
@@ -2335,5 +2338,19 @@ struct kvm_exported_tdp {
struct list_head list_node;
};

+#ifdef __KVM_HAVE_ARCH_EXPORTED_TDP
+int kvm_arch_exported_tdp_init(struct kvm *kvm, struct kvm_exported_tdp *tdp);
+void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp);
+#else
+static inline int kvm_arch_exported_tdp_init(struct kvm *kvm,
+ struct kvm_exported_tdp *tdp)
+{
+ return -EOPNOTSUPP;
+}
+static inline void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp)
+{
+}
+#endif /* __KVM_HAVE_ARCH_EXPORTED_TDP */
+
#endif /* CONFIG_HAVE_KVM_EXPORTED_TDP */
#endif
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
index a5c4c3597e94f..7e68199ea9643 100644
--- a/virt/kvm/tdp_fd.c
+++ b/virt/kvm/tdp_fd.c
@@ -52,17 +52,20 @@ int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
goto out;
}
tdp->kvm = kvm;
+ ret = kvm_arch_exported_tdp_init(kvm, tdp);
+ if (ret)
+ goto out;

tdp_fd->file = anon_inode_getfile("tdp_fd", &kvm_tdp_fd_fops,
tdp_fd, O_RDWR | O_CLOEXEC);
if (!tdp_fd->file) {
ret = -EFAULT;
- goto out;
+ goto out_uninit;
}

fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
if (fd < 0)
- goto out;
+ goto out_uninit;

fd_install(fd, tdp_fd->file);
ct->fd = fd;
@@ -73,10 +76,12 @@ int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
spin_unlock(&kvm->exported_tdplist_lock);
return 0;

-out:
+out_uninit:
if (tdp_fd->file)
fput(tdp_fd->file);

+ kvm_arch_exported_tdp_destroy(tdp);
+out:
if (tdp->kvm)
kvm_put_kvm_no_destroy(tdp->kvm);
kfree(tdp);
@@ -102,6 +107,7 @@ static int kvm_tdp_fd_release(struct inode *inode, struct file *file)
list_del(&tdp->list_node);
spin_unlock(&tdp->kvm->exported_tdplist_lock);

+ kvm_arch_exported_tdp_destroy(tdp);
kvm_put_kvm(tdp->kvm);
kfree(tdp);
kfree(tdp_fd);
--
2.17.1

2023-12-02 09:46:51

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 06/42] KVM: Register/Unregister importers to KVM exported TDP

Each TDP exported by KVM has its own list of importers. External components
can register/unregister itself as an importer with a unique importer ops.

The sequence for external components to register/unregister as importer is
like:
1. call kvm_tdp_fd_get() to get a KVM TDP fd object.
2. call tdp_fd->ops->register_importer() to register itself as an importer.
3. call tdp_fd->ops->unregister_importer() to unregister itself as
importer.
4. call kvm_tdp_fd_put() to put the KVM TDP fd object.

When destroying a KVM TDP fd object, all importers are force-unregistered.
There's no extra notification to the importers at that time because the
force-unregister should only happen when importers calls kvm_tdp_fd_put()
without calling tdp_fd->ops->unregister_importer() first.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_host.h | 5 +++
virt/kvm/tdp_fd.c | 68 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5a74b2b0ac81f..f73d32eef8833 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2334,6 +2334,11 @@ struct kvm_exported_tdp {

struct kvm *kvm;
u32 as_id;
+
+ /* protect importers list */
+ spinlock_t importer_lock;
+ struct list_head importers;
+
/* head at kvm->exported_tdp_list */
struct list_head list_node;
};
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
index 7e68199ea9643..3271da1a4b2c1 100644
--- a/virt/kvm/tdp_fd.c
+++ b/virt/kvm/tdp_fd.c
@@ -13,6 +13,13 @@ static inline int is_tdp_fd_file(struct file *file);
static const struct file_operations kvm_tdp_fd_fops;
static const struct kvm_exported_tdp_ops exported_tdp_ops;

+struct kvm_tdp_importer {
+ struct kvm_tdp_importer_ops *ops;
+ void *data;
+ struct list_head node;
+};
+static void kvm_tdp_unregister_all_importers(struct kvm_exported_tdp *tdp);
+
int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
{
struct kvm_exported_tdp *tdp;
@@ -56,6 +63,9 @@ int kvm_create_tdp_fd(struct kvm *kvm, struct kvm_create_tdp_fd *ct)
if (ret)
goto out;

+ INIT_LIST_HEAD(&tdp->importers);
+ spin_lock_init(&tdp->importer_lock);
+
tdp_fd->file = anon_inode_getfile("tdp_fd", &kvm_tdp_fd_fops,
tdp_fd, O_RDWR | O_CLOEXEC);
if (!tdp_fd->file) {
@@ -107,6 +117,7 @@ static int kvm_tdp_fd_release(struct inode *inode, struct file *file)
list_del(&tdp->list_node);
spin_unlock(&tdp->kvm->exported_tdplist_lock);

+ kvm_tdp_unregister_all_importers(tdp);
kvm_arch_exported_tdp_destroy(tdp);
kvm_put_kvm(tdp->kvm);
kfree(tdp);
@@ -141,12 +152,67 @@ static inline int is_tdp_fd_file(struct file *file)
static int kvm_tdp_register_importer(struct kvm_tdp_fd *tdp_fd,
struct kvm_tdp_importer_ops *ops, void *data)
{
- return -EOPNOTSUPP;
+ struct kvm_tdp_importer *importer, *tmp;
+ struct kvm_exported_tdp *tdp;
+
+ if (!tdp_fd || !tdp_fd->priv || !ops)
+ return -EINVAL;
+
+ tdp = tdp_fd->priv;
+ importer = kzalloc(sizeof(*importer), GFP_KERNEL);
+ if (!importer)
+ return -ENOMEM;
+
+ spin_lock(&tdp->importer_lock);
+ list_for_each_entry(tmp, &tdp->importers, node) {
+ if (tmp->ops != ops)
+ continue;
+
+ kfree(importer);
+ spin_unlock(&tdp->importer_lock);
+ return -EBUSY;
+ }
+
+ importer->ops = ops;
+ importer->data = data;
+ list_add(&importer->node, &tdp->importers);
+
+ spin_unlock(&tdp->importer_lock);
+
+ return 0;
}

static void kvm_tdp_unregister_importer(struct kvm_tdp_fd *tdp_fd,
struct kvm_tdp_importer_ops *ops)
{
+ struct kvm_tdp_importer *importer, *n;
+ struct kvm_exported_tdp *tdp;
+
+ if (!tdp_fd || !tdp_fd->priv)
+ return;
+
+ tdp = tdp_fd->priv;
+ spin_lock(&tdp->importer_lock);
+ list_for_each_entry_safe(importer, n, &tdp->importers, node) {
+ if (importer->ops != ops)
+ continue;
+
+ list_del(&importer->node);
+ kfree(importer);
+ }
+ spin_unlock(&tdp->importer_lock);
+}
+
+static void kvm_tdp_unregister_all_importers(struct kvm_exported_tdp *tdp)
+{
+ struct kvm_tdp_importer *importer, *n;
+
+ spin_lock(&tdp->importer_lock);
+ list_for_each_entry_safe(importer, n, &tdp->importers, node) {
+ list_del(&importer->node);
+ kfree(importer);
+ }
+ spin_unlock(&tdp->importer_lock);
}

static void *kvm_tdp_get_metadata(struct kvm_tdp_fd *tdp_fd)
--
2.17.1

2023-12-02 09:47:23

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 07/42] KVM: Forward page fault requests to arch specific code for exported TDP

Implement .fault op of KVM TDP FD object and pass page fault requests from
importers of KVM TDP FD to KVM arch specific code.

Since the thread for importers to call .fault op is not vCPU thread and
could be a kernel thread, thread "mm" is checked and kthread_use_mm() are
called when necessary.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_host.h | 9 +++++++++
virt/kvm/tdp_fd.c | 28 +++++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f73d32eef8833..b76919eec9b72 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2346,6 +2346,8 @@ struct kvm_exported_tdp {
#ifdef __KVM_HAVE_ARCH_EXPORTED_TDP
int kvm_arch_exported_tdp_init(struct kvm *kvm, struct kvm_exported_tdp *tdp);
void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp);
+int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn,
+ struct kvm_tdp_fault_type type);
#else
static inline int kvm_arch_exported_tdp_init(struct kvm *kvm,
struct kvm_exported_tdp *tdp)
@@ -2355,6 +2357,13 @@ static inline int kvm_arch_exported_tdp_init(struct kvm *kvm,
static inline void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp)
{
}
+
+static inline int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp,
+ unsigned long gfn,
+ struct kvm_tdp_fault_type type)
+{
+ return -EOPNOTSUPP;
+}
#endif /* __KVM_HAVE_ARCH_EXPORTED_TDP */

#endif /* CONFIG_HAVE_KVM_EXPORTED_TDP */
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
index 3271da1a4b2c1..02c9066391ebe 100644
--- a/virt/kvm/tdp_fd.c
+++ b/virt/kvm/tdp_fd.c
@@ -223,7 +223,33 @@ static void *kvm_tdp_get_metadata(struct kvm_tdp_fd *tdp_fd)
static int kvm_tdp_fault(struct kvm_tdp_fd *tdp_fd, struct mm_struct *mm,
unsigned long gfn, struct kvm_tdp_fault_type type)
{
- return -EOPNOTSUPP;
+ bool kthread = current->mm == NULL;
+ int ret = -EINVAL;
+
+ if (!tdp_fd || !tdp_fd->priv || !tdp_fd->priv->kvm)
+ return -EINVAL;
+
+ if (!type.read && !type.write && !type.exec)
+ return -EINVAL;
+
+ if (!mm || tdp_fd->priv->kvm->mm != mm)
+ return -EINVAL;
+
+ if (!mmget_not_zero(mm))
+ return -EPERM;
+
+ if (kthread)
+ kthread_use_mm(mm);
+ else if (current->mm != mm)
+ goto out;
+
+ ret = kvm_arch_fault_exported_tdp(tdp_fd->priv, gfn, type);
+
+ if (kthread)
+ kthread_unuse_mm(mm);
+out:
+ mmput(mm);
+ return ret;
}

static const struct kvm_exported_tdp_ops exported_tdp_ops = {
--
2.17.1

2023-12-02 09:48:02

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 08/42] KVM: Add a helper to notify importers that KVM exported TDP is flushed

Introduce a helper in KVM TDP FD to notify importers that TDP page tables
are invalidated. This helper will be called by arch code (e.g. VMX specific
code).

Currently, the helper will notify all importers of all KVM exported TDPs.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/kvm_host.h | 3 +++
virt/kvm/tdp_fd.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b76919eec9b72..a8af95194767f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2366,5 +2366,8 @@ static inline int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp,
}
#endif /* __KVM_HAVE_ARCH_EXPORTED_TDP */

+void kvm_tdp_fd_flush_notify(struct kvm *kvm, unsigned long gfn, unsigned long npages);
+
#endif /* CONFIG_HAVE_KVM_EXPORTED_TDP */
+
#endif
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
index 02c9066391ebe..8c16af685a061 100644
--- a/virt/kvm/tdp_fd.c
+++ b/virt/kvm/tdp_fd.c
@@ -304,3 +304,41 @@ void kvm_tdp_fd_put(struct kvm_tdp_fd *tdp_fd)
fput(tdp_fd->file);
}
EXPORT_SYMBOL_GPL(kvm_tdp_fd_put);
+
+static void kvm_tdp_fd_flush(struct kvm_exported_tdp *tdp, unsigned long gfn,
+ unsigned long npages)
+{
+#define INVALID_NPAGES (-1UL)
+ bool all = (gfn == 0) && (npages == INVALID_NPAGES);
+ struct kvm_tdp_importer *importer;
+ unsigned long start, size;
+
+ if (all) {
+ start = 0;
+ size = -1UL;
+ } else {
+ start = gfn << PAGE_SHIFT;
+ size = npages << PAGE_SHIFT;
+ }
+
+ spin_lock(&tdp->importer_lock);
+
+ list_for_each_entry(importer, &tdp->importers, node) {
+ if (!importer->ops->invalidate)
+ continue;
+
+ importer->ops->invalidate(importer->data, start, size);
+ }
+ spin_unlock(&tdp->importer_lock);
+}
+
+void kvm_tdp_fd_flush_notify(struct kvm *kvm, unsigned long gfn, unsigned long npages)
+{
+ struct kvm_exported_tdp *tdp;
+
+ spin_lock(&kvm->exported_tdplist_lock);
+ list_for_each_entry(tdp, &kvm->exported_tdp_list, list_node)
+ kvm_tdp_fd_flush(tdp, gfn, npages);
+ spin_unlock(&kvm->exported_tdplist_lock);
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_fd_flush_notify);
--
2.17.1

2023-12-02 09:48:38

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 09/42] iommu: Add IOMMU_DOMAIN_KVM

Introduce a new domain type to share stage 2 mappings from KVM.

Paging strcture allocation/free of this new domain are managed by KVM.
IOMMU side just gets page table root address from KVM via parsing vendor
specific data passed in from KVM through IOMMUFD and sets it to the IOMMU
hardware.

This new domain can be allocated by domain_alloc_kvm op, and attached to
a device through the existing iommu_attach_device/group() interfaces.

Page mapping/unmapping are managed by KVM too, therefore map/unmap ops are
not implemented.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/iommu.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c79378833c758..9ecee72e2d6c4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -171,6 +171,8 @@ struct iommu_domain_geometry {
#define __IOMMU_DOMAIN_NESTED (1U << 6) /* User-managed address space nested
on a stage-2 translation */

+#define __IOMMU_DOMAIN_KVM (1U << 7) /* KVM-managed stage-2 translation */
+
#define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
/*
* This are the possible domain-types
@@ -187,6 +189,7 @@ struct iommu_domain_geometry {
* invalidation.
* IOMMU_DOMAIN_SVA - DMA addresses are shared process addresses
* represented by mm_struct's.
+ * IOMMU_DOMAIN_KVM - DMA mappings on stage 2, managed by KVM.
* IOMMU_DOMAIN_PLATFORM - Legacy domain for drivers that do their own
* dma_api stuff. Do not use in new drivers.
*/
@@ -201,6 +204,7 @@ struct iommu_domain_geometry {
#define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SVA)
#define IOMMU_DOMAIN_PLATFORM (__IOMMU_DOMAIN_PLATFORM)
#define IOMMU_DOMAIN_NESTED (__IOMMU_DOMAIN_NESTED)
+#define IOMMU_DOMAIN_KVM (__IOMMU_DOMAIN_KVM)

struct iommu_domain {
unsigned type;
--
2.17.1

2023-12-02 09:49:12

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 10/42] iommu: Add new iommu op to create domains managed by KVM

Introduce a new iommu_domain op to create domains managed by KVM through
IOMMUFD.

These domains have a few different properties compares to kernel owned
domains and user owned domains:

- They must not be PAGING domains. Page mapping/unmapping is controlled by
KVM.

- They must be stage 2 mappings translating GPA to HPA.

- Paging structure allocation/free is not managed by IOMMU driver, but
by KVM.

- TLBs flushes are notified by KVM.

The new op clearly says the domain is being created by IOMMUFD.
A driver specific structure to the meta data of paging structures from KVM
is passed in via the op param "data".

IOMMU drivers that cannot support VFIO/IOMMUFD should not support this op.

This new op for now is only supposed to be used by IOMMUFD, hence no
wrapper for it. IOMMUFD would call the callback directly. As for domain
free, IOMMUFD would use iommu_domain_free().

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/iommu.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9ecee72e2d6c4..0ce23ee399d35 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -522,6 +522,13 @@ __iommu_copy_struct_from_user_array(void *dst_data,
* @domain_alloc_paging: Allocate an iommu_domain that can be used for
* UNMANAGED, DMA, and DMA_FQ domain types.
* @domain_alloc_sva: Allocate an iommu_domain for Shared Virtual Addressing.
+ * @domain_alloc_kvm: Allocate an iommu domain with type IOMMU_DOMAIN_KVM.
+ * It's called by IOMMUFD and must fully initialize the new
+ * domain before return.
+ * The @data is of type "const void *" whose format is defined
+ * in kvm arch specific header "asm/kvm_exported_tdp.h".
+ * Unpon success, domain of type IOMMU_DOMAIN_KVM is returned.
+ * Upon failure, ERR_PTR is returned.
* @probe_device: Add device to iommu driver handling
* @release_device: Remove device from iommu driver handling
* @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -564,6 +571,8 @@ struct iommu_ops {
struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
struct mm_struct *mm);
+ struct iommu_domain *(*domain_alloc_kvm)(struct device *dev, u32 flags,
+ const void *data);

struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
--
2.17.1

2023-12-02 09:49:57

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm

On KVM invalidates mappings that are shared to IOMMU stage 2 paging
structures, IOMMU driver needs to invalidate hardware TLBs accordingly.

The new op cache_invalidate_kvm is called from IOMMUFD to invalidate
hardware TLBs upon receiving invalidation notifications from KVM.

Signed-off-by: Yan Zhao <[email protected]>
---
include/linux/iommu.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 0ce23ee399d35..0b056d5a6b3a3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -636,6 +636,9 @@ struct iommu_ops {
* forward a driver specific error code to user space.
* Both the driver data structure and the error code
* must be defined in include/uapi/linux/iommufd.h
+ * @cache_invalidate_kvm: Synchronously flush hardware TLBs for KVM managed
+ * stage 2 IO page tables.
+ * The @domain must be IOMMU_DOMAIN_KVM.
* @iova_to_phys: translate iova to physical address
* @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
* including no-snoop TLPs on PCIe or other platform
@@ -665,6 +668,8 @@ struct iommu_domain_ops {
int (*cache_invalidate_user)(struct iommu_domain *domain,
struct iommu_user_data_array *array,
u32 *error_code);
+ void (*cache_invalidate_kvm)(struct iommu_domain *domain,
+ unsigned long iova, unsigned long size);

phys_addr_t (*iova_to_phys)(struct iommu_domain *domain,
dma_addr_t iova);
--
2.17.1

2023-12-02 09:50:25

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT

Add allocation data info iommu_hwpt_kvm_info to allow IOMMUFD to create a
KVM managed HWPT via ioctl IOMMU_HWPT_ALLOC.

As KVM managed HWPT serves as stage-2 page tables whose paging structure
and page mapping/unmapping are managed by KVM, there's no need to connect
KVM managed HWPT to IOAS or parent HWPT.

Signed-off-by: Yan Zhao <[email protected]>
---
include/uapi/linux/iommufd.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 71c009cc614a4..08570f3a751fc 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -390,6 +390,15 @@ struct iommu_hwpt_vtd_s1 {
__u32 __reserved;
};

+/**
+ * struct iommu_hwpt_kvm_info - KVM managed stage-2 page table info
+ * (IOMMU_HWPT_DATA_KVM)
+ * @fd: The fd of the page table shared from KVM
+ */
+struct iommu_hwpt_kvm_info {
+ __aligned_u64 fd;
+};
+
/**
* struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
* (IOMMU_HWPT_DATA_ARM_SMMUV3)
@@ -413,11 +422,13 @@ struct iommu_hwpt_arm_smmuv3 {
* @IOMMU_HWPT_DATA_NONE: no data
* @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
* @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
+ * @IOMMU_HWPT_DATA_KVM: KVM managed stage-2 page table
*/
enum iommu_hwpt_data_type {
IOMMU_HWPT_DATA_NONE,
IOMMU_HWPT_DATA_VTD_S1,
IOMMU_HWPT_DATA_ARM_SMMUV3,
+ IOMMU_HWPT_DATA_KVM,
};

/**
@@ -447,6 +458,10 @@ enum iommu_hwpt_data_type {
* must be set to a pre-defined type corresponding to an I/O page table
* type supported by the underlying IOMMU hardware.
*
+ * A KVM-managed HWPT will be created if @data_type is IOMMU_HWPT_DATA_KVM.
+ * @pt_id is not queried if data_type is IOMMU_HWPT_DATA_KVM because KVM-managed
+ * HWPT doesn't have any IOAS or parent HWPT associated.
+ *
* If the @data_type is set to IOMMU_HWPT_DATA_NONE, @data_len and
* @data_uptr should be zero. Otherwise, both @data_len and @data_uptr
* must be given.
--
2.17.1

2023-12-02 09:51:02

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 13/42] iommufd: Add a KVM HW pagetable object

Add new obj type IOMMUFD_OBJ_HWPT_KVM for KVM HW page tables, which
correspond to iommu stage 2 domains whose paging strcutures and mappings
are managed by KVM.

Extend the IOMMU_HWPT_ALLOC ioctl to accept KVM HW page table specific
data of "struct iommu_hwpt_kvm_info".

The real allocator iommufd_hwpt_kvm_alloc() is now an empty function and
will be implemented in next patch when config IOMMUFD_KVM_HWPT is on.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/iommufd/device.c | 13 +++++----
drivers/iommu/iommufd/hw_pagetable.c | 29 +++++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 35 +++++++++++++++++++++++++
drivers/iommu/iommufd/main.c | 4 +++
4 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 59d3a07300d93..83af6b7e2784b 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -629,7 +629,8 @@ static int iommufd_device_change_pt(struct iommufd_device *idev, u32 *pt_id,

switch (pt_obj->type) {
case IOMMUFD_OBJ_HWPT_NESTED:
- case IOMMUFD_OBJ_HWPT_PAGING: {
+ case IOMMUFD_OBJ_HWPT_PAGING:
+ case IOMMUFD_OBJ_HWPT_KVM: {
struct iommufd_hw_pagetable *hwpt =
container_of(pt_obj, struct iommufd_hw_pagetable, obj);

@@ -667,8 +668,9 @@ static int iommufd_device_change_pt(struct iommufd_device *idev, u32 *pt_id,
/**
* iommufd_device_attach - Connect a device to an iommu_domain
* @idev: device to attach
- * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGING
- * Output the IOMMUFD_OBJ_HWPT_PAGING ID
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGING, or
+ * IOMMUFD_OBJ_HWPT_KVM
+ * Output the IOMMUFD_OBJ_HWPT_PAGING ID or IOMMUFD_OBJ_HWPT_KVM ID
*
* This connects the device to an iommu_domain, either automatically or manually
* selected. Once this completes the device could do DMA.
@@ -696,8 +698,9 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD);
/**
* iommufd_device_replace - Change the device's iommu_domain
* @idev: device to change
- * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGING
- * Output the IOMMUFD_OBJ_HWPT_PAGING ID
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HWPT_PAGING, or
+ * IOMMUFD_OBJ_HWPT_KVM
+ * Output the IOMMUFD_OBJ_HWPT_PAGING ID or IOMMUFD_OBJ_HWPT_KVM ID
*
* This is the same as::
*
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 367459d92f696..c8430ec42cdf8 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -273,6 +273,31 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
if (IS_ERR(idev))
return PTR_ERR(idev);

+ if (cmd->data_type == IOMMU_HWPT_DATA_KVM) {
+ struct iommu_hwpt_kvm_info kvm_data;
+ struct iommufd_hwpt_kvm *hwpt_kvm;
+
+ if (!cmd->data_len || cmd->data_len != sizeof(kvm_data) ||
+ !cmd->data_uptr) {
+ rc = -EINVAL;
+ goto out_put_idev;
+ }
+ rc = copy_struct_from_user(&kvm_data, sizeof(kvm_data),
+ u64_to_user_ptr(cmd->data_uptr),
+ cmd->data_len);
+ if (rc)
+ goto out_put_idev;
+
+ hwpt_kvm = iommufd_hwpt_kvm_alloc(ucmd->ictx, idev, cmd->flags,
+ &kvm_data);
+ if (IS_ERR(hwpt_kvm)) {
+ rc = PTR_ERR(hwpt_kvm);
+ goto out_put_idev;
+ }
+ hwpt = &hwpt_kvm->common;
+ goto out_respond;
+ }
+
pt_obj = iommufd_get_object(ucmd->ictx, cmd->pt_id, IOMMUFD_OBJ_ANY);
if (IS_ERR(pt_obj)) {
rc = -EINVAL;
@@ -310,6 +335,7 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
goto out_put_pt;
}

+out_respond:
cmd->out_hwpt_id = hwpt->obj.id;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
if (rc)
@@ -323,7 +349,8 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
if (ioas)
mutex_unlock(&ioas->mutex);
out_put_pt:
- iommufd_put_object(pt_obj);
+ if (cmd->data_type != IOMMU_HWPT_DATA_KVM)
+ iommufd_put_object(pt_obj);
out_put_idev:
iommufd_put_object(&idev->obj);
return rc;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 160521800d9b4..a46a6e3e537f9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -125,6 +125,7 @@ enum iommufd_object_type {
IOMMUFD_OBJ_DEVICE,
IOMMUFD_OBJ_HWPT_PAGING,
IOMMUFD_OBJ_HWPT_NESTED,
+ IOMMUFD_OBJ_HWPT_KVM,
IOMMUFD_OBJ_IOAS,
IOMMUFD_OBJ_ACCESS,
#ifdef CONFIG_IOMMUFD_TEST
@@ -266,17 +267,33 @@ struct iommufd_hwpt_nested {
struct iommufd_hwpt_paging *parent;
};

+struct iommufd_hwpt_kvm {
+ struct iommufd_hw_pagetable common;
+ void *context;
+};
+
static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt)
{
return hwpt->obj.type == IOMMUFD_OBJ_HWPT_PAGING;
}

+static inline bool hwpt_is_kvm(struct iommufd_hw_pagetable *hwpt)
+{
+ return hwpt->obj.type == IOMMUFD_OBJ_HWPT_KVM;
+}
+
static inline struct iommufd_hwpt_paging *
to_hwpt_paging(struct iommufd_hw_pagetable *hwpt)
{
return container_of(hwpt, struct iommufd_hwpt_paging, common);
}

+static inline struct iommufd_hwpt_kvm *
+to_hwpt_kvm(struct iommufd_hw_pagetable *hwpt)
+{
+ return container_of(hwpt, struct iommufd_hwpt_kvm, common);
+}
+
static inline struct iommufd_hwpt_paging *
iommufd_get_hwpt_paging(struct iommufd_ucmd *ucmd, u32 id)
{
@@ -413,4 +430,22 @@ static inline bool iommufd_selftest_is_mock_dev(struct device *dev)
return false;
}
#endif
+
+struct iommu_hwpt_kvm_info;
+static inline struct iommufd_hwpt_kvm *
+iommufd_hwpt_kvm_alloc(struct iommufd_ctx *ictx,
+ struct iommufd_device *idev, u32 flags,
+ const struct iommu_hwpt_kvm_info *kvm_data)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_hwpt_kvm_abort(struct iommufd_object *obj)
+{
+}
+
+static inline void iommufd_hwpt_kvm_destroy(struct iommufd_object *obj)
+{
+}
+
#endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 6edef860f91cc..0798c1279133f 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -499,6 +499,10 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
.destroy = iommufd_hwpt_nested_destroy,
.abort = iommufd_hwpt_nested_abort,
},
+ [IOMMUFD_OBJ_HWPT_KVM] = {
+ .destroy = iommufd_hwpt_kvm_destroy,
+ .abort = iommufd_hwpt_kvm_abort,
+ },
#ifdef CONFIG_IOMMUFD_TEST
[IOMMUFD_OBJ_SELFTEST] = {
.destroy = iommufd_selftest_destroy,
--
2.17.1

2023-12-02 09:51:30

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 14/42] iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU

Enable IOMMUFD KVM HW page table object to serve as proxy between KVM and
IOMMU driver. Config IOMMUFD_KVM_HWPT is added to turn on/off this ability.

KVM HW page table object first gets KVM TDP fd object via KVM exported
interface kvm_tdp_fd_get() and then queries KVM for vendor meta data of
page tables exported (shared) by KVM. It then passes the meta data to IOMMU
driver to create a IOMMU_DOMAIN_KVM domain via op domain_alloc_kvm.
IOMMU driver is responsible to check compatibility between IOMMU hardware
and the KVM exported page tables.

After successfully creating IOMMU_DOMAIN_KVM domain, IOMMUFD KVM HW page
table object registers invalidation callback to KVM to receive invalidation
notifications. It then passes the notification to IOMMU driver via op
cache_invalidate_kvm to invalidate hardware TLBs.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/iommufd/Kconfig | 10 ++
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/hw_pagetable_kvm.c | 183 +++++++++++++++++++++++
drivers/iommu/iommufd/iommufd_private.h | 9 ++
4 files changed, 203 insertions(+)
create mode 100644 drivers/iommu/iommufd/hw_pagetable_kvm.c

diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
index 99d4b075df49e..d79e0c1e00a4d 100644
--- a/drivers/iommu/iommufd/Kconfig
+++ b/drivers/iommu/iommufd/Kconfig
@@ -32,6 +32,16 @@ config IOMMUFD_VFIO_CONTAINER

Unless testing IOMMUFD, say N here.

+config IOMMUFD_KVM_HWPT
+ bool "Supports KVM managed HW page tables"
+ default n
+ help
+ Selecting this option will allow IOMMUFD to create IOMMU stage 2
+ page tables whose paging structure and mappings are managed by
+ KVM MMU. IOMMUFD serves as proxy between KVM and IOMMU driver to
+ allow IOMMU driver to get paging structure meta data and cache
+ invalidate notifications from KVM.
+
config IOMMUFD_TEST
bool "IOMMU Userspace API Test support"
depends on DEBUG_KERNEL
diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index 34b446146961c..ae1e0b5c300dc 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -8,6 +8,7 @@ iommufd-y := \
pages.o \
vfio_compat.o

+iommufd-$(CONFIG_IOMMUFD_KVM_HWPT) += hw_pagetable_kvm.o
iommufd-$(CONFIG_IOMMUFD_TEST) += selftest.o

obj-$(CONFIG_IOMMUFD) += iommufd.o
diff --git a/drivers/iommu/iommufd/hw_pagetable_kvm.c b/drivers/iommu/iommufd/hw_pagetable_kvm.c
new file mode 100644
index 0000000000000..e0e205f384ed5
--- /dev/null
+++ b/drivers/iommu/iommufd/hw_pagetable_kvm.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/iommu.h>
+#include <uapi/linux/iommufd.h>
+#include <linux/kvm_tdp_fd.h>
+
+#include "../iommu-priv.h"
+#include "iommufd_private.h"
+
+static void iommufd_kvmtdp_invalidate(void *data,
+ unsigned long start, unsigned long size)
+{
+ void (*invalidate_fn)(struct iommu_domain *domain,
+ unsigned long iova, unsigned long size);
+ struct iommufd_hw_pagetable *hwpt = data;
+
+ if (!hwpt || !hwpt_is_kvm(hwpt))
+ return;
+
+ invalidate_fn = hwpt->domain->ops->cache_invalidate_kvm;
+
+ if (!invalidate_fn)
+ return;
+
+ invalidate_fn(hwpt->domain, start, size);
+
+}
+
+struct kvm_tdp_importer_ops iommufd_import_ops = {
+ .invalidate = iommufd_kvmtdp_invalidate,
+};
+
+static inline int kvmtdp_register(struct kvm_tdp_fd *tdp_fd, void *data)
+{
+ if (!tdp_fd->ops->register_importer || !tdp_fd->ops->register_importer)
+ return -EOPNOTSUPP;
+
+ return tdp_fd->ops->register_importer(tdp_fd, &iommufd_import_ops, data);
+}
+
+static inline void kvmtdp_unregister(struct kvm_tdp_fd *tdp_fd)
+{
+ WARN_ON(!tdp_fd->ops->unregister_importer);
+
+ tdp_fd->ops->unregister_importer(tdp_fd, &iommufd_import_ops);
+}
+
+static inline void *kvmtdp_get_metadata(struct kvm_tdp_fd *tdp_fd)
+{
+ if (!tdp_fd->ops->get_metadata)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ return tdp_fd->ops->get_metadata(tdp_fd);
+}
+
+/*
+ * Get KVM TDP FD object and ensure tdp_fd->ops is available
+ */
+static inline struct kvm_tdp_fd *kvmtdp_get(int fd)
+{
+ struct kvm_tdp_fd *tdp_fd = NULL;
+ struct kvm_tdp_fd *(*get_func)(int fd) = NULL;
+ void (*put_func)(struct kvm_tdp_fd *) = NULL;
+
+ get_func = symbol_get(kvm_tdp_fd_get);
+
+ if (!get_func)
+ goto out;
+
+ put_func = symbol_get(kvm_tdp_fd_put);
+ if (!put_func)
+ goto out;
+
+ tdp_fd = get_func(fd);
+ if (!tdp_fd)
+ goto out;
+
+ if (tdp_fd->ops) {
+ /* success */
+ goto out;
+ }
+
+ put_func(tdp_fd);
+ tdp_fd = NULL;
+
+out:
+ if (get_func)
+ symbol_put(kvm_tdp_fd_get);
+
+ if (put_func)
+ symbol_put(kvm_tdp_fd_put);
+
+ return tdp_fd;
+}
+
+static void kvmtdp_put(struct kvm_tdp_fd *tdp_fd)
+{
+ void (*put_func)(struct kvm_tdp_fd *) = NULL;
+
+ put_func = symbol_get(kvm_tdp_fd_put);
+ WARN_ON(!put_func);
+
+ put_func(tdp_fd);
+
+ symbol_put(kvm_tdp_fd_put);
+}
+
+void iommufd_hwpt_kvm_destroy(struct iommufd_object *obj)
+{
+ struct kvm_tdp_fd *tdp_fd;
+ struct iommufd_hwpt_kvm *hwpt_kvm =
+ container_of(obj, struct iommufd_hwpt_kvm, common.obj);
+
+ if (hwpt_kvm->common.domain)
+ iommu_domain_free(hwpt_kvm->common.domain);
+
+ tdp_fd = hwpt_kvm->context;
+ kvmtdp_unregister(tdp_fd);
+ kvmtdp_put(tdp_fd);
+}
+
+void iommufd_hwpt_kvm_abort(struct iommufd_object *obj)
+{
+ iommufd_hwpt_kvm_destroy(obj);
+}
+
+struct iommufd_hwpt_kvm *
+iommufd_hwpt_kvm_alloc(struct iommufd_ctx *ictx,
+ struct iommufd_device *idev, u32 flags,
+ const struct iommu_hwpt_kvm_info *kvm_data)
+{
+
+ const struct iommu_ops *ops = dev_iommu_ops(idev->dev);
+ struct iommufd_hwpt_kvm *hwpt_kvm;
+ struct iommufd_hw_pagetable *hwpt;
+ struct kvm_tdp_fd *tdp_fd;
+ void *meta_data;
+ int rc;
+
+ if (!ops->domain_alloc_kvm)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ if (kvm_data->fd < 0)
+ return ERR_PTR(-EINVAL);
+
+ tdp_fd = kvmtdp_get(kvm_data->fd);
+ if (!tdp_fd)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ meta_data = kvmtdp_get_metadata(tdp_fd);
+ if (!meta_data || IS_ERR(meta_data)) {
+ rc = -EFAULT;
+ goto out_put_tdp;
+ }
+
+ hwpt_kvm = __iommufd_object_alloc(ictx, hwpt_kvm, IOMMUFD_OBJ_HWPT_KVM,
+ common.obj);
+ if (IS_ERR(hwpt_kvm)) {
+ rc = PTR_ERR(hwpt_kvm);
+ goto out_put_tdp;
+ }
+
+ hwpt_kvm->context = tdp_fd;
+ hwpt = &hwpt_kvm->common;
+
+ hwpt->domain = ops->domain_alloc_kvm(idev->dev, flags, meta_data);
+ if (IS_ERR(hwpt->domain)) {
+ rc = PTR_ERR(hwpt->domain);
+ hwpt->domain = NULL;
+ goto out_abort;
+ }
+
+ rc = kvmtdp_register(tdp_fd, hwpt);
+ if (rc)
+ goto out_abort;
+
+ return hwpt_kvm;
+
+out_abort:
+ iommufd_object_abort_and_destroy(ictx, &hwpt->obj);
+out_put_tdp:
+ kvmtdp_put(tdp_fd);
+ return ERR_PTR(rc);
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index a46a6e3e537f9..2c3149b1d5b55 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -432,6 +432,14 @@ static inline bool iommufd_selftest_is_mock_dev(struct device *dev)
#endif

struct iommu_hwpt_kvm_info;
+#ifdef CONFIG_IOMMUFD_KVM_HWPT
+struct iommufd_hwpt_kvm *
+iommufd_hwpt_kvm_alloc(struct iommufd_ctx *ictx,
+ struct iommufd_device *idev, u32 flags,
+ const struct iommu_hwpt_kvm_info *kvm_data);
+void iommufd_hwpt_kvm_abort(struct iommufd_object *obj);
+void iommufd_hwpt_kvm_destroy(struct iommufd_object *obj);
+#else
static inline struct iommufd_hwpt_kvm *
iommufd_hwpt_kvm_alloc(struct iommufd_ctx *ictx,
struct iommufd_device *idev, u32 flags,
@@ -447,5 +455,6 @@ static inline void iommufd_hwpt_kvm_abort(struct iommufd_object *obj)
static inline void iommufd_hwpt_kvm_destroy(struct iommufd_object *obj)
{
}
+#endif

#endif
--
2.17.1

2023-12-02 09:51:57

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 15/42] iommufd: Add iopf handler to KVM hw pagetable

Add iopf handler to KVM HW page table. The iopf handler is implemented to
forward IO page fault requests to KVM and return complete status back to
IOMMU driver via iommu_page_response().

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/iommufd/hw_pagetable_kvm.c | 87 ++++++++++++++++++++++++
1 file changed, 87 insertions(+)

diff --git a/drivers/iommu/iommufd/hw_pagetable_kvm.c b/drivers/iommu/iommufd/hw_pagetable_kvm.c
index e0e205f384ed5..bff9fa3d9f703 100644
--- a/drivers/iommu/iommufd/hw_pagetable_kvm.c
+++ b/drivers/iommu/iommufd/hw_pagetable_kvm.c
@@ -6,6 +6,89 @@
#include "../iommu-priv.h"
#include "iommufd_private.h"

+static int iommufd_kvmtdp_fault(void *data, struct mm_struct *mm,
+ unsigned long addr, u32 perm)
+{
+ struct iommufd_hw_pagetable *hwpt = data;
+ struct kvm_tdp_fault_type fault_type = {0};
+ unsigned long gfn = addr >> PAGE_SHIFT;
+ struct kvm_tdp_fd *tdp_fd;
+ int ret;
+
+ if (!hwpt || !hwpt_is_kvm(hwpt))
+ return IOMMU_PAGE_RESP_INVALID;
+
+ tdp_fd = to_hwpt_kvm(hwpt)->context;
+ if (!tdp_fd->ops->fault)
+ return IOMMU_PAGE_RESP_INVALID;
+
+ fault_type.read = !!(perm & IOMMU_FAULT_PERM_READ);
+ fault_type.write = !!(perm & IOMMU_FAULT_PERM_WRITE);
+ fault_type.exec = !!(perm & IOMMU_FAULT_PERM_EXEC);
+
+ ret = tdp_fd->ops->fault(tdp_fd, mm, gfn, fault_type);
+ return ret ? IOMMU_PAGE_RESP_FAILURE : IOMMU_PAGE_RESP_SUCCESS;
+}
+
+static int iommufd_kvmtdp_complete_group(struct device *dev, struct iopf_fault *iopf,
+ enum iommu_page_response_code status)
+{
+ struct iommu_page_response resp = {
+ .pasid = iopf->fault.prm.pasid,
+ .grpid = iopf->fault.prm.grpid,
+ .code = status,
+ };
+
+ if ((iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID) &&
+ (iopf->fault.prm.flags & IOMMU_FAULT_PAGE_RESPONSE_NEEDS_PASID))
+ resp.flags = IOMMU_PAGE_RESP_PASID_VALID;
+
+ return iommu_page_response(dev, &resp);
+}
+
+static void iommufd_kvmtdp_handle_iopf(struct work_struct *work)
+{
+ struct iopf_fault *iopf;
+ struct iopf_group *group;
+ enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;
+ struct iommu_domain *domain;
+ void *fault_data;
+ int ret;
+
+ group = container_of(work, struct iopf_group, work);
+ domain = group->domain;
+ fault_data = domain->fault_data;
+
+ list_for_each_entry(iopf, &group->faults, list) {
+ /*
+ * For the moment, errors are sticky: don't handle subsequent
+ * faults in the group if there is an error.
+ */
+ if (status != IOMMU_PAGE_RESP_SUCCESS)
+ break;
+
+ status = iommufd_kvmtdp_fault(fault_data, domain->mm,
+ iopf->fault.prm.addr,
+ iopf->fault.prm.perm);
+ }
+
+ ret = iommufd_kvmtdp_complete_group(group->dev, &group->last_fault, status);
+
+ iopf_free_group(group);
+
+}
+
+static int iommufd_kvmtdp_iopf_handler(struct iopf_group *group)
+{
+ struct iommu_fault_param *fault_param = group->dev->iommu->fault_param;
+
+ INIT_WORK(&group->work, iommufd_kvmtdp_handle_iopf);
+ if (!queue_work(fault_param->queue->wq, &group->work))
+ return -EBUSY;
+
+ return 0;
+}
+
static void iommufd_kvmtdp_invalidate(void *data,
unsigned long start, unsigned long size)
{
@@ -169,6 +252,10 @@ iommufd_hwpt_kvm_alloc(struct iommufd_ctx *ictx,
goto out_abort;
}

+ hwpt->domain->mm = current->mm;
+ hwpt->domain->iopf_handler = iommufd_kvmtdp_iopf_handler;
+ hwpt->domain->fault_data = hwpt;
+
rc = kvmtdp_register(tdp_fd, hwpt);
if (rc)
goto out_abort;
--
2.17.1

2023-12-02 09:52:21

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT

Enable device feature IOPF during device attachment to KVM HWPT and abort
the attachment if feature enabling is failed.

"pin" is not done by KVM HWPT. If VMM wants to create KVM HWPT, it must
know that all devices attached to this HWPT support IOPF so that pin-all
is skipped.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/iommufd/device.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 83af6b7e2784b..4ea447e052ce1 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -381,10 +381,26 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
goto err_unresv;
idev->igroup->hwpt = hwpt;
}
+ if (hwpt_is_kvm(hwpt)) {
+ /*
+ * Feature IOPF requires ats is enabled which is true only
+ * after device is attached to iommu domain.
+ * So enable dev feature IOPF after iommu_attach_group().
+ * -EBUSY will be returned if feature IOPF is already on.
+ */
+ rc = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
+ if (rc && rc != -EBUSY)
+ goto err_detach;
+ }
refcount_inc(&hwpt->obj.users);
list_add_tail(&idev->group_item, &idev->igroup->device_list);
mutex_unlock(&idev->igroup->lock);
return 0;
+err_detach:
+ if (list_empty(&idev->igroup->device_list)) {
+ iommu_detach_group(hwpt->domain, idev->igroup->group);
+ idev->igroup->hwpt = NULL;
+ }
err_unresv:
if (hwpt_is_paging(hwpt))
iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt,
@@ -408,6 +424,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
if (hwpt_is_paging(hwpt))
iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt,
idev->dev);
+ if (hwpt_is_kvm(hwpt))
+ iommu_dev_disable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
mutex_unlock(&idev->igroup->lock);

/* Caller must destroy hwpt */
--
2.17.1

2023-12-02 09:53:06

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 17/42] iommu/vt-d: Make some macros and helpers to be extern

This makes the macros and helpers visible to outside of iommu.c, which
is a preparation for next patch to create domain of IOMMU_DOMAIN_KVM.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/iommu.c | 39 +++----------------------------------
drivers/iommu/intel/iommu.h | 35 +++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5df6c21781e1c..924006cda18c5 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -49,7 +49,6 @@
#define MAX_AGAW_PFN_WIDTH (MAX_AGAW_WIDTH - VTD_PAGE_SHIFT)

#define __DOMAIN_MAX_PFN(gaw) ((((uint64_t)1) << ((gaw) - VTD_PAGE_SHIFT)) - 1)
-#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << (gaw)) - 1)

/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
to match. That way, we can use 'unsigned long' for PFNs with impunity. */
@@ -62,10 +61,6 @@

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)

-/* page table handling */
-#define LEVEL_STRIDE (9)
-#define LEVEL_MASK (((u64)1 << LEVEL_STRIDE) - 1)
-
static inline int agaw_to_level(int agaw)
{
return agaw + 2;
@@ -76,11 +71,6 @@ static inline int agaw_to_width(int agaw)
return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);
}

-static inline int width_to_agaw(int width)
-{
- return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
-}
-
static inline unsigned int level_to_offset_bits(int level)
{
return (level - 1) * LEVEL_STRIDE;
@@ -281,8 +271,6 @@ static LIST_HEAD(dmar_satc_units);
#define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, &dmar_rmrr_units, list)

-static void intel_iommu_domain_free(struct iommu_domain *domain);
-
int dmar_disabled = !IS_ENABLED(CONFIG_INTEL_IOMMU_DEFAULT_ON);
int intel_iommu_sm = IS_ENABLED(CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON);

@@ -450,12 +438,6 @@ int iommu_calculate_agaw(struct intel_iommu *iommu)
return __iommu_calculate_agaw(iommu, DEFAULT_DOMAIN_ADDRESS_WIDTH);
}

-static inline bool iommu_paging_structure_coherency(struct intel_iommu *iommu)
-{
- return sm_supported(iommu) ?
- ecap_smpwc(iommu->ecap) : ecap_coherent(iommu->ecap);
-}
-
static void domain_update_iommu_coherency(struct dmar_domain *domain)
{
struct iommu_domain_info *info;
@@ -1757,7 +1739,7 @@ static bool first_level_by_default(unsigned int type)
return type != IOMMU_DOMAIN_UNMANAGED;
}

-static struct dmar_domain *alloc_domain(unsigned int type)
+struct dmar_domain *alloc_domain(unsigned int type)
{
struct dmar_domain *domain;

@@ -1842,20 +1824,6 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
spin_unlock(&iommu->lock);
}

-static inline int guestwidth_to_adjustwidth(int gaw)
-{
- int agaw;
- int r = (gaw - 12) % 9;
-
- if (r == 0)
- agaw = gaw;
- else
- agaw = gaw + 9 - r;
- if (agaw > 64)
- agaw = 64;
- return agaw;
-}
-
static void domain_exit(struct dmar_domain *domain)
{
if (domain->pgd) {
@@ -4106,7 +4074,7 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
return domain;
}

-static void intel_iommu_domain_free(struct iommu_domain *domain)
+void intel_iommu_domain_free(struct iommu_domain *domain)
{
if (domain != &si_domain->domain)
domain_exit(to_dmar_domain(domain));
@@ -4155,8 +4123,7 @@ int prepare_domain_attach_device(struct iommu_domain *domain,
return 0;
}

-static int intel_iommu_attach_device(struct iommu_domain *domain,
- struct device *dev)
+int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev)
{
struct device_domain_info *info = dev_iommu_priv_get(dev);
int ret;
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 6acb0211e85fe..c76f558ae6323 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1021,4 +1021,39 @@ static inline const char *decode_prq_descriptor(char *str, size_t size,
return str;
}

+#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << (gaw)) - 1)
+
+/* page table handling */
+#define LEVEL_STRIDE (9)
+#define LEVEL_MASK (((u64)1 << LEVEL_STRIDE) - 1)
+
+int intel_iommu_attach_device(struct iommu_domain *domain, struct device *dev);
+void intel_iommu_domain_free(struct iommu_domain *domain);
+struct dmar_domain *alloc_domain(unsigned int type);
+
+static inline int guestwidth_to_adjustwidth(int gaw)
+{
+ int agaw;
+ int r = (gaw - 12) % 9;
+
+ if (r == 0)
+ agaw = gaw;
+ else
+ agaw = gaw + 9 - r;
+ if (agaw > 64)
+ agaw = 64;
+ return agaw;
+}
+
+static inline bool iommu_paging_structure_coherency(struct intel_iommu *iommu)
+{
+ return sm_supported(iommu) ?
+ ecap_smpwc(iommu->ecap) : ecap_coherent(iommu->ecap);
+}
+
+static inline int width_to_agaw(int width)
+{
+ return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
+}
+
#endif
--
2.17.1

2023-12-02 09:53:26

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 18/42] iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU

Add support of IOMMU_DOMAIN_KVM domain. Paging structures allocation/free,
page mapping and unmapping of this damain are managed by KVM rather than by
Intel IOMMU driver.

The meta data of paging structures of KVM domain is read from the
allocation "data" passed in from KVM through IOMMUFD. The format to parse
the meta data is defined in arch header "asm/kvm_exported_tdp.h".

KVM domain's gaw(guest witdh), agaw, pgd, max_add, max super page level are
all read from the paging structure meta data from KVM. Snoop and paging
structure coherency are forced to be true.

IOMMU hardware are checked against the requirement of KVM domain at domain
allocation phase and later device attachment phase (in a later patch).

CONFIG_INTEL_IOMMU_KVM is provided to turn on/off KVM domain support.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/Kconfig | 9 +++
drivers/iommu/intel/Makefile | 1 +
drivers/iommu/intel/iommu.c | 18 ++++-
drivers/iommu/intel/iommu.h | 5 ++
drivers/iommu/intel/kvm.c | 128 +++++++++++++++++++++++++++++++++++
5 files changed, 160 insertions(+), 1 deletion(-)
create mode 100644 drivers/iommu/intel/kvm.c

diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index a4a125666293f..78078103d4280 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -108,4 +108,13 @@ config INTEL_IOMMU_PERF_EVENTS
to aid performance tuning and debug. These are available on modern
processors which support Intel VT-d 4.0 and later.

+config INTEL_IOMMU_KVM
+ bool "Support of stage 2 paging structures/mappings managed by KVM"
+ help
+ Selecting this option will enable Intel IOMMU to use paging
+ structures shared from KVM MMU as the stage 2 paging structures
+ in IOMMU hardware. The page mapping/unmapping, paging struture
+ allocation/free of this stage 2 paging structures are not managed
+ by Intel IOMMU driver, but by KVM MMU.
+
endif # INTEL_IOMMU
diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
index 5dabf081a7793..c097bdd6ee13d 100644
--- a/drivers/iommu/intel/Makefile
+++ b/drivers/iommu/intel/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
obj-$(CONFIG_INTEL_IOMMU_PERF_EVENTS) += perfmon.o
+obj-$(CONFIG_INTEL_IOMMU_KVM) += kvm.o
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 924006cda18c5..fcdee40f30ed1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -375,6 +375,15 @@ static inline int domain_type_is_si(struct dmar_domain *domain)
return domain->domain.type == IOMMU_DOMAIN_IDENTITY;
}

+static inline int domain_type_is_kvm(struct dmar_domain *domain)
+{
+#ifdef CONFIG_INTEL_IOMMU_KVM
+ return domain->domain.type == IOMMU_DOMAIN_KVM;
+#else
+ return false;
+#endif
+}
+
static inline int domain_pfn_supported(struct dmar_domain *domain,
unsigned long pfn)
{
@@ -1735,6 +1744,9 @@ static bool first_level_by_default(unsigned int type)
if (intel_cap_flts_sanity() ^ intel_cap_slts_sanity())
return intel_cap_flts_sanity();

+ if (type == IOMMU_DOMAIN_KVM)
+ return false;
+
/* Both levels are available, decide it based on domain type */
return type != IOMMU_DOMAIN_UNMANAGED;
}
@@ -1826,7 +1838,8 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)

static void domain_exit(struct dmar_domain *domain)
{
- if (domain->pgd) {
+ /* pgd of kvm domain is managed by KVM */
+ if (!domain_type_is_kvm(domain) && (domain->pgd)) {
LIST_HEAD(freelist);

domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
@@ -4892,6 +4905,9 @@ const struct iommu_ops intel_iommu_ops = {
.hw_info = intel_iommu_hw_info,
.domain_alloc = intel_iommu_domain_alloc,
.domain_alloc_user = intel_iommu_domain_alloc_user,
+#ifdef CONFIG_INTEL_IOMMU_KVM
+ .domain_alloc_kvm = intel_iommu_domain_alloc_kvm,
+#endif
.probe_device = intel_iommu_probe_device,
.probe_finalize = intel_iommu_probe_finalize,
.release_device = intel_iommu_release_device,
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index c76f558ae6323..8826e9248f6ed 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1056,4 +1056,9 @@ static inline int width_to_agaw(int width)
return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
}

+#ifdef CONFIG_INTEL_IOMMU_KVM
+struct iommu_domain *
+intel_iommu_domain_alloc_kvm(struct device *dev, u32 flags, const void *data);
+#endif
+
#endif
diff --git a/drivers/iommu/intel/kvm.c b/drivers/iommu/intel/kvm.c
new file mode 100644
index 0000000000000..188ec90083051
--- /dev/null
+++ b/drivers/iommu/intel/kvm.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/iommu.h>
+#include <asm/kvm_exported_tdp.h>
+#include "iommu.h"
+
+/**
+ * Check IOMMU hardware Snoop related caps
+ *
+ * - force_snooping: Force snoop cpu caches per current KVM implementation.
+ * - scalable-mode: To enable PGSNP bit in PASIDTE to overwrite SNP
+ * bit (bit 11) in stage 2 leaves.
+ * - paging structure coherency: As KVM will not call clflush_cache_range()
+ */
+static bool is_coherency(struct intel_iommu *iommu)
+{
+ return ecap_sc_support(iommu->ecap) && sm_supported(iommu) &&
+ iommu_paging_structure_coherency(iommu);
+}
+
+static bool is_iommu_cap_compatible_to_kvm_domain(struct dmar_domain *domain,
+ struct intel_iommu *iommu)
+{
+ if (!is_coherency(iommu))
+ return false;
+
+ if (domain->iommu_superpage > fls(cap_super_page_val(iommu->cap)))
+ return false;
+
+ if (domain->agaw > iommu->agaw || domain->agaw > cap_mgaw(iommu->cap))
+ return false;
+
+ return true;
+}
+
+/*
+ * Cache coherency is always enforced in KVM domain.
+ * IOMMU hardware caps will be checked to allow the cache coherency before
+ * device attachment to the KVM domain.
+ */
+static bool kvm_domain_enforce_cache_coherency(struct iommu_domain *domain)
+{
+ return true;
+}
+
+static const struct iommu_domain_ops intel_kvm_domain_ops = {
+ .free = intel_iommu_domain_free,
+ .enforce_cache_coherency = kvm_domain_enforce_cache_coherency,
+};
+
+struct iommu_domain *
+intel_iommu_domain_alloc_kvm(struct device *dev, u32 flags, const void *data)
+{
+ bool request_nest_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
+ const struct kvm_exported_tdp_meta_vmx *tdp = data;
+ struct dmar_domain *dmar_domain;
+ struct iommu_domain *domain;
+ struct intel_iommu *iommu;
+ int adjust_width;
+
+ iommu = device_to_iommu(dev, NULL, NULL);
+
+ if (!iommu)
+ return ERR_PTR(-ENODEV);
+ /*
+ * In theroy, a KVM domain can be nested as a parent domain to a user
+ * domain. Turn it off as we don't want to handle cases like IO page
+ * fault on nested domain for now.
+ */
+ if ((request_nest_parent)) {
+ pr_err("KVM domain does not work as nested parent currently\n");
+ return ERR_PTR(-EOPNOTSUPP);
+ }
+
+ if (!tdp || tdp->type != KVM_TDP_TYPE_EPT) {
+ pr_err("No meta data or wrong KVM TDP type\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (tdp->level != 4 && tdp->level != 5) {
+ pr_err("Unsupported KVM TDP level %d in IOMMU\n", tdp->level);
+ return ERR_PTR(-EOPNOTSUPP);
+ }
+
+ dmar_domain = alloc_domain(IOMMU_DOMAIN_KVM);
+ if (!dmar_domain)
+ return ERR_PTR(-ENOMEM);
+
+ if (dmar_domain->use_first_level)
+ WARN_ON("KVM domain is applying to IOMMU flpt\n");
+
+ domain = &dmar_domain->domain;
+ domain->ops = &intel_kvm_domain_ops;
+ domain->type = IOMMU_DOMAIN_KVM;
+
+ /* read dmar domain meta data from "tdp" */
+ dmar_domain->gaw = tdp->level == 4 ? ADDR_WIDTH_4LEVEL : ADDR_WIDTH_5LEVEL;
+ adjust_width = guestwidth_to_adjustwidth(dmar_domain->gaw);
+ dmar_domain->agaw = width_to_agaw(adjust_width);
+ dmar_domain->iommu_superpage = tdp->max_huge_page_level - 1;
+ dmar_domain->max_addr = (1 << dmar_domain->gaw);
+ dmar_domain->pgd = phys_to_virt(tdp->root_hpa);
+
+ dmar_domain->nested_parent = false;
+ dmar_domain->dirty_tracking = false;
+
+ /*
+ * force_snooping and paging strucure coherency in KVM domain
+ * IOMMU hareware cap will be checked before device attach
+ */
+ dmar_domain->force_snooping = true;
+ dmar_domain->iommu_coherency = true;
+
+ /* no need to let iommu_map/unmap see pgsize_bitmap */
+ domain->pgsize_bitmap = 0;
+
+ /* force aperture */
+ domain->geometry.aperture_start = 0;
+ domain->geometry.aperture_end = __DOMAIN_MAX_ADDR(dmar_domain->gaw);
+ domain->geometry.force_aperture = true;
+
+ if (!is_iommu_cap_compatible_to_kvm_domain(dmar_domain, iommu)) {
+ pr_err("Unsupported KVM TDP\n");
+ kfree(dmar_domain);
+ return ERR_PTR(-EOPNOTSUPP);
+ }
+
+ return domain;
+}
--
2.17.1

2023-12-02 09:53:57

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 19/42] iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is enforced

Set bit PGSNP (Page Snoop, bit 88) in PASIDTE when attaching device to a
domain whose cache coherency is enforced.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/pasid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e814..a42955b5e666f 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -679,10 +679,11 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
pasid_set_address_width(pte, agaw);
pasid_set_translation_type(pte, PASID_ENTRY_PGTT_SL_ONLY);
pasid_set_fault_enable(pte);
+ if (domain->force_snooping)
+ pasid_set_pgsnp(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
if (domain->dirty_tracking)
pasid_set_ssade(pte);
-
pasid_set_present(pte);
spin_unlock(&iommu->lock);

--
2.17.1

2023-12-02 09:54:35

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 20/42] iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain

IOMMU_DOMAIN_KVM domain reuses intel_iommu_attach_device() for device
attachment. But unlike attaching to other dmar_domain, domain caps (e.g.
iommu_superpage) are not updated after device attach. Instead, IOMMU caps
are checked for compatibility before domain attachment.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/iommu.c | 11 +++++++++++
drivers/iommu/intel/iommu.h | 7 +++++++
drivers/iommu/intel/kvm.c | 9 +++++++++
3 files changed, 27 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index fcdee40f30ed1..9cc42b3d24f65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -552,6 +552,13 @@ static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain)
/* Some capabilities may be different across iommus */
void domain_update_iommu_cap(struct dmar_domain *domain)
{
+ /*
+ * No need to adjust iommu cap of kvm domain.
+ * Instead, iommu will be checked in pre-attach phase.
+ */
+ if (domain_type_is_kvm(domain))
+ return;
+
domain_update_iommu_coherency(domain);
domain->iommu_superpage = domain_update_iommu_superpage(domain, NULL);

@@ -4104,6 +4111,9 @@ int prepare_domain_attach_device(struct iommu_domain *domain,
if (!iommu)
return -ENODEV;

+ if (domain_type_is_kvm(dmar_domain))
+ return prepare_kvm_domain_attach(dmar_domain, iommu);
+
if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
return -EINVAL;

@@ -4117,6 +4127,7 @@ int prepare_domain_attach_device(struct iommu_domain *domain,

if (dmar_domain->max_addr > (1LL << addr_width))
return -EINVAL;
+
dmar_domain->gaw = addr_width;

/*
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 8826e9248f6ed..801700bc7d820 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1059,6 +1059,13 @@ static inline int width_to_agaw(int width)
#ifdef CONFIG_INTEL_IOMMU_KVM
struct iommu_domain *
intel_iommu_domain_alloc_kvm(struct device *dev, u32 flags, const void *data);
+int prepare_kvm_domain_attach(struct dmar_domain *domain, struct intel_iommu *iommu);
+#else
+static inline int prepare_kvm_domain_attach(struct dmar_domain *domain,
+ struct intel_iommu *iommu)
+{
+ return 0;
+}
#endif

#endif
diff --git a/drivers/iommu/intel/kvm.c b/drivers/iommu/intel/kvm.c
index 188ec90083051..1ce334785430b 100644
--- a/drivers/iommu/intel/kvm.c
+++ b/drivers/iommu/intel/kvm.c
@@ -32,6 +32,14 @@ static bool is_iommu_cap_compatible_to_kvm_domain(struct dmar_domain *domain,
return true;
}

+int prepare_kvm_domain_attach(struct dmar_domain *domain, struct intel_iommu *iommu)
+{
+ if (is_iommu_cap_compatible_to_kvm_domain(domain, iommu))
+ return 0;
+
+ return -EINVAL;
+}
+
/*
* Cache coherency is always enforced in KVM domain.
* IOMMU hardware caps will be checked to allow the cache coherency before
@@ -43,6 +51,7 @@ static bool kvm_domain_enforce_cache_coherency(struct iommu_domain *domain)
}

static const struct iommu_domain_ops intel_kvm_domain_ops = {
+ .attach_dev = intel_iommu_attach_device,
.free = intel_iommu_domain_free,
.enforce_cache_coherency = kvm_domain_enforce_cache_coherency,
};
--
2.17.1

2023-12-02 09:55:11

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 21/42] iommu/vt-d: Check reserved bits for IOMMU_DOMAIN_KVM domain

Compatibility check between IOMMU driver and KVM.
rsvd_bits_mask is provided by KVM to guarantee that the set bits are
must-be-zero bits in PTEs. Intel vt-d driver can check it to see if all
must-be-zero bits required by IOMMU side are included.

In this RFC, only bit 11 is checked for simplicity and demo purpose.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/kvm.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/iommu/intel/kvm.c b/drivers/iommu/intel/kvm.c
index 1ce334785430b..998d6daaf7ea1 100644
--- a/drivers/iommu/intel/kvm.c
+++ b/drivers/iommu/intel/kvm.c
@@ -32,6 +32,18 @@ static bool is_iommu_cap_compatible_to_kvm_domain(struct dmar_domain *domain,
return true;
}

+static int check_tdp_reserved_bits(const struct kvm_exported_tdp_meta_vmx *tdp)
+{
+ int i;
+
+ for (i = PT64_ROOT_MAX_LEVEL; --i >= 0;) {
+ if (!(tdp->rsvd_bits_mask[0][i] & BIT(11)) ||
+ !(tdp->rsvd_bits_mask[1][i] & BIT(11)))
+ return -EFAULT;
+ }
+ return 0;
+}
+
int prepare_kvm_domain_attach(struct dmar_domain *domain, struct intel_iommu *iommu)
{
if (is_iommu_cap_compatible_to_kvm_domain(domain, iommu))
@@ -90,6 +102,11 @@ intel_iommu_domain_alloc_kvm(struct device *dev, u32 flags, const void *data)
return ERR_PTR(-EOPNOTSUPP);
}

+ if (check_tdp_reserved_bits(tdp)) {
+ pr_err("Reserved bits incompatible between KVM and IOMMU\n");
+ return ERR_PTR(-EOPNOTSUPP);
+ }
+
dmar_domain = alloc_domain(IOMMU_DOMAIN_KVM);
if (!dmar_domain)
return ERR_PTR(-ENOMEM);
--
2.17.1

2023-12-02 09:56:10

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 23/42] iommu/vt-d: Allow pasid 0 in IOPF

Pasid 0 is allowed when IOPFs are triggered in second level page tables.
Page requests/response with pasid 0 or without pasid are also permitted by
vt-d hardware spec.

FIXME:
Current .page_response and intel_svm_enable_prq() are bound to SVM and
is compiled only with CONFIG_INTEL_IOMMU_SVM.
e.g.
.page_response = intel_svm_page_response,

Need to move prq enableing and page response code outside of svm.c and
SVM independent.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/svm.c | 37 ++++++++++++++++++++-----------------
1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 659de9c160241..a2a63a85baa9f 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -628,6 +628,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
int head, tail, handled;
struct pci_dev *pdev;
u64 address;
+ bool bad_req = false;

/*
* Clear PPR bit before reading head/tail registers, to ensure that
@@ -642,30 +643,29 @@ static irqreturn_t prq_event_thread(int irq, void *d)
req = &iommu->prq[head / sizeof(*req)];
address = (u64)req->addr << VTD_PAGE_SHIFT;

- if (unlikely(!req->pasid_present)) {
- pr_err("IOMMU: %s: Page request without PASID\n",
+ if (unlikely(!req->pasid_present))
+ pr_info("IOMMU: %s: Page request without PASID\n",
iommu->name);
-bad_req:
- handle_bad_prq_event(iommu, req, QI_RESP_INVALID);
- goto prq_advance;
- }

if (unlikely(!is_canonical_address(address))) {
pr_err("IOMMU: %s: Address is not canonical\n",
iommu->name);
- goto bad_req;
+ bad_req = true;
+ goto prq_advance;
}

if (unlikely(req->pm_req && (req->rd_req | req->wr_req))) {
pr_err("IOMMU: %s: Page request in Privilege Mode\n",
iommu->name);
- goto bad_req;
+ bad_req = true;
+ goto prq_advance;
}

if (unlikely(req->exe_req && req->rd_req)) {
pr_err("IOMMU: %s: Execution request not supported\n",
iommu->name);
- goto bad_req;
+ bad_req = true;
+ goto prq_advance;
}

/* Drop Stop Marker message. No need for a response. */
@@ -679,8 +679,10 @@ static irqreturn_t prq_event_thread(int irq, void *d)
* If prq is to be handled outside iommu driver via receiver of
* the fault notifiers, we skip the page response here.
*/
- if (!pdev)
- goto bad_req;
+ if (!pdev) {
+ bad_req = true;
+ goto prq_advance;
+ }

if (intel_svm_prq_report(iommu, &pdev->dev, req))
handle_bad_prq_event(iommu, req, QI_RESP_INVALID);
@@ -688,8 +690,14 @@ static irqreturn_t prq_event_thread(int irq, void *d)
trace_prq_report(iommu, &pdev->dev, req->qw_0, req->qw_1,
req->priv_data[0], req->priv_data[1],
iommu->prq_seq_number++);
+
pci_dev_put(pdev);
+
prq_advance:
+ if (bad_req) {
+ handle_bad_prq_event(iommu, req, QI_RESP_INVALID);
+ bad_req = false;
+ }
head = (head + sizeof(*req)) & PRQ_RING_MASK;
}

@@ -747,12 +755,7 @@ int intel_svm_page_response(struct device *dev,
private_present = prm->flags & IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA;
last_page = prm->flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE;

- if (!pasid_present) {
- ret = -EINVAL;
- goto out;
- }
-
- if (prm->pasid == 0 || prm->pasid >= PASID_MAX) {
+ if (prm->pasid >= PASID_MAX) {
ret = -EINVAL;
goto out;
}
--
2.17.1

2023-12-02 09:56:11

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 22/42] iommu/vt-d: Support cache invalidate of IOMMU_DOMAIN_KVM domain

Support invalidation of hardware TLBs on KVM invalidates mappings on
domain of type IOMMU_DOMAIN_KVM.

Signed-off-by: Yan Zhao <[email protected]>
---
drivers/iommu/intel/kvm.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/drivers/iommu/intel/kvm.c b/drivers/iommu/intel/kvm.c
index 998d6daaf7ea1..56cb8f9bf1da0 100644
--- a/drivers/iommu/intel/kvm.c
+++ b/drivers/iommu/intel/kvm.c
@@ -62,10 +62,41 @@ static bool kvm_domain_enforce_cache_coherency(struct iommu_domain *domain)
return true;
}

+static void domain_flush_iotlb_psi(struct dmar_domain *domain,
+ unsigned long iova, unsigned long size)
+{
+ struct iommu_domain_info *info;
+ unsigned long i;
+
+ if (!IS_ALIGNED(size, VTD_PAGE_SIZE) ||
+ !IS_ALIGNED(iova, VTD_PAGE_SIZE)) {
+ pr_err("Invalid KVM domain invalidation: iova=0x%lx, size=0x%lx\n",
+ iova, size);
+ return;
+ }
+
+ xa_for_each(&domain->iommu_array, i, info)
+ iommu_flush_iotlb_psi(info->iommu, domain,
+ iova >> VTD_PAGE_SHIFT,
+ size >> VTD_PAGE_SHIFT, 1, 0);
+}
+
+static void kvm_domain_cache_invalidate(struct iommu_domain *domain,
+ unsigned long iova, unsigned long size)
+{
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+ if (iova == 0 && size == -1UL)
+ intel_flush_iotlb_all(domain);
+ else
+ domain_flush_iotlb_psi(dmar_domain, iova, size);
+}
+
static const struct iommu_domain_ops intel_kvm_domain_ops = {
.attach_dev = intel_iommu_attach_device,
.free = intel_iommu_domain_free,
.enforce_cache_coherency = kvm_domain_enforce_cache_coherency,
+ .cache_invalidate_kvm = kvm_domain_cache_invalidate,
};

struct iommu_domain *
--
2.17.1

2023-12-02 09:56:41

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 24/42] KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59

Add a config CONFIG_HAVE_KVM_MMU_PRESENT_HIGH to support locating
SPTE_MMU_PRESENT bit from bit 11 to bit 59 and mark bit 11 as reserved 0.

Though locating SPTE_MMU_PRESENT bit at low bit 11 has lower footprint,
sometimes it's not allowed for bit 11 to be set, e.g.
when KVM's TDP is exported and shared to IOMMU as stage 2 page tables, bit
11 must be reserved as 0 in Intel vt-d.

For the 19 bits MMIO GEN masks,
w/o CONFIG_HAVE_KVM_MMU_PRESENT_HIGH, it's divided into 2 parts,
Low: bit 3 - 10
High: bit 52 - 62

w/ CONFIG_HAVE_KVM_MMU_PRESENT_HIGH, it's divided into 3 parts,
Low: bit 3 - 11
Mid: bit 52 - 58
High: bit 60 - 62

It is ok for MMIO GEN mask to take bit 11 because MMIO GEN mask is for
generation info of emulated MMIOs and therefore will not be directly
accessed by Intel vt-d hardware.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 7 ++++
arch/x86/kvm/mmu/spte.c | 3 ++
arch/x86/kvm/mmu/spte.h | 77 ++++++++++++++++++++++++++++++++++++-----
virt/kvm/Kconfig | 3 ++
4 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57e181bba21b..69af78e508197 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4926,6 +4926,13 @@ static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
reserved_hpa_bits(), false,
max_huge_page_level);

+ if (IS_ENABLED(CONFIG_HAVE_KVM_MMU_PRESENT_HIGH)) {
+ for (i = PT64_ROOT_MAX_LEVEL; --i >= 0;) {
+ shadow_zero_check->rsvd_bits_mask[0][i] |= rsvd_bits(11, 11);
+ shadow_zero_check->rsvd_bits_mask[1][i] |= rsvd_bits(11, 11);
+ }
+ }
+
if (!shadow_me_mask)
return;

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c99..179156cd995df 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -64,6 +64,9 @@ static u64 generation_mmio_spte_mask(u64 gen)
WARN_ON_ONCE(gen & ~MMIO_SPTE_GEN_MASK);

mask = (gen << MMIO_SPTE_GEN_LOW_SHIFT) & MMIO_SPTE_GEN_LOW_MASK;
+#ifdef CONFIG_HAVE_KVM_MMU_PRESENT_HIGH
+ mask |= (gen << MMIO_SPTE_GEN_MID_SHIFT) & MMIO_SPTE_GEN_MID_MASK;
+#endif
mask |= (gen << MMIO_SPTE_GEN_HIGH_SHIFT) & MMIO_SPTE_GEN_HIGH_MASK;
return mask;
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a885..b88b686a4ecbc 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -7,13 +7,20 @@
#include "mmu_internal.h"

/*
- * A MMU present SPTE is backed by actual memory and may or may not be present
- * in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
- * is ignored by all flavors of SPTEs and checking a low bit often generates
- * better code than for a high bit, e.g. 56+. MMU present checks are pervasive
- * enough that the improved code generation is noticeable in KVM's footprint.
- */
+* A MMU present SPTE is backed by actual memory and may or may not be present
+* in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
+* is ignored by all flavors of SPTEs and checking a low bit often generates
+* better code than for a high bit, e.g. 56+. MMU present checks are pervasive
+* enough that the improved code generation is noticeable in KVM's footprint.
+* However, sometimes it's desired to have present bit in high bits. e.g.
+* if a KVM TDP is exported to IOMMU side, bit 11 could be a reserved bit in
+* IOMMU side. Add a config to decide MMU present bit is at bit 11 or bit 59.
+*/
+#ifdef CONFIG_HAVE_KVM_MMU_PRESENT_HIGH
+#define SPTE_MMU_PRESENT_MASK BIT_ULL(59)
+#else
#define SPTE_MMU_PRESENT_MASK BIT_ULL(11)
+#endif

/*
* TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
@@ -111,19 +118,66 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
* checking for MMIO spte cache hits.
*/

+#ifdef CONFIG_HAVE_KVM_MMU_PRESENT_HIGH
+
#define MMIO_SPTE_GEN_LOW_START 3
-#define MMIO_SPTE_GEN_LOW_END 10
+#define MMIO_SPTE_GEN_LOW_END 11
+#define MMIO_SPTE_GEN_MID_START 52
+#define MMIO_SPTE_GEN_MID_END 58
+#define MMIO_SPTE_GEN_HIGH_START 60
+#define MMIO_SPTE_GEN_HIGH_END 62
+#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
+ MMIO_SPTE_GEN_LOW_START)
+#define MMIO_SPTE_GEN_MID_MASK GENMASK_ULL(MMIO_SPTE_GEN_MID_END, \
+ MMIO_SPTE_GEN_MID_START)
+#define MMIO_SPTE_GEN_HIGH_MASK GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
+ MMIO_SPTE_GEN_HIGH_START)
+static_assert(!(SPTE_MMU_PRESENT_MASK &
+ (MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_MID_MASK |
+ MMIO_SPTE_GEN_HIGH_MASK)));
+/*
+ * The SPTE MMIO mask must NOT overlap the MMIO generation bits or the
+ * MMU-present bit. The generation obviously co-exists with the magic MMIO
+ * mask/value, and MMIO SPTEs are considered !MMU-present.
+ *
+ * The SPTE MMIO mask is allowed to use hardware "present" bits (i.e. all EPT
+ * RWX bits), all physical address bits (legal PA bits are used for "fast" MMIO
+ * and so they're off-limits for generation; additional checks ensure the mask
+ * doesn't overlap legal PA bits), and bit 63 (carved out for future usage).
+ */
+#define SPTE_MMIO_ALLOWED_MASK (BIT_ULL(63) | GENMASK_ULL(51, 12) | GENMASK_ULL(2, 0))
+static_assert(!(SPTE_MMIO_ALLOWED_MASK &
+ (SPTE_MMU_PRESENT_MASK | MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_MID_MASK |
+ MMIO_SPTE_GEN_HIGH_MASK)));
+
+#define MMIO_SPTE_GEN_LOW_BITS (MMIO_SPTE_GEN_LOW_END - MMIO_SPTE_GEN_LOW_START + 1)
+#define MMIO_SPTE_GEN_MID_BITS (MMIO_SPTE_GEN_MID_END - MMIO_SPTE_GEN_MID_START + 1)
+#define MMIO_SPTE_GEN_HIGH_BITS (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)

+/* remember to adjust the comment above as well if you change these */
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 9 && MMIO_SPTE_GEN_MID_BITS == 7 &&
+ MMIO_SPTE_GEN_HIGH_BITS == 3);
+
+#define MMIO_SPTE_GEN_LOW_SHIFT (MMIO_SPTE_GEN_LOW_START - 0)
+#define MMIO_SPTE_GEN_MID_SHIFT (MMIO_SPTE_GEN_MID_START - MMIO_SPTE_GEN_LOW_BITS)
+#define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_MID_BITS - \
+ MMIO_SPTE_GEN_LOW_BITS)
+
+#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + \
+ MMIO_SPTE_GEN_MID_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
+
+#else /* !CONFIG_HAVE_KVM_MMU_PRESENT_HIGH */
+
+#define MMIO_SPTE_GEN_LOW_START 3
+#define MMIO_SPTE_GEN_LOW_END 10
#define MMIO_SPTE_GEN_HIGH_START 52
#define MMIO_SPTE_GEN_HIGH_END 62
-
#define MMIO_SPTE_GEN_LOW_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
MMIO_SPTE_GEN_LOW_START)
#define MMIO_SPTE_GEN_HIGH_MASK GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
MMIO_SPTE_GEN_HIGH_START)
static_assert(!(SPTE_MMU_PRESENT_MASK &
(MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_HIGH_MASK)));
-
/*
* The SPTE MMIO mask must NOT overlap the MMIO generation bits or the
* MMU-present bit. The generation obviously co-exists with the magic MMIO
@@ -149,6 +203,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+#endif /* #ifdef CONFIG_HAVE_KVM_MMU_PRESENT_HIGH */
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
@@ -465,6 +521,9 @@ static inline u64 get_mmio_spte_generation(u64 spte)
u64 gen;

gen = (spte & MMIO_SPTE_GEN_LOW_MASK) >> MMIO_SPTE_GEN_LOW_SHIFT;
+#ifdef CONFIG_HAVE_KVM_MMU_PRESENT_HIGH
+ gen |= (spte & MMIO_SPTE_GEN_MID_MASK) >> MMIO_SPTE_GEN_MID_SHIFT;
+#endif
gen |= (spte & MMIO_SPTE_GEN_HIGH_MASK) >> MMIO_SPTE_GEN_HIGH_SHIFT;
return gen;
}
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 63b5d55c84e95..b00f9f5180292 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -95,3 +95,6 @@ config KVM_GENERIC_HARDWARE_ENABLING

config HAVE_KVM_EXPORTED_TDP
bool
+
+config HAVE_KVM_MMU_PRESENT_HIGH
+ bool
--
2.17.1

2023-12-02 09:57:30

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 25/42] KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu"

Abstract "struct kvm_mmu_common" and move 3 common fields "root, root_role,
shadow_zero_check" from "struct kvm_mmu" to "struct kvm_mmu_common".

"struct kvm_mmu_common" is a preparation for later patches to introduce
"struct kvm_exported_tdp_mmu" which is used by KVM to export TDP.

Opportunistically, a new param "struct kvm_mmu_common *mmu_common" is added
to make_spte(), so that is_rsvd_spte() in make_spte() can use
&mmu_common->shadow_zero_check directly without asking it from vcpu.

No functional changes expected.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 22 +++--
arch/x86/kvm/mmu.h | 6 +-
arch/x86/kvm/mmu/mmu.c | 168 ++++++++++++++++----------------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 9 +-
arch/x86/kvm/mmu/spte.c | 7 +-
arch/x86/kvm/mmu/spte.h | 3 +-
arch/x86/kvm/mmu/tdp_mmu.c | 13 +--
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 8 +-
12 files changed, 127 insertions(+), 119 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e3..16e01eee34a99 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -437,12 +437,25 @@ struct kvm_mmu_root_info {
struct kvm_mmu_page;
struct kvm_page_fault;

+struct kvm_mmu_common {
+ struct kvm_mmu_root_info root;
+ union kvm_mmu_page_role root_role;
+
+ /*
+ * check zero bits on shadow page table entries, these
+ * bits include not only hardware reserved bits but also
+ * the bits spte never used.
+ */
+ struct rsvd_bits_validate shadow_zero_check;
+};
+
/*
* x86 supports 4 paging modes (5-level 64-bit, 4-level 64-bit, 3-level 32-bit,
* and 2-level 32-bit). The kvm_mmu structure abstracts the details of the
* current mmu mode.
*/
struct kvm_mmu {
+ struct kvm_mmu_common common;
unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu);
u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
@@ -453,9 +466,7 @@ struct kvm_mmu {
struct x86_exception *exception);
int (*sync_spte)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp, int i);
- struct kvm_mmu_root_info root;
union kvm_cpu_role cpu_role;
- union kvm_mmu_page_role root_role;

/*
* The pkru_mask indicates if protection key checks are needed. It
@@ -478,13 +489,6 @@ struct kvm_mmu {
u64 *pml4_root;
u64 *pml5_root;

- /*
- * check zero bits on shadow page table entries, these
- * bits include not only hardware reserved bits but also
- * the bits spte never used.
- */
- struct rsvd_bits_validate shadow_zero_check;
-
struct rsvd_bits_validate guest_rsvd_check;

u64 pdptrs[4]; /* pae */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bb8c86eefac04..e9631cc23a594 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -126,7 +126,7 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,

static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
{
- if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
+ if (likely(vcpu->arch.mmu->common.root.hpa != INVALID_PAGE))
return 0;

return kvm_mmu_load(vcpu);
@@ -148,13 +148,13 @@ static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)

static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
{
- u64 root_hpa = vcpu->arch.mmu->root.hpa;
+ u64 root_hpa = vcpu->arch.mmu->common.root.hpa;

if (!VALID_PAGE(root_hpa))
return;

static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa,
- vcpu->arch.mmu->root_role.level);
+ vcpu->arch.mmu->common.root_role.level);
}

static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 69af78e508197..cfeb066f38687 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -643,7 +643,7 @@ static bool mmu_spte_age(u64 *sptep)

static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
{
- return tdp_mmu_enabled && vcpu->arch.mmu->root_role.direct;
+ return tdp_mmu_enabled && vcpu->arch.mmu->common.root_role.direct;
}

static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
@@ -1911,7 +1911,7 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)

static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
- union kvm_mmu_page_role root_role = vcpu->arch.mmu->root_role;
+ union kvm_mmu_page_role root_role = vcpu->arch.mmu->common.root_role;

/*
* Ignore various flags when verifying that it's safe to sync a shadow
@@ -2363,11 +2363,11 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
{
iterator->addr = addr;
iterator->shadow_addr = root;
- iterator->level = vcpu->arch.mmu->root_role.level;
+ iterator->level = vcpu->arch.mmu->common.root_role.level;

if (iterator->level >= PT64_ROOT_4LEVEL &&
vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
- !vcpu->arch.mmu->root_role.direct)
+ !vcpu->arch.mmu->common.root_role.direct)
iterator->level = PT32E_ROOT_LEVEL;

if (iterator->level == PT32E_ROOT_LEVEL) {
@@ -2375,7 +2375,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
* prev_root is currently only used for 64-bit hosts. So only
* the active root_hpa is valid here.
*/
- BUG_ON(root != vcpu->arch.mmu->root.hpa);
+ BUG_ON(root != vcpu->arch.mmu->common.root.hpa);

iterator->shadow_addr
= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
@@ -2389,7 +2389,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, u64 addr)
{
- shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->root.hpa,
+ shadow_walk_init_using_root(iterator, vcpu, vcpu->arch.mmu->common.root.hpa,
addr);
}

@@ -2771,7 +2771,7 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t gpa;
int r;

- if (vcpu->arch.mmu->root_role.direct)
+ if (vcpu->arch.mmu->common.root_role.direct)
return 0;

gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
@@ -2939,7 +2939,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
was_rmapped = 1;
}

- wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
+ wrprot = make_spte(vcpu, &vcpu->arch.mmu->common,
+ sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
true, host_writable, &spte);

if (*sptep == spte) {
@@ -3577,7 +3578,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,

/* Before acquiring the MMU lock, see if we need to do any real work. */
free_active_root = (roots_to_free & KVM_MMU_ROOT_CURRENT)
- && VALID_PAGE(mmu->root.hpa);
+ && VALID_PAGE(mmu->common.root.hpa);

if (!free_active_root) {
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
@@ -3597,10 +3598,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
&invalid_list);

if (free_active_root) {
- if (kvm_mmu_is_dummy_root(mmu->root.hpa)) {
+ if (kvm_mmu_is_dummy_root(mmu->common.root.hpa)) {
/* Nothing to cleanup for dummy roots. */
- } else if (root_to_sp(mmu->root.hpa)) {
- mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
+ } else if (root_to_sp(mmu->common.root.hpa)) {
+ mmu_free_root_page(kvm, &mmu->common.root.hpa, &invalid_list);
} else if (mmu->pae_root) {
for (i = 0; i < 4; ++i) {
if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
@@ -3611,8 +3612,8 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
mmu->pae_root[i] = INVALID_PAE_ROOT;
}
}
- mmu->root.hpa = INVALID_PAGE;
- mmu->root.pgd = 0;
+ mmu->common.root.hpa = INVALID_PAGE;
+ mmu->common.root.pgd = 0;
}

kvm_mmu_commit_zap_page(kvm, &invalid_list);
@@ -3631,7 +3632,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
* This should not be called while L2 is active, L2 can't invalidate
* _only_ its own roots, e.g. INVVPID unconditionally exits.
*/
- WARN_ON_ONCE(mmu->root_role.guest_mode);
+ WARN_ON_ONCE(mmu->common.root_role.guest_mode);

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
root_hpa = mmu->prev_roots[i].hpa;
@@ -3650,7 +3651,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
u8 level)
{
- union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
+ union kvm_mmu_page_role role = vcpu->arch.mmu->common.root_role;
struct kvm_mmu_page *sp;

role.level = level;
@@ -3668,7 +3669,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- u8 shadow_root_level = mmu->root_role.level;
+ u8 shadow_root_level = mmu->common.root_role.level;
hpa_t root;
unsigned i;
int r;
@@ -3680,10 +3681,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)

if (tdp_mmu_enabled) {
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
- mmu->root.hpa = root;
+ mmu->common.root.hpa = root;
} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
- mmu->root.hpa = root;
+ mmu->common.root.hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
if (WARN_ON_ONCE(!mmu->pae_root)) {
r = -EIO;
@@ -3698,7 +3699,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
mmu->pae_root[i] = root | PT_PRESENT_MASK |
shadow_me_value;
}
- mmu->root.hpa = __pa(mmu->pae_root);
+ mmu->common.root.hpa = __pa(mmu->pae_root);
} else {
WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
r = -EIO;
@@ -3706,7 +3707,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
}

/* root.pgd is ignored for direct MMUs. */
- mmu->root.pgd = 0;
+ mmu->common.root.pgd = 0;
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
return r;
@@ -3785,7 +3786,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
root_gfn = root_pgd >> PAGE_SHIFT;

if (!kvm_vcpu_is_visible_gfn(vcpu, root_gfn)) {
- mmu->root.hpa = kvm_mmu_get_dummy_root();
+ mmu->common.root.hpa = kvm_mmu_get_dummy_root();
return 0;
}

@@ -3819,8 +3820,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
*/
if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->root_role.level);
- mmu->root.hpa = root;
+ mmu->common.root_role.level);
+ mmu->common.root.hpa = root;
goto set_root_pgd;
}

@@ -3835,7 +3836,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* the shadow page table may be a PAE or a long mode page table.
*/
pm_mask = PT_PRESENT_MASK | shadow_me_value;
- if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
+ if (mmu->common.root_role.level >= PT64_ROOT_4LEVEL) {
pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;

if (WARN_ON_ONCE(!mmu->pml4_root)) {
@@ -3844,7 +3845,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;

- if (mmu->root_role.level == PT64_ROOT_5LEVEL) {
+ if (mmu->common.root_role.level == PT64_ROOT_5LEVEL) {
if (WARN_ON_ONCE(!mmu->pml5_root)) {
r = -EIO;
goto out_unlock;
@@ -3876,15 +3877,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
mmu->pae_root[i] = root | pm_mask;
}

- if (mmu->root_role.level == PT64_ROOT_5LEVEL)
- mmu->root.hpa = __pa(mmu->pml5_root);
- else if (mmu->root_role.level == PT64_ROOT_4LEVEL)
- mmu->root.hpa = __pa(mmu->pml4_root);
+ if (mmu->common.root_role.level == PT64_ROOT_5LEVEL)
+ mmu->common.root.hpa = __pa(mmu->pml5_root);
+ else if (mmu->common.root_role.level == PT64_ROOT_4LEVEL)
+ mmu->common.root.hpa = __pa(mmu->pml4_root);
else
- mmu->root.hpa = __pa(mmu->pae_root);
+ mmu->common.root.hpa = __pa(mmu->pae_root);

set_root_pgd:
- mmu->root.pgd = root_pgd;
+ mmu->common.root.pgd = root_pgd;
out_unlock:
write_unlock(&vcpu->kvm->mmu_lock);

@@ -3894,7 +3895,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL;
+ bool need_pml5 = mmu->common.root_role.level > PT64_ROOT_4LEVEL;
u64 *pml5_root = NULL;
u64 *pml4_root = NULL;
u64 *pae_root;
@@ -3905,9 +3906,9 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
* equivalent level in the guest's NPT to shadow. Allocate the tables
* on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
*/
- if (mmu->root_role.direct ||
+ if (mmu->common.root_role.direct ||
mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
- mmu->root_role.level < PT64_ROOT_4LEVEL)
+ mmu->common.root_role.level < PT64_ROOT_4LEVEL)
return 0;

/*
@@ -4003,16 +4004,16 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
int i;
struct kvm_mmu_page *sp;

- if (vcpu->arch.mmu->root_role.direct)
+ if (vcpu->arch.mmu->common.root_role.direct)
return;

- if (!VALID_PAGE(vcpu->arch.mmu->root.hpa))
+ if (!VALID_PAGE(vcpu->arch.mmu->common.root.hpa))
return;

vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);

if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
- hpa_t root = vcpu->arch.mmu->root.hpa;
+ hpa_t root = vcpu->arch.mmu->common.root.hpa;

if (!is_unsync_root(root))
return;
@@ -4134,7 +4135,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
if (!is_shadow_present_pte(sptes[leaf]))
leaf++;

- rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
+ rsvd_check = &vcpu->arch.mmu->common.shadow_zero_check;

for (level = root; level >= leaf; level--)
reserved |= is_rsvd_spte(rsvd_check, sptes[level], level);
@@ -4233,7 +4234,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

arch.token = alloc_apf_token(vcpu);
arch.gfn = gfn;
- arch.direct_map = vcpu->arch.mmu->root_role.direct;
+ arch.direct_map = vcpu->arch.mmu->common.root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);

return kvm_setup_async_pf(vcpu, cr2_or_gpa,
@@ -4244,7 +4245,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
{
int r;

- if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
+ if ((vcpu->arch.mmu->common.root_role.direct != work->arch.direct_map) ||
work->wakeup_all)
return;

@@ -4252,7 +4253,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
if (unlikely(r))
return;

- if (!vcpu->arch.mmu->root_role.direct &&
+ if (!vcpu->arch.mmu->common.root_role.direct &&
work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;

@@ -4348,7 +4349,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
- struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->root.hpa);
+ struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->common.root.hpa);

/* Special roots, e.g. pae_root, are not backed by shadow pages. */
if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4374,7 +4375,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
int r;

/* Dummy roots are used only for shadowing bad guest roots. */
- if (WARN_ON_ONCE(kvm_mmu_is_dummy_root(vcpu->arch.mmu->root.hpa)))
+ if (WARN_ON_ONCE(kvm_mmu_is_dummy_root(vcpu->arch.mmu->common.root.hpa)))
return RET_PF_RETRY;

if (page_fault_handle_page_track(vcpu, fault))
@@ -4555,9 +4556,9 @@ static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
/*
* Find out if a previously cached root matching the new pgd/role is available,
* and insert the current root as the MRU in the cache.
- * If a matching root is found, it is assigned to kvm_mmu->root and
+ * If a matching root is found, it is assigned to kvm_mmu->common.root and
* true is returned.
- * If no match is found, kvm_mmu->root is left invalid, the LRU root is
+ * If no match is found, kvm_mmu->common.root is left invalid, the LRU root is
* evicted to make room for the current root, and false is returned.
*/
static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *mmu,
@@ -4566,7 +4567,7 @@ static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *m
{
uint i;

- if (is_root_usable(&mmu->root, new_pgd, new_role))
+ if (is_root_usable(&mmu->common.root, new_pgd, new_role))
return true;

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
@@ -4578,8 +4579,8 @@ static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *m
* 2 C 0 1 3
* 3 C 0 1 2 (on exit from the loop)
*/
- swap(mmu->root, mmu->prev_roots[i]);
- if (is_root_usable(&mmu->root, new_pgd, new_role))
+ swap(mmu->common.root, mmu->prev_roots[i]);
+ if (is_root_usable(&mmu->common.root, new_pgd, new_role))
return true;
}

@@ -4589,10 +4590,11 @@ static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *m

/*
* Find out if a previously cached root matching the new pgd/role is available.
- * On entry, mmu->root is invalid.
- * If a matching root is found, it is assigned to kvm_mmu->root, the LRU entry
- * of the cache becomes invalid, and true is returned.
- * If no match is found, kvm_mmu->root is left invalid and false is returned.
+ * On entry, mmu->common.root is invalid.
+ * If a matching root is found, it is assigned to kvm_mmu->common.root, the LRU
+ * entry of the cache becomes invalid, and true is returned.
+ * If no match is found, kvm_mmu->common.root is left invalid and false is
+ * returned.
*/
static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mmu,
gpa_t new_pgd,
@@ -4607,7 +4609,7 @@ static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mm
return false;

hit:
- swap(mmu->root, mmu->prev_roots[i]);
+ swap(mmu->common.root, mmu->prev_roots[i]);
/* Bubble up the remaining roots. */
for (; i < KVM_MMU_NUM_PREV_ROOTS - 1; i++)
mmu->prev_roots[i] = mmu->prev_roots[i + 1];
@@ -4622,10 +4624,10 @@ static bool fast_pgd_switch(struct kvm *kvm, struct kvm_mmu *mmu,
* Limit reuse to 64-bit hosts+VMs without "special" roots in order to
* avoid having to deal with PDPTEs and other complexities.
*/
- if (VALID_PAGE(mmu->root.hpa) && !root_to_sp(mmu->root.hpa))
+ if (VALID_PAGE(mmu->common.root.hpa) && !root_to_sp(mmu->common.root.hpa))
kvm_mmu_free_roots(kvm, mmu, KVM_MMU_ROOT_CURRENT);

- if (VALID_PAGE(mmu->root.hpa))
+ if (VALID_PAGE(mmu->common.root.hpa))
return cached_root_find_and_keep_current(kvm, mmu, new_pgd, new_role);
else
return cached_root_find_without_current(kvm, mmu, new_pgd, new_role);
@@ -4634,7 +4636,7 @@ static bool fast_pgd_switch(struct kvm *kvm, struct kvm_mmu *mmu,
void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- union kvm_mmu_page_role new_role = mmu->root_role;
+ union kvm_mmu_page_role new_role = mmu->common.root_role;

/*
* Return immediately if no usable root was found, kvm_mmu_reload()
@@ -4669,7 +4671,7 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
* count. Otherwise, clear the write flooding count.
*/
if (!new_role.direct) {
- struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->root.hpa);
+ struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->common.root.hpa);

if (!WARN_ON_ONCE(!sp))
__clear_sp_write_flooding_count(sp);
@@ -4863,7 +4865,7 @@ static inline u64 reserved_hpa_bits(void)
* follow the features in guest.
*/
static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context)
+ struct kvm_mmu_common *context)
{
/* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
bool is_amd = true;
@@ -4909,7 +4911,7 @@ static inline bool boot_cpu_is_amd(void)
* the direct page table on host, use as much mmu features as
* possible, however, kvm currently does not do execution-protection.
*/
-static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
+static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu_common *context)
{
struct rsvd_bits_validate *shadow_zero_check;
int i;
@@ -4947,7 +4949,7 @@ static void reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
* is the shadow page table for intel nested guest.
*/
static void
-reset_ept_shadow_zero_bits_mask(struct kvm_mmu *context, bool execonly)
+reset_ept_shadow_zero_bits_mask(struct kvm_mmu_common *context, bool execonly)
{
__reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
reserved_hpa_bits(), execonly,
@@ -5223,11 +5225,11 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
union kvm_mmu_page_role root_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);

if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
- root_role.word == context->root_role.word)
+ root_role.word == context->common.root_role.word)
return;

context->cpu_role.as_u64 = cpu_role.as_u64;
- context->root_role.word = root_role.word;
+ context->common.root_role.word = root_role.word;
context->page_fault = kvm_tdp_page_fault;
context->sync_spte = NULL;
context->get_guest_pgd = get_guest_cr3;
@@ -5242,7 +5244,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
context->gva_to_gpa = paging32_gva_to_gpa;

reset_guest_paging_metadata(vcpu, context);
- reset_tdp_shadow_zero_bits_mask(context);
+ reset_tdp_shadow_zero_bits_mask(&context->common);
}

static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
@@ -5250,11 +5252,11 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
union kvm_mmu_page_role root_role)
{
if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
- root_role.word == context->root_role.word)
+ root_role.word == context->common.root_role.word)
return;

context->cpu_role.as_u64 = cpu_role.as_u64;
- context->root_role.word = root_role.word;
+ context->common.root_role.word = root_role.word;

if (!is_cr0_pg(context))
nonpaging_init_context(context);
@@ -5264,7 +5266,7 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
paging32_init_context(context);

reset_guest_paging_metadata(vcpu, context);
- reset_shadow_zero_bits_mask(vcpu, context);
+ reset_shadow_zero_bits_mask(vcpu, &context->common);
}

static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
@@ -5356,7 +5358,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
if (new_mode.as_u64 != context->cpu_role.as_u64) {
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
context->cpu_role.as_u64 = new_mode.as_u64;
- context->root_role.word = new_mode.base.word;
+ context->common.root_role.word = new_mode.base.word;

context->page_fault = ept_page_fault;
context->gva_to_gpa = ept_gva_to_gpa;
@@ -5365,7 +5367,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
update_permission_bitmask(context, true);
context->pkru_mask = 0;
reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
- reset_ept_shadow_zero_bits_mask(context, execonly);
+ reset_ept_shadow_zero_bits_mask(&context->common, execonly);
}

kvm_mmu_new_pgd(vcpu, new_eptp);
@@ -5451,9 +5453,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
* that problem is swept under the rug; KVM's CPUID API is horrific and
* it's all but impossible to solve it without introducing a new API.
*/
- vcpu->arch.root_mmu.root_role.word = 0;
- vcpu->arch.guest_mmu.root_role.word = 0;
- vcpu->arch.nested_mmu.root_role.word = 0;
+ vcpu->arch.root_mmu.common.root_role.word = 0;
+ vcpu->arch.guest_mmu.common.root_role.word = 0;
+ vcpu->arch.nested_mmu.common.root_role.word = 0;
vcpu->arch.root_mmu.cpu_role.ext.valid = 0;
vcpu->arch.guest_mmu.cpu_role.ext.valid = 0;
vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
@@ -5477,13 +5479,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;

- r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
+ r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->common.root_role.direct);
if (r)
goto out;
r = mmu_alloc_special_roots(vcpu);
if (r)
goto out;
- if (vcpu->arch.mmu->root_role.direct)
+ if (vcpu->arch.mmu->common.root_role.direct)
r = mmu_alloc_direct_roots(vcpu);
else
r = mmu_alloc_shadow_roots(vcpu);
@@ -5511,9 +5513,9 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
struct kvm *kvm = vcpu->kvm;

kvm_mmu_free_roots(kvm, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON_ONCE(VALID_PAGE(vcpu->arch.root_mmu.root.hpa));
+ WARN_ON_ONCE(VALID_PAGE(vcpu->arch.root_mmu.common.root.hpa));
kvm_mmu_free_roots(kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON_ONCE(VALID_PAGE(vcpu->arch.guest_mmu.root.hpa));
+ WARN_ON_ONCE(VALID_PAGE(vcpu->arch.guest_mmu.common.root.hpa));
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
}

@@ -5549,7 +5551,7 @@ static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
unsigned long roots_to_free = 0;
int i;

- if (is_obsolete_root(kvm, mmu->root.hpa))
+ if (is_obsolete_root(kvm, mmu->common.root.hpa))
roots_to_free |= KVM_MMU_ROOT_CURRENT;

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
@@ -5719,7 +5721,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_PF;
- bool direct = vcpu->arch.mmu->root_role.direct;
+ bool direct = vcpu->arch.mmu->common.root_role.direct;

/*
* IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
@@ -5732,7 +5734,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
error_code &= ~PFERR_IMPLICIT_ACCESS;

- if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
+ if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->common.root.hpa)))
return RET_PF_RETRY;

r = RET_PF_INVALID;
@@ -5762,7 +5764,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
* paging in both guests. If true, we simply unprotect the page
* and resume the guest.
*/
- if (vcpu->arch.mmu->root_role.direct &&
+ if (vcpu->arch.mmu->common.root_role.direct &&
(error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
return 1;
@@ -5844,7 +5846,7 @@ void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return;

if (roots & KVM_MMU_ROOT_CURRENT)
- __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->root.hpa);
+ __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->common.root.hpa);

for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
if (roots & KVM_MMU_ROOT_PREVIOUS(i))
@@ -5990,8 +5992,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
struct page *page;
int i;

- mmu->root.hpa = INVALID_PAGE;
- mmu->root.pgd = 0;
+ mmu->common.root.hpa = INVALID_PAGE;
+ mmu->common.root.pgd = 0;
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index decc1f1536694..7699596308386 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -299,7 +299,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
};
int r;

- if (vcpu->arch.mmu->root_role.direct) {
+ if (vcpu->arch.mmu->common.root_role.direct) {
fault.gfn = fault.addr >> PAGE_SHIFT;
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f672..84509af0d7f9d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -648,7 +648,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (FNAME(gpte_changed)(vcpu, gw, top_level))
goto out_gpte_changed;

- if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
+ if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->common.root.hpa)))
goto out_gpte_changed;

/*
@@ -657,7 +657,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* loading a dummy root and handling the resulting page fault, e.g. if
* userspace create a memslot in the interim.
*/
- if (unlikely(kvm_mmu_is_dummy_root(vcpu->arch.mmu->root.hpa))) {
+ if (unlikely(kvm_mmu_is_dummy_root(vcpu->arch.mmu->common.root.hpa))) {
kvm_make_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu);
goto out_gpte_changed;
}
@@ -960,9 +960,8 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- make_spte(vcpu, sp, slot, pte_access, gfn,
- spte_to_pfn(spte), spte, true, false,
- host_writable, &spte);
+ make_spte(vcpu, &vcpu->arch.mmu->common, sp, slot, pte_access,
+ gfn, spte_to_pfn(spte), spte, true, false, host_writable, &spte);

return mmu_spte_update(sptep, spte);
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 179156cd995df..9060a56e45569 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -137,7 +137,8 @@ bool spte_has_volatile_bits(u64 spte)
return false;
}

-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_common *mmu_common, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
@@ -237,9 +238,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (prefetch)
spte = mark_spte_for_access_track(spte);

- WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+ WARN_ONCE(is_rsvd_spte(&mmu_common->shadow_zero_check, spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
- get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+ get_rsvd_bits(&mmu_common->shadow_zero_check, spte, level));

if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index b88b686a4ecbc..8f747268a4874 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -530,7 +530,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)

bool spte_has_volatile_bits(u64 spte);

-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_common *mmu_common, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6cd4dd631a2fa..6657685a28709 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -219,7 +219,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,

hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
{
- union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
+ union kvm_mmu_page_role role = vcpu->arch.mmu->common.root_role;
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;

@@ -640,7 +640,7 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
else

#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
- for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+ for_each_tdp_pte(_iter, root_to_sp(_mmu->common.root.hpa), _start, _end)

/*
* Yield if the MMU lock is contended or this thread needs to return control
@@ -964,9 +964,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ wrprot = make_spte(vcpu, &vcpu->arch.mmu->common, sp, fault->slot,
+ ACC_ALL, iter->gfn, fault->pfn, iter->old_spte,
+ fault->prefetch, true, fault->map_writable,
+ &new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
@@ -1769,7 +1770,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;

- *root_level = vcpu->arch.mmu->root_role.level;
+ *root_level = vcpu->arch.mmu->common.root_role.level;

tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
leaf = iter.level;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7121463123584..4941f53234a00 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3900,7 +3900,7 @@ static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)

static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
{
- hpa_t root_tdp = vcpu->arch.mmu->root.hpa;
+ hpa_t root_tdp = vcpu->arch.mmu->common.root.hpa;

/*
* When running on Hyper-V with EnlightenedNptTlb enabled, explicitly
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff78..43451fca00605 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5720,7 +5720,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);

roots_to_free = 0;
- if (nested_ept_root_matches(mmu->root.hpa, mmu->root.pgd,
+ if (nested_ept_root_matches(mmu->common.root.hpa, mmu->common.root.pgd,
operand.eptp))
roots_to_free |= KVM_MMU_ROOT_CURRENT;

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1f..1cc717a718e9c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3190,7 +3190,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- u64 root_hpa = mmu->root.hpa;
+ u64 root_hpa = mmu->common.root.hpa;

/* No flush required if the current context is invalid. */
if (!VALID_PAGE(root_hpa))
@@ -3198,7 +3198,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)

if (enable_ept)
ept_sync_context(construct_eptp(vcpu, root_hpa,
- mmu->root_role.level));
+ mmu->common.root_role.level));
else
vpid_sync_context(vmx_get_current_vpid(vcpu));
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f11..9ac8682c70ae7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8688,7 +8688,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
return false;

- if (!vcpu->arch.mmu->root_role.direct) {
+ if (!vcpu->arch.mmu->common.root_role.direct) {
/*
* Write permission should be allowed since only
* write access need to be emulated.
@@ -8721,7 +8721,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_release_pfn_clean(pfn);

/* The instructions are well-emulated on direct mmu. */
- if (vcpu->arch.mmu->root_role.direct) {
+ if (vcpu->arch.mmu->common.root_role.direct) {
unsigned int indirect_shadow_pages;

write_lock(&vcpu->kvm->mmu_lock);
@@ -8789,7 +8789,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
vcpu->arch.last_retry_eip = ctxt->eip;
vcpu->arch.last_retry_addr = cr2_or_gpa;

- if (!vcpu->arch.mmu->root_role.direct)
+ if (!vcpu->arch.mmu->common.root_role.direct)
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);

kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
@@ -9089,7 +9089,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
ctxt->exception.address = cr2_or_gpa;

/* With shadow page tables, cr2 contains a GVA or nGPA. */
- if (vcpu->arch.mmu->root_role.direct) {
+ if (vcpu->arch.mmu->common.root_role.direct) {
ctxt->gpa_available = true;
ctxt->gpa_val = cr2_or_gpa;
}
--
2.17.1

2023-12-02 09:57:32

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 26/42] KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops

Introduce a new op get_default_mt_mask to kvm_x86_ops to get default memory
types when no non-coherent DMA devices are attached.

For VMX, when there's no non-coherent DMA devices, guest MTRRs and vCPUs
CR0.CD mode are not queried to get memory types of EPT. So, introduce a
new op get_default_mt_mask that does not require param "vcpu" to get memory
types.

This is a preparation patch for later KVM MMU to export TDP, because IO
page fault requests are in non-vcpu context and have no "vcpu" to get
memory type from op get_mt_mask.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/vmx.c | 11 +++++++++++
3 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 26b628d84594b..d751407b1056c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -92,6 +92,7 @@ KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
+KVM_X86_OP_OPTIONAL_RET0(get_default_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 16e01eee34a99..1f6ac04e0f952 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1679,6 +1679,7 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
u8 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+ u8 (*get_default_mt_mask)(struct kvm *kvm, bool is_mmio);

void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1cc717a718e9c..f290dd3094da6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7614,6 +7614,16 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
}

+static u8 vmx_get_default_mt_mask(struct kvm *kvm, bool is_mmio)
+{
+ WARN_ON(kvm_arch_has_noncoherent_dma(kvm));
+
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+
+ return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+}
+
static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
{
/*
@@ -8295,6 +8305,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vmx_get_mt_mask,
+ .get_default_mt_mask = vmx_get_default_mt_mask,

.get_exit_info = vmx_get_exit_info,

--
2.17.1

2023-12-02 09:57:55

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 27/42] KVM: x86/mmu: change param "vcpu" to "kvm" in kvm_mmu_hugepage_adjust()

kvm_mmu_hugepage_adjust() requires "vcpu" only to get "vcpu->kvm".
Switch to pass in "kvm" directly.

No functional changes expected.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++----
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cfeb066f38687..b461bab51255e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3159,7 +3159,7 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
}

-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+void kvm_mmu_hugepage_adjust(struct kvm *kvm, struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
kvm_pfn_t mask;
@@ -3179,8 +3179,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* Enforce the iTLB multihit workaround after capturing the requested
* level, which will be used to do precise, accurate accounting.
*/
- fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
- fault->gfn, fault->max_level);
+ fault->req_level = kvm_mmu_max_mapping_level(kvm, slot, fault->gfn,
+ fault->max_level);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;

@@ -3222,7 +3222,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
int ret;
gfn_t base_gfn = fault->gfn;

- kvm_mmu_hugepage_adjust(vcpu, fault);
+ kvm_mmu_hugepage_adjust(vcpu->kvm, fault);

trace_kvm_mmu_spte_requested(fault);
for_each_shadow_entry(vcpu, fault->addr, it) {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7699596308386..1e9be0604e348 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -339,7 +339,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_mmu_max_mapping_level(struct kvm *kvm,
const struct kvm_memory_slot *slot, gfn_t gfn,
int max_level);
-void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+void kvm_mmu_hugepage_adjust(struct kvm *kvm, struct kvm_page_fault *fault);
void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);

void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 84509af0d7f9d..13c6390824a3e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -716,7 +716,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
* are being shadowed by KVM, i.e. allocating a new shadow page may
* affect the allowed hugepage size.
*/
- kvm_mmu_hugepage_adjust(vcpu, fault);
+ kvm_mmu_hugepage_adjust(vcpu->kvm, fault);

trace_kvm_mmu_spte_requested(fault);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6657685a28709..5d76d4849e8aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1047,7 +1047,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct kvm_mmu_page *sp;
int ret = RET_PF_RETRY;

- kvm_mmu_hugepage_adjust(vcpu, fault);
+ kvm_mmu_hugepage_adjust(vcpu->kvm, fault);

trace_kvm_mmu_spte_requested(fault);

--
2.17.1

2023-12-02 09:58:35

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 28/42] KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track()

page_fault_handle_page_track() only uses param "vcpu" to refer to
"vcpu->kvm", change it to "kvm" directly.

No functional changes expected.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++----
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b461bab51255e..73437c1b1943e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4186,7 +4186,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return RET_PF_RETRY;
}

-static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
+static bool page_fault_handle_page_track(struct kvm *kvm,
struct kvm_page_fault *fault)
{
if (unlikely(fault->rsvd))
@@ -4199,7 +4199,7 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
* guest is writing the page which is write tracked which can
* not be fixed by page fault handler.
*/
- if (kvm_gfn_is_write_tracked(vcpu->kvm, fault->slot, fault->gfn))
+ if (kvm_gfn_is_write_tracked(kvm, fault->slot, fault->gfn))
return true;

return false;
@@ -4378,7 +4378,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (WARN_ON_ONCE(kvm_mmu_is_dummy_root(vcpu->arch.mmu->common.root.hpa)))
return RET_PF_RETRY;

- if (page_fault_handle_page_track(vcpu, fault))
+ if (page_fault_handle_page_track(vcpu->kvm, fault))
return RET_PF_EMULATE;

r = fast_page_fault(vcpu, fault);
@@ -4458,7 +4458,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
{
int r;

- if (page_fault_handle_page_track(vcpu, fault))
+ if (page_fault_handle_page_track(vcpu->kvm, fault))
return RET_PF_EMULATE;

r = fast_page_fault(vcpu, fault);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 13c6390824a3e..f685b036f6637 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -803,7 +803,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
fault->max_level = walker.level;
fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);

- if (page_fault_handle_page_track(vcpu, fault)) {
+ if (page_fault_handle_page_track(vcpu->kvm, fault)) {
shadow_page_table_clear_flood(vcpu, fault->addr);
return RET_PF_EMULATE;
}
--
2.17.1

2023-12-02 09:58:53

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 29/42] KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level()

kvm_mmu_get_tdp_level() only requires param "vcpu" for cpuid_maxphyaddr().
So, pass in the value of cpuid_maxphyaddr() directly to avoid param "vcpu".

This is a preparation patch for later KVM MMU to export TDP.

No functional changes expected.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 73437c1b1943e..abdf49b5cdd79 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5186,14 +5186,14 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
reset_guest_paging_metadata(vcpu, mmu);
}

-static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
+static inline int kvm_mmu_get_tdp_level(int maxphyaddr)
{
/* tdp_root_level is architecture forced level, use it if nonzero */
if (tdp_root_level)
return tdp_root_level;

/* Use 5-level TDP if and only if it's useful/necessary. */
- if (max_tdp_level == 5 && cpuid_maxphyaddr(vcpu) <= 48)
+ if (max_tdp_level == 5 && maxphyaddr <= 48)
return 4;

return max_tdp_level;
@@ -5211,7 +5211,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
role.smm = cpu_role.base.smm;
role.guest_mode = cpu_role.base.guest_mode;
role.ad_disabled = !kvm_ad_enabled();
- role.level = kvm_mmu_get_tdp_level(vcpu);
+ role.level = kvm_mmu_get_tdp_level(cpuid_maxphyaddr(vcpu));
role.direct = true;
role.has_4_byte_gpte = false;

@@ -5310,7 +5310,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
WARN_ON_ONCE(cpu_role.base.direct);

root_role = cpu_role.base;
- root_role.level = kvm_mmu_get_tdp_level(vcpu);
+ root_role.level = kvm_mmu_get_tdp_level(cpuid_maxphyaddr(vcpu));
if (root_role.level == PT64_ROOT_5LEVEL &&
cpu_role.base.level == PT64_ROOT_4LEVEL)
root_role.passthrough = 1;
@@ -6012,7 +6012,8 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
* other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
* KVM; that horror is handled on-demand by mmu_alloc_special_roots().
*/
- if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
+ if (tdp_enabled &&
+ kvm_mmu_get_tdp_level(cpuid_maxphyaddr(vcpu)) > PT32E_ROOT_LEVEL)
return 0;

page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32);
--
2.17.1

2023-12-02 09:59:31

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 30/42] KVM: x86/mmu: remove param "vcpu" from kvm_calc_tdp_mmu_root_page_role()

kvm_calc_tdp_mmu_root_page_role() only requires "vcpu" to get maxphyaddr
for kvm_mmu_get_tdp_level(). So, just pass in the value of maxphyaddr from
the caller to get rid of param "vcpu".

This is a preparation patch for later KVM MMU to export TDP.

No functional changes expected.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index abdf49b5cdd79..bcf17aef29119 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5200,7 +5200,7 @@ static inline int kvm_mmu_get_tdp_level(int maxphyaddr)
}

static union kvm_mmu_page_role
-kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
+kvm_calc_tdp_mmu_root_page_role(int maxphyaddr,
union kvm_cpu_role cpu_role)
{
union kvm_mmu_page_role role = {0};
@@ -5211,7 +5211,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
role.smm = cpu_role.base.smm;
role.guest_mode = cpu_role.base.guest_mode;
role.ad_disabled = !kvm_ad_enabled();
- role.level = kvm_mmu_get_tdp_level(cpuid_maxphyaddr(vcpu));
+ role.level = kvm_mmu_get_tdp_level(maxphyaddr);
role.direct = true;
role.has_4_byte_gpte = false;

@@ -5222,7 +5222,9 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
union kvm_cpu_role cpu_role)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- union kvm_mmu_page_role root_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
+ union kvm_mmu_page_role root_role;
+
+ root_role = kvm_calc_tdp_mmu_root_page_role(cpuid_maxphyaddr(vcpu), cpu_role);

if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
root_role.word == context->common.root_role.word)
--
2.17.1

2023-12-02 10:00:40

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 32/42] KVM: x86/mmu: add extra param "kvm" to make_mmio_spte()

Add an extra param "kvm" to make_mmio_spte() to allow param "vcpu" to be
NULL in future to allow generating mmio spte in non-vcpu context.

When "vcpu" is NULL, kvm_memslots() rather than kvm_vcpu_memslots() is
called to get memslots pointer, so MMIO SPTEs are not allowed to be
generated for SMM mode in non-vCPU context.

This is a preparation patch for later KVM MMU to export TDP.

Note: actually, if the exported TDP is mapped in non-vCPU context, it
will not reach make_mmio_spte() due to earlier failure in
kvm_handle_noslot_fault(). make_mmio_spte() is modified in this patch to
avoid the check of "vcpu" in the caller.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/spte.c | 5 +++--
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df5651ea99139..e4cae4ff20770 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -296,7 +296,7 @@ static void kvm_flush_remote_tlbs_sptep(struct kvm *kvm, u64 *sptep)
static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
unsigned int access)
{
- u64 spte = make_mmio_spte(vcpu, gfn, access);
+ u64 spte = make_mmio_spte(vcpu->kvm, vcpu, gfn, access);

trace_mark_mmio_spte(sptep, gfn, spte);
mmu_spte_set(sptep, spte);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 9060a56e45569..daeab3b9eee1e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -71,9 +71,10 @@ static u64 generation_mmio_spte_mask(u64 gen)
return mask;
}

-u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
+u64 make_mmio_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
{
- u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
+ struct kvm_memslots *memslots = vcpu ? kvm_vcpu_memslots(vcpu) : kvm_memslots(kvm);
+ u64 gen = memslots->generation & MMIO_SPTE_GEN_MASK;
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 8f747268a4874..4ad19c469bd73 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -539,7 +539,7 @@ bool make_spte(struct kvm_vcpu *vcpu,
u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
union kvm_mmu_page_role role, int index);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
-u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
+u64 make_mmio_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
u64 mark_spte_for_access_track(u64 spte);

/* Restore an acc-track PTE back to a regular PTE */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5d76d4849e8aa..892cf1f5b57a8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -962,7 +962,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
return RET_PF_RETRY;

if (unlikely(!fault->slot))
- new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+ new_spte = make_mmio_spte(vcpu->kvm, vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, &vcpu->arch.mmu->common, sp, fault->slot,
ACC_ALL, iter->gfn, fault->pfn, iter->old_spte,
--
2.17.1

2023-12-02 10:01:05

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 33/42] KVM: x86/mmu: add extra param "kvm" to make_spte()

Add an extra param "kvm" to make_spte() to allow param "vcpu" to be NULL in
future to allow generating spte in non-vcpu context.

"vcpu" is only used in make_spte() to get memory type mask if
shadow_memtype_mask is true, which applies only to VMX when EPT is enabled.
VMX only requires param "vcpu" when non-coherent DMA devices are attached
to check vcpu's CR0.CD and guest MTRRs.
So, if non-coherent DMAs are not attached, make_spte() can call
kvm_x86_get_default_mt_mask() to get default memory type for non-vCPU
context.

This is a preparation patch for later KVM MMU to export TDP.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.c | 18 ++++++++++++------
arch/x86/kvm/mmu/spte.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cae4ff20770..c9b587b30dae3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2939,7 +2939,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
was_rmapped = 1;
}

- wrprot = make_spte(vcpu, &vcpu->arch.mmu->common,
+ wrprot = make_spte(vcpu->kvm, vcpu, &vcpu->arch.mmu->common,
sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
true, host_writable, &spte);

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 054d1a203f0ca..fb4767a9e966e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -960,7 +960,7 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- make_spte(vcpu, &vcpu->arch.mmu->common, sp, slot, pte_access,
+ make_spte(vcpu->kvm, vcpu, &vcpu->arch.mmu->common, sp, slot, pte_access,
gfn, spte_to_pfn(spte), spte, true, false, host_writable, &spte);

return mmu_spte_update(sptep, spte);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index daeab3b9eee1e..5e73a679464c0 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -138,7 +138,7 @@ bool spte_has_volatile_bits(u64 spte)
return false;
}

-bool make_spte(struct kvm_vcpu *vcpu,
+bool make_spte(struct kvm *kvm, struct kvm_vcpu *vcpu,
struct kvm_mmu_common *mmu_common, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
@@ -179,7 +179,7 @@ bool make_spte(struct kvm_vcpu *vcpu,
* just to optimize a mode that is anything but performance critical.
*/
if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
- is_nx_huge_page_enabled(vcpu->kvm)) {
+ is_nx_huge_page_enabled(kvm)) {
pte_access &= ~ACC_EXEC_MASK;
}

@@ -194,9 +194,15 @@ bool make_spte(struct kvm_vcpu *vcpu,
if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;

- if (shadow_memtype_mask)
- spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
+ if (shadow_memtype_mask) {
+ if (vcpu)
+ spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
+ else
+ spte |= static_call(kvm_x86_get_default_mt_mask)(kvm,
+ kvm_is_mmio_pfn(pfn));
+ }
+
if (host_writable)
spte |= shadow_host_writable_mask;
else
@@ -225,7 +231,7 @@ bool make_spte(struct kvm_vcpu *vcpu,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
- if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
+ if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
wrprot = true;
pte_access &= ~ACC_WRITE_MASK;
spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
@@ -246,7 +252,7 @@ bool make_spte(struct kvm_vcpu *vcpu,
if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON_ONCE(level > PG_LEVEL_4K);
- mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
+ mark_page_dirty_in_slot(kvm, slot, gfn);
}

*new_spte = spte;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 4ad19c469bd73..f1532589b7083 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -530,7 +530,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)

bool spte_has_volatile_bits(u64 spte);

-bool make_spte(struct kvm_vcpu *vcpu,
+bool make_spte(struct kvm *kvm, struct kvm_vcpu *vcpu,
struct kvm_mmu_common *mmu_common, struct kvm_mmu_page *sp,
const struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 892cf1f5b57a8..a45d1b71cd62a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -964,7 +964,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu->kvm, vcpu, iter->gfn, ACC_ALL);
else
- wrprot = make_spte(vcpu, &vcpu->arch.mmu->common, sp, fault->slot,
+ wrprot = make_spte(vcpu->kvm, vcpu, &vcpu->arch.mmu->common, sp, fault->slot,
ACC_ALL, iter->gfn, fault->pfn, iter->old_spte,
fault->prefetch, true, fault->map_writable,
&new_spte);
--
2.17.1

2023-12-02 10:01:08

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 31/42] KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn()

Add an extra param "kvm" to kvm_faultin_pfn() to allow param "vcpu" to be
NULL in future to allow page faults in non-vcpu context.

It is a preparation for later KVM MMU to export TDP.

No-slot mapping (for emulated MMIO cache), async pf, sig pending PFN are
not compatible to page fault in non-vcpu context.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 35 +++++++++++++++++++---------------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bcf17aef29119..df5651ea99139 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3266,9 +3266,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);
}

-static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int kvm_handle_error_pfn(struct kvm *kvm, struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
- if (is_sigpending_pfn(fault->pfn)) {
+ if (is_sigpending_pfn(fault->pfn) && vcpu) {
kvm_handle_signal_exit(vcpu);
return -EINTR;
}
@@ -3289,12 +3290,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
return -EFAULT;
}

-static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
+static int kvm_handle_noslot_fault(struct kvm *kvm, struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
unsigned int access)
{
gva_t gva = fault->is_tdp ? 0 : fault->addr;

+ if (!vcpu)
+ return -EFAULT;
+
vcpu_cache_mmio_info(vcpu, gva, fault->gfn,
access & shadow_mmio_access_mask);

@@ -4260,7 +4264,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
}

-static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int __kvm_faultin_pfn(struct kvm *kvm, struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_memory_slot *slot = fault->slot;
bool async;
@@ -4275,7 +4280,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

if (!kvm_is_visible_memslot(slot)) {
/* Don't expose private memslots to L2. */
- if (is_guest_mode(vcpu)) {
+ if (vcpu && is_guest_mode(vcpu)) {
fault->slot = NULL;
fault->pfn = KVM_PFN_NOSLOT;
fault->map_writable = false;
@@ -4288,7 +4293,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* when the AVIC is re-enabled.
*/
if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
- !kvm_apicv_activated(vcpu->kvm))
+ !kvm_apicv_activated(kvm))
return RET_PF_EMULATE;
}

@@ -4299,7 +4304,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (!async)
return RET_PF_CONTINUE; /* *pfn has correct page already */

- if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
+ if (!fault->prefetch && vcpu && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(fault->addr, fault->gfn);
if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
@@ -4321,23 +4326,23 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_CONTINUE;
}

-static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
- unsigned int access)
+static int kvm_faultin_pfn(struct kvm *kvm, struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault, unsigned int access)
{
int ret;

- fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
+ fault->mmu_seq = kvm->mmu_invalidate_seq;
smp_rmb();

- ret = __kvm_faultin_pfn(vcpu, fault);
+ ret = __kvm_faultin_pfn(kvm, vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;

if (unlikely(is_error_pfn(fault->pfn)))
- return kvm_handle_error_pfn(vcpu, fault);
+ return kvm_handle_error_pfn(kvm, vcpu, fault);

if (unlikely(!fault->slot))
- return kvm_handle_noslot_fault(vcpu, fault, access);
+ return kvm_handle_noslot_fault(kvm, vcpu, fault, access);

return RET_PF_CONTINUE;
}
@@ -4389,7 +4394,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r)
return r;

- r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
+ r = kvm_faultin_pfn(vcpu->kvm, vcpu, fault, ACC_ALL);
if (r != RET_PF_CONTINUE)
return r;

@@ -4469,7 +4474,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
if (r)
return r;

- r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
+ r = kvm_faultin_pfn(vcpu->kvm, vcpu, fault, ACC_ALL);
if (r != RET_PF_CONTINUE)
return r;

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f685b036f6637..054d1a203f0ca 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -812,7 +812,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r)
return r;

- r = kvm_faultin_pfn(vcpu, fault, walker.pte_access);
+ r = kvm_faultin_pfn(vcpu->kvm, vcpu, fault, walker.pte_access);
if (r != RET_PF_CONTINUE)
return r;

--
2.17.1

2023-12-02 10:01:49

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 34/42] KVM: x86/mmu: add extra param "kvm" to tdp_mmu_map_handle_target_level()

Add an extra param "kvm" to tdp_mmu_map_handle_target_level() to allow for
mapping in non-vCPU context in future.

"vcpu" is only required in tdp_mmu_map_handle_target_level() for accounting
of MMIO SPTEs. As kvm_faultin_pfn() now will return error for non-slot
PFNs, no MMIO SPTEs should be generated and accounted in non-vCPU context.
So, just let tdp_mmu_map_handle_target_level() warn if MMIO SPTEs are
encountered in non-vCPU context.

This is a preparation patch for later KVM MMU to export TDP.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a45d1b71cd62a..5edff3b4698b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -949,7 +949,9 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)
*/
-static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
+static int tdp_mmu_map_handle_target_level(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
+ struct kvm_mmu_common *mmu_common,
struct kvm_page_fault *fault,
struct tdp_iter *iter)
{
@@ -958,24 +960,26 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
int ret = RET_PF_FIXED;
bool wrprot = false;

+ WARN_ON(!kvm);
+
if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
return RET_PF_RETRY;

if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu->kvm, vcpu, iter->gfn, ACC_ALL);
else
- wrprot = make_spte(vcpu->kvm, vcpu, &vcpu->arch.mmu->common, sp, fault->slot,
+ wrprot = make_spte(kvm, vcpu, mmu_common, sp, fault->slot,
ACC_ALL, iter->gfn, fault->pfn, iter->old_spte,
fault->prefetch, true, fault->map_writable,
&new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
- else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
+ else if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
return RET_PF_RETRY;
else if (is_shadow_present_pte(iter->old_spte) &&
!is_last_spte(iter->old_spte, iter->level))
- kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
+ kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level);

/*
* If the page fault was caused by a write but the page is write
@@ -989,10 +993,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
if (unlikely(is_mmio_spte(new_spte))) {
- vcpu->stat.pf_mmio_spte_created++;
- trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
- new_spte);
- ret = RET_PF_EMULATE;
+ /* if without vcpu, no mmio spte should be installed */
+ if (!WARN_ON(!vcpu)) {
+ vcpu->stat.pf_mmio_spte_created++;
+ trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
+ new_spte);
+ ret = RET_PF_EMULATE;
+ }
} else {
trace_kvm_mmu_set_spte(iter->level, iter->gfn,
rcu_dereference(iter->sptep));
@@ -1114,7 +1121,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
goto retry;

map_target_level:
- ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
+ ret = tdp_mmu_map_handle_target_level(vcpu->kvm, vcpu, &vcpu->arch.mmu->common,
+ fault, &iter);

retry:
rcu_read_unlock();
--
2.17.1

2023-12-02 10:02:16

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 35/42] KVM: x86/mmu: Get/Put TDP root page to be exported

Get/Put the root page of a KVM exported TDP page table based on KVM TDP
MMU.

When KVM TDP FD requests a TDP to export, it provides an address space
to indicate page roles of the TDP to export. In this RFC, KVM address space
0 is supported only. So, TDP MMU will select a root page role with smm=0
and guest_mode=0. (Level of root page role is from kvm->arch.maxphyaddr,
based on the assumption that vCPUs are homogeneous.)

TDP MMU then searches list tdp_mmu_roots for a existing root, or create a
new root if no one is found.
A specific kvm->arch.exported_tdp_header_cache is used to allocate the root
page in non-vCPU context.
The found/created root page will be marked as "exported".

When KVM TDP fd puts the exported FD, the mark of "exported" on root page
role will be removed.

No matter the root page role is exported or not, vCPUs just load TDP root
according to its vCPU modes.

In this way, KVM is able to share the TDP page tables in KVM address space
0 to IOMMU side.

tdp_mmu_roots
|
role | smm | guest_mode +------+-----------+----------+
------|----------------- | | | |
0 | 0 | 0 ==> address space 0 | v v v
1 | 1 | 0 | .--------. .--------. .--------.
2 | 0 | 1 | | root | | root | | root |
3 | 1 | 1 | |(role 1)| |(role 2)| |(role 3)|
| '--------' '--------' '--------'
| ^
| | create or get .------.
| +--------------------| vCPU |
| fault '------'
| smm=1
| guest_mode=0
|
(set root as exported) v
.--------. create or get .---------------. create or get .------.
| TDP FD |------------------->| root (role 0) |<-----------------| vCPU |
'--------' fault '---------------' fault '------'
. smm=0
. guest_mode=0
.
non-vCPU context <---|---> vCPU context
.
.

This patch actually needs to be split into several smaller ones.
It's tempted to be kept in a single big patch to show a bigger picture.
Will split them into smaller ones in next version.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 18 +++++
arch/x86/kvm/mmu.h | 5 ++
arch/x86/kvm/mmu/mmu.c | 129 ++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/mmu_internal.h | 4 +
arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 6 ++
arch/x86/kvm/x86.c | 17 +++++
7 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f6ac04e0f952..860502720e3e7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1476,7 +1476,25 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ struct kvm_mmu_memory_cache exported_tdp_header_cache;
+ struct kvm_mmu_memory_cache exported_tdp_page_cache;
+ struct mutex exported_tdp_cache_lock;
+ int maxphyaddr;
+#endif
+};
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+#define __KVM_HAVE_ARCH_EXPORTED_TDP
+struct kvm_exported_tdp_mmu {
+ struct kvm_mmu_common common;
+ struct kvm_mmu_page *root_page;
};
+struct kvm_arch_exported_tdp {
+ struct kvm_exported_tdp_mmu mmu;
+};
+#endif

struct kvm_vm_stat {
struct kvm_vm_stat_generic generic;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9631cc23a594..3d11f2068572d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -251,6 +251,11 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm);
void kvm_mmu_pre_destroy_vm(struct kvm *kvm);

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+int kvm_mmu_get_exported_tdp(struct kvm *kvm, struct kvm_exported_tdp *tdp);
+void kvm_mmu_put_exported_tdp(struct kvm_exported_tdp *tdp);
+#endif
+
static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
{
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c9b587b30dae3..3e2475c678c27 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5468,6 +5468,13 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
kvm_mmu_reset_context(vcpu);

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ if (vcpu->kvm->arch.maxphyaddr)
+ vcpu->kvm->arch.maxphyaddr = min(vcpu->kvm->arch.maxphyaddr,
+ vcpu->arch.maxphyaddr);
+ else
+ vcpu->kvm->arch.maxphyaddr = vcpu->arch.maxphyaddr;
+#endif
/*
* Changing guest CPUID after KVM_RUN is forbidden, see the comment in
* kvm_arch_vcpu_ioctl().
@@ -6216,6 +6223,13 @@ void kvm_mmu_init_vm(struct kvm *kvm)

kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ mutex_init(&kvm->arch.exported_tdp_cache_lock);
+ kvm->arch.exported_tdp_header_cache.kmem_cache = mmu_page_header_cache;
+ kvm->arch.exported_tdp_header_cache.gfp_zero = __GFP_ZERO;
+ kvm->arch.exported_tdp_page_cache.gfp_zero = __GFP_ZERO;
+#endif
}

static void mmu_free_vm_memory_caches(struct kvm *kvm)
@@ -7193,3 +7207,118 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
}
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+static bool kvm_mmu_is_expoted_allowed(struct kvm *kvm, int as_id)
+{
+ if (as_id != 0) {
+ pr_err("unsupported address space to export TDP\n");
+ return false;
+ }
+
+ /*
+ * Currently, exporting TDP is based on TDP MMU and is not enabled on
+ * hyperv, one of the reasons is because of hyperv's tlb flush way
+ */
+ if (!tdp_mmu_enabled || IS_ENABLED(CONFIG_HYPERV) ||
+ !IS_ENABLED(CONFIG_HAVE_KVM_MMU_PRESENT_HIGH)) {
+ pr_err("Not allowed to create exported tdp, please check config\n");
+ return false;
+ }
+
+ /* we need max phys addr of vcpus, so oneline vcpus must > 0 */
+ if (!atomic_read(&kvm->online_vcpus)) {
+ pr_err("Exported tdp must be created after vCPUs created\n");
+ return false;
+ }
+
+ if (kvm->arch.maxphyaddr < 32) {
+ pr_err("Exported tdp must be created on 64-bit platform\n");
+ return false;
+ }
+ /*
+ * Do not allow noncoherent DMA if TDP is exported, because mapping of
+ * the exported TDP may not be at vCPU context, but noncoherent DMA
+ * requires vCPU mode and guest vCPU MTRRs to get the right memory type.
+ */
+ if (kvm_arch_has_noncoherent_dma(kvm)) {
+ pr_err("Not allowed to create exported tdp for noncoherent DMA\n");
+ return false;
+ }
+
+ return true;
+}
+
+static void init_kvm_exported_tdp_mmu(struct kvm *kvm, int as_id,
+ struct kvm_exported_tdp_mmu *mmu)
+{
+ WARN_ON(!kvm->arch.maxphyaddr);
+
+ union kvm_cpu_role cpu_role = { 0 };
+
+ cpu_role.base.smm = !!as_id;
+ cpu_role.base.guest_mode = 0;
+
+ mmu->common.root_role = kvm_calc_tdp_mmu_root_page_role(kvm->arch.maxphyaddr,
+ cpu_role);
+ reset_tdp_shadow_zero_bits_mask(&mmu->common);
+}
+
+static int mmu_topup_exported_tdp_caches(struct kvm *kvm)
+{
+ int r;
+
+ lockdep_assert_held(&kvm->arch.exported_tdp_cache_lock);
+
+ r = kvm_mmu_topup_memory_cache(&kvm->arch.exported_tdp_header_cache,
+ PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+
+ return kvm_mmu_topup_memory_cache(&kvm->arch.exported_tdp_page_cache,
+ PT64_ROOT_MAX_LEVEL);
+}
+
+int kvm_mmu_get_exported_tdp(struct kvm *kvm, struct kvm_exported_tdp *tdp)
+{
+ struct kvm_exported_tdp_mmu *mmu = &tdp->arch.mmu;
+ struct kvm_mmu_page *root;
+ int ret;
+
+ if (!kvm_mmu_is_expoted_allowed(kvm, tdp->as_id))
+ return -EINVAL;
+
+ init_kvm_exported_tdp_mmu(kvm, tdp->as_id, mmu);
+
+ mutex_lock(&kvm->arch.exported_tdp_cache_lock);
+ ret = mmu_topup_exported_tdp_caches(kvm);
+ if (ret) {
+ mutex_unlock(&kvm->arch.exported_tdp_cache_lock);
+ return ret;
+ }
+ write_lock(&kvm->mmu_lock);
+ root = kvm_tdp_mmu_get_exported_root(kvm, mmu);
+ WARN_ON(root->exported);
+ root->exported = true;
+ mmu->common.root.hpa = __pa(root->spt);
+ mmu->root_page = root;
+ write_unlock(&kvm->mmu_lock);
+
+ mutex_unlock(&kvm->arch.exported_tdp_cache_lock);
+
+ return 0;
+}
+
+void kvm_mmu_put_exported_tdp(struct kvm_exported_tdp *tdp)
+{
+ struct kvm_exported_tdp_mmu *mmu = &tdp->arch.mmu;
+ struct kvm *kvm = tdp->kvm;
+
+ write_lock(&kvm->mmu_lock);
+ mmu->root_page->exported = false;
+ kvm_tdp_mmu_put_exported_root(kvm, mmu->root_page);
+ mmu->common.root.hpa = INVALID_PAGE;
+ mmu->root_page = NULL;
+ write_unlock(&kvm->mmu_lock);
+}
+#endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1e9be0604e348..9294bb7e56c08 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -130,6 +130,10 @@ struct kvm_mmu_page {
/* Used for freeing the page asynchronously if it is a TDP MMU page. */
struct rcu_head rcu_head;
#endif
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ bool exported;
+#endif
};

extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5edff3b4698b7..47edf54961e89 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1824,3 +1824,50 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
*/
return rcu_dereference(sptep);
}
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_exported_cache(struct kvm *kvm)
+{
+ struct kvm_mmu_page *sp;
+
+ sp = kvm_mmu_memory_cache_alloc(&kvm->arch.exported_tdp_header_cache);
+ sp->spt = kvm_mmu_memory_cache_alloc(&kvm->arch.exported_tdp_page_cache);
+
+ return sp;
+}
+
+struct kvm_mmu_page *kvm_tdp_mmu_get_exported_root(struct kvm *kvm,
+ struct kvm_exported_tdp_mmu *mmu)
+{
+ union kvm_mmu_page_role role = mmu->common.root_role;
+ struct kvm_mmu_page *root;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
+ if (root->role.word == role.word &&
+ kvm_tdp_mmu_get_root(root))
+ goto out;
+
+ }
+
+ root = tdp_mmu_alloc_sp_exported_cache(kvm);
+ tdp_mmu_init_sp(root, NULL, 0, role);
+
+ refcount_set(&root->tdp_mmu_root_count, 2);
+
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+
+out:
+ return root;
+}
+
+void kvm_tdp_mmu_put_exported_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ tdp_mmu_zap_root(kvm, root, false);
+ kvm_tdp_mmu_put_root(kvm, root, false);
+}
+
+#endif
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 733a3aef3a96e..1d36ed378848b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -75,4 +75,10 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu
static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
#endif

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+struct kvm_mmu_page *kvm_tdp_mmu_get_exported_root(struct kvm *kvm,
+ struct kvm_exported_tdp_mmu *mmu);
+void kvm_tdp_mmu_put_exported_root(struct kvm *kvm, struct kvm_mmu_page *root);
+#endif
+
#endif /* __KVM_X86_MMU_TDP_MMU_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9ac8682c70ae7..afc0e5372ddce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13429,6 +13429,23 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);

+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+int kvm_arch_exported_tdp_init(struct kvm *kvm, struct kvm_exported_tdp *tdp)
+{
+ int ret;
+
+ ret = kvm_mmu_get_exported_tdp(kvm, tdp);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp)
+{
+ kvm_mmu_put_exported_tdp(tdp);
+}
+#endif

int kvm_spec_ctrl_test_value(u64 value)
{
--
2.17.1

2023-12-02 10:02:40

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 36/42] KVM: x86/mmu: Keep exported TDP root valid

Keep exported TDP root always valid and zap all leaf entries to replace
the "root role invalid" operation.

Unlike TDP roots accessed by vCPUs only, update of TDP root exported to
external components must be in an atomic way, like
1. allocating new root,
2. updating and notifying new root to external components,
3. making old root invalid,

So, it's more efficient to just zap all leaf entries of the exported TDP.

Though zapping all leaf entries will make "fast zap" not fast enough, as
with commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated TDP MMU
roots asynchronously"), zap of root is anyway required to be done
synchronously in kvm_mmu_zap_all_fast() before completing memslot removal.

Besides, it's also safe to skip invalidating "exported" root in
kvm_tdp_mmu_invalidate_all_roots() for path kvm_mmu_uninit_tdp_mmu(),
because when the VM is shutting down, as TDP FD will hold reference count
of kvm, kvm_mmu_uninit_tdp_mmu() --> kvm_tdp_mmu_invalidate_all_roots()
will not come until the TDP root is unmarked as "exported" and put. All
children entries are also zapped before the root is put.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 +++
arch/x86/kvm/mmu/tdp_mmu.c | 40 +++++++++++++++++++++++++++++++++-----
arch/x86/kvm/mmu/tdp_mmu.h | 1 +
3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3e2475c678c27..37a903fff582a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6187,6 +6187,9 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)

kvm_zap_obsolete_pages(kvm);

+ if (tdp_mmu_enabled)
+ kvm_tdp_mmu_zap_exported_roots(kvm);
+
write_unlock(&kvm->mmu_lock);

/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 47edf54961e89..36a309ad27d47 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -897,12 +897,38 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
read_unlock(&kvm->mmu_lock);
}

+void kvm_tdp_mmu_zap_exported_roots(struct kvm *kvm)
+{
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ struct kvm_mmu_page *root;
+ bool flush;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ if (!root->exported)
+ continue;
+
+ flush = tdp_mmu_zap_leafs(kvm, root, 0, -1ULL, false, false);
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+ }
+
+ rcu_read_unlock();
+#endif
+}
+
/*
- * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update. The actual
- * zapping is done separately so that it happens with mmu_lock with read,
- * whereas invalidating roots must be done with mmu_lock held for write (unless
- * the VM is being destroyed).
+ * Mark each TDP MMU root (except exported root) as invalid to prevent vCPUs from
+ * reusing a root that is about to be zapped, e.g. in response to a memslots
+ * update.
+ * The actual zapping is done separately so that it happens with mmu_lock
+ * with read, whereas invalidating roots must be done with mmu_lock held for write
+ * (unless the VM is being destroyed).
+ * For exported root, zap is done in kvm_tdp_mmu_zap_exported_roots() before
+ * the memslot update completes with mmu_lock held for write.
*
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_get_vcpu_root_hpa().
@@ -932,6 +958,10 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
* or get/put references to roots.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ if (root->exported)
+ continue;
+#endif
/*
* Note, invalid roots can outlive a memslot update! Invalid
* roots must be *zapped* before the memslot update completes,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1d36ed378848b..df42350022a3f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -25,6 +25,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
+void kvm_tdp_mmu_zap_exported_roots(struct kvm *kvm);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);

--
2.17.1

2023-12-02 10:03:08

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 37/42] KVM: x86: Implement KVM exported TDP fault handler on x86

Implement fault handler of KVM exported TDP on x86.
The fault handler will fail if the GFN to be faulted is in emulated MMIO
range or in write-tracked range.

kvm_tdp_mmu_map_exported_root() is actually a duplicate of
kvm_tdp_mmu_map() except that its shadow pages are allocated from exported
TDP specific header/page caches in kvm arch rather than from each vCPU's
header/page caches.

The exported TDP specific header/page caches are used is because fault
handler of KVM exported TDP is not called in vCPU thread.
Will seek to remove the duplication in future.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 57 +++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 81 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 2 +
arch/x86/kvm/x86.c | 22 +++++++++++
5 files changed, 163 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3d11f2068572d..a6e6802fb4d56 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -254,6 +254,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
int kvm_mmu_get_exported_tdp(struct kvm *kvm, struct kvm_exported_tdp *tdp);
void kvm_mmu_put_exported_tdp(struct kvm_exported_tdp *tdp);
+int kvm_mmu_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn, u32 err);
#endif

static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 37a903fff582a..b4b1ede30642d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7324,4 +7324,61 @@ void kvm_mmu_put_exported_tdp(struct kvm_exported_tdp *tdp)
mmu->root_page = NULL;
write_unlock(&kvm->mmu_lock);
}
+
+int kvm_mmu_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn, u32 err)
+{
+ struct kvm *kvm = tdp->kvm;
+ struct kvm_page_fault fault = {
+ .addr = gfn << PAGE_SHIFT,
+ .error_code = err,
+ .prefetch = false,
+ .exec = err & PFERR_FETCH_MASK,
+ .write = err & PFERR_WRITE_MASK,
+ .present = err & PFERR_PRESENT_MASK,
+ .rsvd = err & PFERR_RSVD_MASK,
+ .user = err & PFERR_USER_MASK,
+ .is_tdp = true,
+ .nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(kvm),
+ .max_level = KVM_MAX_HUGEPAGE_LEVEL,
+ .req_level = PG_LEVEL_4K,
+ .goal_level = PG_LEVEL_4K,
+ .gfn = gfn,
+ .slot = gfn_to_memslot(kvm, gfn),
+ };
+ struct kvm_exported_tdp_mmu *mmu = &tdp->arch.mmu;
+ int r;
+
+ if (page_fault_handle_page_track(kvm, &fault))
+ return -EINVAL;
+retry:
+ r = kvm_faultin_pfn(kvm, NULL, &fault, ACC_ALL);
+ if (r != RET_PF_CONTINUE)
+ goto out;
+
+ mutex_lock(&kvm->arch.exported_tdp_cache_lock);
+ r = mmu_topup_exported_tdp_caches(kvm);
+ if (r)
+ goto out_cache;
+
+ r = RET_PF_RETRY;
+ read_lock(&kvm->mmu_lock);
+ if (fault.slot && mmu_invalidate_retry_hva(kvm, fault.mmu_seq, fault.hva))
+ goto out_mmu;
+
+ if (mmu->root_page && is_obsolete_sp(kvm, mmu->root_page))
+ goto out_mmu;
+
+ r = kvm_tdp_mmu_map_exported_root(kvm, mmu, &fault);
+
+out_mmu:
+ read_unlock(&kvm->mmu_lock);
+out_cache:
+ mutex_unlock(&kvm->arch.exported_tdp_cache_lock);
+ kvm_release_pfn_clean(fault.pfn);
+out:
+ if (r == RET_PF_RETRY)
+ goto retry;
+
+ return r == RET_PF_FIXED ? 0 : -EFAULT;
+}
#endif
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 36a309ad27d47..e7587aefc3304 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1900,4 +1900,85 @@ void kvm_tdp_mmu_put_exported_root(struct kvm *kvm, struct kvm_mmu_page *root)
kvm_tdp_mmu_put_root(kvm, root, false);
}

+int kvm_tdp_mmu_map_exported_root(struct kvm *kvm, struct kvm_exported_tdp_mmu *mmu,
+ struct kvm_page_fault *fault)
+{
+ struct tdp_iter iter;
+ struct kvm_mmu_page *sp;
+ int ret = RET_PF_RETRY;
+
+ kvm_mmu_hugepage_adjust(kvm, fault);
+
+ trace_kvm_mmu_spte_requested(fault);
+
+ rcu_read_lock();
+
+ tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+ int r;
+
+ if (fault->nx_huge_page_workaround_enabled)
+ disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
+
+ /*
+ * If SPTE has been frozen by another thread, just give up and
+ * retry, avoiding unnecessary page table allocation and free.
+ */
+ if (is_removed_spte(iter.old_spte))
+ goto retry;
+
+ if (iter.level == fault->goal_level)
+ goto map_target_level;
+
+ /* Step down into the lower level page table if it exists. */
+ if (is_shadow_present_pte(iter.old_spte) &&
+ !is_large_pte(iter.old_spte))
+ continue;
+
+ /*
+ * The SPTE is either non-present or points to a huge page that
+ * needs to be split.
+ */
+ sp = tdp_mmu_alloc_sp_exported_cache(kvm);
+ tdp_mmu_init_child_sp(sp, &iter);
+
+ sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
+
+ if (is_shadow_present_pte(iter.old_spte))
+ r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
+ else
+ r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+
+ /*
+ * Force the guest to retry if installing an upper level SPTE
+ * failed, e.g. because a different task modified the SPTE.
+ */
+ if (r) {
+ tdp_mmu_free_sp(sp);
+ goto retry;
+ }
+
+ if (fault->huge_page_disallowed &&
+ fault->req_level >= iter.level) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ if (sp->nx_huge_page_disallowed)
+ track_possible_nx_huge_page(kvm, sp);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ }
+ }
+
+ /*
+ * The walk aborted before reaching the target level, e.g. because the
+ * iterator detected an upper level SPTE was frozen during traversal.
+ */
+ WARN_ON_ONCE(iter.level == fault->goal_level);
+ goto retry;
+
+map_target_level:
+ ret = tdp_mmu_map_handle_target_level(kvm, NULL, &mmu->common, fault, &iter);
+
+retry:
+ rcu_read_unlock();
+ return ret;
+}
+
#endif
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index df42350022a3f..a3ea418aaffed 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -80,6 +80,8 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
struct kvm_mmu_page *kvm_tdp_mmu_get_exported_root(struct kvm *kvm,
struct kvm_exported_tdp_mmu *mmu);
void kvm_tdp_mmu_put_exported_root(struct kvm *kvm, struct kvm_mmu_page *root);
+int kvm_tdp_mmu_map_exported_root(struct kvm *kvm, struct kvm_exported_tdp_mmu *mmu,
+ struct kvm_page_fault *fault);
#endif

#endif /* __KVM_X86_MMU_TDP_MMU_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index afc0e5372ddce..2886eac0590d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13445,6 +13445,28 @@ void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp)
{
kvm_mmu_put_exported_tdp(tdp);
}
+
+int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn,
+ struct kvm_tdp_fault_type type)
+{
+ u32 err = 0;
+ int ret;
+
+ if (type.read)
+ err |= PFERR_PRESENT_MASK | PFERR_USER_MASK;
+
+ if (type.write)
+ err |= PFERR_WRITE_MASK;
+
+ if (type.exec)
+ err |= PFERR_FETCH_MASK;
+
+ mutex_lock(&tdp->kvm->slots_lock);
+ ret = kvm_mmu_fault_exported_tdp(tdp, gfn, err);
+ mutex_unlock(&tdp->kvm->slots_lock);
+ return ret;
+}
+
#endif

int kvm_spec_ctrl_test_value(u64 value)
--
2.17.1

2023-12-02 10:04:21

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 38/42] KVM: x86: "compose" and "get" interface for meta data of exported TDP

Added two fields .exported_tdp_meta_size and .exported_tdp_meta_compose in
kvm_x86_ops to allow vendor specific code to compose meta data of exported
TDP and provided an arch interface for external components to get the
composed meta data.

As the meta data is consumed in IOMMU's vendor driver to check if the
exported TDP is compatible to the IOMMU hardware before reusing them as
IOMMU's stage 2 page tables, it's better to compose them in KVM's vendor
spcific code too.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 3 +++
arch/x86/include/asm/kvm_host.h | 7 +++++++
arch/x86/kvm/x86.c | 23 ++++++++++++++++++++++-
include/linux/kvm_host.h | 6 ++++++
virt/kvm/tdp_fd.c | 2 +-
5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index d751407b1056c..baf3efaa148c2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -136,6 +136,9 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+#if IS_ENABLED(CONFIG_HAVE_KVM_EXPORTED_TDP)
+KVM_X86_OP_OPTIONAL(exported_tdp_meta_compose);
+#endif

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 860502720e3e7..412a1b2088f09 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
#include <linux/irqbypass.h>
#include <linux/hyperv.h>
#include <linux/kfifo.h>
+#include <linux/kvm_tdp_fd.h>

#include <asm/apic.h>
#include <asm/pvclock-abi.h>
@@ -1493,6 +1494,7 @@ struct kvm_exported_tdp_mmu {
};
struct kvm_arch_exported_tdp {
struct kvm_exported_tdp_mmu mmu;
+ void *meta;
};
#endif

@@ -1784,6 +1786,11 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
+ unsigned long exported_tdp_meta_size;
+ void (*exported_tdp_meta_compose)(struct kvm_exported_tdp *tdp);
+#endif
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2886eac0590d8..468bcde414691 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13432,18 +13432,39 @@ EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
#ifdef CONFIG_HAVE_KVM_EXPORTED_TDP
int kvm_arch_exported_tdp_init(struct kvm *kvm, struct kvm_exported_tdp *tdp)
{
+ void *meta;
int ret;

+ if (!kvm_x86_ops.exported_tdp_meta_size ||
+ !kvm_x86_ops.exported_tdp_meta_compose)
+ return -EOPNOTSUPP;
+
+ meta = __vmalloc(kvm_x86_ops.exported_tdp_meta_size,
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!meta)
+ return -ENOMEM;
+
+ tdp->arch.meta = meta;
+
ret = kvm_mmu_get_exported_tdp(kvm, tdp);
- if (ret)
+ if (ret) {
+ kvfree(meta);
return ret;
+ }

+ static_call(kvm_x86_exported_tdp_meta_compose)(tdp);
return 0;
}

void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp)
{
kvm_mmu_put_exported_tdp(tdp);
+ kvfree(tdp->arch.meta);
+}
+
+void *kvm_arch_exported_tdp_get_metadata(struct kvm_exported_tdp *tdp)
+{
+ return tdp->arch.meta;
}

int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a8af95194767f..48324c846d90b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2348,6 +2348,7 @@ int kvm_arch_exported_tdp_init(struct kvm *kvm, struct kvm_exported_tdp *tdp);
void kvm_arch_exported_tdp_destroy(struct kvm_exported_tdp *tdp);
int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp, unsigned long gfn,
struct kvm_tdp_fault_type type);
+void *kvm_arch_exported_tdp_get_metadata(struct kvm_exported_tdp *tdp);
#else
static inline int kvm_arch_exported_tdp_init(struct kvm *kvm,
struct kvm_exported_tdp *tdp)
@@ -2364,6 +2365,11 @@ static inline int kvm_arch_fault_exported_tdp(struct kvm_exported_tdp *tdp,
{
return -EOPNOTSUPP;
}
+
+static inline void *kvm_arch_exported_tdp_get_metadata(struct kvm_exported_tdp *tdp)
+{
+ return NULL;
+}
#endif /* __KVM_HAVE_ARCH_EXPORTED_TDP */

void kvm_tdp_fd_flush_notify(struct kvm *kvm, unsigned long gfn, unsigned long npages);
diff --git a/virt/kvm/tdp_fd.c b/virt/kvm/tdp_fd.c
index 8c16af685a061..e4a2453a5547f 100644
--- a/virt/kvm/tdp_fd.c
+++ b/virt/kvm/tdp_fd.c
@@ -217,7 +217,7 @@ static void kvm_tdp_unregister_all_importers(struct kvm_exported_tdp *tdp)

static void *kvm_tdp_get_metadata(struct kvm_tdp_fd *tdp_fd)
{
- return ERR_PTR(-EOPNOTSUPP);
+ return kvm_arch_exported_tdp_get_metadata(tdp_fd->priv);
}

static int kvm_tdp_fault(struct kvm_tdp_fd *tdp_fd, struct mm_struct *mm,
--
2.17.1

2023-12-02 10:04:44

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 39/42] KVM: VMX: add config KVM_INTEL_EXPORTED_EPT

Add config KVM_INTEL_EXPORTED_EPT to let kvm_intel.ko support exporting EPT
to KVM external components (e.g. Intel VT-d).

This config will turn on HAVE_KVM_EXPORTED_TDP and
HAVE_KVM_MMU_PRESENT_HIGH automatically.

HAVE_KVM_MMU_PRESENT_HIGH will make bit 11 reserved as 0.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/Kconfig | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 950c12868d304..7126344077ab5 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -99,6 +99,19 @@ config X86_SGX_KVM

If unsure, say N.

+config KVM_INTEL_EXPORTED_EPT
+ bool "export EPT to be used by other modules (e.g. iommufd)"
+ depends on KVM_INTEL
+ select HAVE_KVM_EXPORTED_TDP
+ select HAVE_KVM_MMU_PRESENT_HIGH if X86_64
+ help
+ Intel EPT is architecturally guaranteed of compatible to stage 2
+ page tables in Intel IOMMU.
+
+ Enable this feature to allow Intel EPT to be exported and used
+ directly as stage 2 page tables in Intel IOMMU.
+
+
config KVM_AMD
tristate "KVM for AMD processors support"
depends on KVM && (CPU_SUP_AMD || CPU_SUP_HYGON)
--
2.17.1

2023-12-02 10:05:45

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 40/42] KVM: VMX: Compose VMX specific meta data for KVM exported TDP

Compose VMX specific meta data of KVM exported TDP. The format of the meta
data is defined in "asm/kvm_exported_tdp.h".

Intel VT-d driver can include "asm/kvm_exported_tdp.h" to decode this meta
data in order to check page table format, level, reserved zero bits before
loading KVM page tables with root HPA.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f290dd3094da6..7965bc32f87de 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -48,6 +48,7 @@
#include <asm/mwait.h>
#include <asm/spec-ctrl.h>
#include <asm/vmx.h>
+#include <asm/kvm_exported_tdp.h>

#include "capabilities.h"
#include "cpuid.h"
@@ -8216,6 +8217,22 @@ static void vmx_vm_destroy(struct kvm *kvm)
free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
}

+#ifdef CONFIG_KVM_INTEL_EXPORTED_EPT
+void kvm_exported_tdp_compose_meta(struct kvm_exported_tdp *tdp)
+{
+ struct kvm_exported_tdp_meta_vmx *meta = tdp->arch.meta;
+ struct kvm_mmu_common *context = &tdp->arch.mmu.common;
+ void *rsvd_bits_mask = context->shadow_zero_check.rsvd_bits_mask;
+
+ meta->root_hpa = context->root.hpa;
+ meta->level = context->root_role.level;
+ meta->max_huge_page_level = min(ept_caps_to_lpage_level(vmx_capability.ept),
+ KVM_MAX_HUGEPAGE_LEVEL);
+ memcpy(meta->rsvd_bits_mask, rsvd_bits_mask, sizeof(meta->rsvd_bits_mask));
+ meta->type = KVM_TDP_TYPE_EPT;
+}
+#endif
+
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.name = KBUILD_MODNAME,

@@ -8357,6 +8374,11 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.complete_emulated_msr = kvm_complete_insn_gp,

.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+#ifdef CONFIG_KVM_INTEL_EXPORTED_EPT
+ .exported_tdp_meta_size = sizeof(struct kvm_exported_tdp_meta_vmx),
+ .exported_tdp_meta_compose = kvm_exported_tdp_compose_meta,
+#endif
};

static unsigned int vmx_handle_intel_pt_intr(void)
--
2.17.1

2023-12-02 10:05:59

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 41/42] KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on

Add VMX implementation of ops of flush_remote_tlbs* in kvm_x86_ops when
enable_ept is on and CONFIG_HYPERV is off.

Without ops flush_remote_tlbs* in VMX, kvm_flush_remote_tlbs*() just makes
all cpus request KVM_REQ_TLB_FLUSH after finding the two ops are
non-present.
So, by also making all cpu requests KVM_REQ_TLB_FLUSH in ops
flush_remote_tlbs* in VMX, no functional changes should be introduced.

The two ops allow vendor code (e.g. VMX) to control when to notify IOMMU
to flush TLBs. This is useful for contidions when sequence to flush CPU
TLBs and IOTLBs is important.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7965bc32f87de..2fec351a3fa5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7544,6 +7544,17 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

+static int vmx_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages)
+{
+ kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+ return 0;
+}
+
+static int vmx_flush_remote_tlbs(struct kvm *kvm)
+{
+ return vmx_flush_remote_tlbs_range(kvm, 0, -1ULL);
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

@@ -8528,6 +8539,11 @@ static __init int hardware_setup(void)
vmx_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
vmx_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
}
+#else
+ if (enable_ept) {
+ vmx_x86_ops.flush_remote_tlbs = vmx_flush_remote_tlbs;
+ vmx_x86_ops.flush_remote_tlbs_range = vmx_flush_remote_tlbs_range;
+ }
#endif

if (!cpu_has_vmx_ple()) {
--
2.17.1

2023-12-02 10:06:44

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH 42/42] KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM flushes EPT

Call TDP FD helper to notify importers of exported TDP to flush TLBs when
KVM flushes EPT.

Signed-off-by: Yan Zhao <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2fec351a3fa5b..3a2b6ddcde108 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7547,6 +7547,9 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
static int vmx_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, gfn_t nr_pages)
{
kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+#if IS_ENABLED(CONFIG_KVM_INTEL_EXPORTED_EPT)
+ kvm_tdp_fd_flush_notify(kvm, gfn, nr_pages);
+#endif
return 0;
}

--
2.17.1

2023-12-04 15:08:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> In this series, term "exported" is used in place of "shared" to avoid
> confusion with terminology "shared EPT" in TDX.
>
> The framework contains 3 main objects:
>
> "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> With this object, KVM allows external components to
> access a TDP page table exported by KVM.

I don't know much about the internals of kvm, but why have this extra
user visible piece? Isn't there only one "TDP" per kvm fd? Why not
just use the KVM FD as a handle for the TDP?

> "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
> This HWPT has no IOAS associated.
>
> "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
> structures are managed by KVM.
> Its hardware TLB invalidation requests are
> notified from KVM via IOMMUFD KVM HWPT
> object.

This seems broadly the right direction

> - About device which partially supports IOPF
>
> Many devices claiming PCIe PRS capability actually only tolerate IOPF in
> certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
> applications or driver data such as ring descriptors). But the PRS
> capability doesn't include a bit to tell whether a device 100% tolerates
> IOPF in all DMA paths.

The lack of tolerance for truely DMA pinned guest memory is a
significant problem for any real deployment, IMHO. I am aware of no
device that can handle PRI on every single DMA path. :(

> A simple way is to track an allowed list of devices which are known 100%
> IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
> device reporting whether it fully or partially supports IOPF in the PRS
> capability.

I think we need something like this.

> - How to map MSI page on arm platform demands discussions.

Yes, the recurring problem :(

Probably the same approach as nesting would work for a hack - map the
ITS page into the fixed reserved slot and tell the guest not to touch
it and to identity map it.

Jason

2023-12-04 15:09:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 10/42] iommu: Add new iommu op to create domains managed by KVM

On Sat, Dec 02, 2023 at 05:20:07PM +0800, Yan Zhao wrote:
> @@ -522,6 +522,13 @@ __iommu_copy_struct_from_user_array(void *dst_data,
> * @domain_alloc_paging: Allocate an iommu_domain that can be used for
> * UNMANAGED, DMA, and DMA_FQ domain types.
> * @domain_alloc_sva: Allocate an iommu_domain for Shared Virtual Addressing.
> + * @domain_alloc_kvm: Allocate an iommu domain with type IOMMU_DOMAIN_KVM.
> + * It's called by IOMMUFD and must fully initialize the new
> + * domain before return.
> + * The @data is of type "const void *" whose format is defined
> + * in kvm arch specific header "asm/kvm_exported_tdp.h".
> + * Unpon success, domain of type IOMMU_DOMAIN_KVM is returned.
> + * Upon failure, ERR_PTR is returned.
> * @probe_device: Add device to iommu driver handling
> * @release_device: Remove device from iommu driver handling
> * @probe_finalize: Do final setup work after the device is added to an IOMMU
> @@ -564,6 +571,8 @@ struct iommu_ops {
> struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
> struct iommu_domain *(*domain_alloc_sva)(struct device *dev,
> struct mm_struct *mm);
> + struct iommu_domain *(*domain_alloc_kvm)(struct device *dev, u32 flags,
> + const void *data);

This should pass in some kvm related struct here, it should not be
buried in data

Jason

2023-12-04 15:10:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm

On Sat, Dec 02, 2023 at 05:20:41PM +0800, Yan Zhao wrote:
> On KVM invalidates mappings that are shared to IOMMU stage 2 paging
> structures, IOMMU driver needs to invalidate hardware TLBs accordingly.
>
> The new op cache_invalidate_kvm is called from IOMMUFD to invalidate
> hardware TLBs upon receiving invalidation notifications from KVM.

Why?

SVA hooks the invalidation directly to the mm, shouldn't KVM also hook
the invalidation directly from the kvm? Why do we need to call a chain
of function pointers? iommufd isn't adding any value in the chain
here.

Jason

2023-12-04 16:38:31

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > In this series, term "exported" is used in place of "shared" to avoid
> > confusion with terminology "shared EPT" in TDX.
> >
> > The framework contains 3 main objects:
> >
> > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> > With this object, KVM allows external components to
> > access a TDP page table exported by KVM.
>
> I don't know much about the internals of kvm, but why have this extra
> user visible piece?

That I don't know, I haven't looked at the gory details of this RFC.

> Isn't there only one "TDP" per kvm fd?

No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities
across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time:

1. "Normal"
2. SMM
3-N. Guest (for L2, i.e. nested, VMs)

The number of possible TDP page tables used for nested VMs is well bounded, but
since devices obviously can't be nested VMs, I won't bother trying to explain the
the various possibilities (nested NPT on AMD is downright ridiculous).

Nested virtualization aside, devices are obviously not capable of running in SMM
and so they all need to use the "normal" page tables.

I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate
*all* existing page tables and rebuild new page tables as needed. So over the
lifetime of a VM, KVM could theoretically use an infinite number of page tables.

2023-12-04 17:01:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Sat, Dec 02, 2023, Yan Zhao wrote:
> This RFC series proposes a framework to resolve IOPF by sharing KVM TDP
> (Two Dimensional Paging) page table to IOMMU as its stage 2 paging
> structure to support IOPF (IO page fault) on IOMMU's stage 2 paging
> structure.
>
> Previously, all guest pages have to be pinned and mapped in IOMMU stage 2
> paging structures after pass-through devices attached, even if the device
> has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF
> handling for stage 2 paging structure is supported and if there are only
> IOPF-capable devices attached to a VM.
>
> There are 2 approaches to support IOPF on IOMMU stage 2 paging structures:
> - Supporting by IOMMUFD/IOMMU alone
> IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then
> iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA
> to HVA, but page pinning/unpinning needs to be skipped.)
> Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to
> adjust IOVA mappings accordingly.
> IOMMU driver needs to support unmapping sub-ranges of a previous mapped
> range and take care of huge page merge and split in atomic way. [1][2].
>
> - Sharing KVM TDP
> IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root
> of IOMMU stage 2 paging structure, and routes IO page faults to KVM.
> (This assumes that the iommu hw supports the same stage-2 page table
> format as CPU.)
> In this model the page table is centrally managed by KVM (mmu notifier,
> page mapping, subpage unmapping, atomic huge page split/merge, etc.),
> while IOMMUFD only needs to invalidate iotlb/devtlb properly.

There are more approaches beyond having IOMMUFD and KVM be completely separate
entities. E.g. extract the bulk of KVM's "TDP MMU" implementation to common code
so that IOMMUFD doesn't need to reinvent the wheel.

> Currently, there's no upstream code available to support stage 2 IOPF yet.
>
> This RFC chooses to implement "Sharing KVM TDP" approach which has below
> main benefits:

Please list out the pros and cons for each. In the cons column for piggybacking
KVM's page tables:

- *Significantly* increases the complexity in KVM
- Puts constraints on what KVM can/can't do in the future (see the movement
of SPTE_MMU_PRESENT).
- Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
hugepage mitigation, etc.

Please also explain the intended/expected/targeted use cases. E.g. if the main
use case is for device passthrough to slice-of-hardware VMs that aren't memory
oversubscribed,

> - Unified page table management
> The complexity of allocating guest pages per GPAs, registering to MMU
> notifier on host primary MMU, sub-page unmapping, atomic page merge/split

Please find different terminology than "sub-page". With Sub-Page Protection, Intel
has more or less established "sub-page" to mean "less than 4KiB granularity". But
that can't possibly what you mean here because KVM doesn't support (un)mapping
memory at <4KiB granularity. Based on context above, I assume you mean "unmapping
arbitrary pages within a given range".

> are only required to by handled in KVM side, which has been doing that
> well for a long time.
>
> - Reduced page faults:
> Only one page fault is triggered on a single GPA, either caused by IO
> access or by vCPU access. (compared to one IO page fault for DMA and one
> CPU page fault for vCPUs in the non-shared approach.)

This would be relatively easy to solve with bi-directional notifiers, i.e. KVM
notifies IOMMUFD when a vCPU faults in a page, and vice versa.

> - Reduced memory consumption:
> Memory of one page table are saved.

I'm not convinced that memory consumption is all that interesting. If a VM is
mapping the majority of memory into a device, then odds are good that the guest
is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
overhead for pages tables is quite small, especially relative to the total amount
of memory overheads for such systems.

If a VM is mapping only a small subset of its memory into devices, then the IOMMU
page tables should be sparsely populated, i.e. won't consume much memory.

2023-12-04 17:31:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> There are more approaches beyond having IOMMUFD and KVM be
> completely separate entities. E.g. extract the bulk of KVM's "TDP
> MMU" implementation to common code so that IOMMUFD doesn't need to
> reinvent the wheel.

We've pretty much done this already, it is called "hmm" and it is what
the IO world uses. Merging/splitting huge page is just something that
needs some coding in the page table code, that people want for other
reasons anyhow.

> - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
> mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
> hugepage mitigation, etc.

Does it? I think that just remains isolated in kvm. The output from
KVM is only a radix table top pointer, it is up to KVM how to manage
it still.

> I'm not convinced that memory consumption is all that interesting. If a VM is
> mapping the majority of memory into a device, then odds are good that the guest
> is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> overhead for pages tables is quite small, especially relative to the total amount
> of memory overheads for such systems.

AFAIK the main argument is performance. It is similar to why we want
to do IOMMU SVA with MM page table sharing.

If IOMMU mirrors/shadows/copies a page table using something like HMM
techniques then the invalidations will mark ranges of IOVA as
non-present and faults will occur to trigger hmm_range_fault to do the
shadowing.

This means that pretty much all IO will always encounter a non-present
fault, certainly at the start and maybe worse while ongoing.

On the other hand, if we share the exact page table then natural CPU
touches will usually make the page present before an IO happens in
almost all cases and we don't have to take the horribly expensive IO
page fault at all.

We were not able to make bi-dir notifiers with with the CPU mm, I'm
not sure that is "relatively easy" :(

Jason

2023-12-04 18:29:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT

On Sat, Dec 02, 2023 at 05:21:13PM +0800, Yan Zhao wrote:

> @@ -413,11 +422,13 @@ struct iommu_hwpt_arm_smmuv3 {
> * @IOMMU_HWPT_DATA_NONE: no data
> * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
> + * @IOMMU_HWPT_DATA_KVM: KVM managed stage-2 page table
> */
> enum iommu_hwpt_data_type {
> IOMMU_HWPT_DATA_NONE,
> IOMMU_HWPT_DATA_VTD_S1,
> IOMMU_HWPT_DATA_ARM_SMMUV3,
> + IOMMU_HWPT_DATA_KVM,
> };

Definately no, the HWPT_DATA is for the *driver* - it should not be
"kvm".

Add the kvm fd to the main structure

Jason

2023-12-04 18:34:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 14/42] iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU

On Sat, Dec 02, 2023 at 05:22:16PM +0800, Yan Zhao wrote:
> +config IOMMUFD_KVM_HWPT
> + bool "Supports KVM managed HW page tables"
> + default n
> + help
> + Selecting this option will allow IOMMUFD to create IOMMU stage 2
> + page tables whose paging structure and mappings are managed by
> + KVM MMU. IOMMUFD serves as proxy between KVM and IOMMU driver to
> + allow IOMMU driver to get paging structure meta data and cache
> + invalidate notifications from KVM.

I'm not sure we need a user selectable kconfig for this..

Just turn it on if we have kvm turned on an iommu driver implements it

Jason

2023-12-04 18:36:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT

On Sat, Dec 02, 2023 at 05:23:11PM +0800, Yan Zhao wrote:
> Enable device feature IOPF during device attachment to KVM HWPT and abort
> the attachment if feature enabling is failed.
>
> "pin" is not done by KVM HWPT. If VMM wants to create KVM HWPT, it must
> know that all devices attached to this HWPT support IOPF so that pin-all
> is skipped.
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> drivers/iommu/iommufd/device.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 83af6b7e2784b..4ea447e052ce1 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -381,10 +381,26 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> goto err_unresv;
> idev->igroup->hwpt = hwpt;
> }
> + if (hwpt_is_kvm(hwpt)) {
> + /*
> + * Feature IOPF requires ats is enabled which is true only
> + * after device is attached to iommu domain.
> + * So enable dev feature IOPF after iommu_attach_group().
> + * -EBUSY will be returned if feature IOPF is already on.
> + */
> + rc = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
> + if (rc && rc != -EBUSY)
> + goto err_detach;

I would like to remove IOMMU_DEV_FEAT_IOPF completely please

Jason

2023-12-04 19:22:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
>
> > There are more approaches beyond having IOMMUFD and KVM be
> > completely separate entities. E.g. extract the bulk of KVM's "TDP
> > MMU" implementation to common code so that IOMMUFD doesn't need to
> > reinvent the wheel.
>
> We've pretty much done this already, it is called "hmm" and it is what
> the IO world uses. Merging/splitting huge page is just something that
> needs some coding in the page table code, that people want for other
> reasons anyhow.

Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a
glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
while walking the "secondary" HMM page tables.

KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd
instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

> > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
> > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
> > hugepage mitigation, etc.
>
> Does it? I think that just remains isolated in kvm. The output from
> KVM is only a radix table top pointer, it is up to KVM how to manage
> it still.

Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective.
E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is
vulnerable to the iTLB multi-hit mitigation.

> > I'm not convinced that memory consumption is all that interesting. If a VM is
> > mapping the majority of memory into a device, then odds are good that the guest
> > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > overhead for pages tables is quite small, especially relative to the total amount
> > of memory overheads for such systems.
>
> AFAIK the main argument is performance. It is similar to why we want
> to do IOMMU SVA with MM page table sharing.
>
> If IOMMU mirrors/shadows/copies a page table using something like HMM
> techniques then the invalidations will mark ranges of IOVA as
> non-present and faults will occur to trigger hmm_range_fault to do the
> shadowing.
>
> This means that pretty much all IO will always encounter a non-present
> fault, certainly at the start and maybe worse while ongoing.
>
> On the other hand, if we share the exact page table then natural CPU
> touches will usually make the page present before an IO happens in
> almost all cases and we don't have to take the horribly expensive IO
> page fault at all.

I'm not advocating mirroring/copying/shadowing page tables between KVM and the
IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing
KVM code to do so.

I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g.
add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
rather similar to this series.

What terrifies is me sharing page tables between the CPU and the IOMMU verbatim.

Yes, sharing page tables will Just Work for faulting in memory, but the downside
is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
will also impact the IO path. My understanding is that IO page faults are at least
an order of magnitude more expensive than CPU page faults. That means that what's
optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
tables.

E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
logging is not a viable option for the IOMMU because the latency of the resulting
IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because
the VM has passthrough (mediated?) devices would be likely a non-starter.

One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
we will end up having to revert/reject changes that benefit KVM's usage due to
regressing the IOMMU usage.

If instead KVM treats IOMMU page tables as their own thing, then we can have
divergent behavior as needed, e.g. different dirty logging algorithms, different
software-available bits, etc. It would also allow us to define new ABI instead
of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
E.g. off the top of my head:

- The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
memory.

- Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
doesn't support A/D bits or because the admin turned them off via KVM's
enable_ept_ad_bits module param.

- Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's
ABI can be that device writes to L1's page tables are exempt.

- KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
any memslot is deleted" ABI.

> We were not able to make bi-dir notifiers with with the CPU mm, I'm
> not sure that is "relatively easy" :(

I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
same".

It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
to manage IOMMU page tables, then KVM could simply install mappings for multiple
sets of page tables as appropriate.

2023-12-04 19:51:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> >
> > > There are more approaches beyond having IOMMUFD and KVM be
> > > completely separate entities. E.g. extract the bulk of KVM's "TDP
> > > MMU" implementation to common code so that IOMMUFD doesn't need to
> > > reinvent the wheel.
> >
> > We've pretty much done this already, it is called "hmm" and it is what
> > the IO world uses. Merging/splitting huge page is just something that
> > needs some coding in the page table code, that people want for other
> > reasons anyhow.
>
> Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a
> glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
> runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
> while walking the "secondary" HMM page tables.

hmm supports the essential idea of shadowing parts of the primary
MMU. This is a big chunk of what kvm is doing, just differently.

> KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
> MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
> the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd
> instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
> instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
> resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

Hopefully the memfd stuff we be generalized so we can use it in
iommufd too, without relying on kvm. At least the first basic stuff
should be doable fairly soon.

> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.

I guess from my POV, if KVM has two copies of the logically same radix
tree then that is fine too.

> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path. My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults. That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.

Yes, you wouldn't want to do some of the same KVM techniques today in
a shared mode.

> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a
> non-starter.

Yes

> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.

It is certainly a strong argument

> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".

If we say the only thing this works with is the memfd version of KVM,
could we design the memfd stuff to not have the same challenges with
mirroring as normal VMAs?

> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.

This somehow feels more achievable to me since KVM already has all the
code to handle multiple TDPs, having two parallel ones is probably
much easier than trying to weld KVM to a different page table
implementation through some kind of loose coupled notifier.

Jason

2023-12-04 20:12:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> > same".
>
> If we say the only thing this works with is the memfd version of KVM,

That's likely a big "if", as guest_memfd is not and will not be a wholesale
replacement of VMA-based guest memory, at least not in the forseeable future.
I would be quite surprised if the target use cases for this could be moved to
guest_memfd without losing required functionality.

> could we design the memfd stuff to not have the same challenges with
> mirroring as normal VMAs?

What challenges in particular are you concerned about? And maybe also define
"mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized
is very different than ensuring that the IOMMU page tables can only map memory
that is mappable by the guest, i.e. that KVM can map into the CPU page tables.

2023-12-04 23:50:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 12:11:46PM -0800, Sean Christopherson wrote:

> > could we design the memfd stuff to not have the same challenges with
> > mirroring as normal VMAs?
>
> What challenges in particular are you concerned about? And maybe also define
> "mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized
> is very different than ensuring that the IOMMU page tables can only map memory
> that is mappable by the guest, i.e. that KVM can map into the CPU page tables.

IIRC, it has been awhile, it is difficult to get a new populated PTE
out of the MM side and into an hmm user and get all the invalidation
locking to work as well. Especially when the devices want to do
sleeping invalidations.

kvm doesn't solve this problem either, but pushing populated TDP PTEs
to another observer may be simpler, as perhaps would pushing populated
memfd pages or something like that?

"mirroring" here would simply mean that if the CPU side has a
popoulated page then the hmm side copying it would also have a
populated page. Instead of a fault on use model.

Jason

2023-12-05 02:01:56

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > > In this series, term "exported" is used in place of "shared" to avoid
> > > confusion with terminology "shared EPT" in TDX.
> > >
> > > The framework contains 3 main objects:
> > >
> > > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> > > With this object, KVM allows external components to
> > > access a TDP page table exported by KVM.
> >
> > I don't know much about the internals of kvm, but why have this extra
> > user visible piece?
>
> That I don't know, I haven't looked at the gory details of this RFC.
>
> > Isn't there only one "TDP" per kvm fd?
>
> No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities
> across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time:
>
> 1. "Normal"
> 2. SMM
> 3-N. Guest (for L2, i.e. nested, VMs)
Yes, the reason to introduce KVM TDP FD is to let KVM know which TDP the user
wants to export(share).

For as_id=0 (which is currently the only supported as_id to share), a TDP with
smm=0, guest_mode=0 will be chosen.

Upon receiving the KVM_CREATE_TDP_FD ioctl, KVM will try to find an existing
TDP root with role specified by as_id 0. If there's existing TDP with the target
role found, KVM will just export this one; if no existing one found, KVM will
create a new TDP root in non-vCPU context.
Then, KVM will mark the exported TDP as "exported".


tdp_mmu_roots
|
role | smm | guest_mode +------+-----------+----------+
------|----------------- | | | |
0 | 0 | 0 ==> address space 0 | v v v
1 | 1 | 0 | .--------. .--------. .--------.
2 | 0 | 1 | | root | | root | | root |
3 | 1 | 1 | |(role 1)| |(role 2)| |(role 3)|
| '--------' '--------' '--------'
| ^
| | create or get .------.
| +--------------------| vCPU |
| fault '------'
| smm=1
| guest_mode=0
|
(set root as exported) v
.--------. create or get .---------------. create or get .------.
| TDP FD |------------------->| root (role 0) |<-----------------| vCPU |
'--------' fault '---------------' fault '------'
. smm=0
. guest_mode=0
.
non-vCPU context <---|---> vCPU context
.
.

No matter the TDP is exported or not, vCPUs just load TDP root according to its
vCPU modes.
In this way, KVM is able to share the TDP in KVM address space 0 to IOMMU side.

> The number of possible TDP page tables used for nested VMs is well bounded, but
> since devices obviously can't be nested VMs, I won't bother trying to explain the
> the various possibilities (nested NPT on AMD is downright ridiculous).
In future, if possible, I wonder if we can export an TDP for nested VM too.
E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM.
Maybe we can specify that and tell KVM the very piece of TDP to export.

> Nested virtualization aside, devices are obviously not capable of running in SMM
> and so they all need to use the "normal" page tables.
>
> I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate
> *all* existing page tables and rebuild new page tables as needed. So over the
> lifetime of a VM, KVM could theoretically use an infinite number of page tables.
Right. In patch 36, the TDP root which is marked as "exported" will be exempted
from "invalidate". Instead, an "exported" TDP just zaps all leaf entries upon
memory slot removal.
That is to say, for an exported TDP, it can be "active" until it's unmarked as
exported.

2023-12-05 02:21:52

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 11:08:00AM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > In this series, term "exported" is used in place of "shared" to avoid
> > confusion with terminology "shared EPT" in TDX.
> >
> > The framework contains 3 main objects:
> >
> > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> > With this object, KVM allows external components to
> > access a TDP page table exported by KVM.
>
> I don't know much about the internals of kvm, but why have this extra
> user visible piece? Isn't there only one "TDP" per kvm fd? Why not
> just use the KVM FD as a handle for the TDP?
As explained in a parallel mail, the reason to introduce KVM TDP FD is to let
KVM know which TDP the user wants to export(share).
And another reason is wrap the exported TDP with its exported ops in a
single structure. So, components outside of KVM can query meta data and
request page fault, register invalidate callback through the exported ops.

struct kvm_tdp_fd {
/* Public */
struct file *file;
const struct kvm_exported_tdp_ops *ops;

/* private to KVM */
struct kvm_exported_tdp *priv;
};
For KVM, it only needs to expose this struct kvm_tdp_fd and two symbols
kvm_tdp_fd_get() and kvm_tdp_fd_put().


>
> > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
> > This HWPT has no IOAS associated.
> >
> > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
> > structures are managed by KVM.
> > Its hardware TLB invalidation requests are
> > notified from KVM via IOMMUFD KVM HWPT
> > object.
>
> This seems broadly the right direction
>
> > - About device which partially supports IOPF
> >
> > Many devices claiming PCIe PRS capability actually only tolerate IOPF in
> > certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
> > applications or driver data such as ring descriptors). But the PRS
> > capability doesn't include a bit to tell whether a device 100% tolerates
> > IOPF in all DMA paths.
>
> The lack of tolerance for truely DMA pinned guest memory is a
> significant problem for any real deployment, IMHO. I am aware of no
> device that can handle PRI on every single DMA path. :(
DSA actaully can handle PRI on all DMA paths. But it requires driver to turn on
this capability :(

> > A simple way is to track an allowed list of devices which are known 100%
> > IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
> > device reporting whether it fully or partially supports IOPF in the PRS
> > capability.
>
> I think we need something like this.
>
> > - How to map MSI page on arm platform demands discussions.
>
> Yes, the recurring problem :(
>
> Probably the same approach as nesting would work for a hack - map the
> ITS page into the fixed reserved slot and tell the guest not to touch
> it and to identity map it.
Ok.

2023-12-05 04:21:45

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> On Sat, Dec 02, 2023, Yan Zhao wrote:
> Please list out the pros and cons for each. In the cons column for piggybacking
> KVM's page tables:
>
> - *Significantly* increases the complexity in KVM
The complexity to KVM (up to now) are
a. fault in non-vCPU context
b. keep exported root always "active"
c. disallow non-coherent DMAs
d. movement of SPTE_MMU_PRESENT

for a, I think it's accepted, and we can see eager page split allocates
non-leaf pages in non-vCPU context already.
for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which
invalidates all active TDP roots). And instead, the exported TDP's leaf
entries are all zapped.
Though it looks not "fast" enough, it avoids an unnecessary root page
zap, and it's actually not frequent --
- one for memslot removal (IO page fault is unlikey to happen during VM
boot-up)
- one for MMIO gen wraparound (which is rare)
- one for nx huge page mode change (which is rare too)
for c, maybe we can work out a way to remove the MTRR stuffs.
for d, I added a config to turn on/off this movement. But right, KVM side will
have to sacrifice a bit for software usage and take care of it when the
config is on.

> - Puts constraints on what KVM can/can't do in the future (see the movement
> of SPTE_MMU_PRESENT).
> - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
> mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
> hugepage mitigation, etc.
NX hugepage mitigation only exists on certain CPUs. I don't see it in recent
Intel platforms, e.g. SPR and GNR...
We can disallow sharing approach if NX huge page mitigation is enabled.
But if pinning or partial pinning are not involved, nx huge page will only cause
unnecessary zap to reduce performance, but functionally it still works well.

Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the
same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation.

>
> Please also explain the intended/expected/targeted use cases. E.g. if the main
> use case is for device passthrough to slice-of-hardware VMs that aren't memory
> oversubscribed,
>
The main use case is for device passthrough with all devices supporting full
IOPF.
Opportunistically, we hope it can be used in trusted IO, where TDP are shared
to IO side. So, there's only one page table audit required and out-of-sync
window for mappings between CPU and IO side can also be eliminated.

> > - Unified page table management
> > The complexity of allocating guest pages per GPAs, registering to MMU
> > notifier on host primary MMU, sub-page unmapping, atomic page merge/split
>
> Please find different terminology than "sub-page". With Sub-Page Protection, Intel
> has more or less established "sub-page" to mean "less than 4KiB granularity". But
> that can't possibly what you mean here because KVM doesn't support (un)mapping
> memory at <4KiB granularity. Based on context above, I assume you mean "unmapping
> arbitrary pages within a given range".
>
Ok, sorry for this confusion.
By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller
range in the previous huge page.

2023-12-05 06:23:38

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> > > I'm not convinced that memory consumption is all that interesting. If a VM is
> > > mapping the majority of memory into a device, then odds are good that the guest
> > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > > overhead for pages tables is quite small, especially relative to the total amount
> > > of memory overheads for such systems.
> >
> > AFAIK the main argument is performance. It is similar to why we want
> > to do IOMMU SVA with MM page table sharing.
> >
> > If IOMMU mirrors/shadows/copies a page table using something like HMM
> > techniques then the invalidations will mark ranges of IOVA as
> > non-present and faults will occur to trigger hmm_range_fault to do the
> > shadowing.
> >
> > This means that pretty much all IO will always encounter a non-present
> > fault, certainly at the start and maybe worse while ongoing.
> >
> > On the other hand, if we share the exact page table then natural CPU
> > touches will usually make the page present before an IO happens in
> > almost all cases and we don't have to take the horribly expensive IO
> > page fault at all.
>
> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.
>
> I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g.
> add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
> rather similar to this series.
Yes, very similar to current implementation, which added a "exported" flag to
"union kvm_mmu_page_role".
>
> What terrifies is me sharing page tables between the CPU and the IOMMU verbatim.
>
> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path. My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults. That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.
>
> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a non-starter.
>
> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.
>
As the TDP shared by IOMMU is marked by KVM, could we limit the changes (that
benefic KVM but regress IOMMU) to TDPs not shared?

> If instead KVM treats IOMMU page tables as their own thing, then we can have
> divergent behavior as needed, e.g. different dirty logging algorithms, different
> software-available bits, etc. It would also allow us to define new ABI instead
> of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
> E.g. off the top of my head:
>
> - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
> memory.
>
> - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
> doesn't support A/D bits or because the admin turned them off via KVM's
> enable_ept_ad_bits module param.
>
> - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's
> ABI can be that device writes to L1's page tables are exempt.
>
> - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
> any memslot is deleted" ABI.
>
> > We were not able to make bi-dir notifiers with with the CPU mm, I'm
> > not sure that is "relatively easy" :(
>
> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".
>
> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.
Not sure which approach below is the one you are referring to by "fire-and-forget
notifier" and "if we taught KVM to manage IOMMU page tables".

Approach A:
1. User space or IOMMUFD tells KVM which address space to share to IOMMUFD.
2. KVM create a special TDP, and maps this page table whenever a GFN in the
specified address space is faulted to PFN in vCPU side.
3. IOMMUFD imports this special TDP and receives zaps notification from KVM.
KVM will only send the zap notification for memslot removal or for certain MMU
zap notifications

Approach B:
1. User space or IOMMUFD tells KVM which address space to notify.
2. KVM notifies IOMMUFD whenever a GFN in the specified address space is faulted
to PFN in vCPU side.
3. IOMMUFD translates GFN to PFN in its own way (though VMA or through certain
new memfd interface), and maps IO PTEs by itself.
4. IOMMUFD zaps IO PTEs when a memslot is removed and interacts with MMU notifier
for zap notification in the primary MMU.


If approach A is preferred, could vCPUs also be allowed to attach to this
special TDP in VMs that don't suffer from NX hugepage mitigation, and do not
want live migration with passthrough devices, and don't rely on write-protection
for nested VMs.

2023-12-05 06:31:18

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, December 4, 2023 11:08 PM
>
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > - How to map MSI page on arm platform demands discussions.
>
> Yes, the recurring problem :(
>
> Probably the same approach as nesting would work for a hack - map the
> ITS page into the fixed reserved slot and tell the guest not to touch
> it and to identity map it.
>

yes logically it should follow what is planned for nesting.

just that kvm needs to involve more iommu specific knowledge e.g.
iommu_get_msi_cookie() to reserve the slot.

2023-12-05 06:46:55

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

> From: Zhao, Yan Y <[email protected]>
> Sent: Tuesday, December 5, 2023 9:32 AM
>
> On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote:
> > The number of possible TDP page tables used for nested VMs is well
> bounded, but
> > since devices obviously can't be nested VMs, I won't bother trying to
> explain the
> > the various possibilities (nested NPT on AMD is downright ridiculous).
> In future, if possible, I wonder if we can export an TDP for nested VM too.
> E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM.
> Maybe we can specify that and tell KVM the very piece of TDP to export.
>

nesting is tricky.

The reason why the sharing (w/o nesting) is logically ok is that both IOMMU
and KVM page tables are for the same GPA address space created by
the host.

for nested VM together with vIOMMU, the same sharing story holds if the
stage-2 page table in both sides still translates GPA. It implies vIOMMU is
enabled in nested translation mode and L0 KVM doesn't expose vEPT to
L1 VMM (which then uses shadow instead).

things become tricky when vIOMMU is working in a shadowing mode or
when L0 KVM exposes vEPT to L1 VMM. In either case the stage-2 page
table of L0 IOMMU/KVM actually translates a guest address space then
sharing becomes problematic (on figuring out whether both refers to the
same guest address space while that fact might change at any time).

2023-12-05 07:09:45

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm

On Mon, Dec 04, 2023 at 11:09:45AM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:20:41PM +0800, Yan Zhao wrote:
> > On KVM invalidates mappings that are shared to IOMMU stage 2 paging
> > structures, IOMMU driver needs to invalidate hardware TLBs accordingly.
> >
> > The new op cache_invalidate_kvm is called from IOMMUFD to invalidate
> > hardware TLBs upon receiving invalidation notifications from KVM.
>
> Why?
>
> SVA hooks the invalidation directly to the mm, shouldn't KVM also hook
> the invalidation directly from the kvm? Why do we need to call a chain
> of function pointers? iommufd isn't adding any value in the chain
> here.
Do you prefer IOMMU vendor driver to register as importer to KVM directly?
Then IOMMUFD just passes "struct kvm_tdp_fd" to IOMMU vendor driver for domain
creation.
Actually both ways are ok for us.
The current chaining way is just to let IOMMU domain only managed by IOMMUFD and
decoupled to KVM.

2023-12-05 07:18:03

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, December 5, 2023 3:51 AM
>
> On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught
> KVM
> > to manage IOMMU page tables, then KVM could simply install mappings for
> multiple
> > sets of page tables as appropriate.

iommu driver still needs to be notified to invalidate the iotlb, unless we want
KVM to directly call IOMMU API instead of going through iommufd.

>
> This somehow feels more achievable to me since KVM already has all the
> code to handle multiple TDPs, having two parallel ones is probably
> much easier than trying to weld KVM to a different page table
> implementation through some kind of loose coupled notifier.
>

yes performance-wise this can also reduce the I/O page faults as the
sharing approach achieves.

but how is it compared to another way of supporting IOPF natively in
iommufd and iommu drivers? Note that iommufd also needs to support
native vfio applications e.g. dpdk. I'm not sure whether there will be
strong interest in enabling IOPF for those applications. But if the
answer is yes then it's inevitable to have such logic implemented in
the iommu stack given KVM is not in the picture there.

With that is it more reasonable to develop the IOPF support natively
in iommu side, plus an optional notifier mechanism to sync with
KVM-induced host PTE installation as optimization?

2023-12-05 07:37:25

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT

On Mon, Dec 04, 2023 at 02:29:28PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:21:13PM +0800, Yan Zhao wrote:
>
> > @@ -413,11 +422,13 @@ struct iommu_hwpt_arm_smmuv3 {
> > * @IOMMU_HWPT_DATA_NONE: no data
> > * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> > * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
> > + * @IOMMU_HWPT_DATA_KVM: KVM managed stage-2 page table
> > */
> > enum iommu_hwpt_data_type {
> > IOMMU_HWPT_DATA_NONE,
> > IOMMU_HWPT_DATA_VTD_S1,
> > IOMMU_HWPT_DATA_ARM_SMMUV3,
> > + IOMMU_HWPT_DATA_KVM,
> > };
>
> Definately no, the HWPT_DATA is for the *driver* - it should not be
> "kvm".
>
> Add the kvm fd to the main structure
>
Do you mean add a "int kvm_fd" to "struct iommu_hwpt_alloc" ?
struct iommu_hwpt_alloc {
__u32 size;
__u32 flags;
__u32 dev_id;
__u32 pt_id;
__u32 out_hwpt_id;
__u32 __reserved;
__u32 data_type;
__u32 data_len;
__aligned_u64 data_uptr;
};

Then always create the HWPT as IOMMUFD_OBJ_HWPT_KVM as long as kvm_fd > 0 ?

2023-12-05 07:38:47

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 14/42] iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU

On Mon, Dec 04, 2023 at 02:34:10PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:22:16PM +0800, Yan Zhao wrote:
> > +config IOMMUFD_KVM_HWPT
> > + bool "Supports KVM managed HW page tables"
> > + default n
> > + help
> > + Selecting this option will allow IOMMUFD to create IOMMU stage 2
> > + page tables whose paging structure and mappings are managed by
> > + KVM MMU. IOMMUFD serves as proxy between KVM and IOMMU driver to
> > + allow IOMMU driver to get paging structure meta data and cache
> > + invalidate notifications from KVM.
>
> I'm not sure we need a user selectable kconfig for this..
>
> Just turn it on if we have kvm turned on an iommu driver implements it
>
Got it, thanks!

2023-12-05 07:43:36

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT

On Mon, Dec 04, 2023 at 02:36:03PM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:23:11PM +0800, Yan Zhao wrote:
> > Enable device feature IOPF during device attachment to KVM HWPT and abort
> > the attachment if feature enabling is failed.
> >
> > "pin" is not done by KVM HWPT. If VMM wants to create KVM HWPT, it must
> > know that all devices attached to this HWPT support IOPF so that pin-all
> > is skipped.
> >
> > Signed-off-by: Yan Zhao <[email protected]>
> > ---
> > drivers/iommu/iommufd/device.c | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > index 83af6b7e2784b..4ea447e052ce1 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -381,10 +381,26 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> > goto err_unresv;
> > idev->igroup->hwpt = hwpt;
> > }
> > + if (hwpt_is_kvm(hwpt)) {
> > + /*
> > + * Feature IOPF requires ats is enabled which is true only
> > + * after device is attached to iommu domain.
> > + * So enable dev feature IOPF after iommu_attach_group().
> > + * -EBUSY will be returned if feature IOPF is already on.
> > + */
> > + rc = iommu_dev_enable_feature(idev->dev, IOMMU_DEV_FEAT_IOPF);
> > + if (rc && rc != -EBUSY)
> > + goto err_detach;
>
> I would like to remove IOMMU_DEV_FEAT_IOPF completely please

So, turn on device PRI during device attachment in IOMMU vendor driver?

2023-12-05 14:52:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm

On Tue, Dec 05, 2023 at 02:40:28PM +0800, Yan Zhao wrote:
> On Mon, Dec 04, 2023 at 11:09:45AM -0400, Jason Gunthorpe wrote:
> > On Sat, Dec 02, 2023 at 05:20:41PM +0800, Yan Zhao wrote:
> > > On KVM invalidates mappings that are shared to IOMMU stage 2 paging
> > > structures, IOMMU driver needs to invalidate hardware TLBs accordingly.
> > >
> > > The new op cache_invalidate_kvm is called from IOMMUFD to invalidate
> > > hardware TLBs upon receiving invalidation notifications from KVM.
> >
> > Why?
> >
> > SVA hooks the invalidation directly to the mm, shouldn't KVM also hook
> > the invalidation directly from the kvm? Why do we need to call a chain
> > of function pointers? iommufd isn't adding any value in the chain
> > here.
> Do you prefer IOMMU vendor driver to register as importer to KVM directly?
> Then IOMMUFD just passes "struct kvm_tdp_fd" to IOMMU vendor driver for domain
> creation.

Yes, this is what we did for SVA

Function pointers are slow these days, so it is preferred to go
directly.

Jason

2023-12-05 14:53:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT

On Tue, Dec 05, 2023 at 03:08:03PM +0800, Yan Zhao wrote:
> On Mon, Dec 04, 2023 at 02:29:28PM -0400, Jason Gunthorpe wrote:
> > On Sat, Dec 02, 2023 at 05:21:13PM +0800, Yan Zhao wrote:
> >
> > > @@ -413,11 +422,13 @@ struct iommu_hwpt_arm_smmuv3 {
> > > * @IOMMU_HWPT_DATA_NONE: no data
> > > * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> > > * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
> > > + * @IOMMU_HWPT_DATA_KVM: KVM managed stage-2 page table
> > > */
> > > enum iommu_hwpt_data_type {
> > > IOMMU_HWPT_DATA_NONE,
> > > IOMMU_HWPT_DATA_VTD_S1,
> > > IOMMU_HWPT_DATA_ARM_SMMUV3,
> > > + IOMMU_HWPT_DATA_KVM,
> > > };
> >
> > Definately no, the HWPT_DATA is for the *driver* - it should not be
> > "kvm".
> >
> > Add the kvm fd to the main structure
> >
> Do you mean add a "int kvm_fd" to "struct iommu_hwpt_alloc" ?
> struct iommu_hwpt_alloc {
> __u32 size;
> __u32 flags;
> __u32 dev_id;
> __u32 pt_id;
> __u32 out_hwpt_id;
> __u32 __reserved;
> __u32 data_type;
> __u32 data_len;
> __aligned_u64 data_uptr;
> };
>
> Then always create the HWPT as IOMMUFD_OBJ_HWPT_KVM as long as kvm_fd > 0 ?

Yes, but 0 is a valid FD so you need to add a flag 'kvm_fd valid'

Jason

2023-12-05 14:54:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT

On Tue, Dec 05, 2023 at 03:14:09PM +0800, Yan Zhao wrote:

> > I would like to remove IOMMU_DEV_FEAT_IOPF completely please
>
> So, turn on device PRI during device attachment in IOMMU vendor driver?

If a fault requesting domain is attached then PRI should just be
enabled in the driver

Jason

2023-12-06 01:26:05

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 16/42] iommufd: Enable device feature IOPF during device attachment to KVM HWPT

On Tue, Dec 05, 2023 at 10:53:41AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 03:14:09PM +0800, Yan Zhao wrote:
>
> > > I would like to remove IOMMU_DEV_FEAT_IOPF completely please
> >
> > So, turn on device PRI during device attachment in IOMMU vendor driver?
>
> If a fault requesting domain is attached then PRI should just be
> enabled in the driver
>
Right, it makes sense!

2023-12-06 01:27:46

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 12/42] iommufd: Introduce allocation data info and flag for KVM managed HWPT

On Tue, Dec 05, 2023 at 10:53:04AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 03:08:03PM +0800, Yan Zhao wrote:
> > On Mon, Dec 04, 2023 at 02:29:28PM -0400, Jason Gunthorpe wrote:
> > > On Sat, Dec 02, 2023 at 05:21:13PM +0800, Yan Zhao wrote:
> > >
> > > > @@ -413,11 +422,13 @@ struct iommu_hwpt_arm_smmuv3 {
> > > > * @IOMMU_HWPT_DATA_NONE: no data
> > > > * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> > > > * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
> > > > + * @IOMMU_HWPT_DATA_KVM: KVM managed stage-2 page table
> > > > */
> > > > enum iommu_hwpt_data_type {
> > > > IOMMU_HWPT_DATA_NONE,
> > > > IOMMU_HWPT_DATA_VTD_S1,
> > > > IOMMU_HWPT_DATA_ARM_SMMUV3,
> > > > + IOMMU_HWPT_DATA_KVM,
> > > > };
> > >
> > > Definately no, the HWPT_DATA is for the *driver* - it should not be
> > > "kvm".
> > >
> > > Add the kvm fd to the main structure
> > >
> > Do you mean add a "int kvm_fd" to "struct iommu_hwpt_alloc" ?
> > struct iommu_hwpt_alloc {
> > __u32 size;
> > __u32 flags;
> > __u32 dev_id;
> > __u32 pt_id;
> > __u32 out_hwpt_id;
> > __u32 __reserved;
> > __u32 data_type;
> > __u32 data_len;
> > __aligned_u64 data_uptr;
> > };
> >
> > Then always create the HWPT as IOMMUFD_OBJ_HWPT_KVM as long as kvm_fd > 0 ?
>
> Yes, but 0 is a valid FD so you need to add a flag 'kvm_fd valid'

Got it, thanks!

2023-12-06 01:29:56

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH 11/42] iommu: Add new domain op cache_invalidate_kvm

On Tue, Dec 05, 2023 at 10:52:27AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 02:40:28PM +0800, Yan Zhao wrote:
> > On Mon, Dec 04, 2023 at 11:09:45AM -0400, Jason Gunthorpe wrote:
> > > On Sat, Dec 02, 2023 at 05:20:41PM +0800, Yan Zhao wrote:
> > > > On KVM invalidates mappings that are shared to IOMMU stage 2 paging
> > > > structures, IOMMU driver needs to invalidate hardware TLBs accordingly.
> > > >
> > > > The new op cache_invalidate_kvm is called from IOMMUFD to invalidate
> > > > hardware TLBs upon receiving invalidation notifications from KVM.
> > >
> > > Why?
> > >
> > > SVA hooks the invalidation directly to the mm, shouldn't KVM also hook
> > > the invalidation directly from the kvm? Why do we need to call a chain
> > > of function pointers? iommufd isn't adding any value in the chain
> > > here.
> > Do you prefer IOMMU vendor driver to register as importer to KVM directly?
> > Then IOMMUFD just passes "struct kvm_tdp_fd" to IOMMU vendor driver for domain
> > creation.
>
> Yes, this is what we did for SVA
>
> Function pointers are slow these days, so it is preferred to go
> directly.

Ok. Will do in this way. thanks!