LinuxLists.cc - [PATCH v3 0/6] Verify dirty logging works properly with page stats

2022-04-01 15:55:28

Subject: [PATCH v3 0/6] Verify dirty logging works properly with page stats

This patch set aims to fix a bug in which KVM incorrectly assumes a large
page as a NX huge page. The bug would prevent guest VM from regaining large
pages and cause performance issue. We fix the bug by explicitly checking
the lpage_disallowed field in the shadow page. Moreover, to fix the bug
properly for TDP MMU, we integrate two patches from Sean that ensures that
we update lpage_disallowed in shadow page before making spte visible to
guest.

To verify the bug fixed, we use dirty logging as the testing target and
dirty_log_perf_test as the selftest binary. By adding the code to check the
page stats from the per-VM interface, we discovered that VMs could regain
large pages after dirty logging disabled. We also verify the existence of
the bug if running with unpatched kernels.

To make the selftest working properly with per-VM stats interface, we
borrowes two patches come from Ben's series: "[PATCH 00/13] KVM: x86: Add a
cap to disable NX hugepages on a VM" [1].

[1] https://lore.kernel.org/all/[email protected]/T/

v2 -> v3:
- Update lpage_disallowed before making spte visible [seanjc].
- Adding tdp_mmu_pages stats [seanjc]
- update comments in selftest [bgardon]

v2: https://lore.kernel.org/lkml/[email protected]/T/

v1 -> v2:
- Update the commit message. [dmatlack]
- Update the comments in patch 3/4 to clarify the motivation. [bgardon]
- Add another iteration in dirty_log_perf_test to regain pages [bgardon]

Ben Gardon (2):
KVM: selftests: Dump VM stats in binary stats test
KVM: selftests: Test reading a single stat

Mingwei Zhang (2):
KVM: x86/mmu: explicitly check nx_hugepage in
disallowed_hugepage_adjust()
selftests: KVM: use page stats to check if dirty logging works
properly

Sean Christopherson (2):
KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
pages

arch/x86/include/asm/kvm_host.h | 11 +-
arch/x86/kvm/mmu/mmu.c | 28 ++-
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 36 ++--
.../selftests/kvm/dirty_log_perf_test.c | 53 +++++
.../selftests/kvm/include/kvm_util_base.h | 2 +
.../selftests/kvm/kvm_binary_stats_test.c | 6 +
tools/testing/selftests/kvm/lib/kvm_util.c | 196 ++++++++++++++++++
8 files changed, 303 insertions(+), 31 deletions(-)

--
2.35.1.1094.g7c7d902a7c-goog

2022-04-02 07:06:02

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v3 6/6] selftests: KVM: use page stats to check if dirty logging works properly

When dirty logging is enabled, KVM will remap all accessed pages in
NPT/EPT at 4K. This property could be used to check if
the page stats metrics work properly in KVM mmu. At the same time, this
logic might be used the other way around: using page stats to verify if
dirty logging really splits all huge pages. Moreover, when dirty logging is
disabled, KVM zaps corresponding SPTEs and we could check whether the large
pages come back when guest touches the pages again.

So add page stats checking in dirty logging performance selftest. In
particular, add checks in three locations:
- just after vm is created;
- after populating memory into vm but before enabling dirty logging;
- finish dirty logging but before disabling it;
- behind the final iteration after disabling dirty logging.

Tested using commands:
- ./dirty_log_perf_test -s anonymous_hugetlb_1gb
- ./dirty_log_perf_test -s anonymous_hugetlb_2mb
- ./dirty_log_perf_test -s anonymous_thp

Cc: Sean Christopherson <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: Peter Xu <[email protected]>

Suggested-by: Ben Gardon <[email protected]>
Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
---
.../selftests/kvm/dirty_log_perf_test.c | 53 +++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index c9d9e513ca04..dd48aabfff5c 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -25,6 +25,10 @@
#define GICR_BASE_GPA 0x80A0000ULL
#endif

+#ifdef __x86_64__
+#include "processor.h"
+#endif
+
/* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
#define TEST_HOST_LOOP_N 2UL

@@ -191,6 +195,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
p->slots, p->backing_src,
p->partition_vcpu_memory_access);

+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") == 0,
+ "4K page is non zero");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+ "2M page is non zero");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+ "1G page is non zero");
+#endif
perf_test_set_wr_fract(vm, p->wr_fract);

guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
@@ -232,6 +244,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Populate memory time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);

+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+ "4K page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+ p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+ "2M page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+ "1G page is zero");
+#endif
/* Enable dirty logging */
clock_gettime(CLOCK_MONOTONIC, &start);
enable_dirty_logging(vm, p->slots);
@@ -277,6 +300,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
}
}
+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+ "4K page is zero after dirty logging");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+ "2M page is non-zero after dirty logging");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+ "1G page is non-zero after dirty logging");
+#endif

/* Disable dirty logging */
clock_gettime(CLOCK_MONOTONIC, &start);
@@ -285,6 +316,28 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Disabling dirty logging time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);

+ /*
+ * Increment iteration to run the vcpus again to ensure all pages come
+ * back.
+ */
+ iteration++;
+ pr_info("Starting the final iteration to get all pages back.\n");
+ for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+ while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
+ != iteration)
+ ;
+ }
+
+#ifdef __x86_64__
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+ p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+ "2M page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+ "1G page is zero");
+#endif
+
/* Tell the vcpu thread to quit */
host_quit = true;
perf_test_join_vcpu_threads(nr_vcpus);
--
2.35.1.1094.g7c7d902a7c-goog

2022-04-03 12:51:40

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v3 5/6] KVM: selftests: Test reading a single stat

From: Ben Gardon <[email protected]>

Retrieve the value of a single stat by name in the binary stats test to
ensure the kvm_util library functions work.

CC: Jing Zhang <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 1 +
.../selftests/kvm/kvm_binary_stats_test.c | 3 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 53 +++++++++++++++++++
3 files changed, 57 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index c5f4a67772cb..09ee70c0df26 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -401,6 +401,7 @@ void assert_on_unhandled_exception(struct kvm_vm *vm, uint32_t vcpuid);
int vm_get_stats_fd(struct kvm_vm *vm);
int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
void dump_vm_stats(struct kvm_vm *vm);
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name);

uint32_t guest_get_vcpuid(void);

diff --git a/tools/testing/selftests/kvm/kvm_binary_stats_test.c b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
index afc4701ce8dd..97bde355f105 100644
--- a/tools/testing/selftests/kvm/kvm_binary_stats_test.c
+++ b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
@@ -177,6 +177,9 @@ static void vm_stats_test(struct kvm_vm *vm)

/* Dump VM stats */
dump_vm_stats(vm);
+
+ /* Read a single stat. */
+ printf("remote_tlb_flush: %lu\n", vm_get_single_stat(vm, "remote_tlb_flush"));
}

static void vcpu_stats_test(struct kvm_vm *vm, int vcpu_id)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 4d21c3b46780..1d3493d7fd55 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2699,3 +2699,56 @@ void dump_vm_stats(struct kvm_vm *vm)
close(stats_fd);
}

+static int vm_get_stat_data(struct kvm_vm *vm, const char *stat_name,
+ uint64_t **data)
+{
+ struct kvm_stats_desc *stats_desc;
+ struct kvm_stats_header *header;
+ struct kvm_stats_desc *desc;
+ size_t size_desc;
+ int stats_fd;
+ int ret = -EINVAL;
+ int i;
+
+ *data = NULL;
+
+ stats_fd = vm_get_stats_fd(vm);
+
+ header = read_vm_stats_header(stats_fd);
+
+ stats_desc = read_vm_stats_desc(stats_fd, header);
+
+ size_desc = stats_desc_size(header);
+
+ /* Read kvm stats data one by one */
+ for (i = 0; i < header->num_desc; ++i) {
+ desc = (void *)stats_desc + (i * size_desc);
+
+ if (strcmp(desc->name, stat_name))
+ continue;
+
+ ret = read_stat_data(stats_fd, header, desc, data);
+ }
+
+ free(stats_desc);
+ free(header);
+
+ close(stats_fd);
+
+ return ret;
+}
+
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name)
+{
+ uint64_t *data;
+ uint64_t value;
+ int ret;
+
+ ret = vm_get_stat_data(vm, stat_name, &data);
+ TEST_ASSERT(ret == 1, "Stat %s expected to have 1 element, but has %d",
+ stat_name, ret);
+ value = *data;
+ free(data);
+ return value;
+}
+
--
2.35.1.1094.g7c7d902a7c-goog

2022-04-04 06:22:02

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE

From: Sean Christopherson <[email protected]>

Set lpage_disallowed in TDP MMU shadow pages before making the SP visible
to other readers, i.e. before setting its SPTE. This will allow KVM to
query lpage_disallowed when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Reviewed-by: Mingwei Zhang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++----
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++--------
3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..5cb845fae56e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_disallow_lpage(slot, gfn);
}

-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- if (sp->lpage_disallowed)
- return;
-
++kvm->stat.nx_lpage_splits;
list_add_tail(&sp->lpage_disallowed_link,
&kvm->arch.lpage_disallowed_mmu_pages);
+}
+
+static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ if (sp->lpage_disallowed)
+ return;
+
+ __account_huge_nx_page(kvm, sp);
+
sp->lpage_disallowed = true;
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..4a0087efa1e3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_

void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);

#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3b6426725d4..f05423545e6d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
* @sp: The new TDP page table to install.
- * @account_nx: True if this page table is being installed to split a
- * non-executable huge page.
* @shared: This operation is running under the MMU lock in read mode.
*
* Returns: 0 if the new page table was installed. Non-0 if the page table
* could not be installed (e.g. the atomic compare-exchange failed).
*/
static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
- struct kvm_mmu_page *sp, bool account_nx,
- bool shared)
+ struct kvm_mmu_page *sp, bool shared)
{
u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
int ret = 0;
@@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
- if (account_nx)
- account_huge_nx_page(kvm, sp);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

return 0;
@@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret;
@@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);

- if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
+ sp->lpage_disallowed = account_nx;
+
+ if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
tdp_mmu_free_sp(sp);
break;
}
+
+ if (account_nx) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ __account_huge_nx_page(kvm, sp);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ }
}
}

@@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* correctness standpoint since the translation will be the same either
* way.
*/
- ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+ ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
if (ret)
goto out;

--
2.35.1.1094.g7c7d902a7c-goog

2022-04-05 00:26:26

by Ben Gardon

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE

On Mon, Apr 4, 2022 at 11:20 AM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Apr 04, 2022, Ben Gardon wrote:
> > On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <[email protected]> wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 1bff453f7cbe..4a0087efa1e3 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> > >
> > > void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >
> > > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > I believe we need to modify the usage of this function in
> > paging_tmpl.h as well, at which point there should be no users of
> > account_huge_nx_page, so we can just modify the function directly
> > instead of adding a __helper.
> > (Disregard if the source I was looking at was out of date. Lots of
> > churn in this code recently.)
>
> paging_tmpl.h is shadow paging only, i.e. will always handled page faults with
> mmu_lock held for write and it also needs the check for sp->lpage_disallowed
> already being set. Only the TDP MMU code is special in that (a) it holds mmu_lock
> for read and (b) never reuses shadow pages when inserting into the page tables.
>
> Or did I completely misunderstand what you meant by "need to modify the usage"?

Ah right duh. For some reason I thought we were modifying __direct_map
in this commit too. Nevermind, no change needed.

2022-04-05 01:09:42

by Ben Gardon

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE

On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <[email protected]> wrote:
>
> From: Sean Christopherson <[email protected]>
>
> Set lpage_disallowed in TDP MMU shadow pages before making the SP visible
> to other readers, i.e. before setting its SPTE. This will allow KVM to
> query lpage_disallowed when determining if a shadow page can be replaced
> by a NX huge page without violating the rules of the mitigation.
>
> Reviewed-by: Mingwei Zhang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 14 ++++++++++----
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++--------
> 3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1361eb4599b4..5cb845fae56e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> kvm_mmu_gfn_disallow_lpage(slot, gfn);
> }
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - if (sp->lpage_disallowed)
> - return;
> -
> ++kvm->stat.nx_lpage_splits;
> list_add_tail(&sp->lpage_disallowed_link,
> &kvm->arch.lpage_disallowed_mmu_pages);
> +}
> +
> +static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + if (sp->lpage_disallowed)
> + return;
> +
> + __account_huge_nx_page(kvm, sp);
> +
> sp->lpage_disallowed = true;
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..4a0087efa1e3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>
> void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);

I believe we need to modify the usage of this function in
paging_tmpl.h as well, at which point there should be no users of
account_huge_nx_page, so we can just modify the function directly
instead of adding a __helper.
(Disregard if the source I was looking at was out of date. Lots of
churn in this code recently.)

> void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b3b6426725d4..f05423545e6d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> * @kvm: kvm instance
> * @iter: a tdp_iter instance currently on the SPTE that should be set
> * @sp: The new TDP page table to install.
> - * @account_nx: True if this page table is being installed to split a
> - * non-executable huge page.
> * @shared: This operation is running under the MMU lock in read mode.
> *
> * Returns: 0 if the new page table was installed. Non-0 if the page table
> * could not be installed (e.g. the atomic compare-exchange failed).
> */
> static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> - struct kvm_mmu_page *sp, bool account_nx,
> - bool shared)
> + struct kvm_mmu_page *sp, bool shared)
> {
> u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> int ret = 0;
> @@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - if (account_nx)
> - account_huge_nx_page(kvm, sp);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> return 0;
> @@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> + struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> int ret;
> @@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> sp = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_child_sp(sp, &iter);
>
> - if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
> + sp->lpage_disallowed = account_nx;
> +
> + if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> tdp_mmu_free_sp(sp);
> break;
> }
> +
> + if (account_nx) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + __account_huge_nx_page(kvm, sp);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + }
> }
> }
>
> @@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> * correctness standpoint since the translation will be the same either
> * way.
> */
> - ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
> + ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
> if (ret)
> goto out;
>
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

2022-04-05 03:05:33

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

Explicitly check if a NX huge page is disallowed when determining if a page
fault needs to be forced to use a smaller sized page. KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a shadow page instead of a huge page. Any scenario that causes
KVM to zap leaf SPTEs may result in having a SP that can be made huge
without violating the NX huge page mitigation. E.g. disabling of dirty
logging, zapping from mmu_notifier due to page migration, guest MTRR
changes that affect the viability of a huge page, etc...

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")

Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5cb845fae56e..033609e8b332 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2896,6 +2896,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
cur_level == fault->goal_level &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte)) {
+ struct kvm_mmu_page *sp;
+ u64 page_mask;
+
+ sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
+
+ /* Prevent lpage_disallowed read from moving ahead. */
+ smp_rmb();
+
+ if (!sp->lpage_disallowed)
+ return;
/*
* A small SPTE exists for this pfn, but FNAME(fetch)
* and __direct_map would like to create a large PTE
@@ -2903,8 +2913,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
* patching back for them into pfn the next 9 bits of
* the address.
*/
- u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
- KVM_PAGES_PER_HPAGE(cur_level - 1);
+ page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+ KVM_PAGES_PER_HPAGE(cur_level - 1);
fault->pfn |= fault->gfn & page_mask;
fault->goal_level--;
}
--
2.35.1.1094.g7c7d902a7c-goog

2022-04-05 03:06:54

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE

On Mon, Apr 04, 2022, Ben Gardon wrote:
> On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <[email protected]> wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 1bff453f7cbe..4a0087efa1e3 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> >
> > void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >
> > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> I believe we need to modify the usage of this function in
> paging_tmpl.h as well, at which point there should be no users of
> account_huge_nx_page, so we can just modify the function directly
> instead of adding a __helper.
> (Disregard if the source I was looking at was out of date. Lots of
> churn in this code recently.)

paging_tmpl.h is shadow paging only, i.e. will always handled page faults with
mmu_lock held for write and it also needs the check for sp->lpage_disallowed
already being set. Only the TDP MMU code is special in that (a) it holds mmu_lock
for read and (b) never reuses shadow pages when inserting into the page tables.

Or did I completely misunderstand what you meant by "need to modify the usage"?