LinuxLists.cc - [PATCH 0/4] Verify dirty logging works properly with page stats

2022-03-24 01:42:11

Subject: [PATCH 0/4] Verify dirty logging works properly with page stats

This patchset is to verify if dirty logging works properly by checking page
stats from the per-VM interface. We discovered one performance bug in
disallowed_hugepage_adjust() which prevents KVM from recovering large pages
for the guest. The selftest logic added later could help validate the
problem.

The patchset borrowes two patches come from Ben's series: "[PATCH 00/13]
KVM: x86: Add a cap to disable NX hugepages on a VM" [1], which completes the
selftest library functions to use the stats interface.

[1] https://lore.kernel.org/all/[email protected]/T/

Ben Gardon (2):
selftests: KVM: Dump VM stats in binary stats test
selftests: KVM: Test reading a single stat

Mingwei Zhang (2):
KVM: x86/mmu: explicitly check nx_hugepage in
disallowed_hugepage_adjust()
selftests: KVM: use dirty logging to check if page stats work
correctly

arch/x86/kvm/mmu/mmu.c | 14 +-
.../selftests/kvm/dirty_log_perf_test.c | 52 +++++
.../selftests/kvm/include/kvm_util_base.h | 2 +
.../selftests/kvm/kvm_binary_stats_test.c | 6 +
tools/testing/selftests/kvm/lib/kvm_util.c | 196 ++++++++++++++++++
5 files changed, 268 insertions(+), 2 deletions(-)

--
2.35.1.894.gb6a874cedc-goog

2022-03-24 02:09:26

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v2 0/4] Verify dirty logging works properly with page stats

This patchset is to verify if dirty logging works properly by checking page
stats from the per-VM interface. We discovered one performance bug in
disallowed_hugepage_adjust() which prevents KVM from recovering large pages
for the guest. The selftest logic added later could help validate the
problem.

The patchset borrowes two patches come from Ben's series: "[PATCH 00/13]
KVM: x86: Add a cap to disable NX hugepages on a VM" [1], which completes
the selftest library functions to use the stats interface.

[1] https://lore.kernel.org/all/[email protected]/T/

v2 -> v1:
- Update the commit message. [dmatlack]
- Update the comments in patch 3/4 to clarify the motivation. [bgardon]
- Add another iteration in dirty_log_perf_test to regain pages [bgardon]

v1: https://lore.kernel.org/all/[email protected]/

Ben Gardon (2):
selftests: KVM: Dump VM stats in binary stats test
selftests: KVM: Test reading a single stat

Mingwei Zhang (2):
KVM: x86/mmu: explicitly check nx_hugepage in
disallowed_hugepage_adjust()
selftests: KVM: use dirty logging to check if page stats work
correctly

arch/x86/kvm/mmu/mmu.c | 14 +-
.../selftests/kvm/dirty_log_perf_test.c | 53 +++++
.../selftests/kvm/include/kvm_util_base.h | 2 +
.../selftests/kvm/kvm_binary_stats_test.c | 6 +
tools/testing/selftests/kvm/lib/kvm_util.c | 196 ++++++++++++++++++
5 files changed, 269 insertions(+), 2 deletions(-)

--
2.35.1.1021.g381101b075-goog

2022-03-24 16:54:09

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v2 4/4] selftests: KVM: use dirty logging to check if page stats work correctly

When dirty logging is enabled, KVM will remap all accessed pages in
NPT/EPT at 4K. This property could be used to check if
the page stats metrics work properly in KVM mmu. At the same time, this
logic might be used the other way around: using page stats to verify if
dirty logging really splits all huge pages. Moreover, when dirty logging is
disabled, KVM zaps corresponding SPTEs and we could check whether the large
pages come back when guest touches the pages again.

So add page stats checking in dirty logging performance selftest. In
particular, add checks in three locations:
- just after vm is created;
- after populating memory into vm but before enabling dirty logging;
- finish dirty logging but before disabling it;
- behind the final iteration after disabling dirty logging.

Tested using commands:
- ./dirty_log_perf_test -s anonymous_hugetlb_1gb
- ./dirty_log_perf_test -s anonymous_hugetlb_2mb
- ./dirty_log_perf_test -s anonymous_thp

Cc: Sean Christopherson <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: Peter Xu <[email protected]>

Suggested-by: Ben Gardon <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
---
.../selftests/kvm/dirty_log_perf_test.c | 53 +++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 1954b964d1cf..21431b0f5547 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -19,6 +19,10 @@
#include "perf_test_util.h"
#include "guest_modes.h"

+#ifdef __x86_64__
+#include "processor.h"
+#endif
+
/* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
#define TEST_HOST_LOOP_N 2UL

@@ -185,6 +189,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
p->slots, p->backing_src,
p->partition_vcpu_memory_access);

+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") == 0,
+ "4K page is non zero");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+ "2M page is non zero");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+ "1G page is non zero");
+#endif
perf_test_set_wr_fract(vm, p->wr_fract);

guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
@@ -222,6 +234,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Populate memory time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);

+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+ "4K page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+ p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+ "2M page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+ "1G page is zero");
+#endif
/* Enable dirty logging */
clock_gettime(CLOCK_MONOTONIC, &start);
enable_dirty_logging(vm, p->slots);
@@ -267,6 +290,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
}
}
+#ifdef __x86_64__
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+ "4K page is zero after dirty logging");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+ "2M page is non-zero after dirty logging");
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+ "1G page is non-zero after dirty logging");
+#endif

/* Disable dirty logging */
clock_gettime(CLOCK_MONOTONIC, &start);
@@ -275,6 +306,28 @@ static void run_test(enum vm_guest_mode mode, void *arg)
pr_info("Disabling dirty logging time: %ld.%.9lds\n",
ts_diff.tv_sec, ts_diff.tv_nsec);

+ /*
+ * Increment iteration to run the vcpus again to ensure all pages come
+ * back.
+ */
+ iteration++;
+ pr_info("Starting the final iteration to get all pages back.\n");
+ for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+ while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
+ != iteration)
+ ;
+ }
+
+#ifdef __x86_64__
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+ p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+ "2M page is zero");
+ if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+ TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+ "1G page is zero");
+#endif
+
/* Tell the vcpu thread to quit */
host_quit = true;
perf_test_join_vcpu_threads(nr_vcpus);
--
2.35.1.1021.g381101b075-goog

2022-03-24 19:00:25

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v2 1/4] selftests: KVM: Dump VM stats in binary stats test

From: Ben Gardon <[email protected]>

Add kvm_util library functions to read KVM stats through the binary
stats interface and then dump them to stdout when running the binary
stats test. Subsequent commits will extend the kvm_util code and use it
to make assertions in a test for NX hugepages.

CC: Jing Zhang <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 1 +
.../selftests/kvm/kvm_binary_stats_test.c | 3 +
tools/testing/selftests/kvm/lib/kvm_util.c | 143 ++++++++++++++++++
3 files changed, 147 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 4ed6aa049a91..160b9ad8474a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -392,6 +392,7 @@ void assert_on_unhandled_exception(struct kvm_vm *vm, uint32_t vcpuid);

int vm_get_stats_fd(struct kvm_vm *vm);
int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
+void dump_vm_stats(struct kvm_vm *vm);

uint32_t guest_get_vcpuid(void);

diff --git a/tools/testing/selftests/kvm/kvm_binary_stats_test.c b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
index 17f65d514915..afc4701ce8dd 100644
--- a/tools/testing/selftests/kvm/kvm_binary_stats_test.c
+++ b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
@@ -174,6 +174,9 @@ static void vm_stats_test(struct kvm_vm *vm)
stats_test(stats_fd);
close(stats_fd);
TEST_ASSERT(fcntl(stats_fd, F_GETFD) == -1, "Stats fd not freed");
+
+ /* Dump VM stats */
+ dump_vm_stats(vm);
}

static void vcpu_stats_test(struct kvm_vm *vm, int vcpu_id)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d8cf851ab119..d9c660913403 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2517,3 +2517,146 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid)

return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
}
+
+/* Caller is responsible for freeing the returned kvm_stats_header. */
+static struct kvm_stats_header *read_vm_stats_header(int stats_fd)
+{
+ struct kvm_stats_header *header;
+ ssize_t ret;
+
+ /* Read kvm stats header */
+ header = malloc(sizeof(*header));
+ TEST_ASSERT(header, "Allocate memory for stats header");
+
+ ret = read(stats_fd, header, sizeof(*header));
+ TEST_ASSERT(ret == sizeof(*header), "Read stats header");
+
+ return header;
+}
+
+static void dump_header(int stats_fd, struct kvm_stats_header *header)
+{
+ ssize_t ret;
+ char *id;
+
+ printf("flags: %u\n", header->flags);
+ printf("name size: %u\n", header->name_size);
+ printf("num_desc: %u\n", header->num_desc);
+ printf("id_offset: %u\n", header->id_offset);
+ printf("desc_offset: %u\n", header->desc_offset);
+ printf("data_offset: %u\n", header->data_offset);
+
+ /* Read kvm stats id string */
+ id = malloc(header->name_size);
+ TEST_ASSERT(id, "Allocate memory for id string");
+ ret = pread(stats_fd, id, header->name_size, header->id_offset);
+ TEST_ASSERT(ret == header->name_size, "Read id string");
+
+ printf("id: %s\n", id);
+
+ free(id);
+}
+
+static ssize_t stats_desc_size(struct kvm_stats_header *header)
+{
+ return sizeof(struct kvm_stats_desc) + header->name_size;
+}
+
+/* Caller is responsible for freeing the returned kvm_stats_desc. */
+static struct kvm_stats_desc *read_vm_stats_desc(int stats_fd,
+ struct kvm_stats_header *header)
+{
+ struct kvm_stats_desc *stats_desc;
+ size_t size_desc;
+ ssize_t ret;
+
+ size_desc = header->num_desc * stats_desc_size(header);
+
+ /* Allocate memory for stats descriptors */
+ stats_desc = malloc(size_desc);
+ TEST_ASSERT(stats_desc, "Allocate memory for stats descriptors");
+
+ /* Read kvm stats descriptors */
+ ret = pread(stats_fd, stats_desc, size_desc, header->desc_offset);
+ TEST_ASSERT(ret == size_desc, "Read KVM stats descriptors");
+
+ return stats_desc;
+}
+
+/* Caller is responsible for freeing the memory *data. */
+static int read_stat_data(int stats_fd, struct kvm_stats_header *header,
+ struct kvm_stats_desc *desc, uint64_t **data)
+{
+ u64 *stats_data;
+ ssize_t ret;
+
+ stats_data = malloc(desc->size * sizeof(*stats_data));
+
+ ret = pread(stats_fd, stats_data, desc->size * sizeof(*stats_data),
+ header->data_offset + desc->offset);
+
+ /* ret is in bytes. */
+ ret = ret / sizeof(*stats_data);
+
+ TEST_ASSERT(ret == desc->size,
+ "Read data of KVM stats: %s", desc->name);
+
+ *data = stats_data;
+
+ return ret;
+}
+
+static void dump_stat(int stats_fd, struct kvm_stats_header *header,
+ struct kvm_stats_desc *desc)
+{
+ u64 *stats_data;
+ ssize_t ret;
+ int i;
+
+ printf("\tflags: %u\n", desc->flags);
+ printf("\texponent: %u\n", desc->exponent);
+ printf("\tsize: %u\n", desc->size);
+ printf("\toffset: %u\n", desc->offset);
+ printf("\tbucket_size: %u\n", desc->bucket_size);
+ printf("\tname: %s\n", (char *)&desc->name);
+
+ ret = read_stat_data(stats_fd, header, desc, &stats_data);
+
+ printf("\tdata: %lu", *stats_data);
+ for (i = 1; i < ret; i++)
+ printf(", %lu", *(stats_data + i));
+ printf("\n\n");
+
+ free(stats_data);
+}
+
+void dump_vm_stats(struct kvm_vm *vm)
+{
+ struct kvm_stats_desc *stats_desc;
+ struct kvm_stats_header *header;
+ struct kvm_stats_desc *desc;
+ size_t size_desc;
+ int stats_fd;
+ int i;
+
+ stats_fd = vm_get_stats_fd(vm);
+
+ header = read_vm_stats_header(stats_fd);
+ dump_header(stats_fd, header);
+
+ stats_desc = read_vm_stats_desc(stats_fd, header);
+
+ size_desc = stats_desc_size(header);
+
+ /* Read kvm stats data one by one */
+ for (i = 0; i < header->num_desc; ++i) {
+ desc = (void *)stats_desc + (i * size_desc);
+ dump_stat(stats_fd, header, desc);
+ }
+
+ free(stats_desc);
+ free(header);
+
+ close(stats_fd);
+}
+
--
2.35.1.1021.g381101b075-goog

2022-03-24 23:46:30

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v2 2/4] selftests: KVM: Test reading a single stat

From: Ben Gardon <[email protected]>

Retrieve the value of a single stat by name in the binary stats test to
ensure the kvm_util library functions work.

CC: Jing Zhang <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 1 +
.../selftests/kvm/kvm_binary_stats_test.c | 3 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 53 +++++++++++++++++++
3 files changed, 57 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 160b9ad8474a..a07964c95941 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -393,6 +393,7 @@ void assert_on_unhandled_exception(struct kvm_vm *vm, uint32_t vcpuid);
int vm_get_stats_fd(struct kvm_vm *vm);
int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
void dump_vm_stats(struct kvm_vm *vm);
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name);

uint32_t guest_get_vcpuid(void);

diff --git a/tools/testing/selftests/kvm/kvm_binary_stats_test.c b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
index afc4701ce8dd..97bde355f105 100644
--- a/tools/testing/selftests/kvm/kvm_binary_stats_test.c
+++ b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
@@ -177,6 +177,9 @@ static void vm_stats_test(struct kvm_vm *vm)

/* Dump VM stats */
dump_vm_stats(vm);
+
+ /* Read a single stat. */
+ printf("remote_tlb_flush: %lu\n", vm_get_single_stat(vm, "remote_tlb_flush"));
}

static void vcpu_stats_test(struct kvm_vm *vm, int vcpu_id)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d9c660913403..dad54f5d57e7 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2660,3 +2660,56 @@ void dump_vm_stats(struct kvm_vm *vm)
close(stats_fd);
}

+static int vm_get_stat_data(struct kvm_vm *vm, const char *stat_name,
+ uint64_t **data)
+{
+ struct kvm_stats_desc *stats_desc;
+ struct kvm_stats_header *header;
+ struct kvm_stats_desc *desc;
+ size_t size_desc;
+ int stats_fd;
+ int ret = -EINVAL;
+ int i;
+
+ *data = NULL;
+
+ stats_fd = vm_get_stats_fd(vm);
+
+ header = read_vm_stats_header(stats_fd);
+
+ stats_desc = read_vm_stats_desc(stats_fd, header);
+
+ size_desc = stats_desc_size(header);
+
+ /* Read kvm stats data one by one */
+ for (i = 0; i < header->num_desc; ++i) {
+ desc = (void *)stats_desc + (i * size_desc);
+
+ if (strcmp(desc->name, stat_name))
+ continue;
+
+ ret = read_stat_data(stats_fd, header, desc, data);
+ }
+
+ free(stats_desc);
+ free(header);
+
+ close(stats_fd);
+
+ return ret;
+}
+
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name)
+{
+ uint64_t *data;
+ uint64_t value;
+ int ret;
+
+ ret = vm_get_stat_data(vm, stat_name, &data);
+ TEST_ASSERT(ret == 1, "Stat %s expected to have 1 element, but has %d",
+ stat_name, ret);
+ value = *data;
+ free(data);
+ return value;
+}
+
--
2.35.1.1021.g381101b075-goog

2022-03-25 16:06:42

by Ben Gardon

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] selftests: KVM: use dirty logging to check if page stats work correctly

On Wed, Mar 23, 2022 at 11:49 AM Mingwei Zhang <[email protected]> wrote:
>
> When dirty logging is enabled, KVM will remap all accessed pages in
> NPT/EPT at 4K. This property could be used to check if
> the page stats metrics work properly in KVM mmu. At the same time, this
> logic might be used the other way around: using page stats to verify if
> dirty logging really splits all huge pages. Moreover, when dirty logging is
> disabled, KVM zaps corresponding SPTEs and we could check whether the large
> pages come back when guest touches the pages again.
>
> So add page stats checking in dirty logging performance selftest. In
> particular, add checks in three locations:
> - just after vm is created;
> - after populating memory into vm but before enabling dirty logging;
> - finish dirty logging but before disabling it;
> - behind the final iteration after disabling dirty logging.
>
> Tested using commands:
> - ./dirty_log_perf_test -s anonymous_hugetlb_1gb
> - ./dirty_log_perf_test -s anonymous_hugetlb_2mb
> - ./dirty_log_perf_test -s anonymous_thp
>
> Cc: Sean Christopherson <[email protected]>
> Cc: David Matlack <[email protected]>
> Cc: Jing Zhang <[email protected]>
> Cc: Peter Xu <[email protected]>
>
> Suggested-by: Ben Gardon <[email protected]>
> Signed-off-by: Mingwei Zhang <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> .../selftests/kvm/dirty_log_perf_test.c | 53 +++++++++++++++++++
> 1 file changed, 53 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 1954b964d1cf..21431b0f5547 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -19,6 +19,10 @@
> #include "perf_test_util.h"
> #include "guest_modes.h"
>
> +#ifdef __x86_64__
> +#include "processor.h"
> +#endif
> +
> /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
> #define TEST_HOST_LOOP_N 2UL
>
> @@ -185,6 +189,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> p->slots, p->backing_src,
> p->partition_vcpu_memory_access);
>
> +#ifdef __x86_64__
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") == 0,
> + "4K page is non zero");
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
> + "2M page is non zero");
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
> + "1G page is non zero");
> +#endif
> perf_test_set_wr_fract(vm, p->wr_fract);
>
> guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
> @@ -222,6 +234,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> pr_info("Populate memory time: %ld.%.9lds\n",
> ts_diff.tv_sec, ts_diff.tv_nsec);
>
> +#ifdef __x86_64__
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
> + "4K page is zero");
> + if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> + p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
> + "2M page is zero");
> + if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
> + "1G page is zero");
> +#endif
> /* Enable dirty logging */
> clock_gettime(CLOCK_MONOTONIC, &start);
> enable_dirty_logging(vm, p->slots);
> @@ -267,6 +290,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
> }
> }
> +#ifdef __x86_64__
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
> + "4K page is zero after dirty logging");
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
> + "2M page is non-zero after dirty logging");
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
> + "1G page is non-zero after dirty logging");
> +#endif
>
> /* Disable dirty logging */
> clock_gettime(CLOCK_MONOTONIC, &start);
> @@ -275,6 +306,28 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> pr_info("Disabling dirty logging time: %ld.%.9lds\n",
> ts_diff.tv_sec, ts_diff.tv_nsec);
>
> + /*
> + * Increment iteration to run the vcpus again to ensure all pages come
> + * back.
> + */
> + iteration++;
> + pr_info("Starting the final iteration to get all pages back.\n");
> + for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
> + while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
> + != iteration)
> + ;
> + }
> +
> +#ifdef __x86_64__
> + if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
> + p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
> + "2M page is zero");
> + if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
> + TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
> + "1G page is zero");
> +#endif
> +
> /* Tell the vcpu thread to quit */
> host_quit = true;
> perf_test_join_vcpu_threads(nr_vcpus);
> --
> 2.35.1.1021.g381101b075-goog
>

2022-03-25 18:24:30

by Mingwei Zhang

[permalink] [raw]

Subject: [PATCH v2 3/4] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

Add extra check to specify the case of nx hugepage and allow KVM to
reconstruct large mapping after dirty logging is disabled. Existing code
works only for nx hugepage but the condition is too general in that does
not consider other usage case (such as dirty logging). Note that
when dirty logging is disabled, KVM calls kvm_mmu_zap_collapsible_sptes()
which only zaps leaf SPTEs. Moreover, existing code assumes that a present
PMD or PUD indicates that there exist 'smaller SPTEs' under the paging
structure. This assumption may no be true if KVM zaps only leafs in MMU.

Missing the check causes KVM incorrectly regards the faulting page as a NX
huge page and refuse to map it at desired level. And this leads to back
performance issue in shadow mmu and potentially in TDP mmu as well.

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: [email protected]

Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5628d0ba637e..d9b2001d8217 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2919,6 +2919,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
cur_level == fault->goal_level &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte)) {
+ struct kvm_mmu_page *sp;
+ u64 page_mask;
+ /*
+ * When nx hugepage flag is not set, there is no reason to go
+ * down to another level. This helps KVM re-generate large
+ * mappings after dirty logging disabled.
+ */
+ sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
+ if (!sp->lpage_disallowed)
+ return;
/*
* A small SPTE exists for this pfn, but FNAME(fetch)
* and __direct_map would like to create a large PTE
@@ -2926,8 +2936,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
* patching back for them into pfn the next 9 bits of
* the address.
*/
- u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
- KVM_PAGES_PER_HPAGE(cur_level - 1);
+ page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+ KVM_PAGES_PER_HPAGE(cur_level - 1);
fault->pfn |= fault->gfn & page_mask;
fault->goal_level--;
}
--
2.35.1.1021.g381101b075-goog

2022-03-30 01:43:17

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

+Yosry

On Wed, Mar 23, 2022, Mingwei Zhang wrote:
> Add extra check to specify the case of nx hugepage and allow KVM to
> reconstruct large mapping after dirty logging is disabled. Existing code
> works only for nx hugepage but the condition is too general in that does
> not consider other usage case (such as dirty logging). Note that
> when dirty logging is disabled, KVM calls kvm_mmu_zap_collapsible_sptes()
> which only zaps leaf SPTEs.

This opening is very, very confusing. A big part of the confusion is due to poor
naming in KVM, where kvm_mmu_page.lpage_disallowed really should be
nx_huge_page_disalowed. Enabling dirty logging also disables huge pages, but that
goes through the memslot's disallow_lpage. Reading through this paragraph, and
looking at the code, it sounds like disallowed_hugepage_adjust() is explicitly
checking if a page is disallowed due to dirty logging, which is not the case.

I'd prefer the changelog lead with the bug it's fixing and only briefly mention
dirty logging as a scenario where KVM can end up with shadow pages without child
SPTEs. Such scenarios can also happen with mmu_notifier calls, etc...

E.g.

Explicitly check if a NX huge page is disallowed when determining if a page
fault needs to be forced to use a smaller sized page. KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a shadow page instead of a huge page. Any scenario that causes
KVM to zap leaf SPTEs may result in having a SP that can be made huge
without violating the NX huge page mitigation. E.g. disabling of dirty
logging, zapping from mmu_notifier due to page migration, guest MTRR changes
that affect the viability of a huge page, etc...

> Moreover, existing code assumes that a present PMD or PUD indicates that
> there exist 'smaller SPTEs' under the paging structure. This assumption may
> no be true if KVM zaps only leafs in MMU.
>
> Missing the check causes KVM incorrectly regards the faulting page as a NX
> huge page and refuse to map it at desired level. And this leads to back
> performance issue in shadow mmu and potentially in TDP mmu as well.
>
> Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
> Cc: [email protected]

I vote to keep the Fixes tag, but drop the Cc: stable@ and NAK any MANUALSEL/AUTOSEL
for backporting this to stable trees. Yes, it's a bug, but the NX huge page zapping
kthread will (eventually) reclaim the lost performance. On the flip side, if there's
an edge case we mess up (see below), then we've introduced a far worse bug.

> Reviewed-by: Ben Gardon <[email protected]>
> Signed-off-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5628d0ba637e..d9b2001d8217 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2919,6 +2919,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> cur_level == fault->goal_level &&
> is_shadow_present_pte(spte) &&
> !is_large_pte(spte)) {
> + struct kvm_mmu_page *sp;
> + u64 page_mask;
> + /*
> + * When nx hugepage flag is not set, there is no reason to go
> + * down to another level. This helps KVM re-generate large
> + * mappings after dirty logging disabled.
> + */

Honestly, I'd just omit the comment. Again, IMO the blurb about dirty logging
does more harm than good because it makes it sound like this is somehow unique to
dirty logging, which it is not. Lack of a check was really just a simple goof.

> + sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> + if (!sp->lpage_disallowed)

This is unsafe for the TDP MMU. If mmu_lock is held for read, tdp_mmu_link_sp()
could be in the process of creating the shadow page for a different fault.
Critically, tdp_mmu_link_sp() sets lpage_disallowed _after_ installing the SPTE.
Thus this code could see the present shadow page, with the correct old_spte (and
thus succeed its eventual tdp_mmu_set_spte_atomic()), but with the wrong
lpage_disallowed.

To fix, tdp_mmu_link_sp() needs to tag the shadow page with lpage_disallowed before
setting the SPTE, and also needs appropriate memory barriers. It's a bit messy at
first glance, but actually presents an opportunity to further improve TDP MMU
performance. tdp_mmu_pages can be turned into a counter, at which point the
link/unlock paths don't need to acquire the spinlock except for NX huge page case.

Yosry (Cc'd) is also looking into adding stats for the number of page table pages
consumed by KVM, turning tdp_mmu_pages into a counter would trivialize that for
the TDP MMU (the stats for the TDP MMU and old MMU are best kept separated, see
the details in the attached patch).

Unlike the legacy MMU, the TDP MMU never re-links existing shadow pages. This means
it doesn't need to worry about the scenario where lpage_disallowed was already set.
So, setting the flag prior to the SPTE is trivial and doesn't need to be unwound.

Disclaimer: attached patches are lightly tested.

On top, this patch would need to add barriers, e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5cb845fae56e..0bf85bf3da64 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2896,6 +2896,12 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
cur_level == fault->goal_level &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte)) {
+ /* comment goes here. */
+ smp_rmb();
+
+ if (!sp->lpage_disallowed)
+ return;
+
/*
* A small SPTE exists for this pfn, but FNAME(fetch)
* and __direct_map would like to create a large PTE
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5ca78a89d8ed..9a0bc19179b0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1207,6 +1207,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
tdp_mmu_init_child_sp(sp, &iter);

sp->lpage_disallowed = account_nx;
+ /* comment goes here. */
+ smp_wmb();

if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
tdp_mmu_free_sp(sp);

Attachments:

(No filename) (6.47 kB)
0001-KVM-x86-mmu-Set-lpage_disallowed-in-TDP-MMU-before-s.patch (4.80 kB)
0002-KVM-x86-mmu-Track-the-number-of-TDP-MMU-pages-but-no.patch (4.36 kB)
Download all attachments

2022-03-31 04:16:00

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

On Tue, Mar 29, 2022, Sean Christopherson wrote:
> > Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
> > Cc: [email protected]
>
> I vote to keep the Fixes tag, but drop the Cc: stable@ and NAK any MANUALSEL/AUTOSEL
> for backporting this to stable trees. Yes, it's a bug, but the NX huge page zapping
> kthread will (eventually) reclaim the lost performance. On the flip side, if there's
> an edge case we mess up (see below), then we've introduced a far worse bug.

Doh, that's wrong. The whole problem is that these pages aren't on the NX list...
I think I'd still vote to not send this to stable, but it's a weak vote.

2022-04-01 13:53:50

by Mingwei Zhang

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

On Tue, Mar 29, 2022, Sean Christopherson wrote:
> +Yosry
>
> On Wed, Mar 23, 2022, Mingwei Zhang wrote:
> > Add extra check to specify the case of nx hugepage and allow KVM to
> > reconstruct large mapping after dirty logging is disabled. Existing code
> > works only for nx hugepage but the condition is too general in that does
> > not consider other usage case (such as dirty logging). Note that
> > when dirty logging is disabled, KVM calls kvm_mmu_zap_collapsible_sptes()
> > which only zaps leaf SPTEs.
>
> This opening is very, very confusing. A big part of the confusion is due to poor
> naming in KVM, where kvm_mmu_page.lpage_disallowed really should be
> nx_huge_page_disalowed. Enabling dirty logging also disables huge pages, but that
> goes through the memslot's disallow_lpage. Reading through this paragraph, and
> looking at the code, it sounds like disallowed_hugepage_adjust() is explicitly
> checking if a page is disallowed due to dirty logging, which is not the case.
>
> I'd prefer the changelog lead with the bug it's fixing and only briefly mention
> dirty logging as a scenario where KVM can end up with shadow pages without child
> SPTEs. Such scenarios can also happen with mmu_notifier calls, etc...
>
> E.g.
>
> Explicitly check if a NX huge page is disallowed when determining if a page
> fault needs to be forced to use a smaller sized page. KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a shadow page instead of a huge page. Any scenario that causes
> KVM to zap leaf SPTEs may result in having a SP that can be made huge
> without violating the NX huge page mitigation. E.g. disabling of dirty
> logging, zapping from mmu_notifier due to page migration, guest MTRR changes
> that affect the viability of a huge page, etc...
>
Thanks for the correction. yeah, I hit this bug when I rebase the
internal demand paging. Other than that, the only user that will trigger
this issue is dirty logging. So this would be the motivation.

> > Moreover, existing code assumes that a present PMD or PUD indicates that
> > there exist 'smaller SPTEs' under the paging structure. This assumption may
> > no be true if KVM zaps only leafs in MMU.
> >
> > Missing the check causes KVM incorrectly regards the faulting page as a NX
> > huge page and refuse to map it at desired level. And this leads to back
> > performance issue in shadow mmu and potentially in TDP mmu as well.
> >
> > Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
> > Cc: [email protected]
I remove the Cc to stable, since this patch alone may have a race in TDP
mmu. So will include both patches you attached in the 2nd version.

>
> I vote to keep the Fixes tag, but drop the Cc: stable@ and NAK any MANUALSEL/AUTOSEL
> for backporting this to stable trees. Yes, it's a bug, but the NX huge page zapping
> kthread will (eventually) reclaim the lost performance. On the flip side, if there's
> an edge case we mess up (see below), then we've introduced a far worse bug.
>
> > Reviewed-by: Ben Gardon <[email protected]>
> > Signed-off-by: Mingwei Zhang <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5628d0ba637e..d9b2001d8217 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2919,6 +2919,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> > cur_level == fault->goal_level &&
> > is_shadow_present_pte(spte) &&
> > !is_large_pte(spte)) {
> > + struct kvm_mmu_page *sp;
> > + u64 page_mask;
> > + /*
> > + * When nx hugepage flag is not set, there is no reason to go
> > + * down to another level. This helps KVM re-generate large
> > + * mappings after dirty logging disabled.
> > + */
>
> Honestly, I'd just omit the comment. Again, IMO the blurb about dirty logging
> does more harm than good because it makes it sound like this is somehow unique to
> dirty logging, which it is not. Lack of a check was really just a simple goof.
>
> > + sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> > + if (!sp->lpage_disallowed)
>
> This is unsafe for the TDP MMU. If mmu_lock is held for read, tdp_mmu_link_sp()
> could be in the process of creating the shadow page for a different fault.
> Critically, tdp_mmu_link_sp() sets lpage_disallowed _after_ installing the SPTE.
> Thus this code could see the present shadow page, with the correct old_spte (and
> thus succeed its eventual tdp_mmu_set_spte_atomic()), but with the wrong
> lpage_disallowed.
>
> To fix, tdp_mmu_link_sp() needs to tag the shadow page with lpage_disallowed before
> setting the SPTE, and also needs appropriate memory barriers. It's a bit messy at
> first glance, but actually presents an opportunity to further improve TDP MMU
> performance. tdp_mmu_pages can be turned into a counter, at which point the
> link/unlock paths don't need to acquire the spinlock except for NX huge page case.
>
> Yosry (Cc'd) is also looking into adding stats for the number of page table pages
> consumed by KVM, turning tdp_mmu_pages into a counter would trivialize that for
> the TDP MMU (the stats for the TDP MMU and old MMU are best kept separated, see
> the details in the attached patch).
>
> Unlike the legacy MMU, the TDP MMU never re-links existing shadow pages. This means
> it doesn't need to worry about the scenario where lpage_disallowed was already set.
> So, setting the flag prior to the SPTE is trivial and doesn't need to be unwound.
>
> Disclaimer: attached patches are lightly tested.

ack.
>
> On top, this patch would need to add barriers, e.g.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5cb845fae56e..0bf85bf3da64 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2896,6 +2896,12 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> cur_level == fault->goal_level &&
> is_shadow_present_pte(spte) &&
> !is_large_pte(spte)) {
> + /* comment goes here. */
> + smp_rmb();
> +
> + if (!sp->lpage_disallowed)
> + return;
> +
> /*
> * A small SPTE exists for this pfn, but FNAME(fetch)
> * and __direct_map would like to create a large PTE
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5ca78a89d8ed..9a0bc19179b0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1207,6 +1207,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> tdp_mmu_init_child_sp(sp, &iter);
>
> sp->lpage_disallowed = account_nx;
> + /* comment goes here. */
> + smp_wmb();
>
> if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> tdp_mmu_free_sp(sp);
>

> From 80d8bbd4a92faabc65cc5047c6b8ff1468cda1fa Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <[email protected]>
> Date: Tue, 29 Mar 2022 13:25:34 -0700
> Subject: [PATCH 1/2] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before
> setting SPTE
>
> Set lpage_disallowed in TDP MMU shadow pages before making the SP visible
> to other readers, i.e. before setting its SPTE. This will allow KVM to
> query lpage_disallowed when determining if a shadow page can be replaced
> by a NX huge page without violating the rules of the mitigation.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 14 ++++++++++----
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 20 ++++++++++++--------
> 3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1361eb4599b4..5cb845fae56e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> kvm_mmu_gfn_disallow_lpage(slot, gfn);
> }
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - if (sp->lpage_disallowed)
> - return;
> -
> ++kvm->stat.nx_lpage_splits;
> list_add_tail(&sp->lpage_disallowed_link,
> &kvm->arch.lpage_disallowed_mmu_pages);
> +}
> +
> +static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + if (sp->lpage_disallowed)
> + return;
> +
> + __account_huge_nx_page(kvm, sp);
> +
> sp->lpage_disallowed = true;
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..4a0087efa1e3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>
> void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b3b6426725d4..f05423545e6d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> * @kvm: kvm instance
> * @iter: a tdp_iter instance currently on the SPTE that should be set
> * @sp: The new TDP page table to install.
> - * @account_nx: True if this page table is being installed to split a
> - * non-executable huge page.
> * @shared: This operation is running under the MMU lock in read mode.
> *
> * Returns: 0 if the new page table was installed. Non-0 if the page table
> * could not be installed (e.g. the atomic compare-exchange failed).
> */
> static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> - struct kvm_mmu_page *sp, bool account_nx,
> - bool shared)
> + struct kvm_mmu_page *sp, bool shared)
> {
> u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
> int ret = 0;
> @@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - if (account_nx)
> - account_huge_nx_page(kvm, sp);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> return 0;
> @@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> + struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> int ret;
> @@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> sp = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_child_sp(sp, &iter);
>
> - if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
> + sp->lpage_disallowed = account_nx;
> +
> + if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> tdp_mmu_free_sp(sp);
> break;
> }
> +
> + if (account_nx) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + __account_huge_nx_page(kvm, sp);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + }
> }
> }
>
> @@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> * correctness standpoint since the translation will be the same either
> * way.
> */
> - ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
> + ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
> if (ret)
> goto out;
>
>
> base-commit: 19164ad08bf668bca4f4bfbaacaa0a47c1b737a6
> --
> 2.35.1.1021.g381101b075-goog
>

> From 18eaf3b86dfd01a5b58d9755ba76fe8ff80702ab Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <[email protected]>
> Date: Tue, 29 Mar 2022 14:41:40 -0700
> Subject: [PATCH 2/2] KVM: x86/mmu: Track the number of TDP MMU pages, but not
> the actual pages
>
> Track the number of TDP MMU "shadow" pages instead of tracking the pages
> hemselves. With the NX huge page list manipulation moved out of the
> common linking flow, elminating the list-based tracking means the happy
> path of adding a shadow page doesn't need to acquire a spinlock and can
> instead inc/dec an atomic.
>
> Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
> pages is very, very useful for detecting KVM bugs.
>
> Tracking the number of pages will also make it trivial to expose the
> counter to userspace as a stat in the future, which may or may not be
> desirable.
>
> Note, the TDP MMU needs to use a seperate counter (and stat if that ever
> comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't
> bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES
> (because the TDP MMU consumes so few pages relative to shadow paging),
> and including TDP MMU pages in that counter would break both the shrinker
> and shadow MMUs, e.g. if a VM is using nested TDP.
>
> Cc: Yosry Ahmed <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 11 +++--------
> arch/x86/kvm/mmu/tdp_mmu.c | 16 ++++++++--------
> 2 files changed, 11 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9694dd5e6ccc..d0dd5ed2e209 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1192,6 +1192,9 @@ struct kvm_arch {
> */
> bool tdp_mmu_enabled;
>
> + /* The number of TDP MMU pages across all roots. */
> + atomic64_t tdp_mmu_pages;
> +
> /*
> * List of struct kvm_mmu_pages being used as roots.
> * All struct kvm_mmu_pages in the list should have
> @@ -1212,18 +1215,10 @@ struct kvm_arch {
> */
> struct list_head tdp_mmu_roots;
>
> - /*
> - * List of struct kvmp_mmu_pages not being used as roots.
> - * All struct kvm_mmu_pages in the list should have
> - * tdp_mmu_page set and a tdp_mmu_root_count of 0.
> - */
> - struct list_head tdp_mmu_pages;
> -
> /*
> * Protects accesses to the following fields when the MMU lock
> * is held in read mode:
> * - tdp_mmu_roots (above)
> - * - tdp_mmu_pages (above)
> * - the link field of struct kvm_mmu_pages used by the TDP MMU
> * - lpage_disallowed_mmu_pages
> * - the lpage_disallowed_link field of struct kvm_mmu_pages used
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f05423545e6d..5ca78a89d8ed 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -24,7 +24,6 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>
> INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
> - INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
> kvm->arch.tdp_mmu_zap_wq =
> alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
>
> @@ -51,7 +50,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
> destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
>
> - WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
> + WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
>
> /*
> @@ -381,14 +380,17 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> bool shared)
> {
> + atomic64_dec(&kvm->arch.tdp_mmu_pages);
> +
> + if (!sp->lpage_disallowed)
> + return;
> +
> if (shared)
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> else
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> - list_del(&sp->link);
> - if (sp->lpage_disallowed)
> - unaccount_huge_nx_page(kvm, sp);
> + unaccount_huge_nx_page(kvm, sp);
>
> if (shared)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1141,9 +1143,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> tdp_mmu_set_spte(kvm, iter, spte);
> }
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + atomic64_inc(&kvm->arch.tdp_mmu_pages);
>
> return 0;
> }
> --
> 2.35.1.1021.g381101b075-goog
>