2022-08-19 00:58:11

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 0/5] KVM: arm64: Enable ring-based dirty memory tracking

This series enables the ring-based dirty memory tracking for ARM64.
The feature has been available and enabled on x86 for a while. It
is beneficial when the number of dirty pages is small in a checkpointing
system or live migration scenario. More details can be found from
fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking").

The generic part has been comprehensive enough, meaning there isn't too
much work, needed to extend it to ARM64.

- PATCH[1] enables the feature on ARM64
- PATCH[2-5] improves kvm/selftests/dirty_log_test

Testing
=======

- kvm/selftests/dirty_log_test
- Live migration by QEMU
- Host with 4KB or 64KB base page size

Gavin Shan (5):
KVM: arm64: Enable ring-based dirty memory tracking
KVM: selftests: Use host page size to map ring buffer in
dirty_log_test
KVM: selftests: Dirty host pages in dirty_log_test
KVM: selftests: Clear dirty ring states between two modes in
dirty_log_test
KVM: selftests: Automate choosing dirty ring size in dirty_log_test

Documentation/virt/kvm/api.rst | 2 +-
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 8 ++
tools/testing/selftests/kvm/dirty_log_test.c | 101 ++++++++++++++-----
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
6 files changed, 88 insertions(+), 27 deletions(-)

--
2.23.0


2022-08-19 00:58:20

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

The ring-based dirty memory tracking has been available and enabled
on x86 for a while. The feature is beneficial when the number of
dirty pages is small in a checkpointing system or live migration
scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
Implement ring-based dirty memory tracking").

This enables the ring-based dirty memory tracking on ARM64. It's
notable that no extra reserved ring entries are needed on ARM64
because the huge pages are always split into base pages when page
dirty tracking is enabled.

Signed-off-by: Gavin Shan <[email protected]>
---
Documentation/virt/kvm/api.rst | 2 +-
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 8 ++++++++
4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..19fa1ac017ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
8.29 KVM_CAP_DIRTY_LOG_RING
---------------------------

-:Architectures: x86
+:Architectures: x86, arm64
:Parameters: args[0] - size of the dirty log ring

KVM is capable of tracking dirty memory using ring buffers that are
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 3bb134355874..7e04b0b8d2b2 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -43,6 +43,7 @@
#define __KVM_HAVE_VCPU_EVENTS

#define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64

#define KVM_REG_SIZE(id) \
(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 815cc118c675..0309b2d0f2da 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -32,6 +32,7 @@ menuconfig KVM
select KVM_VFIO
select HAVE_KVM_EVENTFD
select HAVE_KVM_IRQFD
+ select HAVE_KVM_DIRTY_RING
select HAVE_KVM_MSI
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..3de6b9b39db7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
if (!ret)
ret = 1;

+ /* Force vcpu exit if its dirty ring is soft-full */
+ if (unlikely(vcpu->kvm->dirty_ring_size &&
+ kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+ vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+ trace_kvm_dirty_ring_exit(vcpu);
+ ret = 0;
+ }
+
if (ret > 0)
ret = check_vcpu_requests(vcpu);

--
2.23.0

2022-08-19 00:58:38

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 3/5] KVM: selftests: Dirty host pages in dirty_log_test

It's assumed that 1024 host pages, instead of guest pages, are dirtied
in each iteration in guest_code(). The current implementation misses
the case of mismatched page sizes in host and guest. For example,
ARM64 could have 64KB page size in guest, but 4KB page size in host.
(TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages
are dirtied in every iteration.

Fix the issue by touching all sub-pages when we have mismatched
page sizes in host and guest.

Signed-off-by: Gavin Shan <[email protected]>
---
tools/testing/selftests/kvm/dirty_log_test.c | 50 +++++++++++++++-----
1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 9c883c94d478..50b02186ce12 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -70,6 +70,7 @@
* that may change.
*/
static uint64_t host_page_size;
+static uint64_t host_num_pages;
static uint64_t guest_page_size;
static uint64_t guest_num_pages;
static uint64_t random_array[TEST_PAGES_PER_LOOP];
@@ -94,8 +95,23 @@ static uint64_t guest_test_virt_mem = DEFAULT_GUEST_TEST_MEM;
*/
static void guest_code(void)
{
+ uint64_t num_pages, page_size, sub_page_size;
uint64_t addr;
- int i;
+ int pages_per_loop, i, j;
+
+ /*
+ * The page sizes on host and VM could be different. We need
+ * to perform writing on all sub-pages.
+ */
+ if (host_page_size >= guest_page_size) {
+ num_pages = host_num_pages;
+ page_size = host_page_size;
+ sub_page_size = host_page_size;
+ } else {
+ num_pages = guest_num_pages;
+ page_size = guest_page_size;
+ sub_page_size = host_page_size;
+ }

/*
* On s390x, all pages of a 1M segment are initially marked as dirty
@@ -103,18 +119,29 @@ static void guest_code(void)
* To compensate this specialty in this test, we need to touch all
* pages during the first iteration.
*/
- for (i = 0; i < guest_num_pages; i++) {
- addr = guest_test_virt_mem + i * guest_page_size;
- *(uint64_t *)addr = READ_ONCE(iteration);
+ for (i = 0; i < num_pages; i++) {
+ addr = guest_test_virt_mem + i * page_size;
+ addr = align_down(addr, page_size);
+
+ for (j = 0; j < page_size / sub_page_size; j++) {
+ *(uint64_t *)(addr + j * sub_page_size) =
+ READ_ONCE(iteration);
+ }
}

+ pages_per_loop = (TEST_PAGES_PER_LOOP * sub_page_size) / page_size;
+
while (true) {
- for (i = 0; i < TEST_PAGES_PER_LOOP; i++) {
+ for (i = 0; i < pages_per_loop; i++) {
addr = guest_test_virt_mem;
- addr += (READ_ONCE(random_array[i]) % guest_num_pages)
- * guest_page_size;
- addr = align_down(addr, host_page_size);
- *(uint64_t *)addr = READ_ONCE(iteration);
+ addr += (READ_ONCE(random_array[i]) % num_pages)
+ * page_size;
+ addr = align_down(addr, page_size);
+
+ for (j = 0; j < page_size / sub_page_size; j++) {
+ *(uint64_t *)(addr + j * sub_page_size) =
+ READ_ONCE(iteration);
+ }
}

/* Tell the host that we need more random numbers */
@@ -713,14 +740,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
2ul << (DIRTY_MEM_BITS - PAGE_SHIFT_4K), guest_code);

guest_page_size = vm->page_size;
+ host_page_size = getpagesize();
+
/*
* A little more than 1G of guest page sized pages. Cover the
* case where the size is not aligned to 64 pages.
*/
guest_num_pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3;
guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages);
-
- host_page_size = getpagesize();
host_num_pages = vm_num_host_pages(mode, guest_num_pages);

if (!p->phys_offset) {
@@ -760,6 +787,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, host_page_size);
sync_global_to_guest(vm, guest_page_size);
sync_global_to_guest(vm, guest_test_virt_mem);
+ sync_global_to_guest(vm, host_num_pages);
sync_global_to_guest(vm, guest_num_pages);

/* Start the iterations */
--
2.23.0

2022-08-19 00:58:47

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 4/5] KVM: selftests: Clear dirty ring states between two modes in dirty_log_test

There are two states, which need to be cleared before next mode
is executed. Otherwise, we will hit failure as the following messages
indicate.

- The variable 'dirty_ring_vcpu_ring_full' shared by main and vcpu
thread. It's indicating if the vcpu exit due to full ring buffer.
The value can be carried from previous mode (VM_MODE_P40V48_4K) to
current one (VM_MODE_P40V48_64K) when VM_MODE_P40V48_16K isn't
supported.

- The current ring buffer index needs to be reset before next mode
(VM_MODE_P40V48_64K) is executed. Otherwise, the stale value is
carried from previous mode (VM_MODE_P40V48_4K).

# ./dirty_log_test -M dirty-ring
Setting log mode to: 'dirty-ring'
Test iterations: 32, interval: 10 (ms)
Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
guest physical test memory offset: 0xffbfffc000
:
Dirtied 995328 pages
Total bits checked: dirty (1012434), clear (7114123), track_next (966700)
Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
guest physical test memory offset: 0xffbffc0000
vcpu stops because vcpu is kicked out...
vcpu continues now.
Notifying vcpu to continue
Iteration 1 collected 0 pages
vcpu stops because dirty ring is full...
vcpu continues now.
vcpu stops because dirty ring is full...
vcpu continues now.
vcpu stops because dirty ring is full...
==== Test Assertion Failure ====
dirty_log_test.c:369: cleared == count
pid=10541 tid=10541 errno=22 - Invalid argument
1 0x0000000000403087: dirty_ring_collect_dirty_pages at dirty_log_test.c:369
2 0x0000000000402a0b: log_mode_collect_dirty_pages at dirty_log_test.c:492
3 (inlined by) run_test at dirty_log_test.c:795
4 (inlined by) run_test at dirty_log_test.c:705
5 0x0000000000403a37: for_each_guest_mode at guest_modes.c:100
6 0x0000000000401ccf: main at dirty_log_test.c:938
7 0x0000ffff9ecd279b: ?? ??:0
8 0x0000ffff9ecd286b: ?? ??:0
9 0x0000000000401def: _start at ??:?
Reset dirty pages (0) mismatch with collected (35566)

Fix the issues by clearing 'dirty_ring_vcpu_ring_full' and the ring
buffer index before a new mode is executed.

Signed-off-by: Gavin Shan <[email protected]>
---
tools/testing/selftests/kvm/dirty_log_test.c | 27 ++++++++++++--------
1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 50b02186ce12..450e97d10de7 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -252,13 +252,15 @@ static void clear_log_create_vm_done(struct kvm_vm *vm)
}

static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *unused)
{
kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
}

static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *unused)
{
kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages);
@@ -354,10 +356,9 @@ static void dirty_ring_continue_vcpu(void)
}

static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *ring_buf_idx)
{
- /* We only have one vcpu */
- static uint32_t fetch_index = 0;
uint32_t count = 0, cleared;
bool continued_vcpu = false;

@@ -374,7 +375,8 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,

/* Only have one vcpu */
count = dirty_ring_collect_one(vcpu_map_dirty_ring(vcpu),
- slot, bitmap, num_pages, &fetch_index);
+ slot, bitmap, num_pages,
+ ring_buf_idx);

cleared = kvm_vm_reset_dirty_ring(vcpu->vm);

@@ -431,7 +433,8 @@ struct log_mode {
void (*create_vm_done)(struct kvm_vm *vm);
/* Hook to collect the dirty pages into the bitmap provided */
void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages);
+ void *bitmap, uint32_t num_pages,
+ uint32_t *ring_buf_idx);
/* Hook to call when after each vcpu run */
void (*after_vcpu_run)(struct kvm_vcpu *vcpu, int ret, int err);
void (*before_vcpu_join) (void);
@@ -496,13 +499,14 @@ static void log_mode_create_vm_done(struct kvm_vm *vm)
}

static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
- void *bitmap, uint32_t num_pages)
+ void *bitmap, uint32_t num_pages,
+ uint32_t *ring_buf_idx)
{
struct log_mode *mode = &log_modes[host_log_mode];

TEST_ASSERT(mode->collect_dirty_pages != NULL,
"collect_dirty_pages() is required for any log mode!");
- mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages);
+ mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx);
}

static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
@@ -721,6 +725,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
unsigned long *bmap;
+ uint32_t ring_buf_idx = 0;

if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -797,6 +802,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
host_dirty_count = 0;
host_clear_count = 0;
host_track_next_count = 0;
+ WRITE_ONCE(dirty_ring_vcpu_ring_full, false);

pthread_create(&vcpu_thread, NULL, vcpu_worker, vcpu);

@@ -804,7 +810,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
/* Give the vcpu thread some time to dirty some pages */
usleep(p->interval * 1000);
log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX,
- bmap, host_num_pages);
+ bmap, host_num_pages,
+ &ring_buf_idx);

/*
* See vcpu_sync_stop_requested definition for details on why
--
2.23.0

2022-08-19 01:10:56

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 5/5] KVM: selftests: Automate choosing dirty ring size in dirty_log_test

In the dirty ring case, we rely on VM_EXIT due to full dirty ring
state. On ARM64 system, there are 4096 host pages when the host
page size is 64KB. In this case, the vcpu never exits due to the
full dirty ring state. The vcpu corrupts same set of pages, but the
dirty page information isn't collected in the main thread. This
leads to infinite loop as the following log shows.

# ./dirty_log_test -M dirty-ring -c 65536 -m 5
Setting log mode to: 'dirty-ring'
Test iterations: 32, interval: 10 (ms)
Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
guest physical test memory offset: 0xffbffe0000
vcpu stops because vcpu is kicked out...
Notifying vcpu to continue
vcpu continues now.
Iteration 1 collected 576 pages
<No more output afterwards>

Fix the issue by automatically choosing the best dirty ring size,
to ensure VM_EXIT due to full dirty ring state. The option '-c'
provides a hint to it, instead of the value of it.

Signed-off-by: Gavin Shan <[email protected]>
---
tools/testing/selftests/kvm/dirty_log_test.c | 24 ++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 450e97d10de7..ad31b6e3fe6a 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -23,6 +23,9 @@
#include "guest_modes.h"
#include "processor.h"

+#define DIRTY_MEM_BITS 30 /* 1G */
+#define PAGE_SHIFT_4K 12
+
/* The memory slot index to track dirty pages */
#define TEST_MEM_SLOT_INDEX 1

@@ -298,6 +301,22 @@ static bool dirty_ring_supported(void)

static void dirty_ring_create_vm_done(struct kvm_vm *vm)
{
+ uint64_t pages;
+ uint32_t limit;
+
+ /*
+ * We rely on VM_EXIT due to full dirty ring state. Adjust
+ * the ring buffer size to ensure we're able to reach the
+ * full dirty ring state.
+ */
+ pages = (1ul << (DIRTY_MEM_BITS - vm->page_shift)) + 3;
+ pages = vm_adjust_num_guest_pages(vm->mode, pages);
+ pages = vm_num_host_pages(vm->mode, pages);
+
+ limit = 1 << (31 - __builtin_clz(pages));
+ test_dirty_ring_count = 1 << (31 - __builtin_clz(test_dirty_ring_count));
+ test_dirty_ring_count = min(limit, test_dirty_ring_count);
+
/*
* Switch to dirty ring mode after VM creation but before any
* of the vcpu creation.
@@ -710,9 +729,6 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, struct kvm_vcpu **vcpu,
return vm;
}

-#define DIRTY_MEM_BITS 30 /* 1G */
-#define PAGE_SHIFT_4K 12
-
struct test_params {
unsigned long iterations;
unsigned long interval;
@@ -856,7 +872,7 @@ static void help(char *name)
printf("usage: %s [-h] [-i iterations] [-I interval] "
"[-p offset] [-m mode]\n", name);
puts("");
- printf(" -c: specify dirty ring size, in number of entries\n");
+ printf(" -c: hint to dirty ring size, in number of entries\n");
printf(" (only useful for dirty-ring test; default: %"PRIu32")\n",
TEST_DIRTY_RING_COUNT);
printf(" -i: specify iteration counts (default: %"PRIu64")\n",
--
2.23.0

2022-08-19 01:31:22

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v1 2/5] KVM: selftests: Use host page size to map ring buffer in dirty_log_test

In vcpu_map_dirty_ring(), the guest's page size is used to figure out
the offset in the virtual area. It works fine when we have same page
size on host and guest. However, it fails when the page sizes on host
and guest are different, like below error messages indicates. Actually,
the offset should be figured out according to host's page size. Otherwise,
the virtual area associated with the ring buffer can't be identified by
host.

# ./dirty_log_test -M dirty-ring -m 7
Setting log mode to: 'dirty-ring'
Test iterations: 32, interval: 10 (ms)
Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
guest physical test memory offset: 0xffbffc0000
vcpu stops because vcpu is kicked out...
Notifying vcpu to continue
vcpu continues now.
==== Test Assertion Failure ====
lib/kvm_util.c:1477: addr == MAP_FAILED
pid=9000 tid=9000 errno=0 - Success
1 0x0000000000405f5b: vcpu_map_dirty_ring at kvm_util.c:1477
2 0x0000000000402ebb: dirty_ring_collect_dirty_pages at dirty_log_test.c:349
3 0x00000000004029b3: log_mode_collect_dirty_pages at dirty_log_test.c:478
4 (inlined by) run_test at dirty_log_test.c:778
5 (inlined by) run_test at dirty_log_test.c:691
6 0x0000000000403a57: for_each_guest_mode at guest_modes.c:105
7 0x0000000000401ccf: main at dirty_log_test.c:921
8 0x0000ffffb06ec79b: ?? ??:0
9 0x0000ffffb06ec86b: ?? ??:0
10 0x0000000000401def: _start at ??:?
Dirty ring mapped private

Fix the issue by using host's page size to map the ring buffer.

Signed-off-by: Gavin Shan <[email protected]>
---
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9889fe0d8919..4e823cbe6b48 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1464,7 +1464,7 @@ struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu *vcpu)

void *vcpu_map_dirty_ring(struct kvm_vcpu *vcpu)
{
- uint32_t page_size = vcpu->vm->page_size;
+ uint32_t page_size = getpagesize();
uint32_t size = vcpu->vm->dirty_ring_size;

TEST_ASSERT(size > 0, "Should enable dirty ring first");
--
2.23.0

2022-08-19 05:40:25

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH v1 3/5] KVM: selftests: Dirty host pages in dirty_log_test

On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
> It's assumed that 1024 host pages, instead of guest pages, are dirtied
> in each iteration in guest_code(). The current implementation misses
> the case of mismatched page sizes in host and guest. For example,
> ARM64 could have 64KB page size in guest, but 4KB page size in host.
> (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages
> are dirtied in every iteration.
>
> Fix the issue by touching all sub-pages when we have mismatched
> page sizes in host and guest.

I'll let the dirty-log test authors decide what's best to do for this
test, but I'd think we should let the guest continue dirtying its
pages without knowledge of the host pages. Then, adjust the host test
code to assert all sub-pages, other than the ones it expects the guest
to have written, remain untouched.

Thanks,
drew

2022-08-19 08:34:01

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Fri, 19 Aug 2022 01:55:57 +0100,
Gavin Shan <[email protected]> wrote:
>
> The ring-based dirty memory tracking has been available and enabled
> on x86 for a while. The feature is beneficial when the number of
> dirty pages is small in a checkpointing system or live migration
> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> Implement ring-based dirty memory tracking").
>
> This enables the ring-based dirty memory tracking on ARM64. It's
> notable that no extra reserved ring entries are needed on ARM64
> because the huge pages are always split into base pages when page
> dirty tracking is enabled.

Can you please elaborate on this? Adding a per-CPU ring of course
results in extra memory allocation, so there must be a subtle
x86-specific detail that I'm not aware of...

>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 2 +-
> arch/arm64/include/uapi/asm/kvm.h | 1 +
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/arm.c | 8 ++++++++
> 4 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index abd7c32126ce..19fa1ac017ed 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
> 8.29 KVM_CAP_DIRTY_LOG_RING
> ---------------------------
>
> -:Architectures: x86
> +:Architectures: x86, arm64
> :Parameters: args[0] - size of the dirty log ring
>
> KVM is capable of tracking dirty memory using ring buffers that are
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 3bb134355874..7e04b0b8d2b2 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -43,6 +43,7 @@
> #define __KVM_HAVE_VCPU_EVENTS
>
> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64

For context, the documentation says:

<quote>
- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
</quote>

What is the reason for picking this particular value?


>
> #define KVM_REG_SIZE(id) \
> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 815cc118c675..0309b2d0f2da 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -32,6 +32,7 @@ menuconfig KVM
> select KVM_VFIO
> select HAVE_KVM_EVENTFD
> select HAVE_KVM_IRQFD
> + select HAVE_KVM_DIRTY_RING
> select HAVE_KVM_MSI
> select HAVE_KVM_IRQCHIP
> select HAVE_KVM_IRQ_ROUTING
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..3de6b9b39db7 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> if (!ret)
> ret = 1;
>
> + /* Force vcpu exit if its dirty ring is soft-full */
> + if (unlikely(vcpu->kvm->dirty_ring_size &&
> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> + trace_kvm_dirty_ring_exit(vcpu);
> + ret = 0;
> + }
> +

Why can't this be moved to kvm_vcpu_exit_request() instead? I would
also very much like the check to be made a common helper with x86.

A seemingly approach would be to make this a request on dirty log
insertion, and avoid the whole "check the log size" on every run,
which adds pointless overhead to unsuspecting users (aka everyone).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-22 02:10:14

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi Marc,

On 8/19/22 6:00 PM, Marc Zyngier wrote:
> On Fri, 19 Aug 2022 01:55:57 +0100,
> Gavin Shan <[email protected]> wrote:
>>
>> The ring-based dirty memory tracking has been available and enabled
>> on x86 for a while. The feature is beneficial when the number of
>> dirty pages is small in a checkpointing system or live migration
>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>> Implement ring-based dirty memory tracking").
>>
>> This enables the ring-based dirty memory tracking on ARM64. It's
>> notable that no extra reserved ring entries are needed on ARM64
>> because the huge pages are always split into base pages when page
>> dirty tracking is enabled.
>
> Can you please elaborate on this? Adding a per-CPU ring of course
> results in extra memory allocation, so there must be a subtle
> x86-specific detail that I'm not aware of...
>

Sure. I guess it's helpful to explain how it works in next revision.
Something like below:

This enables the ring-based dirty memory tracking on ARM64. The feature
is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
pushed by host when page becomes dirty and pulled by userspace. A vcpu
exit is forced when the ring buffer becomes full. The ring buffers on
all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.

Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
However, it's avoiding synchronization among multiple vcpus when dirty
pages happen on multiple vcpus. More discussion can be found from [1]

[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
(comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)


>>
>> Signed-off-by: Gavin Shan <[email protected]>
>> ---
>> Documentation/virt/kvm/api.rst | 2 +-
>> arch/arm64/include/uapi/asm/kvm.h | 1 +
>> arch/arm64/kvm/Kconfig | 1 +
>> arch/arm64/kvm/arm.c | 8 ++++++++
>> 4 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index abd7c32126ce..19fa1ac017ed 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>> 8.29 KVM_CAP_DIRTY_LOG_RING
>> ---------------------------
>>
>> -:Architectures: x86
>> +:Architectures: x86, arm64
>> :Parameters: args[0] - size of the dirty log ring
>>
>> KVM is capable of tracking dirty memory using ring buffers that are
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 3bb134355874..7e04b0b8d2b2 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -43,6 +43,7 @@
>> #define __KVM_HAVE_VCPU_EVENTS
>>
>> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>
> For context, the documentation says:
>
> <quote>
> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> </quote>
>
> What is the reason for picking this particular value?
>

It's inherited from x86. I don't think it has to be this particular value.
The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.

How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
The virtual area is cheap, I guess it's also nice to use x86's
pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.

#define KVM_COALESCED_MMIO_PAGE_OFFSET 1
#define KVM_DIRTY_LOG_PAGE_OFFSET 2

>>
>> #define KVM_REG_SIZE(id) \
>> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 815cc118c675..0309b2d0f2da 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -32,6 +32,7 @@ menuconfig KVM
>> select KVM_VFIO
>> select HAVE_KVM_EVENTFD
>> select HAVE_KVM_IRQFD
>> + select HAVE_KVM_DIRTY_RING
>> select HAVE_KVM_MSI
>> select HAVE_KVM_IRQCHIP
>> select HAVE_KVM_IRQ_ROUTING
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 986cee6fbc7f..3de6b9b39db7 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>> if (!ret)
>> ret = 1;
>>
>> + /* Force vcpu exit if its dirty ring is soft-full */
>> + if (unlikely(vcpu->kvm->dirty_ring_size &&
>> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>> + trace_kvm_dirty_ring_exit(vcpu);
>> + ret = 0;
>> + }
>> +
>
> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> also very much like the check to be made a common helper with x86.
>
> A seemingly approach would be to make this a request on dirty log
> insertion, and avoid the whole "check the log size" on every run,
> which adds pointless overhead to unsuspecting users (aka everyone).
>

I though of having the check in kvm_vcpu_exit_request(). The various
exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
highest priority and ARM64 is just to follow. I don't think it really
matters. I will improve it accordingly in next revision:

- Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c

bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
{
struct kvm *kvm = vcpu->vcpu;
struct kvm_dirty_ring *ring = &vcpu->dirty_ring;

if (unlikely(kvm->dirty_ring_size &&
kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
trace_kvm_dirty_ring_exit(vcpu);
return true;
}

return false;
}

- Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().

Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
sure if anything else I missed?

Thanks,
Gavin





2022-08-22 06:47:31

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 3/5] KVM: selftests: Dirty host pages in dirty_log_test

Hi Drew,

On 8/19/22 3:28 PM, Andrew Jones wrote:
> On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
>> It's assumed that 1024 host pages, instead of guest pages, are dirtied
>> in each iteration in guest_code(). The current implementation misses
>> the case of mismatched page sizes in host and guest. For example,
>> ARM64 could have 64KB page size in guest, but 4KB page size in host.
>> (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages
>> are dirtied in every iteration.
>>
>> Fix the issue by touching all sub-pages when we have mismatched
>> page sizes in host and guest.
>
> I'll let the dirty-log test authors decide what's best to do for this
> test, but I'd think we should let the guest continue dirtying its
> pages without knowledge of the host pages. Then, adjust the host test
> code to assert all sub-pages, other than the ones it expects the guest
> to have written, remain untouched.
>

I don't think what is clarified in the change log is correct. The current
implementation already had the logic to handle the mismatched page sizes
in vm_dirty_log_verify() where 'step' is used for it by fetching value
from vm_num_host_pages(mode, 1). Please ignore this patch for now, as
explained below.

The issue I have is the 'dirty_log_test' hangs when I have 4KB host page size
and 64KB guest page size. It seems the vcpu doesn't exit due to full ring
buffer state or kick-off. I will have more investigations to figure out the
root cause.

# ./dirty_log_test -M dirty-ring -m 7
Setting log mode to: 'dirty-ring'
Test iterations: 32, interval: 10 (ms)
Testing guest mode: PA-bits:40, VA-bits:48, 64K pages
guest physical test memory offset: 0xffbffc0000
vcpu stops because vcpu is kicked out...
Notifying vcpu to continue
vcpu continues now.
Iteration 1 collected 1903 pages
<no more output>

'dirty_lot_test' works well when both host and guest have 4KB page size.

# ./dirty_log_test -M dirty-ring -m 5
Setting log mode to: 'dirty-ring'
Test iterations: 32, interval: 10 (ms)
Testing guest mode: PA-bits:40, VA-bits:48, 4K pages
guest physical test memory offset: 0xffbfffc000
vcpu stops because vcpu is kicked out...
Notifying vcpu to continue
vcpu continues now.
:
Dirtied 1006592 pages
Total bits checked: dirty (1020487), clear (7106070), track_next (974104)

Thanks,
Gavin



2022-08-22 19:51:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi, Gavin,

On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
> > For context, the documentation says:
> >
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> >
> > What is the reason for picking this particular value?
> >
>
> It's inherited from x86. I don't think it has to be this particular value.
> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
>
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>
> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> #define KVM_DIRTY_LOG_PAGE_OFFSET 2

It was chosen not to be continuous of previous used offset because it'll be
the 1st vcpu region that can cover multiple & dynamic number of pages. I
wanted to leave the 3-63 (x86 used offset 2 already) for small fields so
they can be continuous, which looks a little bit cleaner.

Currently how many pages it'll use depends on the size set by the user,
though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a
maximum of 1MB memory.

So I think setting it to 2 is okay, as long as we keep the rest 1MB address
space for the per-vcpu ring structure, so any new vcpu fields (even if only
1 page will be needed) need to be after that maximum size of the ring.

Thanks,

--
Peter Xu

2022-08-22 22:00:24

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi Gavin,

On Mon, 22 Aug 2022 02:58:20 +0100,
Gavin Shan <[email protected]> wrote:
>
> Hi Marc,
>
> On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > On Fri, 19 Aug 2022 01:55:57 +0100,
> > Gavin Shan <[email protected]> wrote:
> >>
> >> The ring-based dirty memory tracking has been available and enabled
> >> on x86 for a while. The feature is beneficial when the number of
> >> dirty pages is small in a checkpointing system or live migration
> >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> >> Implement ring-based dirty memory tracking").
> >>
> >> This enables the ring-based dirty memory tracking on ARM64. It's
> >> notable that no extra reserved ring entries are needed on ARM64
> >> because the huge pages are always split into base pages when page
> >> dirty tracking is enabled.
> >
> > Can you please elaborate on this? Adding a per-CPU ring of course
> > results in extra memory allocation, so there must be a subtle
> > x86-specific detail that I'm not aware of...
> >
>
> Sure. I guess it's helpful to explain how it works in next revision.
> Something like below:
>
> This enables the ring-based dirty memory tracking on ARM64. The feature
> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> pushed by host when page becomes dirty and pulled by userspace. A vcpu
> exit is forced when the ring buffer becomes full. The ring buffers on
> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
>
> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> However, it's avoiding synchronization among multiple vcpus when dirty
> pages happen on multiple vcpus. More discussion can be found from [1]

Oh, I totally buy the relaxation of the synchronisation (though I
doubt this will have any visible effect until we have something like
Oliver's patches to allow parallel faulting).

But it is the "no extra reserved ring entries are needed on ARM64"
argument that I don't get yet.


>
> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
>
>
> >>
> >> Signed-off-by: Gavin Shan <[email protected]>
> >> ---
> >> Documentation/virt/kvm/api.rst | 2 +-
> >> arch/arm64/include/uapi/asm/kvm.h | 1 +
> >> arch/arm64/kvm/Kconfig | 1 +
> >> arch/arm64/kvm/arm.c | 8 ++++++++
> >> 4 files changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index abd7c32126ce..19fa1ac017ed 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
> >> 8.29 KVM_CAP_DIRTY_LOG_RING
> >> ---------------------------
> >> -:Architectures: x86
> >> +:Architectures: x86, arm64
> >> :Parameters: args[0] - size of the dirty log ring
> >> KVM is capable of tracking dirty memory using ring buffers that
> >> are
> >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> >> index 3bb134355874..7e04b0b8d2b2 100644
> >> --- a/arch/arm64/include/uapi/asm/kvm.h
> >> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >> @@ -43,6 +43,7 @@
> >> #define __KVM_HAVE_VCPU_EVENTS
> >> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> >> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> >
> > For context, the documentation says:
> >
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> >
> > What is the reason for picking this particular value?
> >
>
> It's inherited from x86. I don't think it has to be this particular
> value. The value is used to distinguish the region's owners like
> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
> KVM_DIRTY_LOG_PAGE_OFFSET.
>
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>
> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> #define KVM_DIRTY_LOG_PAGE_OFFSET 2

Given that this is just an offset in the vcpu "file", I don't think it
matters that much. 64 definitely allows for some struct vcpu growth,
and it doesn't hurt to be compatible with x86 (for once...).

>
> >> #define KVM_REG_SIZE(id)
> >> \
> >> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> >> index 815cc118c675..0309b2d0f2da 100644
> >> --- a/arch/arm64/kvm/Kconfig
> >> +++ b/arch/arm64/kvm/Kconfig
> >> @@ -32,6 +32,7 @@ menuconfig KVM
> >> select KVM_VFIO
> >> select HAVE_KVM_EVENTFD
> >> select HAVE_KVM_IRQFD
> >> + select HAVE_KVM_DIRTY_RING
> >> select HAVE_KVM_MSI
> >> select HAVE_KVM_IRQCHIP
> >> select HAVE_KVM_IRQ_ROUTING
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index 986cee6fbc7f..3de6b9b39db7 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >> if (!ret)
> >> ret = 1;
> >> + /* Force vcpu exit if its dirty ring is soft-full */
> >> + if (unlikely(vcpu->kvm->dirty_ring_size &&
> >> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> >> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >> + trace_kvm_dirty_ring_exit(vcpu);
> >> + ret = 0;
> >> + }
> >> +
> >
> > Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> > also very much like the check to be made a common helper with x86.
> >
> > A seemingly approach would be to make this a request on dirty log
> > insertion, and avoid the whole "check the log size" on every run,
> > which adds pointless overhead to unsuspecting users (aka everyone).
> >
>
> I though of having the check in kvm_vcpu_exit_request(). The various
> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
> highest priority and ARM64 is just to follow. I don't think it really
> matters. I will improve it accordingly in next revision:
>
> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
>
> bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
> {
> struct kvm *kvm = vcpu->vcpu;
> struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
>
> if (unlikely(kvm->dirty_ring_size &&
> kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> trace_kvm_dirty_ring_exit(vcpu);
> return true;
> }
>
> return false;
> }
>
> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
>
> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
> sure if anything else I missed?

I'm fine with the above, but what I really wanted is a request from
the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
Something like this (which obviously doesn't compile, but you'll get
the idea):

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..0b41feb6fb7d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)

if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
return kvm_vcpu_suspend(vcpu);
+
+ if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
+ vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+ trace_kvm_dirty_ring_exit(vcpu);
+ return 0;
+ }
}

return 1;
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..08b2f01164fa 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
{
+ struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
struct kvm_dirty_gfn *entry;

/* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
kvm_dirty_gfn_set_dirtied(entry);
ring->dirty_index++;
trace_kvm_dirty_ring_push(ring, slot, offset);
+
+ if (kvm_dirty_ring_soft_full(vcpu))
+ kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
}

struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-23 03:22:54

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 3/5] KVM: selftests: Dirty host pages in dirty_log_test

Hi Drew,

On 8/22/22 4:29 PM, Gavin Shan wrote:
> On 8/19/22 3:28 PM, Andrew Jones wrote:
>> On Fri, Aug 19, 2022 at 08:55:59AM +0800, Gavin Shan wrote:
>>> It's assumed that 1024 host pages, instead of guest pages, are dirtied
>>> in each iteration in guest_code(). The current implementation misses
>>> the case of mismatched page sizes in host and guest. For example,
>>> ARM64 could have 64KB page size in guest, but 4KB page size in host.
>>> (TEST_PAGES_PER_LOOP / 16), instead of TEST_PAGES_PER_LOOP, host pages
>>> are dirtied in every iteration.
>>>
>>> Fix the issue by touching all sub-pages when we have mismatched
>>> page sizes in host and guest.
>>
>> I'll let the dirty-log test authors decide what's best to do for this
>> test, but I'd think we should let the guest continue dirtying its
>> pages without knowledge of the host pages. Then, adjust the host test
>> code to assert all sub-pages, other than the ones it expects the guest
>> to have written, remain untouched.
>>
>
> I don't think what is clarified in the change log is correct. The current
> implementation already had the logic to handle the mismatched page sizes
> in vm_dirty_log_verify() where 'step' is used for it by fetching value
> from vm_num_host_pages(mode, 1). Please ignore this patch for now, as
> explained below.
>
> The issue I have is the 'dirty_log_test' hangs when I have 4KB host page size
> and 64KB guest page size. It seems the vcpu doesn't exit due to full ring
> buffer state or kick-off. I will have more investigations to figure out the
> root cause.
>

[...]

Please ignore this PATCH[3/5], I think this should be fixed by selecting
correct dirty ring count and the fix will be folded to PATCH[5/5] in next
revision.

In dirty_log_test, we have 1GB memory for guest to write and make them
dirty. When we have mismatch page sizes on host and guest, which is either
4kb-host-64kb-guest or 64kb-host-4kb-guest apart from 16kb case, 16384 host
pages are dirtied in each iteration. The default dirty ring count is 65536.
So the vcpu never exit due to full-dirty-ring-buffer state. This leads the
guest's code keep running and the dirty log isn't collected by the main
thread.

#define TEST_DIRTY_RING_COUNT 65536

dirty_pages_per_iteration = (0x40000000 / 0x10000)
= 0x4000
= 16384

Thanks,
Gavin

2022-08-23 04:24:09

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi Peter,

On 8/23/22 4:55 AM, Peter Xu wrote:
> On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
>>> For context, the documentation says:
>>>
>>> <quote>
>>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>>> KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
>>> </quote>
>>>
>>> What is the reason for picking this particular value?
>>>
>>
>> It's inherited from x86. I don't think it has to be this particular value.
>> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
>> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
>> The virtual area is cheap, I guess it's also nice to use x86's
>> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>> #define KVM_DIRTY_LOG_PAGE_OFFSET 2
>
> It was chosen not to be continuous of previous used offset because it'll be
> the 1st vcpu region that can cover multiple & dynamic number of pages. I
> wanted to leave the 3-63 (x86 used offset 2 already) for small fields so
> they can be continuous, which looks a little bit cleaner.
>
> Currently how many pages it'll use depends on the size set by the user,
> though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a
> maximum of 1MB memory.
>
> So I think setting it to 2 is okay, as long as we keep the rest 1MB address
> space for the per-vcpu ring structure, so any new vcpu fields (even if only
> 1 page will be needed) need to be after that maximum size of the ring.
>

Thanks for the details. I think it'd keep the layout since virtual area
is really cheap. So lets keep it as of being if Marc doesn't object. In
this way, the new area, which is usually one page size can go after
KVM_COALESCED_MMIO_PAGE_OFFSET.

#define KVM_DIRTY_LOG_PAGE_OFFSET 64

Thanks,
Gavin



2022-08-23 05:35:38

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi Marc,

On 8/23/22 7:42 AM, Marc Zyngier wrote:
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <[email protected]> wrote:
>> On 8/19/22 6:00 PM, Marc Zyngier wrote:
>>> On Fri, 19 Aug 2022 01:55:57 +0100,
>>> Gavin Shan <[email protected]> wrote:
>>>>
>>>> The ring-based dirty memory tracking has been available and enabled
>>>> on x86 for a while. The feature is beneficial when the number of
>>>> dirty pages is small in a checkpointing system or live migration
>>>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>>>> Implement ring-based dirty memory tracking").
>>>>
>>>> This enables the ring-based dirty memory tracking on ARM64. It's
>>>> notable that no extra reserved ring entries are needed on ARM64
>>>> because the huge pages are always split into base pages when page
>>>> dirty tracking is enabled.
>>>
>>> Can you please elaborate on this? Adding a per-CPU ring of course
>>> results in extra memory allocation, so there must be a subtle
>>> x86-specific detail that I'm not aware of...
>>>
>>
>> Sure. I guess it's helpful to explain how it works in next revision.
>> Something like below:
>>
>> This enables the ring-based dirty memory tracking on ARM64. The feature
>> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
>> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
>> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
>> pushed by host when page becomes dirty and pulled by userspace. A vcpu
>> exit is forced when the ring buffer becomes full. The ring buffers on
>> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
>>
>> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
>> However, it's avoiding synchronization among multiple vcpus when dirty
>> pages happen on multiple vcpus. More discussion can be found from [1]
>
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
>
> But it is the "no extra reserved ring entries are needed on ARM64"
> argument that I don't get yet.
>

Ok. The extra reserved ring entries are x86 specific. When x86's PML
(Page Modification Logging) hardware capability is enabled, the vcpu
exits due to full PML buffer, which is 512 entries. All the information
in PML buffer is pushed to the dirty ring buffer in one shoot. To
avoid overrunning the dirty ring buffer, there are 512 entries are
reserved.

=== include/linux/kvm_host.h

#define KVM_DIRTY_RING_RSVD_ENTRIES 64 // fixed and reserved ring entries

=== virt/kvm/dirty_ring.c

int __weak kvm_cpu_dirty_log_size(void)
{
return 0;
}

u32 kvm_dirty_ring_get_rsvd_entries(void)
{
return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
}

=== arch/x86/kvm/mmu/mmu.c

int kvm_cpu_dirty_log_size(void)
{
return kvm_x86_ops.cpu_dirty_log_size; // Set to 512 when PML is enabled
}


kvm_cpu_dirty_log_size() isn't be overrided by ARM64, meaning it returns
zero on ARM64. On x86, it returns 512 when PML is enabled.

>>
>> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
>> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
>>
>>
>>>>
>>>> Signed-off-by: Gavin Shan <[email protected]>
>>>> ---
>>>> Documentation/virt/kvm/api.rst | 2 +-
>>>> arch/arm64/include/uapi/asm/kvm.h | 1 +
>>>> arch/arm64/kvm/Kconfig | 1 +
>>>> arch/arm64/kvm/arm.c | 8 ++++++++
>>>> 4 files changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>>> index abd7c32126ce..19fa1ac017ed 100644
>>>> --- a/Documentation/virt/kvm/api.rst
>>>> +++ b/Documentation/virt/kvm/api.rst
>>>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>>>> 8.29 KVM_CAP_DIRTY_LOG_RING
>>>> ---------------------------
>>>> -:Architectures: x86
>>>> +:Architectures: x86, arm64
>>>> :Parameters: args[0] - size of the dirty log ring
>>>> KVM is capable of tracking dirty memory using ring buffers that
>>>> are
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 3bb134355874..7e04b0b8d2b2 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -43,6 +43,7 @@
>>>> #define __KVM_HAVE_VCPU_EVENTS
>>>> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>>>
>>> For context, the documentation says:
>>>
>>> <quote>
>>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>>> KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
>>> </quote>
>>>
>>> What is the reason for picking this particular value?
>>>
>>
>> It's inherited from x86. I don't think it has to be this particular
>> value. The value is used to distinguish the region's owners like
>> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
>> KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
>> The virtual area is cheap, I guess it's also nice to use x86's
>> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>> #define KVM_DIRTY_LOG_PAGE_OFFSET 2
>
> Given that this is just an offset in the vcpu "file", I don't think it
> matters that much. 64 definitely allows for some struct vcpu growth,
> and it doesn't hurt to be compatible with x86 (for once...).
>

Sure, thanks. I think it'd better to have same pattern as x86 either.

>>
>>>> #define KVM_REG_SIZE(id)
>>>> \
>>>> (1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>>>> index 815cc118c675..0309b2d0f2da 100644
>>>> --- a/arch/arm64/kvm/Kconfig
>>>> +++ b/arch/arm64/kvm/Kconfig
>>>> @@ -32,6 +32,7 @@ menuconfig KVM
>>>> select KVM_VFIO
>>>> select HAVE_KVM_EVENTFD
>>>> select HAVE_KVM_IRQFD
>>>> + select HAVE_KVM_DIRTY_RING
>>>> select HAVE_KVM_MSI
>>>> select HAVE_KVM_IRQCHIP
>>>> select HAVE_KVM_IRQ_ROUTING
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 986cee6fbc7f..3de6b9b39db7 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>> if (!ret)
>>>> ret = 1;
>>>> + /* Force vcpu exit if its dirty ring is soft-full */
>>>> + if (unlikely(vcpu->kvm->dirty_ring_size &&
>>>> + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>>>> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>>> + trace_kvm_dirty_ring_exit(vcpu);
>>>> + ret = 0;
>>>> + }
>>>> +
>>>
>>> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
>>> also very much like the check to be made a common helper with x86.
>>>
>>> A seemingly approach would be to make this a request on dirty log
>>> insertion, and avoid the whole "check the log size" on every run,
>>> which adds pointless overhead to unsuspecting users (aka everyone).
>>>
>>
>> I though of having the check in kvm_vcpu_exit_request(). The various
>> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
>> highest priority and ARM64 is just to follow. I don't think it really
>> matters. I will improve it accordingly in next revision:
>>
>> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
>>
>> bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
>> {
>> struct kvm *kvm = vcpu->vcpu;
>> struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
>>
>> if (unlikely(kvm->dirty_ring_size &&
>> kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
>> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>> trace_kvm_dirty_ring_exit(vcpu);
>> return true;
>> }
>>
>> return false;
>> }
>>
>> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
>>
>> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
>> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
>> sure if anything else I missed?
>
> I'm fine with the above, but what I really wanted is a request from
> the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
> Something like this (which obviously doesn't compile, but you'll get
> the idea):
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..0b41feb6fb7d 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>
> if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> return kvm_vcpu_suspend(vcpu);
> +
> + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> + trace_kvm_dirty_ring_exit(vcpu);
> + return 0;
> + }
> }
>
> return 1;
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f4c2a6eb1666..08b2f01164fa 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> {
> + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> struct kvm_dirty_gfn *entry;
>
> /* It should never get full */
> @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> kvm_dirty_gfn_set_dirtied(entry);
> ring->dirty_index++;
> trace_kvm_dirty_ring_push(ring, slot, offset);
> +
> + if (kvm_dirty_ring_soft_full(vcpu))
> + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> }
>
> struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
>

Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Thanks,
Gavin

2022-08-23 16:48:00

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> Hi Gavin,
>
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <[email protected]> wrote:
> >
> > Hi Marc,
> >
> > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > Gavin Shan <[email protected]> wrote:
> > >>
> > >> The ring-based dirty memory tracking has been available and enabled
> > >> on x86 for a while. The feature is beneficial when the number of
> > >> dirty pages is small in a checkpointing system or live migration
> > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > >> Implement ring-based dirty memory tracking").
> > >>
> > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > >> notable that no extra reserved ring entries are needed on ARM64
> > >> because the huge pages are always split into base pages when page
> > >> dirty tracking is enabled.
> > >
> > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > results in extra memory allocation, so there must be a subtle
> > > x86-specific detail that I'm not aware of...
> > >
> >
> > Sure. I guess it's helpful to explain how it works in next revision.
> > Something like below:
> >
> > This enables the ring-based dirty memory tracking on ARM64. The feature
> > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > exit is forced when the ring buffer becomes full. The ring buffers on
> > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> >
> > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > However, it's avoiding synchronization among multiple vcpus when dirty
> > pages happen on multiple vcpus. More discussion can be found from [1]
>
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
>

Heh, yeah I need to get that out the door. I'll also note that Gavin's
changes are still relevant without that series, as we do write unprotect
in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
Add fast path to handle permission relaxation during dirty logging").

--
Thanks,
Oliver

2022-08-23 17:24:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 986cee6fbc7f..0b41feb6fb7d 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > return kvm_vcpu_suspend(vcpu);
> > +
> > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > + trace_kvm_dirty_ring_exit(vcpu);
> > + return 0;
> > + }
> > }
> > return 1;
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index f4c2a6eb1666..08b2f01164fa 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > {
> > + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> > struct kvm_dirty_gfn *entry;
> > /* It should never get full */
> > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > kvm_dirty_gfn_set_dirtied(entry);
> > ring->dirty_index++;
> > trace_kvm_dirty_ring_push(ring, slot, offset);
> > +
> > + if (kvm_dirty_ring_soft_full(vcpu))
> > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > }
> > struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> >
>
> Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Note that there can be a slight difference with the old/new code, in that
an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL
vmexit and keep kicking VCPU_RUN with the new code.

Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code
until the next dirty pfn being pushed to the ring, then it'll request ring
full exit again.

Each time it exits the ring grows 1.

At last iiuc it can easily hit the ring full and trigger the warning at the
entry of kvm_dirty_ring_push():

/* It should never get full */
WARN_ON_ONCE(kvm_dirty_ring_full(ring));

We did that because kvm_dirty_ring_push() was previously designed to not be
able to fail at all (e.g., in the old bitmap world we never will fail too).
We can't because we can't lose any dirty page or migration could silently
fail too (consider when we do user exit due to ring full and migration just
completed; there could be unsynced pages on src/dst).

So even though the old approach will need to read kvm->dirty_ring_size for
every entrance which is a pity, it will avoid issue above.

Side note: for x86 the dirty ring check was put at the entrance not because
it needs to be the highest priority - it should really be the same when
check kvm requests. It's just that it'll be the fastest way to fail
properly if needed before loading mmu, disabling preemption/irq, etc.

Thanks,

--
Peter Xu

2022-08-23 20:06:18

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, 23 Aug 2022 14:58:19 +0100,
Peter Xu <[email protected]> wrote:
>
> On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 986cee6fbc7f..0b41feb6fb7d 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > return kvm_vcpu_suspend(vcpu);
> > > +
> > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > + trace_kvm_dirty_ring_exit(vcpu);
> > > + return 0;
> > > + }
> > > }
> > > return 1;
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index f4c2a6eb1666..08b2f01164fa 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > > {
> > > + struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> > > struct kvm_dirty_gfn *entry;
> > > /* It should never get full */
> > > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> > > kvm_dirty_gfn_set_dirtied(entry);
> > > ring->dirty_index++;
> > > trace_kvm_dirty_ring_push(ring, slot, offset);
> > > +
> > > + if (kvm_dirty_ring_soft_full(vcpu))
> > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > }
> > > struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> > >
> >
> > Ok, thanks for the details, Marc. I will adopt your code in next revision :)
>
> Note that there can be a slight difference with the old/new code, in that
> an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL
> vmexit and keep kicking VCPU_RUN with the new code.
>
> Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code
> until the next dirty pfn being pushed to the ring, then it'll request ring
> full exit again.
>
> Each time it exits the ring grows 1.
>
> At last iiuc it can easily hit the ring full and trigger the warning at the
> entry of kvm_dirty_ring_push():
>
> /* It should never get full */
> WARN_ON_ONCE(kvm_dirty_ring_full(ring));

Hmmm, yes. Well spotted.

> We did that because kvm_dirty_ring_push() was previously designed to not be
> able to fail at all (e.g., in the old bitmap world we never will fail too).
> We can't because we can't lose any dirty page or migration could silently
> fail too (consider when we do user exit due to ring full and migration just
> completed; there could be unsynced pages on src/dst).
>
> So even though the old approach will need to read kvm->dirty_ring_size for
> every entrance which is a pity, it will avoid issue above.

I don't think we really need this check on the hot path. All we need
is to make the request sticky until userspace gets their act together
and consumes elements in the ring. Something like:

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..e8ed5e1af159 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)

if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
return kvm_vcpu_suspend(vcpu);
+
+ if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
+ kvm_dirty_ring_soft_full(vcpu)) {
+ kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
+ vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+ trace_kvm_dirty_ring_exit(vcpu);
+ return 0;
+ }
}

return 1;


However, I'm a bit concerned by the reset side of things. It iterates
over the vcpus and expects the view of each ring to be consistent,
even if userspace is hacking at it from another CPU. For example, I
can't see what guarantees that the kernel observes the writes from
userspace in the order they are being performed (the documentation
provides no requirements other than "it must collect the dirty GFNs in
sequence", which doesn't mean much from an ordering perspective).

I can see that working on a strongly ordered architecture, but on
something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
stuff that isn't explicitly ordered. I have the feeling that a CAS
operation on both sides would be enough, but someone who actually
understands how this works should have a look...

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-23 20:54:05

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, 23 Aug 2022 15:44:42 +0100,
Oliver Upton <[email protected]> wrote:
>
> On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> > Hi Gavin,
> >
> > On Mon, 22 Aug 2022 02:58:20 +0100,
> > Gavin Shan <[email protected]> wrote:
> > >
> > > Hi Marc,
> > >
> > > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > > Gavin Shan <[email protected]> wrote:
> > > >>
> > > >> The ring-based dirty memory tracking has been available and enabled
> > > >> on x86 for a while. The feature is beneficial when the number of
> > > >> dirty pages is small in a checkpointing system or live migration
> > > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > > >> Implement ring-based dirty memory tracking").
> > > >>
> > > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > > >> notable that no extra reserved ring entries are needed on ARM64
> > > >> because the huge pages are always split into base pages when page
> > > >> dirty tracking is enabled.
> > > >
> > > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > > results in extra memory allocation, so there must be a subtle
> > > > x86-specific detail that I'm not aware of...
> > > >
> > >
> > > Sure. I guess it's helpful to explain how it works in next revision.
> > > Something like below:
> > >
> > > This enables the ring-based dirty memory tracking on ARM64. The feature
> > > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > > exit is forced when the ring buffer becomes full. The ring buffers on
> > > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> > >
> > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > > However, it's avoiding synchronization among multiple vcpus when dirty
> > > pages happen on multiple vcpus. More discussion can be found from [1]
> >
> > Oh, I totally buy the relaxation of the synchronisation (though I
> > doubt this will have any visible effect until we have something like
> > Oliver's patches to allow parallel faulting).
> >
>
> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> changes are still relevant without that series, as we do write unprotect
> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> Add fast path to handle permission relaxation during dirty logging").

Ah, true. Now if only someone could explain how the whole
producer-consumer thing works without a trace of a barrier, that'd be
great...

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-23 21:23:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> I don't think we really need this check on the hot path. All we need
> is to make the request sticky until userspace gets their act together
> and consumes elements in the ring. Something like:
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..e8ed5e1af159 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>
> if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> return kvm_vcpu_suspend(vcpu);
> +
> + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> + kvm_dirty_ring_soft_full(vcpu)) {
> + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> + trace_kvm_dirty_ring_exit(vcpu);
> + return 0;
> + }
> }
>
> return 1;

Right, this seems working. We can also use kvm_test_request() here.

>
>
> However, I'm a bit concerned by the reset side of things. It iterates
> over the vcpus and expects the view of each ring to be consistent,
> even if userspace is hacking at it from another CPU. For example, I
> can't see what guarantees that the kernel observes the writes from
> userspace in the order they are being performed (the documentation
> provides no requirements other than "it must collect the dirty GFNs in
> sequence", which doesn't mean much from an ordering perspective).
>
> I can see that working on a strongly ordered architecture, but on
> something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> stuff that isn't explicitly ordered. I have the feeling that a CAS
> operation on both sides would be enough, but someone who actually
> understands how this works should have a look...

I definitely don't think I 100% understand all the ordering things since
they're complicated.. but my understanding is that the reset procedure
didn't need memory barrier (unlike pushing, where we have explicit wmb),
because we assumed the userapp is not hostile so logically it should only
modify the flags which is a 32bit field, assuming atomicity guaranteed.

IIRC we used to discuss similar questions on "what if the user is hostile
and wants to hack the process by messing up with the ring", and our
conclusion was as long as the process wouldn't mess up anything outside
itself it should be okay. E.g. It should not be able to either cause the
host to misfunction, or trigger kernel warnings in dmesg, etc..

Thanks,

--
Peter Xu

2022-08-23 23:25:45

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, 23 Aug 2022 22:20:32 +0100,
Peter Xu <[email protected]> wrote:
>
> On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > I don't think we really need this check on the hot path. All we need
> > is to make the request sticky until userspace gets their act together
> > and consumes elements in the ring. Something like:
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 986cee6fbc7f..e8ed5e1af159 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >
> > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > return kvm_vcpu_suspend(vcpu);
> > +
> > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > + kvm_dirty_ring_soft_full(vcpu)) {
> > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > + trace_kvm_dirty_ring_exit(vcpu);
> > + return 0;
> > + }
> > }
> >
> > return 1;
>
> Right, this seems working. We can also use kvm_test_request() here.
>
> >
> >
> > However, I'm a bit concerned by the reset side of things. It iterates
> > over the vcpus and expects the view of each ring to be consistent,
> > even if userspace is hacking at it from another CPU. For example, I
> > can't see what guarantees that the kernel observes the writes from
> > userspace in the order they are being performed (the documentation
> > provides no requirements other than "it must collect the dirty GFNs in
> > sequence", which doesn't mean much from an ordering perspective).
> >
> > I can see that working on a strongly ordered architecture, but on
> > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > operation on both sides would be enough, but someone who actually
> > understands how this works should have a look...
>
> I definitely don't think I 100% understand all the ordering things since
> they're complicated.. but my understanding is that the reset procedure
> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> because we assumed the userapp is not hostile so logically it should only
> modify the flags which is a 32bit field, assuming atomicity guaranteed.

Atomicity doesn't guarantee ordering, unfortunately. Take the
following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
D that exist in the ring in that order, and CPU1 performs an ioctl to
reset the page state.

CPU0:
write_flag(A, KVM_DIRTY_GFN_F_RESET)
write_flag(B, KVM_DIRTY_GFN_F_RESET)
write_flag(C, KVM_DIRTY_GFN_F_RESET)
write_flag(D, KVM_DIRTY_GFN_F_RESET)
[...]

CPU1:
ioctl(KVM_RESET_DIRTY_RINGS)

Since CPU0 writes do not have any ordering, CPU1 can observe the
writes in a sequence that have nothing to do with program order, and
could for example observe that GFN A and D have been reset, but not B
and C. This in turn breaks the logic in the reset code (B, C, and D
don't get reset), despite userspace having followed the spec to the
letter. If each was a store-release (which is the case on x86), it
wouldn't be a problem, but nothing calls it in the documentation.

Maybe that's not a big deal if it is expected that each CPU will issue
a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
writes. But expecting this to work across CPUs without any barrier is
wishful thinking.

> IIRC we used to discuss similar questions on "what if the user is hostile
> and wants to hack the process by messing up with the ring", and our
> conclusion was as long as the process wouldn't mess up anything outside
> itself it should be okay. E.g. It should not be able to either cause the
> host to misfunction, or trigger kernel warnings in dmesg, etc..

I'm not even discussing safety here. I'm purely discussing the
interactions between userspace and kernel based on the documentation
and the existing kernel code.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-23 23:44:32

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> On Tue, 23 Aug 2022 22:20:32 +0100,
> Peter Xu <[email protected]> wrote:
> >
> > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > I don't think we really need this check on the hot path. All we need
> > > is to make the request sticky until userspace gets their act together
> > > and consumes elements in the ring. Something like:
> > >
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >
> > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > return kvm_vcpu_suspend(vcpu);
> > > +
> > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > + kvm_dirty_ring_soft_full(vcpu)) {
> > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > + trace_kvm_dirty_ring_exit(vcpu);
> > > + return 0;
> > > + }
> > > }
> > >
> > > return 1;
> >
> > Right, this seems working. We can also use kvm_test_request() here.
> >
> > >
> > >
> > > However, I'm a bit concerned by the reset side of things. It iterates
> > > over the vcpus and expects the view of each ring to be consistent,
> > > even if userspace is hacking at it from another CPU. For example, I
> > > can't see what guarantees that the kernel observes the writes from
> > > userspace in the order they are being performed (the documentation
> > > provides no requirements other than "it must collect the dirty GFNs in
> > > sequence", which doesn't mean much from an ordering perspective).
> > >
> > > I can see that working on a strongly ordered architecture, but on
> > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > operation on both sides would be enough, but someone who actually
> > > understands how this works should have a look...
> >
> > I definitely don't think I 100% understand all the ordering things since
> > they're complicated.. but my understanding is that the reset procedure
> > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > because we assumed the userapp is not hostile so logically it should only
> > modify the flags which is a 32bit field, assuming atomicity guaranteed.
>
> Atomicity doesn't guarantee ordering, unfortunately.

Right, sorry to be misleading. The "atomicity" part I was trying to say
the kernel will always see consistent update on the fields.

The ordering should also be guaranteed, because things must happen with
below sequence:

(1) kernel publish dirty GFN data (slot, offset)
(2) kernel publish dirty GFN flag (set to DIRTY)
(3) user sees DIRTY, collects (slots, offset)
(4) user sets it to RESET
(5) kernel reads RESET

So the ordering of single-entry is guaranteed in that when (5) happens it
must be after stablized (1+2).

> Take the
> following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> D that exist in the ring in that order, and CPU1 performs an ioctl to
> reset the page state.
>
> CPU0:
> write_flag(A, KVM_DIRTY_GFN_F_RESET)
> write_flag(B, KVM_DIRTY_GFN_F_RESET)
> write_flag(C, KVM_DIRTY_GFN_F_RESET)
> write_flag(D, KVM_DIRTY_GFN_F_RESET)
> [...]
>
> CPU1:
> ioctl(KVM_RESET_DIRTY_RINGS)
>
> Since CPU0 writes do not have any ordering, CPU1 can observe the
> writes in a sequence that have nothing to do with program order, and
> could for example observe that GFN A and D have been reset, but not B
> and C. This in turn breaks the logic in the reset code (B, C, and D
> don't get reset), despite userspace having followed the spec to the
> letter. If each was a store-release (which is the case on x86), it
> wouldn't be a problem, but nothing calls it in the documentation.
>
> Maybe that's not a big deal if it is expected that each CPU will issue
> a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> writes. But expecting this to work across CPUs without any barrier is
> wishful thinking.

I see what you meant...

Firstly I'm actually curious whether that'll really happen if the gfns are
collected in something like a for loop:

for(i = 0; i < N; i++)
collect_dirty_gfn(ring, i);

Because since all the gfps to be read will depend on variable "i", IIUC no
reordering should happen, but I'm not really sure, so more of a pure
question.

Besides, the other thing to mention is that I think it is fine the RESET
ioctl didn't recycle all the gfns got set to reset state. Taking above
example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
are not reset. But IMHO it's okay because it means we reset partial of the
gfns not all of them, and it's safe to do so. It means the next ring full
event can come earlier because we recycled less, but that's functionally
safe to me.

>
> > IIRC we used to discuss similar questions on "what if the user is hostile
> > and wants to hack the process by messing up with the ring", and our
> > conclusion was as long as the process wouldn't mess up anything outside
> > itself it should be okay. E.g. It should not be able to either cause the
> > host to misfunction, or trigger kernel warnings in dmesg, etc..
>
> I'm not even discussing safety here. I'm purely discussing the
> interactions between userspace and kernel based on the documentation
> and the existing kernel code.

I see. Please let me know if above makes sense.

Thanks!

--
Peter Xu

2022-08-24 15:53:22

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Wed, 24 Aug 2022 00:19:04 +0100,
Peter Xu <[email protected]> wrote:
>
> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > On Tue, 23 Aug 2022 22:20:32 +0100,
> > Peter Xu <[email protected]> wrote:
> > >
> > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > > I don't think we really need this check on the hot path. All we need
> > > > is to make the request sticky until userspace gets their act together
> > > > and consumes elements in the ring. Something like:
> > > >
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > >
> > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > > return kvm_vcpu_suspend(vcpu);
> > > > +
> > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > > + kvm_dirty_ring_soft_full(vcpu)) {
> > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > > + trace_kvm_dirty_ring_exit(vcpu);
> > > > + return 0;
> > > > + }
> > > > }
> > > >
> > > > return 1;
> > >
> > > Right, this seems working. We can also use kvm_test_request() here.
> > >
> > > >
> > > >
> > > > However, I'm a bit concerned by the reset side of things. It iterates
> > > > over the vcpus and expects the view of each ring to be consistent,
> > > > even if userspace is hacking at it from another CPU. For example, I
> > > > can't see what guarantees that the kernel observes the writes from
> > > > userspace in the order they are being performed (the documentation
> > > > provides no requirements other than "it must collect the dirty GFNs in
> > > > sequence", which doesn't mean much from an ordering perspective).
> > > >
> > > > I can see that working on a strongly ordered architecture, but on
> > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > > operation on both sides would be enough, but someone who actually
> > > > understands how this works should have a look...
> > >
> > > I definitely don't think I 100% understand all the ordering things since
> > > they're complicated.. but my understanding is that the reset procedure
> > > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > > because we assumed the userapp is not hostile so logically it should only
> > > modify the flags which is a 32bit field, assuming atomicity guaranteed.
> >
> > Atomicity doesn't guarantee ordering, unfortunately.
>
> Right, sorry to be misleading. The "atomicity" part I was trying to say
> the kernel will always see consistent update on the fields.
>
> The ordering should also be guaranteed, because things must happen with
> below sequence:
>
> (1) kernel publish dirty GFN data (slot, offset)
> (2) kernel publish dirty GFN flag (set to DIRTY)
> (3) user sees DIRTY, collects (slots, offset)
> (4) user sets it to RESET
> (5) kernel reads RESET

Maybe. Maybe not. The reset could well be sitting in the CPU write
buffer for as long as it wants and not be seen by the kernel if the
read occurs on another CPU. And that's the crucial bit: single-CPU is
fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
on collection, and global on reset (this seems like a bad decision,
but it is too late to fix this).

>
> So the ordering of single-entry is guaranteed in that when (5) happens it
> must be after stablized (1+2).
>
> > Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> >
> > CPU0:
> > write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > [...]
> >
> > CPU1:
> > ioctl(KVM_RESET_DIRTY_RINGS)
> >
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> >
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
>
> I see what you meant...
>
> Firstly I'm actually curious whether that'll really happen if the gfns are
> collected in something like a for loop:
>
> for(i = 0; i < N; i++)
> collect_dirty_gfn(ring, i);
>
> Because since all the gfps to be read will depend on variable "i", IIUC no
> reordering should happen, but I'm not really sure, so more of a pure
> question.

'i' has no influence on the write ordering. Each write targets a
different address, there is no inter-write dependencies (this concept
doesn't exist other than for writes to the same address), so they can
be reordered at will.

If you want a proof of this, head to http://diy.inria.fr/www/ and run
the MP.litmus test (which conveniently gives you a reduction of this
problem) on both the x86 and AArch64 models. You will see that the
reordering isn't allowed on x86, but definitely allowed on arm64.

> Besides, the other thing to mention is that I think it is fine the RESET
> ioctl didn't recycle all the gfns got set to reset state. Taking above
> example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
> are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
> are not reset. But IMHO it's okay because it means we reset partial of the
> gfns not all of them, and it's safe to do so. It means the next ring full
> event can come earlier because we recycled less, but that's functionally
> safe to me.

It may be safe, but it isn't what the userspace API promises. In other
words, without further straightening of the API, this doesn't work as
expected on relaxed memory architectures. So before this gets enabled
on arm64, this whole ordering issue must be addressed.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-08-24 16:28:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 00:19:04 +0100,
> Peter Xu <[email protected]> wrote:
> >
> > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > On Tue, 23 Aug 2022 22:20:32 +0100,
> > > Peter Xu <[email protected]> wrote:
> > > >
> > > > On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> > > > > I don't think we really need this check on the hot path. All we need
> > > > > is to make the request sticky until userspace gets their act together
> > > > > and consumes elements in the ring. Something like:
> > > > >
> > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > > index 986cee6fbc7f..e8ed5e1af159 100644
> > > > > --- a/arch/arm64/kvm/arm.c
> > > > > +++ b/arch/arm64/kvm/arm.c
> > > > > @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> > > > >
> > > > > if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > > > return kvm_vcpu_suspend(vcpu);
> > > > > +
> > > > > + if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> > > > > + kvm_dirty_ring_soft_full(vcpu)) {
> > > > > + kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> > > > > + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > > > + trace_kvm_dirty_ring_exit(vcpu);
> > > > > + return 0;
> > > > > + }
> > > > > }
> > > > >
> > > > > return 1;
> > > >
> > > > Right, this seems working. We can also use kvm_test_request() here.
> > > >
> > > > >
> > > > >
> > > > > However, I'm a bit concerned by the reset side of things. It iterates
> > > > > over the vcpus and expects the view of each ring to be consistent,
> > > > > even if userspace is hacking at it from another CPU. For example, I
> > > > > can't see what guarantees that the kernel observes the writes from
> > > > > userspace in the order they are being performed (the documentation
> > > > > provides no requirements other than "it must collect the dirty GFNs in
> > > > > sequence", which doesn't mean much from an ordering perspective).
> > > > >
> > > > > I can see that working on a strongly ordered architecture, but on
> > > > > something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> > > > > stuff that isn't explicitly ordered. I have the feeling that a CAS
> > > > > operation on both sides would be enough, but someone who actually
> > > > > understands how this works should have a look...
> > > >
> > > > I definitely don't think I 100% understand all the ordering things since
> > > > they're complicated.. but my understanding is that the reset procedure
> > > > didn't need memory barrier (unlike pushing, where we have explicit wmb),
> > > > because we assumed the userapp is not hostile so logically it should only
> > > > modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > >
> > > Atomicity doesn't guarantee ordering, unfortunately.
> >
> > Right, sorry to be misleading. The "atomicity" part I was trying to say
> > the kernel will always see consistent update on the fields.
> >
> > The ordering should also be guaranteed, because things must happen with
> > below sequence:
> >
> > (1) kernel publish dirty GFN data (slot, offset)
> > (2) kernel publish dirty GFN flag (set to DIRTY)
> > (3) user sees DIRTY, collects (slots, offset)
> > (4) user sets it to RESET
> > (5) kernel reads RESET
>
> Maybe. Maybe not. The reset could well be sitting in the CPU write
> buffer for as long as it wants and not be seen by the kernel if the
> read occurs on another CPU. And that's the crucial bit: single-CPU is
> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> on collection, and global on reset (this seems like a bad decision,
> but it is too late to fix this).

Regarding the last statement, that's something I had question too and
discussed with Paolo, even though at that time it's not a outcome of
discussing memory ordering issues.

IIUC the initial design was trying to avoid tlb flush flood when vcpu
number is large (each RESET per ring even for one page will need all vcpus
to flush, so O(N^2) flushing needed). With global RESET it's O(N). So
it's kind of a trade-off, and indeed until now I'm not sure which one is
better. E.g., with per-ring reset, we can have locality too in userspace,
e.g. the vcpu thread might be able to recycle without holding global locks.

Regarding safety I hope I covered that below in previous reply.

>
> >
> > So the ordering of single-entry is guaranteed in that when (5) happens it
> > must be after stablized (1+2).
> >
> > > Take the
> > > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > > reset the page state.
> > >
> > > CPU0:
> > > write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > > write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > > write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > > write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > > [...]
> > >
> > > CPU1:
> > > ioctl(KVM_RESET_DIRTY_RINGS)
> > >
> > > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > > writes in a sequence that have nothing to do with program order, and
> > > could for example observe that GFN A and D have been reset, but not B
> > > and C. This in turn breaks the logic in the reset code (B, C, and D
> > > don't get reset), despite userspace having followed the spec to the
> > > letter. If each was a store-release (which is the case on x86), it
> > > wouldn't be a problem, but nothing calls it in the documentation.
> > >
> > > Maybe that's not a big deal if it is expected that each CPU will issue
> > > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > > writes. But expecting this to work across CPUs without any barrier is
> > > wishful thinking.
> >
> > I see what you meant...
> >
> > Firstly I'm actually curious whether that'll really happen if the gfns are
> > collected in something like a for loop:
> >
> > for(i = 0; i < N; i++)
> > collect_dirty_gfn(ring, i);
> >
> > Because since all the gfps to be read will depend on variable "i", IIUC no
> > reordering should happen, but I'm not really sure, so more of a pure
> > question.
>
> 'i' has no influence on the write ordering. Each write targets a
> different address, there is no inter-write dependencies (this concept
> doesn't exist other than for writes to the same address), so they can
> be reordered at will.
>
> If you want a proof of this, head to http://diy.inria.fr/www/ and run
> the MP.litmus test (which conveniently gives you a reduction of this
> problem) on both the x86 and AArch64 models. You will see that the
> reordering isn't allowed on x86, but definitely allowed on arm64.
>
> > Besides, the other thing to mention is that I think it is fine the RESET
> > ioctl didn't recycle all the gfns got set to reset state. Taking above
> > example of GFNs A-D, if when reaching the RESET ioctl only A & D's flags
> > are updated, the ioctl will recycle gfn A but stop at gfn B assuming B-D
> > are not reset. But IMHO it's okay because it means we reset partial of the
> > gfns not all of them, and it's safe to do so. It means the next ring full
> > event can come earlier because we recycled less, but that's functionally
> > safe to me.
>
> It may be safe, but it isn't what the userspace API promises.

The document says:

After processing one or more entries in the ring buffer, userspace calls
the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
the kernel will reprotect those collected GFNs. Therefore, the ioctl
must be called *before* reading the content of the dirty pages.

I'd say it's not an explicit promise, but I think I agree with you that at
least it's unclear on the behavior.

Since we have a global recycle mechanism, most likely the app (e.g. current
qemu impl) will use the same thread to collect/reset dirty GFNs, and
trigger the RESET ioctl(). In that case it's safe, IIUC, because no
cross-core ops.

QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):

if (total) {
ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
assert(ret == total);
}

I think the assert() should never trigger as mentioned above. But ideally
maybe it should just be a loop until cleared gfns match total.

> In other words, without further straightening of the API, this doesn't
> work as expected on relaxed memory architectures. So before this gets
> enabled on arm64, this whole ordering issue must be addressed.

How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
possibility of recycling partial of the pages, especially when collection
and the ioctl() aren't done from the same thread?

Any suggestions will be greatly welcomed.

Thanks,

--
Peter Xu

2022-08-24 21:19:34

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <[email protected]> wrote:
>
> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2022 00:19:04 +0100,
> > Peter Xu <[email protected]> wrote:
> > >
> > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > > Atomicity doesn't guarantee ordering, unfortunately.
> > >
> > > Right, sorry to be misleading. The "atomicity" part I was trying to say
> > > the kernel will always see consistent update on the fields.
> > >
> > > The ordering should also be guaranteed, because things must happen with
> > > below sequence:
> > >
> > > (1) kernel publish dirty GFN data (slot, offset)
> > > (2) kernel publish dirty GFN flag (set to DIRTY)
> > > (3) user sees DIRTY, collects (slots, offset)
> > > (4) user sets it to RESET
> > > (5) kernel reads RESET
> >
> > Maybe. Maybe not. The reset could well be sitting in the CPU write
> > buffer for as long as it wants and not be seen by the kernel if the
> > read occurs on another CPU. And that's the crucial bit: single-CPU is
> > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> > on collection, and global on reset (this seems like a bad decision,
> > but it is too late to fix this).
>
> Regarding the last statement, that's something I had question too and
> discussed with Paolo, even though at that time it's not a outcome of
> discussing memory ordering issues.
>
> IIUC the initial design was trying to avoid tlb flush flood when vcpu
> number is large (each RESET per ring even for one page will need all vcpus
> to flush, so O(N^2) flushing needed). With global RESET it's O(N). So
> it's kind of a trade-off, and indeed until now I'm not sure which one is
> better. E.g., with per-ring reset, we can have locality too in userspace,
> e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

> > It may be safe, but it isn't what the userspace API promises.
>
> The document says:
>
> After processing one or more entries in the ring buffer, userspace calls
> the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
> the kernel will reprotect those collected GFNs. Therefore, the ioctl
> must be called *before* reading the content of the dirty pages.
>
> I'd say it's not an explicit promise, but I think I agree with you that at
> least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

> Since we have a global recycle mechanism, most likely the app (e.g. current
> qemu impl) will use the same thread to collect/reset dirty GFNs, and
> trigger the RESET ioctl(). In that case it's safe, IIUC, because no
> cross-core ops.
>
> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>
> if (total) {
> ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
> assert(ret == total);
> }
>
> I think the assert() should never trigger as mentioned above. But ideally
> maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.

>
> > In other words, without further straightening of the API, this doesn't
> > work as expected on relaxed memory architectures. So before this gets
> > enabled on arm64, this whole ordering issue must be addressed.
>
> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
> possibility of recycling partial of the pages, especially when collection
> and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

> Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.

M.

--
Without deviation from the norm, progress is not possible.

2022-08-26 06:34:43

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

Hi Marc,

On 8/25/22 6:57 AM, Marc Zyngier wrote:
> On Wed, 24 Aug 2022 17:21:50 +0100,
> Peter Xu <[email protected]> wrote:
>>
>> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
>>> On Wed, 24 Aug 2022 00:19:04 +0100,
>>> Peter Xu <[email protected]> wrote:
>>>>
>>>> On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
>>>>> Atomicity doesn't guarantee ordering, unfortunately.
>>>>
>>>> Right, sorry to be misleading. The "atomicity" part I was trying to say
>>>> the kernel will always see consistent update on the fields.
>>>>
>>>> The ordering should also be guaranteed, because things must happen with
>>>> below sequence:
>>>>
>>>> (1) kernel publish dirty GFN data (slot, offset)
>>>> (2) kernel publish dirty GFN flag (set to DIRTY)
>>>> (3) user sees DIRTY, collects (slots, offset)
>>>> (4) user sets it to RESET
>>>> (5) kernel reads RESET
>>>
>>> Maybe. Maybe not. The reset could well be sitting in the CPU write
>>> buffer for as long as it wants and not be seen by the kernel if the
>>> read occurs on another CPU. And that's the crucial bit: single-CPU is
>>> fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
>>> on collection, and global on reset (this seems like a bad decision,
>>> but it is too late to fix this).
>>
>> Regarding the last statement, that's something I had question too and
>> discussed with Paolo, even though at that time it's not a outcome of
>> discussing memory ordering issues.
>>
>> IIUC the initial design was trying to avoid tlb flush flood when vcpu
>> number is large (each RESET per ring even for one page will need all vcpus
>> to flush, so O(N^2) flushing needed). With global RESET it's O(N). So
>> it's kind of a trade-off, and indeed until now I'm not sure which one is
>> better. E.g., with per-ring reset, we can have locality too in userspace,
>> e.g. the vcpu thread might be able to recycle without holding global locks.
>
> I don't get that. On x86, each CPU must perform the TLB invalidation
> (there is an IPI for that). So whether you do a per-CPU scan of the
> ring or a global scan is irrelevant: each entry you find in any of the
> rings must result in a global invalidation, since you've updated the
> PTE to make the page RO.
>
> The same is true on ARM, except that the broadcast is done in HW
> instead of being tracked in SW.
>
> Buy anyway, this is all moot. The API is what it is, and it isn't
> going to change any time soon. All we can do is add some
> clarifications to the API for the more relaxed architectures, and make
> sure the kernel behaves accordingly.
>
> [...]
>
>>> It may be safe, but it isn't what the userspace API promises.
>>
>> The document says:
>>
>> After processing one or more entries in the ring buffer, userspace calls
>> the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>> the kernel will reprotect those collected GFNs. Therefore, the ioctl
>> must be called *before* reading the content of the dirty pages.
>>
>> I'd say it's not an explicit promise, but I think I agree with you that at
>> least it's unclear on the behavior.
>
> This is the least problematic part of the documentation. The bit I
> literally choke on is this:
>
> <quote>
> It's not necessary for userspace to harvest the all dirty GFNs at once.
> However it must collect the dirty GFNs in sequence, i.e., the userspace
> program cannot skip one dirty GFN to collect the one next to it.
> </quote>
>
> This is the core of the issue. Without ordering rules, the consumer on
> the other side cannot observe the updates correctly, even if userspace
> follows the above to the letter. Of course, the kernel itself must do
> the right thing (I guess it amounts to the kernel doing a
> load-acquire, and userspace doing a store-release -- effectively
> emulating x86...).
>
>> Since we have a global recycle mechanism, most likely the app (e.g. current
>> qemu impl) will use the same thread to collect/reset dirty GFNs, and
>> trigger the RESET ioctl(). In that case it's safe, IIUC, because no
>> cross-core ops.
>>
>> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
>>
>> if (total) {
>> ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>> assert(ret == total);
>> }
>>
>> I think the assert() should never trigger as mentioned above. But ideally
>> maybe it should just be a loop until cleared gfns match total.
>
> Right. If userspace calls the ioctl on every vcpu, then things should
> work correctly. It is only that the overhead is higher than what it
> should be if multiple vcpus perform a reset at the same time.
>
>>
>>> In other words, without further straightening of the API, this doesn't
>>> work as expected on relaxed memory architectures. So before this gets
>>> enabled on arm64, this whole ordering issue must be addressed.
>>
>> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
>> possibility of recycling partial of the pages, especially when collection
>> and the ioctl() aren't done from the same thread?
>
> I'd rather tell people about the ordering rules. That will come at
> zero cost on x86.
>
>> Any suggestions will be greatly welcomed.
>
> I'll write a couple of patch when I get the time, most likely next
> week. Gavin will hopefully be able to take them as part of his series.
>

Thanks, Marc. Please let me know where I can check out the patches
when they're ready. I can include the patches into this series in
next revision :)

Thanks,
Gavin


2022-08-26 10:59:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On 8/24/22 00:47, Marc Zyngier wrote:
>> I definitely don't think I 100% understand all the ordering things since
>> they're complicated.. but my understanding is that the reset procedure
>> didn't need memory barrier (unlike pushing, where we have explicit wmb),
>> because we assumed the userapp is not hostile so logically it should only
>> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> Atomicity doesn't guarantee ordering, unfortunately. Take the
> following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> D that exist in the ring in that order, and CPU1 performs an ioctl to
> reset the page state.
>
> CPU0:
> write_flag(A, KVM_DIRTY_GFN_F_RESET)
> write_flag(B, KVM_DIRTY_GFN_F_RESET)
> write_flag(C, KVM_DIRTY_GFN_F_RESET)
> write_flag(D, KVM_DIRTY_GFN_F_RESET)
> [...]
>
> CPU1:
> ioctl(KVM_RESET_DIRTY_RINGS)
>
> Since CPU0 writes do not have any ordering, CPU1 can observe the
> writes in a sequence that have nothing to do with program order, and
> could for example observe that GFN A and D have been reset, but not B
> and C. This in turn breaks the logic in the reset code (B, C, and D
> don't get reset), despite userspace having followed the spec to the
> letter. If each was a store-release (which is the case on x86), it
> wouldn't be a problem, but nothing calls it in the documentation.
>
> Maybe that's not a big deal if it is expected that each CPU will issue
> a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> writes. But expecting this to work across CPUs without any barrier is
> wishful thinking.

Agreed, but that's a problem for userspace to solve. If userspace wants
to reset the fields in different CPUs, it has to synchronize with its
own invoking of the ioctl.

That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done
after (in the memory-ordering sense) its last write_flag(D,
KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no
guarantee that the write_flag will have any effect.

The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl was
because it takes kvm->slots_lock so the execution would be serialized
anyway. Turning slots_lock into an rwsem would be even worse because it
also takes kvm->mmu_lock (since slots_lock is a mutex, at least two
concurrent invocations won't clash with each other on the mmu_lock).

Paolo

2022-08-26 11:04:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On 8/23/22 22:35, Marc Zyngier wrote:
>> Heh, yeah I need to get that out the door. I'll also note that Gavin's
>> changes are still relevant without that series, as we do write unprotect
>> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
>> Add fast path to handle permission relaxation during dirty logging").
>
> Ah, true. Now if only someone could explain how the whole
> producer-consumer thing works without a trace of a barrier, that'd be
> great...

Do you mean this?

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
{
struct kvm_dirty_gfn *entry;

/* It should never get full */
WARN_ON_ONCE(kvm_dirty_ring_full(ring));

entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];

entry->slot = slot;
entry->offset = offset;
/*
* Make sure the data is filled in before we publish this to
* the userspace program. There's no paired kernel-side reader.
*/
smp_wmb();
kvm_dirty_gfn_set_dirtied(entry);
ring->dirty_index++;
trace_kvm_dirty_ring_push(ring, slot, offset);
}

The matching smp_rmb() is in userspace.

Paolo

2022-08-26 15:44:58

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Fri, 26 Aug 2022 11:58:08 +0100,
Paolo Bonzini <[email protected]> wrote:
>
> On 8/23/22 22:35, Marc Zyngier wrote:
> >> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> >> changes are still relevant without that series, as we do write unprotect
> >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> >> Add fast path to handle permission relaxation during dirty logging").
> >
> > Ah, true. Now if only someone could explain how the whole
> > producer-consumer thing works without a trace of a barrier, that'd be
> > great...
>
> Do you mean this?
>
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)

Of course not. I mean this:

static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
{
unsigned long i;
struct kvm_vcpu *vcpu;
int cleared = 0;

if (!kvm->dirty_ring_size)
return -EINVAL;

mutex_lock(&kvm->slots_lock);

kvm_for_each_vcpu(i, vcpu, kvm)
cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
[...]
}

and this

int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
{
u32 cur_slot, next_slot;
u64 cur_offset, next_offset;
unsigned long mask;
int count = 0;
struct kvm_dirty_gfn *entry;
bool first_round = true;

/* This is only needed to make compilers happy */
cur_slot = cur_offset = mask = 0;

while (true) {
entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];

if (!kvm_dirty_gfn_harvested(entry))
break;
[...]

}

which provides no ordering whatsoever when a ring is updated from one
CPU and reset from another.

M.

--
Without deviation from the norm, progress is not possible.

2022-08-26 16:07:44

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Fri, 26 Aug 2022 11:50:24 +0100,
Paolo Bonzini <[email protected]> wrote:
>
> On 8/24/22 00:47, Marc Zyngier wrote:
> >> I definitely don't think I 100% understand all the ordering things since
> >> they're complicated.. but my understanding is that the reset procedure
> >> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> >> because we assumed the userapp is not hostile so logically it should only
> >> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > Atomicity doesn't guarantee ordering, unfortunately. Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> >
> > CPU0:
> > write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > [...]
> >
> > CPU1:
> > ioctl(KVM_RESET_DIRTY_RINGS)
> >
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> >
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
>
> Agreed, but that's a problem for userspace to solve. If userspace
> wants to reset the fields in different CPUs, it has to synchronize
> with its own invoking of the ioctl.

userspace has no choice. It cannot order on its own the reads that the
kernel will do to *other* rings.

> That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done
> after (in the memory-ordering sense) its last write_flag(D,
> KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no
> guarantee that the write_flag will have any effect.

The problem isn't on CPU0 The problem is that CPU1 does observe
inconsistent data on arm64, and I don't think this difference in
behaviour is acceptable. Nothing documents this, and there is a baked
in assumption that there is a strong ordering between writes as well
as between writes and read.

> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
> was because it takes kvm->slots_lock so the execution would be
> serialized anyway. Turning slots_lock into an rwsem would be even
> worse because it also takes kvm->mmu_lock (since slots_lock is a
> mutex, at least two concurrent invocations won't clash with each other
> on the mmu_lock).

Whatever the reason, the behaviour should be identical on all
architectures. As is is, it only really works on x86, and I contend
this is a bug that needs fixing.

Thankfully, this can be done at zero cost for x86, and at that of a
set of load-acquires on other architectures.

M.

--
Without deviation from the norm, progress is not possible.

2022-08-27 08:37:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On 8/26/22 17:49, Marc Zyngier wrote:
>> Agreed, but that's a problem for userspace to solve. If userspace
>> wants to reset the fields in different CPUs, it has to synchronize
>> with its own invoking of the ioctl.
>
> userspace has no choice. It cannot order on its own the reads that the
> kernel will do to *other* rings.

Those reads will never see KVM_DIRTY_GFN_F_RESET in the flags however,
if userspace has never interacted with the ring. So there will be
exactly one read on those rings, and there's nothing to reorder.

If that's too tricky and you want to add a load-acquire I have no
objection though. It also helps avoiding read-read reordering between
one entry's flags to the next one's, so it's a good idea to have it anyway.

>> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
>> was because it takes kvm->slots_lock so the execution would be
>> serialized anyway. Turning slots_lock into an rwsem would be even
>> worse because it also takes kvm->mmu_lock (since slots_lock is a
>> mutex, at least two concurrent invocations won't clash with each other
>> on the mmu_lock).
>
> Whatever the reason, the behaviour should be identical on all
> architectures. As is is, it only really works on x86, and I contend
> this is a bug that needs fixing.
>
> Thankfully, this can be done at zero cost for x86, and at that of a
> set of load-acquires on other architectures.

Yes, the global-ness of the API is orthogonal to the memory ordering
issue. I just wanted to explain why a per-vCPU API probably isn't going
to work great.

Paolo

2022-08-30 14:54:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On Fri, Aug 26, 2022 at 04:28:51PM +0100, Marc Zyngier wrote:
> On Fri, 26 Aug 2022 11:58:08 +0100,
> Paolo Bonzini <[email protected]> wrote:
> >
> > On 8/23/22 22:35, Marc Zyngier wrote:
> > >> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> > >> changes are still relevant without that series, as we do write unprotect
> > >> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> > >> Add fast path to handle permission relaxation during dirty logging").
> > >
> > > Ah, true. Now if only someone could explain how the whole
> > > producer-consumer thing works without a trace of a barrier, that'd be
> > > great...
> >
> > Do you mean this?
> >
> > void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>
> Of course not. I mean this:
>
> static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> {
> unsigned long i;
> struct kvm_vcpu *vcpu;
> int cleared = 0;
>
> if (!kvm->dirty_ring_size)
> return -EINVAL;
>
> mutex_lock(&kvm->slots_lock);
>
> kvm_for_each_vcpu(i, vcpu, kvm)
> cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> [...]
> }
>
> and this
>
> int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> {
> u32 cur_slot, next_slot;
> u64 cur_offset, next_offset;
> unsigned long mask;
> int count = 0;
> struct kvm_dirty_gfn *entry;
> bool first_round = true;
>
> /* This is only needed to make compilers happy */
> cur_slot = cur_offset = mask = 0;
>
> while (true) {
> entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>
> if (!kvm_dirty_gfn_harvested(entry))
> break;
> [...]
>
> }
>
> which provides no ordering whatsoever when a ring is updated from one
> CPU and reset from another.

Marc,

I thought we won't hit this as long as we properly take care of other
orderings of (a) gfn push, and (b) gfn collect, but after a second thought
I think it's indeed logically possible that with a reversed ordering here
we can be reading some garbage gfn before (a) happens butt also read the
valid flag after (b).

It seems we must have all the barriers correctly applied always. If that's
correct, do you perhaps mean something like this to just add the last piece
of barrier?

===8<===
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..ea620bfb012d 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)

static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
{
- return gfn->flags & KVM_DIRTY_GFN_F_RESET;
+ return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
}

int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
===8<===

Thanks,

--
Peter Xu

2022-09-02 00:46:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking

On 8/30/22 16:42, Peter Xu wrote:
> Marc,
>
> I thought we won't hit this as long as we properly take care of other
> orderings of (a) gfn push, and (b) gfn collect, but after a second thought
> I think it's indeed logically possible that with a reversed ordering here
> we can be reading some garbage gfn before (a) happens butt also read the
> valid flag after (b).
>
> It seems we must have all the barriers correctly applied always. If that's
> correct, do you perhaps mean something like this to just add the last piece
> of barrier?

Okay, so I thought about it some more and it's quite tricky.

Strictly speaking, the synchronization is just between userspace and
kernel. The fact that the actual producer of dirty pages is in another
CPU is a red herring, because reset only cares about harvested pages.

In other words, the dirty page ring is essentially two ring buffers in
one and we only care about the "harvested ring", not the "produced ring".

On the other hand, it may happen that userspace has set more RESET flags
while the ioctl is ongoing:


CPU0 CPU1 CPU2
fill gfn0
store-rel flags for gfn0
fill gfn1
store-rel flags for gfn1
load-acq flags for gfn0
set RESET for gfn0
load-acq flags for gfn1
set RESET for gfn1
do ioctl! ----------->
ioctl(RESET_RINGS)
fill gfn2
store-rel flags for gfn2
load-acq flags for gfn2
set RESET for gfn2
process gfn0
process gfn1
process gfn2
do ioctl!
etc.

The three load-acquire in CPU0 synchronize with the three store-release
in CPU2, but CPU0 and CPU1 are only synchronized up to gfn1 and CPU1 may
miss gfn2's fields other than flags.

The kernel must be able to cope with invalid values of the fields, and
userspace will invoke the ioctl once more. However, once the RESET flag
is cleared on gfn2, it is lost forever, therefore in the above scenario
CPU1 must read the correct value of gfn2's fields.

Therefore RESET must be set with a store-release, that will synchronize
with a load-acquire in CPU1 as you suggested.

Paolo

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f4c2a6eb1666..ea620bfb012d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -84,7 +84,7 @@ static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
>
> static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
> {
> - return gfn->flags & KVM_DIRTY_GFN_F_RESET;
> + return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
> }
>
> int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> ===8<===
>
> Thanks,
>
> --
> Peter Xu
>