2019-11-29 21:39:09

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 00/15] KVM: Dirty ring interface

Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring

Overview
============

This is a continued work from Lei Cao <[email protected]> and Paolo
on the KVM dirty ring interface. To make it simple, I'll still start
with version 1 as RFC.

The new dirty ring interface is another way to collect dirty pages for
the virtual machine, but it is different from the existing dirty
logging interface in a few ways, majorly:

- Data format: The dirty data was in a ring format rather than a
bitmap format, so the size of data to sync for dirty logging does
not depend on the size of guest memory any more, but speed of
dirtying. Also, the dirty ring is per-vcpu (currently plus
another per-vm ring, so total ring number is N+1), while the dirty
bitmap is per-vm.

- Data copy: The sync of dirty pages does not need data copy any more,
but instead the ring is shared between the userspace and kernel by
page sharings (mmap() on either the vm fd or vcpu fd)

- Interface: Instead of using the old KVM_GET_DIRTY_LOG,
KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
called KVM_RESET_DIRTY_RINGS when we want to reset the collected
dirty pages to protected mode again (works like
KVM_CLEAR_DIRTY_LOG, but ring based)

And more.

I would appreciate if the reviewers can start with patch "KVM:
Implement ring-based dirty memory tracking", especially the document
update part for the big picture. Then I'll avoid copying into most of
them into cover letter again.

I marked this series as RFC because I'm at least uncertain on this
change of vcpu_enter_guest():

if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
/*
* If this is requested, it means that we've
* marked the dirty bit in the dirty ring BUT
* we've not written the date. Do it now.
*/
r = kvm_emulate_instruction(vcpu, 0);
r = r >= 0 ? 0 : r;
goto out;
}

I did a kvm_emulate_instruction() when dirty ring reaches softlimit
and want to exit to userspace, however I'm not really sure whether
there could have any side effect. I'd appreciate any comment of
above, or anything else.

Tests
===========

I wanted to continue work on the QEMU part, but after I noticed that
the interface might still prone to change, I posted this series first.
However to make sure it's at least working, I've provided unit tests
together with the series. The unit tests should be able to test the
series in at least three major paths:

(1) ./dirty_log_test -M dirty-ring

This tests async ring operations: this should be the major work
mode for the dirty ring interface, say, when the kernel is
queuing more data, the userspace is collecting too. Ring can
hardly reaches full when working like this, because in most
cases the collection could be fast.

(2) ./dirty_log_test -M dirty-ring -c 1024

This set the ring size to be very small so that ring soft-full
always triggers (soft-full is a soft limit of the ring state,
when the dirty ring reaches the soft limit it'll do a userspace
exit and let the userspace to collect the data).

(3) ./dirty_log_test -M dirty-ring-wait-queue

This sololy test the extreme case where ring is full. When the
ring is completely full, the thread (no matter vcpu or not) will
be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
wake the threads up (assuming until which the ring will not be
full any more).

Thanks,

Cao, Lei (2):
KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
KVM: X86: Implement ring-based dirty memory tracking

Paolo Bonzini (1):
KVM: Move running VCPU from ARM to common code

Peter Xu (12):
KVM: Add build-time error check on kvm_run size
KVM: Implement ring-based dirty memory tracking
KVM: Make dirty ring exclusive to dirty bitmap log
KVM: Introduce dirty ring wait queue
KVM: selftests: Always clear dirty bitmap after iteration
KVM: selftests: Sync uapi/linux/kvm.h to tools/
KVM: selftests: Use a single binary for dirty/clear log test
KVM: selftests: Introduce after_vcpu_run hook for dirty log test
KVM: selftests: Add dirty ring buffer test
KVM: selftests: Let dirty_log_test async for dirty ring test
KVM: selftests: Add "-c" parameter to dirty log test
KVM: selftests: Test dirty ring waitqueue

Documentation/virt/kvm/api.txt | 116 +++++
arch/arm/include/asm/kvm_host.h | 2 -
arch/arm64/include/asm/kvm_host.h | 2 -
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu/mmu.c | 6 +
arch/x86/kvm/vmx/vmx.c | 7 +
arch/x86/kvm/x86.c | 12 +
include/linux/kvm_dirty_ring.h | 67 +++
include/linux/kvm_host.h | 37 ++
include/linux/kvm_types.h | 1 +
include/uapi/linux/kvm.h | 36 ++
tools/include/uapi/linux/kvm.h | 47 ++
tools/testing/selftests/kvm/Makefile | 2 -
.../selftests/kvm/clear_dirty_log_test.c | 2 -
tools/testing/selftests/kvm/dirty_log_test.c | 452 ++++++++++++++++--
.../testing/selftests/kvm/include/kvm_util.h | 6 +
tools/testing/selftests/kvm/lib/kvm_util.c | 103 ++++
.../selftests/kvm/lib/kvm_util_internal.h | 5 +
virt/kvm/arm/arm.c | 29 --
virt/kvm/arm/perf.c | 6 +-
virt/kvm/arm/vgic/vgic-mmio.c | 15 +-
virt/kvm/dirty_ring.c | 156 ++++++
virt/kvm/kvm_main.c | 315 +++++++++++-
25 files changed, 1329 insertions(+), 104 deletions(-)
create mode 100644 include/linux/kvm_dirty_ring.h
delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
create mode 100644 virt/kvm/dirty_ring.c

--
2.21.0


2019-11-29 21:40:33

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking

From: "Cao, Lei" <[email protected]>

Add new KVM exit reason KVM_EXIT_DIRTY_LOG_FULL and connect
KVM_REQ_DIRTY_LOG_FULL to it.

Signed-off-by: Lei Cao <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
[peterx: rebase, return 0 instead of -EINTR for user exits,
emul_insn before exit to userspace]
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/mmu/mmu.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 7 +++++++
arch/x86/kvm/x86.c | 12 ++++++++++++
5 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b79cd6aa4075..67521627f9e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -49,6 +49,8 @@

#define KVM_IRQCHIP_NUM_PINS KVM_IOAPIC_NUM_PINS

+#define KVM_DIRTY_RING_VERSION 1
+
/* x86-specific vcpu->requests bit members */
#define KVM_REQ_MIGRATE_TIMER KVM_ARCH_REQ(0)
#define KVM_REQ_REPORT_TPR_ACCESS KVM_ARCH_REQ(1)
@@ -1176,6 +1178,7 @@ struct kvm_x86_ops {
struct kvm_memory_slot *slot,
gfn_t offset, unsigned long mask);
int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+ int (*cpu_dirty_log_size)(void);

/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
@@ -1661,4 +1664,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
#define GET_SMSTATE(type, buf, offset) \
(*(type *)((buf) + (offset) - 0x7e00))

+int kvm_cpu_dirty_log_size(void);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 503d3f42da16..b59bf356c478 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@

#define KVM_PIO_PAGE_OFFSET 1
#define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64

#define DE_VECTOR 0
#define DB_VECTOR 1
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..f7efb69b089e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1818,7 +1818,13 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
{
if (kvm_x86_ops->write_log_dirty)
return kvm_x86_ops->write_log_dirty(vcpu);
+ return 0;
+}

+int kvm_cpu_dirty_log_size(void)
+{
+ if (kvm_x86_ops->cpu_dirty_log_size)
+ return kvm_x86_ops->cpu_dirty_log_size();
return 0;
}

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d175429c91b0..871489d92d3c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7710,6 +7710,7 @@ static __init int hardware_setup(void)
kvm_x86_ops->slot_disable_log_dirty = NULL;
kvm_x86_ops->flush_log_dirty = NULL;
kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+ kvm_x86_ops->cpu_dirty_log_size = NULL;
}

if (!cpu_has_vmx_preemption_timer())
@@ -7774,6 +7775,11 @@ static __exit void hardware_unsetup(void)
free_kvm_area();
}

+static int vmx_cpu_dirty_log_size(void)
+{
+ return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -7896,6 +7902,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.flush_log_dirty = vmx_flush_log_dirty,
.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
.write_log_dirty = vmx_write_pml_buffer,
+ .cpu_dirty_log_size = vmx_cpu_dirty_log_size,

.pre_block = vmx_pre_block,
.post_block = vmx_post_block,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ed167e039e5..03ff34783fa1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8094,6 +8094,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
kvm_hv_process_stimers(vcpu);
+
+ if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
+ vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+ /*
+ * If this is requested, it means that we've
+ * marked the dirty bit in the dirty ring BUT
+ * we've not written the date. Do it now.
+ */
+ r = kvm_emulate_instruction(vcpu, 0);
+ r = r >= 0 ? 0 : r;
+ goto out;
+ }
}

if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
--
2.21.0

2019-11-29 21:40:59

by Peter Xu

[permalink] [raw]
Subject: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

This patch is heavily based on previous work from Lei Cao
<[email protected]> and Paolo Bonzini <[email protected]>. [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

We defined two new data structures:

struct kvm_dirty_ring;
struct kvm_dirty_ring_indexes;

Firstly, kvm_dirty_ring is defined to represent a ring of dirty
pages. When dirty tracking is enabled, we can push dirty gfn onto the
ring.

Secondly, kvm_dirty_ring_indexes is defined to represent the
user/kernel interface of each ring. Currently it contains two
indexes: (1) avail_index represents where we should push our next
PFN (written by kernel), while (2) fetch_index represents where the
userspace should fetch the next dirty PFN (written by userspace).

One complete ring is composed by one kvm_dirty_ring plus its
corresponding kvm_dirty_ring_indexes.

Currently, we have N+1 rings for each VM of N vcpus:

- for each vcpu, we have 1 per-vcpu dirty ring,
- for each vm, we have 1 per-vm dirty ring

Please refer to the documentation update in this patch for more
details.

Note that this patch implements the core logic of dirty ring buffer.
It's still disabled for all archs for now. Also, we'll address some
of the other issues in follow up patches before it's firstly enabled
on x86.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
Documentation/virt/kvm/api.txt | 109 +++++++++++++++
arch/x86/kvm/Makefile | 3 +-
include/linux/kvm_dirty_ring.h | 67 +++++++++
include/linux/kvm_host.h | 33 +++++
include/linux/kvm_types.h | 1 +
include/uapi/linux/kvm.h | 36 +++++
virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
8 files changed, 642 insertions(+), 3 deletions(-)
create mode 100644 include/linux/kvm_dirty_ring.h
create mode 100644 virt/kvm/dirty_ring.c

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 49183add44e7..fa622c9a2eb8 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
It is thus encouraged to use the vm ioctl to query for capabilities (available
with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)

+
4.5 KVM_GET_VCPU_MMAP_SIZE

Capability: basic
@@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
memory region. This ioctl returns the size of that region. See the
KVM_RUN documentation for details.

+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+ KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+ this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+ KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+ KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
+ KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+

4.6 KVM_SET_MEMORY_REGION

@@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
AArch64, this value will be reported in the ISS field of ESR_ELx.

See KVM_CAP_VCPU_EVENTS for more details.
+
8.20 KVM_CAP_HYPERV_SEND_IPI

Architectures: x86
@@ -5365,6 +5379,7 @@ Architectures: x86
This capability indicates that KVM supports paravirtualized Hyper-V IPI send
hypercalls:
HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH

Architecture: x86
@@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
flush hypercalls by Hyper-V) so userspace should disable KVM identification
in CPUID and only exposes Hyper-V identification. In this case, guest
thinks it's running on Hyper-V and only use Hyper-V hypercalls.
+
+8.22 KVM_CAP_DIRTY_LOG_RING
+
+Architectures: x86
+Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu and one global
+ring per vm.
+
+One dirty ring has the following two major structures:
+
+struct kvm_dirty_ring {
+ u16 dirty_index;
+ u16 reset_index;
+ u32 size;
+ u32 soft_limit;
+ spinlock_t lock;
+ struct kvm_dirty_gfn *dirty_gfns;
+};
+
+struct kvm_dirty_ring_indexes {
+ __u32 avail_index; /* set by kernel */
+ __u32 fetch_index; /* set by userspace */
+};
+
+While for each of the dirty entry it's defined as:
+
+struct kvm_dirty_gfn {
+ __u32 pad;
+ __u32 slot; /* as_id | slot_id */
+ __u64 offset;
+};
+
+The fields in kvm_dirty_ring will be only internal to KVM itself,
+while the fields in kvm_dirty_ring_indexes will be exposed to
+userspace to be either read or written.
+
+The two indices in the ring buffer are free running counters.
+
+In pseudocode, processing the ring buffer looks like this:
+
+ idx = load-acquire(&ring->fetch_index);
+ while (idx != ring->avail_index) {
+ struct kvm_dirty_gfn *entry;
+ entry = &ring->dirty_gfns[idx & (size - 1)];
+ ...
+
+ idx++;
+ }
+ ring->fetch_index = idx;
+
+Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
+to enable this capability for the new guest and set the size of the
+rings. It is only allowed before creating any vCPU, and the size of
+the ring must be a power of two. The larger the ring buffer, the less
+likely the ring is full and the VM is forced to exit to userspace. The
+optimal size depends on the workload, but it is recommended that it be
+at least 64 KiB (4096 entries).
+
+After the capability is enabled, userspace can mmap the global ring
+buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
+indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
+descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
+is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
+locates inside kvm_run, while kvm_dirty_gfn[] at offset
+KVM_DIRTY_LOG_PAGE_OFFSET).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs up to avail_index, and sets the fetch_index
+accordingly. This can be done when the guest is running or paused,
+and dirty pages need not be collected all at once. After processing
+one or more entries in the ring buffer, userspace calls the VM ioctl
+KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
+fetch_index and to mark those pages clean. Therefore, the ioctl
+must be called *before* reading the content of the dirty pages.
+
+However, there is a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
+userspace it's still possible that the kernel has not yet flushed the
+hardware dirty buffers into the kernel buffer. To achieve that, one
+needs to kick the vcpu out for a hardware buffer flush (vmexit).
+
+If one of the ring buffers is full, the guest will exit to userspace
+with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
+KVM_RUN ioctl will return -EINTR. Once that happens, userspace
+should pause all the vcpus, then harvest all the dirty pages and
+rearm the dirty traps. It can unpause the guest after that.
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index b19ef421084d..0acee817adfb 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
KVM := ../../../virt/kvm

kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
- $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+ $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+ $(KVM)/dirty_ring.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o

kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
new file mode 100644
index 000000000000..8335635b7ff7
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,67 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+/*
+ * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
+ *
+ * dirty_ring: shared with userspace via mmap. It is the compact list
+ * that holds the dirty pages.
+ * dirty_index: free running counter that points to the next slot in
+ * dirty_ring->dirty_gfns where a new dirty page should go.
+ * reset_index: free running counter that points to the next dirty page
+ * in dirty_ring->dirty_gfns for which dirty trap needs to
+ * be reenabled
+ * size: size of the compact list, dirty_ring->dirty_gfns
+ * soft_limit: when the number of dirty pages in the list reaches this
+ * limit, vcpu that owns this ring should exit to userspace
+ * to allow userspace to harvest all the dirty pages
+ * lock: protects dirty_ring, only in use if this is the global
+ * ring
+ *
+ * The number of dirty pages in the ring is calculated by,
+ * dirty_index - reset_index
+ *
+ * kernel increments dirty_ring->indices.avail_index after dirty index
+ * is incremented. When userspace harvests the dirty pages, it increments
+ * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
+ * When kernel reenables dirty traps for the dirty pages, it increments
+ * reset_index up to dirty_ring->indices.fetch_index.
+ *
+ */
+struct kvm_dirty_ring {
+ u32 dirty_index;
+ u32 reset_index;
+ u32 size;
+ u32 soft_limit;
+ spinlock_t lock;
+ struct kvm_dirty_gfn *dirty_gfns;
+};
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm,
+ struct kvm_dirty_ring *ring,
+ struct kvm_dirty_ring_indexes *indexes);
+
+/*
+ * returns 0: successfully pushed
+ * 1: successfully pushed, soft limit reached,
+ * vcpu should exit to userspace
+ * -EBUSY: unable to push, dirty ring full.
+ */
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+ struct kvm_dirty_ring_indexes *indexes,
+ u32 slot, u64 offset, bool lock);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
+
+#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 498a39462ac1..7b747bc9ff3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
#include <linux/kvm_types.h>

#include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>

#ifndef KVM_MAX_VCPU_ID
#define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
#define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_PENDING_TIMER 2
#define KVM_REQ_UNHALT 3
+#define KVM_REQ_DIRTY_RING_FULL 4
#define KVM_REQUEST_ARCH_BASE 8

#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -321,6 +323,7 @@ struct kvm_vcpu {
bool ready;
struct kvm_vcpu_arch arch;
struct dentry *debugfs_dentry;
+ struct kvm_dirty_ring dirty_ring;
};

static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -501,6 +504,10 @@ struct kvm {
struct srcu_struct srcu;
struct srcu_struct irq_srcu;
pid_t userspace_pid;
+ /* Data structure to be exported by mmap(kvm->fd, 0) */
+ struct kvm_vm_run *vm_run;
+ u32 dirty_ring_size;
+ struct kvm_dirty_ring vm_dirty_ring;
};

#define kvm_err(fmt, ...) \
@@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
gfn_t gfn_offset,
unsigned long mask);

+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
+
int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log);
int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
@@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
uintptr_t data, const char *name,
struct task_struct **thread_ptr);

+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full. This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define KVM_DIRTY_RING_RSVD_ENTRIES 64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+
+/*
+ * Arch needs to define these macro after implementing the dirty ring
+ * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures, while
+ * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
+ * feature is off on all archs.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+#ifndef KVM_DIRTY_RING_VERSION
+#define KVM_DIRTY_RING_VERSION 0
+#endif
+
#endif
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1c88e69db3d9..d9d03eea145a 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
struct kvm_memory_slot;
struct kvm_one_reg;
struct kvm_run;
+struct kvm_vm_run;
struct kvm_userspace_memory_region;
struct kvm_vcpu;
struct kvm_vcpu_init;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6f17c8e2dba..0b88d76d6215 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
#define KVM_EXIT_IOAPIC_EOI 26
#define KVM_EXIT_HYPERV 27
#define KVM_EXIT_ARM_NISV 28
+#define KVM_EXIT_DIRTY_RING_FULL 29

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
/* Encounter unexpected vm-exit reason */
#define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4

+struct kvm_dirty_ring_indexes {
+ __u32 avail_index; /* set by kernel */
+ __u32 fetch_index; /* set by userspace */
+};
+
/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
struct kvm_run {
/* in */
@@ -421,6 +427,13 @@ struct kvm_run {
struct kvm_sync_regs regs;
char padding[SYNC_REGS_SIZE_BYTES];
} s;
+
+ struct kvm_dirty_ring_indexes vcpu_ring_indexes;
+};
+
+/* Returned by mmap(kvm->fd, offset=0) */
+struct kvm_vm_run {
+ struct kvm_dirty_ring_indexes vm_ring_indexes;
};

/* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
#define KVM_CAP_ARM_NISV_TO_USER 177
#define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1472,6 +1486,9 @@ struct kvm_enc_region {
/* Available with KVM_CAP_ARM_SVE */
#define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)

+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
+
/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
/* Guest initialization commands */
@@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)

+/*
+ * The following are the requirements for supporting dirty log ring
+ * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
+ *
+ * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
+ * of kvm_write_* so that the global dirty ring is not filled up
+ * too quickly.
+ * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
+ * enabling dirty logging.
+ * 3. There should not be a separate step to synchronize hardware
+ * dirty bitmap with KVM's.
+ */
+
+struct kvm_dirty_gfn {
+ __u32 pad;
+ __u32 slot;
+ __u64 offset;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
new file mode 100644
index 000000000000..9264891f3c32
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,156 @@
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+ return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+ u32 size = kvm->dirty_ring_size;
+
+ ring->dirty_gfns = vmalloc(size);
+ if (!ring->dirty_gfns)
+ return -ENOMEM;
+ memset(ring->dirty_gfns, 0, size);
+
+ ring->size = size / sizeof(struct kvm_dirty_gfn);
+ ring->soft_limit =
+ (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
+ kvm_dirty_ring_get_rsvd_entries();
+ ring->dirty_index = 0;
+ ring->reset_index = 0;
+ spin_lock_init(&ring->lock);
+
+ return 0;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm,
+ struct kvm_dirty_ring *ring,
+ struct kvm_dirty_ring_indexes *indexes)
+{
+ u32 cur_slot, next_slot;
+ u64 cur_offset, next_offset;
+ unsigned long mask;
+ u32 fetch;
+ int count = 0;
+ struct kvm_dirty_gfn *entry;
+
+ fetch = READ_ONCE(indexes->fetch_index);
+ if (fetch == ring->reset_index)
+ return 0;
+
+ entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+ /*
+ * The ring buffer is shared with userspace, which might mmap
+ * it and concurrently modify slot and offset. Userspace must
+ * not be trusted! READ_ONCE prevents the compiler from changing
+ * the values after they've been range-checked (the checks are
+ * in kvm_reset_dirty_gfn).
+ */
+ smp_read_barrier_depends();
+ cur_slot = READ_ONCE(entry->slot);
+ cur_offset = READ_ONCE(entry->offset);
+ mask = 1;
+ count++;
+ ring->reset_index++;
+ while (ring->reset_index != fetch) {
+ entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+ smp_read_barrier_depends();
+ next_slot = READ_ONCE(entry->slot);
+ next_offset = READ_ONCE(entry->offset);
+ ring->reset_index++;
+ count++;
+ /*
+ * Try to coalesce the reset operations when the guest is
+ * scanning pages in the same slot.
+ */
+ if (next_slot == cur_slot) {
+ int delta = next_offset - cur_offset;
+
+ if (delta >= 0 && delta < BITS_PER_LONG) {
+ mask |= 1ull << delta;
+ continue;
+ }
+
+ /* Backwards visit, careful about overflows! */
+ if (delta > -BITS_PER_LONG && delta < 0 &&
+ (mask << -delta >> -delta) == mask) {
+ cur_offset = next_offset;
+ mask = (mask << -delta) | 1;
+ continue;
+ }
+ }
+ kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+ cur_slot = next_slot;
+ cur_offset = next_offset;
+ mask = 1;
+ }
+ kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+ return count;
+}
+
+static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+ return ring->dirty_index - ring->reset_index;
+}
+
+bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+ return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+/*
+ * Returns:
+ * >0 if we should kick the vcpu out,
+ * =0 if the gfn pushed successfully, or,
+ * <0 if error (e.g. ring full)
+ */
+int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+ struct kvm_dirty_ring_indexes *indexes,
+ u32 slot, u64 offset, bool lock)
+{
+ int ret;
+ struct kvm_dirty_gfn *entry;
+
+ if (lock)
+ spin_lock(&ring->lock);
+
+ if (kvm_dirty_ring_full(ring)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+ entry->slot = slot;
+ entry->offset = offset;
+ smp_wmb();
+ ring->dirty_index++;
+ WRITE_ONCE(indexes->avail_index, ring->dirty_index);
+ ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
+ pr_info("%s: slot %u offset %llu used %u\n",
+ __func__, slot, offset, kvm_dirty_ring_used(ring));
+
+out:
+ if (lock)
+ spin_unlock(&ring->lock);
+
+ return ret;
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
+{
+ return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+ if (ring->dirty_gfns) {
+ vfree(ring->dirty_gfns);
+ ring->dirty_gfns = NULL;
+ }
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 681452d288cd..8642c977629b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -64,6 +64,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/kvm.h>

+#include <linux/kvm_dirty_ring.h>
+
/* Worst case buffer size needed for holding an integer. */
#define ITOA_MAX_LEN 12

@@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
struct kvm_vcpu *vcpu,
struct kvm_memory_slot *memslot,
gfn_t gfn);
+static void mark_page_dirty_in_ring(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn);

__visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->preempted = false;
vcpu->ready = false;

+ if (kvm->dirty_ring_size) {
+ r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
+ if (r) {
+ kvm->dirty_ring_size = 0;
+ goto fail_free_run;
+ }
+ }
+
r = kvm_arch_vcpu_init(vcpu);
if (r < 0)
- goto fail_free_run;
+ goto fail_free_ring;
return 0;

+fail_free_ring:
+ if (kvm->dirty_ring_size)
+ kvm_dirty_ring_free(&vcpu->dirty_ring);
fail_free_run:
free_page((unsigned long)vcpu->run);
fail:
@@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
put_pid(rcu_dereference_protected(vcpu->pid, 1));
kvm_arch_vcpu_uninit(vcpu);
free_page((unsigned long)vcpu->run);
+ if (vcpu->kvm->dirty_ring_size)
+ kvm_dirty_ring_free(&vcpu->dirty_ring);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);

@@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
struct kvm *kvm = kvm_arch_alloc_vm();
int r = -ENOMEM;
int i;
+ struct page *page;

if (!kvm)
return ERR_PTR(-ENOMEM);
@@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)

BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);

+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page) {
+ r = -ENOMEM;
+ goto out_err_alloc_page;
+ }
+ kvm->vm_run = page_address(page);
+ BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
+
if (init_srcu_struct(&kvm->srcu))
goto out_err_no_srcu;
if (init_srcu_struct(&kvm->irq_srcu))
@@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
out_err_no_irq_srcu:
cleanup_srcu_struct(&kvm->srcu);
out_err_no_srcu:
+ free_page((unsigned long)page);
+ kvm->vm_run = NULL;
+out_err_alloc_page:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
return ERR_PTR(r);
@@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
int i;
struct mm_struct *mm = kvm->mm;

+ if (kvm->dirty_ring_size) {
+ kvm_dirty_ring_free(&kvm->vm_dirty_ring);
+ }
+
+ if (kvm->vm_run) {
+ free_page((unsigned long)kvm->vm_run);
+ kvm->vm_run = NULL;
+ }
+
kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
kvm_destroy_vm_debugfs(kvm);
kvm_arch_sync_events(kvm);
@@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
{
if (memslot && memslot->dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot->base_gfn;
-
+ mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
set_bit_le(rel_gfn, memslot->dirty_bitmap);
}
}
@@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

+static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
+{
+ return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+ (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+ kvm->dirty_ring_size / PAGE_SIZE);
+}
+
static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
{
struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
#endif
+ else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
+ page = kvm_dirty_ring_get_page(
+ &vcpu->dirty_ring,
+ vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
else
return kvm_arch_vcpu_fault(vcpu, vmf);
get_page(page);
@@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
#endif
case KVM_CAP_NR_MEMSLOTS:
return KVM_USER_MEM_SLOTS;
+ case KVM_CAP_DIRTY_LOG_RING:
+ /* Version will be zero if arch didn't implement it */
+ return KVM_DIRTY_RING_VERSION;
default:
break;
}
return kvm_vm_ioctl_check_extension(kvm, arg);
}

+static void mark_page_dirty_in_ring(struct kvm *kvm,
+ struct kvm_vcpu *vcpu,
+ struct kvm_memory_slot *slot,
+ gfn_t gfn)
+{
+ u32 as_id = 0;
+ u64 offset;
+ int ret;
+ struct kvm_dirty_ring *ring;
+ struct kvm_dirty_ring_indexes *indexes;
+ bool is_vm_ring;
+
+ if (!kvm->dirty_ring_size)
+ return;
+
+ offset = gfn - slot->base_gfn;
+
+ if (vcpu) {
+ as_id = kvm_arch_vcpu_memslots_id(vcpu);
+ } else {
+ as_id = 0;
+ vcpu = kvm_get_running_vcpu();
+ }
+
+ if (vcpu) {
+ ring = &vcpu->dirty_ring;
+ indexes = &vcpu->run->vcpu_ring_indexes;
+ is_vm_ring = false;
+ } else {
+ /*
+ * Put onto per vm ring because no vcpu context. Kick
+ * vcpu0 if ring is full.
+ */
+ vcpu = kvm->vcpus[0];
+ ring = &kvm->vm_dirty_ring;
+ indexes = &kvm->vm_run->vm_ring_indexes;
+ is_vm_ring = true;
+ }
+
+ ret = kvm_dirty_ring_push(ring, indexes,
+ (as_id << 16)|slot->id, offset,
+ is_vm_ring);
+ if (ret < 0) {
+ if (is_vm_ring)
+ pr_warn_once("vcpu %d dirty log overflow\n",
+ vcpu->vcpu_id);
+ else
+ pr_warn_once("per-vm dirty log overflow\n");
+ return;
+ }
+
+ if (ret)
+ kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
+}
+
+void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+ struct kvm_memory_slot *memslot;
+ int as_id, id;
+
+ as_id = slot >> 16;
+ id = (u16)slot;
+ if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+ return;
+
+ memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+ if (offset >= memslot->npages)
+ return;
+
+ spin_lock(&kvm->mmu_lock);
+ /* FIXME: we should use a single AND operation, but there is no
+ * applicable atomic API.
+ */
+ while (mask) {
+ clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
+ mask &= mask - 1;
+ }
+
+ kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+ spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+ int r;
+
+ /* the size should be power of 2 */
+ if (!size || (size & (size - 1)))
+ return -EINVAL;
+
+ /* Should be bigger to keep the reserved entries, or a page */
+ if (size < kvm_dirty_ring_get_rsvd_entries() *
+ sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+ return -EINVAL;
+
+ if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+ sizeof(struct kvm_dirty_gfn))
+ return -E2BIG;
+
+ /* We only allow it to set once */
+ if (kvm->dirty_ring_size)
+ return -EINVAL;
+
+ mutex_lock(&kvm->lock);
+
+ if (kvm->created_vcpus) {
+ /* We don't allow to change this value after vcpu created */
+ r = -EINVAL;
+ } else {
+ kvm->dirty_ring_size = size;
+ r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
+ if (r) {
+ /* Unset dirty ring */
+ kvm->dirty_ring_size = 0;
+ }
+ }
+
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+ int i;
+ struct kvm_vcpu *vcpu;
+ int cleared = 0;
+
+ if (!kvm->dirty_ring_size)
+ return -EINVAL;
+
+ mutex_lock(&kvm->slots_lock);
+
+ cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
+ &kvm->vm_run->vm_ring_indexes);
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
+ &vcpu->run->vcpu_ring_indexes);
+
+ mutex_unlock(&kvm->slots_lock);
+
+ if (cleared)
+ kvm_flush_remote_tlbs(kvm);
+
+ return cleared;
+}
+
int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
@@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
kvm->manual_dirty_log_protect = cap->args[0];
return 0;
#endif
+ case KVM_CAP_DIRTY_LOG_RING:
+ return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
default:
return kvm_vm_ioctl_enable_cap(kvm, cap);
}
@@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
case KVM_CHECK_EXTENSION:
r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
break;
+ case KVM_RESET_DIRTY_RINGS:
+ r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+ break;
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
@@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
}
#endif

+static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
+{
+ struct kvm *kvm = vmf->vma->vm_file->private_data;
+ struct page *page = NULL;
+
+ if (vmf->pgoff == 0)
+ page = virt_to_page(kvm->vm_run);
+ else if (kvm_fault_in_dirty_ring(kvm, vmf))
+ page = kvm_dirty_ring_get_page(
+ &kvm->vm_dirty_ring,
+ vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
+ else
+ return VM_FAULT_SIGBUS;
+
+ get_page(page);
+ vmf->page = page;
+ return 0;
+}
+
+static const struct vm_operations_struct kvm_vm_vm_ops = {
+ .fault = kvm_vm_fault,
+};
+
+static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ vma->vm_ops = &kvm_vm_vm_ops;
+ return 0;
+}
+
static struct file_operations kvm_vm_fops = {
.release = kvm_vm_release,
.unlocked_ioctl = kvm_vm_ioctl,
+ .mmap = kvm_vm_mmap,
.llseek = noop_llseek,
KVM_COMPAT(kvm_vm_compat_ioctl),
};
--
2.21.0

2019-11-30 08:32:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

Hi Peter,

thanks for the RFC! Just a couple comments before I look at the series
(for which I don't expect many surprises).

On 29/11/19 22:34, Peter Xu wrote:
> I marked this series as RFC because I'm at least uncertain on this
> change of vcpu_enter_guest():
>
> if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> /*
> * If this is requested, it means that we've
> * marked the dirty bit in the dirty ring BUT
> * we've not written the date. Do it now.
> */
> r = kvm_emulate_instruction(vcpu, 0);
> r = r >= 0 ? 0 : r;
> goto out;
> }

This is not needed, it will just be a false negative (dirty page that
actually isn't dirty). The dirty bit will be cleared when userspace
resets the ring buffer; then the instruction will be executed again and
mark the page dirty again. Since ring full is not a common condition,
it's not a big deal.

> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> and want to exit to userspace, however I'm not really sure whether
> there could have any side effect. I'd appreciate any comment of
> above, or anything else.
>
> Tests
> ===========
>
> I wanted to continue work on the QEMU part, but after I noticed that
> the interface might still prone to change, I posted this series first.
> However to make sure it's at least working, I've provided unit tests
> together with the series. The unit tests should be able to test the
> series in at least three major paths:
>
> (1) ./dirty_log_test -M dirty-ring
>
> This tests async ring operations: this should be the major work
> mode for the dirty ring interface, say, when the kernel is
> queuing more data, the userspace is collecting too. Ring can
> hardly reaches full when working like this, because in most
> cases the collection could be fast.
>
> (2) ./dirty_log_test -M dirty-ring -c 1024
>
> This set the ring size to be very small so that ring soft-full
> always triggers (soft-full is a soft limit of the ring state,
> when the dirty ring reaches the soft limit it'll do a userspace
> exit and let the userspace to collect the data).
>
> (3) ./dirty_log_test -M dirty-ring-wait-queue
>
> This sololy test the extreme case where ring is full. When the
> ring is completely full, the thread (no matter vcpu or not) will
> be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
> wake the threads up (assuming until which the ring will not be
> full any more).

One question about this testcase: why does the task get into
uninterruptible wait?

Paolo

>
> Thanks,
>
> Cao, Lei (2):
> KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
> KVM: X86: Implement ring-based dirty memory tracking
>
> Paolo Bonzini (1):
> KVM: Move running VCPU from ARM to common code
>
> Peter Xu (12):
> KVM: Add build-time error check on kvm_run size
> KVM: Implement ring-based dirty memory tracking
> KVM: Make dirty ring exclusive to dirty bitmap log
> KVM: Introduce dirty ring wait queue
> KVM: selftests: Always clear dirty bitmap after iteration
> KVM: selftests: Sync uapi/linux/kvm.h to tools/
> KVM: selftests: Use a single binary for dirty/clear log test
> KVM: selftests: Introduce after_vcpu_run hook for dirty log test
> KVM: selftests: Add dirty ring buffer test
> KVM: selftests: Let dirty_log_test async for dirty ring test
> KVM: selftests: Add "-c" parameter to dirty log test
> KVM: selftests: Test dirty ring waitqueue
>
> Documentation/virt/kvm/api.txt | 116 +++++
> arch/arm/include/asm/kvm_host.h | 2 -
> arch/arm64/include/asm/kvm_host.h | 2 -
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu/mmu.c | 6 +
> arch/x86/kvm/vmx/vmx.c | 7 +
> arch/x86/kvm/x86.c | 12 +
> include/linux/kvm_dirty_ring.h | 67 +++
> include/linux/kvm_host.h | 37 ++
> include/linux/kvm_types.h | 1 +
> include/uapi/linux/kvm.h | 36 ++
> tools/include/uapi/linux/kvm.h | 47 ++
> tools/testing/selftests/kvm/Makefile | 2 -
> .../selftests/kvm/clear_dirty_log_test.c | 2 -
> tools/testing/selftests/kvm/dirty_log_test.c | 452 ++++++++++++++++--
> .../testing/selftests/kvm/include/kvm_util.h | 6 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 103 ++++
> .../selftests/kvm/lib/kvm_util_internal.h | 5 +
> virt/kvm/arm/arm.c | 29 --
> virt/kvm/arm/perf.c | 6 +-
> virt/kvm/arm/vgic/vgic-mmio.c | 15 +-
> virt/kvm/dirty_ring.c | 156 ++++++
> virt/kvm/kvm_main.c | 315 +++++++++++-
> 25 files changed, 1329 insertions(+), 104 deletions(-)
> create mode 100644 include/linux/kvm_dirty_ring.h
> delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
> create mode 100644 virt/kvm/dirty_ring.c
>

2019-12-02 02:15:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> Hi Peter,
>
> thanks for the RFC! Just a couple comments before I look at the series
> (for which I don't expect many surprises).
>
> On 29/11/19 22:34, Peter Xu wrote:
> > I marked this series as RFC because I'm at least uncertain on this
> > change of vcpu_enter_guest():
> >
> > if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> > vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > /*
> > * If this is requested, it means that we've
> > * marked the dirty bit in the dirty ring BUT
> > * we've not written the date. Do it now.
> > */
> > r = kvm_emulate_instruction(vcpu, 0);
> > r = r >= 0 ? 0 : r;
> > goto out;
> > }
>
> This is not needed, it will just be a false negative (dirty page that
> actually isn't dirty). The dirty bit will be cleared when userspace
> resets the ring buffer; then the instruction will be executed again and
> mark the page dirty again. Since ring full is not a common condition,
> it's not a big deal.

Actually I added this only because it failed one of the unit tests
when verifying the dirty bits.. But now after a second thought, I
probably agree with you that we can change the userspace too to fix
this.

I think the steps of the failed test case could be simplified into
something like this (assuming the QEMU migration context, might be
easier to understand):

1. page P has data P1
2. vcpu writes to page P, with date P2
3. vmexit (P is still with data P1)
4. mark P as dirty, ring full, user exit
5. collect dirty bit P, migrate P with data P1
6. vcpu run due to some reason, P was written with P2, user exit again
(because ring is already reaching soft limit)
7. do KVM_RESET_DIRTY_RINGS
8. never write to P again

Then P will be P1 always on destination, while it'll be P2 on source.

I think maybe that's why we need to be very sure that when userspace
exits happens (soft limit reached), we need to kick all the vcpus out,
and more importantly we must _not_ let them run again before the
KVM_RESET_DIRTY_PAGES otherwise we might face the data corrupt. I'm
not sure whether we should mention this in the document to let the
userspace to be sure of the issue.

On the other side, I tried to remove the emulate_instruction() above
and fixed the test case, though I found that the last address before
user exit is not really written again after the next vmenter right
after KVM_RESET_DIRTY_RINGS, so the dirty bit was truly lost... I'm
pasting some traces below (I added some tracepoints too, I think I'll
just keep them for v2):

...
dirty_log_test-29003 [001] 184503.384328: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.384329: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.384329: kvm_page_fault: address 7fc036d000 error_code 582
dirty_log_test-29003 [001] 184503.384331: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.384332: kvm_page_fault: address 7fc036d000 error_code 582
dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push: ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
dirty_log_test-29003 [001] 184503.384333: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.384334: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.384334: kvm_page_fault: address 7fc036e000 error_code 582
dirty_log_test-29003 [001] 184503.384336: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.384336: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.384336: kvm_page_fault: address 7fc036e000 error_code 582
dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push: ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit: vcpu 1
dirty_log_test-29003 [001] 184503.384338: kvm_fpu: unload
dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit: reason 0x1d (29)
dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
dirty_log_test-29003 [001] 184503.505184: kvm_fpu: load
dirty_log_test-29003 [001] 184503.505187: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.505193: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.505194: kvm_page_fault: address 7fc036f000 error_code 582 <-------- [1]
dirty_log_test-29003 [001] 184503.505206: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.505207: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.505207: kvm_page_fault: address 7fc036f000 error_code 582
dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push: ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
dirty_log_test-29003 [001] 184503.505226: kvm_entry: vcpu 1
dirty_log_test-29003 [001] 184503.505227: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
dirty_log_test-29003 [001] 184503.505228: kvm_page_fault: address 7fc0370000 error_code 582
dirty_log_test-29003 [001] 184503.505231: kvm_entry: vcpu 1
...

The test was trying to continuously write to pages, from above log
starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
full exit reason.

So far I'm still unsure of two things:

1. Why for each page we faulted twice rather than once. Take the
example of page at 7fc036e000 above, the first fault didn't
trigger the marking dirty path, while only until the 2nd ept
violation did we trigger kvm_dirty_ring_push.

2. Why we didn't get the last page written again after
kvm_userspace_exit (last page was 7fc036e000, and the test failed
because 7fc036e000 detected change however dirty bit unset). In
this case the first write after KVM_RESET_DIRTY_RINGS is the line
pointed by [1], I thought it should be a rewritten of page
7fc036e000 because when the user exit happens logically the write
should not happen yet and eip should keep. However at [1] it's
already writting to a new page.

I'll continue to dig tomorrow, or quick answers will be greatly
welcomed too. :)

>
> > I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> > and want to exit to userspace, however I'm not really sure whether
> > there could have any side effect. I'd appreciate any comment of
> > above, or anything else.
> >
> > Tests
> > ===========
> >
> > I wanted to continue work on the QEMU part, but after I noticed that
> > the interface might still prone to change, I posted this series first.
> > However to make sure it's at least working, I've provided unit tests
> > together with the series. The unit tests should be able to test the
> > series in at least three major paths:
> >
> > (1) ./dirty_log_test -M dirty-ring
> >
> > This tests async ring operations: this should be the major work
> > mode for the dirty ring interface, say, when the kernel is
> > queuing more data, the userspace is collecting too. Ring can
> > hardly reaches full when working like this, because in most
> > cases the collection could be fast.
> >
> > (2) ./dirty_log_test -M dirty-ring -c 1024
> >
> > This set the ring size to be very small so that ring soft-full
> > always triggers (soft-full is a soft limit of the ring state,
> > when the dirty ring reaches the soft limit it'll do a userspace
> > exit and let the userspace to collect the data).
> >
> > (3) ./dirty_log_test -M dirty-ring-wait-queue
> >
> > This sololy test the extreme case where ring is full. When the
> > ring is completely full, the thread (no matter vcpu or not) will
> > be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
> > wake the threads up (assuming until which the ring will not be
> > full any more).
>
> One question about this testcase: why does the task get into
> uninterruptible wait?

Because I'm using wait_event_killable() to wait when ring is
completely full. I thought we should be strict there because it's
after all rare (even more rare than the soft-limit reached), and with
that we will never have a change to lose a dirty bit accidentally. Or
do you think we should still respond to non fatal signals due to some
reason even during that wait period?

Thanks,

--
Peter Xu

2019-12-02 20:13:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <[email protected]> and Paolo Bonzini <[email protected]>. [1]
>
> KVM currently uses large bitmaps to track dirty memory. These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information. The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another. However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial. In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN). This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> We defined two new data structures:
>
> struct kvm_dirty_ring;
> struct kvm_dirty_ring_indexes;
>
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages. When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
>
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring. Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
>
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
>
> Currently, we have N+1 rings for each VM of N vcpus:
>
> - for each vcpu, we have 1 per-vcpu dirty ring,
> - for each vm, we have 1 per-vm dirty ring

Why? I assume the purpose of per-vcpu rings is to avoid contention between
threads, but the motiviation needs to be explicitly stated. And why is a
per-vm fallback ring needed?

If my assumption is correct, have other approaches been tried/profiled?
E.g. using cmpxchg to reserve N number of entries in a shared ring. IMO,
adding kvm_get_running_vcpu() is a hack that is just asking for future
abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
look extremely fragile. I also dislike having two different mechanisms
for accessing the ring (lock for per-vm, something else for per-vcpu).

> Please refer to the documentation update in this patch for more
> details.
>
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now. Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---

...

> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> + u32 size = kvm->dirty_ring_size;

Just pass in @size, that way you don't need @kvm. And the callers will be
less ugly, e.g. the initial allocation won't need to speculatively set
kvm->dirty_ring_size.

> +
> + ring->dirty_gfns = vmalloc(size);
> + if (!ring->dirty_gfns)
> + return -ENOMEM;
> + memset(ring->dirty_gfns, 0, size);
> +
> + ring->size = size / sizeof(struct kvm_dirty_gfn);
> + ring->soft_limit =
> + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -

And passing @size avoids issues like this where a local var is ignored.

> + kvm_dirty_ring_get_rsvd_entries();
> + ring->dirty_index = 0;
> + ring->reset_index = 0;
> + spin_lock_init(&ring->lock);
> +
> + return 0;
> +}
> +

...

> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> + if (ring->dirty_gfns) {

Why condition freeing the dirty ring on kvm->dirty_ring_size, this
obviously protects itself. Not to mention vfree() also plays nice with a
NULL input.

> + vfree(ring->dirty_gfns);
> + ring->dirty_gfns = NULL;
> + }
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/kvm.h>
>
> +#include <linux/kvm_dirty_ring.h>
> +
> /* Worst case buffer size needed for holding an integer. */
> #define ITOA_MAX_LEN 12
>
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn);
>
> __visible bool kvm_rebooting;
> EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> vcpu->preempted = false;
> vcpu->ready = false;
>
> + if (kvm->dirty_ring_size) {
> + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> + if (r) {
> + kvm->dirty_ring_size = 0;
> + goto fail_free_run;

This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
previous allocations will leak if a vcpu allocation fails.

> + }
> + }
> +
> r = kvm_arch_vcpu_init(vcpu);
> if (r < 0)
> - goto fail_free_run;
> + goto fail_free_ring;
> return 0;
>
> +fail_free_ring:
> + if (kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> fail_free_run:
> free_page((unsigned long)vcpu->run);
> fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> put_pid(rcu_dereference_protected(vcpu->pid, 1));
> kvm_arch_vcpu_uninit(vcpu);
> free_page((unsigned long)vcpu->run);
> + if (vcpu->kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> struct kvm *kvm = kvm_arch_alloc_vm();
> int r = -ENOMEM;
> int i;
> + struct page *page;
>
> if (!kvm)
> return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!page) {
> + r = -ENOMEM;
> + goto out_err_alloc_page;
> + }
> + kvm->vm_run = page_address(page);
> + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
> if (init_srcu_struct(&kvm->srcu))
> goto out_err_no_srcu;
> if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> out_err_no_irq_srcu:
> cleanup_srcu_struct(&kvm->srcu);
> out_err_no_srcu:
> + free_page((unsigned long)page);
> + kvm->vm_run = NULL;

No need to nullify vm_run.

> +out_err_alloc_page:
> kvm_arch_free_vm(kvm);
> mmdrop(current->mm);
> return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> int i;
> struct mm_struct *mm = kvm->mm;
>
> + if (kvm->dirty_ring_size) {
> + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> + }

Unnecessary parantheses.

> +
> + if (kvm->vm_run) {
> + free_page((unsigned long)kvm->vm_run);
> + kvm->vm_run = NULL;
> + }
> +
> kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> kvm_destroy_vm_debugfs(kvm);
> kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> {
> if (memslot && memslot->dirty_bitmap) {
> unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> set_bit_le(rel_gfn, memslot->dirty_bitmap);
> }
> }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

2019-12-02 20:30:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> Hi Peter,
>
> thanks for the RFC! Just a couple comments before I look at the series
> (for which I don't expect many surprises).
>
> On 29/11/19 22:34, Peter Xu wrote:
> > I marked this series as RFC because I'm at least uncertain on this
> > change of vcpu_enter_guest():
> >
> > if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> > vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > /*
> > * If this is requested, it means that we've
> > * marked the dirty bit in the dirty ring BUT
> > * we've not written the date. Do it now.
> > */
> > r = kvm_emulate_instruction(vcpu, 0);
> > r = r >= 0 ? 0 : r;
> > goto out;
> > }
>
> This is not needed, it will just be a false negative (dirty page that
> actually isn't dirty). The dirty bit will be cleared when userspace
> resets the ring buffer; then the instruction will be executed again and
> mark the page dirty again. Since ring full is not a common condition,
> it's not a big deal.

Side topic, KVM_REQ_DIRTY_RING_FULL is misnamed, it's set when a ring goes
above its soft limit, not when the ring is actually full. It took quite a
bit of digging to figure out whether or not PML was broken...

2019-12-02 20:45:02

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Mon, Dec 02, 2019 at 12:21:19PM -0800, Sean Christopherson wrote:
> On Sat, Nov 30, 2019 at 09:29:42AM +0100, Paolo Bonzini wrote:
> > Hi Peter,
> >
> > thanks for the RFC! Just a couple comments before I look at the series
> > (for which I don't expect many surprises).
> >
> > On 29/11/19 22:34, Peter Xu wrote:
> > > I marked this series as RFC because I'm at least uncertain on this
> > > change of vcpu_enter_guest():
> > >
> > > if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> > > vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > > /*
> > > * If this is requested, it means that we've
> > > * marked the dirty bit in the dirty ring BUT
> > > * we've not written the date. Do it now.
> > > */
> > > r = kvm_emulate_instruction(vcpu, 0);
> > > r = r >= 0 ? 0 : r;
> > > goto out;
> > > }
> >
> > This is not needed, it will just be a false negative (dirty page that
> > actually isn't dirty). The dirty bit will be cleared when userspace
> > resets the ring buffer; then the instruction will be executed again and
> > mark the page dirty again. Since ring full is not a common condition,
> > it's not a big deal.
>
> Side topic, KVM_REQ_DIRTY_RING_FULL is misnamed, it's set when a ring goes
> above its soft limit, not when the ring is actually full. It took quite a
> bit of digging to figure out whether or not PML was broken...

Yeah it's indeed a bit confusing.

Do you like KVM_REQ_DIRTY_RING_COLLECT? Pair with
KVM_EXIT_DIRTY_RING_COLLECT. Or, suggestions?

--
Peter Xu

2019-12-02 21:18:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <[email protected]> and Paolo Bonzini <[email protected]>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information. The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another. However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial. In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN). This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > We defined two new data structures:
> >
> > struct kvm_dirty_ring;
> > struct kvm_dirty_ring_indexes;
> >
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages. When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> >
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring. Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> >
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> >
> > Currently, we have N+1 rings for each VM of N vcpus:
> >
> > - for each vcpu, we have 1 per-vcpu dirty ring,
> > - for each vm, we have 1 per-vm dirty ring
>
> Why? I assume the purpose of per-vcpu rings is to avoid contention between
> threads, but the motiviation needs to be explicitly stated. And why is a
> per-vm fallback ring needed?

Yes, as explained in previous reply, the problem is there could have
guest memory writes without vcpu contexts.

>
> If my assumption is correct, have other approaches been tried/profiled?
> E.g. using cmpxchg to reserve N number of entries in a shared ring.

Not yet, but I'd be fine to try anything if there's better
alternatives. Besides, could you help explain why sharing one ring
and let each vcpu to reserve a region in the ring could be helpful in
the pov of performance?

> IMO,
> adding kvm_get_running_vcpu() is a hack that is just asking for future
> abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> look extremely fragile.

I agree. Another way is to put heavier traffic to the per-vm ring,
but the downside could be that the per-vm ring could get full easier
(but I haven't tested).

> I also dislike having two different mechanisms
> for accessing the ring (lock for per-vm, something else for per-vcpu).

Actually I proposed to drop the per-vm ring (actually I had a version
that implemented this.. and I just changed it back to the per-vm ring
later on, see below) and when there's no vcpu context I thought about:

(1) use vcpu0 ring

(2) or a better algo to pick up a per-vcpu ring (like, the less full
ring, we can do many things here, e.g., we can easily maintain a
structure track this so we can get O(1) search, I think)

I discussed this with Paolo, but I think Paolo preferred the per-vm
ring because there's no good reason to choose vcpu0 as what (1)
suggested. While if to choose (2) we probably need to lock even for
per-cpu ring, so could be a bit slower.

Since this is still RFC, I think we still have chance to change this,
depending on how the discussion goes.

>
> > Please refer to the documentation update in this patch for more
> > details.
> >
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now. Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
>
> ...
>
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > + u32 size = kvm->dirty_ring_size;
>
> Just pass in @size, that way you don't need @kvm. And the callers will be
> less ugly, e.g. the initial allocation won't need to speculatively set
> kvm->dirty_ring_size.

Sure.

>
> > +
> > + ring->dirty_gfns = vmalloc(size);
> > + if (!ring->dirty_gfns)
> > + return -ENOMEM;
> > + memset(ring->dirty_gfns, 0, size);
> > +
> > + ring->size = size / sizeof(struct kvm_dirty_gfn);
> > + ring->soft_limit =
> > + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
>
> And passing @size avoids issues like this where a local var is ignored.
>
> > + kvm_dirty_ring_get_rsvd_entries();
> > + ring->dirty_index = 0;
> > + ring->reset_index = 0;
> > + spin_lock_init(&ring->lock);
> > +
> > + return 0;
> > +}
> > +
>
> ...
>
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > + if (ring->dirty_gfns) {
>
> Why condition freeing the dirty ring on kvm->dirty_ring_size, this
> obviously protects itself. Not to mention vfree() also plays nice with a
> NULL input.

Ok I can drop this check.

>
> > + vfree(ring->dirty_gfns);
> > + ring->dirty_gfns = NULL;
> > + }
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/kvm.h>
> >
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > /* Worst case buffer size needed for holding an integer. */
> > #define ITOA_MAX_LEN 12
> >
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > struct kvm_vcpu *vcpu,
> > struct kvm_memory_slot *memslot,
> > gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > + struct kvm_vcpu *vcpu,
> > + struct kvm_memory_slot *slot,
> > + gfn_t gfn);
> >
> > __visible bool kvm_rebooting;
> > EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> > vcpu->preempted = false;
> > vcpu->ready = false;
> >
> > + if (kvm->dirty_ring_size) {
> > + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > + if (r) {
> > + kvm->dirty_ring_size = 0;
> > + goto fail_free_run;
>
> This looks wrong, kvm->dirty_ring_size is used to free allocations, i.e.
> previous allocations will leak if a vcpu allocation fails.

You are right. That's an overkill.

>
> > + }
> > + }
> > +
> > r = kvm_arch_vcpu_init(vcpu);
> > if (r < 0)
> > - goto fail_free_run;
> > + goto fail_free_ring;
> > return 0;
> >
> > +fail_free_ring:
> > + if (kvm->dirty_ring_size)
> > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > fail_free_run:
> > free_page((unsigned long)vcpu->run);
> > fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> > put_pid(rcu_dereference_protected(vcpu->pid, 1));
> > kvm_arch_vcpu_uninit(vcpu);
> > free_page((unsigned long)vcpu->run);
> > + if (vcpu->kvm->dirty_ring_size)
> > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > }
> > EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > int r = -ENOMEM;
> > int i;
> > + struct page *page;
> >
> > if (!kvm)
> > return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >
> > BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >
> > + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!page) {
> > + r = -ENOMEM;
> > + goto out_err_alloc_page;
> > + }
> > + kvm->vm_run = page_address(page);
> > + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> > if (init_srcu_struct(&kvm->srcu))
> > goto out_err_no_srcu;
> > if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > out_err_no_irq_srcu:
> > cleanup_srcu_struct(&kvm->srcu);
> > out_err_no_srcu:
> > + free_page((unsigned long)page);
> > + kvm->vm_run = NULL;
>
> No need to nullify vm_run.

Ok.

>
> > +out_err_alloc_page:
> > kvm_arch_free_vm(kvm);
> > mmdrop(current->mm);
> > return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > int i;
> > struct mm_struct *mm = kvm->mm;
> >
> > + if (kvm->dirty_ring_size) {
> > + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > + }
>
> Unnecessary parantheses.

True.

Thanks,

>
> > +
> > + if (kvm->vm_run) {
> > + free_page((unsigned long)kvm->vm_run);
> > + kvm->vm_run = NULL;
> > + }
> > +
> > kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> > kvm_destroy_vm_debugfs(kvm);
> > kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > {
> > if (memslot && memslot->dirty_bitmap) {
> > unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> > set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > }
> > }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > }
> > EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>

--
Peter Xu

2019-12-02 21:52:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > >
> > > - for each vcpu, we have 1 per-vcpu dirty ring,
> > > - for each vm, we have 1 per-vm dirty ring
> >
> > Why? I assume the purpose of per-vcpu rings is to avoid contention between
> > threads, but the motiviation needs to be explicitly stated. And why is a
> > per-vm fallback ring needed?
>
> Yes, as explained in previous reply, the problem is there could have
> guest memory writes without vcpu contexts.
>
> >
> > If my assumption is correct, have other approaches been tried/profiled?
> > E.g. using cmpxchg to reserve N number of entries in a shared ring.
>
> Not yet, but I'd be fine to try anything if there's better
> alternatives. Besides, could you help explain why sharing one ring
> and let each vcpu to reserve a region in the ring could be helpful in
> the pov of performance?

The goal would be to avoid taking a lock, or at least to avoid holding a
lock for an extended duration, e.g. some sort of multi-step process where
entries in the ring are first reserved, then filled, and finally marked
valid. That'd allow the "fill" action to be done in parallel.

In case it isn't clear, I haven't thought through an actual solution :-).

My point is that I think it's worth exploring and profiling other
implementations because the dual per-vm and per-vcpu rings has a few warts
that we'd be stuck with forever.

> > IMO,
> > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > look extremely fragile.
>
> I agree. Another way is to put heavier traffic to the per-vm ring,
> but the downside could be that the per-vm ring could get full easier
> (but I haven't tested).

There's nothing that prevents increasing the size of the common ring each
time a new vCPU is added. Alternatively, userspace could explicitly
request or hint the desired ring size.

> > I also dislike having two different mechanisms
> > for accessing the ring (lock for per-vm, something else for per-vcpu).
>
> Actually I proposed to drop the per-vm ring (actually I had a version
> that implemented this.. and I just changed it back to the per-vm ring
> later on, see below) and when there's no vcpu context I thought about:
>
> (1) use vcpu0 ring
>
> (2) or a better algo to pick up a per-vcpu ring (like, the less full
> ring, we can do many things here, e.g., we can easily maintain a
> structure track this so we can get O(1) search, I think)
>
> I discussed this with Paolo, but I think Paolo preferred the per-vm
> ring because there's no good reason to choose vcpu0 as what (1)
> suggested. While if to choose (2) we probably need to lock even for
> per-cpu ring, so could be a bit slower.

Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
find a third option that provides comparable performance without using any
per-vcpu rings.

2019-12-02 23:11:50

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Mon, Dec 02, 2019 at 01:50:49PM -0800, Sean Christopherson wrote:
> On Mon, Dec 02, 2019 at 04:16:40PM -0500, Peter Xu wrote:
> > On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote:
> > > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > > Currently, we have N+1 rings for each VM of N vcpus:
> > > >
> > > > - for each vcpu, we have 1 per-vcpu dirty ring,
> > > > - for each vm, we have 1 per-vm dirty ring
> > >
> > > Why? I assume the purpose of per-vcpu rings is to avoid contention between
> > > threads, but the motiviation needs to be explicitly stated. And why is a
> > > per-vm fallback ring needed?
> >
> > Yes, as explained in previous reply, the problem is there could have
> > guest memory writes without vcpu contexts.
> >
> > >
> > > If my assumption is correct, have other approaches been tried/profiled?
> > > E.g. using cmpxchg to reserve N number of entries in a shared ring.
> >
> > Not yet, but I'd be fine to try anything if there's better
> > alternatives. Besides, could you help explain why sharing one ring
> > and let each vcpu to reserve a region in the ring could be helpful in
> > the pov of performance?
>
> The goal would be to avoid taking a lock, or at least to avoid holding a
> lock for an extended duration, e.g. some sort of multi-step process where
> entries in the ring are first reserved, then filled, and finally marked
> valid. That'd allow the "fill" action to be done in parallel.

Considering that per-vcpu ring should be no worst than this, so iiuc
you prefer a single per-vm ring here, which is without per-vcpu ring.
However I don't see a good reason to split a per-vm resource into
per-vcpu manually somehow, instead of using the per-vcpu structure
directly like what this series does... Or could you show me what I've
missed?

IMHO it's really a natural thought that we should use kvm_vcpu to
split the ring as long as we still want to make it in parallel of the
vcpus.

>
> In case it isn't clear, I haven't thought through an actual solution :-).

Feel free to shoot when the ideas come. :) I'd be glad to test your
idea, especially where it could be better!

>
> My point is that I think it's worth exploring and profiling other
> implementations because the dual per-vm and per-vcpu rings has a few warts
> that we'd be stuck with forever.

I do agree that the interface could be a bit awkward to keep these two
rings. Besides this, do you still have other concerns?

And when you say about profiling, I hope I understand it right that it
should be something unrelated to this specific issue that we're
discussing (say, on whether to use per-vm ring, or per-vm + per-vcpu
rings) because for performance imho it's really the layout of the ring
that could matter more, and how the ring is shared and accessed
between the userspace and kernel.

For current implementation (I'm not sure whether that's initial
version from Lei, or Paolo, anyway...), IMHO it's good enough from
perf pov in that it at least supports:

(1) zero copy
(2) complete async model
(3) per-vcpu isolations

None of these is there for KVM_GET_DIRTY_LOG. Not to mention that
tracking dirty bits are not really that "performance critical" - if
you see in QEMU we have plenty of ways to explicitly turn down the CPU
like cpu-throttle, just because dirtying pages and even with the whole
tracking overhead is too fast already even using KVM_GET_DIRTY_LOG,
and the slow thing is QEMU when collecting and sending the pages! :)

>
> > > IMO,
> > > adding kvm_get_running_vcpu() is a hack that is just asking for future
> > > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring()
> > > look extremely fragile.
> >
> > I agree. Another way is to put heavier traffic to the per-vm ring,
> > but the downside could be that the per-vm ring could get full easier
> > (but I haven't tested).
>
> There's nothing that prevents increasing the size of the common ring each
> time a new vCPU is added. Alternatively, userspace could explicitly
> request or hint the desired ring size.

Yeah I don't have strong opinion on this, but I just don't see it
greatly helpful to explicitly expose this API to userspace. IMHO for
now a global ring size should be good enough. If userspace wants to
make it fast, the ring can hardly gets full (because the collection of
the dirty ring can be really, really fast if the userspace wants).

>
> > > I also dislike having two different mechanisms
> > > for accessing the ring (lock for per-vm, something else for per-vcpu).
> >
> > Actually I proposed to drop the per-vm ring (actually I had a version
> > that implemented this.. and I just changed it back to the per-vm ring
> > later on, see below) and when there's no vcpu context I thought about:
> >
> > (1) use vcpu0 ring
> >
> > (2) or a better algo to pick up a per-vcpu ring (like, the less full
> > ring, we can do many things here, e.g., we can easily maintain a
> > structure track this so we can get O(1) search, I think)
> >
> > I discussed this with Paolo, but I think Paolo preferred the per-vm
> > ring because there's no good reason to choose vcpu0 as what (1)
> > suggested. While if to choose (2) we probably need to lock even for
> > per-cpu ring, so could be a bit slower.
>
> Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.

I'm still uncertain on whether it's a good idea to drop the per-vcpu
ring (as stated above). But I'm still open to any further thoughts
as long as I can start to understand when the only-per-vm ring would
be better.

Thanks!

--
Peter Xu

2019-12-03 13:49:10

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 02/12/19 22:50, Sean Christopherson wrote:
>>
>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>> ring because there's no good reason to choose vcpu0 as what (1)
>> suggested. While if to choose (2) we probably need to lock even for
>> per-cpu ring, so could be a bit slower.
> Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
> find a third option that provides comparable performance without using any
> per-vcpu rings.
>

The advantage of per-vCPU rings is that it naturally: 1) parallelizes
the processing of dirty pages; 2) makes userspace vCPU thread do more
work on vCPUs that dirty more pages.

I agree that on the producer side we could reserve multiple entries in
the case of PML (and without PML only one entry should be added at a
time). But I'm afraid that things get ugly when the ring is full,
because you'd have to wait for all vCPUs to finish publishing the
entries they have reserved.

It's ugly that we _also_ need a per-VM ring, but unfortunately some
operations do not really have a vCPU that they can refer to.

Paolo

2019-12-03 14:01:01

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On 02/12/19 03:13, Peter Xu wrote:
>> This is not needed, it will just be a false negative (dirty page that
>> actually isn't dirty). The dirty bit will be cleared when userspace
>> resets the ring buffer; then the instruction will be executed again and
>> mark the page dirty again. Since ring full is not a common condition,
>> it's not a big deal.
>
> Actually I added this only because it failed one of the unit tests
> when verifying the dirty bits.. But now after a second thought, I
> probably agree with you that we can change the userspace too to fix
> this.

I think there is already a similar case in dirty_log_test when a page is
dirty but we called KVM_GET_DIRTY_LOG just before it got written to.

> I think the steps of the failed test case could be simplified into
> something like this (assuming the QEMU migration context, might be
> easier to understand):
>
> 1. page P has data P1
> 2. vcpu writes to page P, with date P2
> 3. vmexit (P is still with data P1)
> 4. mark P as dirty, ring full, user exit
> 5. collect dirty bit P, migrate P with data P1
> 6. vcpu run due to some reason, P was written with P2, user exit again
> (because ring is already reaching soft limit)
> 7. do KVM_RESET_DIRTY_RINGS

Migration should only be done after KVM_RESET_DIRTY_RINGS (think of
KVM_RESET_DIRTY_RINGS as the equivalent of KVM_CLEAR_DIRTY_LOG).

> dirty_log_test-29003 [001] 184503.384328: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.384329: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.384329: kvm_page_fault: address 7fc036d000 error_code 582
> dirty_log_test-29003 [001] 184503.384331: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.384332: kvm_page_fault: address 7fc036d000 error_code 582
> dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push: ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
> dirty_log_test-29003 [001] 184503.384333: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.384334: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.384334: kvm_page_fault: address 7fc036e000 error_code 582
> dirty_log_test-29003 [001] 184503.384336: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.384336: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.384336: kvm_page_fault: address 7fc036e000 error_code 582
> dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push: ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
> dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit: vcpu 1
> dirty_log_test-29003 [001] 184503.384338: kvm_fpu: unload
> dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit: reason 0x1d (29)
> dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
> dirty_log_test-29003 [001] 184503.505184: kvm_fpu: load
> dirty_log_test-29003 [001] 184503.505187: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.505193: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.505194: kvm_page_fault: address 7fc036f000 error_code 582 <-------- [1]
> dirty_log_test-29003 [001] 184503.505206: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.505207: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.505207: kvm_page_fault: address 7fc036f000 error_code 582
> dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push: ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
> dirty_log_test-29003 [001] 184503.505226: kvm_entry: vcpu 1
> dirty_log_test-29003 [001] 184503.505227: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> dirty_log_test-29003 [001] 184503.505228: kvm_page_fault: address 7fc0370000 error_code 582
> dirty_log_test-29003 [001] 184503.505231: kvm_entry: vcpu 1
> ...
>
> The test was trying to continuously write to pages, from above log
> starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
> full exit reason.
>
> So far I'm still unsure of two things:
>
> 1. Why for each page we faulted twice rather than once. Take the
> example of page at 7fc036e000 above, the first fault didn't
> trigger the marking dirty path, while only until the 2nd ept
> violation did we trigger kvm_dirty_ring_push.

Not sure about that. Try enabling kvmmmu tracepoints too, it will tell
you more of the path that was taken while processing the EPT violation.

If your machine has PML, what you're seeing is likely not-present
violation, not dirty-protect violation. Try disabling pml and see if
the trace changes.

> 2. Why we didn't get the last page written again after
> kvm_userspace_exit (last page was 7fc036e000, and the test failed
> because 7fc036e000 detected change however dirty bit unset). In
> this case the first write after KVM_RESET_DIRTY_RINGS is the line
> pointed by [1], I thought it should be a rewritten of page
> 7fc036e000 because when the user exit happens logically the write
> should not happen yet and eip should keep. However at [1] it's
> already writting to a new page.

IIUC you should get, with PML enabled:

- guest writes to page
- PML marks dirty bit, causes vmexit
- host copies PML log to ring, causes userspace exit
- userspace calls KVM_RESET_DIRTY_RINGS
- host marks page as clean
- userspace calls KVM_RUN
- guest writes again to page

but the page won't be in the ring until after another vmexit happens.
Therefore, it's okay to reap the pages in the ring asynchronously, but
there must be a synchronization point in the testcase sooner or later,
where all CPUs are kicked out of KVM_RUN. This synchronization point
corresponds to the migration downtime.

Thanks,

Paolo

2019-12-03 18:46:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> On 02/12/19 22:50, Sean Christopherson wrote:
> >>
> >> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >> ring because there's no good reason to choose vcpu0 as what (1)
> >> suggested. While if to choose (2) we probably need to lock even for
> >> per-cpu ring, so could be a bit slower.
> > Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
> > find a third option that provides comparable performance without using any
> > per-vcpu rings.
> >
>
> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> the processing of dirty pages; 2) makes userspace vCPU thread do more
> work on vCPUs that dirty more pages.
>
> I agree that on the producer side we could reserve multiple entries in
> the case of PML (and without PML only one entry should be added at a
> time). But I'm afraid that things get ugly when the ring is full,
> because you'd have to wait for all vCPUs to finish publishing the
> entries they have reserved.

Ah, I take it the intended model is that userspace will only start pulling
entries off the ring when KVM explicitly signals that the ring is "full"?

Rather than reserve entries, what if vCPUs reserved an entire ring? Create
a pool of N=nr_vcpus rings that are shared by all vCPUs. To mark pages
dirty, a vCPU claims a ring, pushes the pages into the ring, and then
returns the ring to the pool. If pushing pages hits the soft limit, a
request is made to drain the ring and the ring is not returned to the pool
until it is drained.

Except for acquiring a ring, which likely can be heavily optimized, that'd
allow parallel processing (#1), and would provide a facsimile of #2 as
pushing more pages onto a ring would naturally increase the likelihood of
triggering a drain. And it might be interesting to see the effect of using
different methods of ring selection, e.g. pure round robin, LRU, last used
on the current vCPU, etc...

> It's ugly that we _also_ need a per-VM ring, but unfortunately some
> operations do not really have a vCPU that they can refer to.

2019-12-03 19:15:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn)
> +{
> + u32 as_id = 0;

Redundant initialization of as_id.

> + u64 offset;
> + int ret;
> + struct kvm_dirty_ring *ring;
> + struct kvm_dirty_ring_indexes *indexes;
> + bool is_vm_ring;
> +
> + if (!kvm->dirty_ring_size)
> + return;
> +
> + offset = gfn - slot->base_gfn;
> +
> + if (vcpu) {
> + as_id = kvm_arch_vcpu_memslots_id(vcpu);
> + } else {
> + as_id = 0;

The setting of as_id is wrong, both with and without a vCPU. as_id should
come from slot->as_id. It may not be actually broken in the current code
base, but at best it's fragile, e.g. Ben's TDP MMU rewrite[*] adds a call
to mark_page_dirty_in_slot() with a potentially non-zero as_id.

[*] https://lkml.kernel.org/r/[email protected]

> + vcpu = kvm_get_running_vcpu();
> + }
> +
> + if (vcpu) {
> + ring = &vcpu->dirty_ring;
> + indexes = &vcpu->run->vcpu_ring_indexes;
> + is_vm_ring = false;
> + } else {
> + /*
> + * Put onto per vm ring because no vcpu context. Kick
> + * vcpu0 if ring is full.
> + */
> + vcpu = kvm->vcpus[0];

Is this a rare event?

> + ring = &kvm->vm_dirty_ring;
> + indexes = &kvm->vm_run->vm_ring_indexes;
> + is_vm_ring = true;
> + }
> +
> + ret = kvm_dirty_ring_push(ring, indexes,
> + (as_id << 16)|slot->id, offset,
> + is_vm_ring);
> + if (ret < 0) {
> + if (is_vm_ring)
> + pr_warn_once("vcpu %d dirty log overflow\n",
> + vcpu->vcpu_id);
> + else
> + pr_warn_once("per-vm dirty log overflow\n");
> + return;
> + }
> +
> + if (ret)
> + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}

2019-12-04 10:06:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 03/12/19 19:46, Sean Christopherson wrote:
> On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
>> On 02/12/19 22:50, Sean Christopherson wrote:
>>>>
>>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
>>>> ring because there's no good reason to choose vcpu0 as what (1)
>>>> suggested. While if to choose (2) we probably need to lock even for
>>>> per-cpu ring, so could be a bit slower.
>>> Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
>>> find a third option that provides comparable performance without using any
>>> per-vcpu rings.
>>>
>>
>> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
>> the processing of dirty pages; 2) makes userspace vCPU thread do more
>> work on vCPUs that dirty more pages.
>>
>> I agree that on the producer side we could reserve multiple entries in
>> the case of PML (and without PML only one entry should be added at a
>> time). But I'm afraid that things get ugly when the ring is full,
>> because you'd have to wait for all vCPUs to finish publishing the
>> entries they have reserved.
>
> Ah, I take it the intended model is that userspace will only start pulling
> entries off the ring when KVM explicitly signals that the ring is "full"?

No, it's not. But perhaps in the asynchronous case you can delay
pushing the reserved entries to the consumer until a moment where no
CPUs have left empty slots in the ring buffer (somebody must have done
multi-producer ring buffers before). In the ring-full case that is
harder because it requires synchronization.

> Rather than reserve entries, what if vCPUs reserved an entire ring? Create
> a pool of N=nr_vcpus rings that are shared by all vCPUs. To mark pages
> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> returns the ring to the pool. If pushing pages hits the soft limit, a
> request is made to drain the ring and the ring is not returned to the pool
> until it is drained.
>
> Except for acquiring a ring, which likely can be heavily optimized, that'd
> allow parallel processing (#1), and would provide a facsimile of #2 as
> pushing more pages onto a ring would naturally increase the likelihood of
> triggering a drain. And it might be interesting to see the effect of using
> different methods of ring selection, e.g. pure round robin, LRU, last used
> on the current vCPU, etc...

If you are creating nr_vcpus rings, and draining is done on the vCPU
thread that has filled the ring, why not create nr_vcpus+1? The current
code then is exactly the same as pre-claiming a ring per vCPU and never
releasing it, and using a spinlock to claim the per-VM ring.

However, we could build on top of my other suggestion to add
slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
exactly what you've suggested. Maybe even add a scary comment around
kvm_get_running_vcpu() suggesting that users only do so to avoid locking
and wrap it with a nice API. Similar to what get_cpu/put_cpu do with
smp_processor_id.

1) Add a pointer from struct kvm_dirty_ring to struct
kvm_dirty_ring_indexes:

vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;

2) push the ring choice and locking to two new functions

struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
return &vcpu->dirty_ring;
} else {
/*
* Put onto per vm ring because no vcpu context.
* We'll kick vcpu0 if ring is full.
*/
spin_lock(&kvm->vm_dirty_ring->lock);
return &kvm->vm_dirty_ring;
}
}

void kvm_put_dirty_ring(struct kvm *kvm,
struct kvm_dirty_ring *ring)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;

if (ring == &kvm->vm_dirty_ring) {
if (vcpu == NULL)
vcpu = kvm->vcpus[0];
spin_unlock(&kvm->vm_dirty_ring->lock);
}

if (full)
kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
}

3) simplify kvm_dirty_ring_push to

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
u32 slot, u64 offset)
{
/* left as an exercise to the reader */
}

and mark_page_dirty_in_ring to

static void mark_page_dirty_in_ring(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn)
{
struct kvm_dirty_ring *ring;

if (!kvm->dirty_ring_size)
return;

ring = kvm_get_dirty_ring(kvm);
kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
gfn - slot->base_gfn);
kvm_put_dirty_ring(kvm, ring);
}

Paolo

>> It's ugly that we _also_ need a per-VM ring, but unfortunately some
>> operations do not really have a vCPU that they can refer to.
>

2019-12-04 10:16:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 03/12/19 20:13, Sean Christopherson wrote:
> The setting of as_id is wrong, both with and without a vCPU. as_id should
> come from slot->as_id.

Which doesn't exist, but is an excellent suggestion nevertheless.

>> + /*
>> + * Put onto per vm ring because no vcpu context. Kick
>> + * vcpu0 if ring is full.
>> + */
>> + vcpu = kvm->vcpus[0];
>
> Is this a rare event?

Yes, every time a vCPU exit happens, the vCPU is supposed to reap the VM
ring as well. (Most of the time it will be empty, and while the reaping
of VM ring entries needs locking, the emptiness check doesn't).

Paolo

>> + ring = &kvm->vm_dirty_ring;
>> + indexes = &kvm->vm_run->vm_ring_indexes;
>> + is_vm_ring = true;
>> + }
>> +
>> + ret = kvm_dirty_ring_push(ring, indexes,
>> + (as_id << 16)|slot->id, offset,
>> + is_vm_ring);
>> + if (ret < 0) {
>> + if (is_vm_ring)
>> + pr_warn_once("vcpu %d dirty log overflow\n",
>> + vcpu->vcpu_id);
>> + else
>> + pr_warn_once("per-vm dirty log overflow\n");
>> + return;
>> + }
>> +
>> + if (ret)
>> + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
>> +}
>

2019-12-04 10:39:42

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking


On 2019/11/30 上午5:34, Peter Xu wrote:
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes,
> + u32 slot, u64 offset, bool lock)
> +{
> + int ret;
> + struct kvm_dirty_gfn *entry;
> +
> + if (lock)
> + spin_lock(&ring->lock);
> +
> + if (kvm_dirty_ring_full(ring)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> + entry->slot = slot;
> + entry->offset = offset;


Haven't gone through the whole series, sorry if it was a silly question
but I wonder things like this will suffer from similar issue on
virtually tagged archs as mentioned in [1].

Is this better to allocate the ring from userspace and set to KVM
instead? Then we can use copy_to/from_user() friends (a little bit slow
on recent CPUs).

[1] https://lkml.org/lkml/2019/4/9/5

Thanks


> + smp_wmb();
> + ring->dirty_index++;
> + WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> + pr_info("%s: slot %u offset %llu used %u\n",
> + __func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:

2019-12-04 10:41:04

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface


On 2019/11/30 上午5:34, Peter Xu wrote:
> Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>
> Overview
> ============
>
> This is a continued work from Lei Cao<[email protected]> and Paolo
> on the KVM dirty ring interface. To make it simple, I'll still start
> with version 1 as RFC.
>
> The new dirty ring interface is another way to collect dirty pages for
> the virtual machine, but it is different from the existing dirty
> logging interface in a few ways, majorly:
>
> - Data format: The dirty data was in a ring format rather than a
> bitmap format, so the size of data to sync for dirty logging does
> not depend on the size of guest memory any more, but speed of
> dirtying. Also, the dirty ring is per-vcpu (currently plus
> another per-vm ring, so total ring number is N+1), while the dirty
> bitmap is per-vm.
>
> - Data copy: The sync of dirty pages does not need data copy any more,
> but instead the ring is shared between the userspace and kernel by
> page sharings (mmap() on either the vm fd or vcpu fd)
>
> - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> dirty pages to protected mode again (works like
> KVM_CLEAR_DIRTY_LOG, but ring based)
>
> And more.


Looks really interesting, I wonder if we can make this as a library then
we can reuse it for vhost.

Thanks

2019-12-04 11:06:06

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 04/12/19 11:38, Jason Wang wrote:
>>
>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>> +    entry->slot = slot;
>> +    entry->offset = offset;
>
>
> Haven't gone through the whole series, sorry if it was a silly question
> but I wonder things like this will suffer from similar issue on
> virtually tagged archs as mentioned in [1].

There is no new infrastructure to track the dirty pages---it's just a
different way to pass them to userspace.

> Is this better to allocate the ring from userspace and set to KVM
> instead? Then we can use copy_to/from_user() friends (a little bit slow
> on recent CPUs).

Yeah, I don't think that would be better than mmap.

Paolo


> [1] https://lkml.org/lkml/2019/4/9/5

2019-12-04 14:34:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 04, 2019 at 11:14:19AM +0100, Paolo Bonzini wrote:
> On 03/12/19 20:13, Sean Christopherson wrote:
> > The setting of as_id is wrong, both with and without a vCPU. as_id should
> > come from slot->as_id.
>
> Which doesn't exist, but is an excellent suggestion nevertheless.

Huh, I explicitly looked at the code to make sure as_id existed before
making this suggestion. No idea what code I actually pulled up.

2019-12-04 19:40:04

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Wed, Dec 04, 2019 at 06:39:48PM +0800, Jason Wang wrote:
>
> On 2019/11/30 上午5:34, Peter Xu wrote:
> > Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> >
> > Overview
> > ============
> >
> > This is a continued work from Lei Cao<[email protected]> and Paolo
> > on the KVM dirty ring interface. To make it simple, I'll still start
> > with version 1 as RFC.
> >
> > The new dirty ring interface is another way to collect dirty pages for
> > the virtual machine, but it is different from the existing dirty
> > logging interface in a few ways, majorly:
> >
> > - Data format: The dirty data was in a ring format rather than a
> > bitmap format, so the size of data to sync for dirty logging does
> > not depend on the size of guest memory any more, but speed of
> > dirtying. Also, the dirty ring is per-vcpu (currently plus
> > another per-vm ring, so total ring number is N+1), while the dirty
> > bitmap is per-vm.
> >
> > - Data copy: The sync of dirty pages does not need data copy any more,
> > but instead the ring is shared between the userspace and kernel by
> > page sharings (mmap() on either the vm fd or vcpu fd)
> >
> > - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> > KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> > called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> > dirty pages to protected mode again (works like
> > KVM_CLEAR_DIRTY_LOG, but ring based)
> >
> > And more.
>
>
> Looks really interesting, I wonder if we can make this as a library then we
> can reuse it for vhost.

So iiuc this ring will majorly for (1) data exchange between kernel
and user, and (2) shared memory. I think from that pov yeh it should
work even for vhost.

It shouldn't be hard to refactor the interfaces to avoid kvm elements,
however I'm not sure how to do that best. Maybe like irqbypass and
put it into virt/lib/ as a standlone module? Would it worth it?

Paolo, what's your take?

--
Peter Xu

2019-12-04 19:53:26

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +    entry->slot = slot;
> >> +    entry->offset = offset;
> >
> >
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
>
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.
>
> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
>
> Yeah, I don't think that would be better than mmap.

Yeah I agree, because I didn't see how copy_to/from_user() helped to
do icache/dcache flushings...

Some context here: Jason raised this question offlist first on whether
we should also need these flush_dcache_cache() helpers for operations
like kvm dirty ring accesses. I feel like it should, however I've got
two other questions, on:

- if we need to do flush_dcache_page() on kernel modified pages
(assuming the same page has mapped to userspace), then why don't
we need flush_cache_page() too on the page, where
flush_cache_page() is defined not-a-nop on those archs?

- assuming an arch has not-a-nop impl for flush_[d]cache_page(),
would atomic operations like cmpxchg really work for them
(assuming that ISAs like cmpxchg should depend on cache
consistency).

Sorry I think these are for sure a bit out of topic for kvm dirty ring
patchset, but since we're at it, I'm raising the questions up in case
there're answers..

Thanks,

--
Peter Xu

2019-12-05 06:51:29

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface


On 2019/12/5 上午3:33, Peter Xu wrote:
> On Wed, Dec 04, 2019 at 06:39:48PM +0800, Jason Wang wrote:
>> On 2019/11/30 上午5:34, Peter Xu wrote:
>>> Branch is here:https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>>>
>>> Overview
>>> ============
>>>
>>> This is a continued work from Lei Cao<[email protected]> and Paolo
>>> on the KVM dirty ring interface. To make it simple, I'll still start
>>> with version 1 as RFC.
>>>
>>> The new dirty ring interface is another way to collect dirty pages for
>>> the virtual machine, but it is different from the existing dirty
>>> logging interface in a few ways, majorly:
>>>
>>> - Data format: The dirty data was in a ring format rather than a
>>> bitmap format, so the size of data to sync for dirty logging does
>>> not depend on the size of guest memory any more, but speed of
>>> dirtying. Also, the dirty ring is per-vcpu (currently plus
>>> another per-vm ring, so total ring number is N+1), while the dirty
>>> bitmap is per-vm.
>>>
>>> - Data copy: The sync of dirty pages does not need data copy any more,
>>> but instead the ring is shared between the userspace and kernel by
>>> page sharings (mmap() on either the vm fd or vcpu fd)
>>>
>>> - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>>> KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>>> called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>>> dirty pages to protected mode again (works like
>>> KVM_CLEAR_DIRTY_LOG, but ring based)
>>>
>>> And more.
>>
>> Looks really interesting, I wonder if we can make this as a library then we
>> can reuse it for vhost.
> So iiuc this ring will majorly for (1) data exchange between kernel
> and user, and (2) shared memory. I think from that pov yeh it should
> work even for vhost.
>
> It shouldn't be hard to refactor the interfaces to avoid kvm elements,
> however I'm not sure how to do that best. Maybe like irqbypass and
> put it into virt/lib/ as a standlone module? Would it worth it?


Maybe, and it looks to me some dirty pages reporting API for VFIO is
proposed in the same time. It will be helpful to unify them (or at least
leave a chance for other users).

Thanks


>
> Paolo, what's your take?
>

2019-12-05 06:52:12

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking


On 2019/12/5 上午3:52, Peter Xu wrote:
> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>> On 04/12/19 11:38, Jason Wang wrote:
>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>> +    entry->slot = slot;
>>>> +    entry->offset = offset;
>>>
>>> Haven't gone through the whole series, sorry if it was a silly question
>>> but I wonder things like this will suffer from similar issue on
>>> virtually tagged archs as mentioned in [1].
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
>>
>>> Is this better to allocate the ring from userspace and set to KVM
>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>> on recent CPUs).
>> Yeah, I don't think that would be better than mmap.
> Yeah I agree, because I didn't see how copy_to/from_user() helped to
> do icache/dcache flushings...


It looks to me one advantage is that exact the same VA is used by both
userspace and kernel so there will be no alias.

Thanks


>
> Some context here: Jason raised this question offlist first on whether
> we should also need these flush_dcache_cache() helpers for operations
> like kvm dirty ring accesses. I feel like it should, however I've got
> two other questions, on:
>
> - if we need to do flush_dcache_page() on kernel modified pages
> (assuming the same page has mapped to userspace), then why don't
> we need flush_cache_page() too on the page, where
> flush_cache_page() is defined not-a-nop on those archs?
>
> - assuming an arch has not-a-nop impl for flush_[d]cache_page(),
> would atomic operations like cmpxchg really work for them
> (assuming that ISAs like cmpxchg should depend on cache
> consistency).
>
> Sorry I think these are for sure a bit out of topic for kvm dirty ring
> patchset, but since we're at it, I'm raising the questions up in case
> there're answers..
>
> Thanks,
>

2019-12-05 12:09:26

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
>
> On 2019/12/5 上午3:52, Peter Xu wrote:
> > On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> > > On 04/12/19 11:38, Jason Wang wrote:
> > > > > +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > > > +    entry->slot = slot;
> > > > > +    entry->offset = offset;
> > > >
> > > > Haven't gone through the whole series, sorry if it was a silly question
> > > > but I wonder things like this will suffer from similar issue on
> > > > virtually tagged archs as mentioned in [1].
> > > There is no new infrastructure to track the dirty pages---it's just a
> > > different way to pass them to userspace.
> > >
> > > > Is this better to allocate the ring from userspace and set to KVM
> > > > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > > > on recent CPUs).
> > > Yeah, I don't think that would be better than mmap.
> > Yeah I agree, because I didn't see how copy_to/from_user() helped to
> > do icache/dcache flushings...
>
>
> It looks to me one advantage is that exact the same VA is used by both
> userspace and kernel so there will be no alias.

Hmm.. but what if the page is mapped more than once in user? Thanks,

--
Peter Xu

2019-12-05 13:14:11

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking


On 2019/12/5 下午8:08, Peter Xu wrote:
> On Thu, Dec 05, 2019 at 02:51:15PM +0800, Jason Wang wrote:
>> On 2019/12/5 上午3:52, Peter Xu wrote:
>>> On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
>>>> On 04/12/19 11:38, Jason Wang wrote:
>>>>>> +    entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
>>>>>> +    entry->slot = slot;
>>>>>> +    entry->offset = offset;
>>>>> Haven't gone through the whole series, sorry if it was a silly question
>>>>> but I wonder things like this will suffer from similar issue on
>>>>> virtually tagged archs as mentioned in [1].
>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>> different way to pass them to userspace.
>>>>
>>>>> Is this better to allocate the ring from userspace and set to KVM
>>>>> instead? Then we can use copy_to/from_user() friends (a little bit slow
>>>>> on recent CPUs).
>>>> Yeah, I don't think that would be better than mmap.
>>> Yeah I agree, because I didn't see how copy_to/from_user() helped to
>>> do icache/dcache flushings...
>>
>> It looks to me one advantage is that exact the same VA is used by both
>> userspace and kernel so there will be no alias.
> Hmm.. but what if the page is mapped more than once in user? Thanks,


Then it's the responsibility of userspace program to do the flush I think.

Thanks

>

2019-12-05 19:33:46

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Tue, Dec 03, 2019 at 02:59:14PM +0100, Paolo Bonzini wrote:
> On 02/12/19 03:13, Peter Xu wrote:
> >> This is not needed, it will just be a false negative (dirty page that
> >> actually isn't dirty). The dirty bit will be cleared when userspace
> >> resets the ring buffer; then the instruction will be executed again and
> >> mark the page dirty again. Since ring full is not a common condition,
> >> it's not a big deal.
> >
> > Actually I added this only because it failed one of the unit tests
> > when verifying the dirty bits.. But now after a second thought, I
> > probably agree with you that we can change the userspace too to fix
> > this.
>
> I think there is already a similar case in dirty_log_test when a page is
> dirty but we called KVM_GET_DIRTY_LOG just before it got written to.

If you mean the host_bmap_track (in dirty_log_test.c), that should be
a reversed version of this race (that's where the data is written,
while we didn't see the dirty bit set). But yes I think I can
probably use the same bitmap to fix the test case, because in both
cases what we want to do is to make sure "the dirty bit of this page
should be set in next round".

>
> > I think the steps of the failed test case could be simplified into
> > something like this (assuming the QEMU migration context, might be
> > easier to understand):
> >
> > 1. page P has data P1
> > 2. vcpu writes to page P, with date P2
> > 3. vmexit (P is still with data P1)
> > 4. mark P as dirty, ring full, user exit
> > 5. collect dirty bit P, migrate P with data P1
> > 6. vcpu run due to some reason, P was written with P2, user exit again
> > (because ring is already reaching soft limit)
> > 7. do KVM_RESET_DIRTY_RINGS
>
> Migration should only be done after KVM_RESET_DIRTY_RINGS (think of
> KVM_RESET_DIRTY_RINGS as the equivalent of KVM_CLEAR_DIRTY_LOG).

Totally agree for migration. It's probably just that the test case
needs fixing.

>
> > dirty_log_test-29003 [001] 184503.384328: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.384329: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.384329: kvm_page_fault: address 7fc036d000 error_code 582
> > dirty_log_test-29003 [001] 184503.384331: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.384332: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.384332: kvm_page_fault: address 7fc036d000 error_code 582
> > dirty_log_test-29003 [001] 184503.384332: kvm_dirty_ring_push: ring 1: dirty 0x37f reset 0x1c0 slot 1 offset 0x37e ret 0 (used 447)
> > dirty_log_test-29003 [001] 184503.384333: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.384334: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.384334: kvm_page_fault: address 7fc036e000 error_code 582
> > dirty_log_test-29003 [001] 184503.384336: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.384336: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.384336: kvm_page_fault: address 7fc036e000 error_code 582
> > dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_push: ring 1: dirty 0x380 reset 0x1c0 slot 1 offset 0x37f ret 1 (used 448)
> > dirty_log_test-29003 [001] 184503.384337: kvm_dirty_ring_exit: vcpu 1
> > dirty_log_test-29003 [001] 184503.384338: kvm_fpu: unload
> > dirty_log_test-29003 [001] 184503.384340: kvm_userspace_exit: reason 0x1d (29)
> > dirty_log_test-29000 [006] 184503.505103: kvm_dirty_ring_reset: ring 1: dirty 0x380 reset 0x380 (used 0)
> > dirty_log_test-29003 [001] 184503.505184: kvm_fpu: load
> > dirty_log_test-29003 [001] 184503.505187: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.505193: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.505194: kvm_page_fault: address 7fc036f000 error_code 582 <-------- [1]
> > dirty_log_test-29003 [001] 184503.505206: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.505207: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.505207: kvm_page_fault: address 7fc036f000 error_code 582
> > dirty_log_test-29003 [001] 184503.505226: kvm_dirty_ring_push: ring 1: dirty 0x381 reset 0x380 slot 1 offset 0x380 ret 0 (used 1)
> > dirty_log_test-29003 [001] 184503.505226: kvm_entry: vcpu 1
> > dirty_log_test-29003 [001] 184503.505227: kvm_exit: reason EPT_VIOLATION rip 0x40359f info 582 0
> > dirty_log_test-29003 [001] 184503.505228: kvm_page_fault: address 7fc0370000 error_code 582
> > dirty_log_test-29003 [001] 184503.505231: kvm_entry: vcpu 1
> > ...
> >
> > The test was trying to continuously write to pages, from above log
> > starting from 7fc036d000. The reason 0x1d (29) is the new dirty ring
> > full exit reason.
> >
> > So far I'm still unsure of two things:
> >
> > 1. Why for each page we faulted twice rather than once. Take the
> > example of page at 7fc036e000 above, the first fault didn't
> > trigger the marking dirty path, while only until the 2nd ept
> > violation did we trigger kvm_dirty_ring_push.
>
> Not sure about that. Try enabling kvmmmu tracepoints too, it will tell
> you more of the path that was taken while processing the EPT violation.

These new tracepoints are extremely useful (which I didn't notice
before).

So here's the final culprit...

void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
{
...
spin_lock(&kvm->mmu_lock);
/* FIXME: we should use a single AND operation, but there is no
* applicable atomic API.
*/
while (mask) {
clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
mask &= mask - 1;
}

kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
spin_unlock(&kvm->mmu_lock);
}

The mask is cleared before reaching
kvm_arch_mmu_enable_log_dirty_pt_masked()..

The funny thing is that I did have a few more patches to even skip
allocate the dirty_bitmap when dirty ring is enabled (hence in that
tree I removed this while loop too, so that has no such problem).
However I dropped those patches when I posted the RFC because I don't
think it's mature, and the selftest didn't complain about that
either.. Though, I do plan to redo that in v2 if you don't disagree.
The major question would be whether the dirty_bitmap could still be
for any use if dirty ring is enabled.

>
> If your machine has PML, what you're seeing is likely not-present
> violation, not dirty-protect violation. Try disabling pml and see if
> the trace changes.
>
> > 2. Why we didn't get the last page written again after
> > kvm_userspace_exit (last page was 7fc036e000, and the test failed
> > because 7fc036e000 detected change however dirty bit unset). In
> > this case the first write after KVM_RESET_DIRTY_RINGS is the line
> > pointed by [1], I thought it should be a rewritten of page
> > 7fc036e000 because when the user exit happens logically the write
> > should not happen yet and eip should keep. However at [1] it's
> > already writting to a new page.
>
> IIUC you should get, with PML enabled:
>
> - guest writes to page
> - PML marks dirty bit, causes vmexit
> - host copies PML log to ring, causes userspace exit
> - userspace calls KVM_RESET_DIRTY_RINGS
> - host marks page as clean
> - userspace calls KVM_RUN
> - guest writes again to page
>
> but the page won't be in the ring until after another vmexit happens.
> Therefore, it's okay to reap the pages in the ring asynchronously, but
> there must be a synchronization point in the testcase sooner or later,
> where all CPUs are kicked out of KVM_RUN. This synchronization point
> corresponds to the migration downtime.

Yep, currently in the test case I used the same signal trick to kick
the vcpu out to make sure PML buffers are flushed during the vmexit,
before the main thread starts to collect dirty bits.

Thanks,

--
Peter Xu

2019-12-05 20:00:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On 05/12/19 20:30, Peter Xu wrote:
>> Try enabling kvmmmu tracepoints too, it will tell
>> you more of the path that was taken while processing the EPT violation.
>
> These new tracepoints are extremely useful (which I didn't notice
> before).

Yes, they are!

> So here's the final culprit...
>
> void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> {
> ...
> spin_lock(&kvm->mmu_lock);
> /* FIXME: we should use a single AND operation, but there is no
> * applicable atomic API.
> */
> while (mask) {
> clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> mask &= mask - 1;
> }
>
> kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> spin_unlock(&kvm->mmu_lock);
> }
>
> The mask is cleared before reaching
> kvm_arch_mmu_enable_log_dirty_pt_masked()..

I'm not sure why that results in two vmexits? (clearing before
kvm_arch_mmu_enable_log_dirty_pt_masked is also what
KVM_{GET,CLEAR}_DIRTY_LOG does).

> The funny thing is that I did have a few more patches to even skip
> allocate the dirty_bitmap when dirty ring is enabled (hence in that
> tree I removed this while loop too, so that has no such problem).
> However I dropped those patches when I posted the RFC because I don't
> think it's mature, and the selftest didn't complain about that
> either.. Though, I do plan to redo that in v2 if you don't disagree.
> The major question would be whether the dirty_bitmap could still be
> for any use if dirty ring is enabled.

Userspace may want a dirty bitmap in addition to a list (for example:
list for migration, bitmap for framebuffer update), but it can also do a
pass over the dirty rings in order to update an internal bitmap.

So I think it make sense to make it either one or the other.

Paolo

2019-12-05 20:54:04

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Thu, Dec 05, 2019 at 08:59:33PM +0100, Paolo Bonzini wrote:
> On 05/12/19 20:30, Peter Xu wrote:
> >> Try enabling kvmmmu tracepoints too, it will tell
> >> you more of the path that was taken while processing the EPT violation.
> >
> > These new tracepoints are extremely useful (which I didn't notice
> > before).
>
> Yes, they are!

(I forgot to say thanks for teaching me that! :)

>
> > So here's the final culprit...
> >
> > void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > {
> > ...
> > spin_lock(&kvm->mmu_lock);
> > /* FIXME: we should use a single AND operation, but there is no
> > * applicable atomic API.
> > */
> > while (mask) {
> > clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > mask &= mask - 1;
> > }
> >
> > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > spin_unlock(&kvm->mmu_lock);
> > }
> >
> > The mask is cleared before reaching
> > kvm_arch_mmu_enable_log_dirty_pt_masked()..
>
> I'm not sure why that results in two vmexits? (clearing before
> kvm_arch_mmu_enable_log_dirty_pt_masked is also what
> KVM_{GET,CLEAR}_DIRTY_LOG does).

Sorry my fault to be not clear on this.

The kvm_arch_mmu_enable_log_dirty_pt_masked() only explains why the
same page is not written again after the ring-full userspace exit
(which triggered the real dirty bit missing), and that's because the
write bit is not removed during KVM_RESET_DIRTY_RINGS so the next
vmenter will directly write to the previous page without vmexit.

The two vmexits is another story - I tracked it is retried because
mmu_notifier_seq has changed, hence it goes through this path:

if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
goto out_unlock;

It's because when try_async_pf(), we will do a writable page fault,
which probably triggers both the invalidate_range_end and change_pte
notifiers. A reference trace when EPT enabled:

kvm_mmu_notifier_change_pte+1
__mmu_notifier_change_pte+82
wp_page_copy+1907
do_wp_page+478
__handle_mm_fault+3395
handle_mm_fault+196
__get_user_pages+681
get_user_pages_unlocked+172
__gfn_to_pfn_memslot+290
try_async_pf+141
tdp_page_fault+326
kvm_mmu_page_fault+115
kvm_arch_vcpu_ioctl_run+2675
kvm_vcpu_ioctl+536
do_vfs_ioctl+1029
ksys_ioctl+94
__x64_sys_ioctl+22
do_syscall_64+91

I'm not sure whether that's ideal, but it makes sense to me.

>
> > The funny thing is that I did have a few more patches to even skip
> > allocate the dirty_bitmap when dirty ring is enabled (hence in that
> > tree I removed this while loop too, so that has no such problem).
> > However I dropped those patches when I posted the RFC because I don't
> > think it's mature, and the selftest didn't complain about that
> > either.. Though, I do plan to redo that in v2 if you don't disagree.
> > The major question would be whether the dirty_bitmap could still be
> > for any use if dirty ring is enabled.
>
> Userspace may want a dirty bitmap in addition to a list (for example:
> list for migration, bitmap for framebuffer update), but it can also do a
> pass over the dirty rings in order to update an internal bitmap.
>
> So I think it make sense to make it either one or the other.

Ok, then I'll do.

Thanks,

--
Peter Xu

2019-12-07 00:29:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > Rather than reserve entries, what if vCPUs reserved an entire ring? Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs. To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool. If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> >
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain. And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
>
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1? The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.

Because I really don't like kvm_get_running_vcpu() :-)

Binding the rings to vCPUs also makes for an inflexible API, e.g. the
amount of memory required for the rings scales linearly with the number of
vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

That being said, I'm pretty clueless when it comes to implementing and
tuning the userspace side of this type of stuff, so feel free to ignore my
thoughts on the API.

2019-12-09 09:51:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 07/12/19 01:29, Sean Christopherson wrote:
> On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
>> On 03/12/19 19:46, Sean Christopherson wrote:
>>> Rather than reserve entries, what if vCPUs reserved an entire ring? Create
>>> a pool of N=nr_vcpus rings that are shared by all vCPUs. To mark pages
>>> dirty, a vCPU claims a ring, pushes the pages into the ring, and then
>>> returns the ring to the pool. If pushing pages hits the soft limit, a
>>> request is made to drain the ring and the ring is not returned to the pool
>>> until it is drained.
>>>
>>> Except for acquiring a ring, which likely can be heavily optimized, that'd
>>> allow parallel processing (#1), and would provide a facsimile of #2 as
>>> pushing more pages onto a ring would naturally increase the likelihood of
>>> triggering a drain. And it might be interesting to see the effect of using
>>> different methods of ring selection, e.g. pure round robin, LRU, last used
>>> on the current vCPU, etc...
>>
>> If you are creating nr_vcpus rings, and draining is done on the vCPU
>> thread that has filled the ring, why not create nr_vcpus+1? The current
>> code then is exactly the same as pre-claiming a ring per vCPU and never
>> releasing it, and using a spinlock to claim the per-VM ring.
>
> Because I really don't like kvm_get_running_vcpu() :-)

I also don't like it particularly, but I think it's okay to wrap it into
a nicer API.

> Binding the rings to vCPUs also makes for an inflexible API, e.g. the
> amount of memory required for the rings scales linearly with the number of
> vCPUs, or maybe there's a use case for having M:N vCPUs:rings.

If we can get rid of the dirty bitmap, the amount of memory is probably
going to be smaller anyway. For example at 64k per ring, 256 rings
occupy 16 MiB of memory, and that is the cost of dirty bitmaps for 512
GiB of guest memory, and that's probably what you can expect for the
memory of a 256-vCPU guest (at least roughly: if the memory is 128 GiB,
the extra 12 MiB for dirty page rings don't really matter).

Paolo

> That being said, I'm pretty clueless when it comes to implementing and
> tuning the userspace side of this type of stuff, so feel free to ignore my
> thoughts on the API.
>

2019-12-09 21:56:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 04, 2019 at 11:05:47AM +0100, Paolo Bonzini wrote:
> On 03/12/19 19:46, Sean Christopherson wrote:
> > On Tue, Dec 03, 2019 at 02:48:10PM +0100, Paolo Bonzini wrote:
> >> On 02/12/19 22:50, Sean Christopherson wrote:
> >>>>
> >>>> I discussed this with Paolo, but I think Paolo preferred the per-vm
> >>>> ring because there's no good reason to choose vcpu0 as what (1)
> >>>> suggested. While if to choose (2) we probably need to lock even for
> >>>> per-cpu ring, so could be a bit slower.
> >>> Ya, per-vm is definitely better than dumping on vcpu0. I'm hoping we can
> >>> find a third option that provides comparable performance without using any
> >>> per-vcpu rings.
> >>>
> >>
> >> The advantage of per-vCPU rings is that it naturally: 1) parallelizes
> >> the processing of dirty pages; 2) makes userspace vCPU thread do more
> >> work on vCPUs that dirty more pages.
> >>
> >> I agree that on the producer side we could reserve multiple entries in
> >> the case of PML (and without PML only one entry should be added at a
> >> time). But I'm afraid that things get ugly when the ring is full,
> >> because you'd have to wait for all vCPUs to finish publishing the
> >> entries they have reserved.
> >
> > Ah, I take it the intended model is that userspace will only start pulling
> > entries off the ring when KVM explicitly signals that the ring is "full"?
>
> No, it's not. But perhaps in the asynchronous case you can delay
> pushing the reserved entries to the consumer until a moment where no
> CPUs have left empty slots in the ring buffer (somebody must have done
> multi-producer ring buffers before). In the ring-full case that is
> harder because it requires synchronization.
>
> > Rather than reserve entries, what if vCPUs reserved an entire ring? Create
> > a pool of N=nr_vcpus rings that are shared by all vCPUs. To mark pages
> > dirty, a vCPU claims a ring, pushes the pages into the ring, and then
> > returns the ring to the pool. If pushing pages hits the soft limit, a
> > request is made to drain the ring and the ring is not returned to the pool
> > until it is drained.
> >
> > Except for acquiring a ring, which likely can be heavily optimized, that'd
> > allow parallel processing (#1), and would provide a facsimile of #2 as
> > pushing more pages onto a ring would naturally increase the likelihood of
> > triggering a drain. And it might be interesting to see the effect of using
> > different methods of ring selection, e.g. pure round robin, LRU, last used
> > on the current vCPU, etc...
>
> If you are creating nr_vcpus rings, and draining is done on the vCPU
> thread that has filled the ring, why not create nr_vcpus+1? The current
> code then is exactly the same as pre-claiming a ring per vCPU and never
> releasing it, and using a spinlock to claim the per-VM ring.
>
> However, we could build on top of my other suggestion to add
> slot->as_id, and wrap kvm_get_running_vcpu() with a nice API, mimicking
> exactly what you've suggested. Maybe even add a scary comment around
> kvm_get_running_vcpu() suggesting that users only do so to avoid locking
> and wrap it with a nice API. Similar to what get_cpu/put_cpu do with
> smp_processor_id.
>
> 1) Add a pointer from struct kvm_dirty_ring to struct
> kvm_dirty_ring_indexes:
>
> vcpu->dirty_ring->data = &vcpu->run->vcpu_ring_indexes;
> kvm->vm_dirty_ring->data = *kvm->vm_run->vm_ring_indexes;
>
> 2) push the ring choice and locking to two new functions
>
> struct kvm_ring *kvm_get_dirty_ring(struct kvm *kvm)
> {
> struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
>
> if (vcpu && !WARN_ON_ONCE(vcpu->kvm != kvm)) {
> return &vcpu->dirty_ring;
> } else {
> /*
> * Put onto per vm ring because no vcpu context.
> * We'll kick vcpu0 if ring is full.
> */
> spin_lock(&kvm->vm_dirty_ring->lock);
> return &kvm->vm_dirty_ring;
> }
> }
>
> void kvm_put_dirty_ring(struct kvm *kvm,
> struct kvm_dirty_ring *ring)
> {
> struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> bool full = kvm_dirty_ring_used(ring) >= ring->soft_limit;
>
> if (ring == &kvm->vm_dirty_ring) {
> if (vcpu == NULL)
> vcpu = kvm->vcpus[0];
> spin_unlock(&kvm->vm_dirty_ring->lock);
> }
>
> if (full)
> kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> }
>
> 3) simplify kvm_dirty_ring_push to
>
> void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> u32 slot, u64 offset)
> {
> /* left as an exercise to the reader */
> }
>
> and mark_page_dirty_in_ring to
>
> static void mark_page_dirty_in_ring(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> gfn_t gfn)
> {
> struct kvm_dirty_ring *ring;
>
> if (!kvm->dirty_ring_size)
> return;
>
> ring = kvm_get_dirty_ring(kvm);
> kvm_dirty_ring_push(ring, (slot->as_id << 16) | slot->id,
> gfn - slot->base_gfn);
> kvm_put_dirty_ring(kvm, ring);
> }

I think I got the major point here. Unless Sean has some better idea
in the future I'll go with this.

Just until recently I noticed that actually kvm_get_running_vcpu() has
a real benefit in that it gives a very solid result on whether we're
with the vcpu context, even more accurate than when we pass vcpu
pointers around (because sometimes we just passed the kvm pointer
along the stack even if we're with a vcpu context, just like what we
did with mark_page_dirty_in_slot). I'm thinking whether I can start
to use this information in the next post on solving an issue I
encountered with the waitqueue.

Current waitqueue is still problematic in that it could wait even with
the mmu lock held when with vcpu context.

The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
the write bits, while it's the only interface to also wake up the
dirty ring sleepers. They could dead lock like this:

main thread vcpu thread
=========== ===========
kvm page fault
mark_page_dirty_in_slot
mmu lock taken
mark dirty, ring full
queue on waitqueue
(with mmu lock)
KVM_RESET_DIRTY_RINGS
take mmu lock <------------ deadlock here
reset ring gfns
wakeup dirty ring sleepers

And if we see if the mark_page_dirty_in_slot() is not with a vcpu
context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
cases we'll use per-vm dirty ring) then it's probably fine.

My planned solution:

- When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
until we finished handling this page fault, probably in somewhere
around vcpu_enter_guest, so that we can do wait_event() after the
mmu lock released

- For per-vm ring full, I'll do what we do now (wait_event() as long
in mark_page_dirty_in_ring) assuming it should not be with the mmu
lock held

To achieve above, I think I really need to know exactly on whether
we're with the vcpu context, where I suppose kvm_get_running_vcpu()
would work for me then, rather than checking against vcpu pointer
passed in.

I also wanted to let KVM_RUN return immediately if either per-vm ring
or per-vcpu ring reaches softlimit always, instead of continue
execution until the next dirty ring full event.

I'd be glad to receive any early comment before I move on to these.

Thanks!

--
Peter Xu

2019-12-10 10:08:31

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 09/12/19 22:54, Peter Xu wrote:
> Just until recently I noticed that actually kvm_get_running_vcpu() has
> a real benefit in that it gives a very solid result on whether we're
> with the vcpu context, even more accurate than when we pass vcpu
> pointers around (because sometimes we just passed the kvm pointer
> along the stack even if we're with a vcpu context, just like what we
> did with mark_page_dirty_in_slot).

Right, that's the point.

> I'm thinking whether I can start
> to use this information in the next post on solving an issue I
> encountered with the waitqueue.
>
> Current waitqueue is still problematic in that it could wait even with
> the mmu lock held when with vcpu context.

I think the idea of the soft limit is that the waiting just cannot
happen. That is, the number of dirtied pages _outside_ the guest (guest
accesses are taken care of by PML, and are subtracted from the soft
limit) cannot exceed hard_limit - (soft_limit + pml_size).

> The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> the write bits, while it's the only interface to also wake up the
> dirty ring sleepers. They could dead lock like this:
>
> main thread vcpu thread
> =========== ===========
> kvm page fault
> mark_page_dirty_in_slot
> mmu lock taken
> mark dirty, ring full
> queue on waitqueue
> (with mmu lock)
> KVM_RESET_DIRTY_RINGS
> take mmu lock <------------ deadlock here
> reset ring gfns
> wakeup dirty ring sleepers
>
> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> cases we'll use per-vm dirty ring) then it's probably fine.
>
> My planned solution:
>
> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> until we finished handling this page fault, probably in somewhere
> around vcpu_enter_guest, so that we can do wait_event() after the
> mmu lock released

I think this can cause a race:

vCPU 1 vCPU 2 host
---------------------------------------------------------------
mark page dirty
write to page
treat page as not dirty
add page to ring

where vCPU 2 skips the clean-page slow path entirely.

Paolo

2019-12-10 13:26:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 04, 2019 at 12:04:53PM +0100, Paolo Bonzini wrote:
> On 04/12/19 11:38, Jason Wang wrote:
> >>
> >> +??? entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> >> +??? entry->slot = slot;
> >> +??? entry->offset = offset;
> >
> >
> > Haven't gone through the whole series, sorry if it was a silly question
> > but I wonder things like this will suffer from similar issue on
> > virtually tagged archs as mentioned in [1].
>
> There is no new infrastructure to track the dirty pages---it's just a
> different way to pass them to userspace.

Did you guys consider using one of the virtio ring formats?
Maybe reusing vhost code?

If you did and it's not a good fit, this is something good to mention
in the commit log.

I also wonder about performance numbers - any data here?


> > Is this better to allocate the ring from userspace and set to KVM
> > instead? Then we can use copy_to/from_user() friends (a little bit slow
> > on recent CPUs).
>
> Yeah, I don't think that would be better than mmap.
>
> Paolo
>
>
> > [1] https://lkml.org/lkml/2019/4/9/5

2019-12-10 13:32:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 10/12/19 14:25, Michael S. Tsirkin wrote:
>> There is no new infrastructure to track the dirty pages---it's just a
>> different way to pass them to userspace.
> Did you guys consider using one of the virtio ring formats?
> Maybe reusing vhost code?

There are no used/available entries here, it's unidirectional
(kernel->user).

> If you did and it's not a good fit, this is something good to mention
> in the commit log.
>
> I also wonder about performance numbers - any data here?

Yes some numbers would be useful. Note however that the improvement is
asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
depending on the workload.

Paolo

2019-12-10 15:53:57

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> > I'm thinking whether I can start
> > to use this information in the next post on solving an issue I
> > encountered with the waitqueue.
> >
> > Current waitqueue is still problematic in that it could wait even with
> > the mmu lock held when with vcpu context.
>
> I think the idea of the soft limit is that the waiting just cannot
> happen. That is, the number of dirtied pages _outside_ the guest (guest
> accesses are taken care of by PML, and are subtracted from the soft
> limit) cannot exceed hard_limit - (soft_limit + pml_size).

So the question go backs to, whether this is guaranteed somehow? Or
do you prefer us to keep the warn_on_once until it triggers then we
can analyze (which I doubt..)?

One thing to mention is that for with-vcpu cases, we probably can even
stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
reaches the softlimit, then for vcpu case it should be easier to
guarantee that. What I want to know is the rest of cases like ioctls
or even something not from the userspace (which I think I should read
more later..).

If the answer is yes, I'd be more than glad to drop the waitqueue.

>
> > The issue is KVM_RESET_DIRTY_RINGS needs the mmu lock to manipulate
> > the write bits, while it's the only interface to also wake up the
> > dirty ring sleepers. They could dead lock like this:
> >
> > main thread vcpu thread
> > =========== ===========
> > kvm page fault
> > mark_page_dirty_in_slot
> > mmu lock taken
> > mark dirty, ring full
> > queue on waitqueue
> > (with mmu lock)
> > KVM_RESET_DIRTY_RINGS
> > take mmu lock <------------ deadlock here
> > reset ring gfns
> > wakeup dirty ring sleepers
> >
> > And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> > context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> > cases we'll use per-vm dirty ring) then it's probably fine.
> >
> > My planned solution:
> >
> > - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> > until we finished handling this page fault, probably in somewhere
> > around vcpu_enter_guest, so that we can do wait_event() after the
> > mmu lock released
>
> I think this can cause a race:
>
> vCPU 1 vCPU 2 host
> ---------------------------------------------------------------
> mark page dirty
> write to page
> treat page as not dirty
> add page to ring
>
> where vCPU 2 skips the clean-page slow path entirely.

If we're still with the rule in userspace that we first do RESET then
collect and send the pages (just like what we've discussed before),
then IMHO it's fine to have vcpu2 to skip the slow path? Because
RESET happens at "treat page as not dirty", then if we are sure that
we only collect and send pages after that point, then the latest
"write to page" data from vcpu2 won't be lost even if vcpu2 is not
blocked by vcpu1's ring full?

Maybe we can also consider to let mark_page_dirty_in_slot() return a
value, then the upper layer could have a chance to skip the spte
update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
can return directly with RET_PF_RETRY.

Thanks,

--
Peter Xu

2019-12-10 16:03:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
>
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Agreed. Vring could be an overkill IMHO (the whole dirty_ring.c is
100+ LOC only).

>
> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> >
> > I also wonder about performance numbers - any data here?
>
> Yes some numbers would be useful. Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.

Yes. I plan to give some numbers when start to work on the QEMU
series (after this lands). However as Paolo said, those numbers would
probably only be with some special case where I know the dirty ring
could win. Frankly speaking I don't even know whether we should
change the default logging mode when the QEMU work is done - I feel
like the old logging interface is still good in many major cases
(small vms, or high dirty rates). It could be that we just offer
another option when the user could consider to solve specific problems.

Thanks,

--
Peter Xu

2019-12-10 17:10:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 10/12/19 16:52, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
>>> I'm thinking whether I can start
>>> to use this information in the next post on solving an issue I
>>> encountered with the waitqueue.
>>>
>>> Current waitqueue is still problematic in that it could wait even with
>>> the mmu lock held when with vcpu context.
>>
>> I think the idea of the soft limit is that the waiting just cannot
>> happen. That is, the number of dirtied pages _outside_ the guest (guest
>> accesses are taken care of by PML, and are subtracted from the soft
>> limit) cannot exceed hard_limit - (soft_limit + pml_size).
>
> So the question go backs to, whether this is guaranteed somehow? Or
> do you prefer us to keep the warn_on_once until it triggers then we
> can analyze (which I doubt..)?

Yes, I would like to keep the WARN_ON_ONCE just because you never know.

Of course it would be much better to audit the calls to kvm_write_guest
and figure out how many could trigger (e.g. two from the operands of an
emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

> One thing to mention is that for with-vcpu cases, we probably can even
> stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> reaches the softlimit, then for vcpu case it should be easier to
> guarantee that. What I want to know is the rest of cases like ioctls
> or even something not from the userspace (which I think I should read
> more later..).

Which ioctls? Most ioctls shouldn't dirty memory at all.

>>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
>>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
>>> cases we'll use per-vm dirty ring) then it's probably fine.
>>>
>>> My planned solution:
>>>
>>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
>>> until we finished handling this page fault, probably in somewhere
>>> around vcpu_enter_guest, so that we can do wait_event() after the
>>> mmu lock released
>>
>> I think this can cause a race:
>>
>> vCPU 1 vCPU 2 host
>> ---------------------------------------------------------------
>> mark page dirty
>> write to page
>> treat page as not dirty
>> add page to ring
>>
>> where vCPU 2 skips the clean-page slow path entirely.
>
> If we're still with the rule in userspace that we first do RESET then
> collect and send the pages (just like what we've discussed before),
> then IMHO it's fine to have vcpu2 to skip the slow path? Because
> RESET happens at "treat page as not dirty", then if we are sure that
> we only collect and send pages after that point, then the latest
> "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> blocked by vcpu1's ring full?

Good point, the race would become

vCPU 1 vCPU 2 host
---------------------------------------------------------------
mark page dirty
write to page
reset rings
wait for mmu lock
add page to ring
release mmu lock
...do reset...
release mmu lock
page is now dirty

> Maybe we can also consider to let mark_page_dirty_in_slot() return a
> value, then the upper layer could have a chance to skip the spte
> update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> can return directly with RET_PF_RETRY.

I don't think that's possible, most writes won't come from a page fault
path and cannot retry.

Paolo

2019-12-10 21:50:34

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >> There is no new infrastructure to track the dirty pages---it's just a
> >> different way to pass them to userspace.
> > Did you guys consider using one of the virtio ring formats?
> > Maybe reusing vhost code?
>
> There are no used/available entries here, it's unidirectional
> (kernel->user).

Didn't look at the design yet, but flow control (to prevent overflow)
goes the other way, doesn't it? That's what used is, essentially.

> > If you did and it's not a good fit, this is something good to mention
> > in the commit log.
> >
> > I also wonder about performance numbers - any data here?
>
> Yes some numbers would be useful. Note however that the improvement is
> asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> depending on the workload.
>
> Paolo

2019-12-10 21:54:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >> There is no new infrastructure to track the dirty pages---it's just a
> > >> different way to pass them to userspace.
> > > Did you guys consider using one of the virtio ring formats?
> > > Maybe reusing vhost code?
> >
> > There are no used/available entries here, it's unidirectional
> > (kernel->user).
>
> Agreed. Vring could be an overkill IMHO (the whole dirty_ring.c is
> 100+ LOC only).


I guess you don't do polling/ event suppression and other tricks that
virtio came up with for speed then? Why won't they be helpful for kvm?
To put it another way, LOC is irrelevant, virtio is already in the
kernel.

Anyway, this is something to be discussed in the cover letter.

> >
> > > If you did and it's not a good fit, this is something good to mention
> > > in the commit log.
> > >
> > > I also wonder about performance numbers - any data here?
> >
> > Yes some numbers would be useful. Note however that the improvement is
> > asymptotical, O(#dirtied pages) vs O(#total pages) so it may differ
> > depending on the workload.
>
> Yes. I plan to give some numbers when start to work on the QEMU
> series (after this lands). However as Paolo said, those numbers would
> probably only be with some special case where I know the dirty ring
> could win. Frankly speaking I don't even know whether we should
> change the default logging mode when the QEMU work is done - I feel
> like the old logging interface is still good in many major cases
> (small vms, or high dirty rates). It could be that we just offer
> another option when the user could consider to solve specific problems.
>
> Thanks,
>
> --
> Peter Xu

2019-12-11 09:07:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 10/12/19 22:53, Michael S. Tsirkin wrote:
> On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
>> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
>>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
>>>>> There is no new infrastructure to track the dirty pages---it's just a
>>>>> different way to pass them to userspace.
>>>> Did you guys consider using one of the virtio ring formats?
>>>> Maybe reusing vhost code?
>>>
>>> There are no used/available entries here, it's unidirectional
>>> (kernel->user).
>>
>> Agreed. Vring could be an overkill IMHO (the whole dirty_ring.c is
>> 100+ LOC only).
>
> I guess you don't do polling/ event suppression and other tricks that
> virtio came up with for speed then?

There are no interrupts either, so no need for event suppression. You
have vmexits when the ring gets full (and that needs to be synchronous),
but apart from that the migration thread will poll the rings once when
it needs to send more pages.

Paolo

2019-12-11 12:54:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> This patch is heavily based on previous work from Lei Cao
> <[email protected]> and Paolo Bonzini <[email protected]>. [1]
>
> KVM currently uses large bitmaps to track dirty memory. These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information. The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another. However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial. In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN). This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> We defined two new data structures:
>
> struct kvm_dirty_ring;
> struct kvm_dirty_ring_indexes;
>
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages. When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
>
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring. Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
>
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
>
> Currently, we have N+1 rings for each VM of N vcpus:
>
> - for each vcpu, we have 1 per-vcpu dirty ring,
> - for each vm, we have 1 per-vm dirty ring
>
> Please refer to the documentation update in this patch for more
> details.
>
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now. Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>


Thanks, that's interesting.

> ---
> Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> arch/x86/kvm/Makefile | 3 +-
> include/linux/kvm_dirty_ring.h | 67 +++++++++
> include/linux/kvm_host.h | 33 +++++
> include/linux/kvm_types.h | 1 +
> include/uapi/linux/kvm.h | 36 +++++
> virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
> virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
> 8 files changed, 642 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/kvm_dirty_ring.h
> create mode 100644 virt/kvm/dirty_ring.c
>
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> It is thus encouraged to use the vm ioctl to query for capabilities (available
> with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>
> +
> 4.5 KVM_GET_VCPU_MMAP_SIZE
>
> Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> memory region. This ioctl returns the size of that region. See the
> KVM_RUN documentation for details.
>
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> + KVM_CAP_COALESCED_MMIO is not documented yet.
> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>
> 4.6 KVM_SET_MEMORY_REGION
>

PAGE_SIZE being which value? It's not always trivial for
userspace to know what's the PAGE_SIZE for the kernel ...


> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> AArch64, this value will be reported in the ISS field of ESR_ELx.
>
> See KVM_CAP_VCPU_EVENTS for more details.
> +
> 8.20 KVM_CAP_HYPERV_SEND_IPI
>
> Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
> This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> hypercalls:
> HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
> 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>
> Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> flush hypercalls by Hyper-V) so userspace should disable KVM identification
> in CPUID and only exposes Hyper-V identification. In this case, guest
> thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> + u16 dirty_index;
> + u16 reset_index;
> + u32 size;
> + u32 soft_limit;
> + spinlock_t lock;
> + struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> + __u32 avail_index; /* set by kernel */
> + __u32 fetch_index; /* set by userspace */

Sticking these next to each other seems to guarantee cache conflicts.

Avail/Fetch seems to mimic Virtio's avail/used exactly. I am not saying
you must reuse the code really, but I think you should take a hard look
at e.g. the virtio packed ring structure. We spent a bunch of time
optimizing it for cache utilization. It seems kernel is the driver,
making entries available, and userspace the device, using them.
Again let's not develop a thread about this, but I think
this is something to consider and discuss in future versions
of the patches.


> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {

What does GFN stand for?

> + __u32 pad;
> + __u32 slot; /* as_id | slot_id */
> + __u64 offset;
> +};

offset of what? a 4K page right? Seems like a waste e.g. for
hugetlbfs... How about replacing pad with size instead?

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

I'm not sure what you are trying to say here. kvm_dirty_gfn
seems to be part of UAPI.

> +
> +The two indices in the ring buffer are free running counters.
> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> + idx = load-acquire(&ring->fetch_index);
> + while (idx != ring->avail_index) {
> + struct kvm_dirty_gfn *entry;
> + entry = &ring->dirty_gfns[idx & (size - 1)];
> + ...
> +
> + idx++;
> + }
> + ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings. It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two.

All these seem like arbitrary limitations to me.

Sizing the ring correctly might prove to be a challenge.

Thus I think there's value in resizing the rings
without destroying VCPU.

Also, power of two just saves a branch here and there,
but wastes lots of memory. Just wrap the index around to
0 and then users can select any size?



> The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

OTOH larger buffers put lots of pressure on the system cache.


> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly. This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once. After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean. Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.
> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer. To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

This last item means that the performance impact of the feature is
really hard to predict. Can improve some workloads drastically. Or can
slow some down.


One solution could be to actually allow using this together with the
existing bitmap. Userspace can then decide whether it wants to block
VCPU on ring full, or just record ring full condition and recover by
bitmap scanning.


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> KVM := ../../../virt/kvm
>
> kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> + $(KVM)/dirty_ring.o
> kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring: shared with userspace via mmap. It is the compact list
> + * that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + * dirty_ring->dirty_gfns where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + * in dirty_ring->dirty_gfns for which dirty trap needs to
> + * be reenabled
> + * size: size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit: when the number of dirty pages in the list reaches this
> + * limit, vcpu that owns this ring should exit to userspace
> + * to allow userspace to harvest all the dirty pages
> + * lock: protects dirty_ring, only in use if this is the global
> + * ring
> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index
> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.
> + *
> + */
> +struct kvm_dirty_ring {
> + u32 dirty_index;
> + u32 reset_index;
> + u32 size;
> + u32 soft_limit;
> + spinlock_t lock;
> + struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> + struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + * 1: successfully pushed, soft limit reached,
> + * vcpu should exit to userspace
> + * -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes,
> + u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
> #include <linux/kvm_types.h>
>
> #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>
> #ifndef KVM_MAX_VCPU_ID
> #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_PENDING_TIMER 2
> #define KVM_REQ_UNHALT 3
> +#define KVM_REQ_DIRTY_RING_FULL 4
> #define KVM_REQUEST_ARCH_BASE 8
>
> #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
> bool ready;
> struct kvm_vcpu_arch arch;
> struct dentry *debugfs_dentry;
> + struct kvm_dirty_ring dirty_ring;
> };
>
> static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
> struct srcu_struct srcu;
> struct srcu_struct irq_srcu;
> pid_t userspace_pid;
> + /* Data structure to be exported by mmap(kvm->fd, 0) */
> + struct kvm_vm_run *vm_run;
> + u32 dirty_ring_size;
> + struct kvm_dirty_ring vm_dirty_ring;
> };
>
> #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> gfn_t gfn_offset,
> unsigned long mask);
>
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
> int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> struct kvm_dirty_log *log);
> int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> uintptr_t data, const char *name,
> struct task_struct **thread_ptr);
>
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full. This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define KVM_DIRTY_RING_RSVD_ENTRIES 64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define KVM_DIRTY_RING_MAX_ENTRIES 65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures,

Confused. Offset where? You set a default for everyone - where does arch
want to override it?

> while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif

One way versioning, with no bits and negotiation
will make it hard to change down the road.
what's wrong with existing KVM capabilities that
you feel there's a need for dedicated versioning for this?

> +
> #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> struct kvm_memory_slot;
> struct kvm_one_reg;
> struct kvm_run;
> +struct kvm_vm_run;
> struct kvm_userspace_memory_region;
> struct kvm_vcpu;
> struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> #define KVM_EXIT_IOAPIC_EOI 26
> #define KVM_EXIT_HYPERV 27
> #define KVM_EXIT_ARM_NISV 28
> +#define KVM_EXIT_DIRTY_RING_FULL 29
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> /* Encounter unexpected vm-exit reason */
> #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4
>
> +struct kvm_dirty_ring_indexes {
> + __u32 avail_index; /* set by kernel */
> + __u32 fetch_index; /* set by userspace */
> +};
> +
> /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> struct kvm_run {
> /* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
> struct kvm_sync_regs regs;
> char padding[SYNC_REGS_SIZE_BYTES];
> } s;
> +
> + struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> + struct kvm_dirty_ring_indexes vm_ring_indexes;
> };
>
> /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> #define KVM_CAP_ARM_NISV_TO_USER 177
> #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> /* Available with KVM_CAP_ARM_SVE */
> #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)
>
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
> +
> /* Secure Encrypted Virtualization command */
> enum sev_cmd_id {
> /* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
> #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
>
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + * of kvm_write_* so that the global dirty ring is not filled up
> + * too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + * enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + * dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> + __u32 pad;
> + __u32 slot;
> + __u64 offset;
> +};
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> + u32 size = kvm->dirty_ring_size;
> +
> + ring->dirty_gfns = vmalloc(size);

So 1/2 a megabyte of kernel memory per VM that userspace locks up.
Do we really have to though? Why not get a userspace pointer,
write it with copy to user, and sidestep all this?

> + if (!ring->dirty_gfns)
> + return -ENOMEM;
> + memset(ring->dirty_gfns, 0, size);
> +
> + ring->size = size / sizeof(struct kvm_dirty_gfn);
> + ring->soft_limit =
> + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> + kvm_dirty_ring_get_rsvd_entries();
> + ring->dirty_index = 0;
> + ring->reset_index = 0;
> + spin_lock_init(&ring->lock);
> +
> + return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> + struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes)
> +{
> + u32 cur_slot, next_slot;
> + u64 cur_offset, next_offset;
> + unsigned long mask;
> + u32 fetch;
> + int count = 0;
> + struct kvm_dirty_gfn *entry;
> +
> + fetch = READ_ONCE(indexes->fetch_index);
> + if (fetch == ring->reset_index)
> + return 0;
> +
> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> + /*
> + * The ring buffer is shared with userspace, which might mmap
> + * it and concurrently modify slot and offset. Userspace must
> + * not be trusted! READ_ONCE prevents the compiler from changing
> + * the values after they've been range-checked (the checks are
> + * in kvm_reset_dirty_gfn).

What it doesn't is prevent speculative attacks. That's why things like
copy from user have a speculation barrier. Instead of worrying about
that, unless it's really critical, I think you'd do well do just use
copy to/from user.

> + */
> + smp_read_barrier_depends();

What depends on what here? Looks suspicious ...

> + cur_slot = READ_ONCE(entry->slot);
> + cur_offset = READ_ONCE(entry->offset);
> + mask = 1;
> + count++;
> + ring->reset_index++;
> + while (ring->reset_index != fetch) {
> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> + smp_read_barrier_depends();

same concerns here

> + next_slot = READ_ONCE(entry->slot);
> + next_offset = READ_ONCE(entry->offset);
> + ring->reset_index++;
> + count++;
> + /*
> + * Try to coalesce the reset operations when the guest is
> + * scanning pages in the same slot.

what does guest scanning mean?

> + */
> + if (next_slot == cur_slot) {
> + int delta = next_offset - cur_offset;
> +
> + if (delta >= 0 && delta < BITS_PER_LONG) {
> + mask |= 1ull << delta;
> + continue;
> + }
> +
> + /* Backwards visit, careful about overflows! */
> + if (delta > -BITS_PER_LONG && delta < 0 &&
> + (mask << -delta >> -delta) == mask) {
> + cur_offset = next_offset;
> + mask = (mask << -delta) | 1;
> + continue;
> + }
> + }
> + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> + cur_slot = next_slot;
> + cur_offset = next_offset;
> + mask = 1;
> + }
> + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> + return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> + return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> + return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + * >0 if we should kick the vcpu out,
> + * =0 if the gfn pushed successfully, or,
> + * <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes,
> + u32 slot, u64 offset, bool lock)
> +{
> + int ret;
> + struct kvm_dirty_gfn *entry;
> +
> + if (lock)
> + spin_lock(&ring->lock);

what's the story around locking here? Why is it safe
not to take the lock sometimes?

> +
> + if (kvm_dirty_ring_full(ring)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> + entry->slot = slot;
> + entry->offset = offset;
> + smp_wmb();
> + ring->dirty_index++;
> + WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> + pr_info("%s: slot %u offset %llu used %u\n",
> + __func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> + if (lock)
> + spin_unlock(&ring->lock);
> +
> + return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> +{
> + return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> + if (ring->dirty_gfns) {
> + vfree(ring->dirty_gfns);
> + ring->dirty_gfns = NULL;
> + }
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/kvm.h>
>
> +#include <linux/kvm_dirty_ring.h>
> +
> /* Worst case buffer size needed for holding an integer. */
> #define ITOA_MAX_LEN 12
>
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn);
>
> __visible bool kvm_rebooting;
> EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> vcpu->preempted = false;
> vcpu->ready = false;
>
> + if (kvm->dirty_ring_size) {
> + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> + if (r) {
> + kvm->dirty_ring_size = 0;
> + goto fail_free_run;
> + }
> + }
> +
> r = kvm_arch_vcpu_init(vcpu);
> if (r < 0)
> - goto fail_free_run;
> + goto fail_free_ring;
> return 0;
>
> +fail_free_ring:
> + if (kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> fail_free_run:
> free_page((unsigned long)vcpu->run);
> fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> put_pid(rcu_dereference_protected(vcpu->pid, 1));
> kvm_arch_vcpu_uninit(vcpu);
> free_page((unsigned long)vcpu->run);
> + if (vcpu->kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> struct kvm *kvm = kvm_arch_alloc_vm();
> int r = -ENOMEM;
> int i;
> + struct page *page;
>
> if (!kvm)
> return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!page) {
> + r = -ENOMEM;
> + goto out_err_alloc_page;
> + }
> + kvm->vm_run = page_address(page);

So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
still. What is wrong with just a pointer and calling put_user?

> + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
> if (init_srcu_struct(&kvm->srcu))
> goto out_err_no_srcu;
> if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> out_err_no_irq_srcu:
> cleanup_srcu_struct(&kvm->srcu);
> out_err_no_srcu:
> + free_page((unsigned long)page);
> + kvm->vm_run = NULL;
> +out_err_alloc_page:
> kvm_arch_free_vm(kvm);
> mmdrop(current->mm);
> return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> int i;
> struct mm_struct *mm = kvm->mm;
>
> + if (kvm->dirty_ring_size) {
> + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> + }
> +
> + if (kvm->vm_run) {
> + free_page((unsigned long)kvm->vm_run);
> + kvm->vm_run = NULL;
> + }
> +
> kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> kvm_destroy_vm_debugfs(kvm);
> kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> {
> if (memslot && memslot->dirty_bitmap) {
> unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> set_bit_le(rel_gfn, memslot->dirty_bitmap);
> }
> }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> + return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> + (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> + kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
> static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> {
> struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> #endif
> + else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> + page = kvm_dirty_ring_get_page(
> + &vcpu->dirty_ring,
> + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> else
> return kvm_arch_vcpu_fault(vcpu, vmf);
> get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #endif
> case KVM_CAP_NR_MEMSLOTS:
> return KVM_USER_MEM_SLOTS;
> + case KVM_CAP_DIRTY_LOG_RING:
> + /* Version will be zero if arch didn't implement it */
> + return KVM_DIRTY_RING_VERSION;
> default:
> break;
> }
> return kvm_vm_ioctl_check_extension(kvm, arg);
> }
>
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn)
> +{
> + u32 as_id = 0;
> + u64 offset;
> + int ret;
> + struct kvm_dirty_ring *ring;
> + struct kvm_dirty_ring_indexes *indexes;
> + bool is_vm_ring;
> +
> + if (!kvm->dirty_ring_size)
> + return;
> +
> + offset = gfn - slot->base_gfn;
> +
> + if (vcpu) {
> + as_id = kvm_arch_vcpu_memslots_id(vcpu);
> + } else {
> + as_id = 0;
> + vcpu = kvm_get_running_vcpu();
> + }
> +
> + if (vcpu) {
> + ring = &vcpu->dirty_ring;
> + indexes = &vcpu->run->vcpu_ring_indexes;
> + is_vm_ring = false;
> + } else {
> + /*
> + * Put onto per vm ring because no vcpu context. Kick
> + * vcpu0 if ring is full.

What about tasks on vcpu 0? Do guests realize it's a bad idea to put
critical tasks there, they will be penalized disproportionally?

> + */
> + vcpu = kvm->vcpus[0];
> + ring = &kvm->vm_dirty_ring;
> + indexes = &kvm->vm_run->vm_ring_indexes;
> + is_vm_ring = true;
> + }
> +
> + ret = kvm_dirty_ring_push(ring, indexes,
> + (as_id << 16)|slot->id, offset,
> + is_vm_ring);
> + if (ret < 0) {
> + if (is_vm_ring)
> + pr_warn_once("vcpu %d dirty log overflow\n",
> + vcpu->vcpu_id);
> + else
> + pr_warn_once("per-vm dirty log overflow\n");
> + return;
> + }
> +
> + if (ret)
> + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> + struct kvm_memory_slot *memslot;
> + int as_id, id;
> +
> + as_id = slot >> 16;
> + id = (u16)slot;
> + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + return;
> +
> + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> + if (offset >= memslot->npages)
> + return;
> +
> + spin_lock(&kvm->mmu_lock);
> + /* FIXME: we should use a single AND operation, but there is no
> + * applicable atomic API.
> + */
> + while (mask) {
> + clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> + mask &= mask - 1;
> + }
> +
> + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> + spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> + int r;
> +
> + /* the size should be power of 2 */
> + if (!size || (size & (size - 1)))
> + return -EINVAL;
> +
> + /* Should be bigger to keep the reserved entries, or a page */
> + if (size < kvm_dirty_ring_get_rsvd_entries() *
> + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> + return -EINVAL;
> +
> + if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> + sizeof(struct kvm_dirty_gfn))
> + return -E2BIG;

KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
So how does userspace know what's legal?
Do you expect it to just try?
More likely it will just copy the number from kernel and can
never ever make it smaller.

> +
> + /* We only allow it to set once */
> + if (kvm->dirty_ring_size)
> + return -EINVAL;
> +
> + mutex_lock(&kvm->lock);
> +
> + if (kvm->created_vcpus) {
> + /* We don't allow to change this value after vcpu created */
> + r = -EINVAL;
> + } else {
> + kvm->dirty_ring_size = size;
> + r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> + if (r) {
> + /* Unset dirty ring */
> + kvm->dirty_ring_size = 0;
> + }
> + }
> +
> + mutex_unlock(&kvm->lock);
> + return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> + int i;
> + struct kvm_vcpu *vcpu;
> + int cleared = 0;
> +
> + if (!kvm->dirty_ring_size)
> + return -EINVAL;
> +
> + mutex_lock(&kvm->slots_lock);
> +
> + cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> + &kvm->vm_run->vm_ring_indexes);
> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> + &vcpu->run->vcpu_ring_indexes);
> +
> + mutex_unlock(&kvm->slots_lock);
> +
> + if (cleared)
> + kvm_flush_remote_tlbs(kvm);
> +
> + return cleared;
> +}
> +
> int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> struct kvm_enable_cap *cap)
> {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> kvm->manual_dirty_log_protect = cap->args[0];
> return 0;
> #endif
> + case KVM_CAP_DIRTY_LOG_RING:
> + return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> default:
> return kvm_vm_ioctl_enable_cap(kvm, cap);
> }
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
> case KVM_CHECK_EXTENSION:
> r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
> break;
> + case KVM_RESET_DIRTY_RINGS:
> + r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> + break;
> default:
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
> }
> #endif
>
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> + struct kvm *kvm = vmf->vma->vm_file->private_data;
> + struct page *page = NULL;
> +
> + if (vmf->pgoff == 0)
> + page = virt_to_page(kvm->vm_run);
> + else if (kvm_fault_in_dirty_ring(kvm, vmf))
> + page = kvm_dirty_ring_get_page(
> + &kvm->vm_dirty_ring,
> + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> + else
> + return VM_FAULT_SIGBUS;
> +
> + get_page(page);
> + vmf->page = page;
> + return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> + .fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &kvm_vm_vm_ops;
> + return 0;
> +}
> +
> static struct file_operations kvm_vm_fops = {
> .release = kvm_vm_release,
> .unlocked_ioctl = kvm_vm_ioctl,
> + .mmap = kvm_vm_mmap,
> .llseek = noop_llseek,
> KVM_COMPAT(kvm_vm_compat_ioctl),
> };
> --
> 2.21.0

2019-12-11 13:06:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> >>>>> There is no new infrastructure to track the dirty pages---it's just a
> >>>>> different way to pass them to userspace.
> >>>> Did you guys consider using one of the virtio ring formats?
> >>>> Maybe reusing vhost code?
> >>>
> >>> There are no used/available entries here, it's unidirectional
> >>> (kernel->user).
> >>
> >> Agreed. Vring could be an overkill IMHO (the whole dirty_ring.c is
> >> 100+ LOC only).
> >
> > I guess you don't do polling/ event suppression and other tricks that
> > virtio came up with for speed then?

I looked at the code finally, there's actually available, and fetched is
exactly like used. Not saying existing code is a great fit for you as
you have an extra slot parameter to pass and it's reversed as compared
to vhost, with kernel being the driver and userspace the device (even
though vringh might fit, yet needs to be updated to support packed rings
though). But sticking to an existing format is a good idea IMHO,
or if not I think it's not a bad idea to add some justification.

> There are no interrupts either, so no need for event suppression. You
> have vmexits when the ring gets full (and that needs to be synchronous),
> but apart from that the migration thread will poll the rings once when
> it needs to send more pages.
>
> Paolo

OK don't use that then.

--
MST

2019-12-11 13:44:03

by Christophe de Dinechin

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface


Peter Xu writes:

> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>
> Overview
> ============
>
> This is a continued work from Lei Cao <[email protected]> and Paolo
> on the KVM dirty ring interface. To make it simple, I'll still start
> with version 1 as RFC.
>
> The new dirty ring interface is another way to collect dirty pages for
> the virtual machine, but it is different from the existing dirty
> logging interface in a few ways, majorly:
>
> - Data format: The dirty data was in a ring format rather than a
> bitmap format, so the size of data to sync for dirty logging does
> not depend on the size of guest memory any more, but speed of
> dirtying. Also, the dirty ring is per-vcpu (currently plus
> another per-vm ring, so total ring number is N+1), while the dirty
> bitmap is per-vm.

I like Sean's suggestion to fetch rings when dirtying. That could reduce
the number of dirty rings to examine.

Also, as is, this means that the same gfn may be present in multiple
rings, right?

>
> - Data copy: The sync of dirty pages does not need data copy any more,
> but instead the ring is shared between the userspace and kernel by
> page sharings (mmap() on either the vm fd or vcpu fd)
>
> - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> dirty pages to protected mode again (works like
> KVM_CLEAR_DIRTY_LOG, but ring based)
>
> And more.
>
> I would appreciate if the reviewers can start with patch "KVM:
> Implement ring-based dirty memory tracking", especially the document
> update part for the big picture. Then I'll avoid copying into most of
> them into cover letter again.
>
> I marked this series as RFC because I'm at least uncertain on this
> change of vcpu_enter_guest():
>
> if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> /*
> * If this is requested, it means that we've
> * marked the dirty bit in the dirty ring BUT
> * we've not written the date. Do it now.

not written the "data" ?

> */
> r = kvm_emulate_instruction(vcpu, 0);
> r = r >= 0 ? 0 : r;
> goto out;
> }
>
> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
> and want to exit to userspace, however I'm not really sure whether
> there could have any side effect. I'd appreciate any comment of
> above, or anything else.
>
> Tests
> ===========
>
> I wanted to continue work on the QEMU part, but after I noticed that
> the interface might still prone to change, I posted this series first.
> However to make sure it's at least working, I've provided unit tests
> together with the series. The unit tests should be able to test the
> series in at least three major paths:
>
> (1) ./dirty_log_test -M dirty-ring
>
> This tests async ring operations: this should be the major work
> mode for the dirty ring interface, say, when the kernel is
> queuing more data, the userspace is collecting too. Ring can
> hardly reaches full when working like this, because in most
> cases the collection could be fast.
>
> (2) ./dirty_log_test -M dirty-ring -c 1024
>
> This set the ring size to be very small so that ring soft-full
> always triggers (soft-full is a soft limit of the ring state,
> when the dirty ring reaches the soft limit it'll do a userspace
> exit and let the userspace to collect the data).
>
> (3) ./dirty_log_test -M dirty-ring-wait-queue
>
> This sololy test the extreme case where ring is full. When the
> ring is completely full, the thread (no matter vcpu or not) will
> be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
> wake the threads up (assuming until which the ring will not be
> full any more).

Am I correct assuming that guest memory can be dirtied by DMA operations?
Should

Not being that familiar with the current implementation of dirty page
tracking, I wonder who marks the pages dirty in that case, and when?
If the VM ring is used for I/O threads, isn't it possible that a large
DMA could dirty a sufficiently large number of GFNs to overflow the
associated ring? Does this case need a separate way to queue the
dirtying I/O thread?

>
> Thanks,
>
> Cao, Lei (2):
> KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
> KVM: X86: Implement ring-based dirty memory tracking
>
> Paolo Bonzini (1):
> KVM: Move running VCPU from ARM to common code
>
> Peter Xu (12):
> KVM: Add build-time error check on kvm_run size
> KVM: Implement ring-based dirty memory tracking
> KVM: Make dirty ring exclusive to dirty bitmap log
> KVM: Introduce dirty ring wait queue
> KVM: selftests: Always clear dirty bitmap after iteration
> KVM: selftests: Sync uapi/linux/kvm.h to tools/
> KVM: selftests: Use a single binary for dirty/clear log test
> KVM: selftests: Introduce after_vcpu_run hook for dirty log test
> KVM: selftests: Add dirty ring buffer test
> KVM: selftests: Let dirty_log_test async for dirty ring test
> KVM: selftests: Add "-c" parameter to dirty log test
> KVM: selftests: Test dirty ring waitqueue
>
> Documentation/virt/kvm/api.txt | 116 +++++
> arch/arm/include/asm/kvm_host.h | 2 -
> arch/arm64/include/asm/kvm_host.h | 2 -
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu/mmu.c | 6 +
> arch/x86/kvm/vmx/vmx.c | 7 +
> arch/x86/kvm/x86.c | 12 +
> include/linux/kvm_dirty_ring.h | 67 +++
> include/linux/kvm_host.h | 37 ++
> include/linux/kvm_types.h | 1 +
> include/uapi/linux/kvm.h | 36 ++
> tools/include/uapi/linux/kvm.h | 47 ++
> tools/testing/selftests/kvm/Makefile | 2 -
> .../selftests/kvm/clear_dirty_log_test.c | 2 -
> tools/testing/selftests/kvm/dirty_log_test.c | 452 ++++++++++++++++--
> .../testing/selftests/kvm/include/kvm_util.h | 6 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 103 ++++
> .../selftests/kvm/lib/kvm_util_internal.h | 5 +
> virt/kvm/arm/arm.c | 29 --
> virt/kvm/arm/perf.c | 6 +-
> virt/kvm/arm/vgic/vgic-mmio.c | 15 +-
> virt/kvm/dirty_ring.c | 156 ++++++
> virt/kvm/kvm_main.c | 315 +++++++++++-
> 25 files changed, 1329 insertions(+), 104 deletions(-)
> create mode 100644 include/linux/kvm_dirty_ring.h
> delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
> create mode 100644 virt/kvm/dirty_ring.c


--
Cheers,
Christophe de Dinechin (IRC c3d)

2019-12-11 14:16:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 11/12/19 13:53, Michael S. Tsirkin wrote:
>> +
>> +struct kvm_dirty_ring_indexes {
>> + __u32 avail_index; /* set by kernel */
>> + __u32 fetch_index; /* set by userspace */
>
> Sticking these next to each other seems to guarantee cache conflicts.

I don't think that's an issue because you'd have a conflict anyway on
the actual entry; userspace anyway has to read the kernel-written index,
which will cause cache traffic.

> Avail/Fetch seems to mimic Virtio's avail/used exactly.

No, avail_index/fetch_index is just the producer and consumer indices
respectively. There is only one ring buffer, not two as in virtio.

> I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

Even in the packed ring you have two cache lines accessed, one for the
index and one for the descriptor. Here you have one, because the data
is embedded in the ring buffer.

>
>> +};
>> +
>> +While for each of the dirty entry it's defined as:
>> +
>> +struct kvm_dirty_gfn {
>
> What does GFN stand for?
>
>> + __u32 pad;
>> + __u32 slot; /* as_id | slot_id */
>> + __u64 offset;
>> +};
>
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

No, it's an offset in the memslot (which will usually be >4GB for any VM
with bigger memory than that).

Paolo

2019-12-11 14:18:06

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On 11/12/19 14:41, Christophe de Dinechin wrote:
>
> Peter Xu writes:
>
>> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
>>
>> Overview
>> ============
>>
>> This is a continued work from Lei Cao <[email protected]> and Paolo
>> on the KVM dirty ring interface. To make it simple, I'll still start
>> with version 1 as RFC.
>>
>> The new dirty ring interface is another way to collect dirty pages for
>> the virtual machine, but it is different from the existing dirty
>> logging interface in a few ways, majorly:
>>
>> - Data format: The dirty data was in a ring format rather than a
>> bitmap format, so the size of data to sync for dirty logging does
>> not depend on the size of guest memory any more, but speed of
>> dirtying. Also, the dirty ring is per-vcpu (currently plus
>> another per-vm ring, so total ring number is N+1), while the dirty
>> bitmap is per-vm.
>
> I like Sean's suggestion to fetch rings when dirtying. That could reduce
> the number of dirty rings to examine.

What do you mean by "fetch rings"?

> Also, as is, this means that the same gfn may be present in multiple
> rings, right?

I think the actual marking of a page as dirty is protected by a spinlock
but I will defer to Peter on this.

Paolo

>>
>> - Data copy: The sync of dirty pages does not need data copy any more,
>> but instead the ring is shared between the userspace and kernel by
>> page sharings (mmap() on either the vm fd or vcpu fd)
>>
>> - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
>> KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
>> called KVM_RESET_DIRTY_RINGS when we want to reset the collected
>> dirty pages to protected mode again (works like
>> KVM_CLEAR_DIRTY_LOG, but ring based)
>>
>> And more.
>>
>> I would appreciate if the reviewers can start with patch "KVM:
>> Implement ring-based dirty memory tracking", especially the document
>> update part for the big picture. Then I'll avoid copying into most of
>> them into cover letter again.
>>
>> I marked this series as RFC because I'm at least uncertain on this
>> change of vcpu_enter_guest():
>>
>> if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
>> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>> /*
>> * If this is requested, it means that we've
>> * marked the dirty bit in the dirty ring BUT
>> * we've not written the date. Do it now.
>
> not written the "data" ?
>
>> */
>> r = kvm_emulate_instruction(vcpu, 0);
>> r = r >= 0 ? 0 : r;
>> goto out;
>> }
>>
>> I did a kvm_emulate_instruction() when dirty ring reaches softlimit
>> and want to exit to userspace, however I'm not really sure whether
>> there could have any side effect. I'd appreciate any comment of
>> above, or anything else.
>>
>> Tests
>> ===========
>>
>> I wanted to continue work on the QEMU part, but after I noticed that
>> the interface might still prone to change, I posted this series first.
>> However to make sure it's at least working, I've provided unit tests
>> together with the series. The unit tests should be able to test the
>> series in at least three major paths:
>>
>> (1) ./dirty_log_test -M dirty-ring
>>
>> This tests async ring operations: this should be the major work
>> mode for the dirty ring interface, say, when the kernel is
>> queuing more data, the userspace is collecting too. Ring can
>> hardly reaches full when working like this, because in most
>> cases the collection could be fast.
>>
>> (2) ./dirty_log_test -M dirty-ring -c 1024
>>
>> This set the ring size to be very small so that ring soft-full
>> always triggers (soft-full is a soft limit of the ring state,
>> when the dirty ring reaches the soft limit it'll do a userspace
>> exit and let the userspace to collect the data).
>>
>> (3) ./dirty_log_test -M dirty-ring-wait-queue
>>
>> This sololy test the extreme case where ring is full. When the
>> ring is completely full, the thread (no matter vcpu or not) will
>> be put onto a per-vm waitqueue, and KVM_RESET_DIRTY_RINGS will
>> wake the threads up (assuming until which the ring will not be
>> full any more).
>
> Am I correct assuming that guest memory can be dirtied by DMA operations?
> Should
>
> Not being that familiar with the current implementation of dirty page
> tracking, I wonder who marks the pages dirty in that case, and when?
> If the VM ring is used for I/O threads, isn't it possible that a large
> DMA could dirty a sufficiently large number of GFNs to overflow the
> associated ring? Does this case need a separate way to queue the
> dirtying I/O thread?
>
>>
>> Thanks,
>>
>> Cao, Lei (2):
>> KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot
>> KVM: X86: Implement ring-based dirty memory tracking
>>
>> Paolo Bonzini (1):
>> KVM: Move running VCPU from ARM to common code
>>
>> Peter Xu (12):
>> KVM: Add build-time error check on kvm_run size
>> KVM: Implement ring-based dirty memory tracking
>> KVM: Make dirty ring exclusive to dirty bitmap log
>> KVM: Introduce dirty ring wait queue
>> KVM: selftests: Always clear dirty bitmap after iteration
>> KVM: selftests: Sync uapi/linux/kvm.h to tools/
>> KVM: selftests: Use a single binary for dirty/clear log test
>> KVM: selftests: Introduce after_vcpu_run hook for dirty log test
>> KVM: selftests: Add dirty ring buffer test
>> KVM: selftests: Let dirty_log_test async for dirty ring test
>> KVM: selftests: Add "-c" parameter to dirty log test
>> KVM: selftests: Test dirty ring waitqueue
>>
>> Documentation/virt/kvm/api.txt | 116 +++++
>> arch/arm/include/asm/kvm_host.h | 2 -
>> arch/arm64/include/asm/kvm_host.h | 2 -
>> arch/x86/include/asm/kvm_host.h | 5 +
>> arch/x86/include/uapi/asm/kvm.h | 1 +
>> arch/x86/kvm/Makefile | 3 +-
>> arch/x86/kvm/mmu/mmu.c | 6 +
>> arch/x86/kvm/vmx/vmx.c | 7 +
>> arch/x86/kvm/x86.c | 12 +
>> include/linux/kvm_dirty_ring.h | 67 +++
>> include/linux/kvm_host.h | 37 ++
>> include/linux/kvm_types.h | 1 +
>> include/uapi/linux/kvm.h | 36 ++
>> tools/include/uapi/linux/kvm.h | 47 ++
>> tools/testing/selftests/kvm/Makefile | 2 -
>> .../selftests/kvm/clear_dirty_log_test.c | 2 -
>> tools/testing/selftests/kvm/dirty_log_test.c | 452 ++++++++++++++++--
>> .../testing/selftests/kvm/include/kvm_util.h | 6 +
>> tools/testing/selftests/kvm/lib/kvm_util.c | 103 ++++
>> .../selftests/kvm/lib/kvm_util_internal.h | 5 +
>> virt/kvm/arm/arm.c | 29 --
>> virt/kvm/arm/perf.c | 6 +-
>> virt/kvm/arm/vgic/vgic-mmio.c | 15 +-
>> virt/kvm/dirty_ring.c | 156 ++++++
>> virt/kvm/kvm_main.c | 315 +++++++++++-
>> 25 files changed, 1329 insertions(+), 104 deletions(-)
>> create mode 100644 include/linux/kvm_dirty_ring.h
>> delete mode 100644 tools/testing/selftests/kvm/clear_dirty_log_test.c
>> create mode 100644 virt/kvm/dirty_ring.c
>
>
> --
> Cheers,
> Christophe de Dinechin (IRC c3d)
>

2019-12-11 14:55:15

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 11, 2019 at 08:04:36AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 11, 2019 at 10:05:28AM +0100, Paolo Bonzini wrote:
> > On 10/12/19 22:53, Michael S. Tsirkin wrote:
> > > On Tue, Dec 10, 2019 at 11:02:11AM -0500, Peter Xu wrote:
> > >> On Tue, Dec 10, 2019 at 02:31:54PM +0100, Paolo Bonzini wrote:
> > >>> On 10/12/19 14:25, Michael S. Tsirkin wrote:
> > >>>>> There is no new infrastructure to track the dirty pages---it's just a
> > >>>>> different way to pass them to userspace.
> > >>>> Did you guys consider using one of the virtio ring formats?
> > >>>> Maybe reusing vhost code?
> > >>>
> > >>> There are no used/available entries here, it's unidirectional
> > >>> (kernel->user).
> > >>
> > >> Agreed. Vring could be an overkill IMHO (the whole dirty_ring.c is
> > >> 100+ LOC only).
> > >
> > > I guess you don't do polling/ event suppression and other tricks that
> > > virtio came up with for speed then?
>
> I looked at the code finally, there's actually available, and fetched is
> exactly like used. Not saying existing code is a great fit for you as
> you have an extra slot parameter to pass and it's reversed as compared
> to vhost, with kernel being the driver and userspace the device (even
> though vringh might fit, yet needs to be updated to support packed rings
> though). But sticking to an existing format is a good idea IMHO,
> or if not I think it's not a bad idea to add some justification.

Right, I'll add a small paragraph in the next cover letter to justify.

Thanks,

--
Peter Xu

2019-12-11 17:17:53

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] KVM: Dirty ring interface

On Wed, Dec 11, 2019 at 03:16:30PM +0100, Paolo Bonzini wrote:
> On 11/12/19 14:41, Christophe de Dinechin wrote:
> >
> > Peter Xu writes:
> >
> >> Branch is here: https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> >>
> >> Overview
> >> ============
> >>
> >> This is a continued work from Lei Cao <[email protected]> and Paolo
> >> on the KVM dirty ring interface. To make it simple, I'll still start
> >> with version 1 as RFC.
> >>
> >> The new dirty ring interface is another way to collect dirty pages for
> >> the virtual machine, but it is different from the existing dirty
> >> logging interface in a few ways, majorly:
> >>
> >> - Data format: The dirty data was in a ring format rather than a
> >> bitmap format, so the size of data to sync for dirty logging does
> >> not depend on the size of guest memory any more, but speed of
> >> dirtying. Also, the dirty ring is per-vcpu (currently plus
> >> another per-vm ring, so total ring number is N+1), while the dirty
> >> bitmap is per-vm.
> >
> > I like Sean's suggestion to fetch rings when dirtying. That could reduce
> > the number of dirty rings to examine.
>
> What do you mean by "fetch rings"?

I'd wildly guess Christophe means something like we create a ring pool
and we try to find a ring to push the dirty gfn when it comes.

OK, should I count it as another vote to Sean's? :)

I agree, but imho larger number of rings won't really be a problem as
long as it's still per-vcpu (after all we have a vcpu number
limitation which is harder to break...). To me what Sean's suggestion
attracted me most is that the interface is cleaner, that we don't need
to expose the ring in two places any more. At the meantime, I won't
care too much on perf issue here because after all it's dirty logging.
If perf is critial, then I think I'll certainly choose per-vcpu ring
without doubt even if it complicates the interface because it'll
certainly help on some conditional lockless.

>
> > Also, as is, this means that the same gfn may be present in multiple
> > rings, right?
>
> I think the actual marking of a page as dirty is protected by a spinlock
> but I will defer to Peter on this.

In most cases imho we should be with the mmu lock iiuc because the
general mmu page fault will take it. However I think there're special
cases:

- when the spte was popped already and just write protected, then
it's very possible we can go via the quick page fault path
(fast_page_fault()). That is lockless (no mmu lock taken).

- when there's no vcpu context, we'll use the per-vm ring. Though
per-vm ring is locked (per-vcpu ring is not!), I don't see how it
would protect two callers to insert two identical gfns
sequentially.. Also it can happen between per-vm and per-vcpu ring
as well.

So I think gfn duplication could happen, but it should be rare. Even
if it happens, it won't hurt much because the 2nd/3rd/... dirty bit of
the same gfn will simply be skipped by userspace when harvesting.

>
> Paolo
>
> >>
> >> - Data copy: The sync of dirty pages does not need data copy any more,
> >> but instead the ring is shared between the userspace and kernel by
> >> page sharings (mmap() on either the vm fd or vcpu fd)
> >>
> >> - Interface: Instead of using the old KVM_GET_DIRTY_LOG,
> >> KVM_CLEAR_DIRTY_LOG interfaces, the new ring uses a new interface
> >> called KVM_RESET_DIRTY_RINGS when we want to reset the collected
> >> dirty pages to protected mode again (works like
> >> KVM_CLEAR_DIRTY_LOG, but ring based)
> >>
> >> And more.
> >>
> >> I would appreciate if the reviewers can start with patch "KVM:
> >> Implement ring-based dirty memory tracking", especially the document
> >> update part for the big picture. Then I'll avoid copying into most of
> >> them into cover letter again.
> >>
> >> I marked this series as RFC because I'm at least uncertain on this
> >> change of vcpu_enter_guest():
> >>
> >> if (kvm_check_request(KVM_REQ_DIRTY_RING_FULL, vcpu)) {
> >> vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >> /*
> >> * If this is requested, it means that we've
> >> * marked the dirty bit in the dirty ring BUT
> >> * we've not written the date. Do it now.
> >
> > not written the "data" ?

Yep, though I'll drop these lines altogether so we'll be fine.. :)

Thanks,

--
Peter Xu

2019-12-11 17:25:21

by Christophe de Dinechin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

Peter Xu writes:

> This patch is heavily based on previous work from Lei Cao
> <[email protected]> and Paolo Bonzini <[email protected]>. [1]
>
> KVM currently uses large bitmaps to track dirty memory. These bitmaps
> are copied to userspace when userspace queries KVM for its dirty page
> information. The use of bitmaps is mostly sufficient for live
> migration, as large parts of memory are be dirtied from one log-dirty
> pass to another.

That statement sort of concerns me. If large parts of memory are
dirtied, won't this cause the rings to fill up quickly enough to cause a
lot of churn between user-space and kernel?

See a possible suggestion to address that below.

> However, in a checkpointing system, the number of
> dirty pages is small and in fact it is often bounded---the VM is
> paused when it has dirtied a pre-defined number of pages. Traversing a
> large, sparsely populated bitmap to find set bits is time-consuming,
> as is copying the bitmap to user-space.
>
> A similar issue will be there for live migration when the guest memory
> is huge while the page dirty procedure is trivial. In that case for
> each dirty sync we need to pull the whole dirty bitmap to userspace
> and analyse every bit even if it's mostly zeros.
>
> The preferred data structure for above scenarios is a dense list of
> guest frame numbers (GFN). This patch series stores the dirty list in
> kernel memory that can be memory mapped into userspace to allow speedy
> harvesting.
>
> We defined two new data structures:
>
> struct kvm_dirty_ring;
> struct kvm_dirty_ring_indexes;
>
> Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> pages. When dirty tracking is enabled, we can push dirty gfn onto the
> ring.
>
> Secondly, kvm_dirty_ring_indexes is defined to represent the
> user/kernel interface of each ring. Currently it contains two
> indexes: (1) avail_index represents where we should push our next
> PFN (written by kernel), while (2) fetch_index represents where the
> userspace should fetch the next dirty PFN (written by userspace).
>
> One complete ring is composed by one kvm_dirty_ring plus its
> corresponding kvm_dirty_ring_indexes.
>
> Currently, we have N+1 rings for each VM of N vcpus:
>
> - for each vcpu, we have 1 per-vcpu dirty ring,
> - for each vm, we have 1 per-vm dirty ring
>
> Please refer to the documentation update in this patch for more
> details.
>
> Note that this patch implements the core logic of dirty ring buffer.
> It's still disabled for all archs for now. Also, we'll address some
> of the other issues in follow up patches before it's firstly enabled
> on x86.
>
> [1] https://patchwork.kernel.org/patch/10471409/
>
> Signed-off-by: Lei Cao <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> arch/x86/kvm/Makefile | 3 +-
> include/linux/kvm_dirty_ring.h | 67 +++++++++
> include/linux/kvm_host.h | 33 +++++
> include/linux/kvm_types.h | 1 +
> include/uapi/linux/kvm.h | 36 +++++
> virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
> virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
> 8 files changed, 642 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/kvm_dirty_ring.h
> create mode 100644 virt/kvm/dirty_ring.c
>
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 49183add44e7..fa622c9a2eb8 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> It is thus encouraged to use the vm ioctl to query for capabilities (available
> with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
>
> +
> 4.5 KVM_GET_VCPU_MMAP_SIZE
>
> Capability: basic
> @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> memory region. This ioctl returns the size of that region. See the
> KVM_RUN documentation for details.
>
> +Besides the size of the KVM_RUN communication region, other areas of
> +the VCPU file descriptor can be mmap-ed, including:
> +
> +- if KVM_CAP_COALESCED_MMIO is available, a page at
> + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> + KVM_CAP_COALESCED_MMIO is not documented yet.

Does the above really belong to this patch?

> +
> +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> +
>
> 4.6 KVM_SET_MEMORY_REGION
>
> @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> AArch64, this value will be reported in the ISS field of ESR_ELx.
>
> See KVM_CAP_VCPU_EVENTS for more details.
> +
> 8.20 KVM_CAP_HYPERV_SEND_IPI
>
> Architectures: x86
> @@ -5365,6 +5379,7 @@ Architectures: x86
> This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> hypercalls:
> HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
> 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>
> Architecture: x86
> @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> flush hypercalls by Hyper-V) so userspace should disable KVM identification
> in CPUID and only exposes Hyper-V identification. In this case, guest
> thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> +
> +8.22 KVM_CAP_DIRTY_LOG_RING
> +
> +Architectures: x86
> +Parameters: args[0] - size of the dirty log ring
> +
> +KVM is capable of tracking dirty memory using ring buffers that are
> +mmaped into userspace; there is one dirty ring per vcpu and one global
> +ring per vm.
> +
> +One dirty ring has the following two major structures:
> +
> +struct kvm_dirty_ring {
> + u16 dirty_index;
> + u16 reset_index;

What is the benefit of using u16 for that? That means with 4K pages, you
can share at most 256M of dirty memory each time? That seems low to me,
especially since it's sufficient to touch one byte in a page to dirty it.

Actually, this is not consistent with the definition in the code ;-)
So I'll assume it's actually u32.

> + u32 size;
> + u32 soft_limit;
> + spinlock_t lock;
> + struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +struct kvm_dirty_ring_indexes {
> + __u32 avail_index; /* set by kernel */
> + __u32 fetch_index; /* set by userspace */
> +};
> +
> +While for each of the dirty entry it's defined as:
> +
> +struct kvm_dirty_gfn {
> + __u32 pad;
> + __u32 slot; /* as_id | slot_id */
> + __u64 offset;
> +};

Like other have suggested, I think we might used "pad" to store size
information to be able to dirty large pages more efficiently.

> +
> +The fields in kvm_dirty_ring will be only internal to KVM itself,
> +while the fields in kvm_dirty_ring_indexes will be exposed to
> +userspace to be either read or written.

The sentence above is confusing when contrasted with the "set by kernel"
comment above.

> +
> +The two indices in the ring buffer are free running counters.

Nit: this patch uses both "indices" and "indexes".
Both are correct, but it would be nice to be consistent.

> +
> +In pseudocode, processing the ring buffer looks like this:
> +
> + idx = load-acquire(&ring->fetch_index);
> + while (idx != ring->avail_index) {
> + struct kvm_dirty_gfn *entry;
> + entry = &ring->dirty_gfns[idx & (size - 1)];
> + ...
> +
> + idx++;
> + }
> + ring->fetch_index = idx;
> +
> +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> +to enable this capability for the new guest and set the size of the
> +rings. It is only allowed before creating any vCPU, and the size of
> +the ring must be a power of two. The larger the ring buffer, the less
> +likely the ring is full and the VM is forced to exit to userspace. The
> +optimal size depends on the workload, but it is recommended that it be
> +at least 64 KiB (4096 entries).

Is there anything in the design that would preclude resizing the ring
buffer at a later time? Presumably, you'd want a large ring while you
are doing things like migrations, but it's mostly useless when you are
not monitoring memory. So it would be nice to be able to call
KVM_ENABLE_CAP at any time to adjust the size.

As I read the current code, one of the issue would be the mapping of the
rings in case of a later extension where we added something beyond the
rings. But I'm not sure that's a big deal at the moment.

> +
> +After the capability is enabled, userspace can mmap the global ring
> +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> +descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
> +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> +KVM_DIRTY_LOG_PAGE_OFFSET).
> +
> +Just like for dirty page bitmaps, the buffer tracks writes to
> +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
> +with the flag set, userspace can start harvesting dirty pages from the
> +ring buffer.
> +
> +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> +to read the dirty GFNs up to avail_index, and sets the fetch_index
> +accordingly. This can be done when the guest is running or paused,
> +and dirty pages need not be collected all at once. After processing
> +one or more entries in the ring buffer, userspace calls the VM ioctl
> +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> +fetch_index and to mark those pages clean. Therefore, the ioctl
> +must be called *before* reading the content of the dirty pages.

> +
> +However, there is a major difference comparing to the
> +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> +userspace it's still possible that the kernel has not yet flushed the
> +hardware dirty buffers into the kernel buffer. To achieve that, one
> +needs to kick the vcpu out for a hardware buffer flush (vmexit).

When you refer to "buffers", are you referring to the cache lines that
contain the ring buffers, or to something else?

I'm a bit confused by this sentence. I think that you mean that a VCPU
may still be running while you read its ring buffer, in which case the
values in the ring buffer are not necessarily in memory yet, so not
visible to a different CPU. But I wonder if you can't make this
requirement to cause a vmexit unnecessary by carefully ordering the
writes, to make sure that the fetch_index is updated only after the
corresponding ring entries have been written to memory,

In other words, as seen by user-space, you would not care that the ring
entries have not been flushed as long as the fetch_index itself is
guaranteed to still be behind the not-flushed-yet entries.

(I would know how to do that on a different architecture, not sure for x86)

> +
> +If one of the ring buffers is full, the guest will exit to userspace
> +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> +should pause all the vcpus, then harvest all the dirty pages and
> +rearm the dirty traps. It can unpause the guest after that.

Except for the condition above, why is it necessary to pause other VCPUs
than the one being harvested?


> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b19ef421084d..0acee817adfb 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> KVM := ../../../virt/kvm
>
> kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> + $(KVM)/dirty_ring.o
> kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
>
> kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> new file mode 100644
> index 000000000000..8335635b7ff7
> --- /dev/null
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -0,0 +1,67 @@
> +#ifndef KVM_DIRTY_RING_H
> +#define KVM_DIRTY_RING_H
> +
> +/*
> + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> + *
> + * dirty_ring: shared with userspace via mmap. It is the compact list
> + * that holds the dirty pages.
> + * dirty_index: free running counter that points to the next slot in
> + * dirty_ring->dirty_gfns where a new dirty page should go.
> + * reset_index: free running counter that points to the next dirty page
> + * in dirty_ring->dirty_gfns for which dirty trap needs to
> + * be reenabled
> + * size: size of the compact list, dirty_ring->dirty_gfns
> + * soft_limit: when the number of dirty pages in the list reaches this
> + * limit, vcpu that owns this ring should exit to userspace
> + * to allow userspace to harvest all the dirty pages
> + * lock: protects dirty_ring, only in use if this is the global
> + * ring

If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

> + *
> + * The number of dirty pages in the ring is calculated by,
> + * dirty_index - reset_index

Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
unambiguous terminology. What about "posted", as in

The number of posted dirty pages, i.e. the number of dirty pages in the
ring, is calculated as dirty_index - reset_index by function
kvm_dirty_ring_posted

(Replace "posted" by any adjective of your liking)

> + *
> + * kernel increments dirty_ring->indices.avail_index after dirty index
> + * is incremented. When userspace harvests the dirty pages, it increments
> + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> + * When kernel reenables dirty traps for the dirty pages, it increments
> + * reset_index up to dirty_ring->indices.fetch_index.

Userspace should not be trusted to be doing this, see below.


> + *
> + */
> +struct kvm_dirty_ring {
> + u32 dirty_index;
> + u32 reset_index;
> + u32 size;
> + u32 soft_limit;
> + spinlock_t lock;
> + struct kvm_dirty_gfn *dirty_gfns;
> +};
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void);
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +
> +/*
> + * called with kvm->slots_lock held, returns the number of
> + * processed pages.
> + */
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> + struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes);
> +
> +/*
> + * returns 0: successfully pushed
> + * 1: successfully pushed, soft limit reached,
> + * vcpu should exit to userspace
> + * -EBUSY: unable to push, dirty ring full.
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes,
> + u32 slot, u64 offset, bool lock);
> +
> +/* for use in vm_operations_struct */
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);

Not very clear what 'i' means, seems to be a page offset based on call sites?

> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> +
> +#endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 498a39462ac1..7b747bc9ff3e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
> #include <linux/kvm_types.h>
>
> #include <asm/kvm_host.h>
> +#include <linux/kvm_dirty_ring.h>
>
> #ifndef KVM_MAX_VCPU_ID
> #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_PENDING_TIMER 2
> #define KVM_REQ_UNHALT 3
> +#define KVM_REQ_DIRTY_RING_FULL 4
> #define KVM_REQUEST_ARCH_BASE 8
>
> #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -321,6 +323,7 @@ struct kvm_vcpu {
> bool ready;
> struct kvm_vcpu_arch arch;
> struct dentry *debugfs_dentry;
> + struct kvm_dirty_ring dirty_ring;
> };
>
> static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> @@ -501,6 +504,10 @@ struct kvm {
> struct srcu_struct srcu;
> struct srcu_struct irq_srcu;
> pid_t userspace_pid;
> + /* Data structure to be exported by mmap(kvm->fd, 0) */
> + struct kvm_vm_run *vm_run;
> + u32 dirty_ring_size;
> + struct kvm_dirty_ring vm_dirty_ring;

If you remove the lock from struct kvm_dirty_ring, you could just put it there.

> };
>
> #define kvm_err(fmt, ...) \
> @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> gfn_t gfn_offset,
> unsigned long mask);
>
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> +
> int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> struct kvm_dirty_log *log);
> int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> uintptr_t data, const char *name,
> struct task_struct **thread_ptr);
>
> +/*
> + * This defines how many reserved entries we want to keep before we
> + * kick the vcpu to the userspace to avoid dirty ring full. This
> + * value can be tuned to higher if e.g. PML is enabled on the host.
> + */
> +#define KVM_DIRTY_RING_RSVD_ENTRIES 64
> +
> +/* Max number of entries allowed for each kvm dirty ring */
> +#define KVM_DIRTY_RING_MAX_ENTRIES 65536
> +
> +/*
> + * Arch needs to define these macro after implementing the dirty ring
> + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> + * starting page offset of the dirty ring structures, while
> + * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
> + * feature is off on all archs.
> + */
> +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> +#endif
> +#ifndef KVM_DIRTY_RING_VERSION
> +#define KVM_DIRTY_RING_VERSION 0
> +#endif
> +
> #endif
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 1c88e69db3d9..d9d03eea145a 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> struct kvm_memory_slot;
> struct kvm_one_reg;
> struct kvm_run;
> +struct kvm_vm_run;
> struct kvm_userspace_memory_region;
> struct kvm_vcpu;
> struct kvm_vcpu_init;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6f17c8e2dba..0b88d76d6215 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> #define KVM_EXIT_IOAPIC_EOI 26
> #define KVM_EXIT_HYPERV 27
> #define KVM_EXIT_ARM_NISV 28
> +#define KVM_EXIT_DIRTY_RING_FULL 29
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> /* Encounter unexpected vm-exit reason */
> #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4
>
> +struct kvm_dirty_ring_indexes {
> + __u32 avail_index; /* set by kernel */
> + __u32 fetch_index; /* set by userspace */
> +};
> +
> /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> struct kvm_run {
> /* in */
> @@ -421,6 +427,13 @@ struct kvm_run {
> struct kvm_sync_regs regs;
> char padding[SYNC_REGS_SIZE_BYTES];
> } s;
> +
> + struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> +};
> +
> +/* Returned by mmap(kvm->fd, offset=0) */
> +struct kvm_vm_run {
> + struct kvm_dirty_ring_indexes vm_ring_indexes;
> };
>
> /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> #define KVM_CAP_ARM_NISV_TO_USER 177
> #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> +#define KVM_CAP_DIRTY_LOG_RING 179
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> /* Available with KVM_CAP_ARM_SVE */
> #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)
>
> +/* Available with KVM_CAP_DIRTY_LOG_RING */
> +#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
> +
> /* Secure Encrypted Virtualization command */
> enum sev_cmd_id {
> /* Guest initialization commands */
> @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
> #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
>
> +/*
> + * The following are the requirements for supporting dirty log ring
> + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> + *
> + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> + * of kvm_write_* so that the global dirty ring is not filled up
> + * too quickly.
> + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> + * enabling dirty logging.
> + * 3. There should not be a separate step to synchronize hardware
> + * dirty bitmap with KVM's.
> + */
> +
> +struct kvm_dirty_gfn {
> + __u32 pad;
> + __u32 slot;
> + __u64 offset;
> +};
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> new file mode 100644
> index 000000000000..9264891f3c32
> --- /dev/null
> +++ b/virt/kvm/dirty_ring.c
> @@ -0,0 +1,156 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kvm_dirty_ring.h>
> +
> +u32 kvm_dirty_ring_get_rsvd_entries(void)
> +{
> + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> +}
> +
> +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +{
> + u32 size = kvm->dirty_ring_size;
> +
> + ring->dirty_gfns = vmalloc(size);
> + if (!ring->dirty_gfns)
> + return -ENOMEM;
> + memset(ring->dirty_gfns, 0, size);
> +
> + ring->size = size / sizeof(struct kvm_dirty_gfn);
> + ring->soft_limit =
> + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> + kvm_dirty_ring_get_rsvd_entries();

Minor, but what about

ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();


> + ring->dirty_index = 0;
> + ring->reset_index = 0;
> + spin_lock_init(&ring->lock);
> +
> + return 0;
> +}
> +
> +int kvm_dirty_ring_reset(struct kvm *kvm,
> + struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes)
> +{
> + u32 cur_slot, next_slot;
> + u64 cur_offset, next_offset;
> + unsigned long mask;
> + u32 fetch;
> + int count = 0;
> + struct kvm_dirty_gfn *entry;
> +
> + fetch = READ_ONCE(indexes->fetch_index);

If I understand correctly, if a malicious user-space writes
ring->reset_index + 1 into fetch_index, the loop below will execute 4
billion times.


> + if (fetch == ring->reset_index)
> + return 0;

To protect against scenario above, I would have something like:

if (fetch - ring->reset_index >= ring->size)
return -EINVAL;

> +
> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> + /*
> + * The ring buffer is shared with userspace, which might mmap
> + * it and concurrently modify slot and offset. Userspace must
> + * not be trusted! READ_ONCE prevents the compiler from changing
> + * the values after they've been range-checked (the checks are
> + * in kvm_reset_dirty_gfn).
> + */
> + smp_read_barrier_depends();
> + cur_slot = READ_ONCE(entry->slot);
> + cur_offset = READ_ONCE(entry->offset);
> + mask = 1;
> + count++;
> + ring->reset_index++;
> + while (ring->reset_index != fetch) {
> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> + smp_read_barrier_depends();
> + next_slot = READ_ONCE(entry->slot);
> + next_offset = READ_ONCE(entry->offset);
> + ring->reset_index++;
> + count++;
> + /*
> + * Try to coalesce the reset operations when the guest is
> + * scanning pages in the same slot.
> + */
> + if (next_slot == cur_slot) {
> + int delta = next_offset - cur_offset;

Since you diff two u64, shouldn't that be an i64 rather than int?

> +
> + if (delta >= 0 && delta < BITS_PER_LONG) {
> + mask |= 1ull << delta;
> + continue;
> + }
> +
> + /* Backwards visit, careful about overflows! */
> + if (delta > -BITS_PER_LONG && delta < 0 &&
> + (mask << -delta >> -delta) == mask) {
> + cur_offset = next_offset;
> + mask = (mask << -delta) | 1;
> + continue;
> + }
> + }
> + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> + cur_slot = next_slot;
> + cur_offset = next_offset;
> + mask = 1;
> + }
> + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);

So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
twice? Something smells weird about this loop ;-) I have a gut feeling
that it could be done in a single while loop combined with the entry
test, but I may be wrong.


> +
> + return count;
> +}
> +
> +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> +{
> + return ring->dirty_index - ring->reset_index;
> +}
> +
> +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> +{
> + return kvm_dirty_ring_used(ring) >= ring->size;
> +}
> +
> +/*
> + * Returns:
> + * >0 if we should kick the vcpu out,
> + * =0 if the gfn pushed successfully, or,
> + * <0 if error (e.g. ring full)
> + */
> +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> + struct kvm_dirty_ring_indexes *indexes,
> + u32 slot, u64 offset, bool lock)

Obviously, if you go with the suggestion to have a "lock" only in struct
kvm, then you'd have to pass a lock ptr instead of a bool.

> +{
> + int ret;
> + struct kvm_dirty_gfn *entry;
> +
> + if (lock)
> + spin_lock(&ring->lock);
> +
> + if (kvm_dirty_ring_full(ring)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> + entry->slot = slot;
> + entry->offset = offset;
> + smp_wmb();
> + ring->dirty_index++;
> + WRITE_ONCE(indexes->avail_index, ring->dirty_index);

Following up on comment about having to vmexit other VCPUs above:
If you have a write barrier for the entry, and then a write once for the
index, isn't that sufficient to ensure that another CPU will pick up the
right values in the right order?


> + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> + pr_info("%s: slot %u offset %llu used %u\n",
> + __func__, slot, offset, kvm_dirty_ring_used(ring));
> +
> +out:
> + if (lock)
> + spin_unlock(&ring->lock);
> +
> + return ret;
> +}
> +
> +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)

Still don't like 'i' :-)


(Stopped my review here for lack of time, decided to share what I had so far)

> +{
> + return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> +}
> +
> +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> +{
> + if (ring->dirty_gfns) {
> + vfree(ring->dirty_gfns);
> + ring->dirty_gfns = NULL;
> + }
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 681452d288cd..8642c977629b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -64,6 +64,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/kvm.h>
>
> +#include <linux/kvm_dirty_ring.h>
> +
> /* Worst case buffer size needed for holding an integer. */
> #define ITOA_MAX_LEN 12
>
> @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> struct kvm_vcpu *vcpu,
> struct kvm_memory_slot *memslot,
> gfn_t gfn);
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn);
>
> __visible bool kvm_rebooting;
> EXPORT_SYMBOL_GPL(kvm_rebooting);
> @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> vcpu->preempted = false;
> vcpu->ready = false;
>
> + if (kvm->dirty_ring_size) {
> + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> + if (r) {
> + kvm->dirty_ring_size = 0;
> + goto fail_free_run;
> + }
> + }
> +
> r = kvm_arch_vcpu_init(vcpu);
> if (r < 0)
> - goto fail_free_run;
> + goto fail_free_ring;
> return 0;
>
> +fail_free_ring:
> + if (kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> fail_free_run:
> free_page((unsigned long)vcpu->run);
> fail:
> @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> put_pid(rcu_dereference_protected(vcpu->pid, 1));
> kvm_arch_vcpu_uninit(vcpu);
> free_page((unsigned long)vcpu->run);
> + if (vcpu->kvm->dirty_ring_size)
> + kvm_dirty_ring_free(&vcpu->dirty_ring);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
>
> @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> struct kvm *kvm = kvm_arch_alloc_vm();
> int r = -ENOMEM;
> int i;
> + struct page *page;
>
> if (!kvm)
> return ERR_PTR(-ENOMEM);
> @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
> BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>
> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!page) {
> + r = -ENOMEM;
> + goto out_err_alloc_page;
> + }
> + kvm->vm_run = page_address(page);
> + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> +
> if (init_srcu_struct(&kvm->srcu))
> goto out_err_no_srcu;
> if (init_srcu_struct(&kvm->irq_srcu))
> @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> out_err_no_irq_srcu:
> cleanup_srcu_struct(&kvm->srcu);
> out_err_no_srcu:
> + free_page((unsigned long)page);
> + kvm->vm_run = NULL;
> +out_err_alloc_page:
> kvm_arch_free_vm(kvm);
> mmdrop(current->mm);
> return ERR_PTR(r);
> @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> int i;
> struct mm_struct *mm = kvm->mm;
>
> + if (kvm->dirty_ring_size) {
> + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> + }
> +
> + if (kvm->vm_run) {
> + free_page((unsigned long)kvm->vm_run);
> + kvm->vm_run = NULL;
> + }
> +
> kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> kvm_destroy_vm_debugfs(kvm);
> kvm_arch_sync_events(kvm);
> @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> {
> if (memslot && memslot->dirty_bitmap) {
> unsigned long rel_gfn = gfn - memslot->base_gfn;
> -
> + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> set_bit_le(rel_gfn, memslot->dirty_bitmap);
> }
> }
> @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>
> +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> +{
> + return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> + (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> + kvm->dirty_ring_size / PAGE_SIZE);
> +}
> +
> static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> {
> struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> #endif
> + else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> + page = kvm_dirty_ring_get_page(
> + &vcpu->dirty_ring,
> + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> else
> return kvm_arch_vcpu_fault(vcpu, vmf);
> get_page(page);
> @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> #endif
> case KVM_CAP_NR_MEMSLOTS:
> return KVM_USER_MEM_SLOTS;
> + case KVM_CAP_DIRTY_LOG_RING:
> + /* Version will be zero if arch didn't implement it */
> + return KVM_DIRTY_RING_VERSION;
> default:
> break;
> }
> return kvm_vm_ioctl_check_extension(kvm, arg);
> }
>
> +static void mark_page_dirty_in_ring(struct kvm *kvm,
> + struct kvm_vcpu *vcpu,
> + struct kvm_memory_slot *slot,
> + gfn_t gfn)
> +{
> + u32 as_id = 0;
> + u64 offset;
> + int ret;
> + struct kvm_dirty_ring *ring;
> + struct kvm_dirty_ring_indexes *indexes;
> + bool is_vm_ring;
> +
> + if (!kvm->dirty_ring_size)
> + return;
> +
> + offset = gfn - slot->base_gfn;
> +
> + if (vcpu) {
> + as_id = kvm_arch_vcpu_memslots_id(vcpu);
> + } else {
> + as_id = 0;
> + vcpu = kvm_get_running_vcpu();
> + }
> +
> + if (vcpu) {
> + ring = &vcpu->dirty_ring;
> + indexes = &vcpu->run->vcpu_ring_indexes;
> + is_vm_ring = false;
> + } else {
> + /*
> + * Put onto per vm ring because no vcpu context. Kick
> + * vcpu0 if ring is full.
> + */
> + vcpu = kvm->vcpus[0];
> + ring = &kvm->vm_dirty_ring;
> + indexes = &kvm->vm_run->vm_ring_indexes;
> + is_vm_ring = true;
> + }
> +
> + ret = kvm_dirty_ring_push(ring, indexes,
> + (as_id << 16)|slot->id, offset,
> + is_vm_ring);
> + if (ret < 0) {
> + if (is_vm_ring)
> + pr_warn_once("vcpu %d dirty log overflow\n",
> + vcpu->vcpu_id);
> + else
> + pr_warn_once("per-vm dirty log overflow\n");
> + return;
> + }
> +
> + if (ret)
> + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> +}
> +
> +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> +{
> + struct kvm_memory_slot *memslot;
> + int as_id, id;
> +
> + as_id = slot >> 16;
> + id = (u16)slot;
> + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> + return;
> +
> + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> + if (offset >= memslot->npages)
> + return;
> +
> + spin_lock(&kvm->mmu_lock);
> + /* FIXME: we should use a single AND operation, but there is no
> + * applicable atomic API.
> + */
> + while (mask) {
> + clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> + mask &= mask - 1;
> + }
> +
> + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> + spin_unlock(&kvm->mmu_lock);
> +}
> +
> +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> +{
> + int r;
> +
> + /* the size should be power of 2 */
> + if (!size || (size & (size - 1)))
> + return -EINVAL;
> +
> + /* Should be bigger to keep the reserved entries, or a page */
> + if (size < kvm_dirty_ring_get_rsvd_entries() *
> + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> + return -EINVAL;
> +
> + if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> + sizeof(struct kvm_dirty_gfn))
> + return -E2BIG;
> +
> + /* We only allow it to set once */
> + if (kvm->dirty_ring_size)
> + return -EINVAL;
> +
> + mutex_lock(&kvm->lock);
> +
> + if (kvm->created_vcpus) {
> + /* We don't allow to change this value after vcpu created */
> + r = -EINVAL;
> + } else {
> + kvm->dirty_ring_size = size;
> + r = kvm_dirty_ring_alloc(kvm, &kvm->vm_dirty_ring);
> + if (r) {
> + /* Unset dirty ring */
> + kvm->dirty_ring_size = 0;
> + }
> + }
> +
> + mutex_unlock(&kvm->lock);
> + return r;
> +}
> +
> +static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> +{
> + int i;
> + struct kvm_vcpu *vcpu;
> + int cleared = 0;
> +
> + if (!kvm->dirty_ring_size)
> + return -EINVAL;
> +
> + mutex_lock(&kvm->slots_lock);
> +
> + cleared += kvm_dirty_ring_reset(kvm, &kvm->vm_dirty_ring,
> + &kvm->vm_run->vm_ring_indexes);
> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring,
> + &vcpu->run->vcpu_ring_indexes);
> +
> + mutex_unlock(&kvm->slots_lock);
> +
> + if (cleared)
> + kvm_flush_remote_tlbs(kvm);
> +
> + return cleared;
> +}
> +
> int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> struct kvm_enable_cap *cap)
> {
> @@ -3282,6 +3483,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> kvm->manual_dirty_log_protect = cap->args[0];
> return 0;
> #endif
> + case KVM_CAP_DIRTY_LOG_RING:
> + return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
> default:
> return kvm_vm_ioctl_enable_cap(kvm, cap);
> }
> @@ -3469,6 +3672,9 @@ static long kvm_vm_ioctl(struct file *filp,
> case KVM_CHECK_EXTENSION:
> r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
> break;
> + case KVM_RESET_DIRTY_RINGS:
> + r = kvm_vm_ioctl_reset_dirty_pages(kvm);
> + break;
> default:
> r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> }
> @@ -3517,9 +3723,39 @@ static long kvm_vm_compat_ioctl(struct file *filp,
> }
> #endif
>
> +static vm_fault_t kvm_vm_fault(struct vm_fault *vmf)
> +{
> + struct kvm *kvm = vmf->vma->vm_file->private_data;
> + struct page *page = NULL;
> +
> + if (vmf->pgoff == 0)
> + page = virt_to_page(kvm->vm_run);
> + else if (kvm_fault_in_dirty_ring(kvm, vmf))
> + page = kvm_dirty_ring_get_page(
> + &kvm->vm_dirty_ring,
> + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> + else
> + return VM_FAULT_SIGBUS;
> +
> + get_page(page);
> + vmf->page = page;
> + return 0;
> +}
> +
> +static const struct vm_operations_struct kvm_vm_vm_ops = {
> + .fault = kvm_vm_fault,
> +};
> +
> +static int kvm_vm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + vma->vm_ops = &kvm_vm_vm_ops;
> + return 0;
> +}
> +
> static struct file_operations kvm_vm_fops = {
> .release = kvm_vm_release,
> .unlocked_ioctl = kvm_vm_ioctl,
> + .mmap = kvm_vm_mmap,
> .llseek = noop_llseek,
> KVM_COMPAT(kvm_vm_compat_ioctl),
> };


--
Cheers,
Christophe de Dinechin (IRC c3d)

2019-12-11 21:01:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <[email protected]> and Paolo Bonzini <[email protected]>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information. The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another. However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial. In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN). This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > We defined two new data structures:
> >
> > struct kvm_dirty_ring;
> > struct kvm_dirty_ring_indexes;
> >
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages. When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> >
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring. Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> >
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> >
> > Currently, we have N+1 rings for each VM of N vcpus:
> >
> > - for each vcpu, we have 1 per-vcpu dirty ring,
> > - for each vm, we have 1 per-vm dirty ring
> >
> > Please refer to the documentation update in this patch for more
> > details.
> >
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now. Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
>
>
> Thanks, that's interesting.

Hi, Michael,

Thanks for reading the series.

>
> > ---
> > Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> > arch/x86/kvm/Makefile | 3 +-
> > include/linux/kvm_dirty_ring.h | 67 +++++++++
> > include/linux/kvm_host.h | 33 +++++
> > include/linux/kvm_types.h | 1 +
> > include/uapi/linux/kvm.h | 36 +++++
> > virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
> > virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
> > 8 files changed, 642 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/kvm_dirty_ring.h
> > create mode 100644 virt/kvm/dirty_ring.c
> >
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > It is thus encouraged to use the vm ioctl to query for capabilities (available
> > with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >
> > +
> > 4.5 KVM_GET_VCPU_MMAP_SIZE
> >
> > Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > memory region. This ioctl returns the size of that region. See the
> > KVM_RUN documentation for details.
> >
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > + KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> > + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >
> > 4.6 KVM_SET_MEMORY_REGION
> >
>
> PAGE_SIZE being which value? It's not always trivial for
> userspace to know what's the PAGE_SIZE for the kernel ...

I thought it can be easily fetched from getpagesize() or
sysconf(PAGE_SIZE)? Especially considering that the document should
be for kvm userspace, I'd say it should be common that a hypervisor
process will need to know this probably in other tons of places.. no?

>
>
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > AArch64, this value will be reported in the ISS field of ESR_ELx.
> >
> > See KVM_CAP_VCPU_EVENTS for more details.
> > +
> > 8.20 KVM_CAP_HYPERV_SEND_IPI
> >
> > Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> > This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > hypercalls:
> > HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> > 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >
> > Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > in CPUID and only exposes Hyper-V identification. In this case, guest
> > thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > + u16 dirty_index;
> > + u16 reset_index;
> > + u32 size;
> > + u32 soft_limit;
> > + spinlock_t lock;
> > + struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > + __u32 avail_index; /* set by kernel */
> > + __u32 fetch_index; /* set by userspace */
>
> Sticking these next to each other seems to guarantee cache conflicts.
>
> Avail/Fetch seems to mimic Virtio's avail/used exactly. I am not saying
> you must reuse the code really, but I think you should take a hard look
> at e.g. the virtio packed ring structure. We spent a bunch of time
> optimizing it for cache utilization. It seems kernel is the driver,
> making entries available, and userspace the device, using them.
> Again let's not develop a thread about this, but I think
> this is something to consider and discuss in future versions
> of the patches.

I think I completely understand your concern. We should avoid wasting
time on those are already there. I'm just afraid that it'll took even
more time to use virtio for this use case while at last we don't
really get much benefit out of it (e.g. most of the virtio features
are not used).

Yeh let's not develop a thread for this topic - I will read more on
virtio before my next post to see whether there's any chance we can
share anything with virtio ring.

>
>
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
>
> What does GFN stand for?

It's guest frame number, iiuc. I'm not the one who named this, but
that's what I understand..

>
> > + __u32 pad;
> > + __u32 slot; /* as_id | slot_id */
> > + __u64 offset;
> > +};
>
> offset of what? a 4K page right? Seems like a waste e.g. for
> hugetlbfs... How about replacing pad with size instead?

As Paolo explained, it's the page frame number of the guest. IIUC
even for hugetlbfs we track dirty bits in 4k size.

>
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
>
> I'm not sure what you are trying to say here. kvm_dirty_gfn
> seems to be part of UAPI.

It was talking about kvm_dirty_ring, which is kvm internal and not
exposed to uapi. While kvm_dirty_gfn is exposed to the users.

>
> > +
> > +The two indices in the ring buffer are free running counters.
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > + idx = load-acquire(&ring->fetch_index);
> > + while (idx != ring->avail_index) {
> > + struct kvm_dirty_gfn *entry;
> > + entry = &ring->dirty_gfns[idx & (size - 1)];
> > + ...
> > +
> > + idx++;
> > + }
> > + ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings. It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two.
>
> All these seem like arbitrary limitations to me.

The dependency of vcpu is partly because we need to create per-vcpu
ring, so it's easier that we don't allow it to change after that.

>
> Sizing the ring correctly might prove to be a challenge.
>
> Thus I think there's value in resizing the rings
> without destroying VCPU.

Do you have an example on when we could use this feature? My wild
guess is that even if we try hard to allow resizing (assuming that
won't bring more bugs, but I hightly doubt...), people may not use it
at all.

The major scenario here is that kvm userspace will be collecting the
dirty bits quickly, so the ring should not really get full easily.
Then the ring size does not really matter much either, as long as it
is bigger than some specific value to avoid vmexits due to full.

How about we start with the simple that we don't allow it to change?
We can do that when the requirement comes.

>
> Also, power of two just saves a branch here and there,
> but wastes lots of memory. Just wrap the index around to
> 0 and then users can select any size?

Same as above to postpone until we need it?

>
>
>
> > The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
>
> OTOH larger buffers put lots of pressure on the system cache.
>
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly. This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once. After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean. Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer. To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
>
> This last item means that the performance impact of the feature is
> really hard to predict. Can improve some workloads drastically. Or can
> slow some down.
>
>
> One solution could be to actually allow using this together with the
> existing bitmap. Userspace can then decide whether it wants to block
> VCPU on ring full, or just record ring full condition and recover by
> bitmap scanning.

That's true, but again allowing mixture use of the two might bring
extra complexity as well (especially when after adding
KVM_CLEAR_DIRTY_LOG).

My understanding of this is that normally we do only want either one
of them depending on the major workload and the configuration of the
guest. It's not trivial to try to provide a one-for-all solution. So
again I would hope we can start from easy, then we extend when we have
better ideas on how to leverage the two interfaces when the ideas
really come, and then we can justify whether it's worth it to work on
that complexity.

>
>
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> > KVM := ../../../virt/kvm
> >
> > kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > + $(KVM)/dirty_ring.o
> > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >
> > kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring: shared with userspace via mmap. It is the compact list
> > + * that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + * dirty_ring->dirty_gfns where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + * in dirty_ring->dirty_gfns for which dirty trap needs to
> > + * be reenabled
> > + * size: size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit: when the number of dirty pages in the list reaches this
> > + * limit, vcpu that owns this ring should exit to userspace
> > + * to allow userspace to harvest all the dirty pages
> > + * lock: protects dirty_ring, only in use if this is the global
> > + * ring
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > + u32 dirty_index;
> > + u32 reset_index;
> > + u32 size;
> > + u32 soft_limit;
> > + spinlock_t lock;
> > + struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > + struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + * 1: successfully pushed, soft limit reached,
> > + * vcpu should exit to userspace
> > + * -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes,
> > + u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> > #include <linux/kvm_types.h>
> >
> > #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >
> > #ifndef KVM_MAX_VCPU_ID
> > #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> > #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > #define KVM_REQ_PENDING_TIMER 2
> > #define KVM_REQ_UNHALT 3
> > +#define KVM_REQ_DIRTY_RING_FULL 4
> > #define KVM_REQUEST_ARCH_BASE 8
> >
> > #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> > bool ready;
> > struct kvm_vcpu_arch arch;
> > struct dentry *debugfs_dentry;
> > + struct kvm_dirty_ring dirty_ring;
> > };
> >
> > static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> > struct srcu_struct srcu;
> > struct srcu_struct irq_srcu;
> > pid_t userspace_pid;
> > + /* Data structure to be exported by mmap(kvm->fd, 0) */
> > + struct kvm_vm_run *vm_run;
> > + u32 dirty_ring_size;
> > + struct kvm_dirty_ring vm_dirty_ring;
> > };
> >
> > #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > gfn_t gfn_offset,
> > unsigned long mask);
> >
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> > int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> > struct kvm_dirty_log *log);
> > int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> > uintptr_t data, const char *name,
> > struct task_struct **thread_ptr);
> >
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full. This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define KVM_DIRTY_RING_RSVD_ENTRIES 64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define KVM_DIRTY_RING_MAX_ENTRIES 65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures,
>
> Confused. Offset where? You set a default for everyone - where does arch
> want to override it?

If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
please see [1] on #ifndef.

>
> > while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET

[1]

> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
>
> One way versioning, with no bits and negotiation
> will make it hard to change down the road.
> what's wrong with existing KVM capabilities that
> you feel there's a need for dedicated versioning for this?

Frankly speaking I don't even think it'll change in the near
future.. :)

Yeh kvm versioning could work too. Here we can also return a zero
just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
original patchset, but it's really helpless either because it's
defined in uapi), but I just don't see how it helps... So I returned
a version number just in case we'd like to change the layout some day
and when we don't want to bother introducing another cap bit for the
same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

>
> > +
> > #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> > struct kvm_memory_slot;
> > struct kvm_one_reg;
> > struct kvm_run;
> > +struct kvm_vm_run;
> > struct kvm_userspace_memory_region;
> > struct kvm_vcpu;
> > struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> > #define KVM_EXIT_IOAPIC_EOI 26
> > #define KVM_EXIT_HYPERV 27
> > #define KVM_EXIT_ARM_NISV 28
> > +#define KVM_EXIT_DIRTY_RING_FULL 29
> >
> > /* For KVM_EXIT_INTERNAL_ERROR */
> > /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> > /* Encounter unexpected vm-exit reason */
> > #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4
> >
> > +struct kvm_dirty_ring_indexes {
> > + __u32 avail_index; /* set by kernel */
> > + __u32 fetch_index; /* set by userspace */
> > +};
> > +
> > /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> > struct kvm_run {
> > /* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> > struct kvm_sync_regs regs;
> > char padding[SYNC_REGS_SIZE_BYTES];
> > } s;
> > +
> > + struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > + struct kvm_dirty_ring_indexes vm_ring_indexes;
> > };
> >
> > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> > #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> > #define KVM_CAP_ARM_NISV_TO_USER 177
> > #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >
> > #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> > /* Available with KVM_CAP_ARM_SVE */
> > #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)
> >
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
> > +
> > /* Secure Encrypted Virtualization command */
> > enum sev_cmd_id {
> > /* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> > #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
> > #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
> >
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + * of kvm_write_* so that the global dirty ring is not filled up
> > + * too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + * enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + * dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > + __u32 pad;
> > + __u32 slot;
> > + __u64 offset;
> > +};
> > +
> > #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > + u32 size = kvm->dirty_ring_size;
> > +
> > + ring->dirty_gfns = vmalloc(size);
>
> So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> Do we really have to though? Why not get a userspace pointer,
> write it with copy to user, and sidestep all this?

I'd say it won't be a big issue on locking 1/2M of host mem for a
vm...

Also note that if dirty ring is enabled, I plan to evaporate the
dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
$GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
less memory used.

>
> > + if (!ring->dirty_gfns)
> > + return -ENOMEM;
> > + memset(ring->dirty_gfns, 0, size);
> > +
> > + ring->size = size / sizeof(struct kvm_dirty_gfn);
> > + ring->soft_limit =
> > + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > + kvm_dirty_ring_get_rsvd_entries();
> > + ring->dirty_index = 0;
> > + ring->reset_index = 0;
> > + spin_lock_init(&ring->lock);
> > +
> > + return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > + struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes)
> > +{
> > + u32 cur_slot, next_slot;
> > + u64 cur_offset, next_offset;
> > + unsigned long mask;
> > + u32 fetch;
> > + int count = 0;
> > + struct kvm_dirty_gfn *entry;
> > +
> > + fetch = READ_ONCE(indexes->fetch_index);
> > + if (fetch == ring->reset_index)
> > + return 0;
> > +
> > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > + /*
> > + * The ring buffer is shared with userspace, which might mmap
> > + * it and concurrently modify slot and offset. Userspace must
> > + * not be trusted! READ_ONCE prevents the compiler from changing
> > + * the values after they've been range-checked (the checks are
> > + * in kvm_reset_dirty_gfn).
>
> What it doesn't is prevent speculative attacks. That's why things like
> copy from user have a speculation barrier. Instead of worrying about
> that, unless it's really critical, I think you'd do well do just use
> copy to/from user.

IMHO I would really hope these data be there without swapped out of
memory, just like what we did with kvm->dirty_bitmap... it's on the
hot path of mmu page fault, even we could be with mmu lock held if
copy_to_user() page faulted. But indeed I've no experience on
avoiding speculative attacks, suggestions would be greatly welcomed on
that. In our case we do (index & (size - 1)), so is it still
suffering from speculative attacks?

>
> > + */
> > + smp_read_barrier_depends();
>
> What depends on what here? Looks suspicious ...

Hmm, I think maybe it can be removed because the entry pointer
reference below should be an ordering constraint already?

>
> > + cur_slot = READ_ONCE(entry->slot);
> > + cur_offset = READ_ONCE(entry->offset);
> > + mask = 1;
> > + count++;
> > + ring->reset_index++;
> > + while (ring->reset_index != fetch) {
> > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > + smp_read_barrier_depends();
>
> same concerns here
>
> > + next_slot = READ_ONCE(entry->slot);
> > + next_offset = READ_ONCE(entry->offset);
> > + ring->reset_index++;
> > + count++;
> > + /*
> > + * Try to coalesce the reset operations when the guest is
> > + * scanning pages in the same slot.
>
> what does guest scanning mean?

My wild guess is that it means when the guest is accessing the pages
continuously so the dirty gfns are continuous too. Anyway I agree
it's not clear, where I can try to rephrase.

>
> > + */
> > + if (next_slot == cur_slot) {
> > + int delta = next_offset - cur_offset;
> > +
> > + if (delta >= 0 && delta < BITS_PER_LONG) {
> > + mask |= 1ull << delta;
> > + continue;
> > + }
> > +
> > + /* Backwards visit, careful about overflows! */
> > + if (delta > -BITS_PER_LONG && delta < 0 &&
> > + (mask << -delta >> -delta) == mask) {
> > + cur_offset = next_offset;
> > + mask = (mask << -delta) | 1;
> > + continue;
> > + }
> > + }
> > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > + cur_slot = next_slot;
> > + cur_offset = next_offset;
> > + mask = 1;
> > + }
> > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > + return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > + return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > + return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + * >0 if we should kick the vcpu out,
> > + * =0 if the gfn pushed successfully, or,
> > + * <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes,
> > + u32 slot, u64 offset, bool lock)
> > +{
> > + int ret;
> > + struct kvm_dirty_gfn *entry;
> > +
> > + if (lock)
> > + spin_lock(&ring->lock);
>
> what's the story around locking here? Why is it safe
> not to take the lock sometimes?

kvm_dirty_ring_push() will be with lock==true only when the per-vm
ring is used. For per-vcpu ring, because that will only happen with
the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
is called with lock==false).

>
> > +
> > + if (kvm_dirty_ring_full(ring)) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
> > + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > + entry->slot = slot;
> > + entry->offset = offset;
> > + smp_wmb();
> > + ring->dirty_index++;
> > + WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > + pr_info("%s: slot %u offset %llu used %u\n",
> > + __func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > + if (lock)
> > + spin_unlock(&ring->lock);
> > +
> > + return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > +{
> > + return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > +}
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > +{
> > + if (ring->dirty_gfns) {
> > + vfree(ring->dirty_gfns);
> > + ring->dirty_gfns = NULL;
> > + }
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 681452d288cd..8642c977629b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -64,6 +64,8 @@
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/kvm.h>
> >
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > /* Worst case buffer size needed for holding an integer. */
> > #define ITOA_MAX_LEN 12
> >
> > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > struct kvm_vcpu *vcpu,
> > struct kvm_memory_slot *memslot,
> > gfn_t gfn);
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > + struct kvm_vcpu *vcpu,
> > + struct kvm_memory_slot *slot,
> > + gfn_t gfn);
> >
> > __visible bool kvm_rebooting;
> > EXPORT_SYMBOL_GPL(kvm_rebooting);
> > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> > vcpu->preempted = false;
> > vcpu->ready = false;
> >
> > + if (kvm->dirty_ring_size) {
> > + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > + if (r) {
> > + kvm->dirty_ring_size = 0;
> > + goto fail_free_run;
> > + }
> > + }
> > +
> > r = kvm_arch_vcpu_init(vcpu);
> > if (r < 0)
> > - goto fail_free_run;
> > + goto fail_free_ring;
> > return 0;
> >
> > +fail_free_ring:
> > + if (kvm->dirty_ring_size)
> > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > fail_free_run:
> > free_page((unsigned long)vcpu->run);
> > fail:
> > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> > put_pid(rcu_dereference_protected(vcpu->pid, 1));
> > kvm_arch_vcpu_uninit(vcpu);
> > free_page((unsigned long)vcpu->run);
> > + if (vcpu->kvm->dirty_ring_size)
> > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > }
> > EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> >
> > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > struct kvm *kvm = kvm_arch_alloc_vm();
> > int r = -ENOMEM;
> > int i;
> > + struct page *page;
> >
> > if (!kvm)
> > return ERR_PTR(-ENOMEM);
> > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >
> > BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> >
> > + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!page) {
> > + r = -ENOMEM;
> > + goto out_err_alloc_page;
> > + }
> > + kvm->vm_run = page_address(page);
>
> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> still. What is wrong with just a pointer and calling put_user?

I want to make it the start point for sharing fields between
user/kernel per-vm. Just like kvm_run for per-vcpu.

IMHO it'll be awkward if we always introduce a new interface just to
take a pointer of the userspace buffer and cache it... I'd say so far
I like the design of kvm_run and alike because it's efficient, easy to
use, and easy for extensions.

>
> > + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > +
> > if (init_srcu_struct(&kvm->srcu))
> > goto out_err_no_srcu;
> > if (init_srcu_struct(&kvm->irq_srcu))
> > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > out_err_no_irq_srcu:
> > cleanup_srcu_struct(&kvm->srcu);
> > out_err_no_srcu:
> > + free_page((unsigned long)page);
> > + kvm->vm_run = NULL;
> > +out_err_alloc_page:
> > kvm_arch_free_vm(kvm);
> > mmdrop(current->mm);
> > return ERR_PTR(r);
> > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > int i;
> > struct mm_struct *mm = kvm->mm;
> >
> > + if (kvm->dirty_ring_size) {
> > + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > + }
> > +
> > + if (kvm->vm_run) {
> > + free_page((unsigned long)kvm->vm_run);
> > + kvm->vm_run = NULL;
> > + }
> > +
> > kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> > kvm_destroy_vm_debugfs(kvm);
> > kvm_arch_sync_events(kvm);
> > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > {
> > if (memslot && memslot->dirty_bitmap) {
> > unsigned long rel_gfn = gfn - memslot->base_gfn;
> > -
> > + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> > set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > }
> > }
> > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > }
> > EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> >
> > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > +{
> > + return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > + (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > + kvm->dirty_ring_size / PAGE_SIZE);
> > +}
> > +
> > static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > {
> > struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> > page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> > #endif
> > + else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > + page = kvm_dirty_ring_get_page(
> > + &vcpu->dirty_ring,
> > + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> > else
> > return kvm_arch_vcpu_fault(vcpu, vmf);
> > get_page(page);
> > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > #endif
> > case KVM_CAP_NR_MEMSLOTS:
> > return KVM_USER_MEM_SLOTS;
> > + case KVM_CAP_DIRTY_LOG_RING:
> > + /* Version will be zero if arch didn't implement it */
> > + return KVM_DIRTY_RING_VERSION;
> > default:
> > break;
> > }
> > return kvm_vm_ioctl_check_extension(kvm, arg);
> > }
> >
> > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > + struct kvm_vcpu *vcpu,
> > + struct kvm_memory_slot *slot,
> > + gfn_t gfn)
> > +{
> > + u32 as_id = 0;
> > + u64 offset;
> > + int ret;
> > + struct kvm_dirty_ring *ring;
> > + struct kvm_dirty_ring_indexes *indexes;
> > + bool is_vm_ring;
> > +
> > + if (!kvm->dirty_ring_size)
> > + return;
> > +
> > + offset = gfn - slot->base_gfn;
> > +
> > + if (vcpu) {
> > + as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > + } else {
> > + as_id = 0;
> > + vcpu = kvm_get_running_vcpu();
> > + }
> > +
> > + if (vcpu) {
> > + ring = &vcpu->dirty_ring;
> > + indexes = &vcpu->run->vcpu_ring_indexes;
> > + is_vm_ring = false;
> > + } else {
> > + /*
> > + * Put onto per vm ring because no vcpu context. Kick
> > + * vcpu0 if ring is full.
>
> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> critical tasks there, they will be penalized disproportionally?

Reasonable question. So far we can't avoid it because vcpu exit is
the event mechanism to say "hey please collect dirty bits". Maybe
someway is better than this, but I'll need to rethink all these
over...

>
> > + */
> > + vcpu = kvm->vcpus[0];
> > + ring = &kvm->vm_dirty_ring;
> > + indexes = &kvm->vm_run->vm_ring_indexes;
> > + is_vm_ring = true;
> > + }
> > +
> > + ret = kvm_dirty_ring_push(ring, indexes,
> > + (as_id << 16)|slot->id, offset,
> > + is_vm_ring);
> > + if (ret < 0) {
> > + if (is_vm_ring)
> > + pr_warn_once("vcpu %d dirty log overflow\n",
> > + vcpu->vcpu_id);
> > + else
> > + pr_warn_once("per-vm dirty log overflow\n");
> > + return;
> > + }
> > +
> > + if (ret)
> > + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > +}
> > +
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > +{
> > + struct kvm_memory_slot *memslot;
> > + int as_id, id;
> > +
> > + as_id = slot >> 16;
> > + id = (u16)slot;
> > + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > + return;
> > +
> > + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > + if (offset >= memslot->npages)
> > + return;
> > +
> > + spin_lock(&kvm->mmu_lock);
> > + /* FIXME: we should use a single AND operation, but there is no
> > + * applicable atomic API.
> > + */
> > + while (mask) {
> > + clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > + mask &= mask - 1;
> > + }
> > +
> > + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > + spin_unlock(&kvm->mmu_lock);
> > +}
> > +
> > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > +{
> > + int r;
> > +
> > + /* the size should be power of 2 */
> > + if (!size || (size & (size - 1)))
> > + return -EINVAL;
> > +
> > + /* Should be bigger to keep the reserved entries, or a page */
> > + if (size < kvm_dirty_ring_get_rsvd_entries() *
> > + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > + return -EINVAL;
> > +
> > + if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > + sizeof(struct kvm_dirty_gfn))
> > + return -E2BIG;
>
> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> So how does userspace know what's legal?
> Do you expect it to just try?

Yep that's what I thought. :)

Please grep E2BIG in QEMU repo target/i386/kvm.c... won't be hard to
do imho..

> More likely it will just copy the number from kernel and can
> never ever make it smaller.

Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
to uapi too.

Thanks,

--
Peter Xu

2019-12-11 22:59:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 11, 2019 at 03:59:52PM -0500, Peter Xu wrote:
> On Wed, Dec 11, 2019 at 07:53:48AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Nov 29, 2019 at 04:34:54PM -0500, Peter Xu wrote:
> > > This patch is heavily based on previous work from Lei Cao
> > > <[email protected]> and Paolo Bonzini <[email protected]>. [1]
> > >
> > > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > > are copied to userspace when userspace queries KVM for its dirty page
> > > information. The use of bitmaps is mostly sufficient for live
> > > migration, as large parts of memory are be dirtied from one log-dirty
> > > pass to another. However, in a checkpointing system, the number of
> > > dirty pages is small and in fact it is often bounded---the VM is
> > > paused when it has dirtied a pre-defined number of pages. Traversing a
> > > large, sparsely populated bitmap to find set bits is time-consuming,
> > > as is copying the bitmap to user-space.
> > >
> > > A similar issue will be there for live migration when the guest memory
> > > is huge while the page dirty procedure is trivial. In that case for
> > > each dirty sync we need to pull the whole dirty bitmap to userspace
> > > and analyse every bit even if it's mostly zeros.
> > >
> > > The preferred data structure for above scenarios is a dense list of
> > > guest frame numbers (GFN). This patch series stores the dirty list in
> > > kernel memory that can be memory mapped into userspace to allow speedy
> > > harvesting.
> > >
> > > We defined two new data structures:
> > >
> > > struct kvm_dirty_ring;
> > > struct kvm_dirty_ring_indexes;
> > >
> > > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > > pages. When dirty tracking is enabled, we can push dirty gfn onto the
> > > ring.
> > >
> > > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > > user/kernel interface of each ring. Currently it contains two
> > > indexes: (1) avail_index represents where we should push our next
> > > PFN (written by kernel), while (2) fetch_index represents where the
> > > userspace should fetch the next dirty PFN (written by userspace).
> > >
> > > One complete ring is composed by one kvm_dirty_ring plus its
> > > corresponding kvm_dirty_ring_indexes.
> > >
> > > Currently, we have N+1 rings for each VM of N vcpus:
> > >
> > > - for each vcpu, we have 1 per-vcpu dirty ring,
> > > - for each vm, we have 1 per-vm dirty ring
> > >
> > > Please refer to the documentation update in this patch for more
> > > details.
> > >
> > > Note that this patch implements the core logic of dirty ring buffer.
> > > It's still disabled for all archs for now. Also, we'll address some
> > > of the other issues in follow up patches before it's firstly enabled
> > > on x86.
> > >
> > > [1] https://patchwork.kernel.org/patch/10471409/
> > >
> > > Signed-off-by: Lei Cao <[email protected]>
> > > Signed-off-by: Paolo Bonzini <[email protected]>
> > > Signed-off-by: Peter Xu <[email protected]>
> >
> >
> > Thanks, that's interesting.
>
> Hi, Michael,
>
> Thanks for reading the series.
>
> >
> > > ---
> > > Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> > > arch/x86/kvm/Makefile | 3 +-
> > > include/linux/kvm_dirty_ring.h | 67 +++++++++
> > > include/linux/kvm_host.h | 33 +++++
> > > include/linux/kvm_types.h | 1 +
> > > include/uapi/linux/kvm.h | 36 +++++
> > > virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
> > > virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
> > > 8 files changed, 642 insertions(+), 3 deletions(-)
> > > create mode 100644 include/linux/kvm_dirty_ring.h
> > > create mode 100644 virt/kvm/dirty_ring.c
> > >
> > > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > > index 49183add44e7..fa622c9a2eb8 100644
> > > --- a/Documentation/virt/kvm/api.txt
> > > +++ b/Documentation/virt/kvm/api.txt
> > > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > > It is thus encouraged to use the vm ioctl to query for capabilities (available
> > > with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> > >
> > > +
> > > 4.5 KVM_GET_VCPU_MMAP_SIZE
> > >
> > > Capability: basic
> > > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > > memory region. This ioctl returns the size of that region. See the
> > > KVM_RUN documentation for details.
> > >
> > > +Besides the size of the KVM_RUN communication region, other areas of
> > > +the VCPU file descriptor can be mmap-ed, including:
> > > +
> > > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > > + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > > + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > > + KVM_CAP_COALESCED_MMIO is not documented yet.
> > > +
> > > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > > + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> > > + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > > +
> > >
> > > 4.6 KVM_SET_MEMORY_REGION
> > >
> >
> > PAGE_SIZE being which value? It's not always trivial for
> > userspace to know what's the PAGE_SIZE for the kernel ...
>
> I thought it can be easily fetched from getpagesize() or
> sysconf(PAGE_SIZE)? Especially considering that the document should
> be for kvm userspace, I'd say it should be common that a hypervisor
> process will need to know this probably in other tons of places.. no?
>
> >
> >
> > > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > > AArch64, this value will be reported in the ISS field of ESR_ELx.
> > >
> > > See KVM_CAP_VCPU_EVENTS for more details.
> > > +
> > > 8.20 KVM_CAP_HYPERV_SEND_IPI
> > >
> > > Architectures: x86
> > > @@ -5365,6 +5379,7 @@ Architectures: x86
> > > This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > > hypercalls:
> > > HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > > +
> > > 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> > >
> > > Architecture: x86
> > > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > > flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > > in CPUID and only exposes Hyper-V identification. In this case, guest
> > > thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > > +
> > > +8.22 KVM_CAP_DIRTY_LOG_RING
> > > +
> > > +Architectures: x86
> > > +Parameters: args[0] - size of the dirty log ring
> > > +
> > > +KVM is capable of tracking dirty memory using ring buffers that are
> > > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > > +ring per vm.
> > > +
> > > +One dirty ring has the following two major structures:
> > > +
> > > +struct kvm_dirty_ring {
> > > + u16 dirty_index;
> > > + u16 reset_index;
> > > + u32 size;
> > > + u32 soft_limit;
> > > + spinlock_t lock;
> > > + struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +struct kvm_dirty_ring_indexes {
> > > + __u32 avail_index; /* set by kernel */
> > > + __u32 fetch_index; /* set by userspace */
> >
> > Sticking these next to each other seems to guarantee cache conflicts.
> >
> > Avail/Fetch seems to mimic Virtio's avail/used exactly. I am not saying
> > you must reuse the code really, but I think you should take a hard look
> > at e.g. the virtio packed ring structure. We spent a bunch of time
> > optimizing it for cache utilization. It seems kernel is the driver,
> > making entries available, and userspace the device, using them.
> > Again let's not develop a thread about this, but I think
> > this is something to consider and discuss in future versions
> > of the patches.
>
> I think I completely understand your concern. We should avoid wasting
> time on those are already there. I'm just afraid that it'll took even
> more time to use virtio for this use case while at last we don't
> really get much benefit out of it (e.g. most of the virtio features
> are not used).
>
> Yeh let's not develop a thread for this topic - I will read more on
> virtio before my next post to see whether there's any chance we can
> share anything with virtio ring.
>
> >
> >
> > > +};
> > > +
> > > +While for each of the dirty entry it's defined as:
> > > +
> > > +struct kvm_dirty_gfn {
> >
> > What does GFN stand for?
>
> It's guest frame number, iiuc. I'm not the one who named this, but
> that's what I understand..
>
> >
> > > + __u32 pad;
> > > + __u32 slot; /* as_id | slot_id */
> > > + __u64 offset;
> > > +};
> >
> > offset of what? a 4K page right? Seems like a waste e.g. for
> > hugetlbfs... How about replacing pad with size instead?
>
> As Paolo explained, it's the page frame number of the guest. IIUC
> even for hugetlbfs we track dirty bits in 4k size.
>
> >
> > > +
> > > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > > +userspace to be either read or written.
> >
> > I'm not sure what you are trying to say here. kvm_dirty_gfn
> > seems to be part of UAPI.
>
> It was talking about kvm_dirty_ring, which is kvm internal and not
> exposed to uapi. While kvm_dirty_gfn is exposed to the users.
>
> >
> > > +
> > > +The two indices in the ring buffer are free running counters.
> > > +
> > > +In pseudocode, processing the ring buffer looks like this:
> > > +
> > > + idx = load-acquire(&ring->fetch_index);
> > > + while (idx != ring->avail_index) {
> > > + struct kvm_dirty_gfn *entry;
> > > + entry = &ring->dirty_gfns[idx & (size - 1)];
> > > + ...
> > > +
> > > + idx++;
> > > + }
> > > + ring->fetch_index = idx;
> > > +
> > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > > +to enable this capability for the new guest and set the size of the
> > > +rings. It is only allowed before creating any vCPU, and the size of
> > > +the ring must be a power of two.
> >
> > All these seem like arbitrary limitations to me.
>
> The dependency of vcpu is partly because we need to create per-vcpu
> ring, so it's easier that we don't allow it to change after that.
>
> >
> > Sizing the ring correctly might prove to be a challenge.
> >
> > Thus I think there's value in resizing the rings
> > without destroying VCPU.
>
> Do you have an example on when we could use this feature?

So e.g. start with a small ring, and if you see stalls too often
increase it? Otherwise I don't see how does one decide
on ring size.

> My wild
> guess is that even if we try hard to allow resizing (assuming that
> won't bring more bugs, but I hightly doubt...), people may not use it
> at all.
>
> The major scenario here is that kvm userspace will be collecting the
> dirty bits quickly, so the ring should not really get full easily.
> Then the ring size does not really matter much either, as long as it
> is bigger than some specific value to avoid vmexits due to full.

Exactly but I don't see how you are going to find that value
unless it's auto-tuning dynamically.

> How about we start with the simple that we don't allow it to change?
> We can do that when the requirement comes.
>
> >
> > Also, power of two just saves a branch here and there,
> > but wastes lots of memory. Just wrap the index around to
> > 0 and then users can select any size?
>
> Same as above to postpone until we need it?

It's to save memory, don't we always need to do that?

> >
> >
> >
> > > The larger the ring buffer, the less
> > > +likely the ring is full and the VM is forced to exit to userspace. The
> > > +optimal size depends on the workload, but it is recommended that it be
> > > +at least 64 KiB (4096 entries).
> >
> > OTOH larger buffers put lots of pressure on the system cache.
> >
> > > +
> > > +After the capability is enabled, userspace can mmap the global ring
> > > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > > +descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
> > > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > > +
> > > +Just like for dirty page bitmaps, the buffer tracks writes to
> > > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > > +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
> > > +with the flag set, userspace can start harvesting dirty pages from the
> > > +ring buffer.
> > > +
> > > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > > +accordingly. This can be done when the guest is running or paused,
> > > +and dirty pages need not be collected all at once. After processing
> > > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > > +fetch_index and to mark those pages clean. Therefore, the ioctl
> > > +must be called *before* reading the content of the dirty pages.
> > > +
> > > +However, there is a major difference comparing to the
> > > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > > +userspace it's still possible that the kernel has not yet flushed the
> > > +hardware dirty buffers into the kernel buffer. To achieve that, one
> > > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
> > > +
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> >
> > This last item means that the performance impact of the feature is
> > really hard to predict. Can improve some workloads drastically. Or can
> > slow some down.
> >
> >
> > One solution could be to actually allow using this together with the
> > existing bitmap. Userspace can then decide whether it wants to block
> > VCPU on ring full, or just record ring full condition and recover by
> > bitmap scanning.
>
> That's true, but again allowing mixture use of the two might bring
> extra complexity as well (especially when after adding
> KVM_CLEAR_DIRTY_LOG).
>
> My understanding of this is that normally we do only want either one
> of them depending on the major workload and the configuration of the
> guest.

And again how does one know which to enable? No one has the
time to fine-tune gazillion parameters.

> It's not trivial to try to provide a one-for-all solution. So
> again I would hope we can start from easy, then we extend when we have
> better ideas on how to leverage the two interfaces when the ideas
> really come, and then we can justify whether it's worth it to work on
> that complexity.

It's less *coding* work to build a simple thing but it need much more *testing*.

IMHO a huge amount of benchmarking has to happen if you just want to
set this loose on users as default with these kind of
limitations. We need to be sure that even though in theory
it can be very bad, in practice it's actually good.
If it's auto-tuning then it's a much easier sell to upstream
even if there's a chance of some regressions.

> >
> >
> > > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > > index b19ef421084d..0acee817adfb 100644
> > > --- a/arch/x86/kvm/Makefile
> > > +++ b/arch/x86/kvm/Makefile
> > > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> > > KVM := ../../../virt/kvm
> > >
> > > kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > > - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > > + $(KVM)/dirty_ring.o
> > > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> > >
> > > kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> > > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > > new file mode 100644
> > > index 000000000000..8335635b7ff7
> > > --- /dev/null
> > > +++ b/include/linux/kvm_dirty_ring.h
> > > @@ -0,0 +1,67 @@
> > > +#ifndef KVM_DIRTY_RING_H
> > > +#define KVM_DIRTY_RING_H
> > > +
> > > +/*
> > > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > > + *
> > > + * dirty_ring: shared with userspace via mmap. It is the compact list
> > > + * that holds the dirty pages.
> > > + * dirty_index: free running counter that points to the next slot in
> > > + * dirty_ring->dirty_gfns where a new dirty page should go.
> > > + * reset_index: free running counter that points to the next dirty page
> > > + * in dirty_ring->dirty_gfns for which dirty trap needs to
> > > + * be reenabled
> > > + * size: size of the compact list, dirty_ring->dirty_gfns
> > > + * soft_limit: when the number of dirty pages in the list reaches this
> > > + * limit, vcpu that owns this ring should exit to userspace
> > > + * to allow userspace to harvest all the dirty pages
> > > + * lock: protects dirty_ring, only in use if this is the global
> > > + * ring
> > > + *
> > > + * The number of dirty pages in the ring is calculated by,
> > > + * dirty_index - reset_index
> > > + *
> > > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > > + * is incremented. When userspace harvests the dirty pages, it increments
> > > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > > + * When kernel reenables dirty traps for the dirty pages, it increments
> > > + * reset_index up to dirty_ring->indices.fetch_index.
> > > + *
> > > + */
> > > +struct kvm_dirty_ring {
> > > + u32 dirty_index;
> > > + u32 reset_index;
> > > + u32 size;
> > > + u32 soft_limit;
> > > + spinlock_t lock;
> > > + struct kvm_dirty_gfn *dirty_gfns;
> > > +};
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > > +
> > > +/*
> > > + * called with kvm->slots_lock held, returns the number of
> > > + * processed pages.
> > > + */
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > + struct kvm_dirty_ring *ring,
> > > + struct kvm_dirty_ring_indexes *indexes);
> > > +
> > > +/*
> > > + * returns 0: successfully pushed
> > > + * 1: successfully pushed, soft limit reached,
> > > + * vcpu should exit to userspace
> > > + * -EBUSY: unable to push, dirty ring full.
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > + struct kvm_dirty_ring_indexes *indexes,
> > > + u32 slot, u64 offset, bool lock);
> > > +
> > > +/* for use in vm_operations_struct */
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > > +
> > > +#endif
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 498a39462ac1..7b747bc9ff3e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -34,6 +34,7 @@
> > > #include <linux/kvm_types.h>
> > >
> > > #include <asm/kvm_host.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > >
> > > #ifndef KVM_MAX_VCPU_ID
> > > #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> > > #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > > #define KVM_REQ_PENDING_TIMER 2
> > > #define KVM_REQ_UNHALT 3
> > > +#define KVM_REQ_DIRTY_RING_FULL 4
> > > #define KVM_REQUEST_ARCH_BASE 8
> > >
> > > #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> > > bool ready;
> > > struct kvm_vcpu_arch arch;
> > > struct dentry *debugfs_dentry;
> > > + struct kvm_dirty_ring dirty_ring;
> > > };
> > >
> > > static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > > @@ -501,6 +504,10 @@ struct kvm {
> > > struct srcu_struct srcu;
> > > struct srcu_struct irq_srcu;
> > > pid_t userspace_pid;
> > > + /* Data structure to be exported by mmap(kvm->fd, 0) */
> > > + struct kvm_vm_run *vm_run;
> > > + u32 dirty_ring_size;
> > > + struct kvm_dirty_ring vm_dirty_ring;
> > > };
> > >
> > > #define kvm_err(fmt, ...) \
> > > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > gfn_t gfn_offset,
> > > unsigned long mask);
> > >
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > > +
> > > int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> > > struct kvm_dirty_log *log);
> > > int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> > > uintptr_t data, const char *name,
> > > struct task_struct **thread_ptr);
> > >
> > > +/*
> > > + * This defines how many reserved entries we want to keep before we
> > > + * kick the vcpu to the userspace to avoid dirty ring full. This
> > > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > > + */
> > > +#define KVM_DIRTY_RING_RSVD_ENTRIES 64
> > > +
> > > +/* Max number of entries allowed for each kvm dirty ring */
> > > +#define KVM_DIRTY_RING_MAX_ENTRIES 65536
> > > +
> > > +/*
> > > + * Arch needs to define these macro after implementing the dirty ring
> > > + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > > + * starting page offset of the dirty ring structures,
> >
> > Confused. Offset where? You set a default for everyone - where does arch
> > want to override it?
>
> If arch defines KVM_DIRTY_LOG_PAGE_OFFSET then below will be a no-op,
> please see [1] on #ifndef.

So which arches need to override it? Why do you say they should?

> >
> > > while
> > > + * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
> > > + * feature is off on all archs.
> > > + */
> > > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
>
> [1]
>
> > > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > > +#endif
> > > +#ifndef KVM_DIRTY_RING_VERSION
> > > +#define KVM_DIRTY_RING_VERSION 0
> > > +#endif
> >
> > One way versioning, with no bits and negotiation
> > will make it hard to change down the road.
> > what's wrong with existing KVM capabilities that
> > you feel there's a need for dedicated versioning for this?
>
> Frankly speaking I don't even think it'll change in the near
> future.. :)
>
> Yeh kvm versioning could work too. Here we can also return a zero
> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
> original patchset, but it's really helpless either because it's
> defined in uapi), but I just don't see how it helps... So I returned
> a version number just in case we'd like to change the layout some day
> and when we don't want to bother introducing another cap bit for the
> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).

I guess it's up to Paolo but really I don't see the point.
You can add a version later when it means something ...

> >
> > > +
> > > #endif
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 1c88e69db3d9..d9d03eea145a 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> > > struct kvm_memory_slot;
> > > struct kvm_one_reg;
> > > struct kvm_run;
> > > +struct kvm_vm_run;
> > > struct kvm_userspace_memory_region;
> > > struct kvm_vcpu;
> > > struct kvm_vcpu_init;
> > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > index e6f17c8e2dba..0b88d76d6215 100644
> > > --- a/include/uapi/linux/kvm.h
> > > +++ b/include/uapi/linux/kvm.h
> > > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> > > #define KVM_EXIT_IOAPIC_EOI 26
> > > #define KVM_EXIT_HYPERV 27
> > > #define KVM_EXIT_ARM_NISV 28
> > > +#define KVM_EXIT_DIRTY_RING_FULL 29
> > >
> > > /* For KVM_EXIT_INTERNAL_ERROR */
> > > /* Emulate instruction failed. */
> > > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> > > /* Encounter unexpected vm-exit reason */
> > > #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4
> > >
> > > +struct kvm_dirty_ring_indexes {
> > > + __u32 avail_index; /* set by kernel */
> > > + __u32 fetch_index; /* set by userspace */
> > > +};
> > > +
> > > /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> > > struct kvm_run {
> > > /* in */
> > > @@ -421,6 +427,13 @@ struct kvm_run {
> > > struct kvm_sync_regs regs;
> > > char padding[SYNC_REGS_SIZE_BYTES];
> > > } s;
> > > +
> > > + struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > > +};
> > > +
> > > +/* Returned by mmap(kvm->fd, offset=0) */
> > > +struct kvm_vm_run {
> > > + struct kvm_dirty_ring_indexes vm_ring_indexes;
> > > };
> > >
> > > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> > > #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> > > #define KVM_CAP_ARM_NISV_TO_USER 177
> > > #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > > +#define KVM_CAP_DIRTY_LOG_RING 179
> > >
> > > #ifdef KVM_CAP_IRQ_ROUTING
> > >
> > > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> > > /* Available with KVM_CAP_ARM_SVE */
> > > #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)
> > >
> > > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > > +#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
> > > +
> > > /* Secure Encrypted Virtualization command */
> > > enum sev_cmd_id {
> > > /* Guest initialization commands */
> > > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> > > #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
> > > #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
> > >
> > > +/*
> > > + * The following are the requirements for supporting dirty log ring
> > > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > > + *
> > > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > > + * of kvm_write_* so that the global dirty ring is not filled up
> > > + * too quickly.
> > > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > > + * enabling dirty logging.
> > > + * 3. There should not be a separate step to synchronize hardware
> > > + * dirty bitmap with KVM's.
> > > + */
> > > +
> > > +struct kvm_dirty_gfn {
> > > + __u32 pad;
> > > + __u32 slot;
> > > + __u64 offset;
> > > +};
> > > +
> > > #endif /* __LINUX_KVM_H */
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > new file mode 100644
> > > index 000000000000..9264891f3c32
> > > --- /dev/null
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -0,0 +1,156 @@
> > > +#include <linux/kvm_host.h>
> > > +#include <linux/kvm.h>
> > > +#include <linux/vmalloc.h>
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > > +{
> > > + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > > +}
> > > +
> > > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > > +{
> > > + u32 size = kvm->dirty_ring_size;
> > > +
> > > + ring->dirty_gfns = vmalloc(size);
> >
> > So 1/2 a megabyte of kernel memory per VM that userspace locks up.
> > Do we really have to though? Why not get a userspace pointer,
> > write it with copy to user, and sidestep all this?
>
> I'd say it won't be a big issue on locking 1/2M of host mem for a
> vm...
> Also note that if dirty ring is enabled, I plan to evaporate the
> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> $GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> less memory used.

Right - I think Avi described the bitmap in kernel memory as one of
design mistakes. Why repeat that with the new design?

> >
> > > + if (!ring->dirty_gfns)
> > > + return -ENOMEM;
> > > + memset(ring->dirty_gfns, 0, size);
> > > +
> > > + ring->size = size / sizeof(struct kvm_dirty_gfn);
> > > + ring->soft_limit =
> > > + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > > + kvm_dirty_ring_get_rsvd_entries();
> > > + ring->dirty_index = 0;
> > > + ring->reset_index = 0;
> > > + spin_lock_init(&ring->lock);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > > + struct kvm_dirty_ring *ring,
> > > + struct kvm_dirty_ring_indexes *indexes)
> > > +{
> > > + u32 cur_slot, next_slot;
> > > + u64 cur_offset, next_offset;
> > > + unsigned long mask;
> > > + u32 fetch;
> > > + int count = 0;
> > > + struct kvm_dirty_gfn *entry;
> > > +
> > > + fetch = READ_ONCE(indexes->fetch_index);
> > > + if (fetch == ring->reset_index)
> > > + return 0;
> > > +
> > > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > + /*
> > > + * The ring buffer is shared with userspace, which might mmap
> > > + * it and concurrently modify slot and offset. Userspace must
> > > + * not be trusted! READ_ONCE prevents the compiler from changing
> > > + * the values after they've been range-checked (the checks are
> > > + * in kvm_reset_dirty_gfn).
> >
> > What it doesn't is prevent speculative attacks. That's why things like
> > copy from user have a speculation barrier. Instead of worrying about
> > that, unless it's really critical, I think you'd do well do just use
> > copy to/from user.
>
> IMHO I would really hope these data be there without swapped out of
> memory, just like what we did with kvm->dirty_bitmap... it's on the
> hot path of mmu page fault, even we could be with mmu lock held if
> copy_to_user() page faulted. But indeed I've no experience on
> avoiding speculative attacks, suggestions would be greatly welcomed on
> that. In our case we do (index & (size - 1)), so is it still
> suffering from speculative attacks?

I don't say I understand everything in depth.
Just reacting to this:
READ_ONCE prevents the compiler from changing
the values after they've been range-checked (the checks are
in kvm_reset_dirty_gfn)

so any range checks you do can be attacked.

And the safest way to avoid the attacks is to do what most
kernel does and use copy from/to user when you talk to
userspace. Avoid annoying things like bypassing SMAP too.


> >
> > > + */
> > > + smp_read_barrier_depends();
> >
> > What depends on what here? Looks suspicious ...
>
> Hmm, I think maybe it can be removed because the entry pointer
> reference below should be an ordering constraint already?
>
> >
> > > + cur_slot = READ_ONCE(entry->slot);
> > > + cur_offset = READ_ONCE(entry->offset);
> > > + mask = 1;
> > > + count++;
> > > + ring->reset_index++;
> > > + while (ring->reset_index != fetch) {
> > > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > > + smp_read_barrier_depends();
> >
> > same concerns here
> >
> > > + next_slot = READ_ONCE(entry->slot);
> > > + next_offset = READ_ONCE(entry->offset);
> > > + ring->reset_index++;
> > > + count++;
> > > + /*
> > > + * Try to coalesce the reset operations when the guest is
> > > + * scanning pages in the same slot.
> >
> > what does guest scanning mean?
>
> My wild guess is that it means when the guest is accessing the pages
> continuously so the dirty gfns are continuous too. Anyway I agree
> it's not clear, where I can try to rephrase.
>
> >
> > > + */
> > > + if (next_slot == cur_slot) {
> > > + int delta = next_offset - cur_offset;
> > > +
> > > + if (delta >= 0 && delta < BITS_PER_LONG) {
> > > + mask |= 1ull << delta;
> > > + continue;
> > > + }
> > > +
> > > + /* Backwards visit, careful about overflows! */
> > > + if (delta > -BITS_PER_LONG && delta < 0 &&
> > > + (mask << -delta >> -delta) == mask) {
> > > + cur_offset = next_offset;
> > > + mask = (mask << -delta) | 1;
> > > + continue;
> > > + }
> > > + }
> > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > + cur_slot = next_slot;
> > > + cur_offset = next_offset;
> > > + mask = 1;
> > > + }
> > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > > +{
> > > + return ring->dirty_index - ring->reset_index;
> > > +}
> > > +
> > > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > > +{
> > > + return kvm_dirty_ring_used(ring) >= ring->size;
> > > +}
> > > +
> > > +/*
> > > + * Returns:
> > > + * >0 if we should kick the vcpu out,
> > > + * =0 if the gfn pushed successfully, or,
> > > + * <0 if error (e.g. ring full)
> > > + */
> > > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > > + struct kvm_dirty_ring_indexes *indexes,
> > > + u32 slot, u64 offset, bool lock)
> > > +{
> > > + int ret;
> > > + struct kvm_dirty_gfn *entry;
> > > +
> > > + if (lock)
> > > + spin_lock(&ring->lock);
> >
> > what's the story around locking here? Why is it safe
> > not to take the lock sometimes?
>
> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> ring is used. For per-vcpu ring, because that will only happen with
> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> is called with lock==false).
>
> >
> > > +
> > > + if (kvm_dirty_ring_full(ring)) {
> > > + ret = -EBUSY;
> > > + goto out;
> > > + }
> > > +
> > > + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > > + entry->slot = slot;
> > > + entry->offset = offset;
> > > + smp_wmb();
> > > + ring->dirty_index++;
> > > + WRITE_ONCE(indexes->avail_index, ring->dirty_index);
> > > + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > > + pr_info("%s: slot %u offset %llu used %u\n",
> > > + __func__, slot, offset, kvm_dirty_ring_used(ring));
> > > +
> > > +out:
> > > + if (lock)
> > > + spin_unlock(&ring->lock);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
> > > +{
> > > + return vmalloc_to_page((void *)ring->dirty_gfns + i * PAGE_SIZE);
> > > +}
> > > +
> > > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
> > > +{
> > > + if (ring->dirty_gfns) {
> > > + vfree(ring->dirty_gfns);
> > > + ring->dirty_gfns = NULL;
> > > + }
> > > +}
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 681452d288cd..8642c977629b 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -64,6 +64,8 @@
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/kvm.h>
> > >
> > > +#include <linux/kvm_dirty_ring.h>
> > > +
> > > /* Worst case buffer size needed for holding an integer. */
> > > #define ITOA_MAX_LEN 12
> > >
> > > @@ -149,6 +151,10 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > > struct kvm_vcpu *vcpu,
> > > struct kvm_memory_slot *memslot,
> > > gfn_t gfn);
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > + struct kvm_vcpu *vcpu,
> > > + struct kvm_memory_slot *slot,
> > > + gfn_t gfn);
> > >
> > > __visible bool kvm_rebooting;
> > > EXPORT_SYMBOL_GPL(kvm_rebooting);
> > > @@ -359,11 +365,22 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> > > vcpu->preempted = false;
> > > vcpu->ready = false;
> > >
> > > + if (kvm->dirty_ring_size) {
> > > + r = kvm_dirty_ring_alloc(vcpu->kvm, &vcpu->dirty_ring);
> > > + if (r) {
> > > + kvm->dirty_ring_size = 0;
> > > + goto fail_free_run;
> > > + }
> > > + }
> > > +
> > > r = kvm_arch_vcpu_init(vcpu);
> > > if (r < 0)
> > > - goto fail_free_run;
> > > + goto fail_free_ring;
> > > return 0;
> > >
> > > +fail_free_ring:
> > > + if (kvm->dirty_ring_size)
> > > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > > fail_free_run:
> > > free_page((unsigned long)vcpu->run);
> > > fail:
> > > @@ -381,6 +398,8 @@ void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
> > > put_pid(rcu_dereference_protected(vcpu->pid, 1));
> > > kvm_arch_vcpu_uninit(vcpu);
> > > free_page((unsigned long)vcpu->run);
> > > + if (vcpu->kvm->dirty_ring_size)
> > > + kvm_dirty_ring_free(&vcpu->dirty_ring);
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_vcpu_uninit);
> > >
> > > @@ -690,6 +709,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > > struct kvm *kvm = kvm_arch_alloc_vm();
> > > int r = -ENOMEM;
> > > int i;
> > > + struct page *page;
> > >
> > > if (!kvm)
> > > return ERR_PTR(-ENOMEM);
> > > @@ -705,6 +725,14 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > >
> > > BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> > >
> > > + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > > + if (!page) {
> > > + r = -ENOMEM;
> > > + goto out_err_alloc_page;
> > > + }
> > > + kvm->vm_run = page_address(page);
> >
> > So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> > still. What is wrong with just a pointer and calling put_user?
>
> I want to make it the start point for sharing fields between
> user/kernel per-vm. Just like kvm_run for per-vcpu.

And why is doing that without get/put user a good idea?
If nothing else this bypasses SMAP, exploits can pass
data from userspace to kernel through that.

> IMHO it'll be awkward if we always introduce a new interface just to
> take a pointer of the userspace buffer and cache it... I'd say so far
> I like the design of kvm_run and alike because it's efficient, easy to
> use, and easy for extensions.


Well kvm run at least isn't accessed when kernel is processing it.
And the structure there is dead simple, not a tricky lockless ring
with indices and things.

Again I might be wrong, eventually it's up to kvm maintainers. But
really there's a standard thing all drivers do to talk to userspace, and
if there's no special reason to do otherwise I would do exactly it.

> >
> > > + BUILD_BUG_ON(sizeof(struct kvm_vm_run) > PAGE_SIZE);
> > > +
> > > if (init_srcu_struct(&kvm->srcu))
> > > goto out_err_no_srcu;
> > > if (init_srcu_struct(&kvm->irq_srcu))
> > > @@ -775,6 +803,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
> > > out_err_no_irq_srcu:
> > > cleanup_srcu_struct(&kvm->srcu);
> > > out_err_no_srcu:
> > > + free_page((unsigned long)page);
> > > + kvm->vm_run = NULL;
> > > +out_err_alloc_page:
> > > kvm_arch_free_vm(kvm);
> > > mmdrop(current->mm);
> > > return ERR_PTR(r);
> > > @@ -800,6 +831,15 @@ static void kvm_destroy_vm(struct kvm *kvm)
> > > int i;
> > > struct mm_struct *mm = kvm->mm;
> > >
> > > + if (kvm->dirty_ring_size) {
> > > + kvm_dirty_ring_free(&kvm->vm_dirty_ring);
> > > + }
> > > +
> > > + if (kvm->vm_run) {
> > > + free_page((unsigned long)kvm->vm_run);
> > > + kvm->vm_run = NULL;
> > > + }
> > > +
> > > kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
> > > kvm_destroy_vm_debugfs(kvm);
> > > kvm_arch_sync_events(kvm);
> > > @@ -2301,7 +2341,7 @@ static void mark_page_dirty_in_slot(struct kvm *kvm,
> > > {
> > > if (memslot && memslot->dirty_bitmap) {
> > > unsigned long rel_gfn = gfn - memslot->base_gfn;
> > > -
> > > + mark_page_dirty_in_ring(kvm, vcpu, memslot, gfn);
> > > set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > > }
> > > }
> > > @@ -2649,6 +2689,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> > >
> > > +static bool kvm_fault_in_dirty_ring(struct kvm *kvm, struct vm_fault *vmf)
> > > +{
> > > + return (vmf->pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
> > > + (vmf->pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
> > > + kvm->dirty_ring_size / PAGE_SIZE);
> > > +}
> > > +
> > > static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > > {
> > > struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
> > > @@ -2664,6 +2711,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
> > > else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
> > > page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
> > > #endif
> > > + else if (kvm_fault_in_dirty_ring(vcpu->kvm, vmf))
> > > + page = kvm_dirty_ring_get_page(
> > > + &vcpu->dirty_ring,
> > > + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
> > > else
> > > return kvm_arch_vcpu_fault(vcpu, vmf);
> > > get_page(page);
> > > @@ -3259,12 +3310,162 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > > #endif
> > > case KVM_CAP_NR_MEMSLOTS:
> > > return KVM_USER_MEM_SLOTS;
> > > + case KVM_CAP_DIRTY_LOG_RING:
> > > + /* Version will be zero if arch didn't implement it */
> > > + return KVM_DIRTY_RING_VERSION;
> > > default:
> > > break;
> > > }
> > > return kvm_vm_ioctl_check_extension(kvm, arg);
> > > }
> > >
> > > +static void mark_page_dirty_in_ring(struct kvm *kvm,
> > > + struct kvm_vcpu *vcpu,
> > > + struct kvm_memory_slot *slot,
> > > + gfn_t gfn)
> > > +{
> > > + u32 as_id = 0;
> > > + u64 offset;
> > > + int ret;
> > > + struct kvm_dirty_ring *ring;
> > > + struct kvm_dirty_ring_indexes *indexes;
> > > + bool is_vm_ring;
> > > +
> > > + if (!kvm->dirty_ring_size)
> > > + return;
> > > +
> > > + offset = gfn - slot->base_gfn;
> > > +
> > > + if (vcpu) {
> > > + as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > > + } else {
> > > + as_id = 0;
> > > + vcpu = kvm_get_running_vcpu();
> > > + }
> > > +
> > > + if (vcpu) {
> > > + ring = &vcpu->dirty_ring;
> > > + indexes = &vcpu->run->vcpu_ring_indexes;
> > > + is_vm_ring = false;
> > > + } else {
> > > + /*
> > > + * Put onto per vm ring because no vcpu context. Kick
> > > + * vcpu0 if ring is full.
> >
> > What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> > critical tasks there, they will be penalized disproportionally?
>
> Reasonable question. So far we can't avoid it because vcpu exit is
> the event mechanism to say "hey please collect dirty bits". Maybe
> someway is better than this, but I'll need to rethink all these
> over...

Maybe signal an eventfd, and let userspace worry about deciding what to
do.

> >
> > > + */
> > > + vcpu = kvm->vcpus[0];
> > > + ring = &kvm->vm_dirty_ring;
> > > + indexes = &kvm->vm_run->vm_ring_indexes;
> > > + is_vm_ring = true;
> > > + }
> > > +
> > > + ret = kvm_dirty_ring_push(ring, indexes,
> > > + (as_id << 16)|slot->id, offset,
> > > + is_vm_ring);
> > > + if (ret < 0) {
> > > + if (is_vm_ring)
> > > + pr_warn_once("vcpu %d dirty log overflow\n",
> > > + vcpu->vcpu_id);
> > > + else
> > > + pr_warn_once("per-vm dirty log overflow\n");
> > > + return;
> > > + }
> > > +
> > > + if (ret)
> > > + kvm_make_request(KVM_REQ_DIRTY_RING_FULL, vcpu);
> > > +}
> > > +
> > > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> > > +{
> > > + struct kvm_memory_slot *memslot;
> > > + int as_id, id;
> > > +
> > > + as_id = slot >> 16;
> > > + id = (u16)slot;
> > > + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
> > > + return;
> > > +
> > > + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> > > + if (offset >= memslot->npages)
> > > + return;
> > > +
> > > + spin_lock(&kvm->mmu_lock);
> > > + /* FIXME: we should use a single AND operation, but there is no
> > > + * applicable atomic API.
> > > + */
> > > + while (mask) {
> > > + clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap);
> > > + mask &= mask - 1;
> > > + }
> > > +
> > > + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
> > > + spin_unlock(&kvm->mmu_lock);
> > > +}
> > > +
> > > +static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> > > +{
> > > + int r;
> > > +
> > > + /* the size should be power of 2 */
> > > + if (!size || (size & (size - 1)))
> > > + return -EINVAL;
> > > +
> > > + /* Should be bigger to keep the reserved entries, or a page */
> > > + if (size < kvm_dirty_ring_get_rsvd_entries() *
> > > + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
> > > + return -EINVAL;
> > > +
> > > + if (size > KVM_DIRTY_RING_MAX_ENTRIES *
> > > + sizeof(struct kvm_dirty_gfn))
> > > + return -E2BIG;
> >
> > KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> > So how does userspace know what's legal?
> > Do you expect it to just try?
>
> Yep that's what I thought. :)
>
> Please grep E2BIG in QEMU repo target/i386/kvm.c... won't be hard to
> do imho..

I don't see anything except just failing. Do we really have something
trying to find a working value? What would even be a reasonable range?
Start from UINT_MAX and work down? In which increments?
This is just a ton of overhead for what could have been a
simple query.

> > More likely it will just copy the number from kernel and can
> > never ever make it smaller.
>
> Not sure, but for sure I can probably move KVM_DIRTY_RING_MAX_ENTRIES
> to uapi too.
>
> Thanks,

Won't help as you can't change it ever then.
You need it runtime discoverable.
Or again, keep it in userspace memory and then you don't
really care what size it is.


> --
> Peter Xu

2019-12-12 00:10:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 11/12/19 23:57, Michael S. Tsirkin wrote:
>>> All these seem like arbitrary limitations to me.
>>>
>>> Sizing the ring correctly might prove to be a challenge.
>>>
>>> Thus I think there's value in resizing the rings
>>> without destroying VCPU.
>>
>> Do you have an example on when we could use this feature?
>
> So e.g. start with a small ring, and if you see stalls too often
> increase it? Otherwise I don't see how does one decide
> on ring size.

If you see stalls often, it means the guest is dirtying memory very
fast. Harvesting the ring puts back pressure on the guest, you may
prefer a smaller ring size to avoid a bufferbloat-like situation.

Note that having a larger ring is better, even though it does incur a
memory cost, because it means the migration thread will be able to reap
the ring buffer asynchronously with no vmexits.

With smaller ring sizes the cost of flushing the TLB when resetting the
rings goes up, but the initial bulk copy phase _will_ have vmexits and
then having to reap more dirty rings becomes more expensive and
introduces some jitter. So it will require some experimentation to find
an optimal value.

Anyway if in the future we go for resizable rings, KVM_ENABLE_CAP can be
passed the largest desired size and then another ioctl can be introduced
to set the mask for indices.

>>> Also, power of two just saves a branch here and there,
>>> but wastes lots of memory. Just wrap the index around to
>>> 0 and then users can select any size?
>>
>> Same as above to postpone until we need it?
>
> It's to save memory, don't we always need to do that?

Does it really save that much memory? Would it really be so beneficial
to choose 12K entries rather than 8K or 16K in the ring?

>> My understanding of this is that normally we do only want either one
>> of them depending on the major workload and the configuration of the
>> guest.
>
> And again how does one know which to enable? No one has the
> time to fine-tune gazillion parameters.

Hopefully we can always use just the ring buffer.

> IMHO a huge amount of benchmarking has to happen if you just want to
> set this loose on users as default with these kind of
> limitations. We need to be sure that even though in theory
> it can be very bad, in practice it's actually good.
> If it's auto-tuning then it's a much easier sell to upstream
> even if there's a chance of some regressions.

Auto-tuning is not a silver bullet, it requires just as much
benchmarking to make sure that it doesn't oscillate crazily and that it
actually outperforms a simple fixed size.

>> Yeh kvm versioning could work too. Here we can also return a zero
>> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
>> original patchset, but it's really helpless either because it's
>> defined in uapi), but I just don't see how it helps... So I returned
>> a version number just in case we'd like to change the layout some day
>> and when we don't want to bother introducing another cap bit for the
>> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
>> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
>
> I guess it's up to Paolo but really I don't see the point.
> You can add a version later when it means something ...

Yeah, we can return the maximum size of the ring buffer, too.

>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>> vm...
>> Also note that if dirty ring is enabled, I plan to evaporate the
>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>> $GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>> less memory used.
>
> Right - I think Avi described the bitmap in kernel memory as one of
> design mistakes. Why repeat that with the new design?

Do you have a source for that? At least the dirty bitmap has to be
accessed from atomic context so it seems unlikely that it can be moved
to user memory.

The dirty ring could use user memory indeed, but it would be much harder
to set up (multiple ioctls for each ring? what to do if userspace
forgets one? etc.). The mmap API is easier to use.

>>>> + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>>> + /*
>>>> + * The ring buffer is shared with userspace, which might mmap
>>>> + * it and concurrently modify slot and offset. Userspace must
>>>> + * not be trusted! READ_ONCE prevents the compiler from changing
>>>> + * the values after they've been range-checked (the checks are
>>>> + * in kvm_reset_dirty_gfn).
>>>
>>> What it doesn't is prevent speculative attacks. That's why things like
>>> copy from user have a speculation barrier. Instead of worrying about
>>> that, unless it's really critical, I think you'd do well do just use
>>> copy to/from user.

An unconditional speculation barrier (lfence) is also expensive. We
already have macros to add speculation checks with array_index_nospec at
the right places, for example __kvm_memslots. We should add an
array_index_nospec to id_to_memslot as well. I'll send a patch for that.

>>> What depends on what here? Looks suspicious ...
>>
>> Hmm, I think maybe it can be removed because the entry pointer
>> reference below should be an ordering constraint already?

entry->xxx depends on ring->reset_index.

>>> what's the story around locking here? Why is it safe
>>> not to take the lock sometimes?
>>
>> kvm_dirty_ring_push() will be with lock==true only when the per-vm
>> ring is used. For per-vcpu ring, because that will only happen with
>> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
>> is called with lock==false).

FWIW this will be done much more nicely in v2.

>>>> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> + if (!page) {
>>>> + r = -ENOMEM;
>>>> + goto out_err_alloc_page;
>>>> + }
>>>> + kvm->vm_run = page_address(page);
>>>
>>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
>>> still. What is wrong with just a pointer and calling put_user?
>>
>> I want to make it the start point for sharing fields between
>> user/kernel per-vm. Just like kvm_run for per-vcpu.

This page is actually not needed at all. Userspace can just map at
KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there. You can drop
kvm_vm_run completely.

>>>> + } else {
>>>> + /*
>>>> + * Put onto per vm ring because no vcpu context. Kick
>>>> + * vcpu0 if ring is full.
>>>
>>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
>>> critical tasks there, they will be penalized disproportionally?
>>
>> Reasonable question. So far we can't avoid it because vcpu exit is
>> the event mechanism to say "hey please collect dirty bits". Maybe
>> someway is better than this, but I'll need to rethink all these
>> over...
>
> Maybe signal an eventfd, and let userspace worry about deciding what to
> do.

This has to be done synchronously. But the vm ring should be used very
rarely (it's for things like kvmclock updates that write to guest memory
outside a vCPU), possibly a handful of times in the whole run of the VM.

>>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
>>> So how does userspace know what's legal?
>>> Do you expect it to just try?
>>
>> Yep that's what I thought. :)

We should return it for KVM_CHECK_EXTENSION.

Paolo

2019-12-12 07:37:11

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >> vm...
> >> Also note that if dirty ring is enabled, I plan to evaporate the
> >> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >> $GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
> >> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >> less memory used.
> >
> > Right - I think Avi described the bitmap in kernel memory as one of
> > design mistakes. Why repeat that with the new design?
>
> Do you have a source for that?

Nope, it was a private talk.

> At least the dirty bitmap has to be
> accessed from atomic context so it seems unlikely that it can be moved
> to user memory.

Why is that? We could surely do it from VCPU context?

> The dirty ring could use user memory indeed, but it would be much harder
> to set up (multiple ioctls for each ring? what to do if userspace
> forgets one? etc.).

Why multiple ioctls? If you do like virtio packed ring you just need the
base and the size.

--
MST

2019-12-12 08:12:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 12/12/19 08:36, Michael S. Tsirkin wrote:
> On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
>>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>>>> vm...
>>>> Also note that if dirty ring is enabled, I plan to evaporate the
>>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>>>> $GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
>>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>>>> less memory used.
>>>
>>> Right - I think Avi described the bitmap in kernel memory as one of
>>> design mistakes. Why repeat that with the new design?
>>
>> Do you have a source for that?
>
> Nope, it was a private talk.
>
>> At least the dirty bitmap has to be
>> accessed from atomic context so it seems unlikely that it can be moved
>> to user memory.
>
> Why is that? We could surely do it from VCPU context?

Spinlock is taken.

>> The dirty ring could use user memory indeed, but it would be much harder
>> to set up (multiple ioctls for each ring? what to do if userspace
>> forgets one? etc.).
>
> Why multiple ioctls? If you do like virtio packed ring you just need the
> base and the size.

You have multiple rings, so multiple invocations of one ioctl.

Paolo

2019-12-12 10:40:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Thu, Dec 12, 2019 at 09:12:04AM +0100, Paolo Bonzini wrote:
> On 12/12/19 08:36, Michael S. Tsirkin wrote:
> > On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>>> I'd say it won't be a big issue on locking 1/2M of host mem for a
> >>>> vm...
> >>>> Also note that if dirty ring is enabled, I plan to evaporate the
> >>>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
> >>>> $GUEST_MEM/32K*2 mem. E.g., for 64G guest it's 64G/32K*2=4M. If with
> >>>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
> >>>> less memory used.
> >>>
> >>> Right - I think Avi described the bitmap in kernel memory as one of
> >>> design mistakes. Why repeat that with the new design?
> >>
> >> Do you have a source for that?
> >
> > Nope, it was a private talk.
> >
> >> At least the dirty bitmap has to be
> >> accessed from atomic context so it seems unlikely that it can be moved
> >> to user memory.
> >
> > Why is that? We could surely do it from VCPU context?
>
> Spinlock is taken.

Right, that's an implementation detail though isn't it?

> >> The dirty ring could use user memory indeed, but it would be much harder
> >> to set up (multiple ioctls for each ring? what to do if userspace
> >> forgets one? etc.).
> >
> > Why multiple ioctls? If you do like virtio packed ring you just need the
> > base and the size.
>
> You have multiple rings, so multiple invocations of one ioctl.
>
> Paolo

Oh. So when you said "multiple ioctls for each ring" - I guess you
meant: "multiple ioctls - one for each ring"?

And it's true, but then it allows supporting things like resize in a
clean way without any effort in the kernel. You get a new ring address -
you switch to that one.

--
MST

2019-12-13 20:25:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 11, 2019 at 06:24:00PM +0100, Christophe de Dinechin wrote:
> Peter Xu writes:
>
> > This patch is heavily based on previous work from Lei Cao
> > <[email protected]> and Paolo Bonzini <[email protected]>. [1]
> >
> > KVM currently uses large bitmaps to track dirty memory. These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information. The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.
>
> That statement sort of concerns me. If large parts of memory are
> dirtied, won't this cause the rings to fill up quickly enough to cause a
> lot of churn between user-space and kernel?

We have cpu-throttle in the QEMU to explicitly provide some "churns"
just to slow the vcpu down. If dirtying is heavy during migrations
then we might prefer some churns.. Also, this should not replace the
old dirty_bitmap, but it should be a new interface only. Even if we
want to switch this as default we'll definitely still keep the old
interface when the user wants it in some scenarios.

>
> See a possible suggestion to address that below.
>
> > However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> >
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial. In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> >
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN). This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> >
> > We defined two new data structures:
> >
> > struct kvm_dirty_ring;
> > struct kvm_dirty_ring_indexes;
> >
> > Firstly, kvm_dirty_ring is defined to represent a ring of dirty
> > pages. When dirty tracking is enabled, we can push dirty gfn onto the
> > ring.
> >
> > Secondly, kvm_dirty_ring_indexes is defined to represent the
> > user/kernel interface of each ring. Currently it contains two
> > indexes: (1) avail_index represents where we should push our next
> > PFN (written by kernel), while (2) fetch_index represents where the
> > userspace should fetch the next dirty PFN (written by userspace).
> >
> > One complete ring is composed by one kvm_dirty_ring plus its
> > corresponding kvm_dirty_ring_indexes.
> >
> > Currently, we have N+1 rings for each VM of N vcpus:
> >
> > - for each vcpu, we have 1 per-vcpu dirty ring,
> > - for each vm, we have 1 per-vm dirty ring
> >
> > Please refer to the documentation update in this patch for more
> > details.
> >
> > Note that this patch implements the core logic of dirty ring buffer.
> > It's still disabled for all archs for now. Also, we'll address some
> > of the other issues in follow up patches before it's firstly enabled
> > on x86.
> >
> > [1] https://patchwork.kernel.org/patch/10471409/
> >
> > Signed-off-by: Lei Cao <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > Documentation/virt/kvm/api.txt | 109 +++++++++++++++
> > arch/x86/kvm/Makefile | 3 +-
> > include/linux/kvm_dirty_ring.h | 67 +++++++++
> > include/linux/kvm_host.h | 33 +++++
> > include/linux/kvm_types.h | 1 +
> > include/uapi/linux/kvm.h | 36 +++++
> > virt/kvm/dirty_ring.c | 156 +++++++++++++++++++++
> > virt/kvm/kvm_main.c | 240 ++++++++++++++++++++++++++++++++-
> > 8 files changed, 642 insertions(+), 3 deletions(-)
> > create mode 100644 include/linux/kvm_dirty_ring.h
> > create mode 100644 virt/kvm/dirty_ring.c
> >
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index 49183add44e7..fa622c9a2eb8 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> > It is thus encouraged to use the vm ioctl to query for capabilities (available
> > with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >
> > +
> > 4.5 KVM_GET_VCPU_MMAP_SIZE
> >
> > Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> > memory region. This ioctl returns the size of that region. See the
> > KVM_RUN documentation for details.
> >
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > + KVM_CAP_COALESCED_MMIO is not documented yet.
>
> Does the above really belong to this patch?

Probably not.. But sure I can move that out in my next post.

>
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
> > + KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >
> > 4.6 KVM_SET_MEMORY_REGION
> >
> > @@ -5358,6 +5371,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> > AArch64, this value will be reported in the ISS field of ESR_ELx.
> >
> > See KVM_CAP_VCPU_EVENTS for more details.
> > +
> > 8.20 KVM_CAP_HYPERV_SEND_IPI
> >
> > Architectures: x86
> > @@ -5365,6 +5379,7 @@ Architectures: x86
> > This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> > hypercalls:
> > HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> > 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >
> > Architecture: x86
> > @@ -5378,3 +5393,97 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> > flush hypercalls by Hyper-V) so userspace should disable KVM identification
> > in CPUID and only exposes Hyper-V identification. In this case, guest
> > thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu and one global
> > +ring per vm.
> > +
> > +One dirty ring has the following two major structures:
> > +
> > +struct kvm_dirty_ring {
> > + u16 dirty_index;
> > + u16 reset_index;
>
> What is the benefit of using u16 for that? That means with 4K pages, you
> can share at most 256M of dirty memory each time? That seems low to me,
> especially since it's sufficient to touch one byte in a page to dirty it.
>
> Actually, this is not consistent with the definition in the code ;-)
> So I'll assume it's actually u32.

Yes it's u32 now. Actually I believe at least Paolo would prefer u16
more. :)

I think even u16 would be mostly enough (if you see, the maximum
allowed value currently is 64K entries only, not a big one). Again,
the thing is that the userspace should be collecting the dirty bits,
so the ring shouldn't reach full easily. Even if it does, we should
probably let it stop for a while as explained above. It'll be
inefficient only if we set it to a too-small value, imho.

>
> > + u32 size;
> > + u32 soft_limit;
> > + spinlock_t lock;
> > + struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +struct kvm_dirty_ring_indexes {
> > + __u32 avail_index; /* set by kernel */
> > + __u32 fetch_index; /* set by userspace */
> > +};
> > +
> > +While for each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > + __u32 pad;
> > + __u32 slot; /* as_id | slot_id */
> > + __u64 offset;
> > +};
>
> Like other have suggested, I think we might used "pad" to store size
> information to be able to dirty large pages more efficiently.

As explained in the other thread, KVM should only trap dirty bits in
4K granularity, never in huge page sizes.

>
> > +
> > +The fields in kvm_dirty_ring will be only internal to KVM itself,
> > +while the fields in kvm_dirty_ring_indexes will be exposed to
> > +userspace to be either read or written.
>
> The sentence above is confusing when contrasted with the "set by kernel"
> comment above.

Maybe "kvm_dirty_ring_indexes will be exposed to both KVM and
userspace" to be clearer?

"set by kernel" means kernel will write to it, then the userspace will
still need to read from it.

>
> > +
> > +The two indices in the ring buffer are free running counters.
>
> Nit: this patch uses both "indices" and "indexes".
> Both are correct, but it would be nice to be consistent.

I'll respect the original patch to change everything into indices.

>
> > +
> > +In pseudocode, processing the ring buffer looks like this:
> > +
> > + idx = load-acquire(&ring->fetch_index);
> > + while (idx != ring->avail_index) {
> > + struct kvm_dirty_gfn *entry;
> > + entry = &ring->dirty_gfns[idx & (size - 1)];
> > + ...
> > +
> > + idx++;
> > + }
> > + ring->fetch_index = idx;
> > +
> > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl
> > +to enable this capability for the new guest and set the size of the
> > +rings. It is only allowed before creating any vCPU, and the size of
> > +the ring must be a power of two. The larger the ring buffer, the less
> > +likely the ring is full and the VM is forced to exit to userspace. The
> > +optimal size depends on the workload, but it is recommended that it be
> > +at least 64 KiB (4096 entries).
>
> Is there anything in the design that would preclude resizing the ring
> buffer at a later time? Presumably, you'd want a large ring while you
> are doing things like migrations, but it's mostly useless when you are
> not monitoring memory. So it would be nice to be able to call
> KVM_ENABLE_CAP at any time to adjust the size.

It'll be scary to me to have it be adjusted at any time... Even
during pushing dirty gfns onto the ring? We need to handle all these
complexities...

IMHO it really does not help that much to have such a feature, or I'd
appreciate we can allow to start from simple.

>
> As I read the current code, one of the issue would be the mapping of the
> rings in case of a later extension where we added something beyond the
> rings. But I'm not sure that's a big deal at the moment.

I think we must define something to be sure that the ring mapped pages
will be limited, so we can still extend things. IMHO that's why I
introduced the maximum allowed ring size. That limits this.

>
> > +
> > +After the capability is enabled, userspace can mmap the global ring
> > +buffer (kvm_dirty_gfn[], offset KVM_DIRTY_LOG_PAGE_OFFSET) and the
> > +indexes (kvm_dirty_ring_indexes, offset 0) from the VM file
> > +descriptor. The per-vcpu dirty ring instead is mmapped when the vcpu
> > +is created, similar to the kvm_run struct (kvm_dirty_ring_indexes
> > +locates inside kvm_run, while kvm_dirty_gfn[] at offset
> > +KVM_DIRTY_LOG_PAGE_OFFSET).
> > +
> > +Just like for dirty page bitmaps, the buffer tracks writes to
> > +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
> > +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
> > +with the flag set, userspace can start harvesting dirty pages from the
> > +ring buffer.
> > +
> > +To harvest the dirty pages, userspace accesses the mmaped ring buffer
> > +to read the dirty GFNs up to avail_index, and sets the fetch_index
> > +accordingly. This can be done when the guest is running or paused,
> > +and dirty pages need not be collected all at once. After processing
> > +one or more entries in the ring buffer, userspace calls the VM ioctl
> > +KVM_RESET_DIRTY_RINGS to notify the kernel that it has updated
> > +fetch_index and to mark those pages clean. Therefore, the ioctl
> > +must be called *before* reading the content of the dirty pages.
>
> > +
> > +However, there is a major difference comparing to the
> > +KVM_GET_DIRTY_LOG interface in that when reading the dirty ring from
> > +userspace it's still possible that the kernel has not yet flushed the
> > +hardware dirty buffers into the kernel buffer. To achieve that, one
> > +needs to kick the vcpu out for a hardware buffer flush (vmexit).
>
> When you refer to "buffers", are you referring to the cache lines that
> contain the ring buffers, or to something else?
>
> I'm a bit confused by this sentence. I think that you mean that a VCPU
> may still be running while you read its ring buffer, in which case the
> values in the ring buffer are not necessarily in memory yet, so not
> visible to a different CPU. But I wonder if you can't make this
> requirement to cause a vmexit unnecessary by carefully ordering the
> writes, to make sure that the fetch_index is updated only after the
> corresponding ring entries have been written to memory,
>
> In other words, as seen by user-space, you would not care that the ring
> entries have not been flushed as long as the fetch_index itself is
> guaranteed to still be behind the not-flushed-yet entries.
>
> (I would know how to do that on a different architecture, not sure for x86)

Sorry for not being clear, but.. Do you mean the "hardware dirty
buffers"? For Intel, it could be PML. Vmexits guarantee that even
PML buffers will be flushed to the dirty rings. Nothing about cache
lines.

I used "hardware dirty buffer" only because this document is for KVM
in general, while PML is only one way to do such buffering. I can add
"(for example, PML)" to make it clearer if you like.

>
> > +
> > +If one of the ring buffers is full, the guest will exit to userspace
> > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > +should pause all the vcpus, then harvest all the dirty pages and
> > +rearm the dirty traps. It can unpause the guest after that.
>
> Except for the condition above, why is it necessary to pause other VCPUs
> than the one being harvested?

This is a good question. Paolo could correct me if I'm wrong.

Firstly I think this should rarely happen if the userspace is
collecting the dirty bits from time to time. If it happens, we'll
need to call KVM_RESET_DIRTY_RINGS to reset all the rings. Then the
question actually becomes to: Whether we'd like to have per-vcpu
KVM_RESET_DIRTY_RINGS?

So the answer is that it could be an overkill to do so. The important
thing here is no matter what KVM_RESET_DIRTY_RINGS will need to change
the page tables and kick all VCPUs for TLB flushings. If we must do
it, we'd better do it as rare as possible. When we're with per-vcpu
ring resets, we'll do N*N vcpu kicks for the bad case (N kicks per
vcpu ring reset, and we've probably got N vcpus). While if we stick
to the simple per-vm reset, it'll anyway kick all vcpus for tlb
flushing after all, then maybe it's easier to collect all of them
altogether and reset them altogether.

>
>
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index b19ef421084d..0acee817adfb 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -5,7 +5,8 @@ ccflags-y += -Iarch/x86/kvm
> > KVM := ../../../virt/kvm
> >
> > kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
> > - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
> > + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
> > + $(KVM)/dirty_ring.o
> > kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
> >
> > kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > new file mode 100644
> > index 000000000000..8335635b7ff7
> > --- /dev/null
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -0,0 +1,67 @@
> > +#ifndef KVM_DIRTY_RING_H
> > +#define KVM_DIRTY_RING_H
> > +
> > +/*
> > + * struct kvm_dirty_ring is defined in include/uapi/linux/kvm.h.
> > + *
> > + * dirty_ring: shared with userspace via mmap. It is the compact list
> > + * that holds the dirty pages.
> > + * dirty_index: free running counter that points to the next slot in
> > + * dirty_ring->dirty_gfns where a new dirty page should go.
> > + * reset_index: free running counter that points to the next dirty page
> > + * in dirty_ring->dirty_gfns for which dirty trap needs to
> > + * be reenabled
> > + * size: size of the compact list, dirty_ring->dirty_gfns
> > + * soft_limit: when the number of dirty pages in the list reaches this
> > + * limit, vcpu that owns this ring should exit to userspace
> > + * to allow userspace to harvest all the dirty pages
> > + * lock: protects dirty_ring, only in use if this is the global
> > + * ring
>
> If that's not used for vcpu rings, maybe move it out of kvm_dirty_ring?

Yeah we can.

>
> > + *
> > + * The number of dirty pages in the ring is calculated by,
> > + * dirty_index - reset_index
>
> Nit: the code calls it "used" (in kvm_dirty_ring_used). Maybe find an
> unambiguous terminology. What about "posted", as in
>
> The number of posted dirty pages, i.e. the number of dirty pages in the
> ring, is calculated as dirty_index - reset_index by function
> kvm_dirty_ring_posted
>
> (Replace "posted" by any adjective of your liking)

Sure.

(Or maybe I'll just try to remove these lines to avoid introducing any
terminology as long as it's not very necessary... and after all
similar things will be mentioned in the documents, and the code itself)

>
> > + *
> > + * kernel increments dirty_ring->indices.avail_index after dirty index
> > + * is incremented. When userspace harvests the dirty pages, it increments
> > + * dirty_ring->indices.fetch_index up to dirty_ring->indices.avail_index.
> > + * When kernel reenables dirty traps for the dirty pages, it increments
> > + * reset_index up to dirty_ring->indices.fetch_index.
>
> Userspace should not be trusted to be doing this, see below.
>
>
> > + *
> > + */
> > +struct kvm_dirty_ring {
> > + u32 dirty_index;
> > + u32 reset_index;
> > + u32 size;
> > + u32 soft_limit;
> > + spinlock_t lock;
> > + struct kvm_dirty_gfn *dirty_gfns;
> > +};
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void);
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +
> > +/*
> > + * called with kvm->slots_lock held, returns the number of
> > + * processed pages.
> > + */
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > + struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes);
> > +
> > +/*
> > + * returns 0: successfully pushed
> > + * 1: successfully pushed, soft limit reached,
> > + * vcpu should exit to userspace
> > + * -EBUSY: unable to push, dirty ring full.
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes,
> > + u32 slot, u64 offset, bool lock);
> > +
> > +/* for use in vm_operations_struct */
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i);
>
> Not very clear what 'i' means, seems to be a page offset based on call sites?

I'll rename it to "offset".

>
> > +
> > +void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring);
> > +
> > +#endif
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 498a39462ac1..7b747bc9ff3e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -34,6 +34,7 @@
> > #include <linux/kvm_types.h>
> >
> > #include <asm/kvm_host.h>
> > +#include <linux/kvm_dirty_ring.h>
> >
> > #ifndef KVM_MAX_VCPU_ID
> > #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > @@ -146,6 +147,7 @@ static inline bool is_error_page(struct page *page)
> > #define KVM_REQ_MMU_RELOAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> > #define KVM_REQ_PENDING_TIMER 2
> > #define KVM_REQ_UNHALT 3
> > +#define KVM_REQ_DIRTY_RING_FULL 4
> > #define KVM_REQUEST_ARCH_BASE 8
> >
> > #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> > @@ -321,6 +323,7 @@ struct kvm_vcpu {
> > bool ready;
> > struct kvm_vcpu_arch arch;
> > struct dentry *debugfs_dentry;
> > + struct kvm_dirty_ring dirty_ring;
> > };
> >
> > static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> > @@ -501,6 +504,10 @@ struct kvm {
> > struct srcu_struct srcu;
> > struct srcu_struct irq_srcu;
> > pid_t userspace_pid;
> > + /* Data structure to be exported by mmap(kvm->fd, 0) */
> > + struct kvm_vm_run *vm_run;
> > + u32 dirty_ring_size;
> > + struct kvm_dirty_ring vm_dirty_ring;
>
> If you remove the lock from struct kvm_dirty_ring, you could just put it there.

Ok.

>
> > };
> >
> > #define kvm_err(fmt, ...) \
> > @@ -832,6 +839,8 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > gfn_t gfn_offset,
> > unsigned long mask);
> >
> > +void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask);
> > +
> > int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> > struct kvm_dirty_log *log);
> > int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> > @@ -1411,4 +1420,28 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
> > uintptr_t data, const char *name,
> > struct task_struct **thread_ptr);
> >
> > +/*
> > + * This defines how many reserved entries we want to keep before we
> > + * kick the vcpu to the userspace to avoid dirty ring full. This
> > + * value can be tuned to higher if e.g. PML is enabled on the host.
> > + */
> > +#define KVM_DIRTY_RING_RSVD_ENTRIES 64
> > +
> > +/* Max number of entries allowed for each kvm dirty ring */
> > +#define KVM_DIRTY_RING_MAX_ENTRIES 65536
> > +
> > +/*
> > + * Arch needs to define these macro after implementing the dirty ring
> > + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
> > + * starting page offset of the dirty ring structures, while
> > + * KVM_DIRTY_RING_VERSION should be defined as >=1. By default, this
> > + * feature is off on all archs.
> > + */
> > +#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
> > +#define KVM_DIRTY_LOG_PAGE_OFFSET 0
> > +#endif
> > +#ifndef KVM_DIRTY_RING_VERSION
> > +#define KVM_DIRTY_RING_VERSION 0
> > +#endif
> > +
> > #endif
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 1c88e69db3d9..d9d03eea145a 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -11,6 +11,7 @@ struct kvm_irq_routing_table;
> > struct kvm_memory_slot;
> > struct kvm_one_reg;
> > struct kvm_run;
> > +struct kvm_vm_run;
> > struct kvm_userspace_memory_region;
> > struct kvm_vcpu;
> > struct kvm_vcpu_init;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e6f17c8e2dba..0b88d76d6215 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
> > #define KVM_EXIT_IOAPIC_EOI 26
> > #define KVM_EXIT_HYPERV 27
> > #define KVM_EXIT_ARM_NISV 28
> > +#define KVM_EXIT_DIRTY_RING_FULL 29
> >
> > /* For KVM_EXIT_INTERNAL_ERROR */
> > /* Emulate instruction failed. */
> > @@ -247,6 +248,11 @@ struct kvm_hyperv_exit {
> > /* Encounter unexpected vm-exit reason */
> > #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON 4
> >
> > +struct kvm_dirty_ring_indexes {
> > + __u32 avail_index; /* set by kernel */
> > + __u32 fetch_index; /* set by userspace */
> > +};
> > +
> > /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
> > struct kvm_run {
> > /* in */
> > @@ -421,6 +427,13 @@ struct kvm_run {
> > struct kvm_sync_regs regs;
> > char padding[SYNC_REGS_SIZE_BYTES];
> > } s;
> > +
> > + struct kvm_dirty_ring_indexes vcpu_ring_indexes;
> > +};
> > +
> > +/* Returned by mmap(kvm->fd, offset=0) */
> > +struct kvm_vm_run {
> > + struct kvm_dirty_ring_indexes vm_ring_indexes;
> > };
> >
> > /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> > @@ -1009,6 +1022,7 @@ struct kvm_ppc_resize_hpt {
> > #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
> > #define KVM_CAP_ARM_NISV_TO_USER 177
> > #define KVM_CAP_ARM_INJECT_EXT_DABT 178
> > +#define KVM_CAP_DIRTY_LOG_RING 179
> >
> > #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -1472,6 +1486,9 @@ struct kvm_enc_region {
> > /* Available with KVM_CAP_ARM_SVE */
> > #define KVM_ARM_VCPU_FINALIZE _IOW(KVMIO, 0xc2, int)
> >
> > +/* Available with KVM_CAP_DIRTY_LOG_RING */
> > +#define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc3)
> > +
> > /* Secure Encrypted Virtualization command */
> > enum sev_cmd_id {
> > /* Guest initialization commands */
> > @@ -1622,4 +1639,23 @@ struct kvm_hyperv_eventfd {
> > #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff
> > #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0)
> >
> > +/*
> > + * The following are the requirements for supporting dirty log ring
> > + * (by enabling KVM_DIRTY_LOG_PAGE_OFFSET).
> > + *
> > + * 1. Memory accesses by KVM should call kvm_vcpu_write_* instead
> > + * of kvm_write_* so that the global dirty ring is not filled up
> > + * too quickly.
> > + * 2. kvm_arch_mmu_enable_log_dirty_pt_masked should be defined for
> > + * enabling dirty logging.
> > + * 3. There should not be a separate step to synchronize hardware
> > + * dirty bitmap with KVM's.
> > + */
> > +
> > +struct kvm_dirty_gfn {
> > + __u32 pad;
> > + __u32 slot;
> > + __u64 offset;
> > +};
> > +
> > #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > new file mode 100644
> > index 000000000000..9264891f3c32
> > --- /dev/null
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -0,0 +1,156 @@
> > +#include <linux/kvm_host.h>
> > +#include <linux/kvm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/kvm_dirty_ring.h>
> > +
> > +u32 kvm_dirty_ring_get_rsvd_entries(void)
> > +{
> > + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
> > +}
> > +
> > +int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +{
> > + u32 size = kvm->dirty_ring_size;
> > +
> > + ring->dirty_gfns = vmalloc(size);
> > + if (!ring->dirty_gfns)
> > + return -ENOMEM;
> > + memset(ring->dirty_gfns, 0, size);
> > +
> > + ring->size = size / sizeof(struct kvm_dirty_gfn);
> > + ring->soft_limit =
> > + (kvm->dirty_ring_size / sizeof(struct kvm_dirty_gfn)) -
> > + kvm_dirty_ring_get_rsvd_entries();
>
> Minor, but what about
>
> ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();

Yeah it's better.

>
>
> > + ring->dirty_index = 0;
> > + ring->reset_index = 0;
> > + spin_lock_init(&ring->lock);
> > +
> > + return 0;
> > +}
> > +
> > +int kvm_dirty_ring_reset(struct kvm *kvm,
> > + struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes)
> > +{
> > + u32 cur_slot, next_slot;
> > + u64 cur_offset, next_offset;
> > + unsigned long mask;
> > + u32 fetch;
> > + int count = 0;
> > + struct kvm_dirty_gfn *entry;
> > +
> > + fetch = READ_ONCE(indexes->fetch_index);
>
> If I understand correctly, if a malicious user-space writes
> ring->reset_index + 1 into fetch_index, the loop below will execute 4
> billion times.
>
>
> > + if (fetch == ring->reset_index)
> > + return 0;
>
> To protect against scenario above, I would have something like:
>
> if (fetch - ring->reset_index >= ring->size)
> return -EINVAL;

Good point... Actually I've got this in my latest branch already, but
still thanks for noticing this!

>
> > +
> > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > + /*
> > + * The ring buffer is shared with userspace, which might mmap
> > + * it and concurrently modify slot and offset. Userspace must
> > + * not be trusted! READ_ONCE prevents the compiler from changing
> > + * the values after they've been range-checked (the checks are
> > + * in kvm_reset_dirty_gfn).
> > + */
> > + smp_read_barrier_depends();
> > + cur_slot = READ_ONCE(entry->slot);
> > + cur_offset = READ_ONCE(entry->offset);
> > + mask = 1;
> > + count++;
> > + ring->reset_index++;

[1]

> > + while (ring->reset_index != fetch) {
> > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > + smp_read_barrier_depends();
> > + next_slot = READ_ONCE(entry->slot);
> > + next_offset = READ_ONCE(entry->offset);
> > + ring->reset_index++;
> > + count++;
> > + /*
> > + * Try to coalesce the reset operations when the guest is
> > + * scanning pages in the same slot.
> > + */
> > + if (next_slot == cur_slot) {
> > + int delta = next_offset - cur_offset;
>
> Since you diff two u64, shouldn't that be an i64 rather than int?

I found there's no i64, so I'm using "long long".

>
> > +
> > + if (delta >= 0 && delta < BITS_PER_LONG) {
> > + mask |= 1ull << delta;
> > + continue;
> > + }
> > +
> > + /* Backwards visit, careful about overflows! */
> > + if (delta > -BITS_PER_LONG && delta < 0 &&
> > + (mask << -delta >> -delta) == mask) {
> > + cur_offset = next_offset;
> > + mask = (mask << -delta) | 1;
> > + continue;
> > + }
> > + }
> > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > + cur_slot = next_slot;
> > + cur_offset = next_offset;
> > + mask = 1;
> > + }
> > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
>
> So if you did not coalesce the last one, you call kvm_reset_dirty_gfn
> twice? Something smells weird about this loop ;-) I have a gut feeling
> that it could be done in a single while loop combined with the entry
> test, but I may be wrong.

It should be easy to save a few lines at [1] by introducing a boolean
"first_round". I don't see it easy to avoid the kvm_reset_dirty_gfn()
call at the end though...

>
>
> > +
> > + return count;
> > +}
> > +
> > +static inline u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
> > +{
> > + return ring->dirty_index - ring->reset_index;
> > +}
> > +
> > +bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
> > +{
> > + return kvm_dirty_ring_used(ring) >= ring->size;
> > +}
> > +
> > +/*
> > + * Returns:
> > + * >0 if we should kick the vcpu out,
> > + * =0 if the gfn pushed successfully, or,
> > + * <0 if error (e.g. ring full)
> > + */
> > +int kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
> > + struct kvm_dirty_ring_indexes *indexes,
> > + u32 slot, u64 offset, bool lock)
>
> Obviously, if you go with the suggestion to have a "lock" only in struct
> kvm, then you'd have to pass a lock ptr instead of a bool.

Paolo got a better solution on that. That "lock" will be dropped.

>
> > +{
> > + int ret;
> > + struct kvm_dirty_gfn *entry;
> > +
> > + if (lock)
> > + spin_lock(&ring->lock);
> > +
> > + if (kvm_dirty_ring_full(ring)) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
> > + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
> > + entry->slot = slot;
> > + entry->offset = offset;
> > + smp_wmb();
> > + ring->dirty_index++;
> > + WRITE_ONCE(indexes->avail_index, ring->dirty_index);
>
> Following up on comment about having to vmexit other VCPUs above:
> If you have a write barrier for the entry, and then a write once for the
> index, isn't that sufficient to ensure that another CPU will pick up the
> right values in the right order?

I think so. I've replied above on the RESET issue.

>
>
> > + ret = kvm_dirty_ring_used(ring) >= ring->soft_limit;
> > + pr_info("%s: slot %u offset %llu used %u\n",
> > + __func__, slot, offset, kvm_dirty_ring_used(ring));
> > +
> > +out:
> > + if (lock)
> > + spin_unlock(&ring->lock);
> > +
> > + return ret;
> > +}
> > +
> > +struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 i)
>
> Still don't like 'i' :-)
>
>
> (Stopped my review here for lack of time, decided to share what I had so far)

Thanks for your comments!

--
Peter Xu

2019-12-14 07:58:45

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 13/12/19 21:23, Peter Xu wrote:
>> What is the benefit of using u16 for that? That means with 4K pages, you
>> can share at most 256M of dirty memory each time? That seems low to me,
>> especially since it's sufficient to touch one byte in a page to dirty it.
>>
>> Actually, this is not consistent with the definition in the code ;-)
>> So I'll assume it's actually u32.
> Yes it's u32 now. Actually I believe at least Paolo would prefer u16
> more. :)

It has to be u16, because it overlaps the padding of the first entry.

Paolo

> I think even u16 would be mostly enough (if you see, the maximum
> allowed value currently is 64K entries only, not a big one). Again,
> the thing is that the userspace should be collecting the dirty bits,
> so the ring shouldn't reach full easily. Even if it does, we should
> probably let it stop for a while as explained above. It'll be
> inefficient only if we set it to a too-small value, imho.
>

2019-12-14 16:27:52

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Sat, Dec 14, 2019 at 08:57:26AM +0100, Paolo Bonzini wrote:
> On 13/12/19 21:23, Peter Xu wrote:
> >> What is the benefit of using u16 for that? That means with 4K pages, you
> >> can share at most 256M of dirty memory each time? That seems low to me,
> >> especially since it's sufficient to touch one byte in a page to dirty it.
> >>
> >> Actually, this is not consistent with the definition in the code ;-)
> >> So I'll assume it's actually u32.
> > Yes it's u32 now. Actually I believe at least Paolo would prefer u16
> > more. :)
>
> It has to be u16, because it overlaps the padding of the first entry.

Hmm, could you explain?

Note that here what Christophe commented is on dirty_index,
reset_index of "struct kvm_dirty_ring", so imho it could really be
anything we want as long as it can store a u32 (which is the size of
the elements in kvm_dirty_ring_indexes).

If you were instead talking about the previous union definition of
"struct kvm_dirty_gfns" rather than "struct kvm_dirty_ring", iiuc I've
moved those indices out of it and defined kvm_dirty_ring_indexes which
we expose via kvm_run, so we don't have that limitation as well any
more?

--
Peter Xu

2019-12-15 17:24:16

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 10, 2019 at 06:09:02PM +0100, Paolo Bonzini wrote:
> On 10/12/19 16:52, Peter Xu wrote:
> > On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote:
> >>> I'm thinking whether I can start
> >>> to use this information in the next post on solving an issue I
> >>> encountered with the waitqueue.
> >>>
> >>> Current waitqueue is still problematic in that it could wait even with
> >>> the mmu lock held when with vcpu context.
> >>
> >> I think the idea of the soft limit is that the waiting just cannot
> >> happen. That is, the number of dirtied pages _outside_ the guest (guest
> >> accesses are taken care of by PML, and are subtracted from the soft
> >> limit) cannot exceed hard_limit - (soft_limit + pml_size).
> >
> > So the question go backs to, whether this is guaranteed somehow? Or
> > do you prefer us to keep the warn_on_once until it triggers then we
> > can analyze (which I doubt..)?
>
> Yes, I would like to keep the WARN_ON_ONCE just because you never know.
>
> Of course it would be much better to audit the calls to kvm_write_guest
> and figure out how many could trigger (e.g. two from the operands of an
> emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.).

I would say we'd better either figure out all the caller's sites to
prove it will never overflow, or, I think we'll need the waitqueue at
least. The problem is if we release a kvm with WARN_ON_ONCE and at
last we found that it can be triggered and ring full can't be avoided,
then it means the interface and design is broken, and it could even be
too late to fix it after the interface is published.

(Actually I was not certain on previous clear_dirty interface where we
introduced a new capability for it. I'm not sure whether that can be
avoided because after all the initial version is not working at all,
and we fixed it up without changing the interface. However for this
one if at last we prove the design wrong, then we must introduce
another capability for it IMHO, and the interface is prone to change
too)

So, with the hope that we could avoid the waitqueue, I checked all the
callers of mark_page_dirty_in_slot(). Since this initial work is only
for x86, I didn't look more into other archs, assuming that can be
done later when it is implemented for other archs (and this will for
sure also cover the common code):

mark_page_dirty_in_slot calls, per-vm (x86 only)
__kvm_write_guest_page
kvm_write_guest_page
init_rmode_tss
vmx_set_tss_addr
kvm_vm_ioctl_set_tss_addr [*]
init_rmode_identity_map
vmx_create_vcpu [*]
vmx_write_pml_buffer
kvm_arch_write_log_dirty [&]
kvm_write_guest
kvm_hv_setup_tsc_page
kvm_guest_time_update [&]
nested_flush_cached_shadow_vmcs12 [&]
kvm_write_wall_clock [&]
kvm_pv_clock_pairing [&]
kvmgt_rw_gpa [?]
kvm_write_guest_offset_cached
kvm_steal_time_set_preempted [&]
kvm_write_guest_cached
pv_eoi_put_user [&]
kvm_lapic_sync_to_vapic [&]
kvm_setup_pvclock_page [&]
record_steal_time [&]
apf_put_user [&]
kvm_clear_guest_page
init_rmode_tss [*] (see above)
init_rmode_identity_map [*] (see above)
kvm_clear_guest
synic_set_msr
kvm_hv_set_msr [&]
kvm_write_guest_offset_cached [&] (see above)
mark_page_dirty
kvm_hv_set_msr_pw [&]

We should only need to look at the leaves of the traces because
they're where the dirty request starts. I'm marking all the leaves
with below criteria then it'll be easier to focus:

Cases with [*]: should not matter much
[&]: actually with a per-vcpu context in the upper layer
[?]: uncertain...

I'm a bit amazed after I took these notes, since I found that besides
those that could probbaly be ignored (marked as [*]), most of the rest
per-vm dirty requests are actually with a vcpu context.

Although now because we have kvm_get_running_vcpu() all cases for [&]
should be fine without changing anything, but I tend to add another
patch in the next post to convert all the [&] cases explicitly to pass
vcpu pointer instead of kvm pointer to be clear if no one disagrees,
then we verify that against kvm_get_running_vcpu().

So the only uncertainty now is kvmgt_rw_gpa() which is marked as [?].
Could this happen frequently? I would guess the answer is we don't
know (which means it can).

>
> > One thing to mention is that for with-vcpu cases, we probably can even
> > stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring
> > reaches the softlimit, then for vcpu case it should be easier to
> > guarantee that. What I want to know is the rest of cases like ioctls
> > or even something not from the userspace (which I think I should read
> > more later..).
>
> Which ioctls? Most ioctls shouldn't dirty memory at all.

init_rmode_tss or init_rmode_identity_map. But I've marked them as
unimportant because they should only happen once at boot.

>
> >>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu
> >>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those
> >>> cases we'll use per-vm dirty ring) then it's probably fine.
> >>>
> >>> My planned solution:
> >>>
> >>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits
> >>> until we finished handling this page fault, probably in somewhere
> >>> around vcpu_enter_guest, so that we can do wait_event() after the
> >>> mmu lock released
> >>
> >> I think this can cause a race:
> >>
> >> vCPU 1 vCPU 2 host
> >> ---------------------------------------------------------------
> >> mark page dirty
> >> write to page
> >> treat page as not dirty
> >> add page to ring
> >>
> >> where vCPU 2 skips the clean-page slow path entirely.
> >
> > If we're still with the rule in userspace that we first do RESET then
> > collect and send the pages (just like what we've discussed before),
> > then IMHO it's fine to have vcpu2 to skip the slow path? Because
> > RESET happens at "treat page as not dirty", then if we are sure that
> > we only collect and send pages after that point, then the latest
> > "write to page" data from vcpu2 won't be lost even if vcpu2 is not
> > blocked by vcpu1's ring full?
>
> Good point, the race would become
>
> vCPU 1 vCPU 2 host
> ---------------------------------------------------------------
> mark page dirty
> write to page
> reset rings
> wait for mmu lock
> add page to ring
> release mmu lock
> ...do reset...
> release mmu lock
> page is now dirty

Hmm, the page will be dirty after the reset, but is that an issue?

Or, could you help me to identify what I've missed?

>
> > Maybe we can also consider to let mark_page_dirty_in_slot() return a
> > value, then the upper layer could have a chance to skip the spte
> > update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it
> > can return directly with RET_PF_RETRY.
>
> I don't think that's possible, most writes won't come from a page fault
> path and cannot retry.

Yep, maybe I should say it in the other way round: we only wait if
kvm_get_running_vcpu() == NULL. Then in somewhere near
vcpu_enter_guest(), we add a check to wait if per-vcpu ring is full.
Would that work?

Thanks,

--
Peter Xu

2019-12-15 17:34:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Thu, Dec 12, 2019 at 01:08:14AM +0100, Paolo Bonzini wrote:
> >>> What depends on what here? Looks suspicious ...
> >>
> >> Hmm, I think maybe it can be removed because the entry pointer
> >> reference below should be an ordering constraint already?
>
> entry->xxx depends on ring->reset_index.

Yes that's true, but...

entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
/* barrier? */
next_slot = READ_ONCE(entry->slot);
next_offset = READ_ONCE(entry->offset);

... I think entry->xxx depends on entry first, then entry depends on
reset_index. So it seems fine because all things have a dependency?

>
> >>> what's the story around locking here? Why is it safe
> >>> not to take the lock sometimes?
> >>
> >> kvm_dirty_ring_push() will be with lock==true only when the per-vm
> >> ring is used. For per-vcpu ring, because that will only happen with
> >> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
> >> is called with lock==false).
>
> FWIW this will be done much more nicely in v2.
>
> >>>> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>>> + if (!page) {
> >>>> + r = -ENOMEM;
> >>>> + goto out_err_alloc_page;
> >>>> + }
> >>>> + kvm->vm_run = page_address(page);
> >>>
> >>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
> >>> still. What is wrong with just a pointer and calling put_user?
> >>
> >> I want to make it the start point for sharing fields between
> >> user/kernel per-vm. Just like kvm_run for per-vcpu.
>
> This page is actually not needed at all. Userspace can just map at
> KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there. You can drop
> kvm_vm_run completely.

I changed it because otherwise we use one entry of the padding, and
all the rest of paddings are a waste of memory because we can never
really use the padding as new fields only for the 1st entry which
overlaps with the indices. IMHO that could even waste more than 4k.

(for now we only "waste" 4K for per-vm, kvm_run is already mapped so
no waste there, not to say potentially I still think we can use the
kvm_vm_run in the future)

>
> >>>> + } else {
> >>>> + /*
> >>>> + * Put onto per vm ring because no vcpu context. Kick
> >>>> + * vcpu0 if ring is full.
> >>>
> >>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
> >>> critical tasks there, they will be penalized disproportionally?
> >>
> >> Reasonable question. So far we can't avoid it because vcpu exit is
> >> the event mechanism to say "hey please collect dirty bits". Maybe
> >> someway is better than this, but I'll need to rethink all these
> >> over...
> >
> > Maybe signal an eventfd, and let userspace worry about deciding what to
> > do.
>
> This has to be done synchronously. But the vm ring should be used very
> rarely (it's for things like kvmclock updates that write to guest memory
> outside a vCPU), possibly a handful of times in the whole run of the VM.

I've summarized a list of callers who might dirty guest memory in the
other thread, it seems to me that even the kvm clock is using per-vcpu
contexts.

>
> >>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
> >>> So how does userspace know what's legal?
> >>> Do you expect it to just try?
> >>
> >> Yep that's what I thought. :)
>
> We should return it for KVM_CHECK_EXTENSION.

OK. I'll drop the versioning.

--
Peter Xu

2019-12-18 22:00:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> On 17/12/19 17:24, Peter Xu wrote:
> >> No, please pass it all the way down to the [&] functions but not to
> >> kvm_write_guest_page. Those should keep using vcpu->kvm.
> > Actually I even wanted to refactor these helpers. I mean, we have two
> > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > the other set is per-vcpu. IIUC the only difference of these two are
> > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > just write to address space zero always.
>
> Right.
>
> > Could we unify them into a
> > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > longer when write) but we always pass in vcpu* as the first parameter?
> > Then we add another parameter "vcpu_smm" to show whether we want to
> > consider the HF_SMM_MASK flag.
>
> You'd have to check through all KVM implementations whether you always
> have the vCPU. Also non-x86 doesn't have address spaces, and by the
> time you add ", true" or ", false" it's longer than the "_vcpu_" you
> have removed. So, not a good idea in my opinion. :D

Well, now I've changed my mind. :) (considering that we still have
many places that will not have vcpu*...)

I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
without removing the kvm_write_*() helpers. Then I'll be able to
convert most of the kvm_write_*() (or its family) callers to
kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.

Would that be good?

--
Peter Xu

2019-12-18 22:25:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 18, 2019 at 04:58:57PM -0500, Peter Xu wrote:
> On Tue, Dec 17, 2019 at 05:28:54PM +0100, Paolo Bonzini wrote:
> > On 17/12/19 17:24, Peter Xu wrote:
> > >> No, please pass it all the way down to the [&] functions but not to
> > >> kvm_write_guest_page. Those should keep using vcpu->kvm.
> > > Actually I even wanted to refactor these helpers. I mean, we have two
> > > sets of helpers now, kvm_[vcpu]_{read|write}*(), so one set is per-vm,
> > > the other set is per-vcpu. IIUC the only difference of these two are
> > > whether we should consider ((vcpu)->arch.hflags & HF_SMM_MASK) or we
> > > just write to address space zero always.
> >
> > Right.
> >
> > > Could we unify them into a
> > > single set of helper (I'll just drop the *_vcpu_* helpers because it's
> > > longer when write) but we always pass in vcpu* as the first parameter?
> > > Then we add another parameter "vcpu_smm" to show whether we want to
> > > consider the HF_SMM_MASK flag.
> >
> > You'd have to check through all KVM implementations whether you always
> > have the vCPU. Also non-x86 doesn't have address spaces, and by the
> > time you add ", true" or ", false" it's longer than the "_vcpu_" you
> > have removed. So, not a good idea in my opinion. :D
>
> Well, now I've changed my mind. :) (considering that we still have
> many places that will not have vcpu*...)
>
> I can simply add that "vcpu_smm" parameter to kvm_vcpu_write_*()
> without removing the kvm_write_*() helpers. Then I'll be able to
> convert most of the kvm_write_*() (or its family) callers to
> kvm_vcpu_write*(..., vcpu_smm=false) calls where proper.
>
> Would that be good?

I've lost track of the problem you're trying to solve, but if you do
something like "vcpu_smm=false", explicitly pass an address space ID
instead of hardcoding x86 specific SMM crud, e.g.

kvm_vcpu_write*(..., as_id=0);

2019-12-18 22:38:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On 18/12/19 23:24, Sean Christopherson wrote:
> I've lost track of the problem you're trying to solve, but if you do
> something like "vcpu_smm=false", explicitly pass an address space ID
> instead of hardcoding x86 specific SMM crud, e.g.
>
> kvm_vcpu_write*(..., as_id=0);

And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
having to hardcode the address space ID. If anything you could add a
__kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
keep kvm_get_running_vcpu() for now and then it can be refactored later.
There are already way too many memory r/w APIs...

Paolo

2019-12-18 22:50:11

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Wed, Dec 18, 2019 at 11:37:31PM +0100, Paolo Bonzini wrote:
> On 18/12/19 23:24, Sean Christopherson wrote:
> > I've lost track of the problem you're trying to solve, but if you do
> > something like "vcpu_smm=false", explicitly pass an address space ID
> > instead of hardcoding x86 specific SMM crud, e.g.
> >
> > kvm_vcpu_write*(..., as_id=0);
>
> And the point of having kvm_vcpu_* vs. kvm_write_* was exactly to not
> having to hardcode the address space ID. If anything you could add a
> __kvm_vcpu_write_* API that takes vcpu+as_id, but really I'd prefer to
> keep kvm_get_running_vcpu() for now and then it can be refactored later.
> There are already way too many memory r/w APIs...

Yeah actuall that's why I wanted to start working on that just in case
it could help to unify all of them some day (and since we did go a few
steps forward on that when discussing the dirty ring). But yeah
kvm_get_running_vcpu() for sure works for us already; let's go the
easy way this time. Thanks,

--
Peter Xu

2019-12-20 18:20:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking

On Fri, Dec 13, 2019 at 03:23:24PM -0500, Peter Xu wrote:
> > > +If one of the ring buffers is full, the guest will exit to userspace
> > > +with the exit reason set to KVM_EXIT_DIRTY_LOG_FULL, and the
> > > +KVM_RUN ioctl will return -EINTR. Once that happens, userspace
> > > +should pause all the vcpus, then harvest all the dirty pages and
> > > +rearm the dirty traps. It can unpause the guest after that.
> >
> > Except for the condition above, why is it necessary to pause other VCPUs
> > than the one being harvested?
>
> This is a good question. Paolo could correct me if I'm wrong.
>
> Firstly I think this should rarely happen if the userspace is
> collecting the dirty bits from time to time. If it happens, we'll
> need to call KVM_RESET_DIRTY_RINGS to reset all the rings. Then the
> question actually becomes to: Whether we'd like to have per-vcpu
> KVM_RESET_DIRTY_RINGS?

Hmm when I'm rethinking this, I could have errornously deduced
something from Christophe's question. Christophe was asking about why
kicking other vcpus, while it does not mean that the RESET will need
to do per-vcpu.

So now I tend to agree here with Christophe that I can't find a reason
why we need to kick all vcpus out. Even if we need to do tlb flushing
for all vcpus when RESET, we can simply collect all the rings before
sending the RESET, then it's not really a reason to explicitly kick
them from userspace. So I plan to remove this sentence in the next
version (which is only a document update).

--
Peter Xu