KVM is currently capable of receiving a single memslot update through
the KVM_SET_USER_MEMORY_REGION ioctl.
The problem arises when we want to atomically perform multiple updates,
so that readers of memslot active list avoid seeing incomplete states.
For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
we see how non atomic updates cause boot failure, because vcpus
will se a partial update (old memslot delete, new one not yet created)
and will crash.
In this series we introduce KVM_SET_USER_MEMORY_REGION_LIST, a new ioctl
that takes a kvm_userspace_memory_region_list, a list of memslot updates,
and performs them atomically.
"atomically" in KVM words just means "apply all modifications to the
inactive memslot list, and then perform a single swap to replace the
active list with the inactive".
It is slightly more complicated that that, since DELETE and MOVE operations
require 2 swaps, but the main idea is the above.
Patch 1-6 are just code movements, in preparation for the following patches.
Patch 7 allows the invalid slot to be in both inactive and active memslot lists.
Patch 8 allows searching for the existing memslot (old) in the inactive list,
and not the active, allowing to perform multiple memslot updates without swapping.
Patch 9 implements IOCTL logic.
QEMU userspace logic in preparation for the IOCTL is here:
https://patchew.org/QEMU/[email protected]/
"[RFC PATCH v2 0/3] accel/kvm: extend kvm memory listener to support"
TODOs and ideas:
- limit the size of the ioctl arguments. Right now it is unbounded
- try to reduce the amount of swaps necessary? ie every DELETE/MOVE
requires an additional swap
- add selftests
- add documentation
Emanuele Giuseppe Esposito (9):
kvm_main.c: move slot check in kvm_set_memory_region
kvm.h: introduce KVM_SET_USER_MEMORY_REGION_LIST ioctl
kvm_main.c: introduce kvm_internal_memory_region_list
kvm_main.c: split logic in kvm_set_memslots
kvm_main.c: split __kvm_set_memory_region logic in kvm_check_mem and
kvm_prepare_batch
kvm_main.c: simplify change-specific callbacks
kvm_main.c: duplicate invalid memslot also in inactive list
kvm_main.c: find memslots from the inactive memslot list
kvm_main.c: handle atomic memslot update
arch/x86/kvm/x86.c | 3 +-
include/linux/kvm_host.h | 21 +-
include/uapi/linux/kvm.h | 21 +-
virt/kvm/kvm_main.c | 420 +++++++++++++++++++++++++++++++--------
4 files changed, 377 insertions(+), 88 deletions(-)
--
2.31.1
This IOCTL enables atomic update of multiple memslots.
The userspace application provides a kvm_userspace_memory_region_list
containing a list of entries, each representing a modification to be
performed to a memslot.
Requests with invalidate_slot == 1 are pre-processed, because they
are ther DELETE or MOVE, and therefore the memslot must be first
replaced with a copy marked as KVM_MEMSLOT_INVALID, and then replaced.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
include/uapi/linux/kvm.h | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a36e78710382..673496b91a25 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -103,6 +103,24 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
};
+/* for KVM_SET_USER_MEMORY_REGION_LIST */
+struct kvm_userspace_memory_region_entry {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+ __u64 userspace_addr; /* start of the userspace allocated memory */
+ __u8 invalidate_slot;
+ __u8 padding[31];
+};
+
+/* for KVM_SET_USER_MEMORY_REGION_LIST */
+struct kvm_userspace_memory_region_list {
+ __u32 nent;
+ __u32 flags;
+ struct kvm_userspace_memory_region_entry entries[0];
+};
+
/*
* The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
* other bits are reserved for kvm internal use which are defined in
@@ -1444,7 +1462,8 @@ struct kvm_vfio_spapr_tce {
struct kvm_userspace_memory_region)
#define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
#define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
-
+#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
+ struct kvm_userspace_memory_region_list)
/* enable ucontrol for s390 */
struct kvm_s390_ucas_mapping {
__u64 user_addr;
--
2.31.1
In preparation for atomic memslot updates, make sure the
invalid memslot is also replacing the old one in the inactive list.
This implies that once we want to insert the new slot for a MOVE,
or simply delete the existing one for a DELETE,
we need to remove the "invalid" slot, not the "old" one.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
virt/kvm/kvm_main.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b73615891f0..31e46f9a06fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1830,6 +1830,7 @@ static int kvm_prepare_memslot(struct kvm *kvm,
}
batch->invalid = invalid_slot;
kvm_invalidate_memslot(kvm, old, invalid_slot);
+ kvm_replace_memslot(kvm, old, invalid_slot);
}
r = kvm_prepare_memory_region(kvm, batch);
@@ -1900,10 +1901,14 @@ static int kvm_set_memslot(struct kvm *kvm,
return r;
/*
- * if change is DELETE or MOVE, invalid is in active memslots
- * and old in inactive, so replace old with new.
+ * if change is DELETE or MOVE, invalid is in both active and inactive
+ * memslot list. This means that we don't need old anymore, and
+ * we should replace invalid with new.
*/
- kvm_replace_memslot(kvm, batch->old, batch->new);
+ if (batch->change == KVM_MR_DELETE || batch->change == KVM_MR_MOVE)
+ kvm_replace_memslot(kvm, batch->invalid, batch->new);
+ else
+ kvm_replace_memslot(kvm, batch->old, batch->new);
/* either old or invalid is the same, since invalid is old's copy */
as_id = kvm_memslots_get_as_id(batch->old, batch->new);
--
2.31.1
Just a function split. No functional change intended,
except for the fact that kvm_prepare_batch() does not
immediately call kvm_set_memslot() if batch->change is
KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
virt/kvm/kvm_main.c | 120 +++++++++++++++++++++++++++++---------------
1 file changed, 79 insertions(+), 41 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 17f07546d591..9d917af30593 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1927,19 +1927,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
return false;
}
-/*
- * Allocate some memory and give it an address in the guest physical address
- * space.
- *
- * Discontiguous memory is allowed, mostly for framebuffers.
- * This function takes also care of initializing batch->new/old/invalid/change
- * fields.
- *
- * Must be called holding kvm->slots_lock for write.
- */
-int __kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem,
- struct kvm_internal_memory_region_list *batch)
+static int kvm_prepare_batch(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_internal_memory_region_list *batch)
{
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -1947,34 +1937,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
unsigned long npages;
gfn_t base_gfn;
int as_id, id;
- int r;
-
- r = check_memory_region_flags(mem);
- if (r)
- return r;
as_id = mem->slot >> 16;
id = (u16)mem->slot;
- /* General sanity checks */
- if ((mem->memory_size & (PAGE_SIZE - 1)) ||
- (mem->memory_size != (unsigned long)mem->memory_size))
- return -EINVAL;
- if (mem->guest_phys_addr & (PAGE_SIZE - 1))
- return -EINVAL;
- /* We can read the guest memory with __xxx_user() later on. */
- if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
- !access_ok((void __user *)(unsigned long)mem->userspace_addr,
- mem->memory_size))
- return -EINVAL;
- if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
- return -EINVAL;
- if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
- return -EINVAL;
- if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
- return -EINVAL;
-
slots = __kvm_memslots(kvm, as_id);
/*
@@ -1993,7 +1959,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
batch->change = KVM_MR_DELETE;
batch->new = NULL;
- return kvm_set_memslot(kvm, batch);
+ return 0;
}
base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
@@ -2020,7 +1986,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
else if (mem->flags != old->flags)
change = KVM_MR_FLAGS_ONLY;
else /* Nothing to change. */
- return 0;
+ return 1;
}
if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
@@ -2041,12 +2007,81 @@ int __kvm_set_memory_region(struct kvm *kvm,
batch->new = new;
batch->change = change;
+ return 0;
+}
+
+static int kvm_check_mem(const struct kvm_userspace_memory_region *mem)
+{
+ int as_id, id;
+
+ as_id = mem->slot >> 16;
+ id = (u16)mem->slot;
+
+ /* General sanity checks */
+ if ((mem->memory_size & (PAGE_SIZE - 1)) ||
+ (mem->memory_size != (unsigned long)mem->memory_size))
+ return -EINVAL;
+ if (mem->guest_phys_addr & (PAGE_SIZE - 1))
+ return -EINVAL;
+ /* We can read the guest memory with __xxx_user() later on. */
+ if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
+ (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
+ !access_ok((void __user *)(unsigned long)mem->userspace_addr,
+ mem->memory_size))
+ return -EINVAL;
+ if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+ return -EINVAL;
+ if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
+ return -EINVAL;
+ if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int kvm_check_memory_region(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_internal_memory_region_list *batch)
+{
+ int r;
+
+ r = check_memory_region_flags(mem);
+ if (r)
+ return r;
- r = kvm_set_memslot(kvm, batch);
+ r = kvm_check_mem(mem);
if (r)
- kfree(new);
+ return r;
+
+ r = kvm_prepare_batch(kvm, mem, batch);
+ if (r && batch->new)
+ kfree(batch->new);
+
return r;
}
+
+/*
+ * Allocate some memory and give it an address in the guest physical address
+ * space.
+ *
+ * Discontiguous memory is allowed, mostly for framebuffers.
+ * This function takes also care of initializing batch->new/old/invalid/change
+ * fields.
+ *
+ * Must be called holding kvm->slots_lock for write.
+ */
+int __kvm_set_memory_region(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem,
+ struct kvm_internal_memory_region_list *batch)
+{
+ int r;
+
+ r = kvm_check_memory_region(kvm, mem, batch);
+ if (r)
+ return r;
+
+ return kvm_set_memslot(kvm, batch);
+}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
static int kvm_set_memory_region(struct kvm *kvm,
@@ -2061,6 +2096,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
mutex_lock(&kvm->slots_lock);
r = __kvm_set_memory_region(kvm, mem, batch);
mutex_unlock(&kvm->slots_lock);
+ /* r == 1 means empty request, nothing to do but no error */
+ if (r == 1)
+ r = 0;
return r;
}
--
2.31.1
And make kvm_set_memory_region static, since it is not used outside
kvm_main.c
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
include/linux/kvm_host.h | 2 --
virt/kvm/kvm_main.c | 11 +++++------
2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b40f8d68fbb..1c5b7b2e35dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1108,8 +1108,6 @@ enum kvm_mr_change {
KVM_MR_FLAGS_ONLY,
};
-int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
int __kvm_set_memory_region(struct kvm *kvm,
const struct kvm_userspace_memory_region *mem);
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index da263c370d00..339de0ed4557 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2007,24 +2007,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
}
EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
-int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+static int kvm_set_memory_region(struct kvm *kvm,
+ const struct kvm_userspace_memory_region *mem)
{
int r;
+ if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
+ return -EINVAL;
+
mutex_lock(&kvm->slots_lock);
r = __kvm_set_memory_region(kvm, mem);
mutex_unlock(&kvm->slots_lock);
return r;
}
-EXPORT_SYMBOL_GPL(kvm_set_memory_region);
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem)
{
- if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
- return -EINVAL;
-
return kvm_set_memory_region(kvm, mem);
}
--
2.31.1
Instead of looking at the active list, look at the inactive.
This causes no harm to the current code, as active and inactive
lists are identical at this point.
In addition, provide flexibility for atomic memslot
updates, because in that case we want to perform multiple
updates in the inactive list first, and then perform a single
swap only. If we were to use the active list, previous updates
that were not yet swapped won't be seen, and the following logic
in kvm_prepare_batch() could for example find an old memslot
that was deleted in the inactive but not in the active, thus
wrongly assuming that the coming request is a MOVE and not a CREATE.
Note that this also causes no harm to the invalidate memslot, since
we are already inserting it as replacement in both active and inactive
list.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 31e46f9a06fa..ecd43560281c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1948,7 +1948,7 @@ static int kvm_prepare_batch(struct kvm *kvm,
as_id = mem->slot >> 16;
id = (u16)mem->slot;
- slots = __kvm_memslots(kvm, as_id);
+ slots = kvm_get_inactive_memslots(kvm, as_id);
/*
* Note, the old memslot (and the pointer itself!) may be invalidated
--
2.31.1
At this point it is also just a split, but later will handle atomic
memslot updates (thus avoiding swapping the memslot list every time).
No functional change intended.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
virt/kvm/kvm_main.c | 37 ++++++++++++++++++++++++++++++++-----
1 file changed, 32 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4fab15d0d4b..17f07546d591 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1790,12 +1790,15 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
kvm_activate_memslot(kvm, old, new);
}
-static int kvm_set_memslot(struct kvm *kvm,
- struct kvm_internal_memory_region_list *batch)
+/*
+ * Takes kvm->slots_arch_lock, and releases it only if
+ * invalid_slot allocation or kvm_prepare_memory_region failed.
+ */
+static int kvm_prepare_memslot(struct kvm *kvm,
+ struct kvm_internal_memory_region_list *batch)
{
struct kvm_memory_slot *invalid_slot;
struct kvm_memory_slot *old = batch->old;
- struct kvm_memory_slot *new = batch->new;
enum kvm_mr_change change = batch->change;
int r;
@@ -1829,7 +1832,8 @@ static int kvm_set_memslot(struct kvm *kvm,
* invalidation needs to be reverted.
*/
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
- invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
+ invalid_slot = kzalloc(sizeof(*invalid_slot),
+ GFP_KERNEL_ACCOUNT);
if (!invalid_slot) {
mutex_unlock(&kvm->slots_arch_lock);
return -ENOMEM;
@@ -1847,13 +1851,24 @@ static int kvm_set_memslot(struct kvm *kvm,
* release slots_arch_lock.
*/
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
+ /* kvm_activate_memslot releases kvm->slots_arch_lock */
kvm_activate_memslot(kvm, invalid_slot, old);
kfree(invalid_slot);
} else {
mutex_unlock(&kvm->slots_arch_lock);
}
- return r;
}
+ return r;
+}
+
+/* Must be called with kvm->slots_arch_lock held, but releases it. */
+static void kvm_finish_memslot(struct kvm *kvm,
+ struct kvm_internal_memory_region_list *batch)
+{
+ struct kvm_memory_slot *invalid_slot = batch->invalid;
+ struct kvm_memory_slot *old = batch->old;
+ struct kvm_memory_slot *new = batch->new;
+ enum kvm_mr_change change = batch->change;
/*
* For DELETE and MOVE, the working slot is now active as the INVALID
@@ -1883,6 +1898,18 @@ static int kvm_set_memslot(struct kvm *kvm,
* responsible for knowing that new->arch may be stale.
*/
kvm_commit_memory_region(kvm, batch);
+}
+
+static int kvm_set_memslot(struct kvm *kvm,
+ struct kvm_internal_memory_region_list *batch)
+{
+ int r;
+
+ r = kvm_prepare_memslot(kvm, batch);
+ if (r)
+ return r;
+
+ kvm_finish_memslot(kvm, batch);
return 0;
}
--
2.31.1
When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
to make sure that all memslots are updated in the inactive list
and then swap (preferreably only once) the lists, so that all
changes are visible immediately.
The only issue is that DELETE and MOVE need to perform 2 swaps:
firstly replace old memslot with invalid, and then remove invalid.
Therefore we need to perform two passes in the memslot list provided
by the ioctl: first handle only the DELETE and MOVE memslots, by
replacing the old node with an invalid one. This implies a swap for
each memslot (they could also be grouped in a single one, but for
simplicity we are not going to do it).
Userspace already marks the DELETE and MOVE requests with the flag
invalid_slot == 1.
Then, scan again the list and update the inactive memslot list.
Once the inactive memslot list is ready, swap it only once per
address space, and conclude the memslot handling.
Regarding kvm->slots_arch_lock, it is always taken in
kvm_prepare_memslot but depending on the batch->change and loop (first,
second, conclusion) in kvm_vm_ioctl_set_memory_region_list,
we release in different places:
- change = DELETE or MOVE. In the first pass, we release after creating
the invalid memslot and swapping.
- Second loop in kvm_vm_ioctl_set_memory_region_list.
We release it in kvm_set_memslot since batch->batch is true.
- Third loop in kvm_vm_ioctl_set_memory_region_list.
We take and release it for each swap.
- Call from kvm_vm_ioctl_set_memory_region: as before this patch,
acquire in kvm_prepare_memslot and release in kvm_swap_active_memslots.
Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
---
include/linux/kvm_host.h | 4 +
virt/kvm/kvm_main.c | 185 ++++++++++++++++++++++++++++++++++++---
2 files changed, 179 insertions(+), 10 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 69af94472b39..753650145836 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1118,6 +1118,10 @@ struct kvm_internal_memory_region_list {
struct kvm_memory_slot *new;
struct kvm_memory_slot *invalid;
enum kvm_mr_change change;
+
+ /* Fields that need to be set by the caller */
+ bool batch;
+ bool is_move_delete;
};
int __kvm_set_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ecd43560281c..85ba05924f0c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1782,7 +1782,8 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
/*
* Takes kvm->slots_arch_lock, and releases it only if
- * invalid_slot allocation or kvm_prepare_memory_region failed.
+ * invalid_slot allocation, kvm_prepare_memory_region failed
+ * or batch->is_move_delete is true.
*/
static int kvm_prepare_memslot(struct kvm *kvm,
struct kvm_internal_memory_region_list *batch)
@@ -1822,15 +1823,26 @@ static int kvm_prepare_memslot(struct kvm *kvm,
* invalidation needs to be reverted.
*/
if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
- invalid_slot = kzalloc(sizeof(*invalid_slot),
+ if (!batch->invalid) {
+ invalid_slot = kzalloc(sizeof(*invalid_slot),
GFP_KERNEL_ACCOUNT);
- if (!invalid_slot) {
+ if (!invalid_slot) {
+ mutex_unlock(&kvm->slots_arch_lock);
+ return -ENOMEM;
+ }
+ batch->invalid = invalid_slot;
+ kvm_invalidate_memslot(kvm, old, invalid_slot);
+ kvm_replace_memslot(kvm, old, invalid_slot);
+ }
+
+ /*
+ * We are in first pass of kvm_vm_ioctl_set_memory_region_list()
+ * just take care of making the old slot invalid and visible.
+ */
+ if (batch->is_move_delete) {
mutex_unlock(&kvm->slots_arch_lock);
- return -ENOMEM;
+ return 0;
}
- batch->invalid = invalid_slot;
- kvm_invalidate_memslot(kvm, old, invalid_slot);
- kvm_replace_memslot(kvm, old, invalid_slot);
}
r = kvm_prepare_memory_region(kvm, batch);
@@ -1896,6 +1908,13 @@ static int kvm_set_memslot(struct kvm *kvm,
{
int r, as_id;
+ /*
+ * First, prepare memslot. If DELETE and MOVE, take care of creating
+ * the invalid slot and use it to replace the old one.
+ * In order to keep things simple, allow each single update
+ * to be immediately visibile by swapping the active and inactive
+ * memory slot arrays.
+ */
r = kvm_prepare_memslot(kvm, batch);
if (r)
return r;
@@ -1910,6 +1929,12 @@ static int kvm_set_memslot(struct kvm *kvm,
else
kvm_replace_memslot(kvm, batch->old, batch->new);
+ /* Caller has to manually commit changes afterwards */
+ if (batch->batch) {
+ mutex_unlock(&kvm->slots_arch_lock);
+ return r;
+ }
+
/* either old or invalid is the same, since invalid is old's copy */
as_id = kvm_memslots_get_as_id(batch->old, batch->new);
@@ -2083,9 +2108,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
{
int r;
- r = kvm_check_memory_region(kvm, mem, batch);
- if (r)
- return r;
+ if (!batch->old && !batch->new && !batch->invalid) {
+ r = kvm_check_memory_region(kvm, mem, batch);
+ if (r)
+ return r;
+ }
return kvm_set_memslot(kvm, batch);
}
@@ -2117,6 +2144,133 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
return kvm_set_memory_region(kvm, mem, &batch);
}
+static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
+ struct kvm_userspace_memory_region_list *list,
+ struct kvm_userspace_memory_region_entry __user *mem_arg)
+{
+ struct kvm_userspace_memory_region_entry *mem, *m_iter;
+ struct kvm_userspace_memory_region *mem_region;
+ struct kvm_internal_memory_region_list *batch, *b_iter;
+ int i, r = 0;
+ bool *as_to_swap;
+
+ /* TODO: limit the number of mem to a max? */
+
+ if (!list->nent)
+ return r;
+
+ mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
+ if (IS_ERR(mem)) {
+ r = PTR_ERR(mem);
+ goto out;
+ }
+
+ batch = kcalloc(list->nent, sizeof(*batch), GFP_KERNEL_ACCOUNT);
+ if (IS_ERR(batch)) {
+ r = PTR_ERR(batch);
+ goto out2;
+ }
+
+ as_to_swap = kcalloc(KVM_ADDRESS_SPACE_NUM, sizeof(bool),
+ GFP_KERNEL_ACCOUNT);
+ if (IS_ERR(as_to_swap)) {
+ r = PTR_ERR(as_to_swap);
+ goto out3;
+ }
+
+ m_iter = mem;
+ b_iter = batch;
+ /*
+ * First pass: handle all DELETE and MOVE requests
+ * (since they swap active and inactive memslots)
+ * by switching old memslot in invalid.
+ */
+ mutex_lock(&kvm->slots_lock);
+ for (i = 0; i < list->nent; i++) {
+
+ if ((u16)m_iter->slot >= KVM_USER_MEM_SLOTS) {
+ r = -EINVAL;
+ goto out4;
+ }
+
+ if (!m_iter->invalidate_slot) {
+ m_iter++;
+ b_iter++;
+ continue;
+ }
+
+ mem_region = (struct kvm_userspace_memory_region *) m_iter;
+ r = kvm_check_memory_region(kvm, mem_region, b_iter);
+ if (r) {
+ mutex_unlock(&kvm->slots_lock);
+ goto out4;
+ }
+
+ WARN_ON(b_iter->change != KVM_MR_DELETE &&
+ b_iter->change != KVM_MR_MOVE);
+
+ b_iter->is_move_delete = true;
+ r = kvm_prepare_memslot(kvm, b_iter);
+ b_iter->is_move_delete = false;
+ if (r < 0) {
+ mutex_unlock(&kvm->slots_lock);
+ goto out4;
+ }
+ m_iter++;
+ b_iter++;
+ }
+ mutex_unlock(&kvm->slots_lock);
+
+
+ m_iter = mem;
+ b_iter = batch;
+ /*
+ * Second pass: handle all other request types
+ * but do not swap the memslots array yet.
+ */
+ for (i = 0; i < list->nent; i++) {
+ b_iter->batch = true;
+ as_to_swap[m_iter->slot >> 16] = true;
+ mem_region = (struct kvm_userspace_memory_region *) m_iter;
+ r = kvm_set_memory_region(kvm, mem_region, b_iter);
+ if (r < 0)
+ goto out4;
+ m_iter++;
+ b_iter++;
+ }
+
+ /*
+ * Conclude by swapping the memslot lists and updating the inactive set too.
+ */
+ b_iter = batch;
+ mutex_lock(&kvm->slots_lock);
+ mutex_lock(&kvm->slots_arch_lock);
+ for (i = 0; i < list->nent; i++) {
+ int as_id;
+
+ as_id = kvm_memslots_get_as_id(b_iter->old, b_iter->new);
+ if (as_to_swap[as_id]) {
+ kvm_swap_active_memslots(kvm, as_id);
+ mutex_lock(&kvm->slots_arch_lock);
+ as_to_swap[as_id] = false;
+ }
+
+ kvm_finish_memslot(kvm, b_iter);
+ b_iter++;
+ }
+ mutex_unlock(&kvm->slots_arch_lock);
+ mutex_unlock(&kvm->slots_lock);
+
+out4:
+ kfree(as_to_swap);
+out3:
+ kfree(batch);
+out2:
+ kvfree(mem);
+out:
+ return r;
+}
+
#ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
/**
* kvm_get_dirty_log - get a snapshot of dirty pages
@@ -4732,6 +4886,17 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
break;
}
+ case KVM_SET_USER_MEMORY_REGION_LIST: {
+ struct kvm_userspace_memory_region_list __user *mem_arg = argp;
+ struct kvm_userspace_memory_region_list mem;
+
+ r = -EFAULT;
+ if (copy_from_user(&mem, mem_arg, sizeof(mem)))
+ goto out;
+
+ r = kvm_vm_ioctl_set_memory_region_list(kvm, &mem, mem_arg->entries);
+ break;
+ }
case KVM_GET_DIRTY_LOG: {
struct kvm_dirty_log log;
--
2.31.1
On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> KVM is currently capable of receiving a single memslot update through
> the KVM_SET_USER_MEMORY_REGION ioctl.
> The problem arises when we want to atomically perform multiple updates,
> so that readers of memslot active list avoid seeing incomplete states.
>
> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
I don't have access. Can you provide a TL;DR?
> we see how non atomic updates cause boot failure, because vcpus
> will se a partial update (old memslot delete, new one not yet created)
> and will crash.
Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity
to take on for something that appears to be solvable in userspace.
And if the issue is related to KVM disallowing the toggling of read-only (can't see
the bug), we can likely solve that without needing a new ioctl() that allows
userspace to batch an arbitrary number of updates.
On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
> Just a function split. No functional change intended,
> except for the fact that kvm_prepare_batch() does not
> immediately call kvm_set_memslot() if batch->change is
> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
>
>
> +static int kvm_check_memory_region(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem,
> + struct kvm_internal_memory_region_list *batch)
> +{
> + int r;
> +
> + r = check_memory_region_flags(mem);
> + if (r)
> + return r;
>
> - r = kvm_set_memslot(kvm, batch);
> + r = kvm_check_mem(mem);
> if (r)
> - kfree(new);
> + return r;
> +
> + r = kvm_prepare_batch(kvm, mem, batch);
> + if (r && batch->new)
> + kfree(batch->new);
From the patch, r !=0 and batch->new !=NULL are exclusive, so free()
here is not reachable.
> +
> return r;
> }
[...]
>
On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
> to make sure that all memslots are updated in the inactive list
> and then swap (preferreably only once) the lists, so that all
> changes are visible immediately.
[...]
> +static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
> + struct kvm_userspace_memory_region_list *list,
> + struct kvm_userspace_memory_region_entry __user *mem_arg)
> +{
> + struct kvm_userspace_memory_region_entry *mem, *m_iter;
> + struct kvm_userspace_memory_region *mem_region;
> + struct kvm_internal_memory_region_list *batch, *b_iter;
> + int i, r = 0;
> + bool *as_to_swap;
> +
> + /* TODO: limit the number of mem to a max? */
> +
> + if (!list->nent)
> + return r;
> +
> + mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
> + if (IS_ERR(mem)) {
> + r = PTR_ERR(mem);
> + goto out;
> + }
IMO, it's more natural to dup the user memory at the first place, i.e.,
kvm_vm_ioctl,
it also makes the outlets shorter.
[...]
Am 13/09/2022 um 04:30 schrieb Yang, Weijiang:
>
> On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>> to make sure that all memslots are updated in the inactive list
>> and then swap (preferreably only once) the lists, so that all
>> changes are visible immediately.
> [...]
>> +static int kvm_vm_ioctl_set_memory_region_list(struct kvm *kvm,
>> + struct kvm_userspace_memory_region_list *list,
>> + struct kvm_userspace_memory_region_entry __user *mem_arg)
>> +{
>> + struct kvm_userspace_memory_region_entry *mem, *m_iter;
>> + struct kvm_userspace_memory_region *mem_region;
>> + struct kvm_internal_memory_region_list *batch, *b_iter;
>> + int i, r = 0;
>> + bool *as_to_swap;
>> +
>> + /* TODO: limit the number of mem to a max? */
>> +
>> + if (!list->nent)
>> + return r;
>> +
>> + mem = vmemdup_user(mem_arg, array_size(sizeof(*mem), list->nent));
>> + if (IS_ERR(mem)) {
>> + r = PTR_ERR(mem);
>> + goto out;
>> + }
>
> IMO, it's more natural to dup the user memory at the first place, i.e.,
> kvm_vm_ioctl,
>
> it also makes the outlets shorter.
>
I followed the same pattern as kvm_vcpu_ioctl_set_cpuid2, which performs
the user memory dup inside the call :)
I see your point but I guess it's better to keep all ioctl
implementations similar.
Thank you,
Emanuele
Am 13/09/2022 um 04:56 schrieb Yang, Weijiang:
>
> On 9/9/2022 6:45 PM, Emanuele Giuseppe Esposito wrote:
>> Just a function split. No functional change intended,
>> except for the fact that kvm_prepare_batch() does not
>> immediately call kvm_set_memslot() if batch->change is
>> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
>>
>>
>> +static int kvm_check_memory_region(struct kvm *kvm,
>> + const struct kvm_userspace_memory_region *mem,
>> + struct kvm_internal_memory_region_list *batch)
>> +{
>> + int r;
>> +
>> + r = check_memory_region_flags(mem);
>> + if (r)
>> + return r;
>> - r = kvm_set_memslot(kvm, batch);
>> + r = kvm_check_mem(mem);
>> if (r)
>> - kfree(new);
>> + return r;
>> +
>> + r = kvm_prepare_batch(kvm, mem, batch);
>> + if (r && batch->new)
>> + kfree(batch->new);
> From the patch, r !=0 and batch->new !=NULL are exclusive, so free()
> here is not reachable.
Good point, I'll get rid of this.
Thank you,
Emanuele
>> +
>> return r;
>> }
> [...]
>>
>
Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>> KVM is currently capable of receiving a single memslot update through
>> the KVM_SET_USER_MEMORY_REGION ioctl.
>> The problem arises when we want to atomically perform multiple updates,
>> so that readers of memslot active list avoid seeing incomplete states.
>>
>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>
> I don't have access. Can you provide a TL;DR?
You should be able to have access to it now.
>
>> we see how non atomic updates cause boot failure, because vcpus
>> will se a partial update (old memslot delete, new one not yet created)
>> and will crash.
>
> Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity
> to take on for something that appears to be solvable in userspace.
>
I think it is not that easy to solve in userspace: see
https://lore.kernel.org/qemu-devel/[email protected]/
"Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
temporarily drop the BQL - something most callers can't handle (esp.
when called from vcpu context e.g., in virtio code)."
Probably @Paolo and @Maxim can add more to this.
Thank you,
Emanuele
On 19/9/2022 12:13 am, Emanuele Giuseppe Esposito wrote:
>
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZhttps://bugzilla.redhat.com/show_bug.cgi?id=1979276
Oh, thanks for stepping up to try to address it.
As it turns out, this issue was discovered "long" before
https://bugzilla.kernel.org/show_bug.cgi?id=213781
As a comment, relevant selftests are necessary and required.
On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>
>
> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>> KVM is currently capable of receiving a single memslot update through
>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>> The problem arises when we want to atomically perform multiple updates,
>>> so that readers of memslot active list avoid seeing incomplete states.
>>>
>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>
>> I don't have access. Can you provide a TL;DR?
>
> You should be able to have access to it now.
>
>>
>>> we see how non atomic updates cause boot failure, because vcpus
>>> will se a partial update (old memslot delete, new one not yet created)
>>> and will crash.
>>
>> Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity
>> to take on for something that appears to be solvable in userspace.
>>
>
> I think it is not that easy to solve in userspace: see
> https://lore.kernel.org/qemu-devel/[email protected]/
>
>
> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
> temporarily drop the BQL - something most callers can't handle (esp.
> when called from vcpu context e.g., in virtio code)."
Can you please comment on the bigger picture? The patch from me works
around *exactly that*, and for that reason, contains that comment.
--
Thanks,
David / dhildenb
On 19.09.22 09:53, David Hildenbrand wrote:
> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>> KVM is currently capable of receiving a single memslot update through
>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>> The problem arises when we want to atomically perform multiple updates,
>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>
>>>> For example, in RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>
>>> I don't have access. Can you provide a TL;DR?
>>
>> You should be able to have access to it now.
>>
>>>
>>>> we see how non atomic updates cause boot failure, because vcpus
>>>> will se a partial update (old memslot delete, new one not yet created)
>>>> and will crash.
>>>
>>> Why not simply pause vCPUs in this scenario? This is an awful lot of a complexity
>>> to take on for something that appears to be solvable in userspace.
>>>
>>
>> I think it is not that easy to solve in userspace: see
>> https://lore.kernel.org/qemu-devel/[email protected]/
>>
>>
>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>> temporarily drop the BQL - something most callers can't handle (esp.
>> when called from vcpu context e.g., in virtio code)."
>
> Can you please comment on the bigger picture? The patch from me works
> around *exactly that*, and for that reason, contains that comment.
>
FWIW, I hacked up my RFC to perform atomic updates on any memslot
transactions (not just resizes) where ranges do add overlap with ranges
to remove.
https://github.com/davidhildenbrand/qemu/tree/memslot
I only performed simple boot check under x86-64 (where I can see region
resizes) and some make checks -- pretty sure it has some rough edges;
but should indicate what's possible and what the possible price might
be. [one could wire up a new KVM ioctl and call it conditionally on
support if really required]
--
Thanks,
David / dhildenb
On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>
>
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access. Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/[email protected]/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
>
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
>
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
>
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
>
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
>
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
>
> ioctl:
> -smp 1: Average: 2.1ms Stdev: 0.8ms
> -smp 2: Average: 2.5ms Stdev: 1.5ms
> -smp 4: Average: 2.2ms Stdev: 1.1ms
> -smp 8: Average: 2.4ms Stdev: 0.7ms
> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)
>
>
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1: Average: 2.2ms Stdev: 1.2ms
> -smp 2: Average: 3.0ms Stdev: 1.4ms
> -smp 4: Average: 3.1ms Stdev: 1.3m
> -smp 8: Average: 3.4ms Stdev: 1.4ms
> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)
>
>
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.
Right, the question is if it happens sufficiently enough that we even
care and if there are not ways to mitigate.
It doesn't necessarily have to scale with the #VCPUs I think. What
should dominate the overall time in theory how long it takes for one
VCPU (the slowest one) to leave the kernel.
I wondered if
1) it might be easier to have a single KVM mechanism/call to kick all
VCPUs out of KVM instead of doing it per VCPU. That might speed up
things eventually heavily already.
2) One thing I wondered is whether the biggest overhead is actually
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
we could optimize that as well. (for now I use one lock per VCPU because
it felt like it would reduce the ioctl overhead; maybe there is a better
alternative to balance between both users)
So treat my patch as a completely unoptimized version.
--
Thanks,
David / dhildenb
Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> On 19.09.22 09:53, David Hildenbrand wrote:
>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>> KVM is currently capable of receiving a single memslot update through
>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>> The problem arises when we want to atomically perform multiple
>>>>> updates,
>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>
>>>>> For example, in RHBZ
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>
>>>> I don't have access. Can you provide a TL;DR?
>>>
>>> You should be able to have access to it now.
>>>
>>>>
>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>> and will crash.
>>>>
>>>> Why not simply pause vCPUs in this scenario? This is an awful lot
>>>> of a complexity
>>>> to take on for something that appears to be solvable in userspace.
>>>>
>>>
>>> I think it is not that easy to solve in userspace: see
>>> https://lore.kernel.org/qemu-devel/[email protected]/
>>>
>>>
>>>
>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>> temporarily drop the BQL - something most callers can't handle (esp.
>>> when called from vcpu context e.g., in virtio code)."
>>
>> Can you please comment on the bigger picture? The patch from me works
>> around *exactly that*, and for that reason, contains that comment.
>>
>
> FWIW, I hacked up my RFC to perform atomic updates on any memslot
> transactions (not just resizes) where ranges do add overlap with ranges
> to remove.
>
> https://github.com/davidhildenbrand/qemu/tree/memslot
>
>
> I only performed simple boot check under x86-64 (where I can see region
> resizes) and some make checks -- pretty sure it has some rough edges;
> but should indicate what's possible and what the possible price might
> be. [one could wire up a new KVM ioctl and call it conditionally on
> support if really required]
>
A benefit of my ioctl implementation is that could be also used by other
hypervisors, which then do not need to share this kind of "hack".
However, after also talking with Maxim and Paolo, we all agreed that the
main disadvantage of your approach is that is not scalable with the
number of vcpus. It is also inferior to stop *all* vcpus just to allow a
memslot update (KVM only pauses vCPUs that access the modified memslots
instead).
So I took some measurements, to see what is the performance difference
between my implementation and yours. I used a machine where I could
replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
Processor with kernel 5.19.0 (+ my patches).
Then I measured the time it takes that QEMU spends in kvm_commit (ie in
memslot updates) while booting a VM. In other words, if kvm_commit takes
10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
not called anymore after boot, so this measurement is easy to compare
over multiple invocations of QEMU.
I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
command is the same to replicate the bug:
./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none >> ~/smp_$v;
Each boot is reproduced 100 times, and then from results I measure
average and stddev (in milliseconds).
ioctl:
-smp 1: Average: 2.1ms Stdev: 0.8ms
-smp 2: Average: 2.5ms Stdev: 1.5ms
-smp 4: Average: 2.2ms Stdev: 1.1ms
-smp 8: Average: 2.4ms Stdev: 0.7ms
-smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
-smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)
pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
-smp 1: Average: 2.2ms Stdev: 1.2ms
-smp 2: Average: 3.0ms Stdev: 1.4ms
-smp 4: Average: 3.1ms Stdev: 1.3m
-smp 8: Average: 3.4ms Stdev: 1.4ms
-smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
-smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)
Above 24 vCPUs performance gets worse quickly but I think it's already
quite clear that the results for ioctl scale better as the number of
vcpus increases, while pausing the vCPUs becomes slower already with 16
vcpus.
Thank you,
Emanuele
Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>> through
>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>> updates,
>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>> states.
>>>>>>>
>>>>>>> For example, in RHBZ
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>
>>>>>> I don't have access. Can you provide a TL;DR?
>>>>>
>>>>> You should be able to have access to it now.
>>>>>
>>>>>>
>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>> created)
>>>>>>> and will crash.
>>>>>>
>>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot
>>>>>> of a complexity
>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>
>>>>>
>>>>> I think it is not that easy to solve in userspace: see
>>>>> https://lore.kernel.org/qemu-devel/[email protected]/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>> will
>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>> when called from vcpu context e.g., in virtio code)."
>>>>
>>>> Can you please comment on the bigger picture? The patch from me works
>>>> around *exactly that*, and for that reason, contains that comment.
>>>>
>>>
>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>> transactions (not just resizes) where ranges do add overlap with ranges
>>> to remove.
>>>
>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>
>>>
>>> I only performed simple boot check under x86-64 (where I can see region
>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>> but should indicate what's possible and what the possible price might
>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>> support if really required]
>>>
>>
>> A benefit of my ioctl implementation is that could be also used by other
>> hypervisors, which then do not need to share this kind of "hack".
>> However, after also talking with Maxim and Paolo, we all agreed that the
>> main disadvantage of your approach is that is not scalable with the
>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>> memslot update (KVM only pauses vCPUs that access the modified memslots
>> instead).
>>
>> So I took some measurements, to see what is the performance difference
>> between my implementation and yours. I used a machine where I could
>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>> Processor with kernel 5.19.0 (+ my patches).
>>
>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>> not called anymore after boot, so this measurement is easy to compare
>> over multiple invocations of QEMU.
>>
>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>> command is the same to replicate the bug:
>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>> --display none >> ~/smp_$v;
>>
>> Each boot is reproduced 100 times, and then from results I measure
>> average and stddev (in milliseconds).
>>
>> ioctl:
>> -smp 1: Average: 2.1ms Stdev: 0.8ms
>> -smp 2: Average: 2.5ms Stdev: 1.5ms
>> -smp 4: Average: 2.2ms Stdev: 1.1ms
>> -smp 8: Average: 2.4ms Stdev: 0.7ms
>> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
>> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)
>>
>>
>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>> -smp 1: Average: 2.2ms Stdev: 1.2ms
>> -smp 2: Average: 3.0ms Stdev: 1.4ms
>> -smp 4: Average: 3.1ms Stdev: 1.3m
>> -smp 8: Average: 3.4ms Stdev: 1.4ms
>> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
>> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)
>>
>>
>> Above 24 vCPUs performance gets worse quickly but I think it's already
>> quite clear that the results for ioctl scale better as the number of
>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>> vcpus.
>
> Right, the question is if it happens sufficiently enough that we even
> care and if there are not ways to mitigate.
>
> It doesn't necessarily have to scale with the #VCPUs I think. What
> should dominate the overall time in theory how long it takes for one
> VCPU (the slowest one) to leave the kernel.
>
> I wondered if
>
> 1) it might be easier to have a single KVM mechanism/call to kick all
> VCPUs out of KVM instead of doing it per VCPU. That might speed up
> things eventually heavily already.
So if I understand correclty, this implies creating a new ioctl in KVM
anyways? What would be then the difference with what I do? We are
affecting the kernel anyways.
>
> 2) One thing I wondered is whether the biggest overhead is actually
> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
> we could optimize that as well. (for now I use one lock per VCPU because
> it felt like it would reduce the ioctl overhead; maybe there is a better
> alternative to balance between both users)
>
> So treat my patch as a completely unoptimized version.
>
For what is worth, also my version performs #invalidate+1 swaps, which
is not optimized.
Honestly, I don't see how the above is easier or simpler than what is
being proposed here.
Thank you,
Emanuele
On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
>
>
> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>
>>>>>>
>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>> through
>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>> updates,
>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>> states.
>>>>>>>>
>>>>>>>> For example, in RHBZ
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>>
>>>>>>> I don't have access. Can you provide a TL;DR?
>>>>>>
>>>>>> You should be able to have access to it now.
>>>>>>
>>>>>>>
>>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>>> created)
>>>>>>>> and will crash.
>>>>>>>
>>>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot
>>>>>>> of a complexity
>>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>>
>>>>>>
>>>>>> I think it is not that easy to solve in userspace: see
>>>>>> https://lore.kernel.org/qemu-devel/[email protected]/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>>> will
>>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>>> when called from vcpu context e.g., in virtio code)."
>>>>>
>>>>> Can you please comment on the bigger picture? The patch from me works
>>>>> around *exactly that*, and for that reason, contains that comment.
>>>>>
>>>>
>>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>>> transactions (not just resizes) where ranges do add overlap with ranges
>>>> to remove.
>>>>
>>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>>
>>>>
>>>> I only performed simple boot check under x86-64 (where I can see region
>>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>>> but should indicate what's possible and what the possible price might
>>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>>> support if really required]
>>>>
>>>
>>> A benefit of my ioctl implementation is that could be also used by other
>>> hypervisors, which then do not need to share this kind of "hack".
>>> However, after also talking with Maxim and Paolo, we all agreed that the
>>> main disadvantage of your approach is that is not scalable with the
>>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>>> memslot update (KVM only pauses vCPUs that access the modified memslots
>>> instead).
>>>
>>> So I took some measurements, to see what is the performance difference
>>> between my implementation and yours. I used a machine where I could
>>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>>> Processor with kernel 5.19.0 (+ my patches).
>>>
>>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>>> not called anymore after boot, so this measurement is easy to compare
>>> over multiple invocations of QEMU.
>>>
>>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>>> command is the same to replicate the bug:
>>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>>> --display none >> ~/smp_$v;
>>>
>>> Each boot is reproduced 100 times, and then from results I measure
>>> average and stddev (in milliseconds).
>>>
>>> ioctl:
>>> -smp 1: Average: 2.1ms Stdev: 0.8ms
>>> -smp 2: Average: 2.5ms Stdev: 1.5ms
>>> -smp 4: Average: 2.2ms Stdev: 1.1ms
>>> -smp 8: Average: 2.4ms Stdev: 0.7ms
>>> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
>>> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)
>>>
>>>
>>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>>> -smp 1: Average: 2.2ms Stdev: 1.2ms
>>> -smp 2: Average: 3.0ms Stdev: 1.4ms
>>> -smp 4: Average: 3.1ms Stdev: 1.3m
>>> -smp 8: Average: 3.4ms Stdev: 1.4ms
>>> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
>>> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)
>>>
>>>
>>> Above 24 vCPUs performance gets worse quickly but I think it's already
>>> quite clear that the results for ioctl scale better as the number of
>>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>>> vcpus.
>>
>> Right, the question is if it happens sufficiently enough that we even
>> care and if there are not ways to mitigate.
>>
>> It doesn't necessarily have to scale with the #VCPUs I think. What
>> should dominate the overall time in theory how long it takes for one
>> VCPU (the slowest one) to leave the kernel.
>>
>> I wondered if
>>
>> 1) it might be easier to have a single KVM mechanism/call to kick all
>> VCPUs out of KVM instead of doing it per VCPU. That might speed up
>> things eventually heavily already.
>
> So if I understand correclty, this implies creating a new ioctl in KVM
> anyways? What would be then the difference with what I do? We are
> affecting the kernel anyways.
>
If that turns out to be the actual bottleneck. I haven't analyzed what
exactly is causing the delay with increasing # VCPUs, so I'm just guessing.
Right now, we primarily use pthread_kill(SIG_IPI) to send a signal to
each VCPU thread.
>>
>> 2) One thing I wondered is whether the biggest overhead is actually
>> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
>> we could optimize that as well. (for now I use one lock per VCPU because
>> it felt like it would reduce the ioctl overhead; maybe there is a better
>> alternative to balance between both users)
>>
>> So treat my patch as a completely unoptimized version.
>>
> For what is worth, also my version performs #invalidate+1 swaps, which
> is not optimized.
>
> Honestly, I don't see how the above is easier or simpler than what is
> being proposed here.
As Sean said "This is an awful lot of a complexity to take on for
something that appears to be solvable in userspace."
I showed that it's possible. So from that point on, there has to be good
justification why that complexity has to exist in kernel space to
optimize this scenario.
Now, it's not my call to make. I'm just pointing out that it can be done
and that there might be room for improvement in my quick prototype.
--
Thanks,
David / dhildenb
On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
> >
> >
> > Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> > > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> > > >
> > > >
> > > > Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
> > > > > On 19.09.22 09:53, David Hildenbrand wrote:
> > > > > > On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
> > > > > > >
> > > > > > >
> > > > > > > Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
> > > > > > > > On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
> > > > > > > > > KVM is currently capable of receiving a single memslot update
> > > > > > > > > through
> > > > > > > > > the KVM_SET_USER_MEMORY_REGION ioctl.
> > > > > > > > > The problem arises when we want to atomically perform multiple
> > > > > > > > > updates,
> > > > > > > > > so that readers of memslot active list avoid seeing incomplete
> > > > > > > > > states.
> > > > > > > > >
> > > > > > > > > For example, in RHBZ
> > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1979276
...
> As Sean said "This is an awful lot of a complexity to take on for something
> that appears to be solvable in userspace."
And if the userspace solution is unpalatable for whatever reason, I'd like to
understand exactly what KVM behavior is problematic for userspace. E.g. the
above RHBZ bug should no longer be an issue as the buggy commit has since been
reverted.
If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
the race _if_ userspace chooses not to pause vCPUs.
Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>> On 23.09.22 15:38, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
>>>> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>>>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>>>>> through
>>>>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>>>>> updates,
>>>>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>>>>> states.
>>>>>>>>>>
>>>>>>>>>> For example, in RHBZ
>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>
> ...
>
>> As Sean said "This is an awful lot of a complexity to take on for something
>> that appears to be solvable in userspace."
>
> And if the userspace solution is unpalatable for whatever reason, I'd like to
> understand exactly what KVM behavior is problematic for userspace. E.g. the
> above RHBZ bug should no longer be an issue as the buggy commit has since been
> reverted.
It still is because I can reproduce the bug, as also pointed out in
multiple comments below.
>
> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> the race _if_ userspace chooses not to pause vCPUs.
>
Also on the BZ they all seem (Paolo included) to agree that the issue is
non-atomic memslots update.
To be more precise, what I did mostly follows what Peter explained in
comment 19 : https://bugzilla.redhat.com/show_bug.cgi?id=1979276#c19
On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
> to make sure that all memslots are updated in the inactive list
> and then swap (preferreably only once) the lists, so that all
> changes are visible immediately.
>
> The only issue is that DELETE and MOVE need to perform 2 swaps:
> firstly replace old memslot with invalid, and then remove invalid.
>
I'm curious, how would a resize (grow/shrink) or a split be handled?
--
Thanks,
David / dhildenb
Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>> to make sure that all memslots are updated in the inactive list
>> and then swap (preferreably only once) the lists, so that all
>> changes are visible immediately.
>>
>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>> firstly replace old memslot with invalid, and then remove invalid.
>>
>
> I'm curious, how would a resize (grow/shrink) or a split be handled?
>
There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
CREATE, FLAGS_ONLY}.
A resize should be implemented in QEMU as DELETE+CREATE.
Therefore a resize on memslot X will be implemented as:
First pass on the userspace operations:
invalidate memslot X;
swap_memslot_list(); // NOW it is visible to the guest
What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
Second pass:
create new memslot X
delete old memslot X
What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
Third pass:
swap() // 1 for each address space affected
What guest sees: new memslot replacing the old one
A split is DELETE+CREATE+CREATE, so it's the same:
First pass on the userspace operations:
invalidate memslot X;
swap_memslot_list(); // NOW it is visible to the guest
What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
Second pass:
delete old memslot X
create new memslot X1
create new memslot X2
What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
Third pass:
swap() // 1 for each address space affected
What guest sees: two new memslots replacing the old one
Am 27/09/2022 um 11:22 schrieb David Hildenbrand:
> On 27.09.22 10:35, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
>>> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>>>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>>>> to make sure that all memslots are updated in the inactive list
>>>> and then swap (preferreably only once) the lists, so that all
>>>> changes are visible immediately.
>>>>
>>>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>>>> firstly replace old memslot with invalid, and then remove invalid.
>>>>
>>>
>>> I'm curious, how would a resize (grow/shrink) or a split be handled?
>>>
>>
>> There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
>> CREATE, FLAGS_ONLY}.
>>
>> A resize should be implemented in QEMU as DELETE+CREATE.
>>
>> Therefore a resize on memslot X will be implemented as:
>> First pass on the userspace operations:
>> invalidate memslot X;
>> swap_memslot_list(); // NOW it is visible to the guest
>>
>> What guest sees: memslot X is invalid, so MMU keeps retrying the page
>> fault
>>
>> Second pass:
>> create new memslot X
>> delete old memslot X
>
> Thanks a lot for the very nice explanation!
Anytime :)
> Does the invalidation already free up memslot metadata (especially the
> rmaps) or will we end up temporarily allocating twice the memslot metadata?
>
Invalidation creates a new temporary identical memslot, I am not sure
about the rmaps. It is anyways the same code as it was done before and
if I understand correctly, a new slot is required to keep the old
intact, in case something goes wrong and we need to revert.
Thanks,
Emanuele
On 27.09.22 10:35, Emanuele Giuseppe Esposito wrote:
>
>
> Am 27/09/2022 um 09:46 schrieb David Hildenbrand:
>> On 09.09.22 12:45, Emanuele Giuseppe Esposito wrote:
>>> When kvm_vm_ioctl_set_memory_region_list() is invoked, we need
>>> to make sure that all memslots are updated in the inactive list
>>> and then swap (preferreably only once) the lists, so that all
>>> changes are visible immediately.
>>>
>>> The only issue is that DELETE and MOVE need to perform 2 swaps:
>>> firstly replace old memslot with invalid, and then remove invalid.
>>>
>>
>> I'm curious, how would a resize (grow/shrink) or a split be handled?
>>
>
> There are only 4 operations possible in KVM: KVM_MR_{DELETE, MOVE,
> CREATE, FLAGS_ONLY}.
>
> A resize should be implemented in QEMU as DELETE+CREATE.
>
> Therefore a resize on memslot X will be implemented as:
> First pass on the userspace operations:
> invalidate memslot X;
> swap_memslot_list(); // NOW it is visible to the guest
>
> What guest sees: memslot X is invalid, so MMU keeps retrying the page fault
>
> Second pass:
> create new memslot X
> delete old memslot X
Thanks a lot for the very nice explanation!
Does the invalidation already free up memslot metadata (especially the
rmaps) or will we end up temporarily allocating twice the memslot metadata?
--
Thanks,
David / dhildenb
>> Does the invalidation already free up memslot metadata (especially the
>> rmaps) or will we end up temporarily allocating twice the memslot metadata?
>>
>
> Invalidation creates a new temporary identical memslot, I am not sure
> about the rmaps. It is anyways the same code as it was done before and
> if I understand correctly, a new slot is required to keep the old
> intact, in case something goes wrong and we need to revert.
Okay, thanks!
--
Thanks,
David / dhildenb
On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>
> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> >> As Sean said "This is an awful lot of a complexity to take on for something
> >> that appears to be solvable in userspace."
> >
> > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > understand exactly what KVM behavior is problematic for userspace. E.g. the
> > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > reverted.
>
> It still is because I can reproduce the bug, as also pointed out in
> multiple comments below.
You can reproduce _a_ bug, but it's obviously not the original bug, because the
last comment says:
Second, indeed the patch was reverted and somehow accepted without generating
too much noise:
...
The underlying issue of course as we both know is still there.
You might have luck reproducing it with this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1855298
But for me it looks like it is 'working' as well, so you might have
to write a unit test to trigger the issue.
> > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > the race _if_ userspace chooses not to pause vCPUs.
> >
>
> Also on the BZ they all seem (Paolo included) to agree that the issue is
> non-atomic memslots update.
Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
memslot. I'm asking for an explanation of exactly what happens when that occurs,
because it should be possible to adjust KVM and/or QEMU to play nice with the
fetch, e.g. to resume the guest until the new memslot is installed, in which case
an atomic update isn't needed.
I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
the MMIO code fetch.
I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
otherwise exit with mmio.len==0.
Compile tested only...
---
From: Sean Christopherson <[email protected]>
Date: Tue, 27 Sep 2022 08:16:03 -0700
Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
MMIO fetch
Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
a memslot. If userspace is manipulating memslots without pausing vCPUs,
e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
valid memslot installed. Depending on guest context, KVM will either
exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
resume the guest (L2 or CPL>0), neither of which is desirable as exiting
with "emulation error" effectively kills the VM, and resuming the guest
doesn't provide userspace an opportunity to react the to fetch.
Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but
there is no other way to communicate "fetch" to userspace without
defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
and not a flag, and there is no known use case for actually supporting
code fetches from MMIO, i.e. there's no need to allow userspace to fill
in the instruction bytes.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/emulate.c | 2 ++
arch/x86/kvm/kvm_emulate.h | 1 +
arch/x86/kvm/x86.c | 9 ++++++++-
3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f092c54d1a2f..e141238d93b0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
done:
if (rc == X86EMUL_PROPAGATE_FAULT)
ctxt->have_exception = true;
+ if (rc == X86EMUL_IO_NEEDED)
+ return EMULATION_IO_FETCH;
return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
}
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 89246446d6aa..3cb2e321fcd2 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
#define EMULATION_OK 0
#define EMULATION_RESTART 1
#define EMULATION_INTERCEPTED 2
+#define EMULATION_IO_FETCH 3
void init_decode_cache(struct x86_emulate_ctxt *ctxt);
int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa5ab0c620de..7eb72694c601 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
bytes = (unsigned)PAGE_SIZE - offset;
ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
offset, bytes);
- if (unlikely(ret < 0))
+ if (unlikely(ret < 0)) {
+ vcpu->run->mmio.phys_addr = gpa;
+ vcpu->run->mmio.len = 0;
+ vcpu->run->mmio.is_write = 0;
+ vcpu->run->exit_reason = KVM_EXIT_MMIO;
return X86EMUL_IO_NEEDED;
+ }
return X86EMUL_CONTINUE;
}
@@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
r = x86_decode_emulated_instruction(vcpu, emulation_type,
insn, insn_len);
if (r != EMULATION_OK) {
+ if (r == EMULATION_IO_FETCH)
+ return 0;
if ((emulation_type & EMULTYPE_TRAP_UD) ||
(emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
kvm_queue_exception(vcpu, UD_VECTOR);
base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
--
Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace. E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
>
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
>
> Second, indeed the patch was reverted and somehow accepted without generating
> too much noise:
>
> ...
>
> The underlying issue of course as we both know is still there.
>
> You might have luck reproducing it with this bug
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>
> But for me it looks like it is 'working' as well, so you might have
> to write a unit test to trigger the issue.
>
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
>
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot. I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
>
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
>
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.
>
> Compile tested only...
So basically you are just making KVM catch the failed
kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
basically ends up in doing nothing and retry again executing the
instruction?
I wonder if there are some performance implications in this, but it's
definitely simpler than what I did.
Tested on the same failing machine used for the BZ, fixes the bug.
Do you want me to re-send the patch on your behalf (and add probably a
small documentation on Documentation/virt/kvm/api.rst)?
Emanuele
>
> ---
> From: Sean Christopherson <[email protected]>
> Date: Tue, 27 Sep 2022 08:16:03 -0700
> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
> MMIO fetch
>
> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> a memslot. If userspace is manipulating memslots without pausing vCPUs,
> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> valid memslot installed. Depending on guest context, KVM will either
> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> with "emulation error" effectively kills the VM, and resuming the guest
> doesn't provide userspace an opportunity to react the to fetch.
>
> Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but
> there is no other way to communicate "fetch" to userspace without
> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> and not a flag, and there is no known use case for actually supporting
> code fetches from MMIO, i.e. there's no need to allow userspace to fill
> in the instruction bytes.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/emulate.c | 2 ++
> arch/x86/kvm/kvm_emulate.h | 1 +
> arch/x86/kvm/x86.c | 9 ++++++++-
> 3 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index f092c54d1a2f..e141238d93b0 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
> done:
> if (rc == X86EMUL_PROPAGATE_FAULT)
> ctxt->have_exception = true;
> + if (rc == X86EMUL_IO_NEEDED)
> + return EMULATION_IO_FETCH;
> return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
> }
>
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index 89246446d6aa..3cb2e321fcd2 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> #define EMULATION_OK 0
> #define EMULATION_RESTART 1
> #define EMULATION_INTERCEPTED 2
> +#define EMULATION_IO_FETCH 3
> void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index aa5ab0c620de..7eb72694c601 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
> bytes = (unsigned)PAGE_SIZE - offset;
> ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
> offset, bytes);
> - if (unlikely(ret < 0))
> + if (unlikely(ret < 0)) {
> + vcpu->run->mmio.phys_addr = gpa;
> + vcpu->run->mmio.len = 0;
> + vcpu->run->mmio.is_write = 0;
> + vcpu->run->exit_reason = KVM_EXIT_MMIO;
> return X86EMUL_IO_NEEDED;
> + }
>
> return X86EMUL_CONTINUE;
> }
> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> r = x86_decode_emulated_instruction(vcpu, emulation_type,
> insn, insn_len);
> if (r != EMULATION_OK) {
> + if (r == EMULATION_IO_FETCH)
> + return 0;
> if ((emulation_type & EMULTYPE_TRAP_UD) ||
> (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> kvm_queue_exception(vcpu, UD_VECTOR);
>
> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
>
On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
>
> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
> > On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
> > > Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
> > > > On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > > > > As Sean said "This is an awful lot of a complexity to take on for something
> > > > > that appears to be solvable in userspace."
> > > >
> > > > And if the userspace solution is unpalatable for whatever reason, I'd like to
> > > > understand exactly what KVM behavior is problematic for userspace. E.g. the
> > > > above RHBZ bug should no longer be an issue as the buggy commit has since been
> > > > reverted.
> > >
> > > It still is because I can reproduce the bug, as also pointed out in
> > > multiple comments below.
> >
> > You can reproduce _a_ bug, but it's obviously not the original bug, because the
> > last comment says:
> >
> > Second, indeed the patch was reverted and somehow accepted without generating
> > too much noise:
> >
> > ...
> >
> > The underlying issue of course as we both know is still there.
> >
> > You might have luck reproducing it with this bug
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1855298
> >
> > But for me it looks like it is 'working' as well, so you might have
> > to write a unit test to trigger the issue.
> >
> > > > If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
> > > > much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
> > > > the race _if_ userspace chooses not to pause vCPUs.
> > > >
> > >
> > > Also on the BZ they all seem (Paolo included) to agree that the issue is
> > > non-atomic memslots update.
> >
> > Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> > memslot. I'm asking for an explanation of exactly what happens when that occurs,
> > because it should be possible to adjust KVM and/or QEMU to play nice with the
> > fetch, e.g. to resume the guest until the new memslot is installed, in which case
> > an atomic update isn't needed.
> >
> > I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> > guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
> > then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> > of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> > the MMIO code fetch.
> >
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> >
> > Compile tested only...
>
> So basically you are just making KVM catch the failed
> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
> basically ends up in doing nothing and retry again executing the
> instruction?
>
> I wonder if there are some performance implications in this, but it's
> definitely simpler than what I did.
>
> Tested on the same failing machine used for the BZ, fixes the bug.
>
> Do you want me to re-send the patch on your behalf (and add probably a
> small documentation on Documentation/virt/kvm/api.rst)?
>
> Emanuele
> > ---
> > From: Sean Christopherson <[email protected]>
> > Date: Tue, 27 Sep 2022 08:16:03 -0700
> > Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
> > MMIO fetch
> >
> > Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
> > able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
> > a memslot. If userspace is manipulating memslots without pausing vCPUs,
> > e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
> > valid memslot installed. Depending on guest context, KVM will either
> > exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
> > resume the guest (L2 or CPL>0), neither of which is desirable as exiting
> > with "emulation error" effectively kills the VM, and resuming the guest
> > doesn't provide userspace an opportunity to react the to fetch.
> >
> > Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but
> > there is no other way to communicate "fetch" to userspace without
> > defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
> > and not a flag, and there is no known use case for actually supporting
> > code fetches from MMIO, i.e. there's no need to allow userspace to fill
> > in the instruction bytes.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/emulate.c | 2 ++
> > arch/x86/kvm/kvm_emulate.h | 1 +
> > arch/x86/kvm/x86.c | 9 ++++++++-
> > 3 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index f092c54d1a2f..e141238d93b0 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
> > done:
> > if (rc == X86EMUL_PROPAGATE_FAULT)
> > ctxt->have_exception = true;
> > + if (rc == X86EMUL_IO_NEEDED)
> > + return EMULATION_IO_FETCH;
> > return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
> > }
> >
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 89246446d6aa..3cb2e321fcd2 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> > #define EMULATION_OK 0
> > #define EMULATION_RESTART 1
> > #define EMULATION_INTERCEPTED 2
> > +#define EMULATION_IO_FETCH 3
> > void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> > int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> > int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index aa5ab0c620de..7eb72694c601 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
> > bytes = (unsigned)PAGE_SIZE - offset;
> > ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
> > offset, bytes);
> > - if (unlikely(ret < 0))
> > + if (unlikely(ret < 0)) {
> > + vcpu->run->mmio.phys_addr = gpa;
> > + vcpu->run->mmio.len = 0;
> > + vcpu->run->mmio.is_write = 0;
> > + vcpu->run->exit_reason = KVM_EXIT_MMIO;
> > return X86EMUL_IO_NEEDED;
> > + }
> >
> > return X86EMUL_CONTINUE;
> > }
> > @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > r = x86_decode_emulated_instruction(vcpu, emulation_type,
> > insn, insn_len);
> > if (r != EMULATION_OK) {
> > + if (r == EMULATION_IO_FETCH)
> > + return 0;
> > if ((emulation_type & EMULTYPE_TRAP_UD) ||
> > (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> > kvm_queue_exception(vcpu, UD_VECTOR);
> >
> > base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
> >
Note that AFAIK, there is another case (and probably more), if TDP is disabled,
and MMU root is in mmio, we kill the guest.
mmu_alloc_shadow_roots -> mmu_check_root
I used to have few hacks in KVM to cope with this, but AFAIK,
I gave up on it, because the issue would show up again and again.
Best regards,
Maxim Levitsky
On 28.09.22 13:14, Maxim Levitsky wrote:
> On Wed, 2022-09-28 at 11:11 +0200, Emanuele Giuseppe Esposito wrote:
>>
>> Am 27/09/2022 um 17:58 schrieb Sean Christopherson:
>>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>>> that appears to be solvable in userspace."
>>>>>
>>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>>> understand exactly what KVM behavior is problematic for userspace. E.g. the
>>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>>> reverted.
>>>>
>>>> It still is because I can reproduce the bug, as also pointed out in
>>>> multiple comments below.
>>>
>>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>>> last comment says:
>>>
>>> Second, indeed the patch was reverted and somehow accepted without generating
>>> too much noise:
>>>
>>> ...
>>>
>>> The underlying issue of course as we both know is still there.
>>>
>>> You might have luck reproducing it with this bug
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>>
>>> But for me it looks like it is 'working' as well, so you might have
>>> to write a unit test to trigger the issue.
>>>
>>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>>
>>>>
>>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>>> non-atomic memslots update.
>>>
>>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>>> memslot. I'm asking for an explanation of exactly what happens when that occurs,
>>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>>> an atomic update isn't needed.
>>>
>>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>>> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
>>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>>> the MMIO code fetch.
>>>
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>>
>>> Compile tested only...
>>
>> So basically you are just making KVM catch the failed
>> kvm_vcpu_read_guest_page() by retuning mmio.len = 0 to QEMU which
>> basically ends up in doing nothing and retry again executing the
>> instruction?
>>
>> I wonder if there are some performance implications in this, but it's
>> definitely simpler than what I did.
>>
>> Tested on the same failing machine used for the BZ, fixes the bug.
>>
>> Do you want me to re-send the patch on your behalf (and add probably a
>> small documentation on Documentation/virt/kvm/api.rst)?
>>
>> Emanuele
>>> ---
>>> From: Sean Christopherson <[email protected]>
>>> Date: Tue, 27 Sep 2022 08:16:03 -0700
>>> Subject: [PATCH] KVM: x86: Exit to userspace with zero-length MMIO "read" on
>>> MMIO fetch
>>>
>>> Exit to userspace with KVM_EXIT_MMIO if emulation fails due to not being
>>> able to fetch instruction bytes, e.g. if the resolved GPA isn't backed by
>>> a memslot. If userspace is manipulating memslots without pausing vCPUs,
>>> e.g. to emulate BIOS relocation, then a vCPU may fetch while there is no
>>> valid memslot installed. Depending on guest context, KVM will either
>>> exit to userspace with KVM_EXIT_INTERNAL_ERROR (L1, CPL=0) or simply
>>> resume the guest (L2 or CPL>0), neither of which is desirable as exiting
>>> with "emulation error" effectively kills the VM, and resuming the guest
>>> doesn't provide userspace an opportunity to react the to fetch.
>>>
>>> Use "mmio.len == 0" to indicate "fetch". This is a bit of a hack, but
>>> there is no other way to communicate "fetch" to userspace without
>>> defining an entirely new exit reason, e.g. "mmio.is_write" is a boolean
>>> and not a flag, and there is no known use case for actually supporting
>>> code fetches from MMIO, i.e. there's no need to allow userspace to fill
>>> in the instruction bytes.
>>>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> ---
>>> arch/x86/kvm/emulate.c | 2 ++
>>> arch/x86/kvm/kvm_emulate.h | 1 +
>>> arch/x86/kvm/x86.c | 9 ++++++++-
>>> 3 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>>> index f092c54d1a2f..e141238d93b0 100644
>>> --- a/arch/x86/kvm/emulate.c
>>> +++ b/arch/x86/kvm/emulate.c
>>> @@ -5353,6 +5353,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>>> done:
>>> if (rc == X86EMUL_PROPAGATE_FAULT)
>>> ctxt->have_exception = true;
>>> + if (rc == X86EMUL_IO_NEEDED)
>>> + return EMULATION_IO_FETCH;
>>> return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
>>> }
>>>
>>> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
>>> index 89246446d6aa..3cb2e321fcd2 100644
>>> --- a/arch/x86/kvm/kvm_emulate.h
>>> +++ b/arch/x86/kvm/kvm_emulate.h
>>> @@ -516,6 +516,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>>> #define EMULATION_OK 0
>>> #define EMULATION_RESTART 1
>>> #define EMULATION_INTERCEPTED 2
>>> +#define EMULATION_IO_FETCH 3
>>> void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>>> int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>>> int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index aa5ab0c620de..7eb72694c601 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7129,8 +7129,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
>>> bytes = (unsigned)PAGE_SIZE - offset;
>>> ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
>>> offset, bytes);
>>> - if (unlikely(ret < 0))
>>> + if (unlikely(ret < 0)) {
>>> + vcpu->run->mmio.phys_addr = gpa;
>>> + vcpu->run->mmio.len = 0;
>>> + vcpu->run->mmio.is_write = 0;
>>> + vcpu->run->exit_reason = KVM_EXIT_MMIO;
>>> return X86EMUL_IO_NEEDED;
>>> + }
>>>
>>> return X86EMUL_CONTINUE;
>>> }
>>> @@ -8665,6 +8670,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>> r = x86_decode_emulated_instruction(vcpu, emulation_type,
>>> insn, insn_len);
>>> if (r != EMULATION_OK) {
>>> + if (r == EMULATION_IO_FETCH)
>>> + return 0;
>>> if ((emulation_type & EMULTYPE_TRAP_UD) ||
>>> (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
>>> kvm_queue_exception(vcpu, UD_VECTOR);
>>>
>>> base-commit: 39d9b48cc777bdf6d67d01ed24f1f89b13f5fbb2
>>>
>
> Note that AFAIK, there is another case (and probably more), if TDP is disabled,
> and MMU root is in mmio, we kill the guest.
>
>
> mmu_alloc_shadow_roots -> mmu_check_root
>
>
> I used to have few hacks in KVM to cope with this, but AFAIK,
> I gave up on it, because the issue would show up again and again.
IIRC, s390x can have real problems if we temporarily remove a memslot.
In case the emulation/interpretation code tries accessing guest memory
and fails because there is no memslot describing that portion of guest RAM.
Note that resizing/merging/splitting currently shouldn't happen on
s390x, though. But resizing of memslots might happen in the near future
with virtio-mem in QEMU.
--
Thanks,
David / dhildenb
On 9/27/22 17:58, Sean Christopherson wrote:
> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>
>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>> that appears to be solvable in userspace."
>>>
>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>> understand exactly what KVM behavior is problematic for userspace. E.g. the
>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>> reverted.
>>
>> It still is because I can reproduce the bug, as also pointed out in
>> multiple comments below.
>
> You can reproduce _a_ bug, but it's obviously not the original bug, because the
> last comment says:
>
> Second, indeed the patch was reverted and somehow accepted without generating
> too much noise:
>
> ...
>
> The underlying issue of course as we both know is still there.
>
> You might have luck reproducing it with this bug
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>
> But for me it looks like it is 'working' as well, so you might have
> to write a unit test to trigger the issue.
>
>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>> the race _if_ userspace chooses not to pause vCPUs.
>>>
>>
>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>> non-atomic memslots update.
>
> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
> memslot. I'm asking for an explanation of exactly what happens when that occurs,
> because it should be possible to adjust KVM and/or QEMU to play nice with the
> fetch, e.g. to resume the guest until the new memslot is installed, in which case
> an atomic update isn't needed.
>
> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
> the MMIO code fetch.
>
> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> otherwise exit with mmio.len==0.
I think this patch is not a good idea for two reasons:
1) we don't know how userspace behaves if mmio.len is zero. It is of
course reasonable to do nothing, but an assertion failure is also a
valid behavior
2) more important, there is no way to distinguish a failure due to the
guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from
one due to the KVM_SET_USER_MEMORY_REGION race condition. So this will
cause a guest that correctly caused an internal error to loop forever.
While the former could be handled in a "wait and see" manner, the latter
in particular is part of the KVM_RUN contract. Of course it is possible
for a guest to just loop forever, but in general all of KVM, QEMU and
upper userspace layers want a crashed guest to be detected and stopped
forever.
Yes, QEMU could loop only if memslot updates are in progress, but
honestly all the alternatives I have seen to atomic memslot updates are
really *awful*. David's patches even invent a new kind of mutex for
which I have absolutely no idea what kind of deadlocks one should worry
about and why they should not exist; QEMU's locking is already pretty
crappy, it's certainly not on my wishlist to make it worse!
This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU)
the kernel is the only place where you can have a *good* fix. It should
have been fixed years ago.
Paolo
On 28.09.22 17:07, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
>> On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:
>>>
>>> Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
>>>> On Mon, Sep 26, 2022, David Hildenbrand wrote:
>>>>> As Sean said "This is an awful lot of a complexity to take on for something
>>>>> that appears to be solvable in userspace."
>>>>
>>>> And if the userspace solution is unpalatable for whatever reason, I'd like to
>>>> understand exactly what KVM behavior is problematic for userspace. E.g. the
>>>> above RHBZ bug should no longer be an issue as the buggy commit has since been
>>>> reverted.
>>>
>>> It still is because I can reproduce the bug, as also pointed out in
>>> multiple comments below.
>>
>> You can reproduce _a_ bug, but it's obviously not the original bug, because the
>> last comment says:
>>
>> Second, indeed the patch was reverted and somehow accepted without generating
>> too much noise:
>>
>> ...
>>
>> The underlying issue of course as we both know is still there.
>>
>> You might have luck reproducing it with this bug
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1855298
>>
>> But for me it looks like it is 'working' as well, so you might have
>> to write a unit test to trigger the issue.
>>
>>>> If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
>>>> much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
>>>> the race _if_ userspace chooses not to pause vCPUs.
>>>>
>>>
>>> Also on the BZ they all seem (Paolo included) to agree that the issue is
>>> non-atomic memslots update.
>>
>> Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
>> memslot. I'm asking for an explanation of exactly what happens when that occurs,
>> because it should be possible to adjust KVM and/or QEMU to play nice with the
>> fetch, e.g. to resume the guest until the new memslot is installed, in which case
>> an atomic update isn't needed.
>>
>> I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
>> guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
>> then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
>> of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
>> the MMIO code fetch.
>>
>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>> otherwise exit with mmio.len==0.
>
> I think this patch is not a good idea for two reasons:
>
> 1) we don't know how userspace behaves if mmio.len is zero. It is of
> course reasonable to do nothing, but an assertion failure is also a
> valid behavior
>
> 2) more important, there is no way to distinguish a failure due to the
> guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from
> one due to the KVM_SET_USER_MEMORY_REGION race condition. So this will
> cause a guest that correctly caused an internal error to loop forever.
>
> While the former could be handled in a "wait and see" manner, the latter
> in particular is part of the KVM_RUN contract. Of course it is possible
> for a guest to just loop forever, but in general all of KVM, QEMU and
> upper userspace layers want a crashed guest to be detected and stopped
> forever.
>
> Yes, QEMU could loop only if memslot updates are in progress, but
> honestly all the alternatives I have seen to atomic memslot updates are
> really *awful*. David's patches even invent a new kind of mutex for
> which I have absolutely no idea what kind of deadlocks one should worry
> about and why they should not exist; QEMU's locking is already pretty
> crappy, it's certainly not on my wishlist to make it worse!
Just to comment on that (I'm happy as long as this gets fixed), a simple
mutex with trylock should get the thing done as well -- kicking the VCPU
if the trylock fails. But I did not look further into locking alternatives.
--
Thanks,
David / dhildenb
On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
>
> I think this patch is not a good idea for two reasons:
>
> 1) we don't know how userspace behaves if mmio.len is zero. It is of course
> reasonable to do nothing, but an assertion failure is also a valid behavior
Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in
L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking
point, the mmio.len==0 hack can be avoided by defining a new exit reason.
> 2) more important, there is no way to distinguish a failure due to the guest
> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
> to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a
> guest that correctly caused an internal error to loop forever.
Userspace has the GPA and absolutely should be able to detect if the MMIO may have
been due to its memslot manipulation versus the guest jumping into the weeds.
> While the former could be handled in a "wait and see" manner, the latter in
> particular is part of the KVM_RUN contract. Of course it is possible for a
> guest to just loop forever, but in general all of KVM, QEMU and upper
> userspace layers want a crashed guest to be detected and stopped forever.
>
> Yes, QEMU could loop only if memslot updates are in progress, but honestly
> all the alternatives I have seen to atomic memslot updates are really
> *awful*. David's patches even invent a new kind of mutex for which I have
> absolutely no idea what kind of deadlocks one should worry about and why
> they should not exist; QEMU's locking is already pretty crappy, it's
> certainly not on my wishlist to make it worse!
>
> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
> kernel is the only place where you can have a *good* fix. It should have
> been fixed years ago.
I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.
E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
do wholesale replacement? That seems like it would be vastly simpler to handle
on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
API that allows 1->N or N->1 conversions but not arbitrary batching.
And just because QEMU's locking is "already pretty crappy", that's not a good
reason to drag KVM down into the mud. E.g. taking a lock and conditionally
releasing it... I get that this is an RFC, but IMO anything that requires such
shenanigans simply isn't acceptable.
/*
* Takes kvm->slots_arch_lock, and releases it only if
* invalid_slot allocation, kvm_prepare_memory_region failed
* or batch->is_move_delete is true.
*/
static int kvm_prepare_memslot(struct kvm *kvm,
struct kvm_internal_memory_region_list *batch)
On 9/28/22 17:58, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/27/22 17:58, Sean Christopherson wrote:
>>> I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
>>> the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
>>> ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
>>> otherwise exit with mmio.len==0.
>>
>> I think this patch is not a good idea for two reasons:
>>
>> 1) we don't know how userspace behaves if mmio.len is zero. It is of course
>> reasonable to do nothing, but an assertion failure is also a valid behavior
>
> Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in
> L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking
> point, the mmio.len==0 hack can be avoided by defining a new exit reason.
I agree that doing this at CPL>0 or in L2 is invalid and makes little
sense (because either way the invalid address cannot be reached without
help from the supervisor or L1's page tables).
>> 2) more important, there is no way to distinguish a failure due to the guest
>> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
>> to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a
>> guest that correctly caused an internal error to loop forever.
>
> Userspace has the GPA and absolutely should be able to detect if the MMIO may have
> been due to its memslot manipulation versus the guest jumping into the weeds.
>
>> While the former could be handled in a "wait and see" manner, the latter in
>> particular is part of the KVM_RUN contract. Of course it is possible for a
>> guest to just loop forever, but in general all of KVM, QEMU and upper
>> userspace layers want a crashed guest to be detected and stopped forever.
>>
>> Yes, QEMU could loop only if memslot updates are in progress, but honestly
>> all the alternatives I have seen to atomic memslot updates are really
>> *awful*. David's patches even invent a new kind of mutex for which I have
>> absolutely no idea what kind of deadlocks one should worry about and why
>> they should not exist; QEMU's locking is already pretty crappy, it's
>> certainly not on my wishlist to make it worse!
>>
>> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
>> kernel is the only place where you can have a *good* fix. It should have
>> been fixed years ago.
>
> I don't disagree that the memslots API is lacking, but IMO that is somewhat
> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> like to have the luxury of being able to explore ideas beyond "let userspace
> batch memslot updates", and I really don't want to feel pressured to get this
> code reviewed and merge.
I absolutely agree that this is not a bugfix. Most new features for KVM
can be seen as bug fixes if you squint hard enough, but they're still
features.
> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> do wholesale replacement? That seems like it would be vastly simpler to handle
> on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
> API that allows 1->N or N->1 conversions but not arbitrary batching.
Wholesale replacement was my first idea when I looked at the issue, I
think at the end of 2020. I never got to a full implementation, but my
impression was that allocating/deallocating dirty bitmaps, rmaps etc.
would make it any easier than arbitrary batch updates.
> And just because QEMU's locking is "already pretty crappy", that's not a good
> reason to drag KVM down into the mud. E.g. taking a lock and conditionally
> releasing it... I get that this is an RFC, but IMO anything that requires such
> shenanigans simply isn't acceptable.
>
> /*
> * Takes kvm->slots_arch_lock, and releases it only if
> * invalid_slot allocation, kvm_prepare_memory_region failed
> * or batch->is_move_delete is true.
> */
> static int kvm_prepare_memslot(struct kvm *kvm,
> struct kvm_internal_memory_region_list *batch)
>
No objection about that. :)
Paolo
On 9/9/22 12:44, Emanuele Giuseppe Esposito wrote:
> This IOCTL enables atomic update of multiple memslots.
> The userspace application provides a kvm_userspace_memory_region_list
> containing a list of entries, each representing a modification to be
> performed to a memslot.
>
> Requests with invalidate_slot == 1 are pre-processed, because they
> are ther DELETE or MOVE, and therefore the memslot must be first
> replaced with a copy marked as KVM_MEMSLOT_INVALID, and then replaced.
>
> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
> ---
> include/uapi/linux/kvm.h | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a36e78710382..673496b91a25 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,24 @@ struct kvm_userspace_memory_region {
> __u64 userspace_addr; /* start of the userspace allocated memory */
> };
>
> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_entry {
> + __u32 slot;
> + __u32 flags;
> + __u64 guest_phys_addr;
> + __u64 memory_size; /* bytes */
> + __u64 userspace_addr; /* start of the userspace allocated memory */
> + __u8 invalidate_slot;
> + __u8 padding[31];
> +};
> +
> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_list {
> + __u32 nent;
> + __u32 flags;
> + struct kvm_userspace_memory_region_entry entries[0];
> +};
> +
> /*
> * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> * other bits are reserved for kvm internal use which are defined in
> @@ -1444,7 +1462,8 @@ struct kvm_vfio_spapr_tce {
> struct kvm_userspace_memory_region)
> #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47)
> #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64)
> -
> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> + struct kvm_userspace_memory_region_list)
> /* enable ucontrol for s390 */
> struct kvm_s390_ucas_mapping {
> __u64 user_addr;
Looks good; however, in the non-RFC I suggest adding documentation in
this patch already (so no Reviewed-by yet).
Paolo
On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> Just a function split. No functional change intended,
> except for the fact that kvm_prepare_batch() does not
> immediately call kvm_set_memslot() if batch->change is
> KVM_MR_DELETE, but delegates the caller (__kvm_set_memory_region).
>
> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
> ---
> virt/kvm/kvm_main.c | 120 +++++++++++++++++++++++++++++---------------
> 1 file changed, 79 insertions(+), 41 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 17f07546d591..9d917af30593 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1927,19 +1927,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id,
> return false;
> }
>
> -/*
> - * Allocate some memory and give it an address in the guest physical address
> - * space.
> - *
> - * Discontiguous memory is allowed, mostly for framebuffers.
> - * This function takes also care of initializing batch->new/old/invalid/change
> - * fields.
> - *
> - * Must be called holding kvm->slots_lock for write.
> - */
> -int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem,
> - struct kvm_internal_memory_region_list *batch)
> +static int kvm_prepare_batch(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem,
> + struct kvm_internal_memory_region_list *batch)
> {
> struct kvm_memory_slot *old, *new;
> struct kvm_memslots *slots;
> @@ -1947,34 +1937,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> unsigned long npages;
> gfn_t base_gfn;
> int as_id, id;
> - int r;
> -
> - r = check_memory_region_flags(mem);
> - if (r)
> - return r;
>
> as_id = mem->slot >> 16;
> id = (u16)mem->slot;
>
> - /* General sanity checks */
> - if ((mem->memory_size & (PAGE_SIZE - 1)) ||
> - (mem->memory_size != (unsigned long)mem->memory_size))
> - return -EINVAL;
> - if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> - return -EINVAL;
> - /* We can read the guest memory with __xxx_user() later on. */
> - if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> - !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> - mem->memory_size))
> - return -EINVAL;
> - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> - return -EINVAL;
> - if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> - return -EINVAL;
> - if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> - return -EINVAL;
> -
> slots = __kvm_memslots(kvm, as_id);
>
> /*
> @@ -1993,7 +1959,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>
> batch->change = KVM_MR_DELETE;
> batch->new = NULL;
> - return kvm_set_memslot(kvm, batch);
> + return 0;
> }
>
> base_gfn = (mem->guest_phys_addr >> PAGE_SHIFT);
> @@ -2020,7 +1986,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> else if (mem->flags != old->flags)
> change = KVM_MR_FLAGS_ONLY;
> else /* Nothing to change. */
> - return 0;
> + return 1;
> }
>
> if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) &&
> @@ -2041,12 +2007,81 @@ int __kvm_set_memory_region(struct kvm *kvm,
>
> batch->new = new;
> batch->change = change;
> + return 0;
> +}
> +
> +static int kvm_check_mem(const struct kvm_userspace_memory_region *mem)
> +{
> + int as_id, id;
> +
> + as_id = mem->slot >> 16;
> + id = (u16)mem->slot;
> +
> + /* General sanity checks */
> + if ((mem->memory_size & (PAGE_SIZE - 1)) ||
> + (mem->memory_size != (unsigned long)mem->memory_size))
> + return -EINVAL;
> + if (mem->guest_phys_addr & (PAGE_SIZE - 1))
> + return -EINVAL;
> + /* We can read the guest memory with __xxx_user() later on. */
> + if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> + (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> + !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> + mem->memory_size))
> + return -EINVAL;
> + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> + return -EINVAL;
> + if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> + return -EINVAL;
> + if ((mem->memory_size >> PAGE_SHIFT) > KVM_MEM_MAX_NR_PAGES)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static int kvm_check_memory_region(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem,
> + struct kvm_internal_memory_region_list *batch)
> +{
> + int r;
> +
> + r = check_memory_region_flags(mem);
> + if (r)
> + return r;
>
> - r = kvm_set_memslot(kvm, batch);
> + r = kvm_check_mem(mem);
> if (r)
> - kfree(new);
> + return r;
> +
> + r = kvm_prepare_batch(kvm, mem, batch);
> + if (r && batch->new)
> + kfree(batch->new);
> +
> return r;
> }
> +
> +/*
> + * Allocate some memory and give it an address in the guest physical address
> + * space.
> + *
> + * Discontiguous memory is allowed, mostly for framebuffers.
> + * This function takes also care of initializing batch->new/old/invalid/change
> + * fields.
> + *
> + * Must be called holding kvm->slots_lock for write.
> + */
> +int __kvm_set_memory_region(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem,
> + struct kvm_internal_memory_region_list *batch)
> +{
> + int r;
> +
> + r = kvm_check_memory_region(kvm, mem, batch);
> + if (r)
> + return r;
> +
> + return kvm_set_memslot(kvm, batch);
> +}
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
> static int kvm_set_memory_region(struct kvm *kvm,
> @@ -2061,6 +2096,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
> mutex_lock(&kvm->slots_lock);
> r = __kvm_set_memory_region(kvm, mem, batch);
> mutex_unlock(&kvm->slots_lock);
> + /* r == 1 means empty request, nothing to do but no error */
> + if (r == 1)
> + r = 0;
This is weird... I think you have the order of the split slightly
messed up. Doing this check earlier, roughly at the same time as the
introduction of the new struct, is probably clearer.
After having reviewed more of the code, I do think you should
disaggregate __kvm_set_memory_region() in separate functions (check,
build entry, prepare, commit) completely. kvm_set_memory_region() calls
them and __kvm_set_memory_region() disappears completely.
Paolo
> return r;
> }
>
On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> /*
> - * if change is DELETE or MOVE, invalid is in active memslots
> - * and old in inactive, so replace old with new.
> + * if change is DELETE or MOVE, invalid is in both active and inactive
> + * memslot list. This means that we don't need old anymore, and
> + * we should replace invalid with new.
> */
> - kvm_replace_memslot(kvm, batch->old, batch->new);
> + if (batch->change == KVM_MR_DELETE || batch->change == KVM_MR_MOVE)
> + kvm_replace_memslot(kvm, batch->invalid, batch->new);
> + else
> + kvm_replace_memslot(kvm, batch->old, batch->new);
This is also
kvm_replace_memslot(kvm, batch->invalid ?: batch->old,
batch->new);
with no need to look at batch->change.
Paolo
On 9/9/22 12:44, Emanuele Giuseppe Esposito wrote:
> And make kvm_set_memory_region static, since it is not used outside
> kvm_main.c
>
> Signed-off-by: Emanuele Giuseppe Esposito <[email protected]>
> ---
> include/linux/kvm_host.h | 2 --
> virt/kvm/kvm_main.c | 11 +++++------
> 2 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3b40f8d68fbb..1c5b7b2e35dd 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1108,8 +1108,6 @@ enum kvm_mr_change {
> KVM_MR_FLAGS_ONLY,
> };
>
> -int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> int __kvm_set_memory_region(struct kvm *kvm,
> const struct kvm_userspace_memory_region *mem);
> void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index da263c370d00..339de0ed4557 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2007,24 +2007,23 @@ int __kvm_set_memory_region(struct kvm *kvm,
> }
> EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
>
> -int kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem)
> +static int kvm_set_memory_region(struct kvm *kvm,
> + const struct kvm_userspace_memory_region *mem)
> {
> int r;
>
> + if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> + return -EINVAL;
> +
> mutex_lock(&kvm->slots_lock);
> r = __kvm_set_memory_region(kvm, mem);
> mutex_unlock(&kvm->slots_lock);
> return r;
> }
> -EXPORT_SYMBOL_GPL(kvm_set_memory_region);
>
> static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm,
> struct kvm_userspace_memory_region *mem)
> {
> - if ((u16)mem->slot >= KVM_USER_MEM_SLOTS)
> - return -EINVAL;
> -
> return kvm_set_memory_region(kvm, mem);
> }
>
The idea here was that kvm_set_memory_region could be used to set
private memory slots while not taking kvm->slots_lock.
So, I would instead:
1) rename __kvm_set_memory_region to kvm_set_memory_region;
2) inline the old kvm_set_memory_region into kvm_vm_ioctl_set_memory_region.
3) replace the comment "Must be called holding kvm->slots_lock for
write." with a proper lockdep_assert_held() now that the function
doesn't have the __ warning sign in front of it.
Paolo
On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> @@ -1782,7 +1782,8 @@ static void kvm_update_flags_memslot(struct kvm *kvm,
>
> /*
> * Takes kvm->slots_arch_lock, and releases it only if
> - * invalid_slot allocation or kvm_prepare_memory_region failed.
> + * invalid_slot allocation, kvm_prepare_memory_region failed
> + * or batch->is_move_delete is true.
> */
This _is_ getting out of hand, though. :) It is not clear to me why you
need to do multiple swaps. If you did a single swap, you could do all
the allocations of invalid slots in a separate loop, called with
slots_arch_lock taken and not released until the final swap. In other
words, something like
mutex_lock(&kvm->slots_arch_lock);
r = create_invalid_slots();
if (r < 0)
return r;
replace_old_slots();
// also handles abort on failure
prepare_memory_regions();
if (r < 0)
return r;
swap();
finish_memslots();
where each function is little more than a loop over the corresponding
function called by KVM_SET_MEMORY_REGION.
Paolo
On 9/9/22 12:45, Emanuele Giuseppe Esposito wrote:
> +/*
> + * Takes kvm->slots_arch_lock, and releases it only if
> + * invalid_slot allocation or kvm_prepare_memory_region failed.
> +*/
Much simpler: "kvm->slots_arch_lock is taken on a successful return."
This is a small change in phrasing, but it makes a lot more sense: on
success you are preparing for the final commit operation, otherwise you
just want the caller to return your errno value.
[...]
> +/* Must be called with kvm->slots_arch_lock held, but releases it. */
s/but/and/. Even better, "and releases it before returning". "But"
tells the reader that something strange is going on, "and" tells it that
something simple is going on.
I would also rename the functions along the lines of my review to patch
3, to highlight that these function prepare/commit a *change* to a memslot.
> +static void kvm_finish_memslot(struct kvm *kvm,
> + struct kvm_internal_memory_region_list *batch)
> +{
> + struct kvm_memory_slot *invalid_slot = batch->invalid;
> + struct kvm_memory_slot *old = batch->old;
> + struct kvm_memory_slot *new = batch->new;
> + enum kvm_mr_change change = batch->change;
lockdep_assert_held(&kvm->slots_arch_lock);
>
> /*
> * For DELETE and MOVE, the working slot is now active as the INVALID
> @@ -1883,6 +1898,18 @@ static int kvm_set_memslot(struct kvm *kvm,
> * responsible for knowing that new->arch may be stale.
> */
> kvm_commit_memory_region(kvm, batch);
> +}
> +
> +static int kvm_set_memslot(struct kvm *kvm,
> + struct kvm_internal_memory_region_list *batch)
> +{
> + int r;
> +
> + r = kvm_prepare_memslot(kvm, batch);
> + if (r)
> + return r;
> +
> + kvm_finish_memslot(kvm, batch);
>
> return 0;
Ok, these are the functions I hinted at in the review of the previous
patch, so we're not far away. You should be able to move the
kvm_set_memslot call to kvm_set_memory_region in patch 3, and then
replace it with the two calls here directly in kvm_set_memory_region.
Paolo
On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/28/22 17:58, Sean Christopherson wrote:
> > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
> > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > like to have the luxury of being able to explore ideas beyond "let userspace
> > batch memslot updates", and I really don't want to feel pressured to get this
> > code reviewed and merge.
>
> I absolutely agree that this is not a bugfix. Most new features for KVM can
> be seen as bug fixes if you squint hard enough, but they're still features.
I guess I'm complaining that there isn't sufficient justification for this new
feature. The cover letter provides a bug that would be fixed by having batched
updates, but as above, that's really due to deficiencies in a different KVM ABI.
Beyond that, there's no explanation of why this exact API is necessary, i.e. there
are no requirements given.
- Why can't this be solved in userspace?
- Is performance a concern? I.e. are updates that need to be batched going to
be high frequency operations?
- What operations does userspace truly need? E.g. if the only use case is to
split/truncate/hole punch an existing memslot, can KVM instead provide a
memslot flag and exit reason that allows kicking vCPUs to userspace if the
memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
- What precisely needs to be atomic? If updates only need to be "atomic" for
an address space, does the API allowing mixing non-SMM and SMM memslots?
- When is KVM required to invalidate and flush? E.g. if a memslot is deleted
and recreated with a different HVA, how does KVM ensure that there are no
outstanding references to the old HVA without introducing non-determinstic
behavior. The current API works by forcing userspace to fully delete the
memslot, i.e. KVM can ensure all references are gone in all TLBs before
allowing userspace to create new, conflicting entries. I don't see how this
can work with batched updates. The update needs to be "atomic", i.e. vCPUs
must never see an invalid/deleted memslot, but if the memslot is writable,
how does KVM prevent some writes from hitting the old HVA and some from hitting
the new HVA without a quiescent period?
- If a memslot is truncated while dirty logging is enabled, what happens to
the bitmap? Is it preserved? Dropped?
Again, I completely agree that the current memslots API is far from perfect, but
I'm not convinced that simply extending the existing API to batch updates is the
best solution from a KVM perspective.
> > E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> > do wholesale replacement? That seems like it would be vastly simpler to handle
> > on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
> > API that allows 1->N or N->1 conversions but not arbitrary batching.
>
> Wholesale replacement was my first idea when I looked at the issue, I think
> at the end of 2020. I never got to a full implementation, but my impression
> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> easier than arbitrary batch updates.
It's not obvious to me that the memslot metadata is going to be easy to handle
regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt"
the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
On 29.09.22 10:05, Emanuele Giuseppe Esposito wrote:
>
>
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
>> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>>> On 9/28/22 17:58, Sean Christopherson wrote:
>>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
>>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>>> batch memslot updates", and I really don't want to feel pressured to get this
>>>> code reviewed and merge.
>>>
>>> I absolutely agree that this is not a bugfix. Most new features for KVM can
>>> be seen as bug fixes if you squint hard enough, but they're still features.
>>
>> I guess I'm complaining that there isn't sufficient justification for this new
>> feature. The cover letter provides a bug that would be fixed by having batched
>> updates, but as above, that's really due to deficiencies in a different KVM ABI.
>>
>> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
>> are no requirements given.
>>
>> - Why can't this be solved in userspace?
>
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.
>
> In addition (maybe you already answered this question but I couldn't
> find an answer in the email thread), does it make sense to stop all
> vcpus for a couple of memslot update? What if we have 100 cpus?
>
>>
>> - Is performance a concern? I.e. are updates that need to be batched going to
>> be high frequency operations?
>
> Currently they are limited to run only at boot. In an unmodified
> KVM/QEMU build, however, I count 86 memslot updates done at boot with
>
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none
I *think* there are only ~3 problematic ones (split/resize), where we
temporarily delete something we will re-add. At least that's what I
remember from working on my prototype.
--
Thanks,
David / dhildenb
Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix. Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
>
> I guess I'm complaining that there isn't sufficient justification for this new
> feature. The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.
>
> Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> are no requirements given.
>
> - Why can't this be solved in userspace?
Because this would provide the "feature" only to QEMU, leaving each
other hypervisor to implement its own.
In addition (maybe you already answered this question but I couldn't
find an answer in the email thread), does it make sense to stop all
vcpus for a couple of memslot update? What if we have 100 cpus?
>
> - Is performance a concern? I.e. are updates that need to be batched going to
> be high frequency operations?
Currently they are limited to run only at boot. In an unmodified
KVM/QEMU build, however, I count 86 memslot updates done at boot with
./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none
>
> - What operations does userspace truly need? E.g. if the only use case is to
> split/truncate/hole punch an existing memslot, can KVM instead provide a
> memslot flag and exit reason that allows kicking vCPUs to userspace if the
> memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
> but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
Could be an alternative solution I guess?
>
> - What precisely needs to be atomic? If updates only need to be "atomic" for
> an address space, does the API allowing mixing non-SMM and SMM memslots?
Does QEMU pass a list of mixed non-SMM and SMM memslots?
Btw, is there a separate list for SMM memslots in KVM?
If not, and are all thrown together, then for simplicity we make all
updates atomic.
>
> - When is KVM required to invalidate and flush? E.g. if a memslot is deleted
> and recreated with a different HVA, how does KVM ensure that there are no
> outstanding references to the old HVA without introducing non-determinstic
> behavior. The current API works by forcing userspace to fully delete the
> memslot, i.e. KVM can ensure all references are gone in all TLBs before
> allowing userspace to create new, conflicting entries. I don't see how this
> can work with batched updates. The update needs to be "atomic", i.e. vCPUs
> must never see an invalid/deleted memslot, but if the memslot is writable,
> how does KVM prevent some writes from hitting the old HVA and some from hitting
> the new HVA without a quiescent period?
Sorry this might be my fault in providing definitions, even though I
think I made plenty of examples. Delete/move operations are not really
"atomic" in the sense that the slot disappears immediately.
The slot(s) is/are first "atomically" invalidated, allowing the guest to
retry the page fault and preparing for the cleanup you mention above,
and then are "atomically" deleted.
The behavior is the same, just instead of doing
invalidate memslot X
swap()
delete memslot X
swap()
invalidate memslot Y
swap()
delete memslot Y
swap()
...
I would do:
invalidate
invalidate
invalidate
swap()
delete
delete
delete
swap()
>
> - If a memslot is truncated while dirty logging is enabled, what happens to
> the bitmap? Is it preserved? Dropped?
Can you explain what you mean with "truncated memslot"?
Regarding the bitmap, currently QEMU should (probably wrongly) update it
before even committing the changes to KVM. But I remember upstream
someone pointed that this could be solved later.
>
> Again, I completely agree that the current memslots API is far from perfect, but
> I'm not convinced that simply extending the existing API to batch updates is the
> best solution from a KVM perspective.
>
>>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
>>> do wholesale replacement? That seems like it would be vastly simpler to handle
>>> on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
>>> API that allows 1->N or N->1 conversions but not arbitrary batching.
>>
>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020. I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
>
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
>
Could you provide an example?
I mean I am not an expert but to me it looks like I preserved the same
old functionalities both here and in QEMU. I just batched and delayed
some operations, which in this case should cause no harm.
Thank you,
Emanuele
On Thu, Sep 29, 2022, Emanuele Giuseppe Esposito wrote:
>
> Am 28/09/2022 um 22:41 schrieb Sean Christopherson:
> > Beyond that, there's no explanation of why this exact API is necessary, i.e. there
> > are no requirements given.
> >
> > - Why can't this be solved in userspace?
>
> Because this would provide the "feature" only to QEMU, leaving each
> other hypervisor to implement its own.
But there's no evidence that any other VMM actually needs this feature.
> > - When is KVM required to invalidate and flush? E.g. if a memslot is deleted
> > and recreated with a different HVA, how does KVM ensure that there are no
> > outstanding references to the old HVA without introducing non-determinstic
> > behavior. The current API works by forcing userspace to fully delete the
> > memslot, i.e. KVM can ensure all references are gone in all TLBs before
> > allowing userspace to create new, conflicting entries. I don't see how this
> > can work with batched updates. The update needs to be "atomic", i.e. vCPUs
> > must never see an invalid/deleted memslot, but if the memslot is writable,
> > how does KVM prevent some writes from hitting the old HVA and some from hitting
> > the new HVA without a quiescent period?
>
> Sorry this might be my fault in providing definitions, even though I
> think I made plenty of examples. Delete/move operations are not really
> "atomic" in the sense that the slot disappears immediately.
>
> The slot(s) is/are first "atomically" invalidated, allowing the guest to
> retry the page fault
I completely forgot that KVM x86 now retries on KVM_MEMSLOT_INVALID[*]. That is
somewhat arbitrary behavior, e.g. if x86 didn't do MMIO SPTE caching it wouldn't
be "necessary" and x86 would still treat INVALID as MMIO. It's also x86 specific,
no other architecture retries on invalid memslots, i.e. relying on that behavior
to provide "atomicity" won't work without updating all other architectures.
That's obviously doable, but since these updates aren't actually atomic, I don't
see any meaningful benefit over adding e.g. KVM_MEM_DISABLED.
[*] e0c378684b65 ("KVM: x86/mmu: Retry page faults that hit an invalid memslot")
> > - If a memslot is truncated while dirty logging is enabled, what happens to
> > the bitmap? Is it preserved? Dropped?
>
> Can you explain what you mean with "truncated memslot"?
Shrink the size of the memslot.
> Regarding the bitmap, currently QEMU should (probably wrongly) update it
> before even committing the changes to KVM. But I remember upstream
> someone pointed that this could be solved later.
These details can't be punted, as they affect the ABI. My interpretation of
"atomic" memslot updates was that KVM would truly honor the promise of atomic
updates, e.g. that a DELETE=>CREATE sequence would effectively allow truncating
a memslot without losing any data. Those details aren't really something that
can be punted down the road to be fixed at a later point because they very much
matter from an ABI perspective.
> > Again, I completely agree that the current memslots API is far from perfect, but
> > I'm not convinced that simply extending the existing API to batch updates is the
> > best solution from a KVM perspective.
> >
> >>> E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
> >>> do wholesale replacement? That seems like it would be vastly simpler to handle
> >>> on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
> >>> API that allows 1->N or N->1 conversions but not arbitrary batching.
> >>
> >> Wholesale replacement was my first idea when I looked at the issue, I think
> >> at the end of 2020. I never got to a full implementation, but my impression
> >> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> >> easier than arbitrary batch updates.
> >
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
> >
>
> Could you provide an example?
> I mean I am not an expert but to me it looks like I preserved the same
> old functionalities both here and in QEMU. I just batched and delayed
> some operations, which in this case should cause no harm.
As above, my confusion is due to calling these updates "atomic". The new API is
really just "allow batching memslot updates and subtly rely on restarting page
faults on KVM_MEMSLOT_INVALID".
IMO, KVM_MEM_DISABLED or similar is the way to go. I.e. formalize the "restart
page faults" semantics without taking on the complexity of batched updates.
On 9/28/22 22:41, Sean Christopherson wrote:
> On Wed, Sep 28, 2022, Paolo Bonzini wrote:
>> On 9/28/22 17:58, Sean Christopherson wrote:
>>> I don't disagree that the memslots API is lacking, but IMO that is somewhat
>>> orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
>>> should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
>>> like to have the luxury of being able to explore ideas beyond "let userspace
>>> batch memslot updates", and I really don't want to feel pressured to get this
>>> code reviewed and merge.
>>
>> I absolutely agree that this is not a bugfix. Most new features for KVM can
>> be seen as bug fixes if you squint hard enough, but they're still features.
>
> I guess I'm complaining that there isn't sufficient justification for this new
> feature. The cover letter provides a bug that would be fixed by having batched
> updates, but as above, that's really due to deficiencies in a different KVM ABI.
I disagree. Failure to fetch should be fixed but is otherwise a red
herring. It's the whole memslot API (including dirty logging) that is a
mess.
If you think we should overhaul it even more than just providing atomic
batched updates, that's fine. But still, the impossibility to perform
atomic updates in batches *is* a suboptimal part of the KVM API.
> - Why can't this be solved in userspace?
I don't think *can't* is the right word. If the metric of choice was
"what can be solved in userspace", we'd all be using microkernels. The
question is why userspace would be a better place to solve it.
The only reason to do it in userspace would be if failure to fetch is
something that is interesting to userspace, other than between two
KVM_SET_USER_MEMORY_REGION. Unless you provide an API to pass failures
to fetch down to userspace, the locking in userspace is going to be
inferior, because it would have to be unconditional. This means worse
performance and more complication, not to mention having to do it N
times instead of 1 for N implementations.
Not forcing userspace to do "tricks" is in my opinion a strong part of
deciding whether an API belongs in KVM.
> - What operations does userspace truly need? E.g. if the only use case is to
> split/truncate/hole punch an existing memslot, can KVM instead provide a
> memslot flag and exit reason that allows kicking vCPUs to userspace if the
> memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
> but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
The main cases are:
- for the boot case, splitting and merging existing memslots. QEMU
likes to merge adjacent memory regions into a single memslot, so if
something goes from read-write to read-only it has to be split and vice
versa. I guess a "don't merge this memory region" flag would be the
less hideous way to solve it in userspace.
- however, there is also the case of resizing an existing memslot which
is what David would like to have for virtio-mem. This is not really
fixable because part of the appeal of virtio-mem is to have a single
huge memslot instead of many smaller ones, in order to reduce the
granularity of add/remove (David, correct me if I'm wrong).
(The latter _would_ be needed by other VMMs).
> If updates only need to be "atomic" for an address space, does the API allowing
> mixing non-SMM and SMM memslots?
I agree that the address space should be moved out of the single entries
and into the header if we follow through with this approach.
> The update needs to be "atomic", i.e. vCPUs
> must never see an invalid/deleted memslot, but if the memslot is writable,
> how does KVM prevent some writes from hitting the old HVA and some from hitting
> the new HVA without a quiescent period?
(Heh, and I forgot likewise that non-x86 does not retry on
KVM_MEMSLOT_INVALID. Yes, that would be treated as a bug on other
architectures).
>> Wholesale replacement was my first idea when I looked at the issue, I think
>> at the end of 2020. I never got to a full implementation, but my impression
>> was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
>> easier than arbitrary batch updates.
>
> It's not obvious to me that the memslot metadata is going to be easy to handle
> regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt"
> the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
Indeed; I would have thought that it is clear with the batch updates API
(which requires the update to be split into delete and insert), but
apparently it's not and it's by no means optimal.
Paolo
On Thu, 2022-09-29 at 17:28 +0200, Paolo Bonzini wrote:
> On 9/28/22 22:41, Sean Christopherson wrote:
> > On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> > > On 9/28/22 17:58, Sean Christopherson wrote:
> > > > I don't disagree that the memslots API is lacking, but IMO that is somewhat
> > > > orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
> > > > should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
> > > > like to have the luxury of being able to explore ideas beyond "let userspace
> > > > batch memslot updates", and I really don't want to feel pressured to get this
> > > > code reviewed and merge.
> > >
> > > I absolutely agree that this is not a bugfix. Most new features for KVM can
> > > be seen as bug fixes if you squint hard enough, but they're still features.
> >
> > I guess I'm complaining that there isn't sufficient justification for this new
> > feature. The cover letter provides a bug that would be fixed by having batched
> > updates, but as above, that's really due to deficiencies in a different KVM ABI.
>
> I disagree. Failure to fetch should be fixed but is otherwise a red
> herring. It's the whole memslot API (including dirty logging) that is a
> mess.
>
> If you think we should overhaul it even more than just providing atomic
> batched updates, that's fine. But still, the impossibility to perform
> atomic updates in batches *is* a suboptimal part of the KVM API.
>
> > - Why can't this be solved in userspace?
>
> I don't think *can't* is the right word. If the metric of choice was
> "what can be solved in userspace", we'd all be using microkernels. The
> question is why userspace would be a better place to solve it.
>
> The only reason to do it in userspace would be if failure to fetch is
> something that is interesting to userspace, other than between two
> KVM_SET_USER_MEMORY_REGION. Unless you provide an API to pass failures
> to fetch down to userspace, the locking in userspace is going to be
> inferior, because it would have to be unconditional. This means worse
> performance and more complication, not to mention having to do it N
> times instead of 1 for N implementations.
>
> Not forcing userspace to do "tricks" is in my opinion a strong part of
> deciding whether an API belongs in KVM.
>
> > - What operations does userspace truly need? E.g. if the only use case is to
> > split/truncate/hole punch an existing memslot, can KVM instead provide a
> > memslot flag and exit reason that allows kicking vCPUs to userspace if the
> > memslot is accessed? E.g. KVM_MEM_DISABLED that acts like an invalid memslot,
> > but KVM exists with a dedicated exit reason instead of generating MMIO semantics.
>
> The main cases are:
>
> - for the boot case, splitting and merging existing memslots. QEMU
> likes to merge adjacent memory regions into a single memslot, so if
> something goes from read-write to read-only it has to be split and vice
> versa. I guess a "don't merge this memory region" flag would be the
> less hideous way to solve it in userspace.
>
> - however, there is also the case of resizing an existing memslot which
> is what David would like to have for virtio-mem. This is not really
> fixable because part of the appeal of virtio-mem is to have a single
> huge memslot instead of many smaller ones, in order to reduce the
> granularity of add/remove (David, correct me if I'm wrong).
>
> (The latter _would_ be needed by other VMMs).
>
> > If updates only need to be "atomic" for an address space, does the API allowing
> > mixing non-SMM and SMM memslots?
>
> I agree that the address space should be moved out of the single entries
> and into the header if we follow through with this approach.
>
> > The update needs to be "atomic", i.e. vCPUs
> > must never see an invalid/deleted memslot, but if the memslot is writable,
> > how does KVM prevent some writes from hitting the old HVA and some from hitting
> > the new HVA without a quiescent period?
>
> (Heh, and I forgot likewise that non-x86 does not retry on
> KVM_MEMSLOT_INVALID. Yes, that would be treated as a bug on other
> architectures).
>
> > > Wholesale replacement was my first idea when I looked at the issue, I think
> > > at the end of 2020. I never got to a full implementation, but my impression
> > > was that allocating/deallocating dirty bitmaps, rmaps etc. would make it any
> > > easier than arbitrary batch updates.
> >
> > It's not obvious to me that the memslot metadata is going to be easy to handle
> > regardless of what we do. E.g. I'm pretty sure that batching updates will "corrupt"
> > the dirty bitmap if a hole is punched in a memslot that's being dirty logged.
>
> Indeed; I would have thought that it is clear with the batch updates API
> (which requires the update to be split into delete and insert), but
> apparently it's not and it's by no means optimal.
I 100% agree with everything Paolo said.
Best regards,
Maxim Levitsky
>
> Paolo
>
On 9/29/22 17:18, Sean Christopherson wrote:
> IMO, KVM_MEM_DISABLED or similar is the way to go. I.e. formalize the "restart
> page faults" semantics without taking on the complexity of batched updates.
If userspace has to collaborate, KVM_MEM_DISABLED (or KVM_MEM_USER_EXIT
would be a better name) is not needed either except as an optimization;
you can just kick all CPUs unconditionally.
And in fact KVM_MEM_DISABLED is not particularly easy to implement
either; in order to allow split/merge it should be possible for a new
memslot to replace multiple disabled memslots, in order to allow
merging, and to be only partially overlap the first/last disabled
memslot it replaces.
None of this is allowed for other memslots, so exactly the same metadata
complications exist as for other options such as wholesale replacement
or batched updates. The only semantics with a sane implementation would
be to destroy the dirty bitmap of disabled memslots when they are
replaced. At least it would be possible for userspace to set
KVM_MEM_DISABLED, issue KVM_GET_DIRTY_LOG and then finally create the
new memslot. That would be _correct_, but still not very appealing.
I don't exclude suffering from tunnel vision, but batched updates to me
still seem to be the least bad option.
Paolo
> The main cases are:
>
> - for the boot case, splitting and merging existing memslots. QEMU
> likes to merge adjacent memory regions into a single memslot, so if
> something goes from read-write to read-only it has to be split and vice
> versa. I guess a "don't merge this memory region" flag would be the
> less hideous way to solve it in userspace.
>
> - however, there is also the case of resizing an existing memslot which
> is what David would like to have for virtio-mem. This is not really
> fixable because part of the appeal of virtio-mem is to have a single
> huge memslot instead of many smaller ones, in order to reduce the
> granularity of add/remove (David, correct me if I'm wrong).
Yes, the most important case I am working on in that regard is reducing
the memslot size/overhead when only exposing comparatively little memory
towards a VM using virtio-mem (say, a virtio-mem device that could grow
to 1 TiB, but we initially only expose 1 GiB to the VM).
One approach I prototyped in the past (where my RFC for atomic updates
came into play because I ran into this issue) to achieve that was
dynamically growing (and eventually shrinking) a single memslot on demand.
An alternative [1] uses multiple individual memslots, and exposes them
on demand. There, I have to make sure that QEMU won't merge adjacent
memslots into a bigger one -- because otherwise, we'd again need atomic
memslot updates again, which KVM, vhost-user, ... don't support. But in
the future, I think we want to have that: if there is no fragmentation,
we should only have a single large memslot and all memslot consumers
should be able to cope with atomic updates.
So in any case, I will have good use for atomic memslot updates. Just
like other hypervisors that (will) implement virtio-mem or something
comparable :)
[1] https://lore.kernel.org/all/[email protected]/T/#u
--
Thanks,
David / dhildenb
On Thu, Sep 29, 2022, Paolo Bonzini wrote:
>
> On 9/28/22 22:41, Sean Christopherson wrote:
> If you think we should overhaul it even more than just providing atomic
> batched updates, that's fine.
Or maybe solve the problem(s) with more precision? What I don't like about the
batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly
becoming atomic. When I hear "atomic" updates, I think "all transactions are
commited at once". That is not what this proposed API does, and due to the TLB
flushing requirements, I don't see how it can be implemented.
> But still, the impossibility to perform
> atomic updates in batches *is* a suboptimal part of the KVM API.
I don't disagree, but I honestly don't see the difference between "batching" and
providing a KVM_MEM_USER_EXIT flag. The batch API will provide better performance,
but I would be surprised if it's significant enough to matter. I'm not saying
that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in
agreement that handling memslot metadata will suck regardless. My point is that
it's simpler to implement in KVM, much easier to document, and for all intents and
purposes provides the same desired functionality: vCPUs that access in-flux memslots
are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to
provide "atomic" updates.
> The main cases are:
>
> - for the boot case, splitting and merging existing memslots. QEMU likes to
> merge adjacent memory regions into a single memslot, so if something goes
> from read-write to read-only it has to be split and vice versa. I guess a
> "don't merge this memory region" flag would be the less hideous way to solve
> it in userspace.
>
> - however, there is also the case of resizing an existing memslot which is
> what David would like to have for virtio-mem. This is not really fixable
> because part of the appeal of virtio-mem is to have a single huge memslot
> instead of many smaller ones, in order to reduce the granularity of
> add/remove (David, correct me if I'm wrong).
>
> (The latter _would_ be needed by other VMMs).
If we really want to provide a better experience for userspace, why not provide
more primitives to address those specific use cases? E.g. fix KVM's historic wart
of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
- Merge 2+ memory regions that are contiguous in GPA and HVA
- Split a memory region into 2+ memory regions
- Truncate a memory region
- Grow a memory region
That would probably require more KVM code overall, but each operation would be more
tightly bounded and thus simpler to define. And I think more precise APIs would
provide other benefits, e.g. growing a region wouldn't require first deleting the
current region, and so could avoid zapping SPTEs and destroying metadata. Merge,
split, and truncate would have similar benefits, though they might be more
difficult to realize in practice.
What I really don't like about the "batch" API, aside from the "atomic" name, is
that it's too open ended and creates conflicts. E.g. to allow "atomic" merging
after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that
implies that each operation in the batch operates on the working state created by
the previous operations.
But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting
all "invalidations" to the front half of the "atomic" operations breaks the
ordering. And there's a ridiculous amount of potential weirdness with MOVE=>MOVE
operations.
KVM could solve those issues by disallowing MOVE and disallowing DELETE after
CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating
on a single contiguous chunk of memslots. At that point, it seems that providing
primitives to do each macro operation is an even better experience for userspace.
In other words, make memslots a bit more CISC ;-)
Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
> If we really want to provide a better experience for userspace, why not provide
> more primitives to address those specific use cases? E.g. fix KVM's historic wart
> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
>
> - Merge 2+ memory regions that are contiguous in GPA and HVA
> - Split a memory region into 2+ memory regions
> - Truncate a memory region
> - Grow a memory region
I looked again at the code and specifically the use case that triggers
the crash in bugzilla. I *think* (correct me if I am wrong), that the
only operation that QEMU performs right now is "grow/shrink".
So *if* we want to go this way, we could start with a simple grow/shrink
API.
Even though we need to consider that this could bring additional
complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
performed one after the other (in my innocent fantasy I was expecting to
find 2 subsequent ioctls in the code), but in QEMU's
address_space_set_flatview(), which seems to first remove all regions
and then add them when changing flatviews.
address_space_update_topology_pass(as, old_view2, new_view, adding=false);
address_space_update_topology_pass(as, old_view2, new_view, adding=true);
I don't think we can change this, as other listeners also rely on such
ordering, but we can still batch all callback requests like I currently
do and process them in kvm_commit(), figuring there which operation is
which.
In other words, we would have something like:
region_del() --> DELETE memslot X -> add it to the queue of operations
region_del() --> DELETE memslot Y -> add it to the queue of operations
region_add() --> CREATE memslot X (size doubled) -> add it to the queue
of operations
region_add() --> CREATE memslot Y (size halved) -> add it to the queue
of operations
...
commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
Y (-size) -> issue 2 ioctls, GROW and SHRINK.
> That would probably require more KVM code overall, but each operation would be more
> tightly bounded and thus simpler to define. And I think more precise APIs would
> provide other benefits, e.g. growing a region wouldn't require first deleting the
> current region, and so could avoid zapping SPTEs and destroying metadata. Merge,
> split, and truncate would have similar benefits, though they might be more
> difficult to realize in practice.
So essentially grow would not require INVALIDATE. Makes sense, but would
it work also with shrink? I guess so, as the memslot is still present
(but shrinked) right?
Paolo, would you be ok with this smaller API? Probably just starting
with grow and shrink first.
I am not against any of the two approaches:
- my approach has the disadvantage that the list could be arbitrarily
long, and it is difficult to rollback the intermediate changes if
something breaks during the request processing (but could be simplified
by making kvm exit or crash).
- Sean approach could potentially provide more burden to the userspace,
as we need to figure which operation is which. Also from my
understanding split and merge won't be really straightforward to
implement, especially in userspace.
David, any concern from userspace prospective on this "CISC" approach?
Thank you,
Emanuele
On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
>
>
> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>> If we really want to provide a better experience for userspace, why not provide
>> more primitives to address those specific use cases? E.g. fix KVM's historic wart
>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to:
>>
>> - Merge 2+ memory regions that are contiguous in GPA and HVA
>> - Split a memory region into 2+ memory regions
>> - Truncate a memory region
>> - Grow a memory region
>
> I looked again at the code and specifically the use case that triggers
> the crash in bugzilla. I *think* (correct me if I am wrong), that the
> only operation that QEMU performs right now is "grow/shrink".
I remember that there were BUG reports where we'd actually split and run
into that problem. Just don't have them at hand. I think they happened
during early boot when the OS re-configured some PCI thingies.
>
> So *if* we want to go this way, we could start with a simple grow/shrink
> API.
>
> Even though we need to consider that this could bring additional
> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
> performed one after the other (in my innocent fantasy I was expecting to
> find 2 subsequent ioctls in the code), but in QEMU's
> address_space_set_flatview(), which seems to first remove all regions
> and then add them when changing flatviews.
>
> address_space_update_topology_pass(as, old_view2, new_view, adding=false);
> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
>
> I don't think we can change this, as other listeners also rely on such
> ordering, but we can still batch all callback requests like I currently
> do and process them in kvm_commit(), figuring there which operation is
> which.
>
> In other words, we would have something like:
>
> region_del() --> DELETE memslot X -> add it to the queue of operations
> region_del() --> DELETE memslot Y -> add it to the queue of operations
> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
> of operations
> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
> of operations
> ...
> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
> Y (-size) -> issue 2 ioctls, GROW and SHRINK.
I communicated resizes (region_resize()) to the notifier in patch #3 of
https://lore.kernel.org/qemu-devel/[email protected]/
You could use that independently of the remainder. It should be less
controversial ;)
But I think it only tackles part of the more generic problem (split/merge).
>
>> That would probably require more KVM code overall, but each operation would be more
>> tightly bounded and thus simpler to define. And I think more precise APIs would
>> provide other benefits, e.g. growing a region wouldn't require first deleting the
>> current region, and so could avoid zapping SPTEs and destroying metadata. Merge,
>> split, and truncate would have similar benefits, though they might be more
>> difficult to realize in practice.
>
> So essentially grow would not require INVALIDATE. Makes sense, but would
> it work also with shrink? I guess so, as the memslot is still present
> (but shrinked) right?
>
> Paolo, would you be ok with this smaller API? Probably just starting
> with grow and shrink first.
>
> I am not against any of the two approaches:
> - my approach has the disadvantage that the list could be arbitrarily
> long, and it is difficult to rollback the intermediate changes if
> something breaks during the request processing (but could be simplified
> by making kvm exit or crash).
>
> - Sean approach could potentially provide more burden to the userspace,
> as we need to figure which operation is which. Also from my
> understanding split and merge won't be really straightforward to
> implement, especially in userspace.
>
> David, any concern from userspace prospective on this "CISC" approach?
In contrast to resizes in QEMU that only affect a single memory
region/slot, splitting/merging is harder to factor out and communicate
to a notifier. As an alternative, we could handle it in the commit stage
in the notifier itself, similar to what my prototype does, and figure
out what needs to be done in there and how to call the proper KVM
interface (and which interface to call).
With virtio-mem (in the future) we might see merges of 2 slots into a
single one, by closing a gap in-between them. In "theory" we could
combine some updates into a single transaction. But it won't be 100s ...
I think I'd prefer an API that doesn't require excessive ad-hoc
extensions later once something in QEMU changes.
I think in essence, all we care about is performing atomic changes that
*have to be atomic*, because something we add during a transaction
overlaps with something we remove during a transaction. Not necessarily
all updates in a transaction!
My prototype essentially detects that scenario, but doesn't call new KVM
interfaces to deal with these.
I assume once we take that into consideration, we can mostly assume that
any such atomic updates (only of the regions that really have to be part
of an atomic update) won't involve more than a handful of memory
regions. We could add a sane KVM API limit.
And if we run into that API limit in QEMU, we can print a warning and do
it non-atomically.
--
Thanks,
David / dhildenb
Am 13/10/2022 um 10:44 schrieb David Hildenbrand:
> On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 29/09/2022 um 23:39 schrieb Sean Christopherson:
>>> If we really want to provide a better experience for userspace, why
>>> not provide
>>> more primitives to address those specific use cases? E.g. fix KVM's
>>> historic wart
>>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more
>>> ioctls to:
>>>
>>> - Merge 2+ memory regions that are contiguous in GPA and HVA
>>> - Split a memory region into 2+ memory regions
>>> - Truncate a memory region
>>> - Grow a memory region
>>
>> I looked again at the code and specifically the use case that triggers
>> the crash in bugzilla. I *think* (correct me if I am wrong), that the
>> only operation that QEMU performs right now is "grow/shrink".
>
> I remember that there were BUG reports where we'd actually split and run
> into that problem. Just don't have them at hand. I think they happened
> during early boot when the OS re-configured some PCI thingies.
If you could point me where this is happening, it would be nice. So far
I could not find or see any split/merge operation.
>
>>
>> So *if* we want to go this way, we could start with a simple grow/shrink
>> API.
>>
>> Even though we need to consider that this could bring additional
>> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not
>> performed one after the other (in my innocent fantasy I was expecting to
>> find 2 subsequent ioctls in the code), but in QEMU's
>> address_space_set_flatview(), which seems to first remove all regions
>> and then add them when changing flatviews.
>>
>> address_space_update_topology_pass(as, old_view2, new_view,
>> adding=false);
>> address_space_update_topology_pass(as, old_view2, new_view, adding=true);
>>
>> I don't think we can change this, as other listeners also rely on such
>> ordering, but we can still batch all callback requests like I currently
>> do and process them in kvm_commit(), figuring there which operation is
>> which.
>>
>> In other words, we would have something like:
>>
>> region_del() --> DELETE memslot X -> add it to the queue of operations
>> region_del() --> DELETE memslot Y -> add it to the queue of operations
>> region_add() --> CREATE memslot X (size doubled) -> add it to the queue
>> of operations
>> region_add() --> CREATE memslot Y (size halved) -> add it to the queue
>> of operations
>> ...
>> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK
>> Y (-size) -> issue 2 ioctls, GROW and SHRINK.
>
> I communicated resizes (region_resize()) to the notifier in patch #3 of
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> You could use that independently of the remainder. It should be less
> controversial ;)
>
> But I think it only tackles part of the more generic problem (split/merge).
Yes, very useful! Using that patch we would have a single place where to
issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since
the operation will be atomic (this time in the true meaning, since we
don't need INVALIDATE.
>
>>
>>> That would probably require more KVM code overall, but each operation
>>> would be more
>>> tightly bounded and thus simpler to define. And I think more precise
>>> APIs would
>>> provide other benefits, e.g. growing a region wouldn't require first
>>> deleting the
>>> current region, and so could avoid zapping SPTEs and destroying
>>> metadata. Merge,
>>> split, and truncate would have similar benefits, though they might be
>>> more
>>> difficult to realize in practice.
>>
>> So essentially grow would not require INVALIDATE. Makes sense, but would
>> it work also with shrink? I guess so, as the memslot is still present
>> (but shrinked) right?
>>
>> Paolo, would you be ok with this smaller API? Probably just starting
>> with grow and shrink first.
>>
>> I am not against any of the two approaches:
>> - my approach has the disadvantage that the list could be arbitrarily
>> long, and it is difficult to rollback the intermediate changes if
>> something breaks during the request processing (but could be simplified
>> by making kvm exit or crash).
>>
>> - Sean approach could potentially provide more burden to the userspace,
>> as we need to figure which operation is which. Also from my
>> understanding split and merge won't be really straightforward to
>> implement, especially in userspace.
>>
>> David, any concern from userspace prospective on this "CISC" approach?
>
> In contrast to resizes in QEMU that only affect a single memory
> region/slot, splitting/merging is harder to factor out and communicate
> to a notifier. As an alternative, we could handle it in the commit stage
> in the notifier itself, similar to what my prototype does, and figure
> out what needs to be done in there and how to call the proper KVM
> interface (and which interface to call).
>
> With virtio-mem (in the future) we might see merges of 2 slots into a
> single one, by closing a gap in-between them. In "theory" we could
> combine some updates into a single transaction. But it won't be 100s ...
>
> I think I'd prefer an API that doesn't require excessive ad-hoc
> extensions later once something in QEMU changes.
>
>
> I think in essence, all we care about is performing atomic changes that
> *have to be atomic*, because something we add during a transaction
> overlaps with something we remove during a transaction. Not necessarily
> all updates in a transaction!
>
> My prototype essentially detects that scenario, but doesn't call new KVM
> interfaces to deal with these.
With "prototype" I assume you mean the patch linked above
(region_resize), not the userspace-only proposal you sent initially right?
>
> I assume once we take that into consideration, we can mostly assume that
> any such atomic updates (only of the regions that really have to be part
> of an atomic update) won't involve more than a handful of memory
> regions. We could add a sane KVM API limit.
>
> And if we run into that API limit in QEMU, we can print a warning and do
> it non-atomically.
>
If we implement single operations (split/merge/grow/shrink), we don't
even need that limit. Except for merge, maybe.
Ok, if it'ok for you all I can try to use David patch and implement some
simple grow/shrink. Then we need to figure where and when exactly QEMU
performs split and merge operations, and maybe implement something
similar to what you did in your proposal?
Thank you,
Emanuele
>> I remember that there were BUG reports where we'd actually split and run
>> into that problem. Just don't have them at hand. I think they happened
>> during early boot when the OS re-configured some PCI thingies.
>
> If you could point me where this is happening, it would be nice. So far
> I could not find or see any split/merge operation.
I remember some bugzilla, but it might be hard to find. Essentially, it
happened that early during boot that it wasn't really a problem so far
(single CPU running that triggers it IIRC).
But I might be messing up some details.
>>
>>>
>>>> That would probably require more KVM code overall, but each operation
>>>> would be more
>>>> tightly bounded and thus simpler to define. And I think more precise
>>>> APIs would
>>>> provide other benefits, e.g. growing a region wouldn't require first
>>>> deleting the
>>>> current region, and so could avoid zapping SPTEs and destroying
>>>> metadata. Merge,
>>>> split, and truncate would have similar benefits, though they might be
>>>> more
>>>> difficult to realize in practice.
>>>
>>> So essentially grow would not require INVALIDATE. Makes sense, but would
>>> it work also with shrink? I guess so, as the memslot is still present
>>> (but shrinked) right?
>>>
>>> Paolo, would you be ok with this smaller API? Probably just starting
>>> with grow and shrink first.
>>>
>>> I am not against any of the two approaches:
>>> - my approach has the disadvantage that the list could be arbitrarily
>>> long, and it is difficult to rollback the intermediate changes if
>>> something breaks during the request processing (but could be simplified
>>> by making kvm exit or crash).
>>>
>>> - Sean approach could potentially provide more burden to the userspace,
>>> as we need to figure which operation is which. Also from my
>>> understanding split and merge won't be really straightforward to
>>> implement, especially in userspace.
>>>
>>> David, any concern from userspace prospective on this "CISC" approach?
>>
>> In contrast to resizes in QEMU that only affect a single memory
>> region/slot, splitting/merging is harder to factor out and communicate
>> to a notifier. As an alternative, we could handle it in the commit stage
>> in the notifier itself, similar to what my prototype does, and figure
>> out what needs to be done in there and how to call the proper KVM
>> interface (and which interface to call).
>>
>> With virtio-mem (in the future) we might see merges of 2 slots into a
>> single one, by closing a gap in-between them. In "theory" we could
>> combine some updates into a single transaction. But it won't be 100s ...
>>
>> I think I'd prefer an API that doesn't require excessive ad-hoc
>> extensions later once something in QEMU changes.
>>
>>
>> I think in essence, all we care about is performing atomic changes that
>> *have to be atomic*, because something we add during a transaction
>> overlaps with something we remove during a transaction. Not necessarily
>> all updates in a transaction!
>>
>> My prototype essentially detects that scenario, but doesn't call new KVM
>> interfaces to deal with these.
>
> With "prototype" I assume you mean the patch linked above
> (region_resize), not the userspace-only proposal you sent initially right?
I meant from the userspace-only proposal the part where we collect all
add/remove callabcks in a list, and in the commit stage try to detect if
there is an overlap. That way we could isolate the memblocks that really
only need an atomic operation. Anything that doesn't overlap can just go
via the old interface.
Not saying that it wins a beauty price, but we wouldn't have to adjust
QEMU core even more to teach it these special cases (but maybe we should
and other notifiers might benefit from that in the long term).
>
>>
>> I assume once we take that into consideration, we can mostly assume that
>> any such atomic updates (only of the regions that really have to be part
>> of an atomic update) won't involve more than a handful of memory
>> regions. We could add a sane KVM API limit.
>>
>> And if we run into that API limit in QEMU, we can print a warning and do
>> it non-atomically.
>>
> If we implement single operations (split/merge/grow/shrink), we don't
> even need that limit. Except for merge, maybe.
Right, in theory you could have multiple splits/merges in a single
commit that all interact.
Like cutting out 8 pieces out of an existing larger memblock.
>
> Ok, if it'ok for you all I can try to use David patch and implement some
> simple grow/shrink. Then we need to figure where and when exactly QEMU
> performs split and merge operations, and maybe implement something
> similar to what you did in your proposal?
Looking at grow/shrink first is certainly not a waste of time.
--
Thanks,
David / dhildenb