We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page sharing by KSM
kvm_mmu_notifier_invalidate_range_start
:
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range (C)
:
kvm_mmu_notifier_invalidate_range_end
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
Cc: [email protected] # v5.13+
Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Reported-by: Shuai Hu <[email protected]>
Reported-by: Zhenyu Zhang <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
---
v2: Improved changelog suggested by Marc
---
virt/kvm/kvm_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 479802a892d4..7f81a3a209b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
unsigned long hva_start, hva_end;
slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+ if (slot->flags & KVM_MEMSLOT_INVALID)
+ continue;
+
hva_start = max(range->start, slot->userspace_addr);
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));
--
2.23.0
On Fri, 09 Jun 2023 11:04:20 +0100,
Gavin Shan <[email protected]> wrote:
>
> We run into guest hang in edk2 firmware when KSM is kept as running on
> the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
> device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
> buffered write. The status is returned by reading the memory region of
> the pflash device and the read request should have been forwarded to QEMU
> and emulated by it. Unfortunately, the read request is covered by an
> illegal stage2 mapping when the guest hang issue occurs. The read request
> is completed with QEMU bypassed and wrong status is fetched. The edk2
> firmware runs into an infinite loop with the wrong status.
>
> The illegal stage2 mapping is populated due to same page sharing by KSM
> at (C) even the associated memory slot has been marked as invalid at (B)
> when the memory slot is requested to be deleted. It's notable that the
> active and inactive memory slots can't be swapped when we're in the middle
> of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
> is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
> to zero again. Besides, the swapping from the active to the inactive memory
> slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
> corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
>
> CPU-A CPU-B
> ----- -----
> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> kvm_vm_ioctl_set_memory_region
> kvm_set_memory_region
> __kvm_set_memory_region
> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> kvm_invalidate_memslot
> kvm_copy_memslot
> kvm_replace_memslot
> kvm_swap_active_memslots (A)
> kvm_arch_flush_shadow_memslot (B)
> same page sharing by KSM
> kvm_mmu_notifier_invalidate_range_start
> :
> kvm_mmu_notifier_change_pte
> kvm_handle_hva_range
> __kvm_handle_hva_range (C)
> :
> kvm_mmu_notifier_invalidate_range_end
>
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping so that the read request for the pflash's status
> is forwarded to QEMU and emulated by it. In this way, the correct pflash's
> status can be returned from QEMU to break the infinite loop in the edk2
> firmware.
>
> Cc: [email protected] # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> Reported-by: Shuai Hu <[email protected]>
> Reported-by: Zhenyu Zhang <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> v2: Improved changelog suggested by Marc
> ---
> virt/kvm/kvm_main.c | 3 +++
> 1 file changed, 3 insertions(+)
Reviewed-by: Marc Zyngier <[email protected]>
M.
--
Without deviation from the norm, progress is not possible.
On Fri, Jun 09, 2023 at 08:04:20PM +1000, Gavin Shan wrote:
> We run into guest hang in edk2 firmware when KSM is kept as running on
> the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
> device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
> buffered write. The status is returned by reading the memory region of
> the pflash device and the read request should have been forwarded to QEMU
> and emulated by it. Unfortunately, the read request is covered by an
> illegal stage2 mapping when the guest hang issue occurs. The read request
> is completed with QEMU bypassed and wrong status is fetched. The edk2
> firmware runs into an infinite loop with the wrong status.
>
> The illegal stage2 mapping is populated due to same page sharing by KSM
> at (C) even the associated memory slot has been marked as invalid at (B)
> when the memory slot is requested to be deleted. It's notable that the
> active and inactive memory slots can't be swapped when we're in the middle
> of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
> is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
> to zero again. Besides, the swapping from the active to the inactive memory
> slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
> corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
>
> CPU-A CPU-B
> ----- -----
> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> kvm_vm_ioctl_set_memory_region
> kvm_set_memory_region
> __kvm_set_memory_region
> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> kvm_invalidate_memslot
> kvm_copy_memslot
> kvm_replace_memslot
> kvm_swap_active_memslots (A)
> kvm_arch_flush_shadow_memslot (B)
> same page sharing by KSM
> kvm_mmu_notifier_invalidate_range_start
> :
> kvm_mmu_notifier_change_pte
> kvm_handle_hva_range
> __kvm_handle_hva_range (C)
> :
> kvm_mmu_notifier_invalidate_range_end
>
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping so that the read request for the pflash's status
> is forwarded to QEMU and emulated by it. In this way, the correct pflash's
> status can be returned from QEMU to break the infinite loop in the edk2
> firmware.
>
> Cc: [email protected] # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> Reported-by: Shuai Hu <[email protected]>
> Reported-by: Zhenyu Zhang <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu
On 6/9/23 18:04, Gavin Shan wrote:
> We run into guest hang in edk2 firmware when KSM is kept as running on
> the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
> device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
> buffered write. The status is returned by reading the memory region of
> the pflash device and the read request should have been forwarded to QEMU
> and emulated by it. Unfortunately, the read request is covered by an
> illegal stage2 mapping when the guest hang issue occurs. The read request
> is completed with QEMU bypassed and wrong status is fetched. The edk2
> firmware runs into an infinite loop with the wrong status.
>
> The illegal stage2 mapping is populated due to same page sharing by KSM
> at (C) even the associated memory slot has been marked as invalid at (B)
> when the memory slot is requested to be deleted. It's notable that the
> active and inactive memory slots can't be swapped when we're in the middle
> of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
> is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
> to zero again. Besides, the swapping from the active to the inactive memory
> slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
> corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
>
> CPU-A CPU-B
> ----- -----
> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> kvm_vm_ioctl_set_memory_region
> kvm_set_memory_region
> __kvm_set_memory_region
> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> kvm_invalidate_memslot
> kvm_copy_memslot
> kvm_replace_memslot
> kvm_swap_active_memslots (A)
> kvm_arch_flush_shadow_memslot (B)
> same page sharing by KSM
> kvm_mmu_notifier_invalidate_range_start
> :
> kvm_mmu_notifier_change_pte
> kvm_handle_hva_range
> __kvm_handle_hva_range (C)
> :
> kvm_mmu_notifier_invalidate_range_end
>
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping so that the read request for the pflash's status
> is forwarded to QEMU and emulated by it. In this way, the correct pflash's
> status can be returned from QEMU to break the infinite loop in the edk2
> firmware.
>
> Cc: [email protected] # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> Reported-by: Shuai Hu <[email protected]>
> Reported-by: Zhenyu Zhang <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Shaoqin Huang <[email protected]>
> ---
> v2: Improved changelog suggested by Marc
> ---
> virt/kvm/kvm_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 479802a892d4..7f81a3a209b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> unsigned long hva_start, hva_end;
>
> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> + if (slot->flags & KVM_MEMSLOT_INVALID)
> + continue;
> +
> hva_start = max(range->start, slot->userspace_addr);
> hva_end = min(range->end, slot->userspace_addr +
> (slot->npages << PAGE_SHIFT));
--
Shaoqin
On 09.06.23 12:04, Gavin Shan wrote:
> We run into guest hang in edk2 firmware when KSM is kept as running on
> the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
> device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
> buffered write. The status is returned by reading the memory region of
> the pflash device and the read request should have been forwarded to QEMU
> and emulated by it. Unfortunately, the read request is covered by an
> illegal stage2 mapping when the guest hang issue occurs. The read request
> is completed with QEMU bypassed and wrong status is fetched. The edk2
> firmware runs into an infinite loop with the wrong status.
>
> The illegal stage2 mapping is populated due to same page sharing by KSM
> at (C) even the associated memory slot has been marked as invalid at (B)
> when the memory slot is requested to be deleted. It's notable that the
> active and inactive memory slots can't be swapped when we're in the middle
> of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
> is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
> to zero again. Besides, the swapping from the active to the inactive memory
> slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
> corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
>
> CPU-A CPU-B
> ----- -----
> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
> kvm_vm_ioctl_set_memory_region
> kvm_set_memory_region
> __kvm_set_memory_region
> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
> kvm_invalidate_memslot
> kvm_copy_memslot
> kvm_replace_memslot
> kvm_swap_active_memslots (A)
> kvm_arch_flush_shadow_memslot (B)
> same page sharing by KSM
> kvm_mmu_notifier_invalidate_range_start
> :
> kvm_mmu_notifier_change_pte
> kvm_handle_hva_range
> __kvm_handle_hva_range (C)
> :
> kvm_mmu_notifier_invalidate_range_end
>
> Fix the issue by skipping the invalid memory slot at (C) to avoid the
> illegal stage2 mapping so that the read request for the pflash's status
> is forwarded to QEMU and emulated by it. In this way, the correct pflash's
> status can be returned from QEMU to break the infinite loop in the edk2
> firmware.
>
> Cc: [email protected] # v5.13+
> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
> Reported-by: Shuai Hu <[email protected]>
> Reported-by: Zhenyu Zhang <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> v2: Improved changelog suggested by Marc
> ---
> virt/kvm/kvm_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 479802a892d4..7f81a3a209b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> unsigned long hva_start, hva_end;
>
> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> + if (slot->flags & KVM_MEMSLOT_INVALID)
> + continue;
> +
> hva_start = max(range->start, slot->userspace_addr);
> hva_end = min(range->end, slot->userspace_addr +
> (slot->npages << PAGE_SHIFT));
Nice debugging!
LGTM
Reviewed-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb