v2:
- Added r-b
- Rewrite the comment in faultin_page() for FOLL_INTERRUPTIBLE [John]
- Dropped the controversial patch to introduce a flag for
__gfn_to_pfn_memslot(), instead used a boolean for now [Sean]
- Rename s/is_sigpending_pfn/KVM_PFN_ERR_SIGPENDING/ [Sean]
- Change comment in kvm_faultin_pfn() mentioning fatal signals [Sean]
rfc: https://lore.kernel.org/kvm/[email protected]
v1: https://lore.kernel.org/kvm/[email protected]
One issue was reported that libvirt won't be able to stop the virtual
machine using QMP command "stop" during a paused postcopy migration [1].
It won't work because "stop the VM" operation requires the hypervisor to
kick all the vcpu threads out using SIG_IPI in QEMU (which is translated to
a SIGUSR1). However since during a paused postcopy, the vcpu threads are
hang death at handle_userfault() so there're simply not responding to the
kicks. Further, the "stop" command will further hang the QMP channel.
The mm has facility to process generic signal (FAULT_FLAG_INTERRUPTIBLE),
however it's only used in the PF handlers only, not in GUP. Unluckily, KVM
is a heavy GUP user on guest page faults. It means we won't be able to
interrupt a long page fault for KVM fetching guest pages with what we have
right now.
I think it's reasonable for GUP to only listen to fatal signals, as most of
the GUP users are not really ready to handle such case. But actually KVM
is not such an user, and KVM actually has rich infrastructure to handle
even generic signals, and properly deliver the signal to the userspace.
Then the page fault can be retried in the next KVM_RUN.
This patchset added FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE,
and let KVM be the first one to use it. KVM and mm/gup can always be able
to respond to fatal signals, but not non-fatal ones until this patchset.
One thing to mention is that this is not allowing all KVM paths to be able
to respond to non fatal signals, but only on x86 slow page faults. In the
future when more code is ready for handling signal interruptions, we can
explore possibility to have more gup callers using FOLL_INTERRUPTIBLE.
Tests
=====
I created a postcopy environment, pause the migration by shutting down the
network to emulate a network failure (so the handle_userfault() will stuck
for a long time), then I tried three things:
(1) Sending QMP command "stop" to QEMU monitor,
(2) Hitting Ctrl-C from QEMU cmdline,
(3) GDB attach to the dest QEMU process.
Before this patchset, all three use case hang. After the patchset, all
work just like when there's not network failure at all.
Please have a look, thanks.
[1] https://gitlab.com/qemu-project/qemu/-/issues/1052
Peter Xu (3):
mm/gup: Add FOLL_INTERRUPTIBLE
kvm: Add new pfn error KVM_PFN_ERR_SIGPENDING
kvm/x86: Allow to respond to generic signals during slow page faults
arch/arm64/kvm/mmu.c | 2 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 16 +++++++++++--
include/linux/kvm_host.h | 15 ++++++++++--
include/linux/mm.h | 1 +
mm/gup.c | 33 ++++++++++++++++++++++----
virt/kvm/kvm_main.c | 30 ++++++++++++++---------
virt/kvm/kvm_mm.h | 4 ++--
virt/kvm/pfncache.c | 2 +-
10 files changed, 82 insertions(+), 25 deletions(-)
--
2.32.0
We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs. One
issue with it is that not all GUP paths are able to handle signal delivers
besides SIGKILL.
That's not ideal for the GUP users who are actually able to handle these
cases, like KVM.
KVM uses GUP extensively on faulting guest pages, during which we've got
existing infrastructures to retry a page fault at a later time. Allowing
the GUP to be interrupted by generic signals can make KVM related threads
to be more responsive. For examples:
(1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
generated to kick the vcpus out of kernel context immediately,
(2) SIGINT: which can be used with interactive hypervisor users to stop a
virtual machine with Ctrl-C without any delays/hangs,
(3) SIGTRAP: which grants GDB capability even during page faults that are
stuck for a long time.
Normally hypervisor will be able to receive these signals properly, but not
if we're stuck in a GUP for a long time for whatever reason. It happens
easily with a stucked postcopy migration when e.g. a network temp failure
happens, then some vcpu threads can hang death waiting for the pages. With
the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
enable the ability to trap these signals.
Reviewed-by: John Hubbard <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 33 +++++++++++++++++++++++++++++----
2 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf3d0d673f6b..c09eccd5d553 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_INTERRUPTIBLE 0x100000 /* allow interrupts from generic signals */
/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..f39cbe011cf1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -933,8 +933,17 @@ static int faultin_page(struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_WRITE;
if (*flags & FOLL_REMOTE)
fault_flags |= FAULT_FLAG_REMOTE;
- if (locked)
+ if (locked) {
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+ /*
+ * FAULT_FLAG_INTERRUPTIBLE is opt-in. GUP callers must set
+ * FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE.
+ * That's because some callers may not be prepared to
+ * handle early exits caused by non-fatal signals.
+ */
+ if (*flags & FOLL_INTERRUPTIBLE)
+ fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
+ }
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
@@ -1322,6 +1331,22 @@ int fixup_user_fault(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(fixup_user_fault);
+/*
+ * GUP always responds to fatal signals. When FOLL_INTERRUPTIBLE is
+ * specified, it'll also respond to generic signals. The caller of GUP
+ * that has FOLL_INTERRUPTIBLE should take care of the GUP interruption.
+ */
+static bool gup_signal_pending(unsigned int flags)
+{
+ if (fatal_signal_pending(current))
+ return true;
+
+ if (!(flags & FOLL_INTERRUPTIBLE))
+ return false;
+
+ return signal_pending(current);
+}
+
/*
* Please note that this function, unlike __get_user_pages will not
* return 0 for nr_pages > 0 without FOLL_NOWAIT
@@ -1403,11 +1428,11 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
* Repeat on the address that fired VM_FAULT_RETRY
* with both FAULT_FLAG_ALLOW_RETRY and
* FAULT_FLAG_TRIED. Note that GUP can be interrupted
- * by fatal signals, so we need to check it before we
+ * by fatal signals of even common signals, depending on
+ * the caller's request. So we need to check it before we
* start trying again otherwise it can loop forever.
*/
-
- if (fatal_signal_pending(current)) {
+ if (gup_signal_pending(flags)) {
if (!pages_done)
pages_done = -EINTR;
break;
--
2.32.0
Add one new PFN error type to show when we got interrupted when fetching
the PFN due to signal pending.
This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/kvm_host.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 83cf7fd842e0..06a5b17d3679 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -96,6 +96,7 @@
#define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK)
#define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
+#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
/*
* error pfns indicate that the gfn is in slot but faild to
@@ -106,6 +107,16 @@ static inline bool is_error_pfn(kvm_pfn_t pfn)
return !!(pfn & KVM_PFN_ERR_MASK);
}
+/*
+ * When KVM_PFN_ERR_SIGPENDING returned, it means we're interrupted during
+ * fetching the PFN (a signal might have arrived), we may want to retry at
+ * some later point and kick the userspace to handle the signal.
+ */
+static inline bool is_sigpending_pfn(kvm_pfn_t pfn)
+{
+ return pfn == KVM_PFN_ERR_SIGPENDING;
+}
+
/*
* error_noslot pfns indicate that the gfn can not be
* translated to pfn - it is not in slot or failed to
--
2.32.0
All the facilities should be ready for this, what we need to do is to add a
new "interruptible" flag showing that we're willing to be interrupted by
common signals during the __gfn_to_pfn_memslot() request, and wire it up
with a FOLL_INTERRUPTIBLE flag that we've just introduced.
Note that only x86 slow page fault routine will set this to true. The new
flag is by default false in non-x86 arch or on other gup paths even for
x86. It can actually be used elsewhere too but not yet covered.
When we see the PFN fetching was interrupted, do early exit to userspace
with an KVM_EXIT_INTR exit reason.
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++--
include/linux/kvm_host.h | 4 ++--
virt/kvm/kvm_main.c | 30 ++++++++++++++++----------
virt/kvm/kvm_mm.h | 4 ++--
virt/kvm/pfncache.c | 2 +-
8 files changed, 41 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5651a05b6a8..93f6b9bf1af1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1204,7 +1204,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
smp_rmb();
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
write_fault, &writable, NULL);
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 514fd45c1994..7aed5ef6588e 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -598,7 +598,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
write_ok = true;
} else {
/* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
writing, &write_ok, NULL);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 42851c32ff3b..9991f9d9ee59 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -845,7 +845,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long pfn;
/* Call KVM generic code to do the slow-path check */
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+ pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
writing, upgrade_p, NULL);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 17252f39bd7c..aeafe0e9cfbf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
unsigned int access)
{
+ /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
+ if (unlikely(is_sigpending_pfn(fault->pfn))) {
+ vcpu->run->exit_reason = KVM_EXIT_INTR;
+ ++vcpu->stat.signal_exits;
+ return -EINTR;
+ }
+
/* The pfn is invalid, report the error! */
if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn);
@@ -3999,7 +4006,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
async = false;
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
+ fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
fault->write, &fault->map_writable,
&fault->hva);
if (!async)
@@ -4016,7 +4023,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}
}
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, NULL,
+ /*
+ * Allow gup to bail on pending non-fatal signals when it's also allowed
+ * to wait for IO. Note, gup always bails if it is unable to quickly
+ * get a page and a fatal signal, i.e. SIGKILL, is pending.
+ */
+ fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
fault->write, &fault->map_writable,
&fault->hva);
return RET_PF_CONTINUE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 06a5b17d3679..5bae753ebe48 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1158,8 +1158,8 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool *async, bool write_fault,
- bool *writable, hva_t *hva);
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva);
void kvm_release_pfn_clean(kvm_pfn_t pfn);
void kvm_release_pfn_dirty(kvm_pfn_t pfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a49df8988cd6..25deacc705b8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2445,7 +2445,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* 1 indicates success, -errno is returned if error is detected.
*/
static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
- bool *writable, kvm_pfn_t *pfn)
+ bool interruptible, bool *writable, kvm_pfn_t *pfn)
{
unsigned int flags = FOLL_HWPOISON;
struct page *page;
@@ -2460,6 +2460,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
flags |= FOLL_WRITE;
if (async)
flags |= FOLL_NOWAIT;
+ if (interruptible)
+ flags |= FOLL_INTERRUPTIBLE;
npages = get_user_pages_unlocked(addr, 1, &page, flags);
if (npages != 1)
@@ -2566,6 +2568,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* Pin guest page in memory and return its pfn.
* @addr: host virtual address which maps memory to the guest
* @atomic: whether this function can sleep
+ * @interruptible: whether the process can be interrupted by non-fatal signals
* @async: whether this function need to wait IO complete if the
* host page is not in the memory
* @write_fault: whether we should get a writable host page
@@ -2576,8 +2579,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* 2): @write_fault = false && @writable, @writable will tell the caller
* whether the mapping is writable.
*/
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
- bool write_fault, bool *writable)
+kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
+ bool *async, bool write_fault, bool *writable)
{
struct vm_area_struct *vma;
kvm_pfn_t pfn = 0;
@@ -2592,9 +2595,12 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
if (atomic)
return KVM_PFN_ERR_FAULT;
- npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+ npages = hva_to_pfn_slow(addr, async, write_fault, interruptible,
+ writable, &pfn);
if (npages == 1)
return pfn;
+ if (npages == -EINTR)
+ return KVM_PFN_ERR_SIGPENDING;
mmap_read_lock(current->mm);
if (npages == -EHWPOISON ||
@@ -2625,8 +2631,8 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
}
kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
- bool atomic, bool *async, bool write_fault,
- bool *writable, hva_t *hva)
+ bool atomic, bool interruptible, bool *async,
+ bool write_fault, bool *writable, hva_t *hva)
{
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
@@ -2651,7 +2657,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
writable = NULL;
}
- return hva_to_pfn(addr, atomic, async, write_fault,
+ return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
writable);
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
@@ -2659,20 +2665,22 @@ EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable)
{
- return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
- write_fault, writable, NULL);
+ return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false,
+ false, NULL, write_fault, writable, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
+ return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
+ NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
+ return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
+ NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 41da467d99c9..a1ab15006af3 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -24,8 +24,8 @@
#define KVM_MMU_READ_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock)
#endif /* KVM_HAVE_MMU_RWLOCK */
-kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
- bool write_fault, bool *writable);
+kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
+ bool *async, bool write_fault, bool *writable);
#ifdef CONFIG_HAVE_KVM_PFNCACHE
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index dd84676615f1..294808e77f44 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -123,7 +123,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva)
smp_rmb();
/* We always request a writeable mapping */
- new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
+ new_pfn = hva_to_pfn(uhva, false, false, NULL, true, NULL);
if (is_error_noslot_pfn(new_pfn))
break;
--
2.32.0
On 21.07.22 02:03, Peter Xu wrote:
> We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs. One
> issue with it is that not all GUP paths are able to handle signal delivers
> besides SIGKILL.
>
> That's not ideal for the GUP users who are actually able to handle these
> cases, like KVM.
>
> KVM uses GUP extensively on faulting guest pages, during which we've got
> existing infrastructures to retry a page fault at a later time. Allowing
> the GUP to be interrupted by generic signals can make KVM related threads
> to be more responsive. For examples:
>
> (1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
> e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
> generated to kick the vcpus out of kernel context immediately,
>
> (2) SIGINT: which can be used with interactive hypervisor users to stop a
> virtual machine with Ctrl-C without any delays/hangs,
>
> (3) SIGTRAP: which grants GDB capability even during page faults that are
> stuck for a long time.
>
> Normally hypervisor will be able to receive these signals properly, but not
> if we're stuck in a GUP for a long time for whatever reason. It happens
> easily with a stucked postcopy migration when e.g. a network temp failure
> happens, then some vcpu threads can hang death waiting for the pages. With
> the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
> enable the ability to trap these signals.
>
> Reviewed-by: John Hubbard <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
Any further comments? Thanks,
On Wed, Jul 20, 2022 at 08:03:15PM -0400, Peter Xu wrote:
> v2:
> - Added r-b
> - Rewrite the comment in faultin_page() for FOLL_INTERRUPTIBLE [John]
> - Dropped the controversial patch to introduce a flag for
> __gfn_to_pfn_memslot(), instead used a boolean for now [Sean]
> - Rename s/is_sigpending_pfn/KVM_PFN_ERR_SIGPENDING/ [Sean]
> - Change comment in kvm_faultin_pfn() mentioning fatal signals [Sean]
>
> rfc: https://lore.kernel.org/kvm/[email protected]
> v1: https://lore.kernel.org/kvm/[email protected]
>
> One issue was reported that libvirt won't be able to stop the virtual
> machine using QMP command "stop" during a paused postcopy migration [1].
>
> It won't work because "stop the VM" operation requires the hypervisor to
> kick all the vcpu threads out using SIG_IPI in QEMU (which is translated to
> a SIGUSR1). However since during a paused postcopy, the vcpu threads are
> hang death at handle_userfault() so there're simply not responding to the
> kicks. Further, the "stop" command will further hang the QMP channel.
>
> The mm has facility to process generic signal (FAULT_FLAG_INTERRUPTIBLE),
> however it's only used in the PF handlers only, not in GUP. Unluckily, KVM
> is a heavy GUP user on guest page faults. It means we won't be able to
> interrupt a long page fault for KVM fetching guest pages with what we have
> right now.
>
> I think it's reasonable for GUP to only listen to fatal signals, as most of
> the GUP users are not really ready to handle such case. But actually KVM
> is not such an user, and KVM actually has rich infrastructure to handle
> even generic signals, and properly deliver the signal to the userspace.
> Then the page fault can be retried in the next KVM_RUN.
>
> This patchset added FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE,
> and let KVM be the first one to use it. KVM and mm/gup can always be able
> to respond to fatal signals, but not non-fatal ones until this patchset.
>
> One thing to mention is that this is not allowing all KVM paths to be able
> to respond to non fatal signals, but only on x86 slow page faults. In the
> future when more code is ready for handling signal interruptions, we can
> explore possibility to have more gup callers using FOLL_INTERRUPTIBLE.
>
> Tests
> =====
>
> I created a postcopy environment, pause the migration by shutting down the
> network to emulate a network failure (so the handle_userfault() will stuck
> for a long time), then I tried three things:
>
> (1) Sending QMP command "stop" to QEMU monitor,
> (2) Hitting Ctrl-C from QEMU cmdline,
> (3) GDB attach to the dest QEMU process.
>
> Before this patchset, all three use case hang. After the patchset, all
> work just like when there's not network failure at all.
>
> Please have a look, thanks.
>
> [1] https://gitlab.com/qemu-project/qemu/-/issues/1052
--
Peter Xu
On Wed, Jul 20, 2022, Peter Xu wrote:
> Add one new PFN error type to show when we got interrupted when fetching
s/we/KVM
> the PFN due to signal pending.
>
> This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
> SIGIPI) even during e.g. handling an userfaultfd page fault.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/kvm_host.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 83cf7fd842e0..06a5b17d3679 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -96,6 +96,7 @@
> #define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK)
> #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> +#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
>
> /*
> * error pfns indicate that the gfn is in slot but faild to
> @@ -106,6 +107,16 @@ static inline bool is_error_pfn(kvm_pfn_t pfn)
> return !!(pfn & KVM_PFN_ERR_MASK);
> }
>
> +/*
> + * When KVM_PFN_ERR_SIGPENDING returned, it means we're interrupted during
> + * fetching the PFN (a signal might have arrived), we may want to retry at
Please avoid "we". Tthe first "we're" can refer to KVM and/or the kernel,
whereas the second is a weird mix of KVM and userspace (KVM exits to userspace,
but it's userspace's decision whether or not to retry).
Easiest thing is to avoid the "we" entirely and not speculate on what may happen.
E.g.
/*
* KVM_PFN_ERR_SIGPENDING indicates that fetching the PFN was interrupted by a
* pending signal. Note, the signal may or may not be fatal.
*/
> + * some later point and kick the userspace to handle the signal.
> + */
> +static inline bool is_sigpending_pfn(kvm_pfn_t pfn)
> +{
> + return pfn == KVM_PFN_ERR_SIGPENDING;
> +}
> +
> /*
> * error_noslot pfns indicate that the gfn can not be
> * translated to pfn - it is not in slot or failed to
> --
> 2.32.0
>
On Wed, Jul 20, 2022, Peter Xu wrote:
> All the facilities should be ready for this, what we need to do is to add a
> new "interruptible" flag showing that we're willing to be interrupted by
> common signals during the __gfn_to_pfn_memslot() request, and wire it up
> with a FOLL_INTERRUPTIBLE flag that we've just introduced.
>
> Note that only x86 slow page fault routine will set this to true. The new
> flag is by default false in non-x86 arch or on other gup paths even for
> x86. It can actually be used elsewhere too but not yet covered.
>
> When we see the PFN fetching was interrupted, do early exit to userspace
> with an KVM_EXIT_INTR exit reason.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/arm64/kvm/mmu.c | 2 +-
> arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++--
> include/linux/kvm_host.h | 4 ++--
> virt/kvm/kvm_main.c | 30 ++++++++++++++++----------
> virt/kvm/kvm_mm.h | 4 ++--
> virt/kvm/pfncache.c | 2 +-
> 8 files changed, 41 insertions(+), 21 deletions(-)
I don't usually like adding code without a user, but in this case I think I'd
prefer to add the @interruptible param and then activate x86's kvm_faultin_pfn()
in a separate patch. It's rather difficult to tease out the functional x86
change, and that would also allow other architectures to use the interruptible
support without needing to depend on the functional x86 change.
And maybe squash the addition of @interruptible with the previous patch? I.e.
add all of the infrastructure for KVM_PFN_ERR_SIGPENDING in patch 2, then use it
in x86 in patch 3.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 17252f39bd7c..aeafe0e9cfbf 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> unsigned int access)
> {
> + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> + if (unlikely(is_sigpending_pfn(fault->pfn))) {
Move this into kvm_handle_bad_page(), then there's no need for a comment to call
out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
> + vcpu->run->exit_reason = KVM_EXIT_INTR;
> + ++vcpu->stat.signal_exits;
> + return -EINTR;
For better or worse, kvm_handle_signal_exit() exists and can be used here. I
don't love that KVM details bleed into xfer_to_guest_mode_work(), but that's a
future problem.
I do think that the "return -EINTR" should be moved into kvm_handle_signal_exit(),
partly for code reuse and partly because returning -EINTR is very much KVM ABI.
Oof, but there are a _lot_ of paths that can use kvm_handle_signal_exit(), and
some of them don't select KVM_XFER_TO_GUEST_WORK, i.e. kvm_handle_signal_exit()
should be defined unconditionally. I'll work on a series to handle that separately,
no reason to take a dependency on that cleanup.
So for now,
static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
{
if (pfn == KVM_PFN_ERR_SIGPENDING) {
kvm_handle_signal_exit(vcpu);
return -EINTR;
}
...
}
On Thu, Aug 11, 2022 at 08:12:38PM +0000, Sean Christopherson wrote:
> On Wed, Jul 20, 2022, Peter Xu wrote:
> > All the facilities should be ready for this, what we need to do is to add a
> > new "interruptible" flag showing that we're willing to be interrupted by
> > common signals during the __gfn_to_pfn_memslot() request, and wire it up
> > with a FOLL_INTERRUPTIBLE flag that we've just introduced.
> >
> > Note that only x86 slow page fault routine will set this to true. The new
> > flag is by default false in non-x86 arch or on other gup paths even for
> > x86. It can actually be used elsewhere too but not yet covered.
> >
> > When we see the PFN fetching was interrupted, do early exit to userspace
> > with an KVM_EXIT_INTR exit reason.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > arch/arm64/kvm/mmu.c | 2 +-
> > arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
> > arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
> > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++--
> > include/linux/kvm_host.h | 4 ++--
> > virt/kvm/kvm_main.c | 30 ++++++++++++++++----------
> > virt/kvm/kvm_mm.h | 4 ++--
> > virt/kvm/pfncache.c | 2 +-
> > 8 files changed, 41 insertions(+), 21 deletions(-)
>
> I don't usually like adding code without a user, but in this case I think I'd
> prefer to add the @interruptible param and then activate x86's kvm_faultin_pfn()
> in a separate patch. It's rather difficult to tease out the functional x86
> change, and that would also allow other architectures to use the interruptible
> support without needing to depend on the functional x86 change.
>
> And maybe squash the addition of @interruptible with the previous patch? I.e.
> add all of the infrastructure for KVM_PFN_ERR_SIGPENDING in patch 2, then use it
> in x86 in patch 3.
Sounds good.
>
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 17252f39bd7c..aeafe0e9cfbf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > unsigned int access)
> > {
> > + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> > + if (unlikely(is_sigpending_pfn(fault->pfn))) {
>
> Move this into kvm_handle_bad_page(), then there's no need for a comment to call
> out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
> PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
It's a pity it needs to be in "bad pfn" category since that's the only
thing we can easily use, but true it is now.
>
> > + vcpu->run->exit_reason = KVM_EXIT_INTR;
> > + ++vcpu->stat.signal_exits;
> > + return -EINTR;
>
> For better or worse, kvm_handle_signal_exit() exists and can be used here. I
> don't love that KVM details bleed into xfer_to_guest_mode_work(), but that's a
> future problem.
>
> I do think that the "return -EINTR" should be moved into kvm_handle_signal_exit(),
> partly for code reuse and partly because returning -EINTR is very much KVM ABI.
> Oof, but there are a _lot_ of paths that can use kvm_handle_signal_exit(), and
> some of them don't select KVM_XFER_TO_GUEST_WORK, i.e. kvm_handle_signal_exit()
> should be defined unconditionally. I'll work on a series to handle that separately,
> no reason to take a dependency on that cleanup.
>
> So for now,
>
> static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> {
> if (pfn == KVM_PFN_ERR_SIGPENDING) {
> kvm_handle_signal_exit(vcpu);
> return -EINTR;
> }
>
> ...
> }
Sounds good too here. Also all points taken in the wording of patch 2.
Will respin shortly, thanks Sean.
--
Peter Xu
On Thu, Aug 11, 2022, Peter Xu wrote:
> On Thu, Aug 11, 2022 at 08:12:38PM +0000, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 17252f39bd7c..aeafe0e9cfbf 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > unsigned int access)
> > > {
> > > + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> > > + if (unlikely(is_sigpending_pfn(fault->pfn))) {
> >
> > Move this into kvm_handle_bad_page(), then there's no need for a comment to call
> > out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
> > PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
>
> It's a pity it needs to be in "bad pfn" category since that's the only
> thing we can easily use, but true it is now.
Would renaming that to kvm_handle_error_pfn() help? I agree that "bad" is poor
terminology now that it handles a variety of errors, hence the quotes.
On Mon, Aug 15, 2022 at 09:26:37PM +0000, Sean Christopherson wrote:
> On Thu, Aug 11, 2022, Peter Xu wrote:
> > On Thu, Aug 11, 2022 at 08:12:38PM +0000, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 17252f39bd7c..aeafe0e9cfbf 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > > static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > > unsigned int access)
> > > > {
> > > > + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> > > > + if (unlikely(is_sigpending_pfn(fault->pfn))) {
> > >
> > > Move this into kvm_handle_bad_page(), then there's no need for a comment to call
> > > out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
> > > PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
> >
> > It's a pity it needs to be in "bad pfn" category since that's the only
> > thing we can easily use, but true it is now.
>
> Would renaming that to kvm_handle_error_pfn() help? I agree that "bad" is poor
> terminology now that it handles a variety of errors, hence the quotes.
It could be slightly helpful I think, at least it starts to match with how
we name KVM_PFN_ERR_*. Will squash the renaming into the same patch.
Thanks,
--
Peter Xu
On Tue, Aug 16, 2022 at 1:48 PM Peter Xu <[email protected]> wrote:
>
> On Mon, Aug 15, 2022 at 09:26:37PM +0000, Sean Christopherson wrote:
> > On Thu, Aug 11, 2022, Peter Xu wrote:
> > > On Thu, Aug 11, 2022 at 08:12:38PM +0000, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index 17252f39bd7c..aeafe0e9cfbf 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > > > static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > > > unsigned int access)
> > > > > {
> > > > > + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> > > > > + if (unlikely(is_sigpending_pfn(fault->pfn))) {
> > > >
> > > > Move this into kvm_handle_bad_page(), then there's no need for a comment to call
> > > > out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
> > > > PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
> > >
> > > It's a pity it needs to be in "bad pfn" category since that's the only
> > > thing we can easily use, but true it is now.
> >
> > Would renaming that to kvm_handle_error_pfn() help? I agree that "bad" is poor
> > terminology now that it handles a variety of errors, hence the quotes.
>
> It could be slightly helpful I think, at least it starts to match with how
> we name KVM_PFN_ERR_*. Will squash the renaming into the same patch.
+1 to kvm_handle_error_pfn(). Weirdly I proposed the same as part of
another series yesterday [1]. That being said I'm probably going to
drop my cleanup patch (specifically patches 7-9) since it conflicts
with your changes and there is a bug in the last patch.
[1] https://lore.kernel.org/kvm/[email protected]/
>
> Thanks,
>
> --
> Peter Xu
>
On Wed, Jul 20, 2022 at 08:03:16PM -0400, Peter Xu wrote:
> We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs. One
> issue with it is that not all GUP paths are able to handle signal delivers
> besides SIGKILL.
>
> That's not ideal for the GUP users who are actually able to handle these
> cases, like KVM.
>
> KVM uses GUP extensively on faulting guest pages, during which we've got
> existing infrastructures to retry a page fault at a later time. Allowing
> the GUP to be interrupted by generic signals can make KVM related threads
> to be more responsive. For examples:
>
> (1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
> e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
> generated to kick the vcpus out of kernel context immediately,
>
> (2) SIGINT: which can be used with interactive hypervisor users to stop a
> virtual machine with Ctrl-C without any delays/hangs,
>
> (3) SIGTRAP: which grants GDB capability even during page faults that are
> stuck for a long time.
>
> Normally hypervisor will be able to receive these signals properly, but not
> if we're stuck in a GUP for a long time for whatever reason. It happens
> easily with a stucked postcopy migration when e.g. a network temp failure
> happens, then some vcpu threads can hang death waiting for the pages. With
> the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
> enable the ability to trap these signals.
>
> Reviewed-by: John Hubbard <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Will squash the hugetlb support too altogether, which is a one-liner
anyway:
---8<---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..4025a305d573 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6176,9 +6176,12 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_WRITE;
else if (unshare)
fault_flags |= FAULT_FLAG_UNSHARE;
- if (locked)
+ if (locked) {
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_KILLABLE;
+ if (flags & FOLL_INTERRUPTIBLE)
+ fault_flags |= FAULT_FLAG_INTERRUPTIBLE;
+ }
if (flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_RETRY_NOWAIT;
---8<---
I'll still keep R-b for John and DavidH.
Thanks,
--
Peter Xu
On Tue, Aug 16, 2022 at 03:51:16PM -0700, David Matlack wrote:
> On Tue, Aug 16, 2022 at 1:48 PM Peter Xu <[email protected]> wrote:
> >
> > On Mon, Aug 15, 2022 at 09:26:37PM +0000, Sean Christopherson wrote:
> > > On Thu, Aug 11, 2022, Peter Xu wrote:
> > > > On Thu, Aug 11, 2022 at 08:12:38PM +0000, Sean Christopherson wrote:
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index 17252f39bd7c..aeafe0e9cfbf 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > > @@ -3012,6 +3012,13 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > > > > static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > > > > unsigned int access)
> > > > > > {
> > > > > > + /* NOTE: not all error pfn is fatal; handle sigpending pfn first */
> > > > > > + if (unlikely(is_sigpending_pfn(fault->pfn))) {
> > > > >
> > > > > Move this into kvm_handle_bad_page(), then there's no need for a comment to call
> > > > > out that this needs to come before the is_error_pfn() check. This _is_ a "bad"
> > > > > PFN, it just so happens that userspace might be able to resolve the "bad" PFN.
> > > >
> > > > It's a pity it needs to be in "bad pfn" category since that's the only
> > > > thing we can easily use, but true it is now.
> > >
> > > Would renaming that to kvm_handle_error_pfn() help? I agree that "bad" is poor
> > > terminology now that it handles a variety of errors, hence the quotes.
> >
> > It could be slightly helpful I think, at least it starts to match with how
> > we name KVM_PFN_ERR_*. Will squash the renaming into the same patch.
>
> +1 to kvm_handle_error_pfn(). Weirdly I proposed the same as part of
> another series yesterday [1]. That being said I'm probably going to
> drop my cleanup patch (specifically patches 7-9) since it conflicts
> with your changes and there is a bug in the last patch.
>
> [1] https://lore.kernel.org/kvm/[email protected]/
Thanks for the heads-up.
Please still feel free to keep working on new versions since I'm still not
sure which one will land earlier. I'll repost very soon on this one (I
just added hugetlb support which I overlooked; it's a touch up in patch 1
only though). I can always rebase on top too.
--
Peter Xu