Register to .numa_protect() callback in mmu notifier so that KVM can get
acurate information about when a page is PROT_NONE protected in primary
MMU and unmap it in secondary MMU accordingly.
In KVM's .invalidate_range_start() handler, if the event is to notify that
the range may be protected to PROT_NONE for NUMA migration purpose,
don't do the unmapping in secondary MMU. Hold on until.numa_protect()
comes.
Signed-off-by: Yan Zhao <[email protected]>
---
virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..907444a1761b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
}
+static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+ if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
+ return;
+
+ kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
+}
+
void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
unsigned long end)
{
@@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
+ (range->flags & MMU_NOTIFIER_RANGE_NUMA);
const struct kvm_hva_range hva_range = {
.start = range->start,
.end = range->end,
.pte = __pte(0),
- .handler = kvm_unmap_gfn_range,
+ .handler = !is_numa ? kvm_unmap_gfn_range :
+ (void *)kvm_null_fn,
.on_lock = kvm_mmu_invalidate_begin,
- .on_unlock = kvm_arch_guest_memory_reclaimed,
- .flush_on_ret = true,
+ .on_unlock = !is_numa ? kvm_arch_guest_memory_reclaimed :
+ (void *)kvm_null_fn,
+ .flush_on_ret = !is_numa ? true : false,
.may_block = mmu_notifier_range_blockable(range),
};
@@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
.change_pte = kvm_mmu_notifier_change_pte,
+ .numa_protect = kvm_mmu_notifier_numa_protect,
.release = kvm_mmu_notifier_release,
};
--
2.17.1
在 2023/8/10 17:02, Yan Zhao 写道:
> Register to .numa_protect() callback in mmu notifier so that KVM can get
> acurate information about when a page is PROT_NONE protected in primary
> MMU and unmap it in secondary MMU accordingly.
>
> In KVM's .invalidate_range_start() handler, if the event is to notify that
> the range may be protected to PROT_NONE for NUMA migration purpose,
> don't do the unmapping in secondary MMU. Hold on until.numa_protect()
> comes.
>
> Signed-off-by: Yan Zhao <[email protected]>
> ---
> virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
> 1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..907444a1761b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
> }
>
> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +
> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> + return;
> +
> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> +}
numa balance will scan wide memory range, and there will be one time
ipi notification with kvm_flush_remote_tlbs. With page level notification,
it may bring out lots of flush remote tlb ipi notification.
however numa balance notification, pmd table of vm maybe needs not be freed
in kvm_unmap_gfn_range.
Regards
Bibo Mao
> +
> void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> unsigned long end)
> {
> @@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> const struct mmu_notifier_range *range)
> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> + bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
> + (range->flags & MMU_NOTIFIER_RANGE_NUMA);
> const struct kvm_hva_range hva_range = {
> .start = range->start,
> .end = range->end,
> .pte = __pte(0),
> - .handler = kvm_unmap_gfn_range,
> + .handler = !is_numa ? kvm_unmap_gfn_range :
> + (void *)kvm_null_fn,
> .on_lock = kvm_mmu_invalidate_begin,
> - .on_unlock = kvm_arch_guest_memory_reclaimed,
> - .flush_on_ret = true,
> + .on_unlock = !is_numa ? kvm_arch_guest_memory_reclaimed :
> + (void *)kvm_null_fn,
> + .flush_on_ret = !is_numa ? true : false,
> .may_block = mmu_notifier_range_blockable(range),
> };
>
> @@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> .clear_young = kvm_mmu_notifier_clear_young,
> .test_young = kvm_mmu_notifier_test_young,
> .change_pte = kvm_mmu_notifier_change_pte,
> + .numa_protect = kvm_mmu_notifier_numa_protect,
> .release = kvm_mmu_notifier_release,
> };
>
> > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long end)
> > +{
> > + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > +
> > + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > + return;
> > +
> > + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > +}
> numa balance will scan wide memory range, and there will be one time
Though scanning memory range is wide, .invalidate_range_start() is sent
for each 2M range.
> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> it may bring out lots of flush remote tlb ipi notification.
Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
will be reduced to 0 with this series.
For VMs without assigned devices or mdev devices, I was previously also
worried about that there might be more IPIs.
But with current test data, there's no more remote tlb IPIs on average.
The reason is below:
Before this series, kvm_unmap_gfn_range() is called for once for a 2M
range.
After this series, kvm_unmap_gfn_range() is called for once if the 2M is
mapped to a huge page in primary MMU, and called for at most 512 times
if mapped to 4K pages in primary MMU.
Though kvm_unmap_gfn_range() is only called once before this series,
as the range is blockable, when there're contentions, remote tlb IPIs
can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
if the pages are mapped in 4K in secondary MMU.
With this series, on the other hand, .numa_protect() sets range to be
unblockable. So there could be less remote tlb IPIs when a 2M range is
mapped into small PTEs in secondary MMU.
Besides, .numa_protect() is not sent for all pages in a given 2M range.
Below is my testing data on a VM without assigned devices:
The data is an average of 10 times guest boot-up.
data | numa balancing caused | numa balancing caused
on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
-------------------|------------------------|--------------------------
before this series | 35 | 8625
after this series | 10037 | 4610
For a single guest bootup,
| numa balancing caused | numa balancing caused
best data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
-------------------|------------------------|--------------------------
before this series | 28 | 13
after this series | 406 | 195
| numa balancing caused | numa balancing caused
worst data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
-------------------|------------------------|--------------------------
before this series | 44 | 43920
after this series | 17352 | 8668
>
> however numa balance notification, pmd table of vm maybe needs not be freed
> in kvm_unmap_gfn_range.
>
在 2023/8/11 11:45, Yan Zhao 写道:
>>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
>>> + struct mm_struct *mm,
>>> + unsigned long start,
>>> + unsigned long end)
>>> +{
>>> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
>>> +
>>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>>> + return;
>>> +
>>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
>>> +}
>> numa balance will scan wide memory range, and there will be one time
> Though scanning memory range is wide, .invalidate_range_start() is sent
> for each 2M range.
yes, range is huge page size when changing numa protection during numa scanning.
>
>> ipi notification with kvm_flush_remote_tlbs. With page level notification,
>> it may bring out lots of flush remote tlb ipi notification.
>
> Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> will be reduced to 0 with this series.
>
> For VMs without assigned devices or mdev devices, I was previously also
> worried about that there might be more IPIs.
> But with current test data, there's no more remote tlb IPIs on average.
>
> The reason is below:
>
> Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> range.
> After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> mapped to a huge page in primary MMU, and called for at most 512 times
> if mapped to 4K pages in primary MMU.
>
>
> Though kvm_unmap_gfn_range() is only called once before this series,
> as the range is blockable, when there're contentions, remote tlb IPIs
> can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
I do not know much about x86, does this happen always or only need reschedule
from code? so that there will be many times of tlb IPIs in only once function
call about kvm_unmap_gfn_range.
> if the pages are mapped in 4K in secondary MMU.
>
> With this series, on the other hand, .numa_protect() sets range to be
> unblockable. So there could be less remote tlb IPIs when a 2M range is
> mapped into small PTEs in secondary MMU.
> Besides, .numa_protect() is not sent for all pages in a given 2M range.
No, .numa_protect() is not sent for all pages. It depends on the workload,
whether the page is accessed for different cpu threads cross-nodes.
>
> Below is my testing data on a VM without assigned devices:
> The data is an average of 10 times guest boot-up.
>
> data | numa balancing caused | numa balancing caused
> on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
> -------------------|------------------------|--------------------------
> before this series | 35 | 8625
> after this series | 10037 | 4610
just be cautious, before the series there are 8625/35 = 246 IPI tlb flush ops
during one time kvm_unmap_gfn_range, is that x86 specific or generic?
By the way are primary mmu and secondary mmu both 4K small page size "on average"?
Regards
Bibo Mao
>
> For a single guest bootup,
> | numa balancing caused | numa balancing caused
> best data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
> -------------------|------------------------|--------------------------
> before this series | 28 | 13
> after this series | 406 | 195
>
> | numa balancing caused | numa balancing caused
> worst data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
> -------------------|------------------------|--------------------------
> before this series | 44 | 43920
> after this series | 17352 | 8668
>
>
>>
>> however numa balance notification, pmd table of vm maybe needs not be freed
>> in kvm_unmap_gfn_range.
>>
>
On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
>
>
> 在 2023/8/11 11:45, Yan Zhao 写道:
> >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> >>> + struct mm_struct *mm,
> >>> + unsigned long start,
> >>> + unsigned long end)
> >>> +{
> >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >>> +
> >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> >>> + return;
> >>> +
> >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> >>> +}
> >> numa balance will scan wide memory range, and there will be one time
> > Though scanning memory range is wide, .invalidate_range_start() is sent
> > for each 2M range.
> yes, range is huge page size when changing numa protection during numa scanning.
>
> >
> >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> >> it may bring out lots of flush remote tlb ipi notification.
> >
> > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > will be reduced to 0 with this series.
> >
> > For VMs without assigned devices or mdev devices, I was previously also
> > worried about that there might be more IPIs.
> > But with current test data, there's no more remote tlb IPIs on average.
> >
> > The reason is below:
> >
> > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > range.
> > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > mapped to a huge page in primary MMU, and called for at most 512 times
> > if mapped to 4K pages in primary MMU.
> >
> >
> > Though kvm_unmap_gfn_range() is only called once before this series,
> > as the range is blockable, when there're contentions, remote tlb IPIs
> > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> I do not know much about x86, does this happen always or only need reschedule
Ah, sorry, I missed platforms other than x86.
Maybe there will be a big difference in other platforms.
> from code? so that there will be many times of tlb IPIs in only once function
Only when MMU lock is contended. But it's not seldom because of the contention in
TDP page fault.
> call about kvm_unmap_gfn_range.
>
> > if the pages are mapped in 4K in secondary MMU.
> >
> > With this series, on the other hand, .numa_protect() sets range to be
> > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > mapped into small PTEs in secondary MMU.
> > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> No, .numa_protect() is not sent for all pages. It depends on the workload,
> whether the page is accessed for different cpu threads cross-nodes.
The .numa_protect() is called in patch 4 only when PROT_NONE is set to
the page.
>
> >
> > Below is my testing data on a VM without assigned devices:
> > The data is an average of 10 times guest boot-up.
> >
> > data | numa balancing caused | numa balancing caused
> > on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs()
> > -------------------|------------------------|--------------------------
> > before this series | 35 | 8625
> > after this series | 10037 | 4610
> just be cautious, before the series there are 8625/35 = 246 IPI tlb flush ops
> during one time kvm_unmap_gfn_range, is that x86 specific or generic?
Only on x86. Didn't test on other platforms.
>
> By the way are primary mmu and secondary mmu both 4K small page size "on average"?
No. 4K and 2M combined in both.
On Fri, Aug 11, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> >
> > 在 2023/8/11 11:45, Yan Zhao 写道:
> > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > >>> + struct mm_struct *mm,
> > >>> + unsigned long start,
> > >>> + unsigned long end)
> > >>> +{
> > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > >>> +
> > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > >>> + return;
> > >>> +
> > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > >>> +}
> > >> numa balance will scan wide memory range, and there will be one time
> > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > for each 2M range.
> > yes, range is huge page size when changing numa protection during numa scanning.
> >
> > >
> > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > >> it may bring out lots of flush remote tlb ipi notification.
> > >
> > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > will be reduced to 0 with this series.
> > >
> > > For VMs without assigned devices or mdev devices, I was previously also
> > > worried about that there might be more IPIs.
> > > But with current test data, there's no more remote tlb IPIs on average.
> > >
> > > The reason is below:
> > >
> > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > range.
No, it's potentially called once per 1GiB range. change_pmd_range() invokes the
mmu_notifier with addr+end, where "end" is the end of the range covered by the
PUD, not the the end of the current PMD. So the worst case scenario would be a
256k increase. Of course, if you have to migrate an entire 1GiB chunk of memory
then you likely have bigger problems, but still.
> > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > if mapped to 4K pages in primary MMU.
> > >
> > >
> > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > I do not know much about x86, does this happen always or only need reschedule
> Ah, sorry, I missed platforms other than x86.
> Maybe there will be a big difference in other platforms.
>
> > from code? so that there will be many times of tlb IPIs in only once function
> Only when MMU lock is contended. But it's not seldom because of the contention in
> TDP page fault.
No? I don't see how mmu_lock contention would affect the number of calls to
kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e.
not trying to access the range in question, then ever zap will generate a remote
TLB flush and thus send IPIs to all running vCPUs.
> > call about kvm_unmap_gfn_range.
> >
> > > if the pages are mapped in 4K in secondary MMU.
> > >
> > > With this series, on the other hand, .numa_protect() sets range to be
> > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > mapped into small PTEs in secondary MMU.
> > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > whether the page is accessed for different cpu threads cross-nodes.
> The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> the page.
I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off,
and losing the batching of invalidations makes me nervous. As Bibo points out,
the behavior will vary based on the workload, VM configuration, etc.
There's also a *very* subtle change, in that the notification will be sent while
holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but
I'm not exactly 100% confident on that. And the only reason there isn't a more
obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
won't yield if there's mmu_lock contention.
However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
I think we can achieve what you want without losing batching, and without changing
secondary MMUs.
Rather than muck with the notification types and add a one-off for NUMA, just
defer the notification until a present PMD/PTE is actually going to be modified.
It's not the prettiest, but other than the locking, I don't think has any chance
of regressing other workloads/configurations.
Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
Compile tested only.
From: Sean Christopherson <[email protected]>
Date: Fri, 11 Aug 2023 10:03:36 -0700
Subject: [PATCH] tmp
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/huge_mm.h | 4 +++-
include/linux/mm.h | 3 +++
mm/huge_memory.c | 5 ++++-
mm/mprotect.c | 47 +++++++++++++++++++++++++++++------------
4 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..b88316adaaad 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,6 +7,8 @@
#include <linux/fs.h> /* only for vma_is_dax() */
+struct mmu_notifier_range;
+
vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
- unsigned long cp_flags);
+ unsigned long cp_flags, struct mmu_notifier_range *range);
vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..284f61cf9c37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
return !!(vma->vm_flags & VM_WRITE);
}
+
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+ struct mmu_notifier_range *range);
bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
extern long change_protection(struct mmu_gather *tlb,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a71cf686e3b2..47c7712b163e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
*/
int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr, pgprot_t newprot,
- unsigned long cp_flags)
+ unsigned long cp_flags, struct mmu_notifier_range *range)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
!toptier)
xchg_page_access_time(page, jiffies_to_msecs(jiffies));
}
+
+ change_pmd_range_notify_secondary_mmus(addr, range);
+
/*
* In case prot_numa, we are under mmap_read_lock(mm). It's critical
* to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d1a809167f49..f5844adbe0cb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
static long change_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
- unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+ struct mmu_notifier_range *range)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
@@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
!toptier)
xchg_page_access_time(page,
jiffies_to_msecs(jiffies));
+
+
}
+ change_pmd_range_notify_secondary_mmus(addr, range);
+
oldpte = ptep_modify_prot_start(vma, addr, pte);
ptent = pte_modify(oldpte, newprot);
@@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
err; \
})
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+ struct mmu_notifier_range *range)
+{
+ if (range->start)
+ return;
+
+ VM_WARN_ON(addr >= range->end);
+ range->start = addr;
+ mmu_notifier_invalidate_range_start_nonblock(range);
+}
+
static inline long change_pmd_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
unsigned long nr_huge_updates = 0;
struct mmu_notifier_range range;
- range.start = 0;
+ /*
+ * Defer invalidation of secondary MMUs until a PMD/PTE change is
+ * imminent, e.g. NUMA balancing in particular can "fail" for certain
+ * types of mappings. Initialize range.start to '0' and use it to
+ * track whether or not the invalidation notification has been set.
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+ vma->vm_mm, 0, end);
pmd = pmd_offset(pud, addr);
do {
@@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (pmd_none(*pmd))
goto next;
- /* invoke the mmu notifier if the pmd is populated */
- if (!range.start) {
- mmu_notifier_range_init(&range,
- MMU_NOTIFY_PROTECTION_VMA, 0,
- vma->vm_mm, addr, end);
- mmu_notifier_invalidate_range_start(&range);
- }
-
_pmd = pmdp_get_lockless(pmd);
if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
+ /*
+ * FIXME: __split_huge_pmd() performs its own
+ * mmu_notifier invalidation prior to clearing
+ * the PMD, ideally all invalidations for the
+ * range would be batched.
+ */
__split_huge_pmd(vma, pmd, addr, false, NULL);
/*
* For file-backed, the pmd could have been
@@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
break;
}
} else {
- ret = change_huge_pmd(tlb, vma, pmd,
- addr, newprot, cp_flags);
+ ret = change_huge_pmd(tlb, vma, pmd, addr,
+ newprot, cp_flags, &range);
if (ret) {
if (ret == HPAGE_PMD_NR) {
pages += HPAGE_PMD_NR;
@@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
}
ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
- cp_flags);
+ cp_flags, &range);
if (ret < 0)
goto again;
pages += ret;
base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
--
> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.
No, this is not legal.
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> + struct mmu_notifier_range *range)
> +{
> + if (range->start)
> + return;
> +
> + VM_WARN_ON(addr >= range->end);
> + range->start = addr;
> + mmu_notifier_invalidate_range_start_nonblock(range);
> +}
The 'nonblock' entry point is advisory, if it fails the caller must
not change the mm.
Jason
On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > >
> > > 在 2023/8/11 11:45, Yan Zhao 写道:
> > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > > >>> + struct mm_struct *mm,
> > > >>> + unsigned long start,
> > > >>> + unsigned long end)
> > > >>> +{
> > > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > > >>> +
> > > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > > >>> + return;
> > > >>> +
> > > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > > >>> +}
> > > >> numa balance will scan wide memory range, and there will be one time
> > > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > > for each 2M range.
> > > yes, range is huge page size when changing numa protection during numa scanning.
> > >
> > > >
> > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > > >> it may bring out lots of flush remote tlb ipi notification.
> > > >
> > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > > will be reduced to 0 with this series.
> > > >
> > > > For VMs without assigned devices or mdev devices, I was previously also
> > > > worried about that there might be more IPIs.
> > > > But with current test data, there's no more remote tlb IPIs on average.
> > > >
> > > > The reason is below:
> > > >
> > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > > range.
>
> No, it's potentially called once per 1GiB range. change_pmd_range() invokes the
> mmu_notifier with addr+end, where "end" is the end of the range covered by the
> PUD, not the the end of the current PMD. So the worst case scenario would be a
> 256k increase. Of course, if you have to migrate an entire 1GiB chunk of memory
> then you likely have bigger problems, but still.
Yes, thanks for pointing it out.
I realized it after re-reading the code and re-checking the log message.
This wider range also explained the collected worst data:
44 kvm_unmap_gfn_range() caused 43920 kvm_flush_remote_tlbs()requests. i.e.
998 remote tlb flush requests per kvm_unmap_gfn_range().
>
> > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > if mapped to 4K pages in primary MMU.
> > > >
> > > >
> > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > I do not know much about x86, does this happen always or only need reschedule
> > Ah, sorry, I missed platforms other than x86.
> > Maybe there will be a big difference in other platforms.
> >
> > > from code? so that there will be many times of tlb IPIs in only once function
> > Only when MMU lock is contended. But it's not seldom because of the contention in
> > TDP page fault.
>
> No? I don't see how mmu_lock contention would affect the number of calls to
> kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e.
> not trying to access the range in question, then ever zap will generate a remote
> TLB flush and thus send IPIs to all running vCPUs.
In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:
1. Check necessity of mmu_lock reschedule
1.a -- if yes,
1.a.1 do kvm_flush_remote_tlbs() if flush is true.
1.a.2 reschedule of mmu_lock
1.a.3 goto 2.
1.b -- if no, goto 2
2. Zap present leaf entry, and set flush to be true
3. Get next gfn and go to to 1
With a wide range to .invalidate_range_start()/end(), it's easy to find
rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
request is performed. (1.a.1)
Take a real case for example,
NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
because vCPU is faulting GFN 0xb804c.
And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
.invalidate_range_end().
Then for this zap of GFN range from 0xb4000 - 0xb9800,
vCPUs retry fault of
GFN 0xb804c for 3298 times,
GFN 0xb8074 for 3161 times,
GFN 0xb81ce for 15190 times,
and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
for one kvm_unmap_gfn_range() because flush requests are not batched.
(this range is mapped in both 2M and 4K in secondary MMU).
Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
after it returns.
In my test scenario, which is VM boot-up, this kind of contention is
frequent.
Here's the 10 times data for a VM boot-up collected previously.
| count of | count of |
| kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
-------|---------------------|-------------------------|
1 | 38 | 14 |
2 | 28 | 5191 |
3 | 36 | 13398 |
4 | 44 | 43920 |
5 | 28 | 14 |
6 | 36 | 10803 |
7 | 38 | 892 |
8 | 32 | 5905 |
9 | 32 | 13 |
10 | 38 | 6096 |
-------|---------------------|-------------------------|
average| 35 | 8625 |
I wonder if we could loose the frequency to check for rescheduling in
tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
if (iter->next_last_level_gfn ==
iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
return false;
>
> > > call about kvm_unmap_gfn_range.
> > >
> > > > if the pages are mapped in 4K in secondary MMU.
> > > >
> > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > mapped into small PTEs in secondary MMU.
> > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > whether the page is accessed for different cpu threads cross-nodes.
> > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > the page.
>
> I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off,
> and losing the batching of invalidations makes me nervous. As Bibo points out,
> the behavior will vary based on the workload, VM configuration, etc.
>
> There's also a *very* subtle change, in that the notification will be sent while
> holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but
> I'm not exactly 100% confident on that. And the only reason there isn't a more
MMU lock is a rwlock, which is a variant of spinlock.
So, I think taking it within PMD/PTE lock is ok.
Actually we have already done that during the .change_pte() notification, where
kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
> obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> won't yield if there's mmu_lock contention.
Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.
>
> Rather than muck with the notification types and add a one-off for NUMA, just
> defer the notification until a present PMD/PTE is actually going to be modified.
> It's not the prettiest, but other than the locking, I don't think has any chance
> of regressing other workloads/configurations.
>
> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>
> Compile tested only.
I don't find a matching end to each
mmu_notifier_invalidate_range_start_nonblock().
>
> From: Sean Christopherson <[email protected]>
> Date: Fri, 11 Aug 2023 10:03:36 -0700
> Subject: [PATCH] tmp
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> include/linux/huge_mm.h | 4 +++-
> include/linux/mm.h | 3 +++
> mm/huge_memory.c | 5 ++++-
> mm/mprotect.c | 47 +++++++++++++++++++++++++++++------------
> 4 files changed, 44 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 20284387b841..b88316adaaad 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -7,6 +7,8 @@
>
> #include <linux/fs.h> /* only for vma_is_dax() */
>
> +struct mmu_notifier_range;
> +
> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
> int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> - unsigned long cp_flags);
> + unsigned long cp_flags, struct mmu_notifier_range *range);
>
> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..284f61cf9c37 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
> return !!(vma->vm_flags & VM_WRITE);
>
> }
> +
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> + struct mmu_notifier_range *range);
> bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> extern long change_protection(struct mmu_gather *tlb,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a71cf686e3b2..47c7712b163e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> */
> int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> - unsigned long cp_flags)
> + unsigned long cp_flags, struct mmu_notifier_range *range)
> {
> struct mm_struct *mm = vma->vm_mm;
> spinlock_t *ptl;
> @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> !toptier)
> xchg_page_access_time(page, jiffies_to_msecs(jiffies));
> }
> +
> + change_pmd_range_notify_secondary_mmus(addr, range);
> +
> /*
> * In case prot_numa, we are under mmap_read_lock(mm). It's critical
> * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index d1a809167f49..f5844adbe0cb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>
> static long change_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
> - unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> + unsigned long end, pgprot_t newprot, unsigned long cp_flags,
> + struct mmu_notifier_range *range)
> {
> pte_t *pte, oldpte;
> spinlock_t *ptl;
> @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> !toptier)
> xchg_page_access_time(page,
> jiffies_to_msecs(jiffies));
> +
> +
> }
>
> + change_pmd_range_notify_secondary_mmus(addr, range);
> +
> oldpte = ptep_modify_prot_start(vma, addr, pte);
> ptent = pte_modify(oldpte, newprot);
>
> @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
> err; \
> })
>
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> + struct mmu_notifier_range *range)
> +{
> + if (range->start)
> + return;
> +
> + VM_WARN_ON(addr >= range->end);
> + range->start = addr;
> + mmu_notifier_invalidate_range_start_nonblock(range);
This will cause range from addr to end to be invalidated, which may
include pinned ranges.
> +}
> +
> static inline long change_pmd_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
> unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
> unsigned long nr_huge_updates = 0;
> struct mmu_notifier_range range;
>
> - range.start = 0;
> + /*
> + * Defer invalidation of secondary MMUs until a PMD/PTE change is
> + * imminent, e.g. NUMA balancing in particular can "fail" for certain
> + * types of mappings. Initialize range.start to '0' and use it to
> + * track whether or not the invalidation notification has been set.
> + */
> + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> + vma->vm_mm, 0, end);
>
> pmd = pmd_offset(pud, addr);
> do {
> @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
> if (pmd_none(*pmd))
> goto next;
>
> - /* invoke the mmu notifier if the pmd is populated */
> - if (!range.start) {
> - mmu_notifier_range_init(&range,
> - MMU_NOTIFY_PROTECTION_VMA, 0,
> - vma->vm_mm, addr, end);
> - mmu_notifier_invalidate_range_start(&range);
> - }
> -
> _pmd = pmdp_get_lockless(pmd);
> if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
> if ((next - addr != HPAGE_PMD_SIZE) ||
> pgtable_split_needed(vma, cp_flags)) {
> + /*
> + * FIXME: __split_huge_pmd() performs its own
> + * mmu_notifier invalidation prior to clearing
> + * the PMD, ideally all invalidations for the
> + * range would be batched.
> + */
> __split_huge_pmd(vma, pmd, addr, false, NULL);
> /*
> * For file-backed, the pmd could have been
> @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
> break;
> }
> } else {
> - ret = change_huge_pmd(tlb, vma, pmd,
> - addr, newprot, cp_flags);
> + ret = change_huge_pmd(tlb, vma, pmd, addr,
> + newprot, cp_flags, &range);
> if (ret) {
> if (ret == HPAGE_PMD_NR) {
> pages += HPAGE_PMD_NR;
> @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
> }
>
> ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
> - cp_flags);
> + cp_flags, &range);
> if (ret < 0)
> goto again;
> pages += ret;
>
> base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
> --
>
>
>
On Mon, Aug 14, 2023 at 02:52:07PM +0800, Yan Zhao wrote:
> I wonder if we could loose the frequency to check for rescheduling in
> tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
>
> if (iter->next_last_level_gfn ==
> iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> return false;
Correct:
@@ -712,7 +713,8 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
WARN_ON(iter->yielded);
/* Ensure forward progress has been made before yielding. */
- if (iter->next_last_level_gfn == iter->yielded_gfn)
+ if (iter->next_last_level_gfn >= iter->yielded_gfn &&
+ iter->next_last_level_gfn < iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
return false;
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
This can reduce kvm_flush_remote_tlbs() a lot in one kvm_unmap_gfn_range() in KVM x86 TDP MMU.
在 2023/8/16 15:18, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
>>> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
>>> confusion will be caused when multiple threads trying to update the
>>> secondary MMU.
>> Since tlb flush is delayed after all pte entries are cleared, and currently
>> there is no tlb flush range supported for secondary mmu. I do know why there
>> is confusion before or after kvm->mmu_lock.
>
> Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
yes, it is just sketchy thought for numa balance scenery,
do kvm_unmap_gfn_range() in invalidate_range_end rather than
invalidate_range_start.
> Then check if PROT_NONE is set in primary MMU before unmap?
> Looks like a good idea, I need to check if it's feasible.
> Thanks!
>
>
在 2023/8/16 13:14, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>>
>>
>> 在 2023/8/16 10:43, bibo mao 写道:
>>>
>>>
>>> 在 2023/8/15 22:50, Sean Christopherson 写道:
>>>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>>>
>>>>>>>> Compile tested only.
>>>>>>>
>>>>>>> I don't find a matching end to each
>>>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>>>
>>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>>>
>>>>>> if (range.start)
>>>>>> mmu_notifier_invalidate_range_end(&range);
>>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>>>> if we only want the range to include pages successfully set to PROT_NONE.
>>>>
>>>> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely
>>>> to defer invalidation until it was actually needed, but still perform only a
>>>> single notification so as to batch the TLB flushes, e.g. the start() call still
>>>> used the original @end.
>>>>
>>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>>>> It was complete untested though, so it may not have actually done anything to reduce
>>>> the number of pointless invalidations.
>>> For numa-balance scenery, can original page still be used by application even if pte
>>> is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb
> For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
>
> See https://lore.kernel.org/all/[email protected]/
>
>> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
>> new page, my meaning that can original page still be used by application even if pte
>> is changed with PROT_NONE and before replaced with new page?
> It's not .change_pte() notification, which is sent when COW.
> The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
> protected with PROT_NONE, and if every check passes, a separate
> .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
> sent.
yes, you are right. change_pte() notification will be will called
when migrate_vma_pages, I messed it with numa page migration. However
invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent also when new page is replaced.
>
> So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
> keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
> (this approach is implemented in RFC v1
> https://lore.kernel.org/all/[email protected]/)
>
>>
>> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and
>> after mmu_notifier_invalidate_range_end notification for secondary mmu.
>> Regards
>> Bibo Mao
>
>>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> But I don't think flush tlb only in the .invalidate_range_end() in
> secondary MMU is a good idea.
I have no good idea, and it beyond my ability to modify kvm framework now :(
> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> confusion will be caused when multiple threads trying to update the
> secondary MMU.
Since tlb flush is delayed after all pte entries are cleared, and currently
there is no tlb flush range supported for secondary mmu. I do know why there
is confusion before or after kvm->mmu_lock.
Regards
Bibo Mao
>
>>>> be cross-range between range mmu_gather and mmu_notifier_range.
>
>
在 2023/8/16 10:43, bibo mao 写道:
>
>
> 在 2023/8/15 22:50, Sean Christopherson 写道:
>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>
>>>>>> Compile tested only.
>>>>>
>>>>> I don't find a matching end to each
>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>
>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>
>>>> if (range.start)
>>>> mmu_notifier_invalidate_range_end(&range);
>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>> if we only want the range to include pages successfully set to PROT_NONE.
>>
>> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely
>> to defer invalidation until it was actually needed, but still perform only a
>> single notification so as to batch the TLB flushes, e.g. the start() call still
>> used the original @end.
>>
>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>> It was complete untested though, so it may not have actually done anything to reduce
>> the number of pointless invalidations.
> For numa-balance scenery, can original page still be used by application even if pte
> is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb
Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
new page, my meaning that can original page still be used by application even if pte
is changed with PROT_NONE and before replaced with new page?
And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and
after mmu_notifier_invalidate_range_end notification for secondary mmu.
Regards
Bibo Mao
> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> be cross-range between range mmu_gather and mmu_notifier_range.
>
> Regards
> Bibo Mao
>>
>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>> index 9e4cd8b4a202..f29718a16211 100644
>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>> if (unlikely(!fault->slot))
>>>> return kvm_handle_noslot_fault(vcpu, fault, access);
>>>>
>>>> + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
>>>> + return RET_PF_RETRY;
>>>> +
>>> This can effectively reduce the remote flush IPIs a lot!
>>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
>>> and kvm->mmu_invalidate_range_end.
>>> Otherwise, I'm somewhat worried about constant false positive and retry.
>>
>> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress. The ranges
>> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
>> are reloaded wouldn't do anything. The key to making forward progress is seeing
>> that there is no in-progress invalidation.
>>
>> I did consider adding said READ_ONCE(), but practically speaking, constant false
>> positives are impossible. KVM will re-enter the guest when retrying, and there
>> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.
>>
>> I suppose in theory we might someday differentiate between "retry because a different
>> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
>> and not bother re-entering the guest for the latter, e.g. have it try to yield
>> instead.
>>
>> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
>> nop, so it wouldn't hurt to be paranoid in this case.
>>
>> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
>> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
>> the check. Losing the check wouldn't be problematic (false negatives are fine,
>> especially on that particular check), but the generated code would *look* buggy.
On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>
>
> 在 2023/8/16 10:43, bibo mao 写道:
> >
> >
> > 在 2023/8/15 22:50, Sean Christopherson 写道:
> >> On Tue, Aug 15, 2023, Yan Zhao wrote:
> >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> >>>>>>
> >>>>>> Compile tested only.
> >>>>>
> >>>>> I don't find a matching end to each
> >>>>> mmu_notifier_invalidate_range_start_nonblock().
> >>>>
> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> >>>>
> >>>> if (range.start)
> >>>> mmu_notifier_invalidate_range_end(&range);
> >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> >>> if we only want the range to include pages successfully set to PROT_NONE.
> >>
> >> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely
> >> to defer invalidation until it was actually needed, but still perform only a
> >> single notification so as to batch the TLB flushes, e.g. the start() call still
> >> used the original @end.
> >>
> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> >> It was complete untested though, so it may not have actually done anything to reduce
> >> the number of pointless invalidations.
> > For numa-balance scenery, can original page still be used by application even if pte
> > is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb
For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
See https://lore.kernel.org/all/[email protected]/
> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
> new page, my meaning that can original page still be used by application even if pte
> is changed with PROT_NONE and before replaced with new page?
It's not .change_pte() notification, which is sent when COW.
The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
protected with PROT_NONE, and if every check passes, a separate
.invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent.
So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
(this approach is implemented in RFC v1
https://lore.kernel.org/all/[email protected]/)
>
> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and
> after mmu_notifier_invalidate_range_end notification for secondary mmu.
> Regards
> Bibo Mao
> >> in notification mmu_notifier_invalidate_range_end with precised range, the range can
But I don't think flush tlb only in the .invalidate_range_end() in
secondary MMU is a good idea.
Flush must be done before kvm->mmu_lock is unlocked, otherwise,
confusion will be caused when multiple threads trying to update the
secondary MMU.
> >> be cross-range between range mmu_gather and mmu_notifier_range.