If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).
Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.
By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.
It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.
So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().
So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().
The related functions are:
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running
On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.
Also, I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.
Changes since v3:
Adds memory barrier to {start,end}_lockless_pgtbl_walk()
Explain (comments) why some lockless pgtbl walks don't need
local_irq_disable (real mode + MSR_EE=0)
Explain (comments) places where counting method is not needed (guest pgd,
which is not touched by THP)
Fixes some misplaced local_irq_restore()
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417
Changes since v2:
Rebased to v5.3
Adds support on __get_user_pages_fast
Adds usage decription to *_lockless_pgtbl_walk()
Better style to dummy functions
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839
Changes since v1:
Isolated atomic operations in functions *_lockless_pgtbl_walk()
Fixed behavior of decrementing before last ptep was used
Link: http://patchwork.ozlabs.org/patch/1163093/
Leonardo Bras (11):
powerpc/mm: Adds counting method to monitor lockless pgtable walks
asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
walks
mm/gup: Applies counting method to monitor gup_pgd_range
powerpc/mce_power: Applies counting method to monitor lockless pgtbl
walks
powerpc/perf: Applies counting method to monitor lockless pgtbl walks
powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
pgtbl walks
powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
walks
powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
pgtbl walks
powerpc/kvm/book3s_64: Applies counting method to monitor lockless
pgtbl walks
powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
walk
powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++
arch/powerpc/include/asm/book3s/64/pgtable.h | 5 ++
arch/powerpc/kernel/mce_power.c | 13 ++++--
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++++++++++++
arch/powerpc/kvm/book3s_64_vio_hv.c | 3 ++
arch/powerpc/kvm/book3s_hv_nested.c | 22 ++++++++-
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++++++--
arch/powerpc/kvm/e500_mmu_host.c | 6 ++-
arch/powerpc/mm/book3s64/hash_tlb.c | 2 +
arch/powerpc/mm/book3s64/hash_utils.c | 12 ++++-
arch/powerpc/mm/book3s64/mmu_context.c | 1 +
arch/powerpc/mm/book3s64/pgtable.c | 48 +++++++++++++++++++-
arch/powerpc/perf/callchain.c | 5 +-
include/asm-generic/pgtable.h | 15 ++++++
mm/gup.c | 8 ++++
16 files changed, 180 insertions(+), 13 deletions(-)
--
2.20.1
Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 6ba6195bff1b..27966481f2a6 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
void serialize_against_pte_lookup(struct mm_struct *mm)
{
smp_mb();
- smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+ if (running_lockless_pgtbl_walk(mm))
+ smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
}
/*
--
2.20.1
Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.
Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0
kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after
the last usage of ptep).
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_hv_nested.c | 22 ++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++++++++++++++----
2 files changed, 34 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..5a641b559de7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
if (!gp)
return;
- /* Find the pte */
+ /* Find the pte:
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/*
* If the pte is present and the pfn is still the same, update the pte.
@@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap,
if (!gp)
return;
- /* Find and invalidate the pte */
+ /* Find and invalidate the pte:
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/* Don't spuriously invalidate ptes if the pfn has changed */
if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
@@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
int shift;
spin_lock(&kvm->mmu_lock);
+ /*
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
@@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(&kvm->mmu_lock);
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..2076a7ac230a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -252,6 +252,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
* If we had a page table table change after lookup, we would
* retry via mmu_notifier_retry.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
if (!realmode)
local_irq_save(irq_flags);
/*
@@ -287,8 +288,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pa |= gpa & ~PAGE_MASK;
}
}
- if (!realmode)
- local_irq_restore(irq_flags);
ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
ptel |= pa;
@@ -311,6 +310,9 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
ptel |= HPTE_R_M;
}
+ if (!realmode)
+ local_irq_restore(irq_flags);
+ end_lockless_pgtbl_walk(kvm->mm);
/* Find and lock the HPTEG slot to use */
do_insert:
@@ -885,11 +887,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
/* Translate to host virtual address */
hva = __gfn_to_hva_memslot(memslot, gfn);
- /* Try to find the host pte for that virtual address */
+ /* Try to find the host pte for that virtual address :
+ * Called by hcall_real_table (real mode + MSR_EE=0)
+ * Interrupts are disabled here.
+ */
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
- if (!ptep)
+ if (!ptep) {
+ end_lockless_pgtbl_walk(kvm->mm);
return H_TOO_HARD;
+ }
pte = kvmppc_read_update_linux_pte(ptep, writing);
+ end_lockless_pgtbl_walk(kvm->mm);
+
if (!pte_present(pte))
return H_TOO_HARD;
--
2.20.1
Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.
hash_page_mm: Adds comment that explain that there is no need to
local_int_disable/save given that it is only called from DataAccess
interrupt, so interrupts are already disabled.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++
arch/powerpc/mm/book3s64/hash_utils.c | 12 +++++++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..5e5213c3f7c4 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
* to being hashed). This is not the most performance oriented
* way to do things but is fine for our needs here.
*/
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
@@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
}
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
}
void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index b8ad14bb1170..8615fab87c43 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1321,7 +1321,11 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
#endif /* CONFIG_PPC_64K_PAGES */
- /* Get PTE and page size from page tables */
+ /* Get PTE and page size from page tables :
+ * Called in from DataAccess interrupt (data_access_common: 0x300),
+ * interrupts are disabled here.
+ */
+ start_lockless_pgtbl_walk(mm);
ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
@@ -1438,6 +1442,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
DBG_LOW(" -> rc=%d\n", rc);
bail:
+ end_lockless_pgtbl_walk(mm);
exception_exit(prev_state);
return rc;
}
@@ -1547,10 +1552,12 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
vsid = get_user_vsid(&mm->context, ea, ssize);
if (!vsid)
return;
+
/*
* Hash doesn't like irqs. Walking linux page table with irq disabled
* saves us from holding multiple locks.
*/
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
/*
@@ -1597,6 +1604,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
pte_val(*ptep));
out_exit:
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
}
#ifdef CONFIG_PPC_MEM_KEYS
@@ -1613,11 +1621,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
if (!mm || !mm->pgd)
return 0;
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
if (ptep)
pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
return pkey;
}
--
2.20.1
Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.
Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++
arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++++++++++++++++++++++++++
arch/powerpc/kvm/book3s_64_vio_hv.c | 3 +++
3 files changed, 35 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
* We need to protect against page table destruction
* hugepage split and collapse.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
write_ok = 1;
}
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(kvm->mm);
}
}
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..9b374b9838fa 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,6 +813,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
* Read the PTE from the process' radix tree and use that
* so we get the shift and attribute bits.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
/*
@@ -821,12 +822,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
*/
if (!ptep) {
local_irq_enable();
+ end_lockless_pgtbl_walk(kvm->mm);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
+ end_lockless_pgtbl_walk(kvm->mm);
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +975,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
kvm->arch.lpid);
+
return 0;
}
@@ -989,6 +998,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
int ref = 0;
unsigned long old, *rmapp;
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1027,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1049,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
ret = 1;
@@ -1046,6 +1070,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
1UL << shift);
spin_unlock(&kvm->mmu_lock);
}
+
return ret;
}
@@ -1085,6 +1110,11 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
gpa = memslot->base_gfn << PAGE_SHIFT;
spin_lock(&kvm->mmu_lock);
for (n = memslot->npages; n; --n) {
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index b4f20f13b860..376d069a92dd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -431,6 +431,7 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
unsigned long ua, unsigned long *phpa)
{
+ struct kvm *kvm = vcpu->kvm;
pte_t *ptep, pte;
unsigned shift = 0;
@@ -443,10 +444,12 @@ static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
* to exit which will agains result in the below page table walk
* to finish.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, ua, NULL, &shift);
if (!ptep || !pte_present(*ptep))
return -ENXIO;
pte = *ptep;
+ end_lockless_pgtbl_walk(kvm->mm);
if (!shift)
shift = PAGE_SHIFT;
--
2.20.1
Enables count-based monitoring method for lockless pagetable walks on
PowerPC book3s_64.
Other architectures/platforms fallback to using generic dummy functions.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8308f32e9782..eb9b26a4a483 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
}
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+void start_lockless_pgtbl_walk(struct mm_struct *mm);
+void end_lockless_pgtbl_walk(struct mm_struct *mm);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
--
2.20.1