If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).
Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.
By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.
It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.
So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().
So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().
The related functions are:
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running
On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.
Changes since v2:
Rebased to v5.3
Adds support on __get_user_pages_fast
Adds usage decription to *_lockless_pgtbl_walk()
Better style to dummy functions
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839
Changes since v1:
Isolated atomic operations in functions *_lockless_pgtbl_walk()
Fixed behavior of decrementing before last ptep was used
Link: http://patchwork.ozlabs.org/patch/1163093/
Leonardo Bras (11):
powerpc/mm: Adds counting method to monitor lockless pgtable walks
asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
walks
mm/gup: Applies counting method to monitor gup_pgd_range
powerpc/mce_power: Applies counting method to monitor lockless pgtbl
walks
powerpc/perf: Applies counting method to monitor lockless pgtbl walks
powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
pgtbl walks
powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
walks
powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
pgtbl walks
powerpc/kvm/book3s_64: Applies counting method to monitor lockless
pgtbl walks
powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
walk
powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++
arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++
arch/powerpc/kernel/mce_power.c | 13 +++++--
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 +++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++
arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++-
arch/powerpc/kvm/e500_mmu_host.c | 4 ++
arch/powerpc/mm/book3s64/hash_tlb.c | 2 +
arch/powerpc/mm/book3s64/hash_utils.c | 7 ++++
arch/powerpc/mm/book3s64/mmu_context.c | 1 +
arch/powerpc/mm/book3s64/pgtable.c | 40 +++++++++++++++++++-
arch/powerpc/perf/callchain.c | 5 ++-
include/asm-generic/pgtable.h | 15 ++++++++
mm/gup.c | 8 ++++
16 files changed, 138 insertions(+), 8 deletions(-)
--
2.20.1
It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.
Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.
In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.
This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++
arch/powerpc/mm/book3s64/mmu_context.c | 1 +
arch/powerpc/mm/book3s64/pgtable.c | 37 ++++++++++++++++++++++++
3 files changed, 41 insertions(+)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@ typedef struct {
/* Number of users of the external (Nest) MMU */
atomic_t copros;
+ /* Number of running instances of lockless pagetable walk*/
+ atomic_t lockless_pgtbl_walk_count;
+
struct hash_mm_context *hash_context;
unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
#endif
atomic_set(&mm->context.active_cpus, 0);
atomic_set(&mm->context.copros, 0);
+ atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
return 0;
}
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..0b86884a8097 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,43 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
}
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ * lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+ atomic_inc(&mm->context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ * returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+ atomic_dec(&mm->context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ * currently running. If it returns 0, there is no running pagetable walk, and
+ * THP split/collapse can be safely done. This can be used to avoid more
+ * expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+ return atomic_read(&mm->context.lockless_pgtbl_walk_count);
+}
+
/*
* We use this to invalidate a pmdp entry before switching from a
* hugepte to regular pmd entry.
--
2.20.1
There is a need to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them.
Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.
In order to speedup these cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.
Given that there are lockless pagetable walks on generic code, it's
necessary to create dummy functions for archs that won't use the approach.
Signed-off-by: Leonardo Bras <[email protected]>
---
include/asm-generic/pgtable.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..0831475e72d3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void)
#endif
#endif
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+static inline void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+ return 0;
+}
+#endif
+
/*
* On some architectures it depends on the mm if the p4d/pud or pmd
* layer of the page table hierarchy is folded or not.
--
2.20.1
As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the couting method, it's necessary
to bound it with {start,end}_lockless_pgtbl_walk.
There are dummy functions, so it is not going to add any overhead on archs
that don't use this method.
Signed-off-by: Leonardo Bras <[email protected]>
---
mm/gup.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..eabd6fd15cf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
+ struct mm_struct mm;
unsigned long len, end;
unsigned long flags;
int nr = 0;
@@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+ mm = current->mm;
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
}
return nr;
@@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
{
unsigned long addr, len, end;
+ struct mm_struct *mm;
int nr = 0, ret = 0;
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+ mm = current->mm;
+ start_lockless_pgtbl_walk(mm);
local_irq_disable();
gup_pgd_range(addr, end, gup_flags, pages, &nr);
local_irq_enable();
+ end_lockless_pgtbl_walk(mm);
ret = nr;
}
--
2.20.1
Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..ed68e57af3a3 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -804,6 +804,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
return;
/* Find the pte */
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/*
* If the pte is present and the pfn is still the same, update the pte.
@@ -815,6 +816,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
__radix_pte_update(ptep, clr, set);
kvmppc_radix_tlbie_page(kvm, gpa, shift, lpid);
}
+ end_lockless_pgtbl_walk(kvm->mm);
}
/*
@@ -854,10 +856,12 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap,
return;
/* Find and invalidate the pte */
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/* Don't spuriously invalidate ptes if the pfn has changed */
if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
+ end_lockless_pgtbl_walk(kvm->mm);
}
static void kvmhv_remove_nest_rmap_list(struct kvm *kvm, unsigned long *rmapp,
@@ -921,6 +925,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
int shift;
spin_lock(&kvm->mmu_lock);
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
@@ -928,6 +933,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
ret = true;
}
+ end_lockless_pgtbl_walk(kvm->mm);
spin_unlock(&kvm->mmu_lock);
if (shift_ret)
@@ -1362,11 +1368,13 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(&kvm->mmu_lock);
+ start_lockless_pgtbl_walk(kvm->mm);
pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
if (pte_p)
pte = *pte_p;
+ end_lockless_pgtbl_walk(kvm->mm);
spin_unlock(&kvm->mmu_lock);
if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..53ca67492211 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -258,6 +258,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
* If called in real mode we have MSR_EE = 0. Otherwise
* we disable irq above.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(pgdir, hva, NULL, &hpage_shift);
if (ptep) {
pte_t pte;
@@ -311,6 +312,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
ptel |= HPTE_R_M;
}
+ end_lockless_pgtbl_walk(kvm->mm);
/* Find and lock the HPTEG slot to use */
do_insert:
@@ -886,10 +888,15 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
hva = __gfn_to_hva_memslot(memslot, gfn);
/* Try to find the host pte for that virtual address */
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
- if (!ptep)
+ if (!ptep) {
+ end_lockless_pgtbl_walk(kvm->mm);
return H_TOO_HARD;
+ }
pte = kvmppc_read_update_linux_pte(ptep, writing);
+ end_lockless_pgtbl_walk(kvm->mm);
+
if (!pte_present(pte))
return H_TOO_HARD;
--
2.20.1
Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++
arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
* We need to protect against page table destruction
* hugepage split and collapse.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
write_ok = 1;
}
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(kvm->mm);
}
}
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..d46f8258d8d6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -741,6 +741,7 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
unsigned long pgflags;
unsigned int shift;
pte_t *ptep;
+ bool ret = false;
/*
* Need to set an R or C bit in the 2nd-level tables;
@@ -755,12 +756,14 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
* We can do this without disabling irq because the Linux MM
* subsystem doesn't do THP splits and collapses on this tree.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) {
kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift);
- return true;
+ ret = true;
}
- return false;
+ end_lockless_pgtbl_walk(kvm->mm);
+ return ret;
}
int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
@@ -813,6 +816,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
* Read the PTE from the process' radix tree and use that
* so we get the shift and attribute bits.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
/*
@@ -821,12 +825,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
*/
if (!ptep) {
local_irq_enable();
+ end_lockless_pgtbl_walk(kvm->mm);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
+ end_lockless_pgtbl_walk(kvm->mm);
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +978,12 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
kvm->arch.lpid);
+ end_lockless_pgtbl_walk(kvm->mm);
return 0;
}
@@ -989,6 +997,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
int ref = 0;
unsigned long old, *rmapp;
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1001,6 +1010,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
1UL << shift);
ref = 1;
}
+ end_lockless_pgtbl_walk(kvm->mm);
return ref;
}
@@ -1013,9 +1023,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
+ end_lockless_pgtbl_walk(kvm->mm);
return ref;
}
@@ -1030,6 +1042,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
ret = 1;
@@ -1046,6 +1059,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
1UL << shift);
spin_unlock(&kvm->mmu_lock);
}
+ end_lockless_pgtbl_walk(kvm->mm);
return ret;
}
@@ -1084,6 +1098,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
gpa = memslot->base_gfn << PAGE_SHIFT;
spin_lock(&kvm->mmu_lock);
+ start_lockless_pgtbl_walk(kvm->mm);
for (n = memslot->npages; n; --n) {
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
@@ -1091,6 +1106,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
kvm->arch.lpid);
gpa += PAGE_SIZE;
}
+ end_lockless_pgtbl_walk(kvm->mm);
spin_unlock(&kvm->mmu_lock);
}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index b4f20f13b860..d7ea44f28993 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -431,6 +431,7 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
unsigned long ua, unsigned long *phpa)
{
+ struct kvm *kvm = vcpu->kvm;
pte_t *ptep, pte;
unsigned shift = 0;
@@ -443,10 +444,13 @@ static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
* to exit which will agains result in the below page table walk
* to finish.
*/
+
+ start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, ua, NULL, &shift);
if (!ptep || !pte_present(*ptep))
return -ENXIO;
pte = *ptep;
+ end_lockless_pgtbl_walk(kvm->mm);
if (!shift)
shift = PAGE_SHIFT;
--
2.20.1
Applies the counting-based method for monitoring lockless pgtable walks on
addr_to_pfn().
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kernel/mce_power.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..0f2f87da4cd1 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
{
pte_t *ptep;
unsigned long flags;
+ unsigned long pfn;
struct mm_struct *mm;
if (user_mode(regs))
@@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
else
mm = &init_mm;
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
if (mm == current->mm)
ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
else
ptep = find_init_mm_pte(addr, NULL);
- local_irq_restore(flags);
+
if (!ptep || pte_special(*ptep))
- return ULONG_MAX;
- return pte_pfn(*ptep);
+ pfn = ULONG_MAX;
+ else
+ pfn = pte_pfn(*ptep);
+
+ local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
+ return pfn;
}
/* flush SLBs and reload */
--
2.20.1
Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++
arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..5e5213c3f7c4 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
* to being hashed). This is not the most performance oriented
* way to do things but is fine for our needs here.
*/
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
@@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
}
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
}
void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index b8ad14bb1170..299946cedc3a 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1322,6 +1322,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
#endif /* CONFIG_PPC_64K_PAGES */
/* Get PTE and page size from page tables */
+ start_lockless_pgtbl_walk(mm);
ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
@@ -1438,6 +1439,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
DBG_LOW(" -> rc=%d\n", rc);
bail:
+ end_lockless_pgtbl_walk(mm);
exception_exit(prev_state);
return rc;
}
@@ -1547,10 +1549,12 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
vsid = get_user_vsid(&mm->context, ea, ssize);
if (!vsid)
return;
+
/*
* Hash doesn't like irqs. Walking linux page table with irq disabled
* saves us from holding multiple locks.
*/
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
/*
@@ -1597,6 +1601,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
pte_val(*ptep));
out_exit:
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
}
#ifdef CONFIG_PPC_MEM_KEYS
@@ -1613,11 +1618,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
if (!mm || !mm->pgd)
return 0;
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
if (ptep)
pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
return pkey;
}
--
2.20.1
Enables count-based monitoring method for lockless pagetable walks on
PowerPC book3s_64.
Other architectures/platforms fallback to using generic dummy functions.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8308f32e9782..eb9b26a4a483 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
}
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+void start_lockless_pgtbl_walk(struct mm_struct *mm);
+void end_lockless_pgtbl_walk(struct mm_struct *mm);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
--
2.20.1
Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 0b86884a8097..e2aa6572f03f 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
void serialize_against_pte_lookup(struct mm_struct *mm)
{
smp_mb();
- smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+ if (running_lockless_pgtbl_walk(mm))
+ smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
}
/*
--
2.20.1
Applies the counting-based method for monitoring lockless pgtable walks on
read_user_stack_slow.
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/perf/callchain.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..9d76194a2a8f 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
int ret = -EFAULT;
pgd_t *pgdir;
pte_t *ptep, pte;
+ struct mm_struct *mm = current->mm;
unsigned shift;
unsigned long addr = (unsigned long) ptr;
unsigned long offset;
unsigned long pfn, flags;
void *kaddr;
- pgdir = current->mm->pgd;
+ pgdir = mm->pgd;
if (!pgdir)
return -EFAULT;
+ start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_current_mm_pte(pgdir, addr, NULL, &shift);
if (!ptep)
@@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
ret = 0;
err_out:
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(mm);
return ret;
}
--
2.20.1
Applies the counting-based method for monitoring lockless pgtable walks on
kvmppc_e500_shadow_map().
Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/e500_mmu_host.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 321db0fdb9db..a1b8bfe20bc8 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
* We are holding kvm->mmu_lock so a notifier invalidate
* can't run hence pfn won't change.
*/
+ start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_linux_pte(pgdir, hva, NULL, NULL);
if (ptep) {
@@ -484,12 +485,15 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
local_irq_restore(flags);
} else {
local_irq_restore(flags);
+ end_lockless_pgtbl_walk(kvm->mm);
pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n",
__func__, (long)gfn, pfn);
ret = -EINVAL;
goto out;
}
}
+ end_lockless_pgtbl_walk(kvm->mm);
+
kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
--
2.20.1
John Hubbard <[email protected]> writes:
> Hi Leonardo,
>
> Thanks for adding linux-mm to CC for this next round of reviews. For the benefit
> of any new reviewers, I'd like to add that there are some issues that were discovered
> while reviewing the v2 patchset, that are not (yet) addressed in this v3 series.
> Since those issues are not listed in the cover letter above, I'll list them here
Thanks for bringing that.
The cover letter is a great place to put this info, I will keep that in
mind for future patchsets.
>
> 1. The locking model requires a combination of disabling interrupts and
> atomic counting and memory barriers, but
>
> a) some memory barriers are missing
> (start/end_lockless_pgtbl_walk), and
It seems that it works fine today because of the amount of intructions
executed between the irq_disable / start_lockless_pgtbl_walk and where
the THP collapse/split can happen. (It's very unlikely that it reorders
that much).
But I don't think it would be so bad to put a memory barrier after
irq_disable just in case.
> b) some cases (patch #8) fail to disable interrupts
I have done some looking into that, and it seems that some uses of
{start,end}_lockless_pgtbl_walk are unneeded, because they operate in
(nested) guest pgd and I was told it's safe against THP split/collapse.
In other uses, there is no interrupt disable because the function is
called in real mode, with MSR_EE=0, and there we have instructions
disabled, so there is no need to disable them again.
>
> ...so the synchronization appears to be inadequate. (And if it *is* adequate, then
> definitely we need the next item, to explain it.)
>
> 2. Documentation of the synchronization/locking model needs to exist, once we
> figure out the exact details of (1).
I will add the missing doc in the code, so it may be easier to
understand in the future.
>
> 3. Related to (1), I've asked to change things so that interrupt controls and
> atomic inc/dec are in the same start/end calls--assuming, of course, that the
> caller can tolerate that.
I am not sure if it would be ok to use irq_{save,restore} in real mode,
I will do some more reading of the docs before addressing this.
>
> 4. Please see the v2 series for any other details I've missed.
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
Thank you for helping, John!
Best regards,
Leonardo Bras
On Fri, 2019-09-27 at 11:46 -0300, Leonardo Bras wrote:
> I am not sure if it would be ok to use irq_{save,restore} in real mode,
> I will do some more reading of the docs before addressing this.
It looks like it's unsafe to merge irq_{save,restore} in
{start,end}_lockless_pgtbl_walk(), due to a possible access of code
that is not accessible in real mode.
I am sending a v4 for the changes so far.
I will look forward for your feedback.