2020-02-06 03:22:29

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 00/11] Introduces new functions for tracking lockless pagetable walks

Patches 1-2: Introduces new arch-generic functions to use before
and after lockless pagetable walks, instead of local_irq_*, and
applies them to generic code. It makes lockless pagetable walks
more explicit and improves documentation about it.

Patches 3-9: Introduces a powerpc-specific version of the above
functions with the option to not touch irq config. Then apply them
to all powerpc code that do lockless pagetable walks.

Patches 10-11: Introduces a percpu counting method to keep track of
the lockless page table walks, then uses this info to reduce the
waiting time on serialize_against_pte_lookup().

Use case:

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

Percpu count-based method:

So, by what I could understand, if there is no lockless pagetable walk
running on given cpu, there is no need to call
serialize_against_pte_lookup() there.

To reduce the cost of running serialize_against_pte_lookup(), I
propose a percpu-counter that keeps track of how many
lockless pagetable walks are currently running on each cpu, and if there
is none, just skip smp_call_function_many() for that cpu.

- Every percpu-counter can be changed only by it's own CPU
- It makes use of the original memory barrier in the functions
- Any counter can be read by any CPU

Due to not locking nor using atomic variables, the impact on the
lockless pagetable walk is intended to be minimum.

The related functions are:
begin_lockless_pgtbl_walk()
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk()
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)

Results:

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 430ms.

Bonus:

I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.

Also fixed some misplaced local_irq_{restore, enable}.

Changes since v5:
Changed counting approach from atomic variables to percpu variables
Counting method only affects powepc, arch-generic only toggle irqs
Changed commit order, so the counting method is introduced at the end
Removed config option, always enabled in powerpc
Rebased on top of v5.5
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=133907

Changes since v4:
Rebased on top of v5.4-rc1
Declared real generic functions instead of dummies
start_lockless_pgtbl_walk renamed to begin_lockless_pgtbl_walk
Interrupt {dis,en}able is now inside of {begin,end}_lockless_pgtbl_walk
Power implementation has option to not {dis,en}able interrupt
More documentation inside the funtions.
Some irq masks variables renamed
Removed some proxy mm_structs
Few typos fixed
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=133015

Changes since v3:
Explain (comments) why some lockless pgtbl walks don't need
local_irq_disable (real mode + MSR_EE=0)
Explain (comments) places where counting method is not needed (guest pgd,
which is not touched by THP)
Fixes some misplaced local_irq_restore()
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417

Changes since v2:
Rebased to v5.3
Adds support on __get_user_pages_fast
Adds usage decription to *_lockless_pgtbl_walk()
Better style to dummy functions
Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Changes since v1:
Isolated atomic operations in functions *_lockless_pgtbl_walk()
Fixed behavior of decrementing before last ptep was used
Link: http://patchwork.ozlabs.org/patch/1163093/

Special thanks for:
Aneesh Kumar, Nick Piggin, Paul Mackerras, Michael Ellerman, Fabiano Rosas,
Dipankar Sarma and Oliver O'Halloran.


Leonardo Bras (11):
asm-generic/pgtable: Adds generic functions to track lockless pgtable
walks
mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range
powerpc/mm: Adds arch-specificic functions to track lockless pgtable
walks
powerpc/mce_power: Use functions to track lockless pgtbl walks
powerpc/perf: Use functions to track lockless pgtbl walks
powerpc/mm/book3s64/hash: Use functions to track lockless pgtbl walks
powerpc/kvm/e500: Use functions to track lockless pgtbl walks
powerpc/kvm/book3s_hv: Use functions to track lockless pgtbl walks
powerpc/kvm/book3s_64: Use functions to track lockless pgtbl walks
powerpc/mm: Adds counting method to track lockless pagetable walks
powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

arch/powerpc/include/asm/book3s/64/pgtable.h | 6 +
arch/powerpc/kernel/mce_power.c | 6 +-
arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
arch/powerpc/kvm/book3s_64_mmu_radix.c | 34 +++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 6 +-
arch/powerpc/kvm/book3s_hv_nested.c | 22 +++-
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 28 +++--
arch/powerpc/kvm/e500_mmu_host.c | 9 +-
arch/powerpc/mm/book3s64/hash_tlb.c | 6 +-
arch/powerpc/mm/book3s64/hash_utils.c | 27 +++--
arch/powerpc/mm/book3s64/pgtable.c | 120 ++++++++++++++++++-
arch/powerpc/perf/callchain.c | 6 +-
include/asm-generic/pgtable.h | 51 ++++++++
mm/gup.c | 10 +-
14 files changed, 288 insertions(+), 49 deletions(-)

--
2.24.1


2020-02-06 03:23:16

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 03/11] powerpc/mm: Adds arch-specificic functions to track lockless pgtable walks

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter {en,dis}able_irq = true.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 6 ++
arch/powerpc/mm/book3s64/pgtable.c | 86 +++++++++++++++++++-
2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 201a69e6a355..78f6ffb1bb3e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1375,5 +1375,11 @@ static inline bool pgd_is_leaf(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
}

+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(void);
+unsigned long __begin_lockless_pgtbl_walk(bool disable_irq);
+void end_lockless_pgtbl_walk(unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 2bf7e1b4fd82..535613030363 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -82,6 +82,7 @@ static void do_nothing(void *unused)
{

}
+
/*
* Serialize against find_current_mm_pte which does lock-less
* lookup in page tables with local interrupts disabled. For huge pages
@@ -98,6 +99,89 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
}

+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ * lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ * is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline
+unsigned long __begin_lockless_pgtbl_walk(bool disable_irq)
+{
+ unsigned long irq_mask = 0;
+
+ /*
+ * Interrupts must be disabled during the lockless page table walk.
+ * That's because the deleting or splitting involves flushing TLBs,
+ * which in turn issues interrupts, that will block when disabled.
+ *
+ * When this function is called from realmode with MSR[EE=0],
+ * it's not needed to touch irq, since it's already disabled.
+ */
+ if (disable_irq)
+ local_irq_save(irq_mask);
+
+ /*
+ * This memory barrier pairs with any code that is either trying to
+ * delete page tables, or split huge pages. Without this barrier,
+ * the page tables could be read speculatively outside of interrupt
+ * disabling or reference counting.
+ */
+ smp_mb();
+
+ return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ * lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs will be disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(void)
+{
+ return __begin_lockless_pgtbl_walk(true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ * returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ * is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq)
+{
+ /*
+ * This memory barrier pairs with any code that is either trying to
+ * delete page tables, or split huge pages. Without this barrier,
+ * the page tables could be read speculatively outside of interrupt
+ * disabling or reference counting.
+ */
+ smp_mb();
+
+ /*
+ * Interrupts must be disabled during the lockless page table walk.
+ * That's because the deleting or splitting involves flushing TLBs,
+ * which in turn issues interrupts, that will block when disabled.
+ *
+ * When this function is called from realmode with MSR[EE=0],
+ * it's not needed to touch irq, since it's already disabled.
+ */
+ if (enable_irq)
+ local_irq_restore(irq_mask);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ * returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs will be enabled
+ */
+void end_lockless_pgtbl_walk(unsigned long irq_mask)
+{
+ __end_lockless_pgtbl_walk(irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
/*
* We use this to invalidate a pmdp entry before switching from a
* hugepte to regular pmd entry.
@@ -487,7 +571,7 @@ static int __init setup_disable_tlbie(char *str)
tlbie_capable = false;
tlbie_enabled = false;

- return 1;
+ return 1;
}
__setup("disable_tlbie", setup_disable_tlbie);

--
2.24.1

2020-02-06 03:23:29

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 01/11] asm-generic/pgtable: Adds generic functions to track lockless pgtable walks

It's necessary to track lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

The default solution is to disable irq before lockless pagetable walks and
enable it after it's finished.

On code, this means you can find local_irq_disable() and local_irq_enable()
around some pieces of code, usually without comments on why it is needed.

This patch proposes a set of generic functions to be called before starting
and after finishing a lockless pagetable walk. It is supposed to make clear
that a lockless pagetable walk happens there, and also carries information
on why the irq disable/enable is needed.

begin_lockless_pgtbl_walk()
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk()
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)

A memory barrier was also added just to make sure there is no speculative
read outside the interrupt disabled area. Other than that, it is not
supposed to have any change of behavior from current code.

It is planned to allow arch-specific versions, so that additional steps can
be added while keeping the code clean.

Signed-off-by: Leonardo Bras <[email protected]>
---
include/asm-generic/pgtable.h | 51 +++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index e2e2bef07dd2..8d368d3c0974 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1222,6 +1222,57 @@ static inline bool arch_has_pfn_modify_check(void)
#endif
#endif

+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+/*
+ * begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ * lockless pagetable walks, such as __find_linux_pte()
+ */
+static inline
+unsigned long begin_lockless_pgtbl_walk(void)
+{
+ unsigned long irq_mask;
+
+ /*
+ * Interrupts must be disabled during the lockless page table walk.
+ * That's because the deleting or splitting involves flushing TLBs,
+ * which in turn issues interrupts, that will block when disabled.
+ */
+ local_irq_save(irq_mask);
+
+ /*
+ * This memory barrier pairs with any code that is either trying to
+ * delete page tables, or split huge pages. Without this barrier,
+ * the page tables could be read speculatively outside of interrupt
+ * disabling.
+ */
+ smp_mb();
+
+ return irq_mask;
+}
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ * returned by a lockless pagetable walk, such as __find_linux_pte()
+ */
+static inline void end_lockless_pgtbl_walk(unsigned long irq_mask)
+{
+ /*
+ * This memory barrier pairs with any code that is either trying to
+ * delete page tables, or split huge pages. Without this barrier,
+ * the page tables could be read speculatively outside of interrupt
+ * disabling.
+ */
+ smp_mb();
+
+ /*
+ * Interrupts must be disabled during the lockless page table walk.
+ * That's because the deleting or splitting involves flushing TLBs,
+ * which in turn issues interrupts, that will block when disabled.
+ */
+ local_irq_restore(irq_mask);
+}
+#endif
+
/*
* On some architectures it depends on the mm if the p4d/pud or pmd
* layer of the page table hierarchy is folded or not.
--
2.24.1

2020-02-06 03:23:39

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 06/11] powerpc/mm/book3s64/hash: Use functions to track lockless pgtbl walks

Applies the new tracking functions to all hash-related functions that do
lockless pagetable walks.

hash_page_mm: Adds comment that explain that there is no need to
local_int_disable/save given that it is only called from DataAccess
interrupt, so interrupts are already disabled.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/hash_tlb.c | 6 +++---
arch/powerpc/mm/book3s64/hash_utils.c | 27 +++++++++++++++++----------
2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..86547c4151f6 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -194,7 +194,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
{
bool is_thp;
int hugepage_shift;
- unsigned long flags;
+ unsigned long irq_mask;

start = _ALIGN_DOWN(start, PAGE_SIZE);
end = _ALIGN_UP(end, PAGE_SIZE);
@@ -209,7 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
* to being hashed). This is not the most performance oriented
* way to do things but is fine for our needs here.
*/
- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
pte_t *ptep = find_current_mm_pte(mm->pgd, start, &is_thp,
@@ -229,7 +229,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
hpte_need_flush(mm, start, ptep, pte, hugepage_shift);
}
arch_leave_lazy_mmu_mode();
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
}

void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 523d4d39d11e..e6d4ab42173b 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1341,12 +1341,16 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
#endif /* CONFIG_PPC_64K_PAGES */

- /* Get PTE and page size from page tables */
+ /* Get PTE and page size from page tables :
+ * Called in from DataAccess interrupt (data_access_common: 0x300),
+ * interrupts are disabled here.
+ */
+ __begin_lockless_pgtbl_walk(false);
ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
rc = 1;
- goto bail;
+ goto bail_pgtbl_walk;
}

/* Add _PAGE_PRESENT to the required access perm */
@@ -1359,7 +1363,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
if (!check_pte_access(access, pte_val(*ptep))) {
DBG_LOW(" no access !\n");
rc = 1;
- goto bail;
+ goto bail_pgtbl_walk;
}

if (hugeshift) {
@@ -1383,7 +1387,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
if (current->mm == mm)
check_paca_psize(ea, mm, psize, user_region);

- goto bail;
+ goto bail_pgtbl_walk;
}

#ifndef CONFIG_PPC_64K_PAGES
@@ -1457,6 +1461,8 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
#endif
DBG_LOW(" -> rc=%d\n", rc);

+bail_pgtbl_walk:
+ __end_lockless_pgtbl_walk(0, false);
bail:
exception_exit(prev_state);
return rc;
@@ -1545,7 +1551,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
unsigned long vsid;
pgd_t *pgdir;
pte_t *ptep;
- unsigned long flags;
+ unsigned long irq_mask;
int rc, ssize, update_flags = 0;
unsigned long access = _PAGE_PRESENT | _PAGE_READ | (is_exec ? _PAGE_EXEC : 0);

@@ -1567,11 +1573,12 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
vsid = get_user_vsid(&mm->context, ea, ssize);
if (!vsid)
return;
+
/*
* Hash doesn't like irqs. Walking linux page table with irq disabled
* saves us from holding multiple locks.
*/
- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();

/*
* THP pages use update_mmu_cache_pmd. We don't do
@@ -1616,7 +1623,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
mm_ctx_user_psize(&mm->context),
pte_val(*ptep));
out_exit:
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
}

/*
@@ -1679,16 +1686,16 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
{
pte_t *ptep;
u16 pkey = 0;
- unsigned long flags;
+ unsigned long irq_mask;

if (!mm || !mm->pgd)
return 0;

- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
if (ptep)
pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);

return pkey;
}
--
2.24.1

2020-02-06 03:23:53

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

As described, gup_pgd_range is a lockless pagetable walk. So, in order to
track against THP split/collapse, it disables/enables irq around it.

To make use of the new tracking functions, it replaces irq disable/enable
by {begin,end}_lockless_pgtbl_walk().

As local_irq_{save,restore} is present inside
{begin,end}_lockless_pgtbl_walk, there should be no change in the
workings.

Signed-off-by: Leonardo Bras <[email protected]>
---
mm/gup.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..04e6f46993b6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2369,7 +2369,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
unsigned long len, end;
- unsigned long flags;
+ unsigned long irq_mask;
int nr = 0;

start = untagged_addr(start) & PAGE_MASK;
@@ -2395,9 +2395,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,

if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
}

return nr;
@@ -2450,9 +2450,9 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,

if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
- local_irq_disable();
+ begin_lockless_pgtbl_walk();
gup_pgd_range(addr, end, gup_flags, pages, &nr);
- local_irq_enable();
+ end_lockless_pgtbl_walk(IRQS_ENABLED);
ret = nr;
}

--
2.24.1

2020-02-06 03:24:00

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 04/11] powerpc/mce_power: Use functions to track lockless pgtbl walks

Applies the new functions used for tracking lockless pgtable walks on
addr_to_pfn().

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kernel/mce_power.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 1cbf7f1a4e3d..a9e38ef4e437 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -29,7 +29,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
{
pte_t *ptep;
unsigned int shift;
- unsigned long pfn, flags;
+ unsigned long pfn, irq_mask;
struct mm_struct *mm;

if (user_mode(regs))
@@ -37,7 +37,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
else
mm = &init_mm;

- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);

if (!ptep || pte_special(*ptep)) {
@@ -53,7 +53,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
}

out:
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
return pfn;
}

--
2.24.1

2020-02-06 03:24:36

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 08/11] powerpc/kvm/book3s_hv: Use functions to track lockless pgtbl walks

Applies the new functions for tracking all book3s_hv related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0

kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after
the last usage of ptep).

Given that some of these functions can be called in real mode, and others
always are, we use __{begin,end}_lockless_pgtbl_walk so we can decide when
to disable interrupts.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_hv_nested.c | 22 ++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 28 ++++++++++++++++++----------
2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index dc97e5be76f6..a398061d5778 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
if (!gp)
return;

- /* Find the pte */
+ /* Find the pte:
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/*
* If the pte is present and the pfn is still the same, update the pte.
@@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap,
if (!gp)
return;

- /* Find and invalidate the pte */
+ /* Find and invalidate the pte:
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
/* Don't spuriously invalidate ptes if the pfn has changed */
if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
@@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
int shift;

spin_lock(&kvm->mmu_lock);
+ /*
+ * We are walking the nested guest (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
@@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(&kvm->mmu_lock);
+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (!shift)
shift = PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 220305454c23..fd4d8f174f09 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -210,7 +210,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_t *ptep;
unsigned int writing;
unsigned long mmu_seq;
- unsigned long rcbits, irq_flags = 0;
+ unsigned long rcbits, irq_mask = 0;

if (kvm_is_radix(kvm))
return H_FUNCTION;
@@ -252,8 +252,8 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
* If we had a page table table change after lookup, we would
* retry via mmu_notifier_retry.
*/
- if (!realmode)
- local_irq_save(irq_flags);
+ irq_mask = __begin_lockless_pgtbl_walk(!realmode);
+
/*
* If called in real mode we have MSR_EE = 0. Otherwise
* we disable irq above.
@@ -272,8 +272,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
* to <= host page size, if host is using hugepage
*/
if (host_pte_size < psize) {
- if (!realmode)
- local_irq_restore(flags);
+ __end_lockless_pgtbl_walk(irq_mask, !realmode);
return H_PARAMETER;
}
pte = kvmppc_read_update_linux_pte(ptep, writing);
@@ -287,8 +286,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pa |= gpa & ~PAGE_MASK;
}
}
- if (!realmode)
- local_irq_restore(irq_flags);

ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
ptel |= pa;
@@ -302,8 +299,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,

/*If we had host pte mapping then Check WIMG */
if (ptep && !hpte_cache_flags_ok(ptel, is_ci)) {
- if (is_ci)
+ if (is_ci) {
+ __end_lockless_pgtbl_walk(irq_mask, !realmode);
return H_PARAMETER;
+ }
/*
* Allow guest to map emulated device memory as
* uncacheable, but actually make it cacheable.
@@ -311,6 +310,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
ptel |= HPTE_R_M;
}
+ __end_lockless_pgtbl_walk(irq_mask, !realmode);

/* Find and lock the HPTEG slot to use */
do_insert:
@@ -907,11 +907,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
/* Translate to host virtual address */
hva = __gfn_to_hva_memslot(memslot, gfn);

- /* Try to find the host pte for that virtual address */
+ /* Try to find the host pte for that virtual address :
+ * Called by hcall_real_table (real mode + MSR_EE=0)
+ * Interrupts are disabled here.
+ */
+ __begin_lockless_pgtbl_walk(false);
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
- if (!ptep)
+ if (!ptep) {
+ __end_lockless_pgtbl_walk(0, false);
return H_TOO_HARD;
+ }
pte = kvmppc_read_update_linux_pte(ptep, writing);
+ __end_lockless_pgtbl_walk(0, false);
+
if (!pte_present(pte))
return H_TOO_HARD;

--
2.24.1

2020-02-06 03:25:12

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 10/11] powerpc/mm: Adds counting method to track lockless pagetable walks

Implements an additional feature to track lockless pagetable walks,
using a per-cpu counter: lockless_pgtbl_walk_counter.

Before a lockless pagetable walk, preemption is disabled and the
current cpu's counter is increased.
When the lockless pagetable walk finishes, the current cpu counter
is decreased and the preemption is enabled.

With that, it's possible to know in which cpus are happening lockless
pagetable walks, and optimize serialize_against_pte_lookup().

Implementation notes:
- Every counter can be changed only by it's CPU
- It makes use of the original memory barrier in the functions
- Any counter can be read by any CPU

Due to not locking nor using atomic variables, the impact on the
lockless pagetable walk is intended to be minimum.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/mm/book3s64/pgtable.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 535613030363..bb138b628f86 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -83,6 +83,7 @@ static void do_nothing(void *unused)

}

+static DEFINE_PER_CPU(int, lockless_pgtbl_walk_counter);
/*
* Serialize against find_current_mm_pte which does lock-less
* lookup in page tables with local interrupts disabled. For huge pages
@@ -120,6 +121,15 @@ unsigned long __begin_lockless_pgtbl_walk(bool disable_irq)
if (disable_irq)
local_irq_save(irq_mask);

+ /*
+ * Counts this instance of lockless pagetable walk for this cpu.
+ * Disables preempt to make sure there is no cpu change between
+ * begin/end lockless pagetable walk, so that percpu counting
+ * works fine.
+ */
+ preempt_disable();
+ (*this_cpu_ptr(&lockless_pgtbl_walk_counter))++;
+
/*
* This memory barrier pairs with any code that is either trying to
* delete page tables, or split huge pages. Without this barrier,
@@ -158,6 +168,14 @@ inline void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq)
*/
smp_mb();

+ /*
+ * Removes this instance of lockless pagetable walk for this cpu.
+ * Enables preempt only after end lockless pagetable walk,
+ * so that percpu counting works fine.
+ */
+ (*this_cpu_ptr(&lockless_pgtbl_walk_counter))--;
+ preempt_enable();
+
/*
* Interrupts must be disabled during the lockless page table walk.
* That's because the deleting or splitting involves flushing TLBs,
--
2.24.1

2020-02-06 03:28:39

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
> gup_pgd_range(addr, end, gup_flags, pages, &nr);
> - local_irq_enable();
> + end_lockless_pgtbl_walk(IRQS_ENABLED);
> ret = nr;
> }
>

Just noticed IRQS_ENABLED is not available on other archs than ppc64.
I will fix this for v7.


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-06 03:39:19

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 05/11] powerpc/perf: Use functions to track lockless pgtbl walks

Applies the new functions used for tracking lockless pgtable walks on
read_user_stack_slow.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/perf/callchain.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index cbc251981209..fd9979e69f36 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -116,14 +116,14 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
unsigned shift;
unsigned long addr = (unsigned long) ptr;
unsigned long offset;
- unsigned long pfn, flags;
+ unsigned long pfn, irq_mask;
void *kaddr;

pgdir = current->mm->pgd;
if (!pgdir)
return -EFAULT;

- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
ptep = find_current_mm_pte(pgdir, addr, NULL, &shift);
if (!ptep)
goto err_out;
@@ -145,7 +145,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
memcpy(buf, kaddr + offset, nb);
ret = 0;
err_out:
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
return ret;
}

--
2.24.1

2020-02-06 03:39:32

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 07/11] powerpc/kvm/e500: Use functions to track lockless pgtbl walks

Applies the new functions used for tracking lockless pgtable walks on
kvmppc_e500_shadow_map().

Fixes the place where local_irq_restore() is called: previously, if ptep
was NULL, local_irq_restore() would never be called.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/e500_mmu_host.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 425d13806645..3dcf11f77256 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -336,7 +336,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
pte_t *ptep;
unsigned int wimg = 0;
pgd_t *pgdir;
- unsigned long flags;
+ unsigned long irq_mask;

/* used to check for invalidations in progress */
mmu_seq = kvm->mmu_notifier_seq;
@@ -473,7 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
* We are holding kvm->mmu_lock so a notifier invalidate
* can't run hence pfn won't change.
*/
- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
ptep = find_linux_pte(pgdir, hva, NULL, NULL);
if (ptep) {
pte_t pte = READ_ONCE(*ptep);
@@ -481,15 +481,16 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
if (pte_present(pte)) {
wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
MAS2_WIMGE_MASK;
- local_irq_restore(flags);
} else {
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n",
__func__, (long)gfn, pfn);
ret = -EINVAL;
goto out;
}
}
+ end_lockless_pgtbl_walk(irq_mask);
+
kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);

kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
--
2.24.1

2020-02-06 03:40:40

by Leonardo Bras

[permalink] [raw]
Subject: [PATCH v6 09/11] powerpc/kvm/book3s_64: Use functions to track lockless pgtbl walks

Applies the new tracking functions to all book3s_64 related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.

Given that some of these functions always are called in realmode, we use
__{begin,end}_lockless_pgtbl_walk so we can decide when to disable
interrupts.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

There are also a function that uses local_irq_{en,dis}able, so the return
value of begin_lockless_pgtbl_walk() is ignored and we pass IRQS_ENABLED to
end_lockless_pgtbl_walk() to mimic the effect of local_irq_enable().

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 ++---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 34 +++++++++++++++++++++++---
arch/powerpc/kvm/book3s_64_vio_hv.c | 6 ++++-
3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 6c372f5c61b6..e7ce29a5df60 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -605,19 +605,19 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
/* if the guest wants write access, see if that is OK */
if (!writing && hpte_is_writable(r)) {
pte_t *ptep, pte;
- unsigned long flags;
+ unsigned long irq_mask;
/*
* We need to protect against page table destruction
* hugepage split and collapse.
*/
- local_irq_save(flags);
+ irq_mask = begin_lockless_pgtbl_walk();
ptep = find_current_mm_pte(mm->pgd, hva, NULL, NULL);
if (ptep) {
pte = kvmppc_read_update_linux_pte(ptep, 1);
if (__pte_write(pte))
write_ok = 1;
}
- local_irq_restore(flags);
+ end_lockless_pgtbl_walk(irq_mask);
}
}

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 803940d79b73..cda2e455baf2 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,20 +813,20 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
* Read the PTE from the process' radix tree and use that
* so we get the shift and attribute bits.
*/
- local_irq_disable();
+ begin_lockless_pgtbl_walk();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
/*
* If the PTE disappeared temporarily due to a THP
* collapse, just return and let the guest try again.
*/
if (!ptep) {
- local_irq_enable();
+ end_lockless_pgtbl_walk(IRQS_ENABLED);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
- local_irq_enable();
+ end_lockless_pgtbl_walk(IRQS_ENABLED);

/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -980,10 +980,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
return 0;
}

+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
kvm->arch.lpid);
+
return 0;
}

@@ -1000,6 +1006,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;

+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1027,6 +1038,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;

+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1047,6 +1063,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ret;

+ /*
+ * We are walking the secondary (partition-scoped) page table here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
ret = 1;
@@ -1063,6 +1084,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
1UL << shift);
spin_unlock(&kvm->mmu_lock);
}
+
return ret;
}

@@ -1108,6 +1130,12 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
gpa = memslot->base_gfn << PAGE_SHIFT;
spin_lock(&kvm->mmu_lock);
for (n = memslot->npages; n; --n) {
+ /*
+ * We are walking the secondary (partition-scoped) page table
+ * here.
+ * We can do this without disabling irq because the Linux MM
+ * subsystem doesn't do THP splits and collapses on this tree.
+ */
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index ab6eeb8e753e..83c70c1557e4 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -453,10 +453,14 @@ static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
* to exit which will agains result in the below page table walk
* to finish.
*/
+ __begin_lockless_pgtbl_walk(false);
ptep = __find_linux_pte(vcpu->arch.pgdir, ua, NULL, &shift);
- if (!ptep || !pte_present(*ptep))
+ if (!ptep || !pte_present(*ptep)) {
+ __end_lockless_pgtbl_walk(0, false);
return -ENXIO;
+ }
pte = *ptep;
+ __end_lockless_pgtbl_walk(0, false);

if (!shift)
shift = PAGE_SHIFT;
--
2.24.1

2020-02-06 05:49:42

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 04/11] powerpc/mce_power: Use functions to track lockless pgtbl walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> Applies the new functions used for tracking lockless pgtable walks on
> addr_to_pfn().
>
> local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
> so there is no need to repeat it here.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/kernel/mce_power.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index 1cbf7f1a4e3d..a9e38ef4e437 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -29,7 +29,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> {
> pte_t *ptep;
> unsigned int shift;
> - unsigned long pfn, flags;
> + unsigned long pfn, irq_mask;

Why change the name ? flags is a well known historical name.

> struct mm_struct *mm;
>
> if (user_mode(regs))
> @@ -37,7 +37,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> else
> mm = &init_mm;
>
> - local_irq_save(flags);
> + irq_mask = begin_lockless_pgtbl_walk();
> ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
>
> if (!ptep || pte_special(*ptep)) {
> @@ -53,7 +53,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> }
>
> out:
> - local_irq_restore(flags);
> + end_lockless_pgtbl_walk(irq_mask);
> return pfn;
> }
>
>

Christophe

2020-02-06 05:52:35

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/mm: Adds arch-specificic functions to track lockless pgtable walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> On powerpc, we need to do some lockless pagetable walks from functions
> that already have disabled interrupts, specially from real mode with
> MSR[EE=0].
>
> In these contexts, disabling/enabling interrupts can be very troubling.

When interrupts are already disabled, the flag returned when disabling
it will be such that when we restore it later, interrupts remain
disabled, so what's the problem ?

>
> So, this arch-specific implementation features functions with an extra
> argument that allows interrupt enable/disable to be skipped:
> __begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().
>
> Functions similar to the generic ones are also exported, by calling
> the above functions with parameter {en,dis}able_irq = true.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/include/asm/book3s/64/pgtable.h | 6 ++
> arch/powerpc/mm/book3s64/pgtable.c | 86 +++++++++++++++++++-
> 2 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 201a69e6a355..78f6ffb1bb3e 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1375,5 +1375,11 @@ static inline bool pgd_is_leaf(pgd_t pgd)
> return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
> }
>
> +#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> +unsigned long begin_lockless_pgtbl_walk(void);
> +unsigned long __begin_lockless_pgtbl_walk(bool disable_irq);
> +void end_lockless_pgtbl_walk(unsigned long irq_mask);
> +void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq);
> +

Why not make them static inline just like the generic ones ?

> #endif /* __ASSEMBLY__ */
> #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 2bf7e1b4fd82..535613030363 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -82,6 +82,7 @@ static void do_nothing(void *unused)
> {
>
> }
> +

Is this blank line related to the patch ?

> /*
> * Serialize against find_current_mm_pte which does lock-less
> * lookup in page tables with local interrupts disabled. For huge pages
> @@ -98,6 +99,89 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
> smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
> }
>
> +/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> + * lockless pagetable walks, such as __find_linux_pte().
> + * This version allows setting disable_irq=false, so irqs are not touched, which
> + * is quite useful for running when ints are already disabled (like real-mode)
> + */
> +inline
> +unsigned long __begin_lockless_pgtbl_walk(bool disable_irq)
> +{
> + unsigned long irq_mask = 0;
> +
> + /*
> + * Interrupts must be disabled during the lockless page table walk.
> + * That's because the deleting or splitting involves flushing TLBs,
> + * which in turn issues interrupts, that will block when disabled.
> + *
> + * When this function is called from realmode with MSR[EE=0],
> + * it's not needed to touch irq, since it's already disabled.
> + */
> + if (disable_irq)
> + local_irq_save(irq_mask);
> +
> + /*
> + * This memory barrier pairs with any code that is either trying to
> + * delete page tables, or split huge pages. Without this barrier,
> + * the page tables could be read speculatively outside of interrupt
> + * disabling or reference counting.
> + */
> + smp_mb();
> +
> + return irq_mask;
> +}
> +EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
> +
> +/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> + * lockless pagetable walks, such as __find_linux_pte().
> + * This version is used by generic code, and always assume irqs will be disabled
> + */
> +unsigned long begin_lockless_pgtbl_walk(void)
> +{
> + return __begin_lockless_pgtbl_walk(true);
> +}
> +EXPORT_SYMBOL(begin_lockless_pgtbl_walk);

Even more than begin_lockless_pgtbl_walk(), this one is worth being
static inline in the H file.

> +
> +/*
> + * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> + * returned by a lockless pagetable walk, such as __find_linux_pte()
> + * This version allows setting enable_irq=false, so irqs are not touched, which
> + * is quite useful for running when ints are already disabled (like real-mode)
> + */
> +inline void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq)
> +{
> + /*
> + * This memory barrier pairs with any code that is either trying to
> + * delete page tables, or split huge pages. Without this barrier,
> + * the page tables could be read speculatively outside of interrupt
> + * disabling or reference counting.
> + */
> + smp_mb();
> +
> + /*
> + * Interrupts must be disabled during the lockless page table walk.
> + * That's because the deleting or splitting involves flushing TLBs,
> + * which in turn issues interrupts, that will block when disabled.
> + *
> + * When this function is called from realmode with MSR[EE=0],
> + * it's not needed to touch irq, since it's already disabled.
> + */
> + if (enable_irq)
> + local_irq_restore(irq_mask);
> +}
> +EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
> +
> +/*
> + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> + * returned by a lockless pagetable walk, such as __find_linux_pte()
> + * This version is used by generic code, and always assume irqs will be enabled
> + */
> +void end_lockless_pgtbl_walk(unsigned long irq_mask)
> +{
> + __end_lockless_pgtbl_walk(irq_mask, true);
> +}
> +EXPORT_SYMBOL(end_lockless_pgtbl_walk);
> +
> /*
> * We use this to invalidate a pmdp entry before switching from a
> * hugepte to regular pmd entry.
> @@ -487,7 +571,7 @@ static int __init setup_disable_tlbie(char *str)
> tlbie_capable = false;
> tlbie_enabled = false;
>
> - return 1;
> + return 1;

Is that related to this patch at all ?

> }
> __setup("disable_tlbie", setup_disable_tlbie);
>
>

Christophe

2020-02-06 06:09:00

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] asm-generic/pgtable: Adds generic functions to track lockless pgtable walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> It's necessary to track lockless pagetable walks, in order to avoid doing
> THP splitting/collapsing during them.
>
> The default solution is to disable irq before lockless pagetable walks and
> enable it after it's finished.
>
> On code, this means you can find local_irq_disable() and local_irq_enable()
> around some pieces of code, usually without comments on why it is needed.
>
> This patch proposes a set of generic functions to be called before starting
> and after finishing a lockless pagetable walk. It is supposed to make clear
> that a lockless pagetable walk happens there, and also carries information
> on why the irq disable/enable is needed.
>
> begin_lockless_pgtbl_walk()
> Insert before starting any lockless pgtable walk
> end_lockless_pgtbl_walk()
> Insert after the end of any lockless pgtable walk
> (Mostly after the ptep is last used)
>
> A memory barrier was also added just to make sure there is no speculative
> read outside the interrupt disabled area. Other than that, it is not
> supposed to have any change of behavior from current code.

Is that speculative barrier necessary for all architectures ? Does it
impact performance ? Shouldn't this be another patch ?

>
> It is planned to allow arch-specific versions, so that additional steps can
> be added while keeping the code clean.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> include/asm-generic/pgtable.h | 51 +++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index e2e2bef07dd2..8d368d3c0974 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1222,6 +1222,57 @@ static inline bool arch_has_pfn_modify_check(void)
> #endif
> #endif
>
> +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> +/*
> + * begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> + * lockless pagetable walks, such as __find_linux_pte()
> + */
> +static inline
> +unsigned long begin_lockless_pgtbl_walk(void)

What about keeping the same syntax as local_irq_save(), something like:

#define begin_lockless_pgtbl_walk(flags) \
do {
local_irq_save(flags);
smp_mb();
} while (0)

> +{
> + unsigned long irq_mask;
> +
> + /*
> + * Interrupts must be disabled during the lockless page table walk.
> + * That's because the deleting or splitting involves flushing TLBs,
> + * which in turn issues interrupts, that will block when disabled.
> + */
> + local_irq_save(irq_mask);
> +
> + /*
> + * This memory barrier pairs with any code that is either trying to
> + * delete page tables, or split huge pages. Without this barrier,
> + * the page tables could be read speculatively outside of interrupt
> + * disabling.
> + */
> + smp_mb();
> +
> + return irq_mask;
> +}
> +
> +/*
> + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> + * returned by a lockless pagetable walk, such as __find_linux_pte()
> + */
> +static inline void end_lockless_pgtbl_walk(unsigned long irq_mask)

Same

#define end_lockless_pgtbl_walk(flags) \
do {
smp_mb();
local_irq_restore(flags);
} while (0);

> +{
> + /*
> + * This memory barrier pairs with any code that is either trying to
> + * delete page tables, or split huge pages. Without this barrier,
> + * the page tables could be read speculatively outside of interrupt
> + * disabling.
> + */
> + smp_mb();
> +
> + /*
> + * Interrupts must be disabled during the lockless page table walk.
> + * That's because the deleting or splitting involves flushing TLBs,
> + * which in turn issues interrupts, that will block when disabled.
> + */
> + local_irq_restore(irq_mask);
> +}
> +#endif
> +
> /*
> * On some architectures it depends on the mm if the p4d/pud or pmd
> * layer of the page table hierarchy is folded or not.
>

Christophe

2020-02-06 06:09:45

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 06/11] powerpc/mm/book3s64/hash: Use functions to track lockless pgtbl walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> Applies the new tracking functions to all hash-related functions that do
> lockless pagetable walks.
>
> hash_page_mm: Adds comment that explain that there is no need to
> local_int_disable/save given that it is only called from DataAccess
> interrupt, so interrupts are already disabled.
>
> local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
> so there is no need to repeat it here.
>
> Variable that saves the irq mask was renamed from flags to irq_mask so it
> doesn't lose meaning now it's not directly passed to local_irq_* functions.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/mm/book3s64/hash_tlb.c | 6 +++---
> arch/powerpc/mm/book3s64/hash_utils.c | 27 +++++++++++++++++----------
> 2 files changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
> index 4a70d8dd39cd..86547c4151f6 100644
> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
> @@ -194,7 +194,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
> {
> bool is_thp;
> int hugepage_shift;
> - unsigned long flags;
> + unsigned long irq_mask;
>
> start = _ALIGN_DOWN(start, PAGE_SIZE);
> end = _ALIGN_UP(end, PAGE_SIZE);
> @@ -209,7 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
> * to being hashed). This is not the most performance oriented
> * way to do things but is fine for our needs here.
> */
> - local_irq_save(flags);
> + irq_mask = begin_lockless_pgtbl_walk();
> arch_enter_lazy_mmu_mode();
> for (; start < end; start += PAGE_SIZE) {
> pte_t *ptep = find_current_mm_pte(mm->pgd, start, &is_thp,
> @@ -229,7 +229,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
> hpte_need_flush(mm, start, ptep, pte, hugepage_shift);
> }
> arch_leave_lazy_mmu_mode();
> - local_irq_restore(flags);
> + end_lockless_pgtbl_walk(irq_mask);
> }
>
> void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 523d4d39d11e..e6d4ab42173b 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1341,12 +1341,16 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
> #endif /* CONFIG_PPC_64K_PAGES */
>
> - /* Get PTE and page size from page tables */
> + /* Get PTE and page size from page tables :
> + * Called in from DataAccess interrupt (data_access_common: 0x300),
> + * interrupts are disabled here.
> + */

Comments formatting is not in line with Linux kernel rules. Should be no
text on the first /* line.

> + __begin_lockless_pgtbl_walk(false);

I think it would be better to not use __begin_lockless_pgtbl_walk()
directly but keep it in a single place, and define something like
begin_lockless_pgtbl_walk_noirq() similar to begin_lockless_pgtbl_walk()

> ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
> if (ptep == NULL || !pte_present(*ptep)) {
> DBG_LOW(" no PTE !\n");
> rc = 1;
> - goto bail;
> + goto bail_pgtbl_walk;

What's the point in changing the name of this label ? There is only one
label, why polute the function with so huge name ?

For me, everyone understand what 'bail' means. Unneccessary changes
should be avoided. If you really really want to do it, it should be
another patch.

See kernel codying style, chapter 'naming':
"LOCAL variable names should be short". This also applies to labels.

"C is a Spartan language, and so should your naming be. Unlike Modula-2
and Pascal programmers, C programmers do not use cute names like
ThisVariableIsATemporaryCounter. A C programmer would call that variable
tmp, which is much easier to write, and not the least more difficult to
understand."

> }
>
> /* Add _PAGE_PRESENT to the required access perm */
> @@ -1359,7 +1363,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> if (!check_pte_access(access, pte_val(*ptep))) {
> DBG_LOW(" no access !\n");
> rc = 1;
> - goto bail;
> + goto bail_pgtbl_walk;
> }
>
> if (hugeshift) {
> @@ -1383,7 +1387,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> if (current->mm == mm)
> check_paca_psize(ea, mm, psize, user_region);
>
> - goto bail;
> + goto bail_pgtbl_walk;
> }
>
> #ifndef CONFIG_PPC_64K_PAGES
> @@ -1457,6 +1461,8 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> #endif
> DBG_LOW(" -> rc=%d\n", rc);
>
> +bail_pgtbl_walk:
> + __end_lockless_pgtbl_walk(0, false);
> bail:
> exception_exit(prev_state);
> return rc;
> @@ -1545,7 +1551,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> unsigned long vsid;
> pgd_t *pgdir;
> pte_t *ptep;
> - unsigned long flags;
> + unsigned long irq_mask;
> int rc, ssize, update_flags = 0;
> unsigned long access = _PAGE_PRESENT | _PAGE_READ | (is_exec ? _PAGE_EXEC : 0);
>
> @@ -1567,11 +1573,12 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> vsid = get_user_vsid(&mm->context, ea, ssize);
> if (!vsid)
> return;
> +

Is this new line related to the patch ?

> /*
> * Hash doesn't like irqs. Walking linux page table with irq disabled
> * saves us from holding multiple locks.
> */
> - local_irq_save(flags);
> + irq_mask = begin_lockless_pgtbl_walk();
>
> /*
> * THP pages use update_mmu_cache_pmd. We don't do
> @@ -1616,7 +1623,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> mm_ctx_user_psize(&mm->context),
> pte_val(*ptep));
> out_exit:
> - local_irq_restore(flags);
> + end_lockless_pgtbl_walk(irq_mask);
> }
>
> /*
> @@ -1679,16 +1686,16 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
> {
> pte_t *ptep;
> u16 pkey = 0;
> - unsigned long flags;
> + unsigned long irq_mask;
>
> if (!mm || !mm->pgd)
> return 0;
>
> - local_irq_save(flags);
> + irq_mask = begin_lockless_pgtbl_walk();
> ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
> if (ptep)
> pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
> - local_irq_restore(flags);
> + end_lockless_pgtbl_walk(irq_mask);
>
> return pkey;
> }
>

Christophe

2020-02-06 06:19:14

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 07/11] powerpc/kvm/e500: Use functions to track lockless pgtbl walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> Applies the new functions used for tracking lockless pgtable walks on
> kvmppc_e500_shadow_map().
>
> Fixes the place where local_irq_restore() is called: previously, if ptep
> was NULL, local_irq_restore() would never be called.
>
> local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
> so there is no need to repeat it here.
>
> Variable that saves the irq mask was renamed from flags to irq_mask so it
> doesn't lose meaning now it's not directly passed to local_irq_* functions.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/kvm/e500_mmu_host.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 425d13806645..3dcf11f77256 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -336,7 +336,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> pte_t *ptep;
> unsigned int wimg = 0;
> pgd_t *pgdir;
> - unsigned long flags;
> + unsigned long irq_mask;
>
> /* used to check for invalidations in progress */
> mmu_seq = kvm->mmu_notifier_seq;
> @@ -473,7 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> * We are holding kvm->mmu_lock so a notifier invalidate
> * can't run hence pfn won't change.
> */
> - local_irq_save(flags);
> + irq_mask = begin_lockless_pgtbl_walk();
> ptep = find_linux_pte(pgdir, hva, NULL, NULL);
> if (ptep) {
> pte_t pte = READ_ONCE(*ptep);
> @@ -481,15 +481,16 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> if (pte_present(pte)) {
> wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
> MAS2_WIMGE_MASK;
> - local_irq_restore(flags);
> } else {
> - local_irq_restore(flags);
> + end_lockless_pgtbl_walk(irq_mask);
> pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n",
> __func__, (long)gfn, pfn);
> ret = -EINVAL;
> goto out;
> }
> }
> + end_lockless_pgtbl_walk(irq_mask);
> +

I don't really like unbalanced begin/end.

Something like the following would be cleaner:


begin_lockless_pgtbl_walk()
ptep = find()
if (ptep) {
pte = READ_ONCE()
if (pte_present(pte))
wing=
else
ret = -EINVAL;
}
end_lockless_pgtbl_walk()

if (ret) {
pr_err_rate...()
goto out;
}



> kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
>
> kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
>

Christophe

2020-02-06 06:25:32

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v6 10/11] powerpc/mm: Adds counting method to track lockless pagetable walks



Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> Implements an additional feature to track lockless pagetable walks,
> using a per-cpu counter: lockless_pgtbl_walk_counter.
>
> Before a lockless pagetable walk, preemption is disabled and the
> current cpu's counter is increased.
> When the lockless pagetable walk finishes, the current cpu counter
> is decreased and the preemption is enabled.
>
> With that, it's possible to know in which cpus are happening lockless
> pagetable walks, and optimize serialize_against_pte_lookup().
>
> Implementation notes:
> - Every counter can be changed only by it's CPU
> - It makes use of the original memory barrier in the functions
> - Any counter can be read by any CPU
>
> Due to not locking nor using atomic variables, the impact on the
> lockless pagetable walk is intended to be minimum.

atomic variables have a lot less impact than preempt_enable/disable.

preemt_disable forces a re-scheduling, it really has impact. Why not use
atomic variables instead ?

Christophe

>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/mm/book3s64/pgtable.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 535613030363..bb138b628f86 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -83,6 +83,7 @@ static void do_nothing(void *unused)
>
> }
>
> +static DEFINE_PER_CPU(int, lockless_pgtbl_walk_counter);
> /*
> * Serialize against find_current_mm_pte which does lock-less
> * lookup in page tables with local interrupts disabled. For huge pages
> @@ -120,6 +121,15 @@ unsigned long __begin_lockless_pgtbl_walk(bool disable_irq)
> if (disable_irq)
> local_irq_save(irq_mask);
>
> + /*
> + * Counts this instance of lockless pagetable walk for this cpu.
> + * Disables preempt to make sure there is no cpu change between
> + * begin/end lockless pagetable walk, so that percpu counting
> + * works fine.
> + */
> + preempt_disable();
> + (*this_cpu_ptr(&lockless_pgtbl_walk_counter))++;
> +
> /*
> * This memory barrier pairs with any code that is either trying to
> * delete page tables, or split huge pages. Without this barrier,
> @@ -158,6 +168,14 @@ inline void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq)
> */
> smp_mb();
>
> + /*
> + * Removes this instance of lockless pagetable walk for this cpu.
> + * Enables preempt only after end lockless pagetable walk,
> + * so that percpu counting works fine.
> + */
> + (*this_cpu_ptr(&lockless_pgtbl_walk_counter))--;
> + preempt_enable();
> +
> /*
> * Interrupts must be disabled during the lockless page table walk.
> * That's because the deleting or splitting involves flushing TLBs,
>

2020-02-07 01:22:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

Hi Leonardo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on paulus-powerpc/kvm-ppc-next linus/master v5.5 next-20200206]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Leonardo-Bras/Introduces-new-functions-for-tracking-lockless-pagetable-walks/20200207-071035
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/gup.c: In function 'get_user_pages_fast':
>> mm/gup.c:2435:27: error: 'IRQS_ENABLED' undeclared (first use in this function); did you mean 'IS_ENABLED'?
end_lockless_pgtbl_walk(IRQS_ENABLED);
^~~~~~~~~~~~
IS_ENABLED
mm/gup.c:2435:27: note: each undeclared identifier is reported only once for each function it appears in

vim +2435 mm/gup.c

2395
2396 /**
2397 * get_user_pages_fast() - pin user pages in memory
2398 * @start: starting user address
2399 * @nr_pages: number of pages from start to pin
2400 * @gup_flags: flags modifying pin behaviour
2401 * @pages: array that receives pointers to the pages pinned.
2402 * Should be at least nr_pages long.
2403 *
2404 * Attempt to pin user pages in memory without taking mm->mmap_sem.
2405 * If not successful, it will fall back to taking the lock and
2406 * calling get_user_pages().
2407 *
2408 * Returns number of pages pinned. This may be fewer than the number
2409 * requested. If nr_pages is 0 or negative, returns 0. If no pages
2410 * were pinned, returns -errno.
2411 */
2412 int get_user_pages_fast(unsigned long start, int nr_pages,
2413 unsigned int gup_flags, struct page **pages)
2414 {
2415 unsigned long addr, len, end;
2416 int nr = 0, ret = 0;
2417
2418 if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
2419 return -EINVAL;
2420
2421 start = untagged_addr(start) & PAGE_MASK;
2422 addr = start;
2423 len = (unsigned long) nr_pages << PAGE_SHIFT;
2424 end = start + len;
2425
2426 if (end <= start)
2427 return 0;
2428 if (unlikely(!access_ok((void __user *)start, len)))
2429 return -EFAULT;
2430
2431 if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
2432 gup_fast_permitted(start, end)) {
2433 begin_lockless_pgtbl_walk();
2434 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> 2435 end_lockless_pgtbl_walk(IRQS_ENABLED);

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (3.08 kB)
.config.gz (7.04 kB)
Download all attachments

2020-02-07 01:59:12

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 10/11] powerpc/mm: Adds counting method to track lockless pagetable walks

Hello Christophe, thanks for the feedback!

On Thu, 2020-02-06 at 07:23 +0100, Christophe Leroy wrote:
> > Due to not locking nor using atomic variables, the impact on the
> > lockless pagetable walk is intended to be minimum.
>
> atomic variables have a lot less impact than preempt_enable/disable.
>
> preemt_disable forces a re-scheduling, it really has impact. Why not use
> atomic variables instead ?

In fact, v5 of this patch used atomic variables. But it seems to cause
contention on a single exclusive cacheline, which had no better
performance than locking.
(discussion here: http://patchwork.ozlabs.org/patch/1171012/)

When I try to understand the effect of preempt_disable(), all I can
see is a barrier() and possibly a preempt_count_inc(), which updates a
member of current thread struct if CONFIG_PREEMPT_COUNT is enabled.

If CONFIG_PREEMPTION is also enabled, preempt_enable() can run a
__preempt_schedule() on unlikely(__preempt_count_dec_and_test()).

On most configs available, CONFIG_PREEMPTION is not set, being replaced
either by CONFIG_PREEMPT_NONE (kernel defconfigs) or
CONFIG_PREEMPT_VOLUNTARY in most supported distros. With that, most
probably CONFIG_PREEMPT_COUNT will also not be set, and
preempt_{en,dis}able() are replaced by a barrier().

Using preempt_disable approach, I intent to get better performance for
most used cases.

What do you think of it?

I am still new on this subject, and I am still trying to better
understand how it works. If you notice something I am missing, please
let me know.

Best regards,
Leonardo Bras



Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 02:24:13

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] asm-generic/pgtable: Adds generic functions to track lockless pgtable walks

Hello Christophe, thanks for the feedback!

On Thu, 2020-02-06 at 06:54 +0100, Christophe Leroy wrote:
> > A memory barrier was also added just to make sure there is no speculative
> > read outside the interrupt disabled area. Other than that, it is not
> > supposed to have any change of behavior from current code.
>
> Is that speculative barrier necessary for all architectures ? Does it
> impact performance ? Shouldn't this be another patch ?

It makes sense, better keep the code as much as possible as it was. If
any arch finds this barrier needed, it can implement it's own version
of this function (or another patch to add this to generic, if proved to
be needed in every arch).

> > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > +/*
> > + * begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> > + * lockless pagetable walks, such as __find_linux_pte()
> > + */
> > +static inline
> > +unsigned long begin_lockless_pgtbl_walk(void)
>
> What about keeping the same syntax as local_irq_save(), something like:
>
> #define begin_lockless_pgtbl_walk(flags) \
> do {
> local_irq_save(flags);
> smp_mb();
> } while (0)
>

Makes sense. But wouldn't inlining have the same code output?

Best regards,
Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 03:13:03

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 07/11] powerpc/kvm/e500: Use functions to track lockless pgtbl walks

Hello Christophe,

On Thu, 2020-02-06 at 07:18 +0100, Christophe Leroy wrote:
>
> I don't really like unbalanced begin/end.
>
> Something like the following would be cleaner:
>
>
> begin_lockless_pgtbl_walk()
> ptep = find()
> if (ptep) {
> pte = READ_ONCE()
> if (pte_present(pte))
> wing=
> else
> ret = -EINVAL;
> }
> end_lockless_pgtbl_walk()
>
> if (ret) {
> pr_err_rate...()
> goto out;
> }
>
>

Sure, looks better that way. I will change that for v7.

Thanks for the feedback,

Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 03:53:34

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 06/11] powerpc/mm/book3s64/hash: Use functions to track lockless pgtbl walks

On Thu, 2020-02-06 at 07:06 +0100, Christophe Leroy wrote:
> > - /* Get PTE and page size from page tables */
> > + /* Get PTE and page size from page tables :
> > + * Called in from DataAccess interrupt (data_access_common: 0x300),
> > + * interrupts are disabled here.
> > + */
>
> Comments formatting is not in line with Linux kernel rules. Should be no
> text on the first /* line.

My mistake. Will be corrected in v7.

> > + __begin_lockless_pgtbl_walk(false);
>
> I think it would be better to not use __begin_lockless_pgtbl_walk()
> directly but keep it in a single place, and define something like
> begin_lockless_pgtbl_walk_noirq() similar to begin_lockless_pgtbl_walk()

There are places where touching irq is decided by a boolean, like in
patch 8: http://patchwork.ozlabs.org/patch/1234130/

If I were to change this, I would have to place ifs and decide to call
either normal or *_noirq() versions in every function.

What you suggest?

> > ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
> > if (ptep == NULL || !pte_present(*ptep)) {
> > DBG_LOW(" no PTE !\n");
> > rc = 1;
> > - goto bail;
> > + goto bail_pgtbl_walk;
>
> What's the point in changing the name of this label ? There is only one
> label, why polute the function with so huge name ?
>
> For me, everyone understand what 'bail' means. Unneccessary changes
> should be avoided. If you really really want to do it, it should be
> another patch.
>
> See kernel codying style, chapter 'naming':
> "LOCAL variable names should be short". This also applies to labels.
>
> "C is a Spartan language, and so should your naming be. Unlike Modula-2
> and Pascal programmers, C programmers do not use cute names like
> ThisVariableIsATemporaryCounter. A C programmer would call that variable
> tmp, which is much easier to write, and not the least more difficult to
> understand."
>

It's not label name changing. There are two possible bails in
hash_page_mm(): one for before __begin_lockless_pagetable_walk() and
other for after it. The new one also runs __end_lockless_pgtbl_walk()
before running what the previous did:

> > +bail_pgtbl_walk:
> > + __end_lockless_pgtbl_walk(0, false);
> > bail:
> > exception_exit(prev_state);
> > return rc;

As for the label name lengh, I see no problem changing it to something
like bail_ptw.


> > @@ -1545,7 +1551,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> > unsigned long vsid;
> > pgd_t *pgdir;
> > pte_t *ptep;
> > - unsigned long flags;
> > + unsigned long irq_mask;
> > int rc, ssize, update_flags = 0;
> > unsigned long access = _PAGE_PRESENT | _PAGE_READ | (is_exec ? _PAGE_EXEC : 0);
> >
> > @@ -1567,11 +1573,12 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> > vsid = get_user_vsid(&mm->context, ea, ssize);
> > if (!vsid)
> > return;
> > +
>
> Is this new line related to the patch ?

Nope. I have added while reading code and it just went trough my pre-
sending revision. I can remove it, if it bothers.

>
> > /*
> > * Hash doesn't like irqs. Walking linux page table with irq disabled
> > * saves us from holding multiple locks.
> > */
> > - local_irq_save(flags);
> > + irq_mask = begin_lockless_pgtbl_walk();
> >
> > /*
> > * THP pages use update_mmu_cache_pmd. We don't do
> > @@ -1616,7 +1623,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
> > mm_ctx_user_psize(&mm->context),
> > pte_val(*ptep));
> > out_exit:
> > - local_irq_restore(flags);
> > + end_lockless_pgtbl_walk(irq_mask);
> > }
> >
> > /*
> > @@ -1679,16 +1686,16 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
> > {
> > pte_t *ptep;
> > u16 pkey = 0;
> > - unsigned long flags;
> > + unsigned long irq_mask;
> >
> > if (!mm || !mm->pgd)
> > return 0;
> >
> > - local_irq_save(flags);
> > + irq_mask = begin_lockless_pgtbl_walk();
> > ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
> > if (ptep)
> > pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
> > - local_irq_restore(flags);
> > + end_lockless_pgtbl_walk(irq_mask);
> >
> > return pkey;
> > }
> >
>
> Christophe

Thanks for giving feedback,

Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 04:03:19

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 04/11] powerpc/mce_power: Use functions to track lockless pgtbl walks

On Thu, 2020-02-06 at 06:48 +0100, Christophe Leroy wrote:
> > --- a/arch/powerpc/kernel/mce_power.c
> > +++ b/arch/powerpc/kernel/mce_power.c
> > @@ -29,7 +29,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> > {
> > pte_t *ptep;
> > unsigned int shift;
> > - unsigned long pfn, flags;
> > + unsigned long pfn, irq_mask;
>
> Why change the name ? flags is a well known historical name.

Oh, this commit missed the reasoning for flags name change.

For local_irq_{save,restore} a parameter named flags makes sense, for
the reader, as it could only be flags regarding irq.

I thougt passing flags to {begin,end}_lockless_pgtbl_walk would lose
it's meaning, given that it would only mean "flags for these functions"

So, changing it to irq_mask would make the reader more aware of what it
does mean.

For other commits, I added:

"Variable that saves the irq mask was renamed from flags to irq_mask so
it doesn't lose meaning now it's not directly passed to local_irq_*
functions."

I can add it to this commit message.

Thanks for the feedback,

Leonardo Bras




Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 04:42:33

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/mm: Adds arch-specificic functions to track lockless pgtable walks

On Thu, 2020-02-06 at 06:46 +0100, Christophe Leroy wrote:
>
> Le 06/02/2020 à 04:08, Leonardo Bras a écrit :
> > On powerpc, we need to do some lockless pagetable walks from functions
> > that already have disabled interrupts, specially from real mode with
> > MSR[EE=0].
> >
> > In these contexts, disabling/enabling interrupts can be very troubling.
>
> When interrupts are already disabled, the flag returned when disabling
> it will be such that when we restore it later, interrupts remain
> disabled, so what's the problem ?

There are places in code, like patch 8, where it explicitly avoids
doing irq_save/restore by using a function parameter (realmode).
http://patchwork.ozlabs.org/patch/1234130/

I am not sure why it's that way, but I decided to keep it as is.
It was introduced by Aneesh Kumar in 2015
(691e95fd7396905a38d98919e9c150dbc3ea21a3).

> > So, this arch-specific implementation features functions with an extra
> > argument that allows interrupt enable/disable to be skipped:
> > __begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().
> >
> > Functions similar to the generic ones are also exported, by calling
> > the above functions with parameter {en,dis}able_irq = true.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/include/asm/book3s/64/pgtable.h | 6 ++
> > arch/powerpc/mm/book3s64/pgtable.c | 86 +++++++++++++++++++-
> > 2 files changed, 91 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > index 201a69e6a355..78f6ffb1bb3e 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > @@ -1375,5 +1375,11 @@ static inline bool pgd_is_leaf(pgd_t pgd)
> > return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
> > }
> >
> > +#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > +unsigned long begin_lockless_pgtbl_walk(void);
> > +unsigned long __begin_lockless_pgtbl_walk(bool disable_irq);
> > +void end_lockless_pgtbl_walk(unsigned long irq_mask);
> > +void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq);
> > +
>
> Why not make them static inline just like the generic ones ?
>

Sure, can be done. It would save some function calls.
For that I will define the per-cpu variable in .c and declare it in .h
All new function can be moved to .h, while changing adding the inline
modifier.

> > #endif /* __ASSEMBLY__ */
> > #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> > diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> > index 2bf7e1b4fd82..535613030363 100644
> > --- a/arch/powerpc/mm/book3s64/pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/pgtable.c
> > @@ -82,6 +82,7 @@ static void do_nothing(void *unused)
> > {
> >
> > }
> > +
>
> Is this blank line related to the patch ?
>

Nope, just something I 'fixed' while reading, and gone past my pre-send
patch reviewing. If it bothers, I can remove.

> > /*
> > * Serialize against find_current_mm_pte which does lock-less
> > * lookup in page tables with local interrupts disabled. For huge pages
> > @@ -98,6 +99,89 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
> > smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
> > }
> >
> > +/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> > + * lockless pagetable walks, such as __find_linux_pte().
> > + * This version allows setting disable_irq=false, so irqs are not touched, which
> > + * is quite useful for running when ints are already disabled (like real-mode)
> > + */
> > +inline
> > +unsigned long __begin_lockless_pgtbl_walk(bool disable_irq)
> > +{
> > + unsigned long irq_mask = 0;
> > +
> > + /*
> > + * Interrupts must be disabled during the lockless page table walk.
> > + * That's because the deleting or splitting involves flushing TLBs,
> > + * which in turn issues interrupts, that will block when disabled.
> > + *
> > + * When this function is called from realmode with MSR[EE=0],
> > + * it's not needed to touch irq, since it's already disabled.
> > + */
> > + if (disable_irq)
> > + local_irq_save(irq_mask);
> > +
> > + /*
> > + * This memory barrier pairs with any code that is either trying to
> > + * delete page tables, or split huge pages. Without this barrier,
> > + * the page tables could be read speculatively outside of interrupt
> > + * disabling or reference counting.
> > + */
> > + smp_mb();
> > +
> > + return irq_mask;
> > +}
> > +EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
> > +
> > +/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
> > + * lockless pagetable walks, such as __find_linux_pte().
> > + * This version is used by generic code, and always assume irqs will be disabled
> > + */
> > +unsigned long begin_lockless_pgtbl_walk(void)
> > +{
> > + return __begin_lockless_pgtbl_walk(true);
> > +}
> > +EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
>
> Even more than begin_lockless_pgtbl_walk(), this one is worth being
> static inline in the H file.

Same as above, can be done.

>
> > +
> > +/*
> > + * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> > + * returned by a lockless pagetable walk, such as __find_linux_pte()
> > + * This version allows setting enable_irq=false, so irqs are not touched, which
> > + * is quite useful for running when ints are already disabled (like real-mode)
> > + */
> > +inline void __end_lockless_pgtbl_walk(unsigned long irq_mask, bool enable_irq)
> > +{
> > + /*
> > + * This memory barrier pairs with any code that is either trying to
> > + * delete page tables, or split huge pages. Without this barrier,
> > + * the page tables could be read speculatively outside of interrupt
> > + * disabling or reference counting.
> > + */
> > + smp_mb();
> > +
> > + /*
> > + * Interrupts must be disabled during the lockless page table walk.
> > + * That's because the deleting or splitting involves flushing TLBs,
> > + * which in turn issues interrupts, that will block when disabled.
> > + *
> > + * When this function is called from realmode with MSR[EE=0],
> > + * it's not needed to touch irq, since it's already disabled.
> > + */
> > + if (enable_irq)
> > + local_irq_restore(irq_mask);
> > +}
> > +EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
> > +
> > +/*
> > + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> > + * returned by a lockless pagetable walk, such as __find_linux_pte()
> > + * This version is used by generic code, and always assume irqs will be enabled
> > + */
> > +void end_lockless_pgtbl_walk(unsigned long irq_mask)
> > +{
> > + __end_lockless_pgtbl_walk(irq_mask, true);
> > +}
> > +EXPORT_SYMBOL(end_lockless_pgtbl_walk);
> > +
> > /*
> > * We use this to invalidate a pmdp entry before switching from a
> > * hugepte to regular pmd entry.
> > @@ -487,7 +571,7 @@ static int __init setup_disable_tlbie(char *str)
> > tlbie_capable = false;
> > tlbie_enabled = false;
> >
> > - return 1;
> > + return 1;
>
> Is that related to this patch at all ?

Nope, just another something I 'fixed' while reading, and gone past my
pre-send patch reviewing. If it bothers, I can remove.

>
> > }
> > __setup("disable_tlbie", setup_disable_tlbie);
> >
> >
>
> Christophe

Found other unbalanced begin/end in kvmppc_do_h_enter(). I will change
that too.

Thanks for the feedback,

Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-07 05:42:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] asm-generic/pgtable: Adds generic functions to track lockless pgtable walks

Hi Leonardo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on asm-generic/master paulus-powerpc/kvm-ppc-next linus/master v5.5 next-20200207]
[cannot apply to kvm-ppc/kvm-ppc-next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Leonardo-Bras/Introduces-new-functions-for-tracking-lockless-pagetable-walks/20200207-071035
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.5.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=xtensa

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

include/asm-generic/pgtable.h: Assembler messages:
include/asm-generic/pgtable.h:1230: Error: unknown opcode or format name 'static'
include/asm-generic/pgtable.h:1231: Error: unknown opcode or format name 'unsigned'
include/asm-generic/pgtable.h:1233: Error: unknown opcode or format name 'unsigned'
include/asm-generic/pgtable.h:1240: Error: unknown opcode or format name 'local_irq_save'
include/asm-generic/pgtable.h:1248: Error: unknown opcode or format name 'smp_mb'
include/asm-generic/pgtable.h:1250: Error: unknown opcode or format name 'return'
>> include/asm-generic/pgtable.h:1251: Error: couldn't find a valid instruction format
ops were:
include/asm-generic/pgtable.h:1257: Error: unknown opcode or format name 'static'
include/asm-generic/pgtable.h:1265: Error: unknown opcode or format name 'smp_mb'
include/asm-generic/pgtable.h:1272: Error: unknown opcode or format name 'local_irq_restore'
include/asm-generic/pgtable.h:1273: Error: couldn't find a valid instruction format
ops were:

vim +1251 include/asm-generic/pgtable.h

1224
1225 #ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
1226 /*
1227 * begin_lockless_pgtbl_walk: Must be inserted before a function call that does
1228 * lockless pagetable walks, such as __find_linux_pte()
1229 */
1230 static inline
1231 unsigned long begin_lockless_pgtbl_walk(void)
1232 {
1233 unsigned long irq_mask;
1234
1235 /*
1236 * Interrupts must be disabled during the lockless page table walk.
1237 * That's because the deleting or splitting involves flushing TLBs,
1238 * which in turn issues interrupts, that will block when disabled.
1239 */
1240 local_irq_save(irq_mask);
1241
1242 /*
1243 * This memory barrier pairs with any code that is either trying to
1244 * delete page tables, or split huge pages. Without this barrier,
1245 * the page tables could be read speculatively outside of interrupt
1246 * disabling.
1247 */
1248 smp_mb();
1249
1250 return irq_mask;
> 1251 }
1252

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation


Attachments:
(No filename) (3.44 kB)
.config.gz (10.31 kB)
Download all attachments

2020-02-07 08:03:35

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

Hi Leonardo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on asm-generic/master paulus-powerpc/kvm-ppc-next linus/master v5.5 next-20200207]
[cannot apply to kvm-ppc/kvm-ppc-next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Leonardo-Bras/Introduces-new-functions-for-tracking-lockless-pagetable-walks/20200207-071035
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: mips-randconfig-a001-20200207 (attached as .config)
compiler: mipsel-linux-gcc (GCC) 5.5.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=5.5.0 make.cross ARCH=mips

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/gup.c: In function 'get_user_pages_fast':
>> mm/gup.c:2435:27: error: 'IRQS_ENABLED' undeclared (first use in this function)
end_lockless_pgtbl_walk(IRQS_ENABLED);
^
mm/gup.c:2435:27: note: each undeclared identifier is reported only once for each function it appears in

vim +/IRQS_ENABLED +2435 mm/gup.c

2395
2396 /**
2397 * get_user_pages_fast() - pin user pages in memory
2398 * @start: starting user address
2399 * @nr_pages: number of pages from start to pin
2400 * @gup_flags: flags modifying pin behaviour
2401 * @pages: array that receives pointers to the pages pinned.
2402 * Should be at least nr_pages long.
2403 *
2404 * Attempt to pin user pages in memory without taking mm->mmap_sem.
2405 * If not successful, it will fall back to taking the lock and
2406 * calling get_user_pages().
2407 *
2408 * Returns number of pages pinned. This may be fewer than the number
2409 * requested. If nr_pages is 0 or negative, returns 0. If no pages
2410 * were pinned, returns -errno.
2411 */
2412 int get_user_pages_fast(unsigned long start, int nr_pages,
2413 unsigned int gup_flags, struct page **pages)
2414 {
2415 unsigned long addr, len, end;
2416 int nr = 0, ret = 0;
2417
2418 if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
2419 return -EINVAL;
2420
2421 start = untagged_addr(start) & PAGE_MASK;
2422 addr = start;
2423 len = (unsigned long) nr_pages << PAGE_SHIFT;
2424 end = start + len;
2425
2426 if (end <= start)
2427 return 0;
2428 if (unlikely(!access_ok((void __user *)start, len)))
2429 return -EFAULT;
2430
2431 if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
2432 gup_fast_permitted(start, end)) {
2433 begin_lockless_pgtbl_walk();
2434 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> 2435 end_lockless_pgtbl_walk(IRQS_ENABLED);

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation


Attachments:
(No filename) (3.30 kB)
.config.gz (23.78 kB)
Download all attachments

2020-02-17 20:38:10

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/mm: Adds arch-specificic functions to track lockless pgtable walks

On Fri, 2020-02-07 at 01:38 -0300, Leonardo Bras wrote:
> > Why not make them static inline just like the generic ones ?
> >
>
> Sure, can be done. It would save some function calls.
> For that I will define the per-cpu variable in .c and declare it in .h
> All new function can be moved to .h, while changing adding the inline
> modifier.

Just tried doing that, but percpu stuff relies in paca definitions, and
this ends up creating some cyclic dependencies.

I could try to change that, but the amount of change is big and
probably should be dealt in another patchset.


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-02-17 20:58:42

by Leonardo Bras

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

Hello John, comments inline;

On Fri, 2020-02-07 at 14:54 -0800, John Hubbard wrote:
> On 2/5/20 7:25 PM, Leonardo Bras wrote:
> > On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
> > > gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > > - local_irq_enable();
> > > + end_lockless_pgtbl_walk(IRQS_ENABLED);
> > > ret = nr;
> > > }
> > >
> >
> > Just noticed IRQS_ENABLED is not available on other archs than ppc64.
> > I will fix this for v7.
> >
>
> What's the fix going to look like, approximately?

I am not sure what is the best approach yet.

1. On irq_mask == 0, always enable irq on end_lockless_pgtbl_walk().
Not sure how bat would that affect other archs.

2. Add another function like end_lockless_pgtbl_walk_irqen() that
always enables IRQ.

3. Add another parameter in end_lockless_pgtbl_walk(), so that caller
can choose ii IRQ must be enabled.

Also, not sure if internal_get_user_pages_fast() can possibly be called
with IRQ disabled, and then return with it enabled. Maybe just
saving/restoring should be fine.

Other suggestions are welcome.

>
>
> thanks,

Best regards,

Leonardo Bras


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2020-10-15 14:51:00

by Michal Suchánek

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

Hello,

On Thu, Feb 06, 2020 at 12:25:18AM -0300, Leonardo Bras wrote:
> On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
> > gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > - local_irq_enable();
> > + end_lockless_pgtbl_walk(IRQS_ENABLED);
> > ret = nr;
> > }
> >
>
> Just noticed IRQS_ENABLED is not available on other archs than ppc64.
> I will fix this for v7.

Has threre been v7?

I cannot find it.

Thanks

Michal

2020-10-16 04:37:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

Hi Michal,

On 10/15/20 8:16 PM, Michal Suchánek wrote:
> Hello,
>
> On Thu, Feb 06, 2020 at 12:25:18AM -0300, Leonardo Bras wrote:
>> On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
>>> gup_pgd_range(addr, end, gup_flags, pages, &nr);
>>> - local_irq_enable();
>>> + end_lockless_pgtbl_walk(IRQS_ENABLED);
>>> ret = nr;
>>> }
>>>
>>
>> Just noticed IRQS_ENABLED is not available on other archs than ppc64.
>> I will fix this for v7.
>
> Has threre been v7?
>
> I cannot find it.
>
> Thanks
>
> Michal
>

https://lore.kernel.org/linuxppc-dev/[email protected]

This series should help here.

-aneesh