LinuxLists.cc - [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

2018-08-24 15:55:23

[permalink] [raw]

Subject: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

Hi all,

I hacked up this RFC on the back of the recent changes to the mmu_gather
stuff in mainline. It's had a bit of testing and it looks pretty good so
far.

The main changes in the series are:

- Avoid emitting a DSB barrier after clearing each page-table entry.
Instead, we can have a single DSB prior to the actual TLB invalidation.

- Batch last-level TLB invalidation until the end of the VMA, and use
last-level-only invalidation instructions

- Batch intermediate TLB invalidation until the end of the gather, and
use all-level invalidation instructions

- Adjust the stride of TLB invalidation based upon the smallest unflushed
granule in the gather

As a really stupid benchmark, unmapping a populated mapping of
0x4_3333_3000 bytes using munmap() takes around 20% of the time it took
before.

The core changes now track the levels of page-table that have been visited
by the mmu_gather since the last flush. It may be possible to use the
page_size field instead if we decide to resurrect that from its current
"debug" status, but I think I'd prefer to track the levels explicitly.

Anyway, I wanted to post this before disappearing for the long weekend
(Monday is a holiday in the UK) in the hope that it helps some of the
ongoing discussions.

Cheers,

Will

--->8

Peter Zijlstra (1):
asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather

Will Deacon (10):
arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()
arm64: tlb: Add DSB ISHST prior to TLBI in
__flush_tlb_[kernel_]pgtable()
arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()
arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()
arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()
arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code
asm-generic/tlb: Guard with #ifdef CONFIG_MMU
asm-generic/tlb: Track which levels of the page tables have been
cleared
arm64: tlb: Adjust stride and type of TLBI according to mmu_gather
arm64: tlb: Avoid synchronous TLBIs when freeing page tables

arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 10 ++++-
arch/arm64/include/asm/tlb.h | 34 +++++++----------
arch/arm64/include/asm/tlbflush.h | 28 +++++++-------
include/asm-generic/tlb.h | 79 +++++++++++++++++++++++++++++++++------
mm/memory.c | 4 +-
6 files changed, 105 insertions(+), 51 deletions(-)

--
2.1.4

2018-08-24 15:54:10

[permalink] [raw]

Subject: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

From: Peter Zijlstra <[email protected]>

Some architectures require different TLB invalidation instructions
depending on whether it is only the last-level of page table being
changed, or whether there are also changes to the intermediate
(directory) entries higher up the tree.

Add a new bit to the flags bitfield in struct mmu_gather so that the
architecture code can operate accordingly if it's the intermediate
levels being invalidated.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/asm-generic/tlb.h | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 859f897d3d53..a5caf90264e6 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -99,12 +99,22 @@ struct mmu_gather {
#endif
unsigned long start;
unsigned long end;
- /* we are in the middle of an operation to clear
- * a full mm and can make some optimizations */
- unsigned int fullmm : 1,
- /* we have performed an operation which
- * requires a complete flush of the tlb */
- need_flush_all : 1;
+ /*
+ * we are in the middle of an operation to clear
+ * a full mm and can make some optimizations
+ */
+ unsigned int fullmm : 1;
+
+ /*
+ * we have performed an operation which
+ * requires a complete flush of the tlb
+ */
+ unsigned int need_flush_all : 1;
+
+ /*
+ * we have removed page directories
+ */
+ unsigned int freed_tables : 1;

struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -139,6 +149,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+ tlb->freed_tables = 0;
}

static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -280,6 +291,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define pte_free_tlb(tlb, ptep, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ tlb->freed_tables = 1; \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
#endif
@@ -287,7 +299,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#ifndef pmd_free_tlb
#define pmd_free_tlb(tlb, pmdp, address) \
do { \
- __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ tlb->freed_tables = 1; \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
#endif
@@ -297,6 +310,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define pud_free_tlb(tlb, pudp, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ tlb->freed_tables = 1; \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
#endif
@@ -306,7 +320,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#ifndef p4d_free_tlb
#define p4d_free_tlb(tlb, pudp, address) \
do { \
- __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ __tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ tlb->freed_tables = 1; \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
#endif
--
2.1.4

2018-08-24 15:54:35

[permalink] [raw]

Subject: [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared

It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Signed-off-by: Will Deacon <[email protected]>
---
include/asm-generic/tlb.h | 53 ++++++++++++++++++++++++++++++++++++++++-------
mm/memory.c | 4 +++-
2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index a5caf90264e6..081b105d7215 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -116,6 +116,14 @@ struct mmu_gather {
*/
unsigned int freed_tables : 1;

+ /*
+ * at which levels have we cleared entries?
+ */
+ unsigned int cleared_ptes : 1;
+ unsigned int cleared_pmds : 1;
+ unsigned int cleared_puds : 1;
+ unsigned int cleared_p4ds : 1;
+
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
@@ -150,6 +158,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->end = 0;
}
tlb->freed_tables = 0;
+ tlb->cleared_ptes = 0;
+ tlb->cleared_pmds = 0;
+ tlb->cleared_puds = 0;
+ tlb->cleared_p4ds = 0;
}

static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -199,6 +211,20 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
}
#endif

+static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
+{
+ if (tlb->cleared_ptes)
+ return PAGE_SIZE;
+ if (tlb->cleared_pmds)
+ return PMD_SIZE;
+ if (tlb->cleared_puds)
+ return PUD_SIZE;
+ if (tlb->cleared_p4ds)
+ return P4D_SIZE;
+
+ return PAGE_SIZE;
+}
+
/*
* In the case of tlb vma handling, we can optimise these away in the
* case where we're doing a full MM flush. When we're doing a munmap,
@@ -232,13 +258,19 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define tlb_remove_tlb_entry(tlb, ptep, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
+ tlb->cleared_ptes = 1; \
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
- do { \
- __tlb_adjust_range(tlb, address, huge_page_size(h)); \
- __tlb_remove_tlb_entry(tlb, ptep, address); \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
+ do { \
+ unsigned long _sz = huge_page_size(h); \
+ __tlb_adjust_range(tlb, address, _sz); \
+ if (_sz == PMD_SIZE) \
+ tlb->cleared_pmds = 1; \
+ else if (_sz == PUD_SIZE) \
+ tlb->cleared_puds = 1; \
+ __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

/**
@@ -252,6 +284,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \
do { \
__tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE); \
+ tlb->cleared_pmds = 1; \
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)

@@ -266,6 +299,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define tlb_remove_pud_tlb_entry(tlb, pudp, address) \
do { \
__tlb_adjust_range(tlb, address, HPAGE_PUD_SIZE); \
+ tlb->cleared_puds = 1; \
__tlb_remove_pud_tlb_entry(tlb, pudp, address); \
} while (0)

@@ -291,7 +325,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define pte_free_tlb(tlb, ptep, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
+ tlb->freed_tables = 1; \
+ tlb->cleared_pmds = 1; \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
#endif
@@ -300,7 +335,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define pmd_free_tlb(tlb, pmdp, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
+ tlb->freed_tables = 1; \
+ tlb->cleared_puds = 1; \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
#endif
@@ -310,7 +346,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define pud_free_tlb(tlb, pudp, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
+ tlb->freed_tables = 1; \
+ tlb->cleared_p4ds = 1; \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
#endif
@@ -321,7 +358,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
#define p4d_free_tlb(tlb, pudp, address) \
do { \
__tlb_adjust_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
+ tlb->freed_tables = 1; \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 83aef222f11b..69982d1d9b7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -267,8 +267,10 @@ void arch_tlb_finish_mmu(struct mmu_gather *tlb,
{
struct mmu_gather_batch *batch, *next;

- if (force)
+ if (force) {
+ __tlb_reset_range(tlb);
__tlb_adjust_range(tlb, start, end - start);
+ }

tlb_flush_mmu(tlb);

--
2.1.4

2018-08-24 15:54:47

[permalink] [raw]

Subject: [RFC PATCH 05/11] arm64: tlbflush: Allow stride to be specified for __flush_tlb_range()

When we are unmapping intermediate page-table entries or huge pages, we
don't need to issue a TLBI instruction for every PAGE_SIZE chunk in the
VA range being unmapped.

Allow the invalidation stride to be passed to __flush_tlb_range(), and
adjust our "just nuke the ASID" heuristic to take this into account.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlb.h | 2 +-
arch/arm64/include/asm/tlbflush.h | 11 +++++++----
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a3233167be60..1e1f68ce28f4 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -53,7 +53,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
* the __(pte|pmd|pud)_free_tlb() functions, so last level
* TLBI is sufficient here.
*/
- __flush_tlb_range(&vma, tlb->start, tlb->end, true);
+ __flush_tlb_range(&vma, tlb->start, tlb->end, PAGE_SIZE, true);
}

static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index ddbf1718669d..1f77d08e638b 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -153,12 +153,15 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,

static inline void __flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- bool last_level)
+ unsigned long stride, bool last_level)
{
unsigned long asid = ASID(vma->vm_mm);
unsigned long addr;

- if ((end - start) > MAX_TLB_RANGE) {
+ /* Convert the stride into units of 4k */
+ stride >>= 12;
+
+ if ((end - start) > (MAX_TLB_RANGE * stride)) {
flush_tlb_mm(vma->vm_mm);
return;
}
@@ -167,7 +170,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
end = __TLBI_VADDR(end, asid);

dsb(ishst);
- for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12)) {
+ for (addr = start; addr < end; addr += stride) {
if (last_level) {
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
@@ -186,7 +189,7 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
* We cannot use leaf-only invalidation here, since we may be invalidating
* table entries as part of collapsing hugepages or moving page tables.
*/
- __flush_tlb_range(vma, start, end, false);
+ __flush_tlb_range(vma, start, end, PAGE_SIZE, false);
}

static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
--
2.1.4

2018-08-24 15:54:57

[permalink] [raw]

Subject: [RFC PATCH 07/11] asm-generic/tlb: Guard with #ifdef CONFIG_MMU

The inner workings of the mmu_gather-based TLB invalidation mechanism
are not relevant to nommu configurations, so guard them with an ifdef.
This allows us to implement future functions using static inlines
without breaking the build.

Signed-off-by: Will Deacon <[email protected]>
---
include/asm-generic/tlb.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..859f897d3d53 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -20,6 +20,8 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

+#ifdef CONFIG_MMU
+
#ifdef CONFIG_HAVE_RCU_TABLE_FREE
/*
* Semi RCU freeing of the page directories.
@@ -312,4 +314,5 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,

#define tlb_migrate_finish(mm) do {} while (0)

+#endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */
--
2.1.4

2018-08-24 15:55:05

[permalink] [raw]

Subject: [RFC PATCH 06/11] arm64: tlb: Remove redundant !CONFIG_HAVE_RCU_TABLE_FREE code

If there's one thing the RCU-based table freeing doesn't need, it's more
ifdeffery.

Remove the redundant !CONFIG_HAVE_RCU_TABLE_FREE code, since this option
is unconditionally selected in our Kconfig.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlb.h | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 1e1f68ce28f4..bd00017d529a 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -22,16 +22,10 @@
#include <linux/pagemap.h>
#include <linux/swap.h>

-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-
-#define tlb_remove_entry(tlb, entry) tlb_remove_table(tlb, entry)
static inline void __tlb_remove_table(void *_table)
{
free_page_and_swap_cache((struct page *)_table);
}
-#else
-#define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry)
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */

static void tlb_flush(struct mmu_gather *tlb);

@@ -61,7 +55,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
{
__flush_tlb_pgtable(tlb->mm, addr);
pgtable_page_dtor(pte);
- tlb_remove_entry(tlb, pte);
+ tlb_remove_table(tlb, pte);
}

#if CONFIG_PGTABLE_LEVELS > 2
@@ -69,7 +63,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
unsigned long addr)
{
__flush_tlb_pgtable(tlb->mm, addr);
- tlb_remove_entry(tlb, virt_to_page(pmdp));
+ tlb_remove_table(tlb, virt_to_page(pmdp));
}
#endif

@@ -78,7 +72,7 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
unsigned long addr)
{
__flush_tlb_pgtable(tlb->mm, addr);
- tlb_remove_entry(tlb, virt_to_page(pudp));
+ tlb_remove_table(tlb, virt_to_page(pudp));
}
#endif

--
2.1.4

2018-08-24 15:55:12

[permalink] [raw]

Subject: [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()

__flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
writing the new table entry and therefore avoid the barrier prior to the
TLBI instruction.

In preparation for delaying our walk-cache invalidation on the unmap()
path, move the DSB into the TLB invalidation routines.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 7e2a35424ca4..e257f8655b84 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
{
unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));

+ dsb(ishst);
__tlbi(vae1is, addr);
__tlbi_user(vae1is, addr);
dsb(ish);
@@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
{
unsigned long addr = __TLBI_VADDR(kaddr, 0);

+ dsb(ishst);
__tlbi(vaae1is, addr);
dsb(ish);
}
--
2.1.4

2018-08-24 15:55:18

[permalink] [raw]

Subject: [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()

Now that our walk-cache invalidation routines imply a DSB before the
invalidation, we no longer need one when we are clearing an entry during
unmap.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1bdeca8918a6..2ab2031b778c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -360,6 +360,7 @@ static inline int pmd_protnone(pmd_t pmd)
#define pmd_present(pmd) pte_present(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_valid(pmd) pte_valid(pmd_pte(pmd))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
@@ -431,7 +432,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
WRITE_ONCE(*pmdp, pmd);
- dsb(ishst);
+
+ if (pmd_valid(pmd))
+ dsb(ishst);
}

static inline void pmd_clear(pmd_t *pmdp)
@@ -477,11 +480,14 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
#define pud_present(pud) pte_present(pud_pte(pud))
+#define pud_valid(pud) pte_valid(pud_pte(pud))

static inline void set_pud(pud_t *pudp, pud_t pud)
{
WRITE_ONCE(*pudp, pud);
- dsb(ishst);
+
+ if (pud_valid(pud))
+ dsb(ishst);
}

static inline void pud_clear(pud_t *pudp)
--
2.1.4

2018-08-24 15:55:29

[permalink] [raw]

Subject: [RFC PATCH 04/11] arm64: tlb: Justify non-leaf invalidation in flush_tlb_range()

Add a comment to explain why we can't get away with last-level
invalidation in flush_tlb_range()

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index e257f8655b84..ddbf1718669d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -182,6 +182,10 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
+ /*
+ * We cannot use leaf-only invalidation here, since we may be invalidating
+ * table entries as part of collapsing hugepages or moving page tables.
+ */
__flush_tlb_range(vma, start, end, false);
}

--
2.1.4

2018-08-24 15:55:52

[permalink] [raw]

Subject: [RFC PATCH 10/11] arm64: tlb: Adjust stride and type of TLBI according to mmu_gather

Now that the core mmu_gather code keeps track of both the levels of page
table cleared and also whether or not these entries correspond to
intermediate entries, we can use this in our tlb_flush() callback to
reduce the number of invalidations we issue as well as their scope.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlb.h | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index bd00017d529a..baca8dff6884 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -34,20 +34,21 @@ static void tlb_flush(struct mmu_gather *tlb);
static inline void tlb_flush(struct mmu_gather *tlb)
{
struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
+ bool last_level = !tlb->freed_tables;
+ unsigned long stride = tlb_get_unmap_granule(tlb);

/*
- * The ASID allocator will either invalidate the ASID or mark
- * it as used.
+ * If we're tearing down the address space then we only care about
+ * invalidating the walk-cache, since the ASID allocator won't
+ * reallocate our ASID without invalidating the entire TLB.
*/
- if (tlb->fullmm)
+ if (tlb->fullmm) {
+ if (!last_level)
+ flush_tlb_mm(tlb->mm);
return;
+ }

- /*
- * The intermediate page table levels are already handled by
- * the __(pte|pmd|pud)_free_tlb() functions, so last level
- * TLBI is sufficient here.
- */
- __flush_tlb_range(&vma, tlb->start, tlb->end, PAGE_SIZE, true);
+ __flush_tlb_range(&vma, tlb->start, tlb->end, stride, last_level);
}

static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
--
2.1.4

2018-08-24 15:55:52

[permalink] [raw]

Subject: [RFC PATCH 01/11] arm64: tlb: Use last-level invalidation in flush_tlb_kernel_range()

flush_tlb_kernel_range() is only ever used to invalidate last-level
entries, so we can restrict the scope of the TLB invalidation
instruction.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a4a1901140ee..7e2a35424ca4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -199,7 +199,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end

dsb(ishst);
for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
- __tlbi(vaae1is, addr);
+ __tlbi(vaale1is, addr);
dsb(ish);
isb();
}
--
2.1.4

2018-08-24 15:56:15

[permalink] [raw]

Subject: [RFC PATCH 11/11] arm64: tlb: Avoid synchronous TLBIs when freeing page tables

By selecting HAVE_RCU_TABLE_INVALIDATE, we can rely on tlb_flush() being
called if we fail to batch table pages for freeing. This in turn allows
us to postpone walk-cache invalidation until tlb_finish_mmu(), which
avoids lots of unnecessary DSBs and means we can shoot down the ASID if
the range is large enough.

Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/tlb.h | 3 ---
arch/arm64/include/asm/tlbflush.h | 11 -----------
3 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 29e75b47becd..89059ee1eccc 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -142,6 +142,7 @@ config ARM64
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RCU_TABLE_FREE
+ select HAVE_RCU_TABLE_INVALIDATE
select HAVE_RSEQ
select HAVE_STACKPROTECTOR
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index baca8dff6884..4f65614b3141 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -54,7 +54,6 @@ static inline void tlb_flush(struct mmu_gather *tlb)
static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
unsigned long addr)
{
- __flush_tlb_pgtable(tlb->mm, addr);
pgtable_page_dtor(pte);
tlb_remove_table(tlb, pte);
}
@@ -63,7 +62,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
unsigned long addr)
{
- __flush_tlb_pgtable(tlb->mm, addr);
tlb_remove_table(tlb, virt_to_page(pmdp));
}
#endif
@@ -72,7 +70,6 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
unsigned long addr)
{
- __flush_tlb_pgtable(tlb->mm, addr);
tlb_remove_table(tlb, virt_to_page(pudp));
}
#endif
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 1f77d08e638b..2064ba97845f 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -215,17 +215,6 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
* Used to invalidate the TLB (walk caches) corresponding to intermediate page
* table levels (pgd/pud/pmd).
*/
-static inline void __flush_tlb_pgtable(struct mm_struct *mm,
- unsigned long uaddr)
-{
- unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
-
- dsb(ishst);
- __tlbi(vae1is, addr);
- __tlbi_user(vae1is, addr);
- dsb(ish);
-}
-
static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
{
unsigned long addr = __TLBI_VADDR(kaddr, 0);
--
2.1.4

2018-08-24 16:17:21

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()

On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <[email protected]> wrote:
>
> Now that our walk-cache invalidation routines imply a DSB before the
> invalidation, we no longer need one when we are clearing an entry during
> unmap.

Do you really still need it when *setting* it?

I'm wondering if you could just remove the thing unconditionally.

Why would you need a barrier for another CPU for a mapping that is
just being created? It's ok if they see the old lack of mapping until
they are told about it, and that eventual "being told about it" must
involve a data transfer already.

And I'm assuming arm doesn't cache negative page table entries, so
there's no issue with any stale tlb.

And any other kernel thread looking at the page tables will have to
honor the page table locking, so you don't need it for some direct
page table lookup either.

Hmm? It seems like you shouldn't need to order the "set page directory
entry" with anything.

But maybe there's some magic arm64 rule I'm not aware of. Maybe even
the local TLB hardware walker isn't coherent with local stores?

Linus

2018-08-24 16:21:51

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <[email protected]> wrote:
>
> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

Looks good to me.

Apart from the arm64-specific question I had, I wonder whether we need
to have that single "freed_tables" bit at all, since you wanted to
have the four individual bits for the different levels.

Even if somebody doesn't care about the individual bits, it's
generally exactly as cheap to test four bits as it is to test one, so
it seems unnecessary to have that summary bit.

Linus

2018-08-24 17:58:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()

On Fri, Aug 24, 2018 at 04:52:37PM +0100, Will Deacon wrote:
> __flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
> writing the new table entry and therefore avoid the barrier prior to the
> TLBI instruction.
>
> In preparation for delaying our walk-cache invalidation on the unmap()
> path, move the DSB into the TLB invalidation routines.
>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> arch/arm64/include/asm/tlbflush.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 7e2a35424ca4..e257f8655b84 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
> {
> unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
>
> + dsb(ishst);
> __tlbi(vae1is, addr);
> __tlbi_user(vae1is, addr);
> dsb(ish);
> @@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
> {
> unsigned long addr = __TLBI_VADDR(kaddr, 0);
>
> + dsb(ishst);
> __tlbi(vaae1is, addr);
> dsb(ish);
> }

I would suggest these barrier -- like any other barriers, carry a
comment that explain the required ordering.

I think this here reads like:

STORE: unhook page

DSB-ishst: wait for all stores to complete
TLBI: invalidate broadcast
DSB-ish: wait for TLBI to complete

And the 'newly' placed DSB-ishst ensures the page is observed unlinked
before we issue the invalidate.

2018-08-26 10:59:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On Fri, Aug 24, 2018 at 09:20:00AM -0700, Linus Torvalds wrote:
> On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <[email protected]> wrote:
> >
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
>
> Looks good to me.
>
> Apart from the arm64-specific question I had, I wonder whether we need
> to have that single "freed_tables" bit at all, since you wanted to
> have the four individual bits for the different levels.

I think so; because he also sets those size bits for things like hugetlb
and thp user page frees, not only table page frees. So they're not
exactly the same.

And I think x86 could use this too; if we know we only freed 2M
pages, we can use that in flush_tlb_mm_range() to range flush in 2M
increments instead of 4K.

Something a little like so..

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index cb0a1f470980..cb0898fe9d37 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -8,10 +8,15 @@

#define tlb_flush(tlb) \
{ \
- if (!tlb->fullmm && !tlb->need_flush_all) \
- flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL); \
- else \
- flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL); \
+ unsigned long start = 0UL, end = TLB_FLUSH_ALL; \
+ unsigned int invl_shift = tlb_get_unmap_shift(tlb); \
+ \
+ if (!tlb->fullmm && !tlb->need_flush_all) { \
+ start = tlb->start; \
+ end = tlb->end; \
+ } \
+ \
+ flush_tlb_mm_range(tlb->mm, start, end, invl_shift); \
}

#include <asm-generic/tlb.h>
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 511bf5fae8b8..8ac1cac34f63 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -491,23 +491,25 @@ struct flush_tlb_info {
unsigned long start;
unsigned long end;
u64 new_tlb_gen;
+ unsigned int invl_shift;
};

#define local_flush_tlb() __flush_tlb()

#define flush_tlb_mm(mm) flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)

-#define flush_tlb_range(vma, start, end) \
- flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
+#define flush_tlb_range(vma, start, end) \
+ flush_tlb_mm_range(vma->vm_mm, start, end, \
+ vma->vm_flags & VM_HUGETLB ? PMD_SHUFT : PAGE_SHIFT)

extern void flush_tlb_all(void);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
- unsigned long end, unsigned long vmflag);
+ unsigned long end, unsigned int invl_shift);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);

static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
- flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
+ flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT);
}

void native_flush_tlb_others(const struct cpumask *cpumask,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 752dbf4e0e50..806aa74a8fb4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -537,12 +537,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
unsigned long addr;
- unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
+ unsigned long nr_pages = (f->end - f->start) >> f->invl_shift;

addr = f->start;
while (addr < f->end) {
__flush_tlb_one_user(addr);
- addr += PAGE_SIZE;
+ addr += 1UL << f->invl_shift;
}
if (local)
count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
@@ -653,12 +653,13 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;

void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
- unsigned long end, unsigned long vmflag)
+ unsigned long end, unsigned int invl_shift)
{
int cpu;

struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
.mm = mm,
+ .invl_shift = invl_shift;
};

cpu = get_cpu();
@@ -668,8 +669,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

/* Should we flush just the requested range? */
if ((end != TLB_FLUSH_ALL) &&
- !(vmflag & VM_HUGETLB) &&
- ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+ ((end - start) >> invl_shift) <= tlb_single_page_flush_ceiling) {
info.start = start;
info.end = end;
} else {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e811ef7b8350..cdde0cdb23e7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -175,6 +200,25 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
}
#endif

+static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb)
+{
+ if (tlb->cleared_ptes)
+ return PAGE_SHIFT;
+ if (tlb->cleared_pmds)
+ return PMD_SHIFT;
+ if (tlb->cleared_puds)
+ return PUD_SHIFT;
+ if (tlb->cleared_p4ds)
+ return P4D_SHIFT;
+
+ return PAGE_SHIFT;
+}
+
+static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
+{
+ return 1ULL << tlb_get_unmap_shift(tlb);
+}
+
/*
* In the case of tlb vma handling, we can optimise these away in the
* case where we're doing a full MM flush. When we're doing a munmap,

2018-08-27 04:46:27

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

On Fri, 24 Aug 2018 16:52:43 +0100
Will Deacon <[email protected]> wrote:

> From: Peter Zijlstra <[email protected]>
>
> Some architectures require different TLB invalidation instructions
> depending on whether it is only the last-level of page table being
> changed, or whether there are also changes to the intermediate
> (directory) entries higher up the tree.
>
> Add a new bit to the flags bitfield in struct mmu_gather so that the
> architecture code can operate accordingly if it's the intermediate
> levels being invalidated.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>

powerpc should be able to move right over to using this rather
than keeping the bit in need_flush_all.

powerpc may be able to use the unmap granule thing to improve
its page size dependent flushes, but it might prefer to go
a different way and track start-end for different page sizes.
I wonder how much of that stuff should go into generic code,
and whether it should instead go into a struct arch_mmu_gather.

Thanks,
Nick

2018-08-27 07:55:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared

On Fri, Aug 24, 2018 at 04:52:44PM +0100, Will Deacon wrote:
> +static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
> +{
> + if (tlb->cleared_ptes)
> + return PAGE_SIZE;
> + if (tlb->cleared_pmds)
> + return PMD_SIZE;
> + if (tlb->cleared_puds)
> + return PUD_SIZE;
> + if (tlb->cleared_p4ds)
> + return P4D_SIZE;
> +
> + return PAGE_SIZE;
> +}

When doing the x86 patch; I found _SHIFT more useful than _SIZE.

2018-08-28 12:50:54

[permalink] [raw]

Subject: Re: [RFC PATCH 03/11] arm64: pgtable: Implement p[mu]d_valid() and check in set_p[mu]d()

Hi Linus,

On Fri, Aug 24, 2018 at 09:15:17AM -0700, Linus Torvalds wrote:
> On Fri, Aug 24, 2018 at 8:52 AM Will Deacon <[email protected]> wrote:
> >
> > Now that our walk-cache invalidation routines imply a DSB before the
> > invalidation, we no longer need one when we are clearing an entry during
> > unmap.
>
> Do you really still need it when *setting* it?
>
> I'm wondering if you could just remove the thing unconditionally.
>
> Why would you need a barrier for another CPU for a mapping that is
> just being created? It's ok if they see the old lack of mapping until
> they are told about it, and that eventual "being told about it" must
> involve a data transfer already.
>
> And I'm assuming arm doesn't cache negative page table entries, so
> there's no issue with any stale tlb.
>
> And any other kernel thread looking at the page tables will have to
> honor the page table locking, so you don't need it for some direct
> page table lookup either.
>
> Hmm? It seems like you shouldn't need to order the "set page directory
> entry" with anything.
>
> But maybe there's some magic arm64 rule I'm not aware of. Maybe even
> the local TLB hardware walker isn't coherent with local stores?

Yup, you got it: it's not related to ordering of accesses by other CPUs, but
actually because the page-table walker is treated as a separate observer by
the architecture and therefore we need the DSB to push out the store to the
page-table so that the walker can see it (practically speaking, the walker
isn't guaranteed to snoop the store buffer).

For PTEs mapping user addresses, we actually don't bother with the DSB
when writing a valid entry because it's extremely unlikely that we'd get
back to userspace with the entry sitting in the store buffer. If that
*did* happen, we'd just take the fault a second time. However, if we played
that same trick for pXds, I think that:

(a) We'd need to distinguish between user and kernel mappings
in set_pXd(), since we can't tolerate spurious faults on
kernel addresses.
(b) We'd need to be careful about allocating page-table pages,
so that e.g. the walker sees zeroes for a new pgtable

We could probably achieve (a) with a software bit and (b) is a non-issue
because mm/memory.c uses smp_wmb(), which is always a DMB for us (which
will enforce the eventual ordering but doesn't necessarily publish the
stores immediately).

Will

2018-08-28 13:05:23

[permalink] [raw]

Subject: Re: [RFC PATCH 02/11] arm64: tlb: Add DSB ISHST prior to TLBI in __flush_tlb_[kernel_]pgtable()

On Fri, Aug 24, 2018 at 07:56:09PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 04:52:37PM +0100, Will Deacon wrote:
> > __flush_tlb_[kernel_]pgtable() rely on set_pXd() having a DSB after
> > writing the new table entry and therefore avoid the barrier prior to the
> > TLBI instruction.
> >
> > In preparation for delaying our walk-cache invalidation on the unmap()
> > path, move the DSB into the TLB invalidation routines.
> >
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> > arch/arm64/include/asm/tlbflush.h | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 7e2a35424ca4..e257f8655b84 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -213,6 +213,7 @@ static inline void __flush_tlb_pgtable(struct mm_struct *mm,
> > {
> > unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
> >
> > + dsb(ishst);
> > __tlbi(vae1is, addr);
> > __tlbi_user(vae1is, addr);
> > dsb(ish);
> > @@ -222,6 +223,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
> > {
> > unsigned long addr = __TLBI_VADDR(kaddr, 0);
> >
> > + dsb(ishst);
> > __tlbi(vaae1is, addr);
> > dsb(ish);
> > }
>
> I would suggest these barrier -- like any other barriers, carry a
> comment that explain the required ordering.
>
> I think this here reads like:
>
> STORE: unhook page
>
> DSB-ishst: wait for all stores to complete
> TLBI: invalidate broadcast
> DSB-ish: wait for TLBI to complete
>
> And the 'newly' placed DSB-ishst ensures the page is observed unlinked
> before we issue the invalidate.

Yeah, not a bad idea. We already have a big block comment in this file but
it's desperately out of date, so lemme rewrite that and justify the barriers
at the same time.

Will

2018-08-28 13:14:05

[permalink] [raw]

Subject: Re: [RFC PATCH 09/11] asm-generic/tlb: Track which levels of the page tables have been cleared

On Mon, Aug 27, 2018 at 09:53:41AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 04:52:44PM +0100, Will Deacon wrote:
> > +static inline unsigned long tlb_get_unmap_granule(struct mmu_gather *tlb)
> > +{
> > + if (tlb->cleared_ptes)
> > + return PAGE_SIZE;
> > + if (tlb->cleared_pmds)
> > + return PMD_SIZE;
> > + if (tlb->cleared_puds)
> > + return PUD_SIZE;
> > + if (tlb->cleared_p4ds)
> > + return P4D_SIZE;
> > +
> > + return PAGE_SIZE;
> > +}
>
> When doing the x86 patch; I found _SHIFT more useful than _SIZE.

I'll make that change for the next version.

Cheers,

Will

2018-08-28 13:48:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:

> powerpc may be able to use the unmap granule thing to improve
> its page size dependent flushes, but it might prefer to go
> a different way and track start-end for different page sizes.

I don't really see how tracking multiple ranges would help much with
THP. The ranges would end up being almost the same if there is a good
mix of page sizes.

But something like:

void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
{
if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
tblie_pte(addr);
if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
tlbie_pmd(addr);
if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
tlbie_pud(addr);
}

void tlb_flush_range(struct mmu_gather *tlb)
{
unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
unsigned long addr;

for (addr = tlb->start; addr < tlb->end; addr += stride)
tlb_flush_one(tlb, addr);

ptesync();
}

Should workd I think. You'll only issue multiple TLBIEs on the
boundaries, not every stride.

And for hugetlb the above should be optimal, since stride and
tlb->cleared_* match up 1:1.

2018-08-28 13:51:04

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

On Tue, Aug 28, 2018 at 03:46:38PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
>
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.
>
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.
>
> But something like:
>
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
> if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
> tblie_pte(addr);
> if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
> tlbie_pmd(addr);
> if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
> tlbie_pud(addr);
> }

Sorry, those all should (of course) be !(addr << ...).

> void tlb_flush_range(struct mmu_gather *tlb)
> {
> unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
> unsigned long addr;
>
> for (addr = tlb->start; addr < tlb->end; addr += stride)
> tlb_flush_one(tlb, addr);
>
> ptesync();
> }

And one could; like x86 has; add a threshold above which you just kill
the complete mm.

2018-08-28 14:14:31

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [RFC PATCH 08/11] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather

On Tue, 28 Aug 2018 15:46:38 +0200
Peter Zijlstra <[email protected]> wrote:

> On Mon, Aug 27, 2018 at 02:44:57PM +1000, Nicholas Piggin wrote:
>
> > powerpc may be able to use the unmap granule thing to improve
> > its page size dependent flushes, but it might prefer to go
> > a different way and track start-end for different page sizes.
>
> I don't really see how tracking multiple ranges would help much with
> THP. The ranges would end up being almost the same if there is a good
> mix of page sizes.

That's assuming quite large unmaps. But a lot of the time they are
going to go to a full PID flush.

>
> But something like:
>
> void tlb_flush_one(struct mmu_gather *tlb, unsigned long addr)
> {
> if (tlb->cleared_ptes && (addr << BITS_PER_LONG - PAGE_SHIFT))
> tblie_pte(addr);
> if (tlb->cleared_pmds && (addr << BITS_PER_LONG - PMD_SHIFT))
> tlbie_pmd(addr);
> if (tlb->cleared_puds && (addr << BITS_PER_LONG - PUD_SHIFT))
> tlbie_pud(addr);
> }
>
> void tlb_flush_range(struct mmu_gather *tlb)
> {
> unsigned long stride = 1UL << tlb_get_unmap_shift(tlb);
> unsigned long addr;
>
> for (addr = tlb->start; addr < tlb->end; addr += stride)
> tlb_flush_one(tlb, addr);
>
> ptesync();
> }
>
> Should workd I think. You'll only issue multiple TLBIEs on the
> boundaries, not every stride.

Yeah we already do basically that today in the flush_tlb_range path,
just without the precise test for which page sizes

if (hflush) {
hstart = (start + PMD_SIZE - 1) & PMD_MASK;
hend = end & PMD_MASK;
if (hstart == hend)
hflush = false;
}

if (gflush) {
gstart = (start + PUD_SIZE - 1) & PUD_MASK;
gend = end & PUD_MASK;
if (gstart == gend)
gflush = false;
}

asm volatile("ptesync": : :"memory");
if (local) {
__tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
if (hflush)
__tlbiel_va_range(hstart, hend, pid,
PMD_SIZE, MMU_PAGE_2M);
if (gflush)
__tlbiel_va_range(gstart, gend, pid,
PUD_SIZE, MMU_PAGE_1G);
asm volatile("ptesync": : :"memory");

Thing is I think it's the smallish range cases you want to optimize
for. And for those we'll probably do something even smarter (like keep
a bitmap of pages to flush) because we really want to keep tlbies off
the bus whereas that's less important for x86.

Still not really seeing a reason not to implement a struct
arch_mmu_gather. A little bit of data contained to the arch is nothing
compared with the multitude of hooks and divergence of code.

Thanks,
Nick

2018-09-04 18:39:28

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On 08/24/2018 11:52 AM, Will Deacon wrote:

> I hacked up this RFC on the back of the recent changes to the mmu_gather
> stuff in mainline. It's had a bit of testing and it looks pretty good so
> far.

I will request the server folks go and test this. You'll probably
remember a couple of parts we've seen where aggressive walker caches
ended up (correctly) seeing stale page table entries and we had all
manner of horrifically hard to debug problems. We have some fairly nice
reproducers that were able to find this last time that we can test.

Jon.

--
Computer Architect | Sent from my Fedora powered laptop

2018-09-05 12:30:12

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> On 08/24/2018 11:52 AM, Will Deacon wrote:
>
> > I hacked up this RFC on the back of the recent changes to the mmu_gather
> > stuff in mainline. It's had a bit of testing and it looks pretty good so
> > far.
>
> I will request the server folks go and test this. You'll probably
> remember a couple of parts we've seen where aggressive walker caches
> ended up (correctly) seeing stale page table entries and we had all
> manner of horrifically hard to debug problems. We have some fairly nice
> reproducers that were able to find this last time that we can test.

Cheers, Jon, that would be very helpful. You're probably best off using
my (rebasing) tlb branch rather than picking the RFC:

git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb

Let me know if you'd prefer something stable (I can tag it with a date).

Will

2018-09-07 06:37:35

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On 09/05/2018 08:28 AM, Will Deacon wrote:
> On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
>> On 08/24/2018 11:52 AM, Will Deacon wrote:
>>
>>> I hacked up this RFC on the back of the recent changes to the mmu_gather
>>> stuff in mainline. It's had a bit of testing and it looks pretty good so
>>> far.
>>
>> I will request the server folks go and test this. You'll probably
>> remember a couple of parts we've seen where aggressive walker caches
>> ended up (correctly) seeing stale page table entries and we had all
>> manner of horrifically hard to debug problems. We have some fairly nice
>> reproducers that were able to find this last time that we can test.
>
> Cheers, Jon, that would be very helpful. You're probably best off using
> my (rebasing) tlb branch rather than picking the RFC:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
>
> Let me know if you'd prefer something stable (I can tag it with a date).

That would be useful. I've prodded each of the Arm server SoC vendors I
work with via our weekly call to have them each specifically check this.
A tag would be helpful to that effort I expect. They all claim to be
watching this thread now, so we'll see if they see cabbages here.

Jon.

--
Computer Architect | Sent from my Fedora powered laptop

2018-09-13 15:53:45

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

On Fri, Sep 07, 2018 at 02:36:08AM -0400, Jon Masters wrote:
> On 09/05/2018 08:28 AM, Will Deacon wrote:
> > On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
> >> On 08/24/2018 11:52 AM, Will Deacon wrote:
> >>
> >>> I hacked up this RFC on the back of the recent changes to the mmu_gather
> >>> stuff in mainline. It's had a bit of testing and it looks pretty good so
> >>> far.
> >>
> >> I will request the server folks go and test this. You'll probably
> >> remember a couple of parts we've seen where aggressive walker caches
> >> ended up (correctly) seeing stale page table entries and we had all
> >> manner of horrifically hard to debug problems. We have some fairly nice
> >> reproducers that were able to find this last time that we can test.
> >
> > Cheers, Jon, that would be very helpful. You're probably best off using
> > my (rebasing) tlb branch rather than picking the RFC:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
> >
> > Let me know if you'd prefer something stable (I can tag it with a date).
>
> That would be useful. I've prodded each of the Arm server SoC vendors I
> work with via our weekly call to have them each specifically check this.
> A tag would be helpful to that effort I expect. They all claim to be
> watching this thread now, so we'll see if they see cabbages here.

This is now all queued up in the (stable) arm64 for-next/core branch:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/core

so that's the best place to grab the patches.

Will

2018-09-13 16:54:00

[permalink] [raw]

Subject: Re: [RFC PATCH 00/11] Avoid synchronous TLB invalidation for intermediate page-table entries on arm64

Tnx

--
Computer Architect

> On Sep 13, 2018, at 11:52, Will Deacon <[email protected]> wrote:
>
>> On Fri, Sep 07, 2018 at 02:36:08AM -0400, Jon Masters wrote:
>>> On 09/05/2018 08:28 AM, Will Deacon wrote:
>>>> On Tue, Sep 04, 2018 at 02:38:02PM -0400, Jon Masters wrote:
>>>>> On 08/24/2018 11:52 AM, Will Deacon wrote:
>>>>>
>>>>> I hacked up this RFC on the back of the recent changes to the mmu_gather
>>>>> stuff in mainline. It's had a bit of testing and it looks pretty good so
>>>>> far.
>>>>
>>>> I will request the server folks go and test this. You'll probably
>>>> remember a couple of parts we've seen where aggressive walker caches
>>>> ended up (correctly) seeing stale page table entries and we had all
>>>> manner of horrifically hard to debug problems. We have some fairly nice
>>>> reproducers that were able to find this last time that we can test.
>>>
>>> Cheers, Jon, that would be very helpful. You're probably best off using
>>> my (rebasing) tlb branch rather than picking the RFC:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tlb
>>>
>>> Let me know if you'd prefer something stable (I can tag it with a date).
>>
>> That would be useful. I've prodded each of the Arm server SoC vendors I
>> work with via our weekly call to have them each specifically check this.
>> A tag would be helpful to that effort I expect. They all claim to be
>> watching this thread now, so we'll see if they see cabbages here.
>
> This is now all queued up in the (stable) arm64 for-next/core branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/core
>
> so that's the best place to grab the patches.
>
> Will