2022-07-07 12:54:53

by Barry Song

[permalink] [raw]
Subject: [PATCH 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH

Though ARM64 has the hardware to do tlb shootdown, it is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
My testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

While believing the micro benchmark in 4/4 will perform better
on arm64 servers, I don't have a hardware to test. Thus,
Hi Yicong,
Would you like to run the same test in 4/4 on Kunpeng920?
Hi Darren,
Would you like to run the same test in 4/4 on Ampere's ARM64 server?
Remember to enable zRAM swap device so that pageout can actually
work for the micro benchmark.
thanks!

Barry Song (4):
Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
apply to ARM64"
mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
mm: rmap: Extend tlbbatch APIs to fit new platforms
arm64: support batched/deferred tlb shootdown during page reclamation

Documentation/features/arch-support.txt | 1 -
.../features/vm/TLB/arch-support.txt | 2 +-
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/tlbbatch.h | 12 +++++++++++
arch/arm64/include/asm/tlbflush.h | 13 ++++++++++++
arch/x86/include/asm/tlbflush.h | 4 +++-
mm/rmap.c | 21 +++++++++++++------
7 files changed, 45 insertions(+), 9 deletions(-)
create mode 100644 arch/arm64/include/asm/tlbbatch.h

--
2.25.1


2022-07-07 12:55:09

by Barry Song

[permalink] [raw]
Subject: [PATCH 1/4] Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64"

From: Barry Song <[email protected]>

This reverts commit 6bfef171d0d74cb050112e0e49feb20bfddf7f42.

I was wrong to say batched tlb flush didn't apply to ARM. The fact
is though ARM64 has hardware TLB flush, it is not free and could
be quite expensive.

We still have a good chance to enable batched and deferred TLB flush
on ARM64 for memory reclamation. A possible way is that we only queue
tlbi instructions in hardware's queue. When we have to broadcast TLB,
we broadcast it by dsb. We just need to get adapted to the existing
framework of BATCHED_UNMAP_TLB_FLUSH.

Cc: Will Deacon <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
Documentation/features/arch-support.txt | 1 -
Documentation/features/vm/TLB/arch-support.txt | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index 118ae031840b..d22a1095e661 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,5 +8,4 @@ The meaning of entries in the tables is:
| ok | # feature supported by the architecture
|TODO| # feature not yet supported by the architecture
| .. | # feature cannot be supported by the hardware
- | N/A| # feature doesn't apply to the architecture

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 039e4e91ada3..1c009312b9c1 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
- | arm64: | N/A |
+ | arm64: | TODO |
| csky: | TODO |
| hexagon: | TODO |
| ia64: | TODO |
--
2.25.1

2022-07-07 12:55:30

by Barry Song

[permalink] [raw]
Subject: [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms

From: Barry Song <[email protected]>

Add vma and uaddr to tlbbatch APIs so that platforms like ARM64
are able to apply these two parameters on their specific hard-
ware features. For ARM64, this could be sending tlbi into hard-
ware queues without waiting for its completion.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 4 +++-
mm/rmap.c | 12 ++++++++----
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..9fc48c103b31 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -251,7 +251,9 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
}

static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
- struct mm_struct *mm)
+ struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long uaddr)
{
inc_mm_tlb_gen(mm);
cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
diff --git a/mm/rmap.c b/mm/rmap.c
index d320c29a4ad8..2b5b740d0001 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -642,12 +642,14 @@ void try_to_unmap_flush_dirty(void)
#define TLB_FLUSH_BATCH_PENDING_LARGE \
(TLB_FLUSH_BATCH_PENDING_MASK / 2)

-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+ struct vm_area_struct *vma,
+ unsigned long uaddr)
{
struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
int batch, nbatch;

- arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
+ arch_tlbbatch_add_mm(&tlb_ubc->arch, mm, vma, uaddr);
tlb_ubc->flush_required = true;

/*
@@ -737,7 +739,9 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
}
}
#else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable,
+ struct vm_area_struct *vma,
+ unsigned long uaddr)
{
}

@@ -1600,7 +1604,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
pteval = ptep_get_and_clear(mm, address, pvmw.pte);

- set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+ set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), vma, address);
} else {
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}
--
2.25.1

2022-07-07 13:41:33

by Barry Song

[permalink] [raw]
Subject: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

From: Barry Song <[email protected]>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

int main()
{
#define SIZE (1 * 1024 * 1024)
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);

memset(p, 0x88, SIZE);

for (int k = 0; k < 10000; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += 4096) {
(void)p[i];
}

/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
}

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

~ # perf record taskset -c 4 ./a.out
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
~ # perf report
# To display the perf.data header info, please use --header/--header-only options.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 60K of event 'cycles'
# Event count (approx.): 35706225414
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. .............................................................................
#
21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
3.49% a.out [kernel.kallsyms] [k] memset64
1.63% a.out [kernel.kallsyms] [k] clear_page
1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
1.23% a.out [kernel.kallsyms] [k] xas_load
1.15% a.out [kernel.kallsyms] [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
~ # time taskset -c 4 ./a.out
0.21user 14.34system 0:14.69elapsed
w/ patch:
~ # time taskset -c 4 ./a.out
0.22user 13.45system 0:13.80elapsed

Cc: Jonathan Corbet <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
Documentation/features/vm/TLB/arch-support.txt | 2 +-
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/tlbbatch.h | 12 ++++++++++++
arch/arm64/include/asm/tlbflush.h | 13 +++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
- | arm64: | TODO |
+ | arm64: | ok |
| csky: | TODO |
| hexagon: | TODO |
| ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+ select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+ /*
+ * For arm64, HW can do tlb shootdown, so we don't
+ * need to record cpumask for sending IPI
+ */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..b3ed163267ca 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
dsb(ish);
}

+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+ struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long uaddr)
+{
+ flush_tlb_page_nosync(vma, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+ dsb(ish);
+}
+
/*
* This is meant to avoid soft lock-ups on large TLB flushing ranges and not
* necessarily a performance improvement.
--
2.25.1

2022-07-07 13:44:19

by Barry Song

[permalink] [raw]
Subject: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

From: Barry Song <[email protected]>

Platforms like ARM64 have hareware TLB shootdown broadcast. They
don't maintain mm_cpumask and they just send tlbi and related
sync instructions for TLB flush.
So if mm_cpumask is empty, we also allow deferred TLB flush

Cc: Nadav Amit <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Barry Song <[email protected]>>
---
mm/rmap.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..d320c29a4ad8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,8 +692,13 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
if (!(flags & TTU_BATCH_FLUSH))
return false;

- /* If remote CPUs need to be flushed then defer batch the flush */
- if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+ /*
+ * If remote CPUs need to be flushed then defer batch the flush;
+ * If ARCHs like ARM64 have hardware TLB flush broadcast, thus
+ * they don't maintain mm_cpumask() at all, defer batch as well.
+ */
+ if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids ||
+ cpumask_empty(mm_cpumask(mm)))
should_defer = true;
put_cpu();

--
2.25.1

2022-07-07 13:53:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

On Fri, Jul 08, 2022 at 12:52:42AM +1200, Barry Song wrote:

> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> + /*
> + * For arm64, HW can do tlb shootdown, so we don't
> + * need to record cpumask for sending IPI
> + */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..b3ed163267ca 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> dsb(ish);
> }
>
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> + struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long uaddr)
> +{
> + flush_tlb_page_nosync(vma, uaddr);
> +}

You're passing that vma along just to get the mm, that's quite silly and
trivially fixed.


diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..87505ecce1f0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
dsb(ish);
}

-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
- unsigned long uaddr)
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
+ unsigned long uaddr)
{
unsigned long addr;

dsb(ishst);
- addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+ addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
}

+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+ unsigned long uaddr)
+{
+ return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
static inline void flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{

2022-07-07 16:58:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms

On 7/7/22 05:52, Barry Song wrote:
> static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> - struct mm_struct *mm)
> + struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long uaddr)
> {

It's not a huge deal, but could we pass 'vma' _instead_ of 'mm'? The
implementations could then just use vma->vm_mm instead of the passed-in mm.

2022-07-07 21:22:04

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

On Fri, Jul 8, 2022 at 1:50 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 08, 2022 at 12:52:42AM +1200, Barry Song wrote:
>
> > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> > new file mode 100644
> > index 000000000000..fedb0b87b8db
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/tlbbatch.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ARCH_ARM64_TLBBATCH_H
> > +#define _ARCH_ARM64_TLBBATCH_H
> > +
> > +struct arch_tlbflush_unmap_batch {
> > + /*
> > + * For arm64, HW can do tlb shootdown, so we don't
> > + * need to record cpumask for sending IPI
> > + */
> > +};
> > +
> > +#endif /* _ARCH_ARM64_TLBBATCH_H */
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 412a3b9a3c25..b3ed163267ca 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> > dsb(ish);
> > }
> >
> > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> > + struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > + unsigned long uaddr)
> > +{
> > + flush_tlb_page_nosync(vma, uaddr);
> > +}
>
> You're passing that vma along just to get the mm, that's quite silly and
> trivially fixed.

Yes, this was silly. will include your fix in v2.

>
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..87505ecce1f0 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> dsb(ish);
> }
>
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> - unsigned long uaddr)
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> + unsigned long uaddr)
> {
> unsigned long addr;
>
> dsb(ishst);
> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> }
>
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> + unsigned long uaddr)
> +{
> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
> static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
> {

Thanks
Barry

2022-07-07 21:28:27

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: rmap: Extend tlbbatch APIs to fit new platforms

On Fri, Jul 8, 2022 at 4:43 AM Dave Hansen <[email protected]> wrote:
>
> On 7/7/22 05:52, Barry Song wrote:
> > static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> > - struct mm_struct *mm)
> > + struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > + unsigned long uaddr)
> > {
>
> It's not a huge deal, but could we pass 'vma' _instead_ of 'mm'? The
> implementations could then just use vma->vm_mm instead of the passed-in mm.

Yes, Dave. Peter made the same suggestion in 4/4.
will get this fixed in v2.

Thanks
Barry

2022-07-08 07:13:35

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

> The cpumask_empty() is indeed just another memory access, which is most
> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
> make the code simpler and (slightly, certainly slightly) more performant?

Yep. good suggestion, Nadav. So the code will be as below, right?

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_MEMBARRIER_SYNC_CORE
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
register alias named "current_stack_pointer", this config can be
selected.

+config ARCH_HAS_MM_CPUMASK
+ bool
+
config ARCH_HAS_VM_GET_PAGE_PROT
bool

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
if (!(flags & TTU_BATCH_FLUSH))
return false;

+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+ return true;
+#endif
+
/* If remote CPUs need to be flushed then defer batch the flush */
if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
should_defer = true;

Thanks
Barry

2022-07-08 07:20:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

On Jul 7, 2022, at 5:52 AM, Barry Song <[email protected]> wrote:

> From: Barry Song <[email protected]>
>
> Platforms like ARM64 have hareware TLB shootdown broadcast. They
> don't maintain mm_cpumask and they just send tlbi and related
> sync instructions for TLB flush.
> So if mm_cpumask is empty, we also allow deferred TLB flush
>
> Cc: Nadav Amit <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Barry Song <[email protected]>>
> ---
> mm/rmap.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..d320c29a4ad8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -692,8 +692,13 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> if (!(flags & TTU_BATCH_FLUSH))
> return false;
>
> - /* If remote CPUs need to be flushed then defer batch the flush */
> - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> + /*
> + * If remote CPUs need to be flushed then defer batch the flush;
> + * If ARCHs like ARM64 have hardware TLB flush broadcast, thus
> + * they don't maintain mm_cpumask() at all, defer batch as well.
> + */
> + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids ||
> + cpumask_empty(mm_cpumask(mm)))

The cpumask_empty() is indeed just another memory access, which is most
likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
make the code simpler and (slightly, certainly slightly) more performant?

2022-07-08 08:23:26

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

On Jul 7, 2022, at 11:59 PM, Barry Song <[email protected]> wrote:

>> The cpumask_empty() is indeed just another memory access, which is most
>> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
>> make the code simpler and (slightly, certainly slightly) more performant?
>
> Yep. good suggestion, Nadav. So the code will be as below, right?

Hmmm… Although it is likely to work (because only x86 and arm would use this
batch flushing), I think that for consistency ARCH_HAS_MM_CPUMASK should be
correct for all architectures.

Is it really only x86 that has mm_cpumask()?

2022-07-08 08:52:27

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

On Fri, Jul 8, 2022 at 8:08 PM Nadav Amit <[email protected]> wrote:
>
> On Jul 7, 2022, at 11:59 PM, Barry Song <[email protected]> wrote:
>
> >> The cpumask_empty() is indeed just another memory access, which is most
> >> likely ok. But wouldn’t adding something like CONFIG_ARCH_HAS_MM_CPUMASK
> >> make the code simpler and (slightly, certainly slightly) more performant?
> >
> > Yep. good suggestion, Nadav. So the code will be as below, right?
>
> Hmmm… Although it is likely to work (because only x86 and arm would use this
> batch flushing), I think that for consistency ARCH_HAS_MM_CPUMASK should be
> correct for all architectures.
>
> Is it really only x86 that has mm_cpumask()?

i am quite sure there are some other platforms having mm_cpumask().
for example, arm(not arm64).
but i am not exactly sure of the situation of each individual arch. thus,
i don't risk changing their source code.
but arm64 is the second platform looking for tlbbatch, and
ARCH_HAS_MM_CPUMASK only affects tlbbatch. so i would
expect those platforms to fill their ARCH_HAS_MM_CPUMASK
while they start to bringup their tlbbatch? for this moment,
we only need to make certain we don't break x86?
does it make sense?

Thanks
Barry


>

2022-07-08 08:52:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

On Fri, Jul 08, 2022 at 08:08:45AM +0000, Nadav Amit wrote:

> Is it really only x86 that has mm_cpumask()?

Unlikely, everybody who needs to IPI (eg. doesn't have broadcast
invalidate) has benefit to track this mask more accurately.

The below greps for clearing CPUs in the mask and ought to be a fair
indicator:

$ git grep -l "cpumask_clear_cpu.*mm_cpumask" arch/
arch/arm/include/asm/mmu_context.h
arch/loongarch/include/asm/mmu_context.h
arch/loongarch/mm/tlb.c
arch/mips/include/asm/mmu_context.h
arch/openrisc/mm/tlb.c
arch/powerpc/include/asm/book3s/64/mmu.h
arch/powerpc/mm/book3s64/radix_tlb.c
arch/riscv/mm/context.c
arch/s390/kernel/smp.c
arch/um/include/asm/mmu_context.h
arch/x86/mm/tlb.c

2022-07-11 00:37:58

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: rmap: Allow platforms without mm_cpumask to defer TLB flush

On Fri, Jul 8, 2022 at 8:28 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 08, 2022 at 08:08:45AM +0000, Nadav Amit wrote:
>
> > Is it really only x86 that has mm_cpumask()?
>
> Unlikely, everybody who needs to IPI (eg. doesn't have broadcast
> invalidate) has benefit to track this mask more accurately.
>
> The below greps for clearing CPUs in the mask and ought to be a fair
> indicator:
>
> $ git grep -l "cpumask_clear_cpu.*mm_cpumask" arch/
> arch/arm/include/asm/mmu_context.h
> arch/loongarch/include/asm/mmu_context.h
> arch/loongarch/mm/tlb.c
> arch/mips/include/asm/mmu_context.h
> arch/openrisc/mm/tlb.c
> arch/powerpc/include/asm/book3s/64/mmu.h
> arch/powerpc/mm/book3s64/radix_tlb.c
> arch/riscv/mm/context.c
> arch/s390/kernel/smp.c
> arch/um/include/asm/mmu_context.h
> arch/x86/mm/tlb.c

so i suppose we need the below at this moment. i am not able to
test all of them. but since only x86 has already got tlbbatch
and arm64 is the second one to have tlbbatch now, i suppose the
below changes won't break those archs without tlbbatch. i would
expect people bringing up tlbbatch in those platforms to test
them later,

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7630ba9cb6cc..25c42747f488 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
select ARCH_HAS_KEEPINITRD
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1920d52653b4..4b737c0d17a2 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -7,6 +7,7 @@ config LOONGARCH
select ARCH_ENABLE_MEMORY_HOTPLUG
select ARCH_ENABLE_MEMORY_HOTREMOVE
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index db09d45d59ec..1b196acdeca3 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -9,6 +9,7 @@ config MIPS
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_KCOV
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE if !EVA
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
select ARCH_HAS_STRNCPY_FROM_USER
select ARCH_HAS_STRNLEN_USER
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index e814df4c483c..82483b192f4a 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -9,6 +9,7 @@ config OPENRISC
select ARCH_32BIT_OFF_T
select ARCH_HAS_DMA_SET_UNCACHED
select ARCH_HAS_DMA_CLEAR_UNCACHED
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_SYNC_DMA_FOR_DEVICE
select COMMON_CLK
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c2ce2e60c8f0..19061ffe73a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_MEMREMAP_COMPAT_ALIGN if PPC_64S_HASH_MMU
select ARCH_HAS_MMIOWB if PPC64
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c22f58155948..7570c95a9cc8 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,6 +25,7 @@ config RISCV
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_HAS_KCOV
select ARCH_HAS_MMIOWB
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP if MMU
select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 91c0b80a8bf0..48d91fa05bab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_HAS_KCOV
select ARCH_HAS_MEM_ENCRYPT
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SCALED_CPUTIME
select ARCH_HAS_SET_MEMORY
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 4ec22e156a2e..df29c729267b 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -8,6 +8,7 @@ config UML
select ARCH_EPHEMERAL_INODES
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_STRNCPY_FROM_USER
select ARCH_HAS_STRNLEN_USER
select ARCH_NO_PREEMPT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index be0b95e51df6..a91d73866238 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,6 +81,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_MEMBARRIER_SYNC_CORE
+ select ARCH_HAS_MM_CPUMASK
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..7bf54f57ca01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -951,6 +951,9 @@ config ARCH_HAS_CURRENT_STACK_POINTER
register alias named "current_stack_pointer", this config can be
selected.

+config ARCH_HAS_MM_CPUMASK
+ bool
+
config ARCH_HAS_VM_GET_PAGE_PROT
bool

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..13d4f9a1d4f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -692,6 +692,10 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
if (!(flags & TTU_BATCH_FLUSH))
return false;

+#ifndef CONFIG_ARCH_HAS_MM_CPUMASK
+ return true;
+#endif
+
/* If remote CPUs need to be flushed then defer batch the flush */
if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
should_defer = true;

Thanks
Barry