2022-09-09 05:26:55

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation



On 8/22/22 13:51, Yicong Yang wrote:
> From: Barry Song <[email protected]>
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
> #include <sys/types.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <string.h>
>
> int main()
> {
> #define SIZE (1 * 1024 * 1024)
> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
> memset(p, 0x88, SIZE);
>
> for (int k = 0; k < 10000; k++) {
> /* swap in */
> for (int i = 0; i < SIZE; i += 4096) {
> (void)p[i];
> }
>
> /* swap out */
> madvise(p, SIZE, MADV_PAGEOUT);
> }
> }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
> ~ # perf record taskset -c 4 ./a.out
> [ perf record: Woken up 10 times to write data ]
> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> ~ # perf report
> # To display the perf.data header info, please use --header/--header-only options.
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 60K of event 'cycles'
> # Event count (approx.): 35706225414
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. .............................................................................
> #
> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
> 3.49% a.out [kernel.kallsyms] [k] memset64
> 1.63% a.out [kernel.kallsyms] [k] clear_page
> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
> 1.23% a.out [kernel.kallsyms] [k] xas_load
> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
> ~ # time taskset -c 4 ./a.out
> 0.21user 14.34system 0:14.69elapsed
> w/ patch:
> ~ # time taskset -c 4 ./a.out
> 0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
> Tested with benchmark in the commit on Kunpeng920 arm64 server,
> observed an improvement around 12.5% with command
> `time ./swap_bench`.
> w/o w/
> real 0m13.460s 0m11.771s
> user 0m0.248s 0m0.279s
> sys 0m12.039s 0m11.458s
>
> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> which has been eliminated by this patch:
>
> [root@localhost yang]# perf record -- ./swap_bench && perf report
> [...]
> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Tested-by: Yicong Yang <[email protected]>
> Tested-by: Xin Hao <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> Signed-off-by: Yicong Yang <[email protected]>
> ---
> .../features/vm/TLB/arch-support.txt | 2 +-
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
> 4 files changed, 40 insertions(+), 3 deletions(-)
> create mode 100644 arch/arm64/include/asm/tlbbatch.h
>
> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> index 1c009312b9c1..2caf815d7c6c 100644
> --- a/Documentation/features/vm/TLB/arch-support.txt
> +++ b/Documentation/features/vm/TLB/arch-support.txt
> @@ -9,7 +9,7 @@
> | alpha: | TODO |
> | arc: | TODO |
> | arm: | TODO |
> - | arm64: | TODO |
> + | arm64: | ok |
> | csky: | TODO |
> | hexagon: | TODO |
> | ia64: | TODO |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 571cc234d0b3..09d45cd6d665 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -93,6 +93,7 @@ config ARM64
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> select ARCH_SUPPORTS_NUMA_BALANCING
> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
> select ARCH_WANT_DEFAULT_BPF_JIT
> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> new file mode 100644
> index 000000000000..fedb0b87b8db
> --- /dev/null
> +++ b/arch/arm64/include/asm/tlbbatch.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_ARM64_TLBBATCH_H
> +#define _ARCH_ARM64_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> + /*
> + * For arm64, HW can do tlb shootdown, so we don't
> + * need to record cpumask for sending IPI
> + */
> +};
> +
> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..23cbc987321a 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> dsb(ish);
> }
>
> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> +
> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> unsigned long addr;
>
> dsb(ishst);
> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> }
>
> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> + unsigned long uaddr)
> +{
> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> +}
> +
> static inline void flush_tlb_page(struct vm_area_struct *vma,
> unsigned long uaddr)
> {
> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> dsb(ish);
> }
>
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> + return true;
> +}

Always defer and batch up TLB flush, unconditionally ?

> +
> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> + struct mm_struct *mm,
> + unsigned long uaddr)
> +{
> + __flush_tlb_page_nosync(mm, uaddr);
> +}
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> +{
> + dsb(ish);
> +}

Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and
then doing once via arch_tlbbatch_flush() will have the same effect from
an architecture perspective ?

> +
> /*
> * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
> * necessarily a performance improvement.


2022-09-09 05:54:58

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 8/22/22 13:51, Yicong Yang wrote:
> > From: Barry Song <[email protected]>
> >
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> >
> > Even running a simplest program which requires swapout can
> > prove this is true,
> > #include <sys/types.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> > #include <string.h>
> >
> > int main()
> > {
> > #define SIZE (1 * 1024 * 1024)
> > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> > MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >
> > memset(p, 0x88, SIZE);
> >
> > for (int k = 0; k < 10000; k++) {
> > /* swap in */
> > for (int i = 0; i < SIZE; i += 4096) {
> > (void)p[i];
> > }
> >
> > /* swap out */
> > madvise(p, SIZE, MADV_PAGEOUT);
> > }
> > }
> >
> > Perf result on snapdragon 888 with 8 cores by using zRAM
> > as the swap block device.
> >
> > ~ # perf record taskset -c 4 ./a.out
> > [ perf record: Woken up 10 times to write data ]
> > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> > ~ # perf report
> > # To display the perf.data header info, please use --header/--header-only options.
> > # To display the perf.data header info, please use --header/--header-only options.
> > #
> > #
> > # Total Lost Samples: 0
> > #
> > # Samples: 60K of event 'cycles'
> > # Event count (approx.): 35706225414
> > #
> > # Overhead Command Shared Object Symbol
> > # ........ ....... ................. .............................................................................
> > #
> > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
> > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
> > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
> > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
> > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
> > 3.49% a.out [kernel.kallsyms] [k] memset64
> > 1.63% a.out [kernel.kallsyms] [k] clear_page
> > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
> > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
> > 1.23% a.out [kernel.kallsyms] [k] xas_load
> > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
> >
> > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> > swapping in/out a page mapped by only one process. If the
> > page is mapped by multiple processes, typically, like more
> > than 100 on a phone, the overhead would be much higher as
> > we have to run tlb flush 100 times for one single page.
> > Plus, tlb flush overhead will increase with the number
> > of CPU cores due to the bad scalability of tlb shootdown
> > in HW, so those ARM64 servers should expect much higher
> > overhead.
> >
> > Further perf annonate shows 95% cpu time of ptep_clear_flush
> > is actually used by the final dsb() to wait for the completion
> > of tlb flush. This provides us a very good chance to leverage
> > the existing batched tlb in kernel. The minimum modification
> > is that we only send async tlbi in the first stage and we send
> > dsb while we have to sync in the second stage.
> >
> > With the above simplest micro benchmark, collapsed time to
> > finish the program decreases around 5%.
> >
> > Typical collapsed time w/o patch:
> > ~ # time taskset -c 4 ./a.out
> > 0.21user 14.34system 0:14.69elapsed
> > w/ patch:
> > ~ # time taskset -c 4 ./a.out
> > 0.22user 13.45system 0:13.80elapsed
> >
> > Also, Yicong Yang added the following observation.
> > Tested with benchmark in the commit on Kunpeng920 arm64 server,
> > observed an improvement around 12.5% with command
> > `time ./swap_bench`.
> > w/o w/
> > real 0m13.460s 0m11.771s
> > user 0m0.248s 0m0.279s
> > sys 0m12.039s 0m11.458s
> >
> > Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> > which has been eliminated by this patch:
> >
> > [root@localhost yang]# perf record -- ./swap_bench && perf report
> > [...]
> > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
> >
> > Cc: Jonathan Corbet <[email protected]>
> > Cc: Nadav Amit <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Tested-by: Yicong Yang <[email protected]>
> > Tested-by: Xin Hao <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > Signed-off-by: Yicong Yang <[email protected]>
> > ---
> > .../features/vm/TLB/arch-support.txt | 2 +-
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
> > arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
> > 4 files changed, 40 insertions(+), 3 deletions(-)
> > create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >
> > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> > index 1c009312b9c1..2caf815d7c6c 100644
> > --- a/Documentation/features/vm/TLB/arch-support.txt
> > +++ b/Documentation/features/vm/TLB/arch-support.txt
> > @@ -9,7 +9,7 @@
> > | alpha: | TODO |
> > | arc: | TODO |
> > | arm: | TODO |
> > - | arm64: | TODO |
> > + | arm64: | ok |
> > | csky: | TODO |
> > | hexagon: | TODO |
> > | ia64: | TODO |
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 571cc234d0b3..09d45cd6d665 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -93,6 +93,7 @@ config ARM64
> > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > select ARCH_SUPPORTS_NUMA_BALANCING
> > select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
> > select ARCH_WANT_DEFAULT_BPF_JIT
> > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> > new file mode 100644
> > index 000000000000..fedb0b87b8db
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/tlbbatch.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ARCH_ARM64_TLBBATCH_H
> > +#define _ARCH_ARM64_TLBBATCH_H
> > +
> > +struct arch_tlbflush_unmap_batch {
> > + /*
> > + * For arm64, HW can do tlb shootdown, so we don't
> > + * need to record cpumask for sending IPI
> > + */
> > +};
> > +
> > +#endif /* _ARCH_ARM64_TLBBATCH_H */
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 412a3b9a3c25..23cbc987321a 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> > dsb(ish);
> > }
> >
> > -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> > +
> > +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> > unsigned long uaddr)
> > {
> > unsigned long addr;
> >
> > dsb(ishst);
> > - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> > + addr = __TLBI_VADDR(uaddr, ASID(mm));
> > __tlbi(vale1is, addr);
> > __tlbi_user(vale1is, addr);
> > }
> >
> > +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> > + unsigned long uaddr)
> > +{
> > + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> > +}
> > +
> > static inline void flush_tlb_page(struct vm_area_struct *vma,
> > unsigned long uaddr)
> > {
> > @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> > dsb(ish);
> > }
> >
> > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > +{
> > + return true;
> > +}
>
> Always defer and batch up TLB flush, unconditionally ?

My understanding is we actually don't need tlbbatch for a machine with one
or two cores as the tlb flush is not expensive. even for a system with four
cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
when we have 8 cores, we see the obvious cost of tlb flush. for a server with
100 crores, the cost is incredibly huge.

But, we can hardly write source code to differentiate machines according to
how many cores a machine has, especially when cores can be hot-plugged.

>
> > +
> > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> > + struct mm_struct *mm,
> > + unsigned long uaddr)
> > +{
> > + __flush_tlb_page_nosync(mm, uaddr);
> > +}
> > +
> > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > +{
> > + dsb(ish);
> > +}
>
> Adding up __flush_tlb_page_nosync() without a corresponding dsb(ish) and
> then doing once via arch_tlbbatch_flush() will have the same effect from
> an architecture perspective ?

The difference is we drop the cost of lots of single tlb flush. we
only need to sync
when we have to sync. dsb(ish) guarantees the completion of previous
multiple tlb
flush instructions.

>
> > +
> > /*
> > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
> > * necessarily a performance improvement.

Thanks
Barry

2022-09-15 06:10:01

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation



On 9/9/22 11:05, Barry Song wrote:
> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> <[email protected]> wrote:
>>
>>
>>
>> On 8/22/22 13:51, Yicong Yang wrote:
>>> From: Barry Song <[email protected]>
>>>
>>> on x86, batched and deferred tlb shootdown has lead to 90%
>>> performance increase on tlb shootdown. on arm64, HW can do
>>> tlb shootdown without software IPI. But sync tlbi is still
>>> quite expensive.
>>>
>>> Even running a simplest program which requires swapout can
>>> prove this is true,
>>> #include <sys/types.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>> #include <string.h>
>>>
>>> int main()
>>> {
>>> #define SIZE (1 * 1024 * 1024)
>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>>>
>>> memset(p, 0x88, SIZE);
>>>
>>> for (int k = 0; k < 10000; k++) {
>>> /* swap in */
>>> for (int i = 0; i < SIZE; i += 4096) {
>>> (void)p[i];
>>> }
>>>
>>> /* swap out */
>>> madvise(p, SIZE, MADV_PAGEOUT);
>>> }
>>> }
>>>
>>> Perf result on snapdragon 888 with 8 cores by using zRAM
>>> as the swap block device.
>>>
>>> ~ # perf record taskset -c 4 ./a.out
>>> [ perf record: Woken up 10 times to write data ]
>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>>> ~ # perf report
>>> # To display the perf.data header info, please use --header/--header-only options.
>>> # To display the perf.data header info, please use --header/--header-only options.
>>> #
>>> #
>>> # Total Lost Samples: 0
>>> #
>>> # Samples: 60K of event 'cycles'
>>> # Event count (approx.): 35706225414
>>> #
>>> # Overhead Command Shared Object Symbol
>>> # ........ ....... ................. .............................................................................
>>> #
>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
>>> 3.49% a.out [kernel.kallsyms] [k] memset64
>>> 1.63% a.out [kernel.kallsyms] [k] clear_page
>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
>>> 1.23% a.out [kernel.kallsyms] [k] xas_load
>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
>>>
>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
>>> swapping in/out a page mapped by only one process. If the
>>> page is mapped by multiple processes, typically, like more
>>> than 100 on a phone, the overhead would be much higher as
>>> we have to run tlb flush 100 times for one single page.
>>> Plus, tlb flush overhead will increase with the number
>>> of CPU cores due to the bad scalability of tlb shootdown
>>> in HW, so those ARM64 servers should expect much higher
>>> overhead.
>>>
>>> Further perf annonate shows 95% cpu time of ptep_clear_flush
>>> is actually used by the final dsb() to wait for the completion
>>> of tlb flush. This provides us a very good chance to leverage
>>> the existing batched tlb in kernel. The minimum modification
>>> is that we only send async tlbi in the first stage and we send
>>> dsb while we have to sync in the second stage.
>>>
>>> With the above simplest micro benchmark, collapsed time to
>>> finish the program decreases around 5%.
>>>
>>> Typical collapsed time w/o patch:
>>> ~ # time taskset -c 4 ./a.out
>>> 0.21user 14.34system 0:14.69elapsed
>>> w/ patch:
>>> ~ # time taskset -c 4 ./a.out
>>> 0.22user 13.45system 0:13.80elapsed
>>>
>>> Also, Yicong Yang added the following observation.
>>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
>>> observed an improvement around 12.5% with command
>>> `time ./swap_bench`.
>>> w/o w/
>>> real 0m13.460s 0m11.771s
>>> user 0m0.248s 0m0.279s
>>> sys 0m12.039s 0m11.458s
>>>
>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
>>> which has been eliminated by this patch:
>>>
>>> [root@localhost yang]# perf record -- ./swap_bench && perf report
>>> [...]
>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
>>>
>>> Cc: Jonathan Corbet <[email protected]>
>>> Cc: Nadav Amit <[email protected]>
>>> Cc: Mel Gorman <[email protected]>
>>> Tested-by: Yicong Yang <[email protected]>
>>> Tested-by: Xin Hao <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> Signed-off-by: Yicong Yang <[email protected]>
>>> ---
>>> .../features/vm/TLB/arch-support.txt | 2 +-
>>> arch/arm64/Kconfig | 1 +
>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
>>> 4 files changed, 40 insertions(+), 3 deletions(-)
>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
>>> index 1c009312b9c1..2caf815d7c6c 100644
>>> --- a/Documentation/features/vm/TLB/arch-support.txt
>>> +++ b/Documentation/features/vm/TLB/arch-support.txt
>>> @@ -9,7 +9,7 @@
>>> | alpha: | TODO |
>>> | arc: | TODO |
>>> | arm: | TODO |
>>> - | arm64: | TODO |
>>> + | arm64: | ok |
>>> | csky: | TODO |
>>> | hexagon: | TODO |
>>> | ia64: | TODO |
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 571cc234d0b3..09d45cd6d665 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -93,6 +93,7 @@ config ARM64
>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>>> select ARCH_SUPPORTS_NUMA_BALANCING
>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>>> select ARCH_WANT_DEFAULT_BPF_JIT
>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
>>> new file mode 100644
>>> index 000000000000..fedb0b87b8db
>>> --- /dev/null
>>> +++ b/arch/arm64/include/asm/tlbbatch.h
>>> @@ -0,0 +1,12 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _ARCH_ARM64_TLBBATCH_H
>>> +#define _ARCH_ARM64_TLBBATCH_H
>>> +
>>> +struct arch_tlbflush_unmap_batch {
>>> + /*
>>> + * For arm64, HW can do tlb shootdown, so we don't
>>> + * need to record cpumask for sending IPI
>>> + */
>>> +};
>>> +
>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>>> index 412a3b9a3c25..23cbc987321a 100644
>>> --- a/arch/arm64/include/asm/tlbflush.h
>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>> dsb(ish);
>>> }
>>>
>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>> +
>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>> unsigned long uaddr)
>>> {
>>> unsigned long addr;
>>>
>>> dsb(ishst);
>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
>>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>>> __tlbi(vale1is, addr);
>>> __tlbi_user(vale1is, addr);
>>> }
>>>
>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>> + unsigned long uaddr)
>>> +{
>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
>>> +}
>>> +
>>> static inline void flush_tlb_page(struct vm_area_struct *vma,
>>> unsigned long uaddr)
>>> {
>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>>> dsb(ish);
>>> }
>>>
>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>> +{
>>> + return true;
>>> +}
>>
>> Always defer and batch up TLB flush, unconditionally ?
>
> My understanding is we actually don't need tlbbatch for a machine with one
> or two cores as the tlb flush is not expensive. even for a system with four
> cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
> when we have 8 cores, we see the obvious cost of tlb flush. for a server with
> 100 crores, the cost is incredibly huge.

Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
expensive as well, while queuing up individual TLB flushes ?

The very idea behind TLB deferral is the opportunity it (might) provide
to accumulate address ranges and cpu masks so that individual TLB flush
can be replaced with a more cost effective range based TLB flush. Hence
I guess unless address range or cpumask based cost effective TLB flush
is available, deferral does not improve the unmap performance as much.

2022-09-15 06:45:21

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 9/9/22 11:05, Barry Song wrote:
> > On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 8/22/22 13:51, Yicong Yang wrote:
> >>> From: Barry Song <[email protected]>
> >>>
> >>> on x86, batched and deferred tlb shootdown has lead to 90%
> >>> performance increase on tlb shootdown. on arm64, HW can do
> >>> tlb shootdown without software IPI. But sync tlbi is still
> >>> quite expensive.
> >>>
> >>> Even running a simplest program which requires swapout can
> >>> prove this is true,
> >>> #include <sys/types.h>
> >>> #include <unistd.h>
> >>> #include <sys/mman.h>
> >>> #include <string.h>
> >>>
> >>> int main()
> >>> {
> >>> #define SIZE (1 * 1024 * 1024)
> >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >>>
> >>> memset(p, 0x88, SIZE);
> >>>
> >>> for (int k = 0; k < 10000; k++) {
> >>> /* swap in */
> >>> for (int i = 0; i < SIZE; i += 4096) {
> >>> (void)p[i];
> >>> }
> >>>
> >>> /* swap out */
> >>> madvise(p, SIZE, MADV_PAGEOUT);
> >>> }
> >>> }
> >>>
> >>> Perf result on snapdragon 888 with 8 cores by using zRAM
> >>> as the swap block device.
> >>>
> >>> ~ # perf record taskset -c 4 ./a.out
> >>> [ perf record: Woken up 10 times to write data ]
> >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> >>> ~ # perf report
> >>> # To display the perf.data header info, please use --header/--header-only options.
> >>> # To display the perf.data header info, please use --header/--header-only options.
> >>> #
> >>> #
> >>> # Total Lost Samples: 0
> >>> #
> >>> # Samples: 60K of event 'cycles'
> >>> # Event count (approx.): 35706225414
> >>> #
> >>> # Overhead Command Shared Object Symbol
> >>> # ........ ....... ................. .............................................................................
> >>> #
> >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
> >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
> >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
> >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
> >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
> >>> 3.49% a.out [kernel.kallsyms] [k] memset64
> >>> 1.63% a.out [kernel.kallsyms] [k] clear_page
> >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
> >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
> >>> 1.23% a.out [kernel.kallsyms] [k] xas_load
> >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
> >>>
> >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> >>> swapping in/out a page mapped by only one process. If the
> >>> page is mapped by multiple processes, typically, like more
> >>> than 100 on a phone, the overhead would be much higher as
> >>> we have to run tlb flush 100 times for one single page.
> >>> Plus, tlb flush overhead will increase with the number
> >>> of CPU cores due to the bad scalability of tlb shootdown
> >>> in HW, so those ARM64 servers should expect much higher
> >>> overhead.
> >>>
> >>> Further perf annonate shows 95% cpu time of ptep_clear_flush
> >>> is actually used by the final dsb() to wait for the completion
> >>> of tlb flush. This provides us a very good chance to leverage
> >>> the existing batched tlb in kernel. The minimum modification
> >>> is that we only send async tlbi in the first stage and we send
> >>> dsb while we have to sync in the second stage.
> >>>
> >>> With the above simplest micro benchmark, collapsed time to
> >>> finish the program decreases around 5%.
> >>>
> >>> Typical collapsed time w/o patch:
> >>> ~ # time taskset -c 4 ./a.out
> >>> 0.21user 14.34system 0:14.69elapsed
> >>> w/ patch:
> >>> ~ # time taskset -c 4 ./a.out
> >>> 0.22user 13.45system 0:13.80elapsed
> >>>
> >>> Also, Yicong Yang added the following observation.
> >>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
> >>> observed an improvement around 12.5% with command
> >>> `time ./swap_bench`.
> >>> w/o w/
> >>> real 0m13.460s 0m11.771s
> >>> user 0m0.248s 0m0.279s
> >>> sys 0m12.039s 0m11.458s
> >>>
> >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> >>> which has been eliminated by this patch:
> >>>
> >>> [root@localhost yang]# perf record -- ./swap_bench && perf report
> >>> [...]
> >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
> >>>
> >>> Cc: Jonathan Corbet <[email protected]>
> >>> Cc: Nadav Amit <[email protected]>
> >>> Cc: Mel Gorman <[email protected]>
> >>> Tested-by: Yicong Yang <[email protected]>
> >>> Tested-by: Xin Hao <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> Signed-off-by: Yicong Yang <[email protected]>
> >>> ---
> >>> .../features/vm/TLB/arch-support.txt | 2 +-
> >>> arch/arm64/Kconfig | 1 +
> >>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
> >>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
> >>> 4 files changed, 40 insertions(+), 3 deletions(-)
> >>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >>>
> >>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> >>> index 1c009312b9c1..2caf815d7c6c 100644
> >>> --- a/Documentation/features/vm/TLB/arch-support.txt
> >>> +++ b/Documentation/features/vm/TLB/arch-support.txt
> >>> @@ -9,7 +9,7 @@
> >>> | alpha: | TODO |
> >>> | arc: | TODO |
> >>> | arm: | TODO |
> >>> - | arm64: | TODO |
> >>> + | arm64: | ok |
> >>> | csky: | TODO |
> >>> | hexagon: | TODO |
> >>> | ia64: | TODO |
> >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >>> index 571cc234d0b3..09d45cd6d665 100644
> >>> --- a/arch/arm64/Kconfig
> >>> +++ b/arch/arm64/Kconfig
> >>> @@ -93,6 +93,7 @@ config ARM64
> >>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> >>> select ARCH_SUPPORTS_NUMA_BALANCING
> >>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> >>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> >>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
> >>> select ARCH_WANT_DEFAULT_BPF_JIT
> >>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> >>> new file mode 100644
> >>> index 000000000000..fedb0b87b8db
> >>> --- /dev/null
> >>> +++ b/arch/arm64/include/asm/tlbbatch.h
> >>> @@ -0,0 +1,12 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>> +#ifndef _ARCH_ARM64_TLBBATCH_H
> >>> +#define _ARCH_ARM64_TLBBATCH_H
> >>> +
> >>> +struct arch_tlbflush_unmap_batch {
> >>> + /*
> >>> + * For arm64, HW can do tlb shootdown, so we don't
> >>> + * need to record cpumask for sending IPI
> >>> + */
> >>> +};
> >>> +
> >>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> >>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >>> index 412a3b9a3c25..23cbc987321a 100644
> >>> --- a/arch/arm64/include/asm/tlbflush.h
> >>> +++ b/arch/arm64/include/asm/tlbflush.h
> >>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> >>> dsb(ish);
> >>> }
> >>>
> >>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>> +
> >>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >>> unsigned long uaddr)
> >>> {
> >>> unsigned long addr;
> >>>
> >>> dsb(ishst);
> >>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> >>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> >>> __tlbi(vale1is, addr);
> >>> __tlbi_user(vale1is, addr);
> >>> }
> >>>
> >>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>> + unsigned long uaddr)
> >>> +{
> >>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> >>> +}
> >>> +
> >>> static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>> unsigned long uaddr)
> >>> {
> >>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>> dsb(ish);
> >>> }
> >>>
> >>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>> +{
> >>> + return true;
> >>> +}
> >>
> >> Always defer and batch up TLB flush, unconditionally ?
> >
> > My understanding is we actually don't need tlbbatch for a machine with one
> > or two cores as the tlb flush is not expensive. even for a system with four
> > cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
> > when we have 8 cores, we see the obvious cost of tlb flush. for a server with
> > 100 crores, the cost is incredibly huge.
>
> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
> expensive as well, while queuing up individual TLB flushes ?

This one is much much cheaper as it is not waiting for the
completion of tlbi. waiting for the completion of tlbi is a big
deal in arm64, thus, similar optimization can be seen here

3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
when page aging").


>
> The very idea behind TLB deferral is the opportunity it (might) provide
> to accumulate address ranges and cpu masks so that individual TLB flush
> can be replaced with a more cost effective range based TLB flush. Hence
> I guess unless address range or cpumask based cost effective TLB flush
> is available, deferral does not improve the unmap performance as much.


After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
from all cpus in the system, tlbi is not scalable. The point here is that we
avoid waiting for each individual TLBi. Alternatively, they are batched. If
you read the benchmark in the commit log, you can find the great decline
in the cost to swap out a page.

Thanks
Barry

2022-09-15 15:22:16

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation



> On Sep 14, 2022, at 11:42 PM, Barry Song <[email protected]> wrote:
>
>>
>> The very idea behind TLB deferral is the opportunity it (might) provide
>> to accumulate address ranges and cpu masks so that individual TLB flush
>> can be replaced with a more cost effective range based TLB flush. Hence
>> I guess unless address range or cpumask based cost effective TLB flush
>> is available, deferral does not improve the unmap performance as much.
>
>
> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> from all cpus in the system, tlbi is not scalable. The point here is that we
> avoid waiting for each individual TLBi. Alternatively, they are batched. If
> you read the benchmark in the commit log, you can find the great decline
> in the cost to swap out a page.

Just a minor correction: arch_tlbbatch_flush() does not collect ranges.
On x86 it only accumulate CPU mask.

2022-09-19 03:21:06

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation



On 9/15/22 20:01, Nadav Amit wrote:
>
>
>> On Sep 14, 2022, at 11:42 PM, Barry Song <[email protected]> wrote:
>>
>>>
>>> The very idea behind TLB deferral is the opportunity it (might) provide
>>> to accumulate address ranges and cpu masks so that individual TLB flush
>>> can be replaced with a more cost effective range based TLB flush. Hence
>>> I guess unless address range or cpumask based cost effective TLB flush
>>> is available, deferral does not improve the unmap performance as much.
>>
>>
>> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
>> from all cpus in the system, tlbi is not scalable. The point here is that we
>> avoid waiting for each individual TLBi. Alternatively, they are batched. If
>> you read the benchmark in the commit log, you can find the great decline
>> in the cost to swap out a page.
>
> Just a minor correction: arch_tlbbatch_flush() does not collect ranges.
> On x86 it only accumulate CPU mask.

Thanks Nadav for the clarification.

2022-09-19 05:37:47

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation



On 9/15/22 12:12, Barry Song wrote:
> On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> <[email protected]> wrote:
>>
>>
>>
>> On 9/9/22 11:05, Barry Song wrote:
>>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 8/22/22 13:51, Yicong Yang wrote:
>>>>> From: Barry Song <[email protected]>
>>>>>
>>>>> on x86, batched and deferred tlb shootdown has lead to 90%
>>>>> performance increase on tlb shootdown. on arm64, HW can do
>>>>> tlb shootdown without software IPI. But sync tlbi is still
>>>>> quite expensive.
>>>>>
>>>>> Even running a simplest program which requires swapout can
>>>>> prove this is true,
>>>>> #include <sys/types.h>
>>>>> #include <unistd.h>
>>>>> #include <sys/mman.h>
>>>>> #include <string.h>
>>>>>
>>>>> int main()
>>>>> {
>>>>> #define SIZE (1 * 1024 * 1024)
>>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>>>>>
>>>>> memset(p, 0x88, SIZE);
>>>>>
>>>>> for (int k = 0; k < 10000; k++) {
>>>>> /* swap in */
>>>>> for (int i = 0; i < SIZE; i += 4096) {
>>>>> (void)p[i];
>>>>> }
>>>>>
>>>>> /* swap out */
>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>> }
>>>>> }
>>>>>
>>>>> Perf result on snapdragon 888 with 8 cores by using zRAM
>>>>> as the swap block device.
>>>>>
>>>>> ~ # perf record taskset -c 4 ./a.out
>>>>> [ perf record: Woken up 10 times to write data ]
>>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>>>>> ~ # perf report
>>>>> # To display the perf.data header info, please use --header/--header-only options.
>>>>> # To display the perf.data header info, please use --header/--header-only options.
>>>>> #
>>>>> #
>>>>> # Total Lost Samples: 0
>>>>> #
>>>>> # Samples: 60K of event 'cycles'
>>>>> # Event count (approx.): 35706225414
>>>>> #
>>>>> # Overhead Command Shared Object Symbol
>>>>> # ........ ....... ................. .............................................................................
>>>>> #
>>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
>>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
>>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
>>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
>>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
>>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
>>>>> 3.49% a.out [kernel.kallsyms] [k] memset64
>>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page
>>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
>>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
>>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load
>>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
>>>>>
>>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
>>>>> swapping in/out a page mapped by only one process. If the
>>>>> page is mapped by multiple processes, typically, like more
>>>>> than 100 on a phone, the overhead would be much higher as
>>>>> we have to run tlb flush 100 times for one single page.
>>>>> Plus, tlb flush overhead will increase with the number
>>>>> of CPU cores due to the bad scalability of tlb shootdown
>>>>> in HW, so those ARM64 servers should expect much higher
>>>>> overhead.
>>>>>
>>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush
>>>>> is actually used by the final dsb() to wait for the completion
>>>>> of tlb flush. This provides us a very good chance to leverage
>>>>> the existing batched tlb in kernel. The minimum modification
>>>>> is that we only send async tlbi in the first stage and we send
>>>>> dsb while we have to sync in the second stage.
>>>>>
>>>>> With the above simplest micro benchmark, collapsed time to
>>>>> finish the program decreases around 5%.
>>>>>
>>>>> Typical collapsed time w/o patch:
>>>>> ~ # time taskset -c 4 ./a.out
>>>>> 0.21user 14.34system 0:14.69elapsed
>>>>> w/ patch:
>>>>> ~ # time taskset -c 4 ./a.out
>>>>> 0.22user 13.45system 0:13.80elapsed
>>>>>
>>>>> Also, Yicong Yang added the following observation.
>>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
>>>>> observed an improvement around 12.5% with command
>>>>> `time ./swap_bench`.
>>>>> w/o w/
>>>>> real 0m13.460s 0m11.771s
>>>>> user 0m0.248s 0m0.279s
>>>>> sys 0m12.039s 0m11.458s
>>>>>
>>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
>>>>> which has been eliminated by this patch:
>>>>>
>>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report
>>>>> [...]
>>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
>>>>>
>>>>> Cc: Jonathan Corbet <[email protected]>
>>>>> Cc: Nadav Amit <[email protected]>
>>>>> Cc: Mel Gorman <[email protected]>
>>>>> Tested-by: Yicong Yang <[email protected]>
>>>>> Tested-by: Xin Hao <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> Signed-off-by: Yicong Yang <[email protected]>
>>>>> ---
>>>>> .../features/vm/TLB/arch-support.txt | 2 +-
>>>>> arch/arm64/Kconfig | 1 +
>>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
>>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
>>>>> 4 files changed, 40 insertions(+), 3 deletions(-)
>>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>>
>>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
>>>>> index 1c009312b9c1..2caf815d7c6c 100644
>>>>> --- a/Documentation/features/vm/TLB/arch-support.txt
>>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt
>>>>> @@ -9,7 +9,7 @@
>>>>> | alpha: | TODO |
>>>>> | arc: | TODO |
>>>>> | arm: | TODO |
>>>>> - | arm64: | TODO |
>>>>> + | arm64: | ok |
>>>>> | csky: | TODO |
>>>>> | hexagon: | TODO |
>>>>> | ia64: | TODO |
>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>> index 571cc234d0b3..09d45cd6d665 100644
>>>>> --- a/arch/arm64/Kconfig
>>>>> +++ b/arch/arm64/Kconfig
>>>>> @@ -93,6 +93,7 @@ config ARM64
>>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>>>>> select ARCH_SUPPORTS_NUMA_BALANCING
>>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>>>>> select ARCH_WANT_DEFAULT_BPF_JIT
>>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
>>>>> new file mode 100644
>>>>> index 000000000000..fedb0b87b8db
>>>>> --- /dev/null
>>>>> +++ b/arch/arm64/include/asm/tlbbatch.h
>>>>> @@ -0,0 +1,12 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H
>>>>> +#define _ARCH_ARM64_TLBBATCH_H
>>>>> +
>>>>> +struct arch_tlbflush_unmap_batch {
>>>>> + /*
>>>>> + * For arm64, HW can do tlb shootdown, so we don't
>>>>> + * need to record cpumask for sending IPI
>>>>> + */
>>>>> +};
>>>>> +
>>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
>>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>>>>> index 412a3b9a3c25..23cbc987321a 100644
>>>>> --- a/arch/arm64/include/asm/tlbflush.h
>>>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>>>> dsb(ish);
>>>>> }
>>>>>
>>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>>>> +
>>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>>>> unsigned long uaddr)
>>>>> {
>>>>> unsigned long addr;
>>>>>
>>>>> dsb(ishst);
>>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
>>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>>>>> __tlbi(vale1is, addr);
>>>>> __tlbi_user(vale1is, addr);
>>>>> }
>>>>>
>>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>>>> + unsigned long uaddr)
>>>>> +{
>>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
>>>>> +}
>>>>> +
>>>>> static inline void flush_tlb_page(struct vm_area_struct *vma,
>>>>> unsigned long uaddr)
>>>>> {
>>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>>>>> dsb(ish);
>>>>> }
>>>>>
>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>> +{
>>>>> + return true;
>>>>> +}
>>>>
>>>> Always defer and batch up TLB flush, unconditionally ?
>>>
>>> My understanding is we actually don't need tlbbatch for a machine with one
>>> or two cores as the tlb flush is not expensive. even for a system with four
>>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
>>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with
>>> 100 crores, the cost is incredibly huge.
>>
>> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
>> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
>> expensive as well, while queuing up individual TLB flushes ?
>
> This one is much much cheaper as it is not waiting for the
> completion of tlbi. waiting for the completion of tlbi is a big
> deal in arm64, thus, similar optimization can be seen here
>
> 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
> when page aging").
>
>
>>
>> The very idea behind TLB deferral is the opportunity it (might) provide
>> to accumulate address ranges and cpu masks so that individual TLB flush
>> can be replaced with a more cost effective range based TLB flush. Hence
>> I guess unless address range or cpumask based cost effective TLB flush
>> is available, deferral does not improve the unmap performance as much.
>
>
> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> from all cpus in the system, tlbi is not scalable. The point here is that we
> avoid waiting for each individual TLBi. Alternatively, they are batched. If
> you read the benchmark in the commit log, you can find the great decline
> in the cost to swap out a page.

Alright, although collecting and deferring 'dsb(ish)' to the very end, does
not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I
guess it can be used to improve unmap performance on arm64.

But is this 'dsb(ish)' deferral architecturally valid ?

Let's examine single page unmap path via try_to_unmap_one().

should_defer_flush() {
ptep_get_and_clear()
set_tlb_ubc_flush_pending()
arch_tlbbatch_add_mm()
__flush_tlb_page_nosync()
} else {
ptep_clear_flush()
ptep_get_and_clear()
flush_tlb_page()
flush_tlb_page_nosync()
__flush_tlb_page_nosync()
dsb(ish)
}

__flush_tlb_page_nosync()
{
dsb(ishst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
}

Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi()
and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function.

#define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op "\n" \
ALTERNATIVE("nop\n nop", \
"dsb ish\n tlbi " #op, \
ARM64_WORKAROUND_REPEAT_TLBI, \
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : )

#define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op ", %0\n" \
ALTERNATIVE("nop\n nop", \
"dsb ish\n tlbi " #op ", %0", \
ARM64_WORKAROUND_REPEAT_TLBI, \
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : "r" (arg))

#define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)

#define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0)

#define __tlbi_user(op, arg) do { \
if (arm64_kernel_unmapped_at_el0()) \
__tlbi(op, (arg) | USER_ASID_FLAG); \
} while (0)

There is already a 'dsb(ish)' in between two subsequent TLB operations in
case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess
deferral should not enabled on such systems ?

But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush()
via try_to_unmap_flush[_dirty](). There might be random number of instructions
in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'.
Just wondering, if such 'detached in time with other instructions in between'
'tlbi' and 'dsb(ish)', is architecturally valid ?

There is a comment in 'struct tlbflush_unmap_batch'.

/*
* The arch code makes the following promise: generic code can modify a
* PTE, then call arch_tlbbatch_add_mm() (which internally provides all
* needed barriers), then call arch_tlbbatch_flush(), and the entries
* will be flushed on all CPUs by the time that arch_tlbbatch_flush()
* returns.
*/


It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if
that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ?

2022-09-19 06:14:37

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 9/15/22 12:12, Barry Song wrote:
> > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 9/9/22 11:05, Barry Song wrote:
> >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/22/22 13:51, Yicong Yang wrote:
> >>>>> From: Barry Song <[email protected]>
> >>>>>
> >>>>> on x86, batched and deferred tlb shootdown has lead to 90%
> >>>>> performance increase on tlb shootdown. on arm64, HW can do
> >>>>> tlb shootdown without software IPI. But sync tlbi is still
> >>>>> quite expensive.
> >>>>>
> >>>>> Even running a simplest program which requires swapout can
> >>>>> prove this is true,
> >>>>> #include <sys/types.h>
> >>>>> #include <unistd.h>
> >>>>> #include <sys/mman.h>
> >>>>> #include <string.h>
> >>>>>
> >>>>> int main()
> >>>>> {
> >>>>> #define SIZE (1 * 1024 * 1024)
> >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >>>>>
> >>>>> memset(p, 0x88, SIZE);
> >>>>>
> >>>>> for (int k = 0; k < 10000; k++) {
> >>>>> /* swap in */
> >>>>> for (int i = 0; i < SIZE; i += 4096) {
> >>>>> (void)p[i];
> >>>>> }
> >>>>>
> >>>>> /* swap out */
> >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM
> >>>>> as the swap block device.
> >>>>>
> >>>>> ~ # perf record taskset -c 4 ./a.out
> >>>>> [ perf record: Woken up 10 times to write data ]
> >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> >>>>> ~ # perf report
> >>>>> # To display the perf.data header info, please use --header/--header-only options.
> >>>>> # To display the perf.data header info, please use --header/--header-only options.
> >>>>> #
> >>>>> #
> >>>>> # Total Lost Samples: 0
> >>>>> #
> >>>>> # Samples: 60K of event 'cycles'
> >>>>> # Event count (approx.): 35706225414
> >>>>> #
> >>>>> # Overhead Command Shared Object Symbol
> >>>>> # ........ ....... ................. .............................................................................
> >>>>> #
> >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
> >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
> >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
> >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
> >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
> >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64
> >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page
> >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
> >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
> >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load
> >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
> >>>>>
> >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> >>>>> swapping in/out a page mapped by only one process. If the
> >>>>> page is mapped by multiple processes, typically, like more
> >>>>> than 100 on a phone, the overhead would be much higher as
> >>>>> we have to run tlb flush 100 times for one single page.
> >>>>> Plus, tlb flush overhead will increase with the number
> >>>>> of CPU cores due to the bad scalability of tlb shootdown
> >>>>> in HW, so those ARM64 servers should expect much higher
> >>>>> overhead.
> >>>>>
> >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush
> >>>>> is actually used by the final dsb() to wait for the completion
> >>>>> of tlb flush. This provides us a very good chance to leverage
> >>>>> the existing batched tlb in kernel. The minimum modification
> >>>>> is that we only send async tlbi in the first stage and we send
> >>>>> dsb while we have to sync in the second stage.
> >>>>>
> >>>>> With the above simplest micro benchmark, collapsed time to
> >>>>> finish the program decreases around 5%.
> >>>>>
> >>>>> Typical collapsed time w/o patch:
> >>>>> ~ # time taskset -c 4 ./a.out
> >>>>> 0.21user 14.34system 0:14.69elapsed
> >>>>> w/ patch:
> >>>>> ~ # time taskset -c 4 ./a.out
> >>>>> 0.22user 13.45system 0:13.80elapsed
> >>>>>
> >>>>> Also, Yicong Yang added the following observation.
> >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
> >>>>> observed an improvement around 12.5% with command
> >>>>> `time ./swap_bench`.
> >>>>> w/o w/
> >>>>> real 0m13.460s 0m11.771s
> >>>>> user 0m0.248s 0m0.279s
> >>>>> sys 0m12.039s 0m11.458s
> >>>>>
> >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> >>>>> which has been eliminated by this patch:
> >>>>>
> >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report
> >>>>> [...]
> >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
> >>>>>
> >>>>> Cc: Jonathan Corbet <[email protected]>
> >>>>> Cc: Nadav Amit <[email protected]>
> >>>>> Cc: Mel Gorman <[email protected]>
> >>>>> Tested-by: Yicong Yang <[email protected]>
> >>>>> Tested-by: Xin Hao <[email protected]>
> >>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>> Signed-off-by: Yicong Yang <[email protected]>
> >>>>> ---
> >>>>> .../features/vm/TLB/arch-support.txt | 2 +-
> >>>>> arch/arm64/Kconfig | 1 +
> >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
> >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
> >>>>> 4 files changed, 40 insertions(+), 3 deletions(-)
> >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >>>>>
> >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> >>>>> index 1c009312b9c1..2caf815d7c6c 100644
> >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt
> >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt
> >>>>> @@ -9,7 +9,7 @@
> >>>>> | alpha: | TODO |
> >>>>> | arc: | TODO |
> >>>>> | arm: | TODO |
> >>>>> - | arm64: | TODO |
> >>>>> + | arm64: | ok |
> >>>>> | csky: | TODO |
> >>>>> | hexagon: | TODO |
> >>>>> | ia64: | TODO |
> >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >>>>> index 571cc234d0b3..09d45cd6d665 100644
> >>>>> --- a/arch/arm64/Kconfig
> >>>>> +++ b/arch/arm64/Kconfig
> >>>>> @@ -93,6 +93,7 @@ config ARM64
> >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> >>>>> select ARCH_SUPPORTS_NUMA_BALANCING
> >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
> >>>>> select ARCH_WANT_DEFAULT_BPF_JIT
> >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> >>>>> new file mode 100644
> >>>>> index 000000000000..fedb0b87b8db
> >>>>> --- /dev/null
> >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h
> >>>>> @@ -0,0 +1,12 @@
> >>>>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H
> >>>>> +#define _ARCH_ARM64_TLBBATCH_H
> >>>>> +
> >>>>> +struct arch_tlbflush_unmap_batch {
> >>>>> + /*
> >>>>> + * For arm64, HW can do tlb shootdown, so we don't
> >>>>> + * need to record cpumask for sending IPI
> >>>>> + */
> >>>>> +};
> >>>>> +
> >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >>>>> index 412a3b9a3c25..23cbc987321a 100644
> >>>>> --- a/arch/arm64/include/asm/tlbflush.h
> >>>>> +++ b/arch/arm64/include/asm/tlbflush.h
> >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> >>>>> dsb(ish);
> >>>>> }
> >>>>>
> >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>>>> +
> >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >>>>> unsigned long uaddr)
> >>>>> {
> >>>>> unsigned long addr;
> >>>>>
> >>>>> dsb(ishst);
> >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> >>>>> __tlbi(vale1is, addr);
> >>>>> __tlbi_user(vale1is, addr);
> >>>>> }
> >>>>>
> >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>>>> + unsigned long uaddr)
> >>>>> +{
> >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> >>>>> +}
> >>>>> +
> >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>>>> unsigned long uaddr)
> >>>>> {
> >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>>>> dsb(ish);
> >>>>> }
> >>>>>
> >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>>> +{
> >>>>> + return true;
> >>>>> +}
> >>>>
> >>>> Always defer and batch up TLB flush, unconditionally ?
> >>>
> >>> My understanding is we actually don't need tlbbatch for a machine with one
> >>> or two cores as the tlb flush is not expensive. even for a system with four
> >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
> >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with
> >>> 100 crores, the cost is incredibly huge.
> >>
> >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
> >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
> >> expensive as well, while queuing up individual TLB flushes ?
> >
> > This one is much much cheaper as it is not waiting for the
> > completion of tlbi. waiting for the completion of tlbi is a big
> > deal in arm64, thus, similar optimization can be seen here
> >
> > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
> > when page aging").
> >
> >
> >>
> >> The very idea behind TLB deferral is the opportunity it (might) provide
> >> to accumulate address ranges and cpu masks so that individual TLB flush
> >> can be replaced with a more cost effective range based TLB flush. Hence
> >> I guess unless address range or cpumask based cost effective TLB flush
> >> is available, deferral does not improve the unmap performance as much.
> >
> >
> > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> > from all cpus in the system, tlbi is not scalable. The point here is that we
> > avoid waiting for each individual TLBi. Alternatively, they are batched. If
> > you read the benchmark in the commit log, you can find the great decline
> > in the cost to swap out a page.
>
> Alright, although collecting and deferring 'dsb(ish)' to the very end, does
> not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I
> guess it can be used to improve unmap performance on arm64.
>
> But is this 'dsb(ish)' deferral architecturally valid ?

yes as dsb(ish) ensures the completion of tlbi.
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers

We are even depending on the dsb(ish) during context switch in commit:
3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
when page aging").

Before the context switch, lots of tlbi could have been sent.

>
> Let's examine single page unmap path via try_to_unmap_one().
>
> should_defer_flush() {
> ptep_get_and_clear()
> set_tlb_ubc_flush_pending()
> arch_tlbbatch_add_mm()
> __flush_tlb_page_nosync()
> } else {
> ptep_clear_flush()
> ptep_get_and_clear()
> flush_tlb_page()
> flush_tlb_page_nosync()
> __flush_tlb_page_nosync()
> dsb(ish)
> }
>
> __flush_tlb_page_nosync()
> {
> dsb(ishst);
> addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> }
>
> Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi()
> and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function.
>
> #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
> "tlbi " #op "\n" \
> ALTERNATIVE("nop\n nop", \
> "dsb ish\n tlbi " #op, \
> ARM64_WORKAROUND_REPEAT_TLBI, \
> CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> : : )
>
> #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
> "tlbi " #op ", %0\n" \
> ALTERNATIVE("nop\n nop", \
> "dsb ish\n tlbi " #op ", %0", \
> ARM64_WORKAROUND_REPEAT_TLBI, \
> CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> : : "r" (arg))
>
> #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
>
> #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0)
>
> #define __tlbi_user(op, arg) do { \
> if (arm64_kernel_unmapped_at_el0()) \
> __tlbi(op, (arg) | USER_ASID_FLAG); \
> } while (0)
>
> There is already a 'dsb(ish)' in between two subsequent TLB operations in
> case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess
> deferral should not enabled on such systems ?
>
> But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush()
> via try_to_unmap_flush[_dirty](). There might be random number of instructions
> in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'.
> Just wondering, if such 'detached in time with other instructions in between'
> 'tlbi' and 'dsb(ish)', is architecturally valid ?

yes. I think so, arm64 even depends on the implicit dsb in context switch.

>
> There is a comment in 'struct tlbflush_unmap_batch'.
>
> /*
> * The arch code makes the following promise: generic code can modify a
> * PTE, then call arch_tlbbatch_add_mm() (which internally provides all
> * needed barriers), then call arch_tlbbatch_flush(), and the entries
> * will be flushed on all CPUs by the time that arch_tlbbatch_flush()
> * returns.
> */
>
>
> It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if
> that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ?

yes. include/asm/tlbflush.h, we see the below comments:

* TLB Invalidation
* ================
*
* This header file implements the low-level TLB invalidation routines
* (sometimes referred to as "flushing" in the kernel) for arm64.
*
* Every invalidation operation uses the following template:
*
* DSB ISHST // Ensure prior page-table updates have completed
* TLBI ... // Invalidate the TLB
* DSB ISH // Ensure the TLB invalidation has completed
* if (invalidated kernel mappings)
* ISB // Discard any instructions fetched from the old mapping
*

Clearly dsb(ishst) has ensured page-table updates are visible to all CPUs.

Thanks
Barry