2022-07-18 09:37:43

by Barry Song

[permalink] [raw]
Subject: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

From: Barry Song <[email protected]>

THP_SWAP has been proven to improve the swap throughput significantly
on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
splitting THP after swapped out").
As long as arm64 uses 4K page size, it is quite similar with x86_64
by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
enabling it on arm64 will benefit arm64 as well.
A corner case is that MTE has an assumption that only base pages
can be swapped. We won't enable THP_SWAP for ARM64 hardware with
MTE support until MTE is reworked to coexist with THP_SWAP.

A micro-benchmark is written to measure thp swapout throughput as
below,

unsigned long long tv_to_ms(struct timeval tv)
{
return tv.tv_sec * 1000 + tv.tv_usec / 1000;
}

main()
{
struct timeval tv_b, tv_e;;
#define SIZE 400*1024*1024
volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (!p) {
perror("fail to get memory");
exit(-1);
}

madvise(p, SIZE, MADV_HUGEPAGE);
memset(p, 0x11, SIZE); /* write to get mem */

gettimeofday(&tv_b, NULL);
madvise(p, SIZE, MADV_PAGEOUT);
gettimeofday(&tv_e, NULL);

printf("swp out bandwidth: %ld bytes/ms\n",
SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
}

Testing is done on rk3568 64bit quad core processor Quad Core
Cortex-A55 platform - ROCK 3A.
thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)

Cc: "Huang, Ying" <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Steven Price <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
-v3:
* refine the commit log;
* add a benchmark result;
* refine the macro of arch_thp_swp_supported
Thanks to the comments of Anshuman, Andrew, Steven

arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 6 ++++++
include/linux/huge_mm.h | 12 ++++++++++++
mm/swap_slots.c | 2 +-
4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e1c540e80eec 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -101,6 +101,7 @@ config ARM64
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_NO_INSTR
+ select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARM_AMBA
select ARM_ARCH_TIMER
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b6632f18364..78d6f6014bfb 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,6 +45,12 @@
__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+static inline bool arch_thp_swp_supported(void)
+{
+ return !system_supports_mte();
+}
+#define arch_thp_swp_supported arch_thp_swp_supported
+
/*
* Outside of a few very special situations (e.g. hibernation), we always
* use broadcast TLB invalidation instructions, therefore a spurious page
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de29821231c9..4ddaf6ad73ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
return split_huge_page_to_list(&folio->page, list);
}

+/*
+ * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
+ * limitations in the implementation like arm64 MTE can override this to
+ * false
+ */
+#ifndef arch_thp_swp_supported
+static inline bool arch_thp_swp_supported(void)
+{
+ return true;
+}
+#endif
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 2a65a89b5b4d..10b94d64cc25 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
entry.val = 0;

if (folio_test_large(folio)) {
- if (IS_ENABLED(CONFIG_THP_SWAP))
+ if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
get_swap_pages(1, &entry, folio_nr_pages(folio));
goto out;
}
--
2.25.1


2022-07-18 16:36:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Mon, Jul 18, 2022 at 09:00:50PM +1200, Barry Song wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1652a9800ebe..e1c540e80eec 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -101,6 +101,7 @@ config ARM64
> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> select ARCH_WANT_LD_ORPHAN_WARN
> select ARCH_WANTS_NO_INSTR
> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES

Can't you avoid all the other changes by simply doing:

select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES && !ARM64_MTE

> select ARCH_HAS_UBSAN_SANITIZE_ALL
> select ARM_AMBA
> select ARM_ARCH_TIMER
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b6632f18364..78d6f6014bfb 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,6 +45,12 @@
> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline bool arch_thp_swp_supported(void)
> +{
> + return !system_supports_mte();
> +}
> +#define arch_thp_swp_supported arch_thp_swp_supported
> +
> /*
> * Outside of a few very special situations (e.g. hibernation), we always
> * use broadcast TLB invalidation instructions, therefore a spurious page
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index de29821231c9..4ddaf6ad73ef 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> return split_huge_page_to_list(&folio->page, list);
> }
>
> +/*
> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> + * limitations in the implementation like arm64 MTE can override this to
> + * false
> + */
> +#ifndef arch_thp_swp_supported
> +static inline bool arch_thp_swp_supported(void)
> +{
> + return true;
> +}
> +#endif
> +
> #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2a65a89b5b4d..10b94d64cc25 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> entry.val = 0;
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP))
> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }
> --
> 2.25.1
>
>

2022-07-18 20:28:56

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 4:06 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 09:00:50PM +1200, Barry Song wrote:
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 1652a9800ebe..e1c540e80eec 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -101,6 +101,7 @@ config ARM64
> > select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> > select ARCH_WANT_LD_ORPHAN_WARN
> > select ARCH_WANTS_NO_INSTR
> > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>
> Can't you avoid all the other changes by simply doing:
>
> select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES && !ARM64_MTE
>
> > select ARCH_HAS_UBSAN_SANITIZE_ALL
> > select ARM_AMBA
> > select ARM_ARCH_TIMER

Nope. as we also enable ARM64_MTE on platforms without mte.
ARMv8.5 based processors introduce the Memory Tagging Extension
(MTE) feature but the Kconfig is default Y for platforms before 8.5.

arm64 usually detects the cpufeature rather than depending on
static build configuration.

> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 0b6632f18364..78d6f6014bfb 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,6 +45,12 @@
> > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool arch_thp_swp_supported(void)
> > +{
> > + return !system_supports_mte();
> > +}
> > +#define arch_thp_swp_supported arch_thp_swp_supported
> > +
> > /*
> > * Outside of a few very special situations (e.g. hibernation), we always
> > * use broadcast TLB invalidation instructions, therefore a spurious page
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index de29821231c9..4ddaf6ad73ef 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > return split_huge_page_to_list(&folio->page, list);
> > }
> >
> > +/*
> > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > + * limitations in the implementation like arm64 MTE can override this to
> > + * false
> > + */
> > +#ifndef arch_thp_swp_supported
> > +static inline bool arch_thp_swp_supported(void)
> > +{
> > + return true;
> > +}
> > +#endif
> > +
> > #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 2a65a89b5b4d..10b94d64cc25 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > entry.val = 0;
> >
> > if (folio_test_large(folio)) {
> > - if (IS_ENABLED(CONFIG_THP_SWAP))
> > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > get_swap_pages(1, &entry, folio_nr_pages(folio));
> > goto out;
> > }
> > --
> > 2.25.1
> >
> >

Thanks
Barry

2022-07-19 01:16:49

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Barry Song <[email protected]> writes:

> From: Barry Song <[email protected]>
>
> THP_SWAP has been proven to improve the swap throughput significantly
> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> splitting THP after swapped out").
> As long as arm64 uses 4K page size, it is quite similar with x86_64
> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> enabling it on arm64 will benefit arm64 as well.
> A corner case is that MTE has an assumption that only base pages
> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> MTE support until MTE is reworked to coexist with THP_SWAP.
>
> A micro-benchmark is written to measure thp swapout throughput as
> below,
>
> unsigned long long tv_to_ms(struct timeval tv)
> {
> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> }
>
> main()
> {
> struct timeval tv_b, tv_e;;
> #define SIZE 400*1024*1024
> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> if (!p) {
> perror("fail to get memory");
> exit(-1);
> }
>
> madvise(p, SIZE, MADV_HUGEPAGE);
> memset(p, 0x11, SIZE); /* write to get mem */
>
> gettimeofday(&tv_b, NULL);
> madvise(p, SIZE, MADV_PAGEOUT);
> gettimeofday(&tv_e, NULL);
>
> printf("swp out bandwidth: %ld bytes/ms\n",
> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> }
>
> Testing is done on rk3568 64bit quad core processor Quad Core
> Cortex-A55 platform - ROCK 3A.
> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>
> Cc: "Huang, Ying" <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Steven Price <[email protected]>
> Cc: Yang Shi <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> -v3:
> * refine the commit log;
> * add a benchmark result;
> * refine the macro of arch_thp_swp_supported
> Thanks to the comments of Anshuman, Andrew, Steven
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable.h | 6 ++++++
> include/linux/huge_mm.h | 12 ++++++++++++
> mm/swap_slots.c | 2 +-
> 4 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1652a9800ebe..e1c540e80eec 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -101,6 +101,7 @@ config ARM64
> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> select ARCH_WANT_LD_ORPHAN_WARN
> select ARCH_WANTS_NO_INSTR
> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> select ARCH_HAS_UBSAN_SANITIZE_ALL
> select ARM_AMBA
> select ARM_ARCH_TIMER
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b6632f18364..78d6f6014bfb 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,6 +45,12 @@
> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline bool arch_thp_swp_supported(void)
> +{
> + return !system_supports_mte();
> +}
> +#define arch_thp_swp_supported arch_thp_swp_supported
> +
> /*
> * Outside of a few very special situations (e.g. hibernation), we always
> * use broadcast TLB invalidation instructions, therefore a spurious page
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index de29821231c9..4ddaf6ad73ef 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> return split_huge_page_to_list(&folio->page, list);
> }
>
> +/*
> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> + * limitations in the implementation like arm64 MTE can override this to
> + * false
> + */
> +#ifndef arch_thp_swp_supported
> +static inline bool arch_thp_swp_supported(void)
> +{
> + return true;
> +}

How about the following?

static inline bool arch_wants_thp_swap(void)
{
return IS_ENABLED(ARCH_WANTS_THP_SWAP);
}

Best Regards,
Huang, Ying

> +#endif
> +
> #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 2a65a89b5b4d..10b94d64cc25 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> entry.val = 0;
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP))
> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }

2022-07-19 01:26:22

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > From: Barry Song <[email protected]>
> >
> > THP_SWAP has been proven to improve the swap throughput significantly
> > on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> > splitting THP after swapped out").
> > As long as arm64 uses 4K page size, it is quite similar with x86_64
> > by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> > enabling it on arm64 will benefit arm64 as well.
> > A corner case is that MTE has an assumption that only base pages
> > can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> > MTE support until MTE is reworked to coexist with THP_SWAP.
> >
> > A micro-benchmark is written to measure thp swapout throughput as
> > below,
> >
> > unsigned long long tv_to_ms(struct timeval tv)
> > {
> > return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> > }
> >
> > main()
> > {
> > struct timeval tv_b, tv_e;;
> > #define SIZE 400*1024*1024
> > volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > if (!p) {
> > perror("fail to get memory");
> > exit(-1);
> > }
> >
> > madvise(p, SIZE, MADV_HUGEPAGE);
> > memset(p, 0x11, SIZE); /* write to get mem */
> >
> > gettimeofday(&tv_b, NULL);
> > madvise(p, SIZE, MADV_PAGEOUT);
> > gettimeofday(&tv_e, NULL);
> >
> > printf("swp out bandwidth: %ld bytes/ms\n",
> > SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> > }
> >
> > Testing is done on rk3568 64bit quad core processor Quad Core
> > Cortex-A55 platform - ROCK 3A.
> > thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> > thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >
> > Cc: "Huang, Ying" <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Anshuman Khandual <[email protected]>
> > Cc: Steven Price <[email protected]>
> > Cc: Yang Shi <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > -v3:
> > * refine the commit log;
> > * add a benchmark result;
> > * refine the macro of arch_thp_swp_supported
> > Thanks to the comments of Anshuman, Andrew, Steven
> >
> > arch/arm64/Kconfig | 1 +
> > arch/arm64/include/asm/pgtable.h | 6 ++++++
> > include/linux/huge_mm.h | 12 ++++++++++++
> > mm/swap_slots.c | 2 +-
> > 4 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 1652a9800ebe..e1c540e80eec 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -101,6 +101,7 @@ config ARM64
> > select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> > select ARCH_WANT_LD_ORPHAN_WARN
> > select ARCH_WANTS_NO_INSTR
> > + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > select ARCH_HAS_UBSAN_SANITIZE_ALL
> > select ARM_AMBA
> > select ARM_ARCH_TIMER
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 0b6632f18364..78d6f6014bfb 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,6 +45,12 @@
> > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool arch_thp_swp_supported(void)
> > +{
> > + return !system_supports_mte();
> > +}
> > +#define arch_thp_swp_supported arch_thp_swp_supported
> > +
> > /*
> > * Outside of a few very special situations (e.g. hibernation), we always
> > * use broadcast TLB invalidation instructions, therefore a spurious page
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index de29821231c9..4ddaf6ad73ef 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > return split_huge_page_to_list(&folio->page, list);
> > }
> >
> > +/*
> > + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > + * limitations in the implementation like arm64 MTE can override this to
> > + * false
> > + */
> > +#ifndef arch_thp_swp_supported
> > +static inline bool arch_thp_swp_supported(void)
> > +{
> > + return true;
> > +}
>
> How about the following?
>
> static inline bool arch_wants_thp_swap(void)
> {
> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> }

This looks good. then i'll need to change arm64 to

+static inline bool arch_thp_swp_supported(void)
+{
+ return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
+}

>
> Best Regards,
> Huang, Ying
>
> > +#endif
> > +
> > #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 2a65a89b5b4d..10b94d64cc25 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > entry.val = 0;
> >
> > if (folio_test_large(folio)) {
> > - if (IS_ENABLED(CONFIG_THP_SWAP))
> > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > get_swap_pages(1, &entry, folio_nr_pages(folio));
> > goto out;
> > }

Thanks
Barry

2022-07-19 03:22:20

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64



On 7/19/22 06:53, Barry Song wrote:
> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>>> From: Barry Song <[email protected]>
>>>
>>> THP_SWAP has been proven to improve the swap throughput significantly
>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>> splitting THP after swapped out").
>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>> enabling it on arm64 will benefit arm64 as well.
>>> A corner case is that MTE has an assumption that only base pages
>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>
>>> A micro-benchmark is written to measure thp swapout throughput as
>>> below,
>>>
>>> unsigned long long tv_to_ms(struct timeval tv)
>>> {
>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>> }
>>>
>>> main()
>>> {
>>> struct timeval tv_b, tv_e;;
>>> #define SIZE 400*1024*1024
>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>> if (!p) {
>>> perror("fail to get memory");
>>> exit(-1);
>>> }
>>>
>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>
>>> gettimeofday(&tv_b, NULL);
>>> madvise(p, SIZE, MADV_PAGEOUT);
>>> gettimeofday(&tv_e, NULL);
>>>
>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>> }
>>>
>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>> Cortex-A55 platform - ROCK 3A.
>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>
>>> Cc: "Huang, Ying" <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> Cc: Johannes Weiner <[email protected]>
>>> Cc: Hugh Dickins <[email protected]>
>>> Cc: Andrea Arcangeli <[email protected]>
>>> Cc: Anshuman Khandual <[email protected]>
>>> Cc: Steven Price <[email protected]>
>>> Cc: Yang Shi <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> -v3:
>>> * refine the commit log;
>>> * add a benchmark result;
>>> * refine the macro of arch_thp_swp_supported
>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>
>>> arch/arm64/Kconfig | 1 +
>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>> include/linux/huge_mm.h | 12 ++++++++++++
>>> mm/swap_slots.c | 2 +-
>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 1652a9800ebe..e1c540e80eec 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -101,6 +101,7 @@ config ARM64
>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>> select ARCH_WANT_LD_ORPHAN_WARN
>>> select ARCH_WANTS_NO_INSTR
>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>> select ARM_AMBA
>>> select ARM_ARCH_TIMER
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0b6632f18364..78d6f6014bfb 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -45,6 +45,12 @@
>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> +static inline bool arch_thp_swp_supported(void)
>>> +{
>>> + return !system_supports_mte();
>>> +}
>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>> +
>>> /*
>>> * Outside of a few very special situations (e.g. hibernation), we always
>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index de29821231c9..4ddaf6ad73ef 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>> return split_huge_page_to_list(&folio->page, list);
>>> }
>>>
>>> +/*
>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>> + * limitations in the implementation like arm64 MTE can override this to
>>> + * false
>>> + */
>>> +#ifndef arch_thp_swp_supported
>>> +static inline bool arch_thp_swp_supported(void)
>>> +{
>>> + return true;
>>> +}
>>
>> How about the following?
>>
>> static inline bool arch_wants_thp_swap(void)
>> {
>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>> }
>
> This looks good. then i'll need to change arm64 to
>
> +static inline bool arch_thp_swp_supported(void)
> +{
> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> +}

Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
either in the generic fallback stub, or in arm64 platform override. Because
without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
be called in the first place.

>
>>
>> Best Regards,
>> Huang, Ying
>>
>>> +#endif
>>> +
>>> #endif /* _LINUX_HUGE_MM_H */
>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>> index 2a65a89b5b4d..10b94d64cc25 100644
>>> --- a/mm/swap_slots.c
>>> +++ b/mm/swap_slots.c
>>> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>> entry.val = 0;
>>>
>>> if (folio_test_large(folio)) {
>>> - if (IS_ENABLED(CONFIG_THP_SWAP))
>>> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>>> goto out;
>>> }
>
> Thanks
> Barry
>

2022-07-19 03:59:08

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Anshuman Khandual <[email protected]> writes:

> On 7/19/22 06:53, Barry Song wrote:
>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>>
>>> Barry Song <[email protected]> writes:
>>>
>>>> From: Barry Song <[email protected]>
>>>>
>>>> THP_SWAP has been proven to improve the swap throughput significantly
>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>> splitting THP after swapped out").
>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>>> enabling it on arm64 will benefit arm64 as well.
>>>> A corner case is that MTE has an assumption that only base pages
>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>>
>>>> A micro-benchmark is written to measure thp swapout throughput as
>>>> below,
>>>>
>>>> unsigned long long tv_to_ms(struct timeval tv)
>>>> {
>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>>> }
>>>>
>>>> main()
>>>> {
>>>> struct timeval tv_b, tv_e;;
>>>> #define SIZE 400*1024*1024
>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>> if (!p) {
>>>> perror("fail to get memory");
>>>> exit(-1);
>>>> }
>>>>
>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>>
>>>> gettimeofday(&tv_b, NULL);
>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>> gettimeofday(&tv_e, NULL);
>>>>
>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>>> }
>>>>
>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>>> Cortex-A55 platform - ROCK 3A.
>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>>
>>>> Cc: "Huang, Ying" <[email protected]>
>>>> Cc: Minchan Kim <[email protected]>
>>>> Cc: Johannes Weiner <[email protected]>
>>>> Cc: Hugh Dickins <[email protected]>
>>>> Cc: Andrea Arcangeli <[email protected]>
>>>> Cc: Anshuman Khandual <[email protected]>
>>>> Cc: Steven Price <[email protected]>
>>>> Cc: Yang Shi <[email protected]>
>>>> Signed-off-by: Barry Song <[email protected]>
>>>> ---
>>>> -v3:
>>>> * refine the commit log;
>>>> * add a benchmark result;
>>>> * refine the macro of arch_thp_swp_supported
>>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>>
>>>> arch/arm64/Kconfig | 1 +
>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>>> include/linux/huge_mm.h | 12 ++++++++++++
>>>> mm/swap_slots.c | 2 +-
>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index 1652a9800ebe..e1c540e80eec 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -101,6 +101,7 @@ config ARM64
>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>> select ARCH_WANT_LD_ORPHAN_WARN
>>>> select ARCH_WANTS_NO_INSTR
>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>> select ARM_AMBA
>>>> select ARM_ARCH_TIMER
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0b6632f18364..78d6f6014bfb 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -45,6 +45,12 @@
>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>
>>>> +static inline bool arch_thp_swp_supported(void)
>>>> +{
>>>> + return !system_supports_mte();
>>>> +}
>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>>> +
>>>> /*
>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index de29821231c9..4ddaf6ad73ef 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>>> return split_huge_page_to_list(&folio->page, list);
>>>> }
>>>>
>>>> +/*
>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>> + * limitations in the implementation like arm64 MTE can override this to
>>>> + * false
>>>> + */
>>>> +#ifndef arch_thp_swp_supported
>>>> +static inline bool arch_thp_swp_supported(void)
>>>> +{
>>>> + return true;
>>>> +}
>>>
>>> How about the following?
>>>
>>> static inline bool arch_wants_thp_swap(void)
>>> {
>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>>> }
>>
>> This looks good. then i'll need to change arm64 to
>>
>> +static inline bool arch_thp_swp_supported(void)
>> +{
>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>> +}
>
> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> either in the generic fallback stub, or in arm64 platform override. Because
> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> be called in the first place.

For the only caller now, the checking looks redundant. But the original
proposed implementation as follows,

static inline bool arch_thp_swp_supported(void)
{
return true;
}

will return true even on architectures that don't support/want THP swap.
That will confuse people too.

And the "redundant" checking has no run time overhead, because compiler
will do the trick.

Best Regards,
Huang, Ying

>>>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>> +#endif
>>>> +
>>>> #endif /* _LINUX_HUGE_MM_H */
>>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>>> index 2a65a89b5b4d..10b94d64cc25 100644
>>>> --- a/mm/swap_slots.c
>>>> +++ b/mm/swap_slots.c
>>>> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>> entry.val = 0;
>>>>
>>>> if (folio_test_large(folio)) {
>>>> - if (IS_ENABLED(CONFIG_THP_SWAP))
>>>> + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>>>> goto out;
>>>> }
>>
>> Thanks
>> Barry
>>

2022-07-19 04:06:36

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
>
> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> <[email protected]> wrote:
> >
> >
> >
> > On 7/19/22 08:58, Huang, Ying wrote:
> > > Anshuman Khandual <[email protected]> writes:
> > >
> > >> On 7/19/22 06:53, Barry Song wrote:
> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> > >>>>
> > >>>> Barry Song <[email protected]> writes:
> > >>>>
> > >>>>> From: Barry Song <[email protected]>
> > >>>>>
> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> > >>>>> splitting THP after swapped out").
> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> > >>>>> enabling it on arm64 will benefit arm64 as well.
> > >>>>> A corner case is that MTE has an assumption that only base pages
> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> > >>>>>
> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
> > >>>>> below,
> > >>>>>
> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
> > >>>>> {
> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> > >>>>> }
> > >>>>>
> > >>>>> main()
> > >>>>> {
> > >>>>> struct timeval tv_b, tv_e;;
> > >>>>> #define SIZE 400*1024*1024
> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >>>>> if (!p) {
> > >>>>> perror("fail to get memory");
> > >>>>> exit(-1);
> > >>>>> }
> > >>>>>
> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
> > >>>>>
> > >>>>> gettimeofday(&tv_b, NULL);
> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> > >>>>> gettimeofday(&tv_e, NULL);
> > >>>>>
> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> > >>>>> }
> > >>>>>
> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> > >>>>> Cortex-A55 platform - ROCK 3A.
> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> > >>>>>
> > >>>>> Cc: "Huang, Ying" <[email protected]>
> > >>>>> Cc: Minchan Kim <[email protected]>
> > >>>>> Cc: Johannes Weiner <[email protected]>
> > >>>>> Cc: Hugh Dickins <[email protected]>
> > >>>>> Cc: Andrea Arcangeli <[email protected]>
> > >>>>> Cc: Anshuman Khandual <[email protected]>
> > >>>>> Cc: Steven Price <[email protected]>
> > >>>>> Cc: Yang Shi <[email protected]>
> > >>>>> Signed-off-by: Barry Song <[email protected]>
> > >>>>> ---
> > >>>>> -v3:
> > >>>>> * refine the commit log;
> > >>>>> * add a benchmark result;
> > >>>>> * refine the macro of arch_thp_swp_supported
> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
> > >>>>>
> > >>>>> arch/arm64/Kconfig | 1 +
> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
> > >>>>> mm/swap_slots.c | 2 +-
> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> > >>>>>
> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
> > >>>>> --- a/arch/arm64/Kconfig
> > >>>>> +++ b/arch/arm64/Kconfig
> > >>>>> @@ -101,6 +101,7 @@ config ARM64
> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
> > >>>>> select ARCH_WANTS_NO_INSTR
> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> > >>>>> select ARM_AMBA
> > >>>>> select ARM_ARCH_TIMER
> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
> > >>>>> @@ -45,6 +45,12 @@
> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >>>>>
> > >>>>> +static inline bool arch_thp_swp_supported(void)
> > >>>>> +{
> > >>>>> + return !system_supports_mte();
> > >>>>> +}
> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> > >>>>> +
> > >>>>> /*
> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
> > >>>>> --- a/include/linux/huge_mm.h
> > >>>>> +++ b/include/linux/huge_mm.h
> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> > >>>>> return split_huge_page_to_list(&folio->page, list);
> > >>>>> }
> > >>>>>
> > >>>>> +/*
> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
> > >>>>> + * false
> > >>>>> + */
> > >>>>> +#ifndef arch_thp_swp_supported
> > >>>>> +static inline bool arch_thp_swp_supported(void)
> > >>>>> +{
> > >>>>> + return true;
> > >>>>> +}
> > >>>>
> > >>>> How about the following?
> > >>>>
> > >>>> static inline bool arch_wants_thp_swap(void)
> > >>>> {
> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> > >>>> }
> > >>>
> > >>> This looks good. then i'll need to change arm64 to
> > >>>
> > >>> +static inline bool arch_thp_swp_supported(void)
> > >>> +{
> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> > >>> +}
> > >>
> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> > >> either in the generic fallback stub, or in arm64 platform override. Because
> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> > >> be called in the first place.
> > >
> > > For the only caller now, the checking looks redundant. But the original
> > > proposed implementation as follows,
> > >
> > > static inline bool arch_thp_swp_supported(void)
> > > {
> > > return true;
> > > }
> > >
> > > will return true even on architectures that don't support/want THP swap.
> >
> > But the function will never be called on for those platforms.
> >
> > > That will confuse people too.
> >
> > I dont see how.
> >
> > >
> > > And the "redundant" checking has no run time overhead, because compiler
> > > will do the trick.
> > I understand that, but dont think this indirection is necessary.
>
> Hi Anshuman, Hi Ying,
> Thanks for the comments of both of you. Does the below look ok?
>
> generic,
>
> static inline bool arch_wants_thp_swap(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP);
> }
>

sorry, i actually meant arch_thp_swp_supported() but not
arch_wants_thp_swap() in generic code,

static inline bool arch_thp_swp_supported(void)
{
return IS_ENABLED(CONFIG_THP_SWAP);
}

> arm64,
>
> static inline bool arch_thp_swp_supported(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
> }
>
> caller,
>
> folio_alloc_swap(struct folio *folio)
> {
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP))
> + if (arch_thp_swp_supported())
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }
>
> Thanks
> Barry

2022-07-19 04:14:44

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64



On 7/19/22 09:29, Barry Song wrote:
> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> <[email protected]> wrote:
>>
>>
>>
>> On 7/19/22 08:58, Huang, Ying wrote:
>>> Anshuman Khandual <[email protected]> writes:
>>>
>>>> On 7/19/22 06:53, Barry Song wrote:
>>>>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>>>>>
>>>>>> Barry Song <[email protected]> writes:
>>>>>>
>>>>>>> From: Barry Song <[email protected]>
>>>>>>>
>>>>>>> THP_SWAP has been proven to improve the swap throughput significantly
>>>>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>>>>> splitting THP after swapped out").
>>>>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>>>>>> enabling it on arm64 will benefit arm64 as well.
>>>>>>> A corner case is that MTE has an assumption that only base pages
>>>>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>>>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>>>>>
>>>>>>> A micro-benchmark is written to measure thp swapout throughput as
>>>>>>> below,
>>>>>>>
>>>>>>> unsigned long long tv_to_ms(struct timeval tv)
>>>>>>> {
>>>>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>>>>>> }
>>>>>>>
>>>>>>> main()
>>>>>>> {
>>>>>>> struct timeval tv_b, tv_e;;
>>>>>>> #define SIZE 400*1024*1024
>>>>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>>>> if (!p) {
>>>>>>> perror("fail to get memory");
>>>>>>> exit(-1);
>>>>>>> }
>>>>>>>
>>>>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>>>>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>>>>>
>>>>>>> gettimeofday(&tv_b, NULL);
>>>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>>>> gettimeofday(&tv_e, NULL);
>>>>>>>
>>>>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>>>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>>>>>> }
>>>>>>>
>>>>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>>>>>> Cortex-A55 platform - ROCK 3A.
>>>>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>>>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>>>>>
>>>>>>> Cc: "Huang, Ying" <[email protected]>
>>>>>>> Cc: Minchan Kim <[email protected]>
>>>>>>> Cc: Johannes Weiner <[email protected]>
>>>>>>> Cc: Hugh Dickins <[email protected]>
>>>>>>> Cc: Andrea Arcangeli <[email protected]>
>>>>>>> Cc: Anshuman Khandual <[email protected]>
>>>>>>> Cc: Steven Price <[email protected]>
>>>>>>> Cc: Yang Shi <[email protected]>
>>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>>> ---
>>>>>>> -v3:
>>>>>>> * refine the commit log;
>>>>>>> * add a benchmark result;
>>>>>>> * refine the macro of arch_thp_swp_supported
>>>>>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>>>>>
>>>>>>> arch/arm64/Kconfig | 1 +
>>>>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>>>>>> include/linux/huge_mm.h | 12 ++++++++++++
>>>>>>> mm/swap_slots.c | 2 +-
>>>>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>>> index 1652a9800ebe..e1c540e80eec 100644
>>>>>>> --- a/arch/arm64/Kconfig
>>>>>>> +++ b/arch/arm64/Kconfig
>>>>>>> @@ -101,6 +101,7 @@ config ARM64
>>>>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>>>>> select ARCH_WANT_LD_ORPHAN_WARN
>>>>>>> select ARCH_WANTS_NO_INSTR
>>>>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>>>>> select ARM_AMBA
>>>>>>> select ARM_ARCH_TIMER
>>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>>> index 0b6632f18364..78d6f6014bfb 100644
>>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>>> @@ -45,6 +45,12 @@
>>>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>>>>
>>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>>> +{
>>>>>>> + return !system_supports_mte();
>>>>>>> +}
>>>>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>>>>>> +
>>>>>>> /*
>>>>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>> index de29821231c9..4ddaf6ad73ef 100644
>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>>>>>> return split_huge_page_to_list(&folio->page, list);
>>>>>>> }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>>>>> + * limitations in the implementation like arm64 MTE can override this to
>>>>>>> + * false
>>>>>>> + */
>>>>>>> +#ifndef arch_thp_swp_supported
>>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>>> +{
>>>>>>> + return true;
>>>>>>> +}
>>>>>>
>>>>>> How about the following?
>>>>>>
>>>>>> static inline bool arch_wants_thp_swap(void)
>>>>>> {
>>>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>>>>>> }
>>>>>
>>>>> This looks good. then i'll need to change arm64 to
>>>>>
>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>> +{
>>>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>>>>> +}
>>>>
>>>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>>>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>>>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>>>> either in the generic fallback stub, or in arm64 platform override. Because
>>>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>>>> be called in the first place.
>>>
>>> For the only caller now, the checking looks redundant. But the original
>>> proposed implementation as follows,
>>>
>>> static inline bool arch_thp_swp_supported(void)
>>> {
>>> return true;
>>> }
>>>
>>> will return true even on architectures that don't support/want THP swap.
>>
>> But the function will never be called on for those platforms.
>>
>>> That will confuse people too.
>>
>> I dont see how.
>>
>>>
>>> And the "redundant" checking has no run time overhead, because compiler
>>> will do the trick.
>> I understand that, but dont think this indirection is necessary.
>
> Hi Anshuman, Hi Ying,
> Thanks for the comments of both of you. Does the below look ok?
>
> generic,
>
> static inline bool arch_wants_thp_swap(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP);
> }
>
> arm64,
>
> static inline bool arch_thp_swp_supported(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
> }
>
> caller,
>
> folio_alloc_swap(struct folio *folio)
> {
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP))
> + if (arch_thp_swp_supported())
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }

Current proposal in this patch LGTM, I dont see any reason for these changes.

2022-07-19 04:14:44

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Anshuman Khandual <[email protected]> writes:

> On 7/19/22 08:58, Huang, Ying wrote:
>> Anshuman Khandual <[email protected]> writes:
>>
>>> On 7/19/22 06:53, Barry Song wrote:
>>>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>>>>
>>>>> Barry Song <[email protected]> writes:
>>>>>
>>>>>> From: Barry Song <[email protected]>
>>>>>>
>>>>>> THP_SWAP has been proven to improve the swap throughput significantly
>>>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>>>> splitting THP after swapped out").
>>>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>>>>> enabling it on arm64 will benefit arm64 as well.
>>>>>> A corner case is that MTE has an assumption that only base pages
>>>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>>>>
>>>>>> A micro-benchmark is written to measure thp swapout throughput as
>>>>>> below,
>>>>>>
>>>>>> unsigned long long tv_to_ms(struct timeval tv)
>>>>>> {
>>>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>>>>> }
>>>>>>
>>>>>> main()
>>>>>> {
>>>>>> struct timeval tv_b, tv_e;;
>>>>>> #define SIZE 400*1024*1024
>>>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>>> if (!p) {
>>>>>> perror("fail to get memory");
>>>>>> exit(-1);
>>>>>> }
>>>>>>
>>>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>>>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>>>>
>>>>>> gettimeofday(&tv_b, NULL);
>>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>>> gettimeofday(&tv_e, NULL);
>>>>>>
>>>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>>>>> }
>>>>>>
>>>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>>>>> Cortex-A55 platform - ROCK 3A.
>>>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>>>>
>>>>>> Cc: "Huang, Ying" <[email protected]>
>>>>>> Cc: Minchan Kim <[email protected]>
>>>>>> Cc: Johannes Weiner <[email protected]>
>>>>>> Cc: Hugh Dickins <[email protected]>
>>>>>> Cc: Andrea Arcangeli <[email protected]>
>>>>>> Cc: Anshuman Khandual <[email protected]>
>>>>>> Cc: Steven Price <[email protected]>
>>>>>> Cc: Yang Shi <[email protected]>
>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>> ---
>>>>>> -v3:
>>>>>> * refine the commit log;
>>>>>> * add a benchmark result;
>>>>>> * refine the macro of arch_thp_swp_supported
>>>>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>>>>
>>>>>> arch/arm64/Kconfig | 1 +
>>>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>>>>> include/linux/huge_mm.h | 12 ++++++++++++
>>>>>> mm/swap_slots.c | 2 +-
>>>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>> index 1652a9800ebe..e1c540e80eec 100644
>>>>>> --- a/arch/arm64/Kconfig
>>>>>> +++ b/arch/arm64/Kconfig
>>>>>> @@ -101,6 +101,7 @@ config ARM64
>>>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>>>> select ARCH_WANT_LD_ORPHAN_WARN
>>>>>> select ARCH_WANTS_NO_INSTR
>>>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>>>> select ARM_AMBA
>>>>>> select ARM_ARCH_TIMER
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 0b6632f18364..78d6f6014bfb 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -45,6 +45,12 @@
>>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>>>
>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>> +{
>>>>>> + return !system_supports_mte();
>>>>>> +}
>>>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>>>>> +
>>>>>> /*
>>>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index de29821231c9..4ddaf6ad73ef 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>>>>> return split_huge_page_to_list(&folio->page, list);
>>>>>> }
>>>>>>
>>>>>> +/*
>>>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>>>> + * limitations in the implementation like arm64 MTE can override this to
>>>>>> + * false
>>>>>> + */
>>>>>> +#ifndef arch_thp_swp_supported
>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>> +{
>>>>>> + return true;
>>>>>> +}
>>>>>
>>>>> How about the following?
>>>>>
>>>>> static inline bool arch_wants_thp_swap(void)
>>>>> {
>>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>>>>> }
>>>>
>>>> This looks good. then i'll need to change arm64 to
>>>>
>>>> +static inline bool arch_thp_swp_supported(void)
>>>> +{
>>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>>>> +}
>>>
>>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>>> either in the generic fallback stub, or in arm64 platform override. Because
>>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>>> be called in the first place.
>>
>> For the only caller now, the checking looks redundant. But the original
>> proposed implementation as follows,
>>
>> static inline bool arch_thp_swp_supported(void)
>> {
>> return true;
>> }
>>
>> will return true even on architectures that don't support/want THP swap.
>
> But the function will never be called on for those platforms.

But the function is defined on these platforms. We need to make the
function itself correct instead of just working for the only caller now.

>> That will confuse people too.
>
> I dont see how.

From the definition of the function, people will think THP swap is
supported/wanted on the platform. But in fact, it may be not.

>>
>> And the "redundant" checking has no run time overhead, because compiler
>> will do the trick.
> I understand that, but dont think this indirection is necessary.

Best Regards,
Huang, Ying

2022-07-19 04:15:45

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64



On 7/19/22 08:58, Huang, Ying wrote:
> Anshuman Khandual <[email protected]> writes:
>
>> On 7/19/22 06:53, Barry Song wrote:
>>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>>>
>>>> Barry Song <[email protected]> writes:
>>>>
>>>>> From: Barry Song <[email protected]>
>>>>>
>>>>> THP_SWAP has been proven to improve the swap throughput significantly
>>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>>> splitting THP after swapped out").
>>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>>>> enabling it on arm64 will benefit arm64 as well.
>>>>> A corner case is that MTE has an assumption that only base pages
>>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>>>
>>>>> A micro-benchmark is written to measure thp swapout throughput as
>>>>> below,
>>>>>
>>>>> unsigned long long tv_to_ms(struct timeval tv)
>>>>> {
>>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>>>> }
>>>>>
>>>>> main()
>>>>> {
>>>>> struct timeval tv_b, tv_e;;
>>>>> #define SIZE 400*1024*1024
>>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>> if (!p) {
>>>>> perror("fail to get memory");
>>>>> exit(-1);
>>>>> }
>>>>>
>>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>>>
>>>>> gettimeofday(&tv_b, NULL);
>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>> gettimeofday(&tv_e, NULL);
>>>>>
>>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>>>> }
>>>>>
>>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>>>> Cortex-A55 platform - ROCK 3A.
>>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>>>
>>>>> Cc: "Huang, Ying" <[email protected]>
>>>>> Cc: Minchan Kim <[email protected]>
>>>>> Cc: Johannes Weiner <[email protected]>
>>>>> Cc: Hugh Dickins <[email protected]>
>>>>> Cc: Andrea Arcangeli <[email protected]>
>>>>> Cc: Anshuman Khandual <[email protected]>
>>>>> Cc: Steven Price <[email protected]>
>>>>> Cc: Yang Shi <[email protected]>
>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>> ---
>>>>> -v3:
>>>>> * refine the commit log;
>>>>> * add a benchmark result;
>>>>> * refine the macro of arch_thp_swp_supported
>>>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>>>
>>>>> arch/arm64/Kconfig | 1 +
>>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>>>> include/linux/huge_mm.h | 12 ++++++++++++
>>>>> mm/swap_slots.c | 2 +-
>>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>> index 1652a9800ebe..e1c540e80eec 100644
>>>>> --- a/arch/arm64/Kconfig
>>>>> +++ b/arch/arm64/Kconfig
>>>>> @@ -101,6 +101,7 @@ config ARM64
>>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>>> select ARCH_WANT_LD_ORPHAN_WARN
>>>>> select ARCH_WANTS_NO_INSTR
>>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>>> select ARM_AMBA
>>>>> select ARM_ARCH_TIMER
>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>> index 0b6632f18364..78d6f6014bfb 100644
>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>> @@ -45,6 +45,12 @@
>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>>
>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>> +{
>>>>> + return !system_supports_mte();
>>>>> +}
>>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>>>> +
>>>>> /*
>>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index de29821231c9..4ddaf6ad73ef 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>>>> return split_huge_page_to_list(&folio->page, list);
>>>>> }
>>>>>
>>>>> +/*
>>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>>> + * limitations in the implementation like arm64 MTE can override this to
>>>>> + * false
>>>>> + */
>>>>> +#ifndef arch_thp_swp_supported
>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>> +{
>>>>> + return true;
>>>>> +}
>>>>
>>>> How about the following?
>>>>
>>>> static inline bool arch_wants_thp_swap(void)
>>>> {
>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>>>> }
>>>
>>> This looks good. then i'll need to change arm64 to
>>>
>>> +static inline bool arch_thp_swp_supported(void)
>>> +{
>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>>> +}
>>
>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>> either in the generic fallback stub, or in arm64 platform override. Because
>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>> be called in the first place.
>
> For the only caller now, the checking looks redundant. But the original
> proposed implementation as follows,
>
> static inline bool arch_thp_swp_supported(void)
> {
> return true;
> }
>
> will return true even on architectures that don't support/want THP swap.

But the function will never be called on for those platforms.

> That will confuse people too.

I dont see how.

>
> And the "redundant" checking has no run time overhead, because compiler
> will do the trick.
I understand that, but dont think this indirection is necessary.

2022-07-19 04:28:20

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 7/19/22 08:58, Huang, Ying wrote:
> > Anshuman Khandual <[email protected]> writes:
> >
> >> On 7/19/22 06:53, Barry Song wrote:
> >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> >>>>
> >>>> Barry Song <[email protected]> writes:
> >>>>
> >>>>> From: Barry Song <[email protected]>
> >>>>>
> >>>>> THP_SWAP has been proven to improve the swap throughput significantly
> >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> >>>>> splitting THP after swapped out").
> >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> >>>>> enabling it on arm64 will benefit arm64 as well.
> >>>>> A corner case is that MTE has an assumption that only base pages
> >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> >>>>>
> >>>>> A micro-benchmark is written to measure thp swapout throughput as
> >>>>> below,
> >>>>>
> >>>>> unsigned long long tv_to_ms(struct timeval tv)
> >>>>> {
> >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> >>>>> }
> >>>>>
> >>>>> main()
> >>>>> {
> >>>>> struct timeval tv_b, tv_e;;
> >>>>> #define SIZE 400*1024*1024
> >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >>>>> if (!p) {
> >>>>> perror("fail to get memory");
> >>>>> exit(-1);
> >>>>> }
> >>>>>
> >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> >>>>> memset(p, 0x11, SIZE); /* write to get mem */
> >>>>>
> >>>>> gettimeofday(&tv_b, NULL);
> >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >>>>> gettimeofday(&tv_e, NULL);
> >>>>>
> >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> >>>>> }
> >>>>>
> >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> >>>>> Cortex-A55 platform - ROCK 3A.
> >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >>>>>
> >>>>> Cc: "Huang, Ying" <[email protected]>
> >>>>> Cc: Minchan Kim <[email protected]>
> >>>>> Cc: Johannes Weiner <[email protected]>
> >>>>> Cc: Hugh Dickins <[email protected]>
> >>>>> Cc: Andrea Arcangeli <[email protected]>
> >>>>> Cc: Anshuman Khandual <[email protected]>
> >>>>> Cc: Steven Price <[email protected]>
> >>>>> Cc: Yang Shi <[email protected]>
> >>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>> ---
> >>>>> -v3:
> >>>>> * refine the commit log;
> >>>>> * add a benchmark result;
> >>>>> * refine the macro of arch_thp_swp_supported
> >>>>> Thanks to the comments of Anshuman, Andrew, Steven
> >>>>>
> >>>>> arch/arm64/Kconfig | 1 +
> >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> >>>>> include/linux/huge_mm.h | 12 ++++++++++++
> >>>>> mm/swap_slots.c | 2 +-
> >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >>>>> index 1652a9800ebe..e1c540e80eec 100644
> >>>>> --- a/arch/arm64/Kconfig
> >>>>> +++ b/arch/arm64/Kconfig
> >>>>> @@ -101,6 +101,7 @@ config ARM64
> >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >>>>> select ARCH_WANT_LD_ORPHAN_WARN
> >>>>> select ARCH_WANTS_NO_INSTR
> >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> >>>>> select ARM_AMBA
> >>>>> select ARM_ARCH_TIMER
> >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>>>> index 0b6632f18364..78d6f6014bfb 100644
> >>>>> --- a/arch/arm64/include/asm/pgtable.h
> >>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >>>>> @@ -45,6 +45,12 @@
> >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>>>
> >>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>> +{
> >>>>> + return !system_supports_mte();
> >>>>> +}
> >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> >>>>> +
> >>>>> /*
> >>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>> index de29821231c9..4ddaf6ad73ef 100644
> >>>>> --- a/include/linux/huge_mm.h
> >>>>> +++ b/include/linux/huge_mm.h
> >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >>>>> return split_huge_page_to_list(&folio->page, list);
> >>>>> }
> >>>>>
> >>>>> +/*
> >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >>>>> + * limitations in the implementation like arm64 MTE can override this to
> >>>>> + * false
> >>>>> + */
> >>>>> +#ifndef arch_thp_swp_supported
> >>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>> +{
> >>>>> + return true;
> >>>>> +}
> >>>>
> >>>> How about the following?
> >>>>
> >>>> static inline bool arch_wants_thp_swap(void)
> >>>> {
> >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >>>> }
> >>>
> >>> This looks good. then i'll need to change arm64 to
> >>>
> >>> +static inline bool arch_thp_swp_supported(void)
> >>> +{
> >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >>> +}
> >>
> >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >> either in the generic fallback stub, or in arm64 platform override. Because
> >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >> be called in the first place.
> >
> > For the only caller now, the checking looks redundant. But the original
> > proposed implementation as follows,
> >
> > static inline bool arch_thp_swp_supported(void)
> > {
> > return true;
> > }
> >
> > will return true even on architectures that don't support/want THP swap.
>
> But the function will never be called on for those platforms.
>
> > That will confuse people too.
>
> I dont see how.
>
> >
> > And the "redundant" checking has no run time overhead, because compiler
> > will do the trick.
> I understand that, but dont think this indirection is necessary.

Hi Anshuman, Hi Ying,
Thanks for the comments of both of you. Does the below look ok?

generic,

static inline bool arch_wants_thp_swap(void)
{
return IS_ENABLED(CONFIG_THP_SWAP);
}

arm64,

static inline bool arch_thp_swp_supported(void)
{
return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
}

caller,

folio_alloc_swap(struct folio *folio)
{

if (folio_test_large(folio)) {
- if (IS_ENABLED(CONFIG_THP_SWAP))
+ if (arch_thp_swp_supported())
get_swap_pages(1, &entry, folio_nr_pages(folio));
goto out;
}

Thanks
Barry

2022-07-19 06:11:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Barry Song <[email protected]> writes:

> On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
>>
>> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
>> <[email protected]> wrote:
>> >
>> >
>> >
>> > On 7/19/22 08:58, Huang, Ying wrote:
>> > > Anshuman Khandual <[email protected]> writes:
>> > >
>> > >> On 7/19/22 06:53, Barry Song wrote:
>> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>> > >>>>
>> > >>>> Barry Song <[email protected]> writes:
>> > >>>>
>> > >>>>> From: Barry Song <[email protected]>
>> > >>>>>
>> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
>> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>> > >>>>> splitting THP after swapped out").
>> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>> > >>>>> enabling it on arm64 will benefit arm64 as well.
>> > >>>>> A corner case is that MTE has an assumption that only base pages
>> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>> > >>>>>
>> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
>> > >>>>> below,
>> > >>>>>
>> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
>> > >>>>> {
>> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>> > >>>>> }
>> > >>>>>
>> > >>>>> main()
>> > >>>>> {
>> > >>>>> struct timeval tv_b, tv_e;;
>> > >>>>> #define SIZE 400*1024*1024
>> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> > >>>>> if (!p) {
>> > >>>>> perror("fail to get memory");
>> > >>>>> exit(-1);
>> > >>>>> }
>> > >>>>>
>> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
>> > >>>>>
>> > >>>>> gettimeofday(&tv_b, NULL);
>> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
>> > >>>>> gettimeofday(&tv_e, NULL);
>> > >>>>>
>> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>> > >>>>> }
>> > >>>>>
>> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>> > >>>>> Cortex-A55 platform - ROCK 3A.
>> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>> > >>>>>
>> > >>>>> Cc: "Huang, Ying" <[email protected]>
>> > >>>>> Cc: Minchan Kim <[email protected]>
>> > >>>>> Cc: Johannes Weiner <[email protected]>
>> > >>>>> Cc: Hugh Dickins <[email protected]>
>> > >>>>> Cc: Andrea Arcangeli <[email protected]>
>> > >>>>> Cc: Anshuman Khandual <[email protected]>
>> > >>>>> Cc: Steven Price <[email protected]>
>> > >>>>> Cc: Yang Shi <[email protected]>
>> > >>>>> Signed-off-by: Barry Song <[email protected]>
>> > >>>>> ---
>> > >>>>> -v3:
>> > >>>>> * refine the commit log;
>> > >>>>> * add a benchmark result;
>> > >>>>> * refine the macro of arch_thp_swp_supported
>> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
>> > >>>>>
>> > >>>>> arch/arm64/Kconfig | 1 +
>> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
>> > >>>>> mm/swap_slots.c | 2 +-
>> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>> > >>>>>
>> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
>> > >>>>> --- a/arch/arm64/Kconfig
>> > >>>>> +++ b/arch/arm64/Kconfig
>> > >>>>> @@ -101,6 +101,7 @@ config ARM64
>> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
>> > >>>>> select ARCH_WANTS_NO_INSTR
>> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>> > >>>>> select ARM_AMBA
>> > >>>>> select ARM_ARCH_TIMER
>> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
>> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
>> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
>> > >>>>> @@ -45,6 +45,12 @@
>> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> > >>>>>
>> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> > >>>>> +{
>> > >>>>> + return !system_supports_mte();
>> > >>>>> +}
>> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>> > >>>>> +
>> > >>>>> /*
>> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
>> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
>> > >>>>> --- a/include/linux/huge_mm.h
>> > >>>>> +++ b/include/linux/huge_mm.h
>> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>> > >>>>> return split_huge_page_to_list(&folio->page, list);
>> > >>>>> }
>> > >>>>>
>> > >>>>> +/*
>> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
>> > >>>>> + * false
>> > >>>>> + */
>> > >>>>> +#ifndef arch_thp_swp_supported
>> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> > >>>>> +{
>> > >>>>> + return true;
>> > >>>>> +}
>> > >>>>
>> > >>>> How about the following?
>> > >>>>
>> > >>>> static inline bool arch_wants_thp_swap(void)
>> > >>>> {
>> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>> > >>>> }
>> > >>>
>> > >>> This looks good. then i'll need to change arm64 to
>> > >>>
>> > >>> +static inline bool arch_thp_swp_supported(void)
>> > >>> +{
>> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>> > >>> +}
>> > >>
>> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>> > >> either in the generic fallback stub, or in arm64 platform override. Because
>> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>> > >> be called in the first place.
>> > >
>> > > For the only caller now, the checking looks redundant. But the original
>> > > proposed implementation as follows,
>> > >
>> > > static inline bool arch_thp_swp_supported(void)
>> > > {
>> > > return true;
>> > > }
>> > >
>> > > will return true even on architectures that don't support/want THP swap.
>> >
>> > But the function will never be called on for those platforms.
>> >
>> > > That will confuse people too.
>> >
>> > I dont see how.
>> >
>> > >
>> > > And the "redundant" checking has no run time overhead, because compiler
>> > > will do the trick.
>> > I understand that, but dont think this indirection is necessary.
>>
>> Hi Anshuman, Hi Ying,
>> Thanks for the comments of both of you. Does the below look ok?
>>
>> generic,
>>
>> static inline bool arch_wants_thp_swap(void)
>> {
>> return IS_ENABLED(CONFIG_THP_SWAP);
>> }
>>
>
> sorry, i actually meant arch_thp_swp_supported() but not
> arch_wants_thp_swap() in generic code,
>
> static inline bool arch_thp_swp_supported(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP);
> }

IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
selected by users. arch_thp_swp_supported() is to report the
capability.

Best Regards,
Huang, Ying

>> arm64,
>>
>> static inline bool arch_thp_swp_supported(void)
>> {
>> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
>> }
>>
>> caller,
>>
>> folio_alloc_swap(struct folio *folio)
>> {
>>
>> if (folio_test_large(folio)) {
>> - if (IS_ENABLED(CONFIG_THP_SWAP))
>> + if (arch_thp_swp_supported())
>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>> goto out;
>> }
>>
>> Thanks
>> Barry

2022-07-19 06:35:42

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 5:47 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
> >>
> >> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> >> <[email protected]> wrote:
> >> >
> >> >
> >> >
> >> > On 7/19/22 08:58, Huang, Ying wrote:
> >> > > Anshuman Khandual <[email protected]> writes:
> >> > >
> >> > >> On 7/19/22 06:53, Barry Song wrote:
> >> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> >> > >>>>
> >> > >>>> Barry Song <[email protected]> writes:
> >> > >>>>
> >> > >>>>> From: Barry Song <[email protected]>
> >> > >>>>>
> >> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
> >> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> >> > >>>>> splitting THP after swapped out").
> >> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> >> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> >> > >>>>> enabling it on arm64 will benefit arm64 as well.
> >> > >>>>> A corner case is that MTE has an assumption that only base pages
> >> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> >> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> >> > >>>>>
> >> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
> >> > >>>>> below,
> >> > >>>>>
> >> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
> >> > >>>>> {
> >> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> >> > >>>>> }
> >> > >>>>>
> >> > >>>>> main()
> >> > >>>>> {
> >> > >>>>> struct timeval tv_b, tv_e;;
> >> > >>>>> #define SIZE 400*1024*1024
> >> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >> > >>>>> if (!p) {
> >> > >>>>> perror("fail to get memory");
> >> > >>>>> exit(-1);
> >> > >>>>> }
> >> > >>>>>
> >> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> >> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
> >> > >>>>>
> >> > >>>>> gettimeofday(&tv_b, NULL);
> >> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >> > >>>>> gettimeofday(&tv_e, NULL);
> >> > >>>>>
> >> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> >> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> >> > >>>>> }
> >> > >>>>>
> >> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> >> > >>>>> Cortex-A55 platform - ROCK 3A.
> >> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> >> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >> > >>>>>
> >> > >>>>> Cc: "Huang, Ying" <[email protected]>
> >> > >>>>> Cc: Minchan Kim <[email protected]>
> >> > >>>>> Cc: Johannes Weiner <[email protected]>
> >> > >>>>> Cc: Hugh Dickins <[email protected]>
> >> > >>>>> Cc: Andrea Arcangeli <[email protected]>
> >> > >>>>> Cc: Anshuman Khandual <[email protected]>
> >> > >>>>> Cc: Steven Price <[email protected]>
> >> > >>>>> Cc: Yang Shi <[email protected]>
> >> > >>>>> Signed-off-by: Barry Song <[email protected]>
> >> > >>>>> ---
> >> > >>>>> -v3:
> >> > >>>>> * refine the commit log;
> >> > >>>>> * add a benchmark result;
> >> > >>>>> * refine the macro of arch_thp_swp_supported
> >> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
> >> > >>>>>
> >> > >>>>> arch/arm64/Kconfig | 1 +
> >> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> >> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
> >> > >>>>> mm/swap_slots.c | 2 +-
> >> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> >> > >>>>>
> >> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
> >> > >>>>> --- a/arch/arm64/Kconfig
> >> > >>>>> +++ b/arch/arm64/Kconfig
> >> > >>>>> @@ -101,6 +101,7 @@ config ARM64
> >> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
> >> > >>>>> select ARCH_WANTS_NO_INSTR
> >> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> >> > >>>>> select ARM_AMBA
> >> > >>>>> select ARM_ARCH_TIMER
> >> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
> >> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
> >> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >> > >>>>> @@ -45,6 +45,12 @@
> >> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >> > >>>>>
> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> > >>>>> +{
> >> > >>>>> + return !system_supports_mte();
> >> > >>>>> +}
> >> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> >> > >>>>> +
> >> > >>>>> /*
> >> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
> >> > >>>>> --- a/include/linux/huge_mm.h
> >> > >>>>> +++ b/include/linux/huge_mm.h
> >> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >> > >>>>> return split_huge_page_to_list(&folio->page, list);
> >> > >>>>> }
> >> > >>>>>
> >> > >>>>> +/*
> >> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
> >> > >>>>> + * false
> >> > >>>>> + */
> >> > >>>>> +#ifndef arch_thp_swp_supported
> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> > >>>>> +{
> >> > >>>>> + return true;
> >> > >>>>> +}
> >> > >>>>
> >> > >>>> How about the following?
> >> > >>>>
> >> > >>>> static inline bool arch_wants_thp_swap(void)
> >> > >>>> {
> >> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >> > >>>> }
> >> > >>>
> >> > >>> This looks good. then i'll need to change arm64 to
> >> > >>>
> >> > >>> +static inline bool arch_thp_swp_supported(void)
> >> > >>> +{
> >> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >> > >>> +}
> >> > >>
> >> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >> > >> either in the generic fallback stub, or in arm64 platform override. Because
> >> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >> > >> be called in the first place.
> >> > >
> >> > > For the only caller now, the checking looks redundant. But the original
> >> > > proposed implementation as follows,
> >> > >
> >> > > static inline bool arch_thp_swp_supported(void)
> >> > > {
> >> > > return true;
> >> > > }
> >> > >
> >> > > will return true even on architectures that don't support/want THP swap.
> >> >
> >> > But the function will never be called on for those platforms.
> >> >
> >> > > That will confuse people too.
> >> >
> >> > I dont see how.
> >> >
> >> > >
> >> > > And the "redundant" checking has no run time overhead, because compiler
> >> > > will do the trick.
> >> > I understand that, but dont think this indirection is necessary.
> >>
> >> Hi Anshuman, Hi Ying,
> >> Thanks for the comments of both of you. Does the below look ok?
> >>
> >> generic,
> >>
> >> static inline bool arch_wants_thp_swap(void)
> >> {
> >> return IS_ENABLED(CONFIG_THP_SWAP);
> >> }
> >>
> >
> > sorry, i actually meant arch_thp_swp_supported() but not
> > arch_wants_thp_swap() in generic code,
> >
> > static inline bool arch_thp_swp_supported(void)
> > {
> > return IS_ENABLED(CONFIG_THP_SWAP);
> > }
>
> IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
> selected by users. arch_thp_swp_supported() is to report the
> capability.

Hi Ying,
CONFIG_THP_SWAP implicitly includes ARCH_WANTS_THP_SWAP. So it seems
a bit odd to have still another arch_wants_thp_swap().
if the name of arch_thp_swp_supported is not sensible to you, will
thp_swp_supported()
without arch_ make more sense? a similar example is,

static inline bool gigantic_page_runtime_supported(void)
{
return IS_ENABLED(CONFIG_ARCH_HAS_GIGANTIC_PAGE);
}

Otherwise, can we just keep the code as is according to Anshuman's suggestion?

Thanks
Barry

}



>
> Best Regards,
> Huang, Ying
>
> >> arm64,
> >>
> >> static inline bool arch_thp_swp_supported(void)
> >> {
> >> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
> >> }
> >>
> >> caller,
> >>
> >> folio_alloc_swap(struct folio *folio)
> >> {
> >>
> >> if (folio_test_large(folio)) {
> >> - if (IS_ENABLED(CONFIG_THP_SWAP))
> >> + if (arch_thp_swp_supported())
> >> get_swap_pages(1, &entry, folio_nr_pages(folio));
> >> goto out;
> >> }
> >>
> >> Thanks
> >> Barry

2022-07-19 06:54:54

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Barry Song <[email protected]> writes:

> On Tue, Jul 19, 2022 at 5:47 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
>> >>
>> >> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
>> >> <[email protected]> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On 7/19/22 08:58, Huang, Ying wrote:
>> >> > > Anshuman Khandual <[email protected]> writes:
>> >> > >
>> >> > >> On 7/19/22 06:53, Barry Song wrote:
>> >> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>> >> > >>>>
>> >> > >>>> Barry Song <[email protected]> writes:
>> >> > >>>>
>> >> > >>>>> From: Barry Song <[email protected]>
>> >> > >>>>>
>> >> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
>> >> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>> >> > >>>>> splitting THP after swapped out").
>> >> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>> >> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>> >> > >>>>> enabling it on arm64 will benefit arm64 as well.
>> >> > >>>>> A corner case is that MTE has an assumption that only base pages
>> >> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>> >> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>> >> > >>>>>
>> >> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
>> >> > >>>>> below,
>> >> > >>>>>
>> >> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
>> >> > >>>>> {
>> >> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>> >> > >>>>> }
>> >> > >>>>>
>> >> > >>>>> main()
>> >> > >>>>> {
>> >> > >>>>> struct timeval tv_b, tv_e;;
>> >> > >>>>> #define SIZE 400*1024*1024
>> >> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>> >> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> >> > >>>>> if (!p) {
>> >> > >>>>> perror("fail to get memory");
>> >> > >>>>> exit(-1);
>> >> > >>>>> }
>> >> > >>>>>
>> >> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>> >> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
>> >> > >>>>>
>> >> > >>>>> gettimeofday(&tv_b, NULL);
>> >> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
>> >> > >>>>> gettimeofday(&tv_e, NULL);
>> >> > >>>>>
>> >> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>> >> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>> >> > >>>>> }
>> >> > >>>>>
>> >> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>> >> > >>>>> Cortex-A55 platform - ROCK 3A.
>> >> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>> >> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>> >> > >>>>>
>> >> > >>>>> Cc: "Huang, Ying" <[email protected]>
>> >> > >>>>> Cc: Minchan Kim <[email protected]>
>> >> > >>>>> Cc: Johannes Weiner <[email protected]>
>> >> > >>>>> Cc: Hugh Dickins <[email protected]>
>> >> > >>>>> Cc: Andrea Arcangeli <[email protected]>
>> >> > >>>>> Cc: Anshuman Khandual <[email protected]>
>> >> > >>>>> Cc: Steven Price <[email protected]>
>> >> > >>>>> Cc: Yang Shi <[email protected]>
>> >> > >>>>> Signed-off-by: Barry Song <[email protected]>
>> >> > >>>>> ---
>> >> > >>>>> -v3:
>> >> > >>>>> * refine the commit log;
>> >> > >>>>> * add a benchmark result;
>> >> > >>>>> * refine the macro of arch_thp_swp_supported
>> >> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
>> >> > >>>>>
>> >> > >>>>> arch/arm64/Kconfig | 1 +
>> >> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>> >> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
>> >> > >>>>> mm/swap_slots.c | 2 +-
>> >> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>> >> > >>>>>
>> >> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> >> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
>> >> > >>>>> --- a/arch/arm64/Kconfig
>> >> > >>>>> +++ b/arch/arm64/Kconfig
>> >> > >>>>> @@ -101,6 +101,7 @@ config ARM64
>> >> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> >> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
>> >> > >>>>> select ARCH_WANTS_NO_INSTR
>> >> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>> >> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>> >> > >>>>> select ARM_AMBA
>> >> > >>>>> select ARM_ARCH_TIMER
>> >> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> >> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
>> >> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
>> >> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
>> >> > >>>>> @@ -45,6 +45,12 @@
>> >> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>> >> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> >> > >>>>>
>> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> >> > >>>>> +{
>> >> > >>>>> + return !system_supports_mte();
>> >> > >>>>> +}
>> >> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>> >> > >>>>> +
>> >> > >>>>> /*
>> >> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
>> >> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>> >> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> >> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
>> >> > >>>>> --- a/include/linux/huge_mm.h
>> >> > >>>>> +++ b/include/linux/huge_mm.h
>> >> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>> >> > >>>>> return split_huge_page_to_list(&folio->page, list);
>> >> > >>>>> }
>> >> > >>>>>
>> >> > >>>>> +/*
>> >> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>> >> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
>> >> > >>>>> + * false
>> >> > >>>>> + */
>> >> > >>>>> +#ifndef arch_thp_swp_supported
>> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> >> > >>>>> +{
>> >> > >>>>> + return true;
>> >> > >>>>> +}
>> >> > >>>>
>> >> > >>>> How about the following?
>> >> > >>>>
>> >> > >>>> static inline bool arch_wants_thp_swap(void)
>> >> > >>>> {
>> >> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>> >> > >>>> }
>> >> > >>>
>> >> > >>> This looks good. then i'll need to change arm64 to
>> >> > >>>
>> >> > >>> +static inline bool arch_thp_swp_supported(void)
>> >> > >>> +{
>> >> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>> >> > >>> +}
>> >> > >>
>> >> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>> >> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>> >> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>> >> > >> either in the generic fallback stub, or in arm64 platform override. Because
>> >> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>> >> > >> be called in the first place.
>> >> > >
>> >> > > For the only caller now, the checking looks redundant. But the original
>> >> > > proposed implementation as follows,
>> >> > >
>> >> > > static inline bool arch_thp_swp_supported(void)
>> >> > > {
>> >> > > return true;
>> >> > > }
>> >> > >
>> >> > > will return true even on architectures that don't support/want THP swap.
>> >> >
>> >> > But the function will never be called on for those platforms.
>> >> >
>> >> > > That will confuse people too.
>> >> >
>> >> > I dont see how.
>> >> >
>> >> > >
>> >> > > And the "redundant" checking has no run time overhead, because compiler
>> >> > > will do the trick.
>> >> > I understand that, but dont think this indirection is necessary.
>> >>
>> >> Hi Anshuman, Hi Ying,
>> >> Thanks for the comments of both of you. Does the below look ok?
>> >>
>> >> generic,
>> >>
>> >> static inline bool arch_wants_thp_swap(void)
>> >> {
>> >> return IS_ENABLED(CONFIG_THP_SWAP);
>> >> }
>> >>
>> >
>> > sorry, i actually meant arch_thp_swp_supported() but not
>> > arch_wants_thp_swap() in generic code,
>> >
>> > static inline bool arch_thp_swp_supported(void)
>> > {
>> > return IS_ENABLED(CONFIG_THP_SWAP);
>> > }
>>
>> IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
>> selected by users. arch_thp_swp_supported() is to report the
>> capability.
>
> Hi Ying,
> CONFIG_THP_SWAP implicitly includes ARCH_WANTS_THP_SWAP. So it seems
> a bit odd to have still another arch_wants_thp_swap().
> if the name of arch_thp_swp_supported is not sensible to you, will
> thp_swp_supported()
> without arch_ make more sense? a similar example is,
>
> static inline bool gigantic_page_runtime_supported(void)
> {
> return IS_ENABLED(CONFIG_ARCH_HAS_GIGANTIC_PAGE);
> }

Here, the capability of the architecture is reported. But
CONFIG_THP_SWAP is a user option.

I'm OK with the function name "arch_thp_swp_supported()". I just think
that we should implement the function in a way that is consistent with
the function name as much as possible. That is, don't return true on
architectures that THP swap isn't supported in fact.

> Otherwise, can we just keep the code as is according to Anshuman's suggestion?

Although I still think my way is better, I will not force you to do
that. If you don't think that is better, you can use your original
implementation.

Best Regards,
Huang, Ying

> Thanks
> Barry
>
> }
>
>
>
>>
>> Best Regards,
>> Huang, Ying
>>
>> >> arm64,
>> >>
>> >> static inline bool arch_thp_swp_supported(void)
>> >> {
>> >> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
>> >> }
>> >>
>> >> caller,
>> >>
>> >> folio_alloc_swap(struct folio *folio)
>> >> {
>> >>
>> >> if (folio_test_large(folio)) {
>> >> - if (IS_ENABLED(CONFIG_THP_SWAP))
>> >> + if (arch_thp_swp_supported())
>> >> get_swap_pages(1, &entry, folio_nr_pages(folio));
>> >> goto out;
>> >> }
>> >>
>> >> Thanks
>> >> Barry

2022-07-19 07:25:34

by Huang, Ying

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

Barry Song <[email protected]> writes:

> On Tue, Jul 19, 2022 at 6:33 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Tue, Jul 19, 2022 at 5:47 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Barry Song <[email protected]> writes:
>> >>
>> >> > On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
>> >> >>
>> >> >> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
>> >> >> <[email protected]> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On 7/19/22 08:58, Huang, Ying wrote:
>> >> >> > > Anshuman Khandual <[email protected]> writes:
>> >> >> > >
>> >> >> > >> On 7/19/22 06:53, Barry Song wrote:
>> >> >> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>> >> >> > >>>>
>> >> >> > >>>> Barry Song <[email protected]> writes:
>> >> >> > >>>>
>> >> >> > >>>>> From: Barry Song <[email protected]>
>> >> >> > >>>>>
>> >> >> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
>> >> >> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>> >> >> > >>>>> splitting THP after swapped out").
>> >> >> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>> >> >> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>> >> >> > >>>>> enabling it on arm64 will benefit arm64 as well.
>> >> >> > >>>>> A corner case is that MTE has an assumption that only base pages
>> >> >> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>> >> >> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>> >> >> > >>>>>
>> >> >> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
>> >> >> > >>>>> below,
>> >> >> > >>>>>
>> >> >> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
>> >> >> > >>>>> {
>> >> >> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>> >> >> > >>>>> }
>> >> >> > >>>>>
>> >> >> > >>>>> main()
>> >> >> > >>>>> {
>> >> >> > >>>>> struct timeval tv_b, tv_e;;
>> >> >> > >>>>> #define SIZE 400*1024*1024
>> >> >> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>> >> >> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> >> >> > >>>>> if (!p) {
>> >> >> > >>>>> perror("fail to get memory");
>> >> >> > >>>>> exit(-1);
>> >> >> > >>>>> }
>> >> >> > >>>>>
>> >> >> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>> >> >> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
>> >> >> > >>>>>
>> >> >> > >>>>> gettimeofday(&tv_b, NULL);
>> >> >> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
>> >> >> > >>>>> gettimeofday(&tv_e, NULL);
>> >> >> > >>>>>
>> >> >> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>> >> >> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>> >> >> > >>>>> }
>> >> >> > >>>>>
>> >> >> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>> >> >> > >>>>> Cortex-A55 platform - ROCK 3A.
>> >> >> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>> >> >> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>> >> >> > >>>>>
>> >> >> > >>>>> Cc: "Huang, Ying" <[email protected]>
>> >> >> > >>>>> Cc: Minchan Kim <[email protected]>
>> >> >> > >>>>> Cc: Johannes Weiner <[email protected]>
>> >> >> > >>>>> Cc: Hugh Dickins <[email protected]>
>> >> >> > >>>>> Cc: Andrea Arcangeli <[email protected]>
>> >> >> > >>>>> Cc: Anshuman Khandual <[email protected]>
>> >> >> > >>>>> Cc: Steven Price <[email protected]>
>> >> >> > >>>>> Cc: Yang Shi <[email protected]>
>> >> >> > >>>>> Signed-off-by: Barry Song <[email protected]>
>> >> >> > >>>>> ---
>> >> >> > >>>>> -v3:
>> >> >> > >>>>> * refine the commit log;
>> >> >> > >>>>> * add a benchmark result;
>> >> >> > >>>>> * refine the macro of arch_thp_swp_supported
>> >> >> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
>> >> >> > >>>>>
>> >> >> > >>>>> arch/arm64/Kconfig | 1 +
>> >> >> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>> >> >> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
>> >> >> > >>>>> mm/swap_slots.c | 2 +-
>> >> >> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>> >> >> > >>>>>
>> >> >> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> >> >> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
>> >> >> > >>>>> --- a/arch/arm64/Kconfig
>> >> >> > >>>>> +++ b/arch/arm64/Kconfig
>> >> >> > >>>>> @@ -101,6 +101,7 @@ config ARM64
>> >> >> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> >> >> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
>> >> >> > >>>>> select ARCH_WANTS_NO_INSTR
>> >> >> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>> >> >> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>> >> >> > >>>>> select ARM_AMBA
>> >> >> > >>>>> select ARM_ARCH_TIMER
>> >> >> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> >> >> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
>> >> >> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
>> >> >> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
>> >> >> > >>>>> @@ -45,6 +45,12 @@
>> >> >> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>> >> >> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> >> >> > >>>>>
>> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> >> >> > >>>>> +{
>> >> >> > >>>>> + return !system_supports_mte();
>> >> >> > >>>>> +}
>> >> >> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>> >> >> > >>>>> +
>> >> >> > >>>>> /*
>> >> >> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
>> >> >> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>> >> >> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> >> >> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
>> >> >> > >>>>> --- a/include/linux/huge_mm.h
>> >> >> > >>>>> +++ b/include/linux/huge_mm.h
>> >> >> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>> >> >> > >>>>> return split_huge_page_to_list(&folio->page, list);
>> >> >> > >>>>> }
>> >> >> > >>>>>
>> >> >> > >>>>> +/*
>> >> >> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>> >> >> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
>> >> >> > >>>>> + * false
>> >> >> > >>>>> + */
>> >> >> > >>>>> +#ifndef arch_thp_swp_supported
>> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
>> >> >> > >>>>> +{
>> >> >> > >>>>> + return true;
>> >> >> > >>>>> +}
>> >> >> > >>>>
>> >> >> > >>>> How about the following?
>> >> >> > >>>>
>> >> >> > >>>> static inline bool arch_wants_thp_swap(void)
>> >> >> > >>>> {
>> >> >> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>> >> >> > >>>> }
>> >> >> > >>>
>> >> >> > >>> This looks good. then i'll need to change arm64 to
>> >> >> > >>>
>> >> >> > >>> +static inline bool arch_thp_swp_supported(void)
>> >> >> > >>> +{
>> >> >> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>> >> >> > >>> +}
>> >> >> > >>
>> >> >> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>> >> >> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>> >> >> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>> >> >> > >> either in the generic fallback stub, or in arm64 platform override. Because
>> >> >> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>> >> >> > >> be called in the first place.
>> >> >> > >
>> >> >> > > For the only caller now, the checking looks redundant. But the original
>> >> >> > > proposed implementation as follows,
>> >> >> > >
>> >> >> > > static inline bool arch_thp_swp_supported(void)
>> >> >> > > {
>> >> >> > > return true;
>> >> >> > > }
>> >> >> > >
>> >> >> > > will return true even on architectures that don't support/want THP swap.
>> >> >> >
>> >> >> > But the function will never be called on for those platforms.
>> >> >> >
>> >> >> > > That will confuse people too.
>> >> >> >
>> >> >> > I dont see how.
>> >> >> >
>> >> >> > >
>> >> >> > > And the "redundant" checking has no run time overhead, because compiler
>> >> >> > > will do the trick.
>> >> >> > I understand that, but dont think this indirection is necessary.
>> >> >>
>> >> >> Hi Anshuman, Hi Ying,
>> >> >> Thanks for the comments of both of you. Does the below look ok?
>> >> >>
>> >> >> generic,
>> >> >>
>> >> >> static inline bool arch_wants_thp_swap(void)
>> >> >> {
>> >> >> return IS_ENABLED(CONFIG_THP_SWAP);
>> >> >> }
>> >> >>
>> >> >
>> >> > sorry, i actually meant arch_thp_swp_supported() but not
>> >> > arch_wants_thp_swap() in generic code,
>> >> >
>> >> > static inline bool arch_thp_swp_supported(void)
>> >> > {
>> >> > return IS_ENABLED(CONFIG_THP_SWAP);
>> >> > }
>> >>
>> >> IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
>> >> selected by users. arch_thp_swp_supported() is to report the
>> >> capability.
>> >
>> > Hi Ying,
>> > CONFIG_THP_SWAP implicitly includes ARCH_WANTS_THP_SWAP. So it seems
>> > a bit odd to have still another arch_wants_thp_swap().
>> > if the name of arch_thp_swp_supported is not sensible to you, will
>> > thp_swp_supported()
>> > without arch_ make more sense? a similar example is,
>> >
>> > static inline bool gigantic_page_runtime_supported(void)
>> > {
>> > return IS_ENABLED(CONFIG_ARCH_HAS_GIGANTIC_PAGE);
>> > }
>>
>> Here, the capability of the architecture is reported. But
>> CONFIG_THP_SWAP is a user option.
>>
>> I'm OK with the function name "arch_thp_swp_supported()". I just think
>> that we should implement the function in a way that is consistent with
>> the function name as much as possible. That is, don't return true on
>> architectures that THP swap isn't supported in fact.
>
> My point is that having a generic thp_swp_supported() which can combine
> both IS_ENABLED(CONFIG_THP_SWP) && arch_thp_swp_thing().
> then we can always only call this rather than checking two conditions.
> Although there is only one caller for this moment, we might get more
> later. So always calling this single function might make our life easier.
>
> we can treat
>
> static inline bool arch_thp_swp_supported(void)
> {
> return IS_ENABLED(CONFIG_THP_SWAP);
> }

The issue is that IS_ENABLED(CONFIG_THP_SWAP) reports that whether THP
swap is selected by the user, not just whether THP swap is supported by
the architecture.

Best Regards,
Huang, Ying

> as a virtual function in base class. thp_swp is generically true if platforms
> are able to enable CONFIG_THP_SWAP.
>
> Derived classes like arm64 can overwrite it to false in some particular cases
> based on their unique characteristics.
>
>>
>> > Otherwise, can we just keep the code as is according to Anshuman's suggestion?
>>
>> Although I still think my way is better, I will not force you to do
>> that. If you don't think that is better, you can use your original
>> implementation.
>
> Ok, fair enough. And thanks for your reviewing :-)
>
> Thanks
> Barry

2022-07-19 07:35:39

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 6:33 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Tue, Jul 19, 2022 at 5:47 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
> >> >>
> >> >> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 7/19/22 08:58, Huang, Ying wrote:
> >> >> > > Anshuman Khandual <[email protected]> writes:
> >> >> > >
> >> >> > >> On 7/19/22 06:53, Barry Song wrote:
> >> >> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> >> >> > >>>>
> >> >> > >>>> Barry Song <[email protected]> writes:
> >> >> > >>>>
> >> >> > >>>>> From: Barry Song <[email protected]>
> >> >> > >>>>>
> >> >> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
> >> >> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> >> >> > >>>>> splitting THP after swapped out").
> >> >> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> >> >> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> >> >> > >>>>> enabling it on arm64 will benefit arm64 as well.
> >> >> > >>>>> A corner case is that MTE has an assumption that only base pages
> >> >> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> >> >> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> >> >> > >>>>>
> >> >> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
> >> >> > >>>>> below,
> >> >> > >>>>>
> >> >> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
> >> >> > >>>>> {
> >> >> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> >> >> > >>>>> }
> >> >> > >>>>>
> >> >> > >>>>> main()
> >> >> > >>>>> {
> >> >> > >>>>> struct timeval tv_b, tv_e;;
> >> >> > >>>>> #define SIZE 400*1024*1024
> >> >> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >> >> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >> >> > >>>>> if (!p) {
> >> >> > >>>>> perror("fail to get memory");
> >> >> > >>>>> exit(-1);
> >> >> > >>>>> }
> >> >> > >>>>>
> >> >> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> >> >> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
> >> >> > >>>>>
> >> >> > >>>>> gettimeofday(&tv_b, NULL);
> >> >> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >> >> > >>>>> gettimeofday(&tv_e, NULL);
> >> >> > >>>>>
> >> >> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> >> >> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> >> >> > >>>>> }
> >> >> > >>>>>
> >> >> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> >> >> > >>>>> Cortex-A55 platform - ROCK 3A.
> >> >> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> >> >> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >> >> > >>>>>
> >> >> > >>>>> Cc: "Huang, Ying" <[email protected]>
> >> >> > >>>>> Cc: Minchan Kim <[email protected]>
> >> >> > >>>>> Cc: Johannes Weiner <[email protected]>
> >> >> > >>>>> Cc: Hugh Dickins <[email protected]>
> >> >> > >>>>> Cc: Andrea Arcangeli <[email protected]>
> >> >> > >>>>> Cc: Anshuman Khandual <[email protected]>
> >> >> > >>>>> Cc: Steven Price <[email protected]>
> >> >> > >>>>> Cc: Yang Shi <[email protected]>
> >> >> > >>>>> Signed-off-by: Barry Song <[email protected]>
> >> >> > >>>>> ---
> >> >> > >>>>> -v3:
> >> >> > >>>>> * refine the commit log;
> >> >> > >>>>> * add a benchmark result;
> >> >> > >>>>> * refine the macro of arch_thp_swp_supported
> >> >> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
> >> >> > >>>>>
> >> >> > >>>>> arch/arm64/Kconfig | 1 +
> >> >> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> >> >> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
> >> >> > >>>>> mm/swap_slots.c | 2 +-
> >> >> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> >> >> > >>>>>
> >> >> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> >> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
> >> >> > >>>>> --- a/arch/arm64/Kconfig
> >> >> > >>>>> +++ b/arch/arm64/Kconfig
> >> >> > >>>>> @@ -101,6 +101,7 @@ config ARM64
> >> >> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >> >> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
> >> >> > >>>>> select ARCH_WANTS_NO_INSTR
> >> >> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >> >> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> >> >> > >>>>> select ARM_AMBA
> >> >> > >>>>> select ARM_ARCH_TIMER
> >> >> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> >> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
> >> >> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
> >> >> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >> >> > >>>>> @@ -45,6 +45,12 @@
> >> >> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >> >> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >> >> > >>>>>
> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> >> > >>>>> +{
> >> >> > >>>>> + return !system_supports_mte();
> >> >> > >>>>> +}
> >> >> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> >> >> > >>>>> +
> >> >> > >>>>> /*
> >> >> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >> >> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >> >> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> >> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
> >> >> > >>>>> --- a/include/linux/huge_mm.h
> >> >> > >>>>> +++ b/include/linux/huge_mm.h
> >> >> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >> >> > >>>>> return split_huge_page_to_list(&folio->page, list);
> >> >> > >>>>> }
> >> >> > >>>>>
> >> >> > >>>>> +/*
> >> >> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >> >> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
> >> >> > >>>>> + * false
> >> >> > >>>>> + */
> >> >> > >>>>> +#ifndef arch_thp_swp_supported
> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> >> > >>>>> +{
> >> >> > >>>>> + return true;
> >> >> > >>>>> +}
> >> >> > >>>>
> >> >> > >>>> How about the following?
> >> >> > >>>>
> >> >> > >>>> static inline bool arch_wants_thp_swap(void)
> >> >> > >>>> {
> >> >> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >> >> > >>>> }
> >> >> > >>>
> >> >> > >>> This looks good. then i'll need to change arm64 to
> >> >> > >>>
> >> >> > >>> +static inline bool arch_thp_swp_supported(void)
> >> >> > >>> +{
> >> >> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >> >> > >>> +}
> >> >> > >>
> >> >> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >> >> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >> >> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >> >> > >> either in the generic fallback stub, or in arm64 platform override. Because
> >> >> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >> >> > >> be called in the first place.
> >> >> > >
> >> >> > > For the only caller now, the checking looks redundant. But the original
> >> >> > > proposed implementation as follows,
> >> >> > >
> >> >> > > static inline bool arch_thp_swp_supported(void)
> >> >> > > {
> >> >> > > return true;
> >> >> > > }
> >> >> > >
> >> >> > > will return true even on architectures that don't support/want THP swap.
> >> >> >
> >> >> > But the function will never be called on for those platforms.
> >> >> >
> >> >> > > That will confuse people too.
> >> >> >
> >> >> > I dont see how.
> >> >> >
> >> >> > >
> >> >> > > And the "redundant" checking has no run time overhead, because compiler
> >> >> > > will do the trick.
> >> >> > I understand that, but dont think this indirection is necessary.
> >> >>
> >> >> Hi Anshuman, Hi Ying,
> >> >> Thanks for the comments of both of you. Does the below look ok?
> >> >>
> >> >> generic,
> >> >>
> >> >> static inline bool arch_wants_thp_swap(void)
> >> >> {
> >> >> return IS_ENABLED(CONFIG_THP_SWAP);
> >> >> }
> >> >>
> >> >
> >> > sorry, i actually meant arch_thp_swp_supported() but not
> >> > arch_wants_thp_swap() in generic code,
> >> >
> >> > static inline bool arch_thp_swp_supported(void)
> >> > {
> >> > return IS_ENABLED(CONFIG_THP_SWAP);
> >> > }
> >>
> >> IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
> >> selected by users. arch_thp_swp_supported() is to report the
> >> capability.
> >
> > Hi Ying,
> > CONFIG_THP_SWAP implicitly includes ARCH_WANTS_THP_SWAP. So it seems
> > a bit odd to have still another arch_wants_thp_swap().
> > if the name of arch_thp_swp_supported is not sensible to you, will
> > thp_swp_supported()
> > without arch_ make more sense? a similar example is,
> >
> > static inline bool gigantic_page_runtime_supported(void)
> > {
> > return IS_ENABLED(CONFIG_ARCH_HAS_GIGANTIC_PAGE);
> > }
>
> Here, the capability of the architecture is reported. But
> CONFIG_THP_SWAP is a user option.
>
> I'm OK with the function name "arch_thp_swp_supported()". I just think
> that we should implement the function in a way that is consistent with
> the function name as much as possible. That is, don't return true on
> architectures that THP swap isn't supported in fact.

My point is that having a generic thp_swp_supported() which can combine
both IS_ENABLED(CONFIG_THP_SWP) && arch_thp_swp_thing().
then we can always only call this rather than checking two conditions.
Although there is only one caller for this moment, we might get more
later. So always calling this single function might make our life easier.

we can treat

static inline bool arch_thp_swp_supported(void)
{
return IS_ENABLED(CONFIG_THP_SWAP);
}

as a virtual function in base class. thp_swp is generically true if platforms
are able to enable CONFIG_THP_SWAP.

Derived classes like arm64 can overwrite it to false in some particular cases
based on their unique characteristics.

>
> > Otherwise, can we just keep the code as is according to Anshuman's suggestion?
>
> Although I still think my way is better, I will not force you to do
> that. If you don't think that is better, you can use your original
> implementation.

Ok, fair enough. And thanks for your reviewing :-)

Thanks
Barry

2022-07-19 08:49:34

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 7:20 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Tue, Jul 19, 2022 at 6:33 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Tue, Jul 19, 2022 at 5:47 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Barry Song <[email protected]> writes:
> >> >>
> >> >> > On Tue, Jul 19, 2022 at 3:59 PM Barry Song <[email protected]> wrote:
> >> >> >>
> >> >> >> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> >> >> >> <[email protected]> wrote:
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On 7/19/22 08:58, Huang, Ying wrote:
> >> >> >> > > Anshuman Khandual <[email protected]> writes:
> >> >> >> > >
> >> >> >> > >> On 7/19/22 06:53, Barry Song wrote:
> >> >> >> > >>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> >> >> >> > >>>>
> >> >> >> > >>>> Barry Song <[email protected]> writes:
> >> >> >> > >>>>
> >> >> >> > >>>>> From: Barry Song <[email protected]>
> >> >> >> > >>>>>
> >> >> >> > >>>>> THP_SWAP has been proven to improve the swap throughput significantly
> >> >> >> > >>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> >> >> >> > >>>>> splitting THP after swapped out").
> >> >> >> > >>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> >> >> >> > >>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> >> >> >> > >>>>> enabling it on arm64 will benefit arm64 as well.
> >> >> >> > >>>>> A corner case is that MTE has an assumption that only base pages
> >> >> >> > >>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> >> >> >> > >>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> >> >> >> > >>>>>
> >> >> >> > >>>>> A micro-benchmark is written to measure thp swapout throughput as
> >> >> >> > >>>>> below,
> >> >> >> > >>>>>
> >> >> >> > >>>>> unsigned long long tv_to_ms(struct timeval tv)
> >> >> >> > >>>>> {
> >> >> >> > >>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> >> >> >> > >>>>> }
> >> >> >> > >>>>>
> >> >> >> > >>>>> main()
> >> >> >> > >>>>> {
> >> >> >> > >>>>> struct timeval tv_b, tv_e;;
> >> >> >> > >>>>> #define SIZE 400*1024*1024
> >> >> >> > >>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >> >> >> > >>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >> >> >> > >>>>> if (!p) {
> >> >> >> > >>>>> perror("fail to get memory");
> >> >> >> > >>>>> exit(-1);
> >> >> >> > >>>>> }
> >> >> >> > >>>>>
> >> >> >> > >>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> >> >> >> > >>>>> memset(p, 0x11, SIZE); /* write to get mem */
> >> >> >> > >>>>>
> >> >> >> > >>>>> gettimeofday(&tv_b, NULL);
> >> >> >> > >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >> >> >> > >>>>> gettimeofday(&tv_e, NULL);
> >> >> >> > >>>>>
> >> >> >> > >>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> >> >> >> > >>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> >> >> >> > >>>>> }
> >> >> >> > >>>>>
> >> >> >> > >>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> >> >> >> > >>>>> Cortex-A55 platform - ROCK 3A.
> >> >> >> > >>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> >> >> >> > >>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >> >> >> > >>>>>
> >> >> >> > >>>>> Cc: "Huang, Ying" <[email protected]>
> >> >> >> > >>>>> Cc: Minchan Kim <[email protected]>
> >> >> >> > >>>>> Cc: Johannes Weiner <[email protected]>
> >> >> >> > >>>>> Cc: Hugh Dickins <[email protected]>
> >> >> >> > >>>>> Cc: Andrea Arcangeli <[email protected]>
> >> >> >> > >>>>> Cc: Anshuman Khandual <[email protected]>
> >> >> >> > >>>>> Cc: Steven Price <[email protected]>
> >> >> >> > >>>>> Cc: Yang Shi <[email protected]>
> >> >> >> > >>>>> Signed-off-by: Barry Song <[email protected]>
> >> >> >> > >>>>> ---
> >> >> >> > >>>>> -v3:
> >> >> >> > >>>>> * refine the commit log;
> >> >> >> > >>>>> * add a benchmark result;
> >> >> >> > >>>>> * refine the macro of arch_thp_swp_supported
> >> >> >> > >>>>> Thanks to the comments of Anshuman, Andrew, Steven
> >> >> >> > >>>>>
> >> >> >> > >>>>> arch/arm64/Kconfig | 1 +
> >> >> >> > >>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> >> >> >> > >>>>> include/linux/huge_mm.h | 12 ++++++++++++
> >> >> >> > >>>>> mm/swap_slots.c | 2 +-
> >> >> >> > >>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> >> >> >> > >>>>>
> >> >> >> > >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> >> >> > >>>>> index 1652a9800ebe..e1c540e80eec 100644
> >> >> >> > >>>>> --- a/arch/arm64/Kconfig
> >> >> >> > >>>>> +++ b/arch/arm64/Kconfig
> >> >> >> > >>>>> @@ -101,6 +101,7 @@ config ARM64
> >> >> >> > >>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >> >> >> > >>>>> select ARCH_WANT_LD_ORPHAN_WARN
> >> >> >> > >>>>> select ARCH_WANTS_NO_INSTR
> >> >> >> > >>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >> >> >> > >>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> >> >> >> > >>>>> select ARM_AMBA
> >> >> >> > >>>>> select ARM_ARCH_TIMER
> >> >> >> > >>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> >> >> > >>>>> index 0b6632f18364..78d6f6014bfb 100644
> >> >> >> > >>>>> --- a/arch/arm64/include/asm/pgtable.h
> >> >> >> > >>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >> >> >> > >>>>> @@ -45,6 +45,12 @@
> >> >> >> > >>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >> >> >> > >>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >> >> >> > >>>>>
> >> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> >> >> > >>>>> +{
> >> >> >> > >>>>> + return !system_supports_mte();
> >> >> >> > >>>>> +}
> >> >> >> > >>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> >> >> >> > >>>>> +
> >> >> >> > >>>>> /*
> >> >> >> > >>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >> >> >> > >>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >> >> >> > >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> >> >> > >>>>> index de29821231c9..4ddaf6ad73ef 100644
> >> >> >> > >>>>> --- a/include/linux/huge_mm.h
> >> >> >> > >>>>> +++ b/include/linux/huge_mm.h
> >> >> >> > >>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >> >> >> > >>>>> return split_huge_page_to_list(&folio->page, list);
> >> >> >> > >>>>> }
> >> >> >> > >>>>>
> >> >> >> > >>>>> +/*
> >> >> >> > >>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >> >> >> > >>>>> + * limitations in the implementation like arm64 MTE can override this to
> >> >> >> > >>>>> + * false
> >> >> >> > >>>>> + */
> >> >> >> > >>>>> +#ifndef arch_thp_swp_supported
> >> >> >> > >>>>> +static inline bool arch_thp_swp_supported(void)
> >> >> >> > >>>>> +{
> >> >> >> > >>>>> + return true;
> >> >> >> > >>>>> +}
> >> >> >> > >>>>
> >> >> >> > >>>> How about the following?
> >> >> >> > >>>>
> >> >> >> > >>>> static inline bool arch_wants_thp_swap(void)
> >> >> >> > >>>> {
> >> >> >> > >>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >> >> >> > >>>> }
> >> >> >> > >>>
> >> >> >> > >>> This looks good. then i'll need to change arm64 to
> >> >> >> > >>>
> >> >> >> > >>> +static inline bool arch_thp_swp_supported(void)
> >> >> >> > >>> +{
> >> >> >> > >>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >> >> >> > >>> +}
> >> >> >> > >>
> >> >> >> > >> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >> >> >> > >> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >> >> >> > >> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >> >> >> > >> either in the generic fallback stub, or in arm64 platform override. Because
> >> >> >> > >> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >> >> >> > >> be called in the first place.
> >> >> >> > >
> >> >> >> > > For the only caller now, the checking looks redundant. But the original
> >> >> >> > > proposed implementation as follows,
> >> >> >> > >
> >> >> >> > > static inline bool arch_thp_swp_supported(void)
> >> >> >> > > {
> >> >> >> > > return true;
> >> >> >> > > }
> >> >> >> > >
> >> >> >> > > will return true even on architectures that don't support/want THP swap.
> >> >> >> >
> >> >> >> > But the function will never be called on for those platforms.
> >> >> >> >
> >> >> >> > > That will confuse people too.
> >> >> >> >
> >> >> >> > I dont see how.
> >> >> >> >
> >> >> >> > >
> >> >> >> > > And the "redundant" checking has no run time overhead, because compiler
> >> >> >> > > will do the trick.
> >> >> >> > I understand that, but dont think this indirection is necessary.
> >> >> >>
> >> >> >> Hi Anshuman, Hi Ying,
> >> >> >> Thanks for the comments of both of you. Does the below look ok?
> >> >> >>
> >> >> >> generic,
> >> >> >>
> >> >> >> static inline bool arch_wants_thp_swap(void)
> >> >> >> {
> >> >> >> return IS_ENABLED(CONFIG_THP_SWAP);
> >> >> >> }
> >> >> >>
> >> >> >
> >> >> > sorry, i actually meant arch_thp_swp_supported() but not
> >> >> > arch_wants_thp_swap() in generic code,
> >> >> >
> >> >> > static inline bool arch_thp_swp_supported(void)
> >> >> > {
> >> >> > return IS_ENABLED(CONFIG_THP_SWAP);
> >> >> > }
> >> >>
> >> >> IS_ENABLED(CONFIG_THP_SWAP) doesn't match the name too. It's an option
> >> >> selected by users. arch_thp_swp_supported() is to report the
> >> >> capability.
> >> >
> >> > Hi Ying,
> >> > CONFIG_THP_SWAP implicitly includes ARCH_WANTS_THP_SWAP. So it seems
> >> > a bit odd to have still another arch_wants_thp_swap().
> >> > if the name of arch_thp_swp_supported is not sensible to you, will
> >> > thp_swp_supported()
> >> > without arch_ make more sense? a similar example is,
> >> >
> >> > static inline bool gigantic_page_runtime_supported(void)
> >> > {
> >> > return IS_ENABLED(CONFIG_ARCH_HAS_GIGANTIC_PAGE);
> >> > }
> >>
> >> Here, the capability of the architecture is reported. But
> >> CONFIG_THP_SWAP is a user option.
> >>
> >> I'm OK with the function name "arch_thp_swp_supported()". I just think
> >> that we should implement the function in a way that is consistent with
> >> the function name as much as possible. That is, don't return true on
> >> architectures that THP swap isn't supported in fact.
> >
> > My point is that having a generic thp_swp_supported() which can combine
> > both IS_ENABLED(CONFIG_THP_SWP) && arch_thp_swp_thing().
> > then we can always only call this rather than checking two conditions.
> > Although there is only one caller for this moment, we might get more
> > later. So always calling this single function might make our life easier.
> >
> > we can treat
> >
> > static inline bool arch_thp_swp_supported(void)
> > {
> > return IS_ENABLED(CONFIG_THP_SWAP);
> > }
>
> The issue is that IS_ENABLED(CONFIG_THP_SWAP) reports that whether THP
> swap is selected by the user, not just whether THP swap is supported by
> the architecture.

Your point is absolutely correct. but does it really matter given that
even huge_mm.h
being changed is doing this?

static inline bool thp_migration_supported(void)
{
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
}

Personally I would prefer we enjoy the convenience of calling single one
thp_swp_supported() in the existing caller and more potential callers by
trading off the accuracy of the name.

But since it is hard for all of us to be on the same page, I'll have to keep
the code as is :-)

Thanks
Barry

2022-07-20 02:05:12

by Barry Song

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Tue, Jul 19, 2022 at 4:04 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 7/19/22 09:29, Barry Song wrote:
> > On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 7/19/22 08:58, Huang, Ying wrote:
> >>> Anshuman Khandual <[email protected]> writes:
> >>>
> >>>> On 7/19/22 06:53, Barry Song wrote:
> >>>>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
> >>>>>>
> >>>>>> Barry Song <[email protected]> writes:
> >>>>>>
> >>>>>>> From: Barry Song <[email protected]>
> >>>>>>>
> >>>>>>> THP_SWAP has been proven to improve the swap throughput significantly
> >>>>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
> >>>>>>> splitting THP after swapped out").
> >>>>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
> >>>>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
> >>>>>>> enabling it on arm64 will benefit arm64 as well.
> >>>>>>> A corner case is that MTE has an assumption that only base pages
> >>>>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
> >>>>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
> >>>>>>>
> >>>>>>> A micro-benchmark is written to measure thp swapout throughput as
> >>>>>>> below,
> >>>>>>>
> >>>>>>> unsigned long long tv_to_ms(struct timeval tv)
> >>>>>>> {
> >>>>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
> >>>>>>> }
> >>>>>>>
> >>>>>>> main()
> >>>>>>> {
> >>>>>>> struct timeval tv_b, tv_e;;
> >>>>>>> #define SIZE 400*1024*1024
> >>>>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >>>>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >>>>>>> if (!p) {
> >>>>>>> perror("fail to get memory");
> >>>>>>> exit(-1);
> >>>>>>> }
> >>>>>>>
> >>>>>>> madvise(p, SIZE, MADV_HUGEPAGE);
> >>>>>>> memset(p, 0x11, SIZE); /* write to get mem */
> >>>>>>>
> >>>>>>> gettimeofday(&tv_b, NULL);
> >>>>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >>>>>>> gettimeofday(&tv_e, NULL);
> >>>>>>>
> >>>>>>> printf("swp out bandwidth: %ld bytes/ms\n",
> >>>>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
> >>>>>>> }
> >>>>>>>
> >>>>>>> Testing is done on rk3568 64bit quad core processor Quad Core
> >>>>>>> Cortex-A55 platform - ROCK 3A.
> >>>>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
> >>>>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
> >>>>>>>
> >>>>>>> Cc: "Huang, Ying" <[email protected]>
> >>>>>>> Cc: Minchan Kim <[email protected]>
> >>>>>>> Cc: Johannes Weiner <[email protected]>
> >>>>>>> Cc: Hugh Dickins <[email protected]>
> >>>>>>> Cc: Andrea Arcangeli <[email protected]>
> >>>>>>> Cc: Anshuman Khandual <[email protected]>
> >>>>>>> Cc: Steven Price <[email protected]>
> >>>>>>> Cc: Yang Shi <[email protected]>
> >>>>>>> Signed-off-by: Barry Song <[email protected]>
> >>>>>>> ---
> >>>>>>> -v3:
> >>>>>>> * refine the commit log;
> >>>>>>> * add a benchmark result;
> >>>>>>> * refine the macro of arch_thp_swp_supported
> >>>>>>> Thanks to the comments of Anshuman, Andrew, Steven
> >>>>>>>
> >>>>>>> arch/arm64/Kconfig | 1 +
> >>>>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
> >>>>>>> include/linux/huge_mm.h | 12 ++++++++++++
> >>>>>>> mm/swap_slots.c | 2 +-
> >>>>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >>>>>>> index 1652a9800ebe..e1c540e80eec 100644
> >>>>>>> --- a/arch/arm64/Kconfig
> >>>>>>> +++ b/arch/arm64/Kconfig
> >>>>>>> @@ -101,6 +101,7 @@ config ARM64
> >>>>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >>>>>>> select ARCH_WANT_LD_ORPHAN_WARN
> >>>>>>> select ARCH_WANTS_NO_INSTR
> >>>>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
> >>>>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
> >>>>>>> select ARM_AMBA
> >>>>>>> select ARM_ARCH_TIMER
> >>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>>>>>> index 0b6632f18364..78d6f6014bfb 100644
> >>>>>>> --- a/arch/arm64/include/asm/pgtable.h
> >>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
> >>>>>>> @@ -45,6 +45,12 @@
> >>>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>>>>>
> >>>>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>>>> +{
> >>>>>>> + return !system_supports_mte();
> >>>>>>> +}
> >>>>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
> >>>>>>> +
> >>>>>>> /*
> >>>>>>> * Outside of a few very special situations (e.g. hibernation), we always
> >>>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
> >>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>>>> index de29821231c9..4ddaf6ad73ef 100644
> >>>>>>> --- a/include/linux/huge_mm.h
> >>>>>>> +++ b/include/linux/huge_mm.h
> >>>>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
> >>>>>>> return split_huge_page_to_list(&folio->page, list);
> >>>>>>> }
> >>>>>>>
> >>>>>>> +/*
> >>>>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> >>>>>>> + * limitations in the implementation like arm64 MTE can override this to
> >>>>>>> + * false
> >>>>>>> + */
> >>>>>>> +#ifndef arch_thp_swp_supported
> >>>>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>>>> +{
> >>>>>>> + return true;
> >>>>>>> +}
> >>>>>>
> >>>>>> How about the following?
> >>>>>>
> >>>>>> static inline bool arch_wants_thp_swap(void)
> >>>>>> {
> >>>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >>>>>> }
> >>>>>
> >>>>> This looks good. then i'll need to change arm64 to
> >>>>>
> >>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>> +{
> >>>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >>>>> +}
> >>>>
> >>>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >>>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >>>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >>>> either in the generic fallback stub, or in arm64 platform override. Because
> >>>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >>>> be called in the first place.
> >>>
> >>> For the only caller now, the checking looks redundant. But the original
> >>> proposed implementation as follows,
> >>>
> >>> static inline bool arch_thp_swp_supported(void)
> >>> {
> >>> return true;
> >>> }
> >>>
> >>> will return true even on architectures that don't support/want THP swap.
> >>
> >> But the function will never be called on for those platforms.
> >>
> >>> That will confuse people too.
> >>
> >> I dont see how.
> >>
> >>>
> >>> And the "redundant" checking has no run time overhead, because compiler
> >>> will do the trick.
> >> I understand that, but dont think this indirection is necessary.
> >
> > Hi Anshuman, Hi Ying,
> > Thanks for the comments of both of you. Does the below look ok?
> >
> > generic,
> >
> > static inline bool arch_wants_thp_swap(void)
> > {
> > return IS_ENABLED(CONFIG_THP_SWAP);
> > }
> >
> > arm64,
> >
> > static inline bool arch_thp_swp_supported(void)
> > {
> > return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
> > }
> >
> > caller,
> >
> > folio_alloc_swap(struct folio *folio)
> > {
> >
> > if (folio_test_large(folio)) {
> > - if (IS_ENABLED(CONFIG_THP_SWAP))
> > + if (arch_thp_swp_supported())
> > get_swap_pages(1, &entry, folio_nr_pages(folio));
> > goto out;
> > }
>
> Current proposal in this patch LGTM, I dont see any reason for these changes.

OK, thanks, Anshuman. Can I collect this as a Reviewed-by?

Barry

2022-07-20 03:04:50

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64



On 7/20/22 07:16, Barry Song wrote:
> On Tue, Jul 19, 2022 at 4:04 PM Anshuman Khandual
> <[email protected]> wrote:
>>
>>
>>
>> On 7/19/22 09:29, Barry Song wrote:
>>> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 7/19/22 08:58, Huang, Ying wrote:
>>>>> Anshuman Khandual <[email protected]> writes:
>>>>>
>>>>>> On 7/19/22 06:53, Barry Song wrote:
>>>>>>> On Tue, Jul 19, 2022 at 12:44 PM Huang, Ying <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Barry Song <[email protected]> writes:
>>>>>>>>
>>>>>>>>> From: Barry Song <[email protected]>
>>>>>>>>>
>>>>>>>>> THP_SWAP has been proven to improve the swap throughput significantly
>>>>>>>>> on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay
>>>>>>>>> splitting THP after swapped out").
>>>>>>>>> As long as arm64 uses 4K page size, it is quite similar with x86_64
>>>>>>>>> by having 2MB PMD THP. THP_SWAP is architecture-independent, thus,
>>>>>>>>> enabling it on arm64 will benefit arm64 as well.
>>>>>>>>> A corner case is that MTE has an assumption that only base pages
>>>>>>>>> can be swapped. We won't enable THP_SWAP for ARM64 hardware with
>>>>>>>>> MTE support until MTE is reworked to coexist with THP_SWAP.
>>>>>>>>>
>>>>>>>>> A micro-benchmark is written to measure thp swapout throughput as
>>>>>>>>> below,
>>>>>>>>>
>>>>>>>>> unsigned long long tv_to_ms(struct timeval tv)
>>>>>>>>> {
>>>>>>>>> return tv.tv_sec * 1000 + tv.tv_usec / 1000;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> main()
>>>>>>>>> {
>>>>>>>>> struct timeval tv_b, tv_e;;
>>>>>>>>> #define SIZE 400*1024*1024
>>>>>>>>> volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>>>>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>>>>>> if (!p) {
>>>>>>>>> perror("fail to get memory");
>>>>>>>>> exit(-1);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> madvise(p, SIZE, MADV_HUGEPAGE);
>>>>>>>>> memset(p, 0x11, SIZE); /* write to get mem */
>>>>>>>>>
>>>>>>>>> gettimeofday(&tv_b, NULL);
>>>>>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>>>>>> gettimeofday(&tv_e, NULL);
>>>>>>>>>
>>>>>>>>> printf("swp out bandwidth: %ld bytes/ms\n",
>>>>>>>>> SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b)));
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Testing is done on rk3568 64bit quad core processor Quad Core
>>>>>>>>> Cortex-A55 platform - ROCK 3A.
>>>>>>>>> thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests)
>>>>>>>>> thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests)
>>>>>>>>>
>>>>>>>>> Cc: "Huang, Ying" <[email protected]>
>>>>>>>>> Cc: Minchan Kim <[email protected]>
>>>>>>>>> Cc: Johannes Weiner <[email protected]>
>>>>>>>>> Cc: Hugh Dickins <[email protected]>
>>>>>>>>> Cc: Andrea Arcangeli <[email protected]>
>>>>>>>>> Cc: Anshuman Khandual <[email protected]>
>>>>>>>>> Cc: Steven Price <[email protected]>
>>>>>>>>> Cc: Yang Shi <[email protected]>
>>>>>>>>> Signed-off-by: Barry Song <[email protected]>
>>>>>>>>> ---
>>>>>>>>> -v3:
>>>>>>>>> * refine the commit log;
>>>>>>>>> * add a benchmark result;
>>>>>>>>> * refine the macro of arch_thp_swp_supported
>>>>>>>>> Thanks to the comments of Anshuman, Andrew, Steven
>>>>>>>>>
>>>>>>>>> arch/arm64/Kconfig | 1 +
>>>>>>>>> arch/arm64/include/asm/pgtable.h | 6 ++++++
>>>>>>>>> include/linux/huge_mm.h | 12 ++++++++++++
>>>>>>>>> mm/swap_slots.c | 2 +-
>>>>>>>>> 4 files changed, 20 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>>>>> index 1652a9800ebe..e1c540e80eec 100644
>>>>>>>>> --- a/arch/arm64/Kconfig
>>>>>>>>> +++ b/arch/arm64/Kconfig
>>>>>>>>> @@ -101,6 +101,7 @@ config ARM64
>>>>>>>>> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>>>>>>>> select ARCH_WANT_LD_ORPHAN_WARN
>>>>>>>>> select ARCH_WANTS_NO_INSTR
>>>>>>>>> + select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
>>>>>>>>> select ARCH_HAS_UBSAN_SANITIZE_ALL
>>>>>>>>> select ARM_AMBA
>>>>>>>>> select ARM_ARCH_TIMER
>>>>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>>>>> index 0b6632f18364..78d6f6014bfb 100644
>>>>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>>>>> @@ -45,6 +45,12 @@
>>>>>>>>> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>>>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>>>>>>
>>>>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>>>>> +{
>>>>>>>>> + return !system_supports_mte();
>>>>>>>>> +}
>>>>>>>>> +#define arch_thp_swp_supported arch_thp_swp_supported
>>>>>>>>> +
>>>>>>>>> /*
>>>>>>>>> * Outside of a few very special situations (e.g. hibernation), we always
>>>>>>>>> * use broadcast TLB invalidation instructions, therefore a spurious page
>>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>>> index de29821231c9..4ddaf6ad73ef 100644
>>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>>> @@ -461,4 +461,16 @@ static inline int split_folio_to_list(struct folio *folio,
>>>>>>>>> return split_huge_page_to_list(&folio->page, list);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +/*
>>>>>>>>> + * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>>>>>>>> + * limitations in the implementation like arm64 MTE can override this to
>>>>>>>>> + * false
>>>>>>>>> + */
>>>>>>>>> +#ifndef arch_thp_swp_supported
>>>>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>>>>> +{
>>>>>>>>> + return true;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> How about the following?
>>>>>>>>
>>>>>>>> static inline bool arch_wants_thp_swap(void)
>>>>>>>> {
>>>>>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
>>>>>>>> }
>>>>>>>
>>>>>>> This looks good. then i'll need to change arm64 to
>>>>>>>
>>>>>>> +static inline bool arch_thp_swp_supported(void)
>>>>>>> +{
>>>>>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
>>>>>>> +}
>>>>>>
>>>>>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
>>>>>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
>>>>>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
>>>>>> either in the generic fallback stub, or in arm64 platform override. Because
>>>>>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
>>>>>> be called in the first place.
>>>>>
>>>>> For the only caller now, the checking looks redundant. But the original
>>>>> proposed implementation as follows,
>>>>>
>>>>> static inline bool arch_thp_swp_supported(void)
>>>>> {
>>>>> return true;
>>>>> }
>>>>>
>>>>> will return true even on architectures that don't support/want THP swap.
>>>>
>>>> But the function will never be called on for those platforms.
>>>>
>>>>> That will confuse people too.
>>>>
>>>> I dont see how.
>>>>
>>>>>
>>>>> And the "redundant" checking has no run time overhead, because compiler
>>>>> will do the trick.
>>>> I understand that, but dont think this indirection is necessary.
>>>
>>> Hi Anshuman, Hi Ying,
>>> Thanks for the comments of both of you. Does the below look ok?
>>>
>>> generic,
>>>
>>> static inline bool arch_wants_thp_swap(void)
>>> {
>>> return IS_ENABLED(CONFIG_THP_SWAP);
>>> }
>>>
>>> arm64,
>>>
>>> static inline bool arch_thp_swp_supported(void)
>>> {
>>> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
>>> }
>>>
>>> caller,
>>>
>>> folio_alloc_swap(struct folio *folio)
>>> {
>>>
>>> if (folio_test_large(folio)) {
>>> - if (IS_ENABLED(CONFIG_THP_SWAP))
>>> + if (arch_thp_swp_supported())
>>> get_swap_pages(1, &entry, folio_nr_pages(folio));
>>> goto out;
>>> }
>>
>> Current proposal in this patch LGTM, I dont see any reason for these changes.
>
> OK, thanks, Anshuman. Can I collect this as a Reviewed-by?

Yes please.

Reviewed-by: Anshuman Khandual <[email protected]>

2022-07-20 09:52:15

by Will Deacon

[permalink] [raw]
Subject: Re: [RESEND PATCH v3] arm64: enable THP_SWAP for arm64

On Wed, Jul 20, 2022 at 08:06:42AM +0530, Anshuman Khandual wrote:
> On 7/20/22 07:16, Barry Song wrote:
> > On Tue, Jul 19, 2022 at 4:04 PM Anshuman Khandual
> > <[email protected]> wrote:
> >> On 7/19/22 09:29, Barry Song wrote:
> >>> On Tue, Jul 19, 2022 at 3:35 PM Anshuman Khandual
> >>> <[email protected]> wrote:
> >>>> On 7/19/22 08:58, Huang, Ying wrote:
> >>>>> Anshuman Khandual <[email protected]> writes:
> >>>>>>>> How about the following?
> >>>>>>>>
> >>>>>>>> static inline bool arch_wants_thp_swap(void)
> >>>>>>>> {
> >>>>>>>> return IS_ENABLED(ARCH_WANTS_THP_SWAP);
> >>>>>>>> }
> >>>>>>>
> >>>>>>> This looks good. then i'll need to change arm64 to
> >>>>>>>
> >>>>>>> +static inline bool arch_thp_swp_supported(void)
> >>>>>>> +{
> >>>>>>> + return IS_ENABLED(ARCH_WANTS_THP_SWAP) && !system_supports_mte();
> >>>>>>> +}
> >>>>>>
> >>>>>> Why ? CONFIG_THP_SWAP depends on ARCH_WANTS_THP_SWAP. In folio_alloc_swap(),
> >>>>>> IS_ENABLED(CONFIG_THP_SWAP) enabled, will also imply ARCH_WANTS_THP_SWAP too
> >>>>>> is enabled. Hence checking for ARCH_WANTS_THP_SWAP again does not make sense
> >>>>>> either in the generic fallback stub, or in arm64 platform override. Because
> >>>>>> without ARCH_WANTS_THP_SWAP enabled, arch_thp_swp_supported() should never
> >>>>>> be called in the first place.
> >>>>>
> >>>>> For the only caller now, the checking looks redundant. But the original
> >>>>> proposed implementation as follows,
> >>>>>
> >>>>> static inline bool arch_thp_swp_supported(void)
> >>>>> {
> >>>>> return true;
> >>>>> }
> >>>>>
> >>>>> will return true even on architectures that don't support/want THP swap.
> >>>>
> >>>> But the function will never be called on for those platforms.
> >>>>
> >>>>> That will confuse people too.
> >>>>
> >>>> I dont see how.
> >>>>
> >>>>>
> >>>>> And the "redundant" checking has no run time overhead, because compiler
> >>>>> will do the trick.
> >>>> I understand that, but dont think this indirection is necessary.
> >>>
> >>> Hi Anshuman, Hi Ying,
> >>> Thanks for the comments of both of you. Does the below look ok?
> >>>
> >>> generic,
> >>>
> >>> static inline bool arch_wants_thp_swap(void)
> >>> {
> >>> return IS_ENABLED(CONFIG_THP_SWAP);
> >>> }
> >>>
> >>> arm64,
> >>>
> >>> static inline bool arch_thp_swp_supported(void)
> >>> {
> >>> return IS_ENABLED(CONFIG_THP_SWAP) && !system_supports_mte();
> >>> }
> >>>
> >>> caller,
> >>>
> >>> folio_alloc_swap(struct folio *folio)
> >>> {
> >>>
> >>> if (folio_test_large(folio)) {
> >>> - if (IS_ENABLED(CONFIG_THP_SWAP))
> >>> + if (arch_thp_swp_supported())
> >>> get_swap_pages(1, &entry, folio_nr_pages(folio));
> >>> goto out;
> >>> }
> >>
> >> Current proposal in this patch LGTM, I dont see any reason for these changes.
> >
> > OK, thanks, Anshuman. Can I collect this as a Reviewed-by?
>
> Yes please.
>
> Reviewed-by: Anshuman Khandual <[email protected]>

I've lost track of exactly what the outcome here is, so Barry, please can
you send a final version of the agreed-upon patch?

Thanks,

Will