2023-04-03 05:41:02

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

This series introduces multi-page clearing for hugepages.

This is a follow up of some of the ideas discussed at:
https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.

In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)

An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.

All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.

There are, however, some problems:

1. extended zeroing periods means there's an increased latency due to
the now missing preemption points.

That's handled in patches 7, 8, 9:
"sched: define TIF_ALLOW_RESCHED"
"irqentry: define irqentry_exit_allow_resched()"
"x86/clear_huge_page: make clear_contig_region() preemptible"
by the context marking itself reschedulable, and rescheduling in
irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)

2. the current page-at-a-time clearing logic does left-right narrowing
towards the faulting page which benefits workloads by maintaining
cache locality for workloads which have a sequential pattern. Clearing
in large chunks loses that.

Some (but not all) of that could be ameliorated by something like
this patch:
https://lore.kernel.org/lkml/[email protected]/

But, before doing that I'd like some comments on whether that is
worth doing for this specific use case?

Rest of the series:
Patches 1, 2, 3:
"huge_pages: get rid of process_huge_page()"
"huge_page: get rid of {clear,copy}_subpage()"
"huge_page: allow arch override for clear/copy_huge_page()"
are mechanical and they simplify some of the current clear_huge_page()
logic.

Patches 4, 5:
"x86/clear_page: parameterize clear_page*() to specify length"
"x86/clear_pages: add clear_pages()"

add clear_pages() and helpers.

Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
chunked x86 clear_huge_page() implementation.


Performance
==

Demand fault performance gets a decent boost:

*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%


*Milan* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%


vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:

*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(mem=4GB/task, tasks=128)

stime 293.02 +- .49% 239.39 +- .83% -18.30%
utime 440.11 +- .28% 508.74 +- .60% +15.59%
wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%


*Milan* mm/clear_huge_page x86/clear_huge_page change
(mem=1GB/task, tasks=512)

stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%

Also at:
github.com/terminus/linux clear-pages.v1

Comments appreciated!

Ankur Arora (9):
huge_pages: get rid of process_huge_page()
huge_page: get rid of {clear,copy}_subpage()
huge_page: allow arch override for clear/copy_huge_page()
x86/clear_page: parameterize clear_page*() to specify length
x86/clear_pages: add clear_pages()
mm/clear_huge_page: use multi-page clearing
sched: define TIF_ALLOW_RESCHED
irqentry: define irqentry_exit_allow_resched()
x86/clear_huge_page: make clear_contig_region() preemptible

arch/x86/include/asm/page.h | 6 +
arch/x86/include/asm/page_32.h | 6 +
arch/x86/include/asm/page_64.h | 25 +++--
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/lib/clear_page_64.S | 45 ++++++--
arch/x86/mm/hugetlbpage.c | 59 ++++++++++
include/linux/sched.h | 29 +++++
kernel/entry/common.c | 8 ++
kernel/sched/core.c | 36 +++---
mm/memory.c | 174 +++++++++++++++--------------
10 files changed, 270 insertions(+), 120 deletions(-)

--
2.31.1


2023-04-03 05:50:25

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 7/9] sched: define TIF_ALLOW_RESCHED

Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.

This is meant to be used for long running tasks where it is
not convenient to periodically call cond_resched().

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2 ++
include/linux/sched.h | 29 +++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f1cccba52eb9..8c18b9eaeec4 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,6 +100,7 @@ struct thread_info {
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
+#define TIF_RESCHED_ALLOW 30 /* can reschedule if needed */

#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
@@ -122,6 +123,7 @@ struct thread_info {
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
+#define _TIF_RESCHED_ALLOW (1 << TIF_RESCHED_ALLOW)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW_BASE \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..1e7536e6d9ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
return unlikely(tif_need_resched());
}

+/*
+ * Define this in common code to avoid include hell.
+ */
+static __always_inline bool resched_allowed(void)
+{
+#ifndef TIF_RESCHED_ALLOW
+ return false;
+#else
+ return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+#endif
+}
+
+static inline void allow_resched(void)
+{
+ /*
+ * allow_resched() allows preemption via the irqexit context.
+ * To ensure that we stick around on the current runqueue,
+ * disallow migration.
+ */
+ migrate_disable();
+ set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static inline void disallow_resched(void)
+{
+ clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+ migrate_enable();
+}
+
/*
* Wrappers for p->thread_info->cpu access. No-op on UP.
*/
--
2.31.1

2023-04-05 19:52:27

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

On 4/3/2023 10:52 AM, Ankur Arora wrote:
> This series introduces multi-page clearing for hugepages.
>
> This is a follow up of some of the ideas discussed at:
> https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
>
> On x86 page clearing is typically done via string intructions. These,
> unlike a MOV loop, allow us to explicitly advertise the region-size to
> the processor, which could serve as a hint to current (and/or
> future) uarchs to elide cacheline allocation.
>
> In current generation processors, Milan (and presumably other Zen
> variants) use the hint to elide cacheline allocation (for
> region-size > LLC-size.)
>
> An additional reason for doing this is that string instructions are typically
> microcoded, and clearing in bigger chunks than the current page-at-a-
> time logic amortizes some of the cost.
>
> All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
>
> There are, however, some problems:
>
> 1. extended zeroing periods means there's an increased latency due to
> the now missing preemption points.
>
> That's handled in patches 7, 8, 9:
> "sched: define TIF_ALLOW_RESCHED"
> "irqentry: define irqentry_exit_allow_resched()"
> "x86/clear_huge_page: make clear_contig_region() preemptible"
> by the context marking itself reschedulable, and rescheduling in
> irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
>
> 2. the current page-at-a-time clearing logic does left-right narrowing
> towards the faulting page which benefits workloads by maintaining
> cache locality for workloads which have a sequential pattern. Clearing
> in large chunks loses that.
>
> Some (but not all) of that could be ameliorated by something like
> this patch:
> https://lore.kernel.org/lkml/[email protected]/
>
> But, before doing that I'd like some comments on whether that is
> worth doing for this specific use case?
>
> Rest of the series:
> Patches 1, 2, 3:
> "huge_pages: get rid of process_huge_page()"
> "huge_page: get rid of {clear,copy}_subpage()"
> "huge_page: allow arch override for clear/copy_huge_page()"
> are mechanical and they simplify some of the current clear_huge_page()
> logic.
>
> Patches 4, 5:
> "x86/clear_page: parameterize clear_page*() to specify length"
> "x86/clear_pages: add clear_pages()"
>
> add clear_pages() and helpers.
>
> Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
> chunked x86 clear_huge_page() implementation.
>
>
> Performance
> ==
>
> Demand fault performance gets a decent boost:
>
> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
>
> pg-sz=2MB 8.76 11.82 +34.93%
> pg-sz=1GB 8.99 12.18 +35.48%
>
>
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
>
> pg-sz=2MB 12.24 17.54 +43.30%
> pg-sz=1GB 17.98 37.24 +107.11%
>
>
> vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
> worse when user space tries to touch those pages:
>
> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
> (mem=4GB/task, tasks=128)
>
> stime 293.02 +- .49% 239.39 +- .83% -18.30%
> utime 440.11 +- .28% 508.74 +- .60% +15.59%
> wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
>
>
> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (mem=1GB/task, tasks=512)
>
> stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
> utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
> wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
>
> Also at:
> github.com/terminus/linux clear-pages.v1
>
> Comments appreciated!
>

Hello Ankur,

Was able to test your patches. To summarize, am seeing 2x-3x perf
improvement for 2M, 1GB base hugepage sizes.

SUT: Genoa AMD EPYC
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 2

NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127,256-383
NUMA node1 CPU(s): 128-255,384-511

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0), for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
page-size mm/clear_huge_page x86/clear_huge_page change %
2M 5.4567 2.6774 -50.93
1G 2.64452 1.011281 -61.76

Full perfstat info

page size = 2M mm/clear_huge_page

Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10
runs):

5,434.71 msec task-clock # 0.996 CPUs
utilized ( +- 0.55% )
8 context-switches # 1.466 /sec
( +- 4.66% )
0 cpu-migrations # 0.000 /sec
32,918 page-faults # 6.034 K/sec
( +- 0.00% )
16,977,242,482 cycles # 3.112 GHz
( +- 0.04% ) (35.70%)
1,961,724 stalled-cycles-frontend # 0.01% frontend
cycles idle ( +- 1.09% ) (35.72%)
35,685,674 stalled-cycles-backend # 0.21% backend
cycles idle ( +- 3.48% ) (35.74%)
1,038,327,182 instructions # 0.06 insn per cycle
# 0.04 stalled
cycles per insn ( +- 0.38% ) (35.75%)
221,409,216 branches # 40.584 M/sec
( +- 0.36% ) (35.75%)
350,730 branch-misses # 0.16% of all
branches ( +- 1.18% ) (35.75%)
2,520,888,779 L1-dcache-loads # 462.077 M/sec
( +- 0.03% ) (35.73%)
1,094,178,209 L1-dcache-load-misses # 43.46% of all
L1-dcache accesses ( +- 0.02% ) (35.71%)
67,751,730 L1-icache-loads # 12.419 M/sec
( +- 0.11% ) (35.70%)
271,118 L1-icache-load-misses # 0.40% of all
L1-icache accesses ( +- 2.55% ) (35.70%)
506,635 dTLB-loads # 92.866 K/sec
( +- 3.31% ) (35.70%)
237,385 dTLB-load-misses # 43.64% of all
dTLB cache accesses ( +- 7.00% ) (35.69%)
268 iTLB-load-misses # 6700.00% of all
iTLB cache accesses ( +- 13.86% ) (35.70%)

5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% )

page size = 2M x86/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10
runs):

2,780.69 msec task-clock # 1.039 CPUs
utilized ( +- 1.03% )
3 context-switches # 1.121 /sec
( +- 21.34% )
0 cpu-migrations # 0.000 /sec
32,918 page-faults # 12.301 K/sec
( +- 0.00% )
8,143,619,771 cycles # 3.043 GHz
( +- 0.25% ) (35.62%)
2,024,872 stalled-cycles-frontend # 0.02% frontend
cycles idle ( +-320.93% ) (35.66%)
717,198,728 stalled-cycles-backend # 8.82% backend
cycles idle ( +- 8.26% ) (35.69%)
606,549,334 instructions # 0.07 insn per cycle
# 1.39 stalled
cycles per insn ( +- 0.23% ) (35.73%)
108,856,550 branches # 40.677 M/sec
( +- 0.24% ) (35.76%)
202,490 branch-misses # 0.18% of all
branches ( +- 3.58% ) (35.78%)
2,348,818,806 L1-dcache-loads # 877.701 M/sec
( +- 0.03% ) (35.78%)
1,081,562,988 L1-dcache-load-misses # 46.04% of all
L1-dcache accesses ( +- 0.01% ) (35.78%)
<not supported> LLC-loads
<not supported> LLC-load-misses
43,411,167 L1-icache-loads # 16.222 M/sec
( +- 0.19% ) (35.77%)
273,042 L1-icache-load-misses # 0.64% of all
L1-icache accesses ( +- 4.94% ) (35.76%)
834,482 dTLB-loads # 311.827 K/sec
( +- 9.73% ) (35.72%)
437,343 dTLB-load-misses # 65.86% of all
dTLB cache accesses ( +- 8.56% ) (35.68%)
0 iTLB-loads # 0.000 /sec
(35.65%)
160 iTLB-load-misses # 1777.78% of all
iTLB cache accesses ( +- 15.82% ) (35.62%)

2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% )

page size = 1G mm/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10
runs):

2,625.24 msec task-clock # 0.993 CPUs
utilized ( +- 0.23% )
4 context-switches # 1.513 /sec
( +- 4.49% )
1 cpu-migrations # 0.378 /sec
214 page-faults # 80.965 /sec
( +- 0.13% )
8,178,624,349 cycles # 3.094 GHz
( +- 0.23% ) (35.65%)
2,942,576 stalled-cycles-frontend # 0.04% frontend
cycles idle ( +- 75.22% ) (35.69%)
7,117,425 stalled-cycles-backend # 0.09% backend
cycles idle ( +- 3.79% ) (35.73%)
454,521,647 instructions # 0.06 insn per cycle
# 0.02 stalled
cycles per insn ( +- 0.10% ) (35.77%)
113,223,853 branches # 42.837 M/sec
( +- 0.08% ) (35.80%)
84,766 branch-misses # 0.07% of all
branches ( +- 5.37% ) (35.80%)
2,294,528,890 L1-dcache-loads # 868.111 M/sec
( +- 0.02% ) (35.81%)
1,075,907,551 L1-dcache-load-misses # 46.88% of all
L1-dcache accesses ( +- 0.02% ) (35.78%)
26,167,323 L1-icache-loads # 9.900 M/sec
( +- 0.24% ) (35.74%)
139,675 L1-icache-load-misses # 0.54% of all
L1-icache accesses ( +- 0.37% ) (35.70%)
3,459 dTLB-loads # 1.309 K/sec
( +- 12.75% ) (35.67%)
732 dTLB-load-misses # 19.71% of all
dTLB cache accesses ( +- 26.61% ) (35.62%)
11 iTLB-load-misses # 192.98% of all
iTLB cache accesses ( +-238.28% ) (35.62%)

2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% )


page size = 1G x86/clear_huge_page
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10
runs):

1,009.09 msec task-clock # 0.998 CPUs
utilized ( +- 0.06% )
2 context-switches # 1.980 /sec
( +- 23.63% )
1 cpu-migrations # 0.990 /sec
214 page-faults # 211.887 /sec
( +- 0.16% )
3,154,980,463 cycles # 3.124 GHz
( +- 0.06% ) (35.77%)
145,051 stalled-cycles-frontend # 0.00% frontend
cycles idle ( +- 6.26% ) (35.78%)
730,087,143 stalled-cycles-backend # 23.12% backend
cycles idle ( +- 9.75% ) (35.78%)
45,813,391 instructions # 0.01 insn per cycle
# 18.51 stalled
cycles per insn ( +- 1.00% ) (35.78%)
8,498,282 branches # 8.414 M/sec
( +- 1.54% ) (35.78%)
63,351 branch-misses # 0.74% of all
branches ( +- 6.70% ) (35.69%)
29,135,863 L1-dcache-loads # 28.848 M/sec
( +- 5.67% ) (35.68%)
8,537,280 L1-dcache-load-misses # 28.66% of all
L1-dcache accesses ( +- 10.15% ) (35.68%)
1,040,087 L1-icache-loads # 1.030 M/sec
( +- 1.60% ) (35.68%)
9,147 L1-icache-load-misses # 0.85% of all
L1-icache accesses ( +- 6.50% ) (35.67%)
1,084 dTLB-loads # 1.073 K/sec
( +- 12.05% ) (35.68%)
431 dTLB-load-misses # 40.28% of all
dTLB cache accesses ( +- 43.46% ) (35.68%)
16 iTLB-load-misses # 0.00% of all
iTLB cache accesses ( +- 40.54% ) (35.68%)

1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% )

Please feel free to add

Tested-by: Raghavendra K T <[email protected]>

Will come back with further observations on patch/performance if any

Thanks and Regards

2023-04-05 20:12:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 7/9] sched: define TIF_ALLOW_RESCHED

On Sun, Apr 02, 2023 at 10:22:31PM -0700, Ankur Arora wrote:
> Define TIF_ALLOW_RESCHED to allow threads to mark themselves as allowing
> rescheduling in the irqexit path with PREEMPTION_NONE/_VOLUNTARY.
>
> This is meant to be used for long running tasks where it is
> not convenient to periodically call cond_resched().
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> arch/x86/include/asm/thread_info.h | 2 ++
> include/linux/sched.h | 29 +++++++++++++++++++++++++++++
> 2 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index f1cccba52eb9..8c18b9eaeec4 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -100,6 +100,7 @@ struct thread_info {
> #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
> #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
> #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
> +#define TIF_RESCHED_ALLOW 30 /* can reschedule if needed */
>
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> @@ -122,6 +123,7 @@ struct thread_info {
> #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
> #define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
> #define _TIF_ADDR32 (1 << TIF_ADDR32)
> +#define _TIF_RESCHED_ALLOW (1 << TIF_RESCHED_ALLOW)
>
> /* flags to check in __switch_to() */
> #define _TIF_WORK_CTXSW_BASE \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..1e7536e6d9ce 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2229,6 +2229,35 @@ static __always_inline bool need_resched(void)
> return unlikely(tif_need_resched());
> }
>
> +/*
> + * Define this in common code to avoid include hell.
> + */
> +static __always_inline bool resched_allowed(void)
> +{
> +#ifndef TIF_RESCHED_ALLOW
> + return false;
> +#else
> + return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> +#endif
> +}
> +
> +static inline void allow_resched(void)
> +{
> + /*
> + * allow_resched() allows preemption via the irqexit context.
> + * To ensure that we stick around on the current runqueue,
> + * disallow migration.
> + */
> + migrate_disable();
> + set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> +}
> +
> +static inline void disallow_resched(void)
> +{
> + clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
> + migrate_enable();
> +}

Why the migrate_disable(), that comment doesn't help much.

2023-04-08 23:14:36

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing


Raghavendra K T <[email protected]> writes:

> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.

> *Milan* mm/clear_huge_page x86/clear_huge_page change
> (GB/s) (GB/s)
> pg-sz=2MB 12.24 17.54 +43.30%
> pg-sz=1GB 17.98 37.24 +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.

Great. Thanks Raghavendra.

> SUT: Genoa AMD EPYC
> Thread(s) per core: 2
> Core(s) per socket: 128
> Socket(s): 2
>
> NUMA:
> NUMA node(s): 2
> NUMA node0 CPU(s): 0-127,256-383
> NUMA node1 CPU(s): 128-255,384-511
>
> Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size mm/clear_huge_page x86/clear_huge_page
> 2M 5.4567 2.6774
> 1G 2.64452 1.011281

So translating into BW, for Genoa we have:

page-size mm/clear_huge_page x86/clear_huge_page
2M 11.74 23.97
1G 24.24 63.36

That's a pretty good bump over Milan:

> *Milan* mm/clear_huge_page x86/clear_huge_page
> (GB/s) (GB/s)
> pg-sz=2MB 12.24 17.54
> pg-sz=1GB 17.98 37.24

Btw, are these numbers with boost=1?

> Full perfstat info
>
> page size = 2M mm/clear_huge_page
>
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
> 5,434.71 msec task-clock # 0.996 CPUs utilized
> ( +- 0.55% )
> 8 context-switches # 1.466 /sec
> ( +- 4.66% )
> 0 cpu-migrations # 0.000 /sec
> 32,918 page-faults # 6.034 K/sec
> ( +- 0.00% )
> 16,977,242,482 cycles # 3.112 GHz
> ( +- 0.04% ) (35.70%)
> 1,961,724 stalled-cycles-frontend # 0.01% frontend cycles
> idle ( +- 1.09% ) (35.72%)
> 35,685,674 stalled-cycles-backend # 0.21% backend cycles idle
> ( +- 3.48% ) (35.74%)
> 1,038,327,182 instructions # 0.06 insn per cycle
> # 0.04 stalled cycles per
> insn ( +- 0.38% )
> (35.75%)
> 221,409,216 branches # 40.584 M/sec
> ( +- 0.36% ) (35.75%)
> 350,730 branch-misses # 0.16% of all branches
> ( +- 1.18% ) (35.75%)
> 2,520,888,779 L1-dcache-loads # 462.077 M/sec
> ( +- 0.03% ) (35.73%)
> 1,094,178,209 L1-dcache-load-misses # 43.46% of all L1-dcache
> accesses ( +- 0.02% ) (35.71%)
> 67,751,730 L1-icache-loads # 12.419 M/sec
> ( +- 0.11% ) (35.70%)
> 271,118 L1-icache-load-misses # 0.40% of all L1-icache
> accesses ( +- 2.55% ) (35.70%)
> 506,635 dTLB-loads # 92.866 K/sec
> ( +- 3.31% ) (35.70%)
> 237,385 dTLB-load-misses # 43.64% of all dTLB cache
> accesses ( +- 7.00% ) (35.69%)
> 268 iTLB-load-misses # 6700.00% of all iTLB cache
> accesses ( +- 13.86% ) (35.70%)
>
> 5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% )
>
> page size = 2M x86/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
> 2,780.69 msec task-clock # 1.039 CPUs utilized
> ( +- 1.03% )
> 3 context-switches # 1.121 /sec
> ( +- 21.34% )
> 0 cpu-migrations # 0.000 /sec
> 32,918 page-faults # 12.301 K/sec
> ( +- 0.00% )
> 8,143,619,771 cycles # 3.043 GHz
> ( +- 0.25% ) (35.62%)
> 2,024,872 stalled-cycles-frontend # 0.02% frontend cycles
> idle ( +-320.93% ) (35.66%)
> 717,198,728 stalled-cycles-backend # 8.82% backend cycles idle
> ( +- 8.26% ) (35.69%)
> 606,549,334 instructions # 0.07 insn per cycle
> # 1.39 stalled cycles per
> insn ( +- 0.23% )
> (35.73%)
> 108,856,550 branches # 40.677 M/sec
> ( +- 0.24% ) (35.76%)
> 202,490 branch-misses # 0.18% of all branches
> ( +- 3.58% ) (35.78%)
> 2,348,818,806 L1-dcache-loads # 877.701 M/sec
> ( +- 0.03% ) (35.78%)
> 1,081,562,988 L1-dcache-load-misses # 46.04% of all L1-dcache
> accesses ( +- 0.01% ) (35.78%)
> <not supported> LLC-loads
> <not supported> LLC-load-misses
> 43,411,167 L1-icache-loads # 16.222 M/sec
> ( +- 0.19% ) (35.77%)
> 273,042 L1-icache-load-misses # 0.64% of all L1-icache
> accesses ( +- 4.94% ) (35.76%)
> 834,482 dTLB-loads # 311.827 K/sec
> ( +- 9.73% ) (35.72%)
> 437,343 dTLB-load-misses # 65.86% of all dTLB cache
> accesses ( +- 8.56% ) (35.68%)
> 0 iTLB-loads # 0.000 /sec
> (35.65%)
> 160 iTLB-load-misses # 1777.78% of all iTLB cache
> accesses ( +- 15.82% ) (35.62%)
>
> 2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% )
>
> page size = 1G mm/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 2,625.24 msec task-clock # 0.993 CPUs utilized
> ( +- 0.23% )
> 4 context-switches # 1.513 /sec
> ( +- 4.49% )
> 1 cpu-migrations # 0.378 /sec
> 214 page-faults # 80.965 /sec
> ( +- 0.13% )
> 8,178,624,349 cycles # 3.094 GHz
> ( +- 0.23% ) (35.65%)
> 2,942,576 stalled-cycles-frontend # 0.04% frontend cycles
> idle ( +- 75.22% ) (35.69%)
> 7,117,425 stalled-cycles-backend # 0.09% backend cycles idle
> ( +- 3.79% ) (35.73%)
> 454,521,647 instructions # 0.06 insn per cycle
> # 0.02 stalled cycles per
> insn ( +- 0.10% )
> (35.77%)
> 113,223,853 branches # 42.837 M/sec
> ( +- 0.08% ) (35.80%)
> 84,766 branch-misses # 0.07% of all branches
> ( +- 5.37% ) (35.80%)
> 2,294,528,890 L1-dcache-loads # 868.111 M/sec
> ( +- 0.02% ) (35.81%)
> 1,075,907,551 L1-dcache-load-misses # 46.88% of all L1-dcache
> accesses ( +- 0.02% ) (35.78%)
> 26,167,323 L1-icache-loads # 9.900 M/sec
> ( +- 0.24% ) (35.74%)
> 139,675 L1-icache-load-misses # 0.54% of all L1-icache
> accesses ( +- 0.37% ) (35.70%)
> 3,459 dTLB-loads # 1.309 K/sec
> ( +- 12.75% ) (35.67%)
> 732 dTLB-load-misses # 19.71% of all dTLB cache
> accesses ( +- 26.61% ) (35.62%)
> 11 iTLB-load-misses # 192.98% of all iTLB cache
> accesses ( +-238.28% ) (35.62%)
>
> 2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% )
>
>
> page size = 1G x86/clear_huge_page
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 1,009.09 msec task-clock # 0.998 CPUs utilized
> ( +- 0.06% )
> 2 context-switches # 1.980 /sec
> ( +- 23.63% )
> 1 cpu-migrations # 0.990 /sec
> 214 page-faults # 211.887 /sec
> ( +- 0.16% )
> 3,154,980,463 cycles # 3.124 GHz
> ( +- 0.06% ) (35.77%)
> 145,051 stalled-cycles-frontend # 0.00% frontend cycles
> idle ( +- 6.26% ) (35.78%)
> 730,087,143 stalled-cycles-backend # 23.12% backend cycles idle
> ( +- 9.75% ) (35.78%)
> 45,813,391 instructions # 0.01 insn per cycle
> # 18.51 stalled cycles per
> insn ( +- 1.00% )
> (35.78%)
> 8,498,282 branches # 8.414 M/sec
> ( +- 1.54% ) (35.78%)
> 63,351 branch-misses # 0.74% of all branches
> ( +- 6.70% ) (35.69%)
> 29,135,863 L1-dcache-loads # 28.848 M/sec
> ( +- 5.67% ) (35.68%)
> 8,537,280 L1-dcache-load-misses # 28.66% of all L1-dcache
> accesses ( +- 10.15% ) (35.68%)
> 1,040,087 L1-icache-loads # 1.030 M/sec
> ( +- 1.60% ) (35.68%)
> 9,147 L1-icache-load-misses # 0.85% of all L1-icache
> accesses ( +- 6.50% ) (35.67%)
> 1,084 dTLB-loads # 1.073 K/sec
> ( +- 12.05% ) (35.68%)
> 431 dTLB-load-misses # 40.28% of all dTLB cache
> accesses ( +- 43.46% ) (35.68%)
> 16 iTLB-load-misses # 0.00% of all iTLB cache
> accesses ( +- 40.54% ) (35.68%)
>
> 1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <[email protected]>

Thanks

Ankur

> Will come back with further observations on patch/performance if any

2023-04-10 06:44:20

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

On 4/9/2023 4:16 AM, Ankur Arora wrote:
>
> Raghavendra K T <[email protected]> writes:
>
>> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>>> This series introduces multi-page clearing for hugepages.
>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (GB/s) (GB/s)
>> pg-sz=2MB 12.24 17.54 +43.30%
>> pg-sz=1GB 17.98 37.24 +107.11%
>>
>>
>> Hello Ankur,
>>
>> Was able to test your patches. To summarize, am seeing 2x-3x perf
>> improvement for 2M, 1GB base hugepage sizes.
>
> Great. Thanks Raghavendra.
>
>> SUT: Genoa AMD EPYC
>> Thread(s) per core: 2
>> Core(s) per socket: 128
>> Socket(s): 2
>>
>> NUMA:
>> NUMA node(s): 2
>> NUMA node0 CPU(s): 0-127,256-383
>> NUMA node1 CPU(s): 128-255,384-511
>>
>> Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
>> both base-hugepage-size=2M and 1GB
>>
>> perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
>>
>> time in seconds elapsed (average of 10 runs) (lower = better)
>>
>> Result:
>> page-size mm/clear_huge_page x86/clear_huge_page
>> 2M 5.4567 2.6774
>> 1G 2.64452 1.011281
>
> So translating into BW, for Genoa we have:
>
> page-size mm/clear_huge_page x86/clear_huge_page
> 2M 11.74 23.97
> 1G 24.24 63.36
>
> That's a pretty good bump over Milan:
>
>> *Milan* mm/clear_huge_page x86/clear_huge_page
>> (GB/s) (GB/s)
>> pg-sz=2MB 12.24 17.54
>> pg-sz=1GB 17.98 37.24
>
> Btw, are these numbers with boost=1?
>

Yes it is. Also a note about config. I had not enabled
GCOV/LOCKSTAT related config because I faced some issues.