2020-10-14 14:25:04

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 0/8] Use uncached writes while clearing gigantic pages

This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based
clear_page().

The immediate use case is to speedup creation of large (~2TB) guests
VMs. Memory for these guests is allocated via huge/gigantic pages which
are faulted in early.

The intent behind using non-temporal writes is to minimize allocation of
unnecessary cachelines. This helps in minimizing cache pollution, and
potentially also speeds up zeroing of large extents.

That said there are, uncached writes are not always great, as can be seen
in these 'perf bench mem memset' numbers comparing clear_page_erms() and
clear_page_nt():

Intel Broadwellx:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%

AMD Rome:
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%

Intel Skylakex:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%

(All of the machines in these tests had a minimum of 25MB L3 cache per
socket.)

There are two performance issues:
- uncached writes typically perform better only for region sizes
sizes around or larger than ~LLC-size.
- MOVNTI does not always perform well on all microarchitectures.

We handle the first issue by only using clear_page_nt() for GB pages.

That leaves out page zeroing for 2MB pages, which is a size that's large
enough that uncached writes might have meaningful cache benefits but at
the same time is small enough that uncached writes would end up being
slower.

We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by
means of a uncached-or-cached hint decided based on a threshold size. This
would apply to maps backed by any page-size.
This case is not handled in this series -- I wanted to sanity check the
high level approach before attempting that.

Handle the second issue by adding a synthetic cpu-feature,
X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI
performs well.
(Relatedly, I thought I had independently decided to use ALTERNATIVES
to deal with this, but more likely I had just internalized it from this
discussion:
https://lore.kernel.org/linux-mm/[email protected]/#t)

Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx
and AMD Rome. (In my testing, the performance was also good for some
pre-production models but this series leaves them out.)

Please review.

Thanks
Ankur

Ankur Arora (8):
x86/cpuid: add X86_FEATURE_NT_GOOD
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add clear_page_nt()
x86/clear_page: add clear_page_uncached()
mm, clear_huge_page: use clear_page_uncached() for gigantic pages
x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 6 +++
arch/x86/include/asm/page_32.h | 9 ++++
arch/x86/include/asm/page_64.h | 15 ++++++
arch/x86/kernel/cpu/amd.c | 3 ++
arch/x86/kernel/cpu/intel.c | 2 +
arch/x86/lib/clear_page_64.S | 26 +++++++++++
arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
include/asm-generic/page.h | 3 ++
include/linux/highmem.h | 10 ++++
mm/memory.c | 3 +-
tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++-
13 files changed, 158 insertions(+), 62 deletions(-)

--
2.9.3


2020-10-14 14:25:15

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 3/8] perf bench: add memset_movnti()

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset on -f x86-64-movnt on Intel Broadwellx, Skylakex
and AMD Rome:

Intel Broadwellx:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done

# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
11.837121 GB/sec
# Copying 8MB bytes ...
11.783560 GB/sec
# Copying 32MB bytes ...
11.868591 GB/sec
# Copying 128MB bytes ...
11.865211 GB/sec
# Copying 512MB bytes ...
11.864085 GB/sec

Intel Skylakex:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
6.361971 GB/sec
# Copying 8MB bytes ...
6.300403 GB/sec
# Copying 32MB bytes ...
6.288992 GB/sec
# Copying 128MB bytes ...
6.328793 GB/sec
# Copying 512MB bytes ...
6.324471 GB/sec

AMD Rome:
$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
10.993199 GB/sec
# Copying 8MB bytes ...
14.221784 GB/sec
# Copying 32MB bytes ...
14.293337 GB/sec
# Copying 128MB bytes ...
15.238947 GB/sec
# Copying 512MB bytes ...
16.476093 GB/sec

Signed-off-by: Ankur Arora <[email protected]>
---
tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++-
2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index fd5d25a474b7..bfbf6d06f81e 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -26,7 +26,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS

movq %rdi,%r9
@@ -65,7 +65,8 @@ SYM_FUNC_START(memset_erms)
ret
SYM_FUNC_END(memset_erms)

-SYM_FUNC_START(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START(memset_\OP)
movq %rdi,%r10

/* expand byte value */
@@ -76,64 +77,71 @@ SYM_FUNC_START(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:

movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@

.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@

/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@

-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@

-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
ret

-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */

-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
"x86-64-unrolled",
"unrolled memset() in arch/x86/lib/memset_64.S")

@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
MEMSET_FN(memset_erms,
"x86-64-stosb",
"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+ "x86-64-movnt",
+ "movnt-based memset() in arch/x86/lib/memset_64.S")
--
2.9.3

2020-10-14 14:26:56

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 2/8] x86/asm: add memset_movnti()

Add a MOVNTI based implementation of memset().

memset_orig() and memset_movnti() only differ in the opcode used in the
inner loop, so move the memset_orig() logic into a macro, which gets
expanded into memset_movq() and memset_movnti().

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/lib/memset_64.S | 68 +++++++++++++++++++++++++++---------------------
1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9ff15ee404a4..79703cc04b6a 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -27,7 +27,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS

movq %rdi,%r9
@@ -68,7 +68,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
ret
SYM_FUNC_END(memset_erms)

-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
movq %rdi,%r10

/* expand byte value */
@@ -79,64 +80,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:

movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@

.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@

/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@

-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@

-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
ret

-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
--
2.9.3

2020-10-14 14:29:32

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 1/8] x86/cpuid: add X86_FEATURE_NT_GOOD

Enabled on microarchitectures with performant non-temporal MOV (MOVNTI)
instruction.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7b0afd5e6c57..8bae38240346 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -289,6 +289,7 @@
#define X86_FEATURE_FENCE_SWAPGS_KERNEL (11*32+ 5) /* "" LFENCE in kernel entry SWAPGS path */
#define X86_FEATURE_SPLIT_LOCK_DETECT (11*32+ 6) /* #AC for split lock */
#define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
+#define X86_FEATURE_NT_GOOD (11*32+ 8) /* Non-temporal instructions perform well */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
--
2.9.3