Add clear_page_nt() which is essentially an unrolled MOVNTI loop. The
unrolling keeps the inner loop similar to memset_movnti() which can be
exercised via perf bench mem memset.
The caller needs to execute an SFENCE when done.
MOVNTI, from the Intel SDM, Volume 2B, 4-101:
"The non-temporal hint is implemented by using a write combining (WC)
memory type protocol when writing the data to memory. Using this
protocol, the processor does not write the data into the cache hierarchy,
nor does it fetch the corresponding cache line from memory into the
cache hierarchy."
The AMD Arch Manual has something similar to say as well.
This can potentially improve page-clearing bandwidth (see below for
performance numbers for two microarchitectures where it helps and one
where it doesn't) and can help indirectly by consuming less cache
resources.
Any performance benefits are expected for extents larger than LLC-sized
or more -- when we are DRAM-BW constrained rather than cache-BW
constrained.
# Intel Broadwellx
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
System: Oracle X6-2
CPU: 2 nodes * 10 cores/node * 2 threads/core
Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory: 256G evenly split between nodes
Microcode: 0xb00002e
scaling_governor: performance
L3 size: 25MB
intel_pstate/no_turbo: 1
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
5.405362 GB/sec
5.444229 GB/sec
5.397943 GB/sec
5.401012 GB/sec
5.439320 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):
2,064,476,092 cpu-cycles # 1.087 GHz ( +- 0.17% ) (22.19%)
8,578,591 instructions # 0.00 insn per cycle ( +- 12.01% ) (27.79%)
132,481,645 cache-references # 69.730 M/sec ( +- 0.20% ) (27.83%)
157,710 cache-misses # 0.119 % of all cache refs ( +- 5.80% ) (27.84%)
2,879,628 branch-instructions # 1.516 M/sec ( +- 0.21% ) (27.86%)
80,581 branch-misses # 2.80% of all branches ( +- 13.15% ) (27.84%)
94,401,869 bus-cycles # 49.687 M/sec ( +- 0.25% ) (22.21%)
133,947,283 L1-dcache-load-misses # 139717.91% of all L1-dcache accesses ( +- 0.26% ) (22.21%)
95,870 L1-dcache-loads # 0.050 M/sec ( +- 9.95% ) (22.21%)
1,700 LLC-loads # 0.895 K/sec ( +- 6.50% ) (22.21%)
1,410 LLC-load-misses # 82.95% of all LL-cache accesses ( +- 19.42% ) (22.21%)
132,526,771 LLC-stores # 69.754 M/sec ( +- 0.65% ) (11.10%)
101,145 LLC-store-misses # 0.053 M/sec ( +- 11.19% ) (11.10%)
1.90238 +- 0.00358 seconds time elapsed ( +- 0.19% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
11.774264 GB/sec
11.758826 GB/sec
11.774368 GB/sec
11.758239 GB/sec
11.760348 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
1,619,807,936 cpu-cycles # 0.971 GHz ( +- 0.24% ) (22.14%)
1,481,306,856 instructions # 0.91 insn per cycle ( +- 0.33% ) (27.75%)
163,086 cache-references # 0.098 M/sec ( +- 11.68% ) (27.79%)
39,913 cache-misses # 24.474 % of all cache refs ( +- 26.45% ) (27.84%)
135,741,931 branch-instructions # 81.353 M/sec ( +- 0.33% ) (27.89%)
82,647 branch-misses # 0.06% of all branches ( +- 6.29% ) (27.90%)
73,575,446 bus-cycles # 44.095 M/sec ( +- 0.28% ) (22.28%)
27,834 L1-dcache-load-misses # 68.42% of all L1-dcache accesses ( +- 65.93% ) (22.28%)
40,683 L1-dcache-loads # 0.024 M/sec ( +- 42.62% ) (22.27%)
2,598 LLC-loads # 0.002 M/sec ( +- 22.66% ) (22.25%)
1,523 LLC-load-misses # 58.60% of all LL-cache accesses ( +- 39.64% ) (22.22%)
2 LLC-stores # 0.001 K/sec ( +-100.00% ) (11.08%)
0 LLC-store-misses # 0.000 K/sec (11.07%)
1.67003 +- 0.00169 seconds time elapsed ( +- 0.10% )
The L1-dcache-load-miss (L1D.REPLACEMENT) counts are significantly down,
which does confirm that unlike "REP; STOSB", MOVNTI does not result in a
write-allocate.
# AMD Rome
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
# (X86_FEATURE_REP_GOOD) and x86-64-movnt:
System: Oracle E2-2c
CPU: 2 nodes * 64 cores/node * 2 threads/core
AMD EPYC 7742 (Rome, 23:49:0)
Memory: 2048 GB evenly split between nodes
Microcode: 0x8301038
scaling_governor: performance
L3 size: 16 * 16MB
cpufreq/boost: 0
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%
1024MB 11.86 GB/s ( +- 0.83%) 16.54 GB/s ( +- 0.04%) +39.46%
4096MB 11.89 GB/s ( +- 0.61%) 16.49 GB/s ( +- 0.28%) +38.68%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq
# Running 'mem/memset' benchmark:
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
11.785122 GB/sec
11.970851 GB/sec
11.916821 GB/sec
11.861517 GB/sec
11.941867 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosq' (5 runs):
1,014,645,096 cpu-cycles # 1.264 GHz ( +- 0.18% ) (45.28%)
4,620,983 instructions # 0.00 insn per cycle ( +- 1.86% ) (45.37%)
262,988,622 cache-references # 327.723 M/sec ( +- 0.21% ) (45.51%)
6,312,740 cache-misses # 2.400 % of all cache refs ( +- 1.12% ) (45.56%)
1,792,517 branch-instructions # 2.234 M/sec ( +- 0.20% ) (45.60%)
54,095 branch-misses # 3.02% of all branches ( +- 2.99% ) (45.64%)
133,710,131 L1-dcache-load-misses # 363.51% of all L1-dcache accesses ( +- 0.12% ) (45.55%)
36,783,396 L1-dcache-loads # 45.838 M/sec ( +- 0.79% ) (45.46%)
53,411,709 L1-dcache-prefetches # 66.559 M/sec ( +- 0.28% ) (45.39%)
0.80303 +- 0.00117 seconds time elapsed ( +- 0.15% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
16.533230 GB/sec
16.496138 GB/sec
16.480302 GB/sec
16.478333 GB/sec
16.474600 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
1,091,352,779 cpu-cycles # 1.292 GHz ( +- 0.32% ) (45.25%)
1,483,248,390 instructions # 1.36 insn per cycle ( +- 0.14% ) (45.38%)
134,114,985 cache-references # 158.723 M/sec ( +- 0.17% ) (45.51%)
117,682 cache-misses # 0.088 % of all cache refs ( +- 0.99% ) (45.59%)
135,009,275 branch-instructions # 159.781 M/sec ( +- 0.18% ) (45.68%)
50,659 branch-misses # 0.04% of all branches ( +- 7.50% ) (45.66%)
58,569 L1-dcache-load-misses # 5.84% of all L1-dcache accesses ( +- 6.04% ) (45.57%)
1,002,657 L1-dcache-loads # 1.187 M/sec ( +- 15.40% ) (45.45%)
3,111 L1-dcache-prefetches # 0.004 M/sec ( +- 31.21% ) (45.38%)
0.84554 +- 0.00289 seconds time elapsed ( +- 0.34% )
Similar to Intel Broadwellx, the L1-dcache-load-misses (L2$ access from
DC Miss) counts are significantly lower. The L1 prefetcher is also
fairly quiet.
# Intel Skylakex
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
System: Oracle X8-2
CPU: 2 nodes * 26 cores/node * 2 threads/core
Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory: 3TB evenly split between nodes
Microcode: 0x5002f01
scaling_governor: performance
L3 size: 36MB
intel_pstate/no_turbo: 1
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
1024MB 6.48 GB/s ( +- 0.31%) 6.24 GB/s ( +- 0.00%) -3.70%
4096MB 6.51 GB/s ( +- 0.01%) 6.27 GB/s ( +- 0.42%) -3.68%
Comparing perf stats for size=4096MB:
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.516972 GB/sec
6.518756 GB/sec
6.517620 GB/sec
6.517598 GB/sec
6.518799 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):
3,357,373,317 cpu-cycles # 1.133 GHz ( +- 0.01% ) (29.38%)
165,063,710 instructions # 0.05 insn per cycle ( +- 1.54% ) (35.29%)
358,997 cache-references # 0.121 M/sec ( +- 0.89% ) (35.32%)
205,420 cache-misses # 57.221 % of all cache refs ( +- 3.61% ) (35.36%)
6,117,673 branch-instructions # 2.065 M/sec ( +- 1.48% ) (35.38%)
58,309 branch-misses # 0.95% of all branches ( +- 1.30% ) (35.39%)
31,329,466 bus-cycles # 10.575 M/sec ( +- 0.03% ) (23.56%)
68,543,766 L1-dcache-load-misses # 157.03% of all L1-dcache accesses ( +- 0.02% ) (23.53%)
43,648,909 L1-dcache-loads # 14.734 M/sec ( +- 0.50% ) (23.50%)
137,498 LLC-loads # 0.046 M/sec ( +- 0.21% ) (23.49%)
12,308 LLC-load-misses # 8.95% of all LL-cache accesses ( +- 2.52% ) (23.49%)
26,335 LLC-stores # 0.009 M/sec ( +- 5.65% ) (11.75%)
25,008 LLC-store-misses # 0.008 M/sec ( +- 3.42% ) (11.75%)
2.962842 +- 0.000162 seconds time elapsed ( +- 0.01% )
$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.283420 GB/sec
6.222843 GB/sec
6.282976 GB/sec
6.282828 GB/sec
6.283173 GB/sec
Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):
4,462,272,094 cpu-cycles # 1.322 GHz ( +- 0.30% ) (29.38%)
1,633,675,881 instructions # 0.37 insn per cycle ( +- 0.21% ) (35.28%)
283,627 cache-references # 0.084 M/sec ( +- 0.58% ) (35.31%)
28,824 cache-misses # 10.163 % of all cache refs ( +- 20.67% ) (35.34%)
139,719,697 branch-instructions # 41.407 M/sec ( +- 0.16% ) (35.35%)
58,062 branch-misses # 0.04% of all branches ( +- 1.49% ) (35.36%)
41,760,350 bus-cycles # 12.376 M/sec ( +- 0.05% ) (23.55%)
303,300 L1-dcache-load-misses # 0.69% of all L1-dcache accesses ( +- 2.08% ) (23.53%)
43,769,498 L1-dcache-loads # 12.972 M/sec ( +- 0.54% ) (23.52%)
99,570 LLC-loads # 0.030 M/sec ( +- 1.06% ) (23.52%)
1,966 LLC-load-misses # 1.97% of all LL-cache accesses ( +- 6.17% ) (23.52%)
129 LLC-stores # 0.038 K/sec ( +- 27.85% ) (11.75%)
7 LLC-store-misses # 0.002 K/sec ( +- 47.82% ) (11.75%)
3.37465 +- 0.00474 seconds time elapsed ( +- 0.14% )
The L1-dcache-load-misses (L1D.REPLACEMENT) count is much lower just
like the previous two cases. No performance improvement for Skylakex
though.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 939b1cff4a7b..bde3c2785ec4 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,6 +43,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
void clear_page_orig(void *page);
void clear_page_rep(void *page);
void clear_page_erms(void *page);
+void clear_page_nt(void *page);
static inline void clear_page(void *page)
{
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index c4c7dd115953..f16bb753b236 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -50,3 +50,29 @@ SYM_FUNC_START(clear_page_erms)
ret
SYM_FUNC_END(clear_page_erms)
EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Zero a page.
+ * %rdi - page
+ *
+ * Caller needs to issue a fence at the end.
+ */
+SYM_FUNC_START(clear_page_nt)
+ xorl %eax,%eax
+ movl $4096,%ecx
+
+ .p2align 4
+.Lstart:
+ movnti %rax, 0x00(%rdi)
+ movnti %rax, 0x08(%rdi)
+ movnti %rax, 0x10(%rdi)
+ movnti %rax, 0x18(%rdi)
+ movnti %rax, 0x20(%rdi)
+ movnti %rax, 0x28(%rdi)
+ movnti %rax, 0x30(%rdi)
+ movnti %rax, 0x38(%rdi)
+ addq $0x40, %rdi
+ subl $0x40, %ecx
+ ja .Lstart
+ ret
+SYM_FUNC_END(clear_page_nt)
--
2.9.3
On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
> This can potentially improve page-clearing bandwidth (see below for
> performance numbers for two microarchitectures where it helps and one
> where it doesn't) and can help indirectly by consuming less cache
> resources.
>
> Any performance benefits are expected for extents larger than LLC-sized
> or more -- when we are DRAM-BW constrained rather than cache-BW
> constrained.
"potentially", "expected", I don't like those formulations. Do you have
some actual benchmark data where this shows any improvement and not
microbenchmarks only, to warrant the additional complexity?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2020-10-14 12:56 p.m., Borislav Petkov wrote:
> On Wed, Oct 14, 2020 at 01:32:55AM -0700, Ankur Arora wrote:
>> This can potentially improve page-clearing bandwidth (see below for
>> performance numbers for two microarchitectures where it helps and one
>> where it doesn't) and can help indirectly by consuming less cache
>> resources.
>>
>> Any performance benefits are expected for extents larger than LLC-sized
>> or more -- when we are DRAM-BW constrained rather than cache-BW
>> constrained.
>
> "potentially", "expected", I don't like those formulations.
That's fair. The reason for those weasel words is mostly because it
is microarchitecture specific.
For example on Intel where I did compare across generations: I see good
performance on Broadwellx, not good on Skylakex and then good again on
some pre-production CPUs.
> Do you have
> some actual benchmark data where this shows any improvement and not
> microbenchmarks only, to warrant the additional complexity?
Yes, guest creation under QEMU (pinned guests) shows similar improvements.
I've posted performance numbers in patches 7, 8 with a simple page-fault
test derived from that.
I can add numbers from QEMU as well.
Thanks,
Ankur
>