System: Oracle E2-2C
CPU: 2 nodes * 64 cores/node * 2 threads/core
AMD EPYC 7742 (Rome, 23:49:0)
Memory: 2048 GB evenly split between nodes
Microcode: 0x8301038
scaling_governor: performance
L3 size: 16 * 16MB
cpufreq/boost: 0
Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq
(X86_FEATURE_REP_GOOD) and x86-64-movnt (X86_FEATURE_NT_GOOD):
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%
1024MB 11.86 GB/s ( +- 0.83%) 16.54 GB/s ( +- 0.04%) +39.46%
4096MB 11.89 GB/s ( +- 0.61%) 16.49 GB/s ( +- 0.28%) +38.68%
The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.
$ cat pf-test.c
#include <stdlib.h>
#include <sys/mman.h>
#include <linux/mman.h>
#define HPAGE_BITS 30
int main(int argc, char **argv) {
int i;
unsigned long len = atoi(argv[1]); /* In GB */
unsigned long offset = 0;
unsigned long numpages;
char *base;
len *= 1UL << 30;
numpages = len >> HPAGE_BITS;
base = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS |
MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);
for (i = 0; i < numpages; i++) {
*((volatile char *)base + offset) = *(base + offset);
offset += 1UL << HPAGE_BITS;
}
return 0;
}
The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.
Page-clearing throughput for clear_page_rep(): 11.33 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
25,130,082,910 cpu-cycles # 2.226 GHz ( +- 0.44% ) (54.54%)
1,368,762,311 instructions # 0.05 insn per cycle ( +- 0.02% ) (54.54%)
4,265,726,534 cache-references # 377.794 M/sec ( +- 0.02% ) (54.54%)
119,021,793 cache-misses # 2.790 % of all cache refs ( +- 3.90% ) (54.55%)
413,825,787 branch-instructions # 36.650 M/sec ( +- 0.01% ) (54.55%)
236,847 branch-misses # 0.06% of all branches ( +- 18.80% ) (54.56%)
2,152,320,887 L1-dcache-load-misses # 40.40% of all L1-dcache accesses ( +- 0.01% ) (54.55%)
5,326,873,560 L1-dcache-loads # 471.775 M/sec ( +- 0.20% ) (54.55%)
828,943,234 L1-dcache-prefetches # 73.415 M/sec ( +- 0.55% ) (54.54%)
18,914 dTLB-loads # 0.002 M/sec ( +- 47.23% ) (54.54%)
4,423 dTLB-load-misses # 23.38% of all dTLB cache accesses ( +- 27.75% ) (54.54%)
11.2917 +- 0.0499 seconds time elapsed ( +- 0.44% )
Page-clearing throughput for clear_page_nt(): 16.29 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128
Performance counter stats for 'bin/pf-test 128' (5 runs):
17,523,166,924 cpu-cycles # 2.230 GHz ( +- 0.03% ) (45.43%)
24,801,270,826 instructions # 1.42 insn per cycle ( +- 0.01% ) (45.45%)
2,151,391,033 cache-references # 273.845 M/sec ( +- 0.01% ) (45.46%)
168,555 cache-misses # 0.008 % of all cache refs ( +- 4.87% ) (45.47%)
2,490,226,446 branch-instructions # 316.974 M/sec ( +- 0.01% ) (45.48%)
117,604 branch-misses # 0.00% of all branches ( +- 1.56% ) (45.48%)
273,492 L1-dcache-load-misses # 0.06% of all L1-dcache accesses ( +- 2.14% ) (45.47%)
490,340,458 L1-dcache-loads # 62.414 M/sec ( +- 0.02% ) (45.45%)
20,517 L1-dcache-prefetches # 0.003 M/sec ( +- 9.61% ) (45.44%)
7,413 dTLB-loads # 0.944 K/sec ( +- 8.37% ) (45.44%)
2,031 dTLB-load-misses # 27.40% of all dTLB cache accesses ( +- 8.30% ) (45.43%)
7.85674 +- 0.00270 seconds time elapsed ( +- 0.03% )
The L1-dcache-load-misses (L2$ access from DC Miss) count is
substantially lower which suggests we aren't doing write-allocate or
RFO. The L1-dcache-prefetches are also substantially lower.
Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSQ' per
PAGE_SIZE region to a MOVNTI loop.
The page-clearing BW shows a ~40% improvement. Additionally, a quick
'perf bench memset' comparison on AMD Naples (AMD EPYC 7551) shows
similar performance gains. So, enable X86_FEATURE_NT_GOOD for
AMD Zen.
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index dcc3d943c68f..c57eb6c28aa1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -918,6 +918,9 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
{
set_cpu_cap(c, X86_FEATURE_ZEN);
+ if (c->x86 == 0x17)
+ set_cpu_cap(c, X86_FEATURE_NT_GOOD);
+
#ifdef CONFIG_NUMA
node_reclaim_distance = 32;
#endif
--
2.9.3