2021-09-09 08:50:36

by Jinhua Wu

[permalink] [raw]
Subject: [PATCH] perf: optimize clear page in Intel specified model with movq instruction

Clear page is the most time-consuming procedure in page fault handling.
Kernel use fast-string instruction to clear page. We found that in specified
Intel model such as CPX and ICX, the movq instruction perform much better
than fast-string instruction when corresponding page is not in cache.
But when the page is in cache, fast string perform better. We show the test
result in the following:

machine: Intel CPX

Allocated memory size Page fault latency per 4K byte
rep stosb movq
-------------------- ---------------- ------------------
8MB 2057.13ns 1338.38ns
64MB 1850.71ns 1200.20ns
512MB 1918.40ns 1196.91ns
4096MB 1931.24ns 1189.41ns

We can find that there is 40% performance improvement. So we add a blacklist
for Intel specified model, in which we use movq instruction to clear page.

Signed-off-by: Jinhua Wu <[email protected]>
Signed-off-by: Jiayu Ni <[email protected]>
Signed-off-by: Artie Ding <[email protected]>
---
arch/x86/include/asm/page_64.h | 18 ++++++++++++------
arch/x86/kernel/cpu/intel.c | 22 ++++++++++++++++++++++
arch/x86/mm/init.c | 9 +++++++++
3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 4bde0dc..1fedfbe 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -7,6 +7,8 @@
#ifndef __ASSEMBLY__
#include <asm/alternative.h>

+#include <linux/jump_label.h>
+
/* duplicated to the one in bootmem.h */
extern unsigned long max_pfn;
extern unsigned long phys_base;
@@ -43,15 +45,19 @@ static inline unsigned long __phys_addr_nodebug(unsigned long x)
void clear_page_orig(void *page);
void clear_page_rep(void *page);
void clear_page_erms(void *page);
+extern struct static_key_false clear_page_movq_key;

static inline void clear_page(void *page)
{
- alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
- "=D" (page),
- "0" (page)
- : "cc", "memory", "rax", "rcx");
+ if (static_branch_unlikely(&clear_page_movq_key))
+ clear_page_orig(page);
+ else
+ alternative_call_2(clear_page_orig,
+ clear_page_rep, X86_FEATURE_REP_GOOD,
+ clear_page_erms, X86_FEATURE_ERMS,
+ "=D" (page),
+ "0" (page)
+ : "cc", "memory", "rax", "rcx");
}

void copy_page(void *to, void *from);
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43..3366da0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -38,6 +38,28 @@
#include <asm/apic.h>
#endif

+/* Optimize clear page with movq in specific Intel CPU */
+#include <asm/cpu_device_id.h>
+#include <linux/jump_label.h>
+
+DEFINE_STATIC_KEY_FALSE(clear_page_movq_key);
+EXPORT_SYMBOL_GPL(clear_page_movq_key);
+
+extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
+
+const struct x86_cpu_id faststring_blacklist_match[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_XEON_D, 0)
+};
+
+void enable_clear_page_movq(void)
+{
+ if (x86_match_cpu(faststring_blacklist_match))
+ static_branch_enable(&clear_page_movq_key);
+}
+
+
enum split_lock_detect_state {
sld_off = 0,
sld_warn,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 23a14d8..480c189 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -28,6 +28,12 @@
#include <asm/memtype.h>

/*
+ * Optimize clear page with movq in specific Intel CPU
+ * Definition in intel.c
+ */
+extern void enable_clear_page_movq(void);
+
+/*
* We need to define the tracepoints somewhere, and tlb.c
* is only compiled when SMP=y.
*/
@@ -775,6 +781,9 @@ void __init init_mem_mapping(void)

x86_init.hyper.init_mem_mapping();

+ /* Optimize clear page with mov in specific Intel CPU */
+ enable_clear_page_movq();
+
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

--
1.8.3.1


2021-09-09 09:41:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] perf: optimize clear page in Intel specified model with movq instruction

On Thu, Sep 09, 2021 at 04:45:51PM +0800, Jinhua Wu wrote:
> Clear page is the most time-consuming procedure in page fault handling.
> Kernel use fast-string instruction to clear page. We found that in specified
> Intel model such as CPX and ICX, the movq instruction perform much better
> than fast-string instruction when corresponding page is not in cache.
> But when the page is in cache, fast string perform better. We show the test
> result in the following:

What you should do is show the extensive tests you've run with
real-world benchmarks where you really can show 40% performance
improvement.

Also, the static branch "approach" you're using ain't gonna happen. If
anything, another X86_FEATURE_* bit.

Good luck.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-09 10:36:47

by Luming Yu

[permalink] [raw]
Subject: Re: [PATCH] perf: optimize clear page in Intel specified model with movq instruction

On Thu, Sep 9, 2021 at 5:41 PM Borislav Petkov <[email protected]> wrote:
>
> On Thu, Sep 09, 2021 at 04:45:51PM +0800, Jinhua Wu wrote:
> > Clear page is the most time-consuming procedure in page fault handling.
> > Kernel use fast-string instruction to clear page. We found that in specified
> > Intel model such as CPX and ICX, the movq instruction perform much better
> > than fast-string instruction when corresponding page is not in cache.
> > But when the page is in cache, fast string perform better. We show the test
> > result in the following:
>
> What you should do is show the extensive tests you've run with
> real-world benchmarks where you really can show 40% performance
> improvement.
>
> Also, the static branch "approach" you're using ain't gonna happen. If
> anything, another X86_FEATURE_* bit.

do you mean jump label would not be replaced to nop when its key is enabled?
so we could not use it in certain functions?
I don't understand exactly what "ain't gonna happen"
>
> Good luck.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2021-09-09 10:46:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] perf: optimize clear page in Intel specified model with movq instruction

On Thu, Sep 09, 2021 at 06:34:40PM +0800, Luming Yu wrote:
> do you mean jump label would not be replaced to nop when its key is enabled?
> so we could not use it in certain functions?
> I don't understand exactly what "ain't gonna happen"

It means, you need to use an X86_FEATURE_ bit because I won't accept a
static key.

But do the benchmarks first.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-09 11:22:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf: optimize clear page in Intel specified model with movq instruction

On Thu, Sep 09, 2021 at 06:34:40PM +0800, Luming Yu wrote:

> do you mean jump label would not be replaced to nop when its key is enabled?
> so we could not use it in certain functions?

But why add a jump label when you can make that alternative DTRT ?