2021-10-20 17:07:05

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 00/14] Use uncached stores while clearing huge pages

This series adds support for uncached page clearing for huge and
gigantic pages. The motivation is to speedup creation of large
prealloc'd VMs, backed by huge/gigantic pages.

Uncached page clearing helps in two ways:
- Faster than the cached path for sizes > O(LLC-size)
- Avoids replacing potentially useful cache-lines with useless
zeroes

Performance improvements: with this series, VM creation (for VMs
with prealloc'd 2MB backing pages) sees significant runtime
improvements:


AMD Milan, sz=1550 GB, runs=3 BW stdev diff
---------- ------ --------
baseline (clear_page_erms) 8.05 GBps 0.08
CLZERO (clear_page_clzero) 29.94 GBps 0.31 +271.92%

(Creation time for this 1550 GB VM goes from 192.6s to 51.7s.)


Intel Icelake, sz=200 GB, runs=3 BW stdev diff
---------- ------ ---------
baseline (clear_page_erms) 8.25 GBps 0.05
MOVNT (clear_page_movnt) 21.55 GBps 0.31 +161.21%

(Creation time for this 200 GB VM goes from 25.2s to 9.3s.)

Additionally, on the AMD Milan system, a kernel-build test with a
background job doing page-clearing, sees a ~5% improvement in runtime
with uncached clearing vs cached.
A similar test on the Intel Icelake system shows improvement in
cache-miss rates but no overall improvement in runtime.


With the motivation out of the way, the following note describes how
v2 addresses some review comments from v1 (and other sticking points
on series of this nature over the years):

1. Uncached stores (via MOVNT, CLZERO on x86) are weakly ordered with
respect to the cache hierarchy and unless they are combined with an
appropriate fence, are unsafe to use.

Patch 6, "sparse: add address_space __incoherent" adds a new sparse
address_space: __incoherent.

Patch 7, "x86/clear_page: add clear_page_uncached()" defines:
void clear_page_uncached(__incoherent void *)
and the corresponding flush is exposed as:
void clear_page_uncached_make_coherent(void)

This would ensure that an incorrect or missing address_space would
result in a warning from sparse (and KTP.)

2. Page clearing needs to be ordered before any PTE writes related to
the cleared extent (before SetPageUptodate().) For the uncached
path, this means that we need a store fence before the PTE write.

The cost of the fence is microarchitecture dependent but from my
measurements, it is noticeable all the way upto around one every
32KB. This limits us to huge/gigantic pages on x86.

The logic handling this is in patch 10, "clear_huge_page: use
uncached path".

3. Uncached stores are generally slower than cached for extents smaller
than LLC-size, and faster for larger ones.

This means that if you choose the uncached path for too small an
extent, you would see performance regressions. And, keeping the
threshold too high means not getting some of the possible speedup.

Patches 8 and 9, "mm/clear_page: add clear_page_uncached_threshold()",
"x86/clear_page: add arch_clear_page_uncached_threshold()" setup an
arch specific threshold. For architectures that don't specify one, a
default value of 8MB is used.

However, a singe call to clear_huge_pages() or get_/pin_user_pages()
only sees a small portion of an extent being cleared in each
iteration. To make sure we choose uncached stores when working with
large extents, patch 11, "gup: add FOLL_HINT_BULK,
FAULT_FLAG_UNCACHED", adds a new flag that gup users can use for
this purpose. This is used in patch 13, "vfio_iommu_type1: specify
FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
while attaching passthrough PCIe devices.

The get_user_pages() logic to handle these flags is in patch 12,
"gup: use uncached path when clearing large regions".

4. Point (3) above (uncached stores are faster for extents larger than
LLC-sized) is generally true, with a side of Brownian motion thrown
in. For instance, MOVNTI (for > LLC-size) performs well on Broadwell
and Ice Lake, but on Skylake -- sandwiched in between the two,
it does not.

To deal with this, use Ingo's "trust but verify" suggestion,
(https://lore.kernel.org/lkml/[email protected]/)
where we enable MOVNT by default and only disable it on bad
microarchitectures.
If the uncached path ends up being a part of the kernel, hopefully
these regressions would show up early enough in chip testing.

Patch 5, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
and patch 14, "set X86_FEATURE_MOVNT_SLOW for Skylake" disables
the uncached path for Skylake.

Performance numbers are in patch 12, "gup: use uncached path when
clearing large regions."

Also at:
github.com/terminus/linux clear-page-uncached.upstream-v2

Please review.

Changelog:

v1: (https://lore.kernel.org/lkml/[email protected]/)
- Make the unsafe nature of clear_page_uncached() more obvious.
- Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't have
to explicitly enable it for every new model.
- Add GUP path (and appropriate threshold) to allow the uncached path
to be used for huge pages
- Made the code more generic so it's tied to fewer x86 specific assumptions

Thanks
Ankur

Ankur Arora (14):
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add uncached page clearing
x86/asm: add clzero based page clearing
x86/cpuid: add X86_FEATURE_MOVNT_SLOW
sparse: add address_space __incoherent
x86/clear_page: add clear_page_uncached()
mm/clear_page: add clear_page_uncached_threshold()
x86/clear_page: add arch_clear_page_uncached_threshold()
clear_huge_page: use uncached path
gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED
gup: use uncached path when clearing large regions
vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 10 +++
arch/x86/include/asm/page_32.h | 9 +++
arch/x86/include/asm/page_64.h | 34 +++++++++
arch/x86/kernel/cpu/amd.c | 2 +
arch/x86/kernel/cpu/bugs.c | 30 ++++++++
arch/x86/kernel/cpu/cacheinfo.c | 13 ++++
arch/x86/kernel/cpu/cpu.h | 2 +
arch/x86/kernel/cpu/intel.c | 1 +
arch/x86/kernel/setup.c | 6 ++
arch/x86/lib/clear_page_64.S | 45 ++++++++++++
arch/x86/lib/memset_64.S | 68 ++++++++++--------
drivers/vfio/vfio_iommu_type1.c | 3 +
fs/hugetlbfs/inode.c | 7 +-
include/linux/compiler_types.h | 2 +
include/linux/mm.h | 38 +++++++++-
mm/gup.c | 20 ++++++
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 10 ++-
mm/memory.c | 76 ++++++++++++++++++--
tools/arch/x86/lib/memset_64.S | 68 ++++++++++--------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
23 files changed, 386 insertions(+), 69 deletions(-)

--
2.29.2


2021-10-20 17:07:17

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 02/14] perf bench: add memset_movnti()

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset -f x86-64-movnt on Intel Icelake-X, AMD Milan:

# Intel Icelake-X

$ for i in 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
done

# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 8MB bytes ...
12.896170 GB/sec
# Copying 32MB bytes ...
15.879065 GB/sec
# Copying 128MB bytes ...
20.813214 GB/sec
# Copying 512MB bytes ...
24.190817 GB/sec

# AMD Milan

$ for i in 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
done

# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 8MB bytes ...
22.372566 GB/sec
# Copying 32MB bytes ...
22.507923 GB/sec
# Copying 128MB bytes ...
22.492532 GB/sec
# Copying 512MB bytes ...
22.434603 GB/sec

Signed-off-by: Ankur Arora <[email protected]>
---
tools/arch/x86/lib/memset_64.S | 68 +++++++++++---------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index 9827ae267f96..ef2a091563d9 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -25,7 +25,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS

movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
ret
SYM_FUNC_END(memset_erms)

-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
movq %rdi,%r10

/* expand byte value */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:

movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@

.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@

/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@

-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@

-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
ret

-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */

-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
"x86-64-unrolled",
"unrolled memset() in arch/x86/lib/memset_64.S")

@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
MEMSET_FN(memset_erms,
"x86-64-stosb",
"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+ "x86-64-movnt",
+ "movnt-based memset() in arch/x86/lib/memset_64.S")
--
2.29.2

2021-10-20 17:07:23

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 04/14] x86/asm: add clzero based page clearing

Add clear_page_clzero(), which uses CLZERO as the underlying primitive.
CLZERO skips the memory hierarchy, so this provides a non-polluting
implementation of clear_page(). Available if X86_FEATURE_CLZERO is set.

CLZERO, from the AMD architecture guide (Vol 3, Rev 3.30):
"Clears the cache line specified by the logical address in rAX by
writing a zero to every byte in the line. The instruction uses an
implied non temporal memory type, similar to a streaming store, and
uses the write combining protocol to minimize cache pollution.

CLZERO is weakly-ordered with respect to other instructions that
operate on memory. Software should use an SFENCE or stronger to
enforce memory ordering of CLZERO with respect to other store
instructions.

The CLZERO instruction executes at any privilege level. CLZERO
performs all the segmentation and paging checks that a store of
the specified cache line would perform."

The use-case is similar to clear_page_movnt(), except that
clear_page_clzero() is expected to be more performant.

Cc: [email protected]
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cfb95069cf9e..3c53f8ef8818 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -44,6 +44,7 @@ void clear_page_orig(void *page);
void clear_page_rep(void *page);
void clear_page_erms(void *page);
void clear_page_movnt(void *page);
+void clear_page_clzero(void *page);

static inline void clear_page(void *page)
{
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 578f40db0716..1cb29a4454e1 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -76,3 +76,22 @@ SYM_FUNC_START(clear_page_movnt)
ja .Lstart
ret
SYM_FUNC_END(clear_page_movnt)
+
+/*
+ * Zero a page using clzero (on AMD.)
+ * %rdi - page
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+SYM_FUNC_START(clear_page_clzero)
+ movl $4096,%ecx
+ movq %rdi,%rax
+
+ .p2align 4
+.Liter:
+ clzero
+ addq $0x40, %rax
+ sub $0x40, %ecx
+ ja .Liter
+ ret
+SYM_FUNC_END(clear_page_clzero)
--
2.29.2

2021-10-20 17:07:24

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 05/14] x86/cpuid: add X86_FEATURE_MOVNT_SLOW

Enabled on microarchitectures where MOVNT is slower for
bulk page clearing than the standard cached clear_page()
idiom.

Also add check_movnt_quirks() where we would set this.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/amd.c | 2 ++
arch/x86/kernel/cpu/bugs.c | 15 +++++++++++++++
arch/x86/kernel/cpu/cpu.h | 2 ++
arch/x86/kernel/cpu/intel.c | 1 +
5 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..69191f175c2c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -294,6 +294,7 @@
#define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
#define X86_FEATURE_SGX1 (11*32+ 8) /* "" Basic SGX */
#define X86_FEATURE_SGX2 (11*32+ 9) /* "" SGX Enclave Dynamic Memory Management (EDMM) */
+#define X86_FEATURE_MOVNT_SLOW (11*32+10) /* MOVNT is slow. (see check_movnt_quirks()) */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 2131af9f2fa2..5de83c6fe526 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -915,6 +915,8 @@ static void init_amd(struct cpuinfo_x86 *c)
if (c->x86 >= 0x10)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);

+ check_movnt_quirks(c);
+
/* get apicid instead of initial apic id from cpuid */
c->apicid = hard_smp_processor_id();

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ecfca3bbcd96..4e1558d22a5f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -84,6 +84,21 @@ EXPORT_SYMBOL_GPL(mds_idle_clear);
*/
DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);

+void check_movnt_quirks(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_64
+ /*
+ * Check if MOVNT is slower than the model specific clear_page()
+ * idiom (movq/rep-stosb/rep-stosq etc) for bulk page clearing.
+ * (Bulk is defined here as LLC-sized or larger.)
+ *
+ * Condition this check on CONFIG_X86_64 so we don't have
+ * to worry about any CONFIG_X86_32 families that don't
+ * support SSE2/MOVNT.
+ */
+#endif /* CONFIG_X86_64*/
+}
+
void __init check_bugs(void)
{
identify_boot_cpu();
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 95521302630d..72e3715d63ea 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -83,4 +83,6 @@ extern void update_srbds_msr(void);

extern u64 x86_read_arch_cap_msr(void);

+void check_movnt_quirks(struct cpuinfo_x86 *c);
+
#endif /* ARCH_X86_CPU_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..36a2f8e88b74 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -666,6 +666,7 @@ static void init_intel(struct cpuinfo_x86 *c)
c->x86_cache_alignment = c->x86_clflush_size * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+ check_movnt_quirks(c);
#else
/*
* Names for the Pentium II/Celeron processors
--
2.29.2

2021-10-20 17:07:47

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()

Expose the low-level uncached primitives (clear_page_movnt(),
clear_page_clzero()) as alternatives via clear_page_uncached().
Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
and the CPU does not have X86_FEATURE_CLZERO.

Both the uncached primitives use stores which are weakly ordered
with respect to other instructions accessing the memory hierarchy.
To ensure that callers don't mix accesses to different types of
address_spaces, annotate clear_user_page_uncached(), and
clear_page_uncached() as taking __incoherent pointers as arguments.

Also add clear_page_uncached_make_coherent() which provides the
necessary store fence to flush out the uncached regions.

Signed-off-by: Ankur Arora <[email protected]>
---

Notes:
This patch adds the fallback definitions of clear_user_page_uncached()
etc in include/linux/mm.h which is likely not the right place for it.

I'm guessing these should be moved to include/asm-generic/page.h
(or maybe a new include/asm-generic/page_uncached.h) and for
architectures that do have arch/$arch/include/asm/page.h (which
seems like all of them), also replicate there?

Anyway, wanted to first check if that's the way to do it, before
doing that.

arch/x86/include/asm/page.h | 10 ++++++++++
arch/x86/include/asm/page_32.h | 9 +++++++++
arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 14 ++++++++++++++
4 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 94dbd51df58f..163be03ac422 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -39,6 +39,15 @@ static inline void clear_page(void *page)
memset(page, 0, PAGE_SIZE);
}

+static inline void clear_page_uncached(__incoherent void *page)
+{
+ clear_page((__force void *) page);
+}
+
+static inline void clear_page_uncached_make_coherent(void)
+{
+}
+
static inline void copy_page(void *to, void *from)
{
memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 3c53f8ef8818..d7946047c70f 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,38 @@ static inline void clear_page(void *page)
: "cc", "memory", "rax", "rcx");
}

+/*
+ * clear_page_uncached: only allowed on __incoherent memory regions.
+ */
+static inline void clear_page_uncached(__incoherent void *page)
+{
+ alternative_call_2(clear_page_movnt,
+ clear_page, X86_FEATURE_MOVNT_SLOW,
+ clear_page_clzero, X86_FEATURE_CLZERO,
+ "=D" (page),
+ "0" (page)
+ : "cc", "memory", "rax", "rcx");
+}
+
+/*
+ * clear_page_uncached_make_coherent: executes the necessary store
+ * fence after which __incoherent regions can be safely accessed.
+ */
+static inline void clear_page_uncached_make_coherent(void)
+{
+ /*
+ * Keep the sfence for oldinstr and clzero separate to guard against
+ * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
+ * and X86_FEATURE_CLZERO.
+ *
+ * The alternatives need to be in the same order as the ones
+ * in clear_page_uncached().
+ */
+ alternative_2("sfence",
+ "", X86_FEATURE_MOVNT_SLOW,
+ "sfence", X86_FEATURE_CLZERO);
+}
+
void copy_page(void *to, void *from);

#ifdef CONFIG_X86_5LEVEL
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..b88069d1116c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)

#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

+#ifndef clear_user_page_uncached
+/*
+ * clear_user_page_uncached: fallback to the standard clear_user_page().
+ */
+static inline void clear_user_page_uncached(__incoherent void *page,
+ unsigned long vaddr, struct page *pg)
+{
+ clear_user_page((__force void *)page, vaddr, pg);
+}
+
+static inline void clear_page_uncached_make_coherent(void) { }
+#endif
+
+
#ifdef CONFIG_DEBUG_PAGEALLOC
extern unsigned int _debug_guardpage_minorder;
DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
--
2.29.2

2021-10-20 17:07:58

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 08/14] mm/clear_page: add clear_page_uncached_threshold()

Introduce clear_page_uncached_threshold which provides the threshold
above which clear_page_uncached() is used.

The ideal threshold value depends on the CPU architecture and where
the performance curves for cached and uncached stores intersect.
Typically this would depend on microarchitectural details and the LLC
size.
Here, we choose a 8MB (CLEAR_PAGE_UNCACHED_THRESHOLD) which seems
like a reasonably sized LLC.

Also define clear_page_prefer_uncached() which provides the user
interface to query this.

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/mm.h | 18 ++++++++++++++++++
mm/memory.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b88069d1116c..49a97f817eb2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3190,6 +3190,24 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
}

+/*
+ * Default size beyond which huge page clearing uses the uncached
+ * path. We size it for a reasonably sized LLC.
+ */
+#define CLEAR_PAGE_UNCACHED_THRESHOLD (8 << 20)
+
+/*
+ * Arch specific code can define arch_clear_page_uncached_threshold()
+ * to override CLEAR_PAGE_UNCACHED_THRESHOLD with a machine specific value.
+ */
+extern unsigned long __init arch_clear_page_uncached_threshold(void);
+
+extern bool clear_page_prefer_uncached(unsigned long extent);
+#else
+static inline bool clear_page_prefer_uncached(unsigned long extent)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

#ifndef clear_user_page_uncached
diff --git a/mm/memory.c b/mm/memory.c
index adf9b9ef8277..9f6059520985 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5266,6 +5266,36 @@ EXPORT_SYMBOL(__might_fault);
#endif

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+static unsigned long __read_mostly clear_page_uncached_threshold =
+ CLEAR_PAGE_UNCACHED_THRESHOLD;
+
+/* Arch code can override for a machine specific value. */
+unsigned long __weak __init arch_clear_page_uncached_threshold(void)
+{
+ return CLEAR_PAGE_UNCACHED_THRESHOLD;
+}
+
+static int __init setup_clear_page_uncached_threshold(void)
+{
+ clear_page_uncached_threshold =
+ arch_clear_page_uncached_threshold() / PAGE_SIZE;
+ return 0;
+}
+
+/*
+ * cacheinfo is setup via device_initcall and we want to get set after
+ * that. Use the default value until then.
+ */
+late_initcall(setup_clear_page_uncached_threshold);
+
+bool clear_page_prefer_uncached(unsigned long extent)
+{
+ unsigned long pages = extent / PAGE_SIZE;
+
+ return pages >= clear_page_uncached_threshold;
+}
+
/*
* Process all subpages of the specified huge page with the specified
* operation. The target subpage will be processed last to keep its
--
2.29.2

2021-10-20 17:08:12

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 09/14] x86/clear_page: add arch_clear_page_uncached_threshold()

Add arch_clear_page_uncached_threshold() for a machine specific value
above which clear_page_uncached() would be used.

The ideal threshold value depends on the CPU model and where the
performance curves for cached and uncached stores intersect.
A safe value is LLC-size, so we use that of the boot_cpu.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/kernel/cpu/cacheinfo.c | 13 +++++++++++++
arch/x86/kernel/setup.c | 6 ++++++
3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/cacheinfo.h b/arch/x86/include/asm/cacheinfo.h
index 86b2e0dcc4bf..5c6045699e94 100644
--- a/arch/x86/include/asm/cacheinfo.h
+++ b/arch/x86/include/asm/cacheinfo.h
@@ -4,5 +4,6 @@

void cacheinfo_amd_init_llc_id(struct cpuinfo_x86 *c, int cpu);
void cacheinfo_hygon_init_llc_id(struct cpuinfo_x86 *c, int cpu);
+int cacheinfo_lookup_max_size(int cpu);

#endif /* _ASM_X86_CACHEINFO_H */
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index b5e36bd0425b..6c34fc22d9ae 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1033,3 +1033,16 @@ int populate_cache_leaves(unsigned int cpu)

return 0;
}
+
+int cacheinfo_lookup_max_size(int cpu)
+{
+ struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+ struct cacheinfo *this_leaf = this_cpu_ci->info_list;
+ struct cacheinfo *max_leaf;
+
+ /*
+ * Assume that cache sizes always increase with level.
+ */
+ max_leaf = this_leaf + this_cpu_ci->num_leaves - 1;
+ return max_leaf->size;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 40ed44ead063..1b3e2c40f832 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -49,6 +49,7 @@
#include <asm/thermal.h>
#include <asm/unwind.h>
#include <asm/vsyscall.h>
+#include <asm/cacheinfo.h>
#include <linux/vmalloc.h>

/*
@@ -1250,3 +1251,8 @@ static int __init register_kernel_offset_dumper(void)
return 0;
}
__initcall(register_kernel_offset_dumper);
+
+unsigned long __init arch_clear_page_uncached_threshold(void)
+{
+ return cacheinfo_lookup_max_size(0);
+}
--
2.29.2

2021-10-20 17:08:18

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 12/14] gup: use uncached path when clearing large regions

When clearing a large region, or when the user explicitly specifies
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region, take the uncached path.

One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because uncached stores are generally
weakly ordered and need some kind of store fence -- which would need
to be done at PTE write granularity to avoid data leakage. This would be
expensive enough that it would negate any performance advantage.

Performance
====

System: Oracle E4-2C (2 nodes * 64 cores * 2 threads) (Milan)
Processor: AMD EPYC 7J13 64-Core
Memory: 2048 GB evenly split between nodes
LLC-size: 32MB for each CCX (8-core * 2-threads)
boost: 0, Microcode: 0xa001137, scaling-governor: performance

System: Oracle X9-2 (2 nodes * 32 cores * 2 threads) (Icelake)
Processor: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Memory: 512 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

Workload: qemu-VM-create
==

Create a large VM, backed by preallocated 2MB pages.
(This test needs a minor change in qemu so it mmap's with
MAP_POPULATE instead of demand faulting each page.)

Milan, sz=1550 GB, runs=3 BW stdev diff
---------- ------ --------
baseline (clear_page_erms) 8.05 GBps 0.08
CLZERO (clear_page_clzero) 29.94 GBps 0.31 +271.92%

(VM creation time decreases from 192.6s to 51.7s.)

Icelake, sz=200 GB, runs=3 BW stdev diff
---------- ------ ---------
baseline (clear_page_erms) 8.25 GBps 0.05
MOVNT (clear_page_movnt) 21.55 GBps 0.31 +161.21%

(VM creation time decreases from 25.2s to 9.3s.)

As the diff shows, for both these micro-architectures there's a
significant speedup with the CLZERO and MOVNT based interfaces.

Workload: Kernel build with background clear_huge_page()
==

Probe the cache-pollution aspect of this commit with a kernel
build (make -j 15 bzImage) alongside a background clear_huge_page()
load which does mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.

The expectation -- assuming the kernel build performance is partly
cache limited -- is that the background load of clear_page_erms()
should show a greater slowdown, than clear_page_movnt() or
clear_page_clzero().
The build itself does not use THP or similar, so any performance changes
are due to the background load.

# Milan, compile.sh internally tasksets to a CCX
# perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \
-e stalled-cycles-backend -e instructions -e branches \
-e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses \
-e cache-references -e cache-misses -e all_data_cache_accesses \
-e l1_data_cache_fills_all -e l1_data_cache_fills_from_memory \
./compile.sh

Milan kernel-build[1] kernel-build[2] kernel-build[3]
(bg: nothing) (bg:clear_page_erms()) (bg:clear_page_clzero())
----------------- --------------------- ---------------------- ------------------------
run time 280.12s (+- 0.59%) 322.21s (+- 0.26%) 307.02s (+- 1.35%)
IPC 1.16 1.05 1.14
backend-idle 3.78% (+- 0.06%) 4.62% (+- 0.11%) 3.87% (+- 0.10%)
cache-misses 20.08% (+- 0.14%) 20.88% (+- 0.13%) 20.09% (+- 0.11%)
(% of cache-refs)
l1_data_cache_fills- 2.77M/sec (+- 0.20%) 3.11M/sec (+- 0.32%) 2.73M/sec (+- 0.12%)
_from_memory

From the backend-idle stats in [1], the kernel build is only mildly
memory subsystem bound. However, there's a small but clear effect where
the background load of clear_page_clzero() does not leave much of an
imprint on the kernel-build in [3] -- both [1] and [3] have largely
similar IPC, memory and cache behaviour. OTOH, the clear_page_erms()
workload in [2] constrains the kernel-build more.

(Fuller perf stat output, at [1], [2], [3].)

# Icelake, compile.sh internally tasksets to a socket
# perf stat -r 5 -e task-clock -e cycles -e stalled-cycles-frontend \
-e stalled-cycles-backend -e instructions -e branches \
-e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses \
-e cache-references -e cache-misses -e LLC-loads \
-e LLC-load-misses ./compile.sh

Icelake kernel-build[4] kernel-build[5] kernel-build[6]
(bg: nothing) (bg:clear_page_erms()) (bg:clear_page_movnt())
----------------- ----------------- ---------------------- -----------------------

run time 135.47s (+- 0.25%) 136.75s (+- 0.23%) 135.65s (+- 0.15%)
IPC 1.81 1.80 1.80
cache-misses 21.68% (+- 0.42%) 22.88% (+- 0.87%) 21.19% (+- 0.51%)
(% of cache-refs)
LLC-load-misses 35.56% (+- 0.83%) 37.44% (+- 0.99%) 33.54% (+- 1.17%)

From the LLC-load-miss and the cache-miss numbers, clear_page_erms()
seems to cause some additional cache contention in the kernel-build in
[5], compared to [4] and [6]. However, from the IPC and the run time
numbers, looks like the CPU pipeline compensates for the extra misses
quite well.
(Increasing the number of make jobs to 60, did not change the overall
picture appreciably.)

(Fuller perf stat output, at [4], [5], [6].)

[1] Milan, kernel-build

Performance counter stats for './compile.sh' (5 runs):

2,525,721.45 msec task-clock # 9.016 CPUs utilized ( +- 0.06% )
4,642,144,895,632 cycles # 1.838 GHz ( +- 0.01% ) (47.38%)
54,430,239,074 stalled-cycles-frontend # 1.17% frontend cycles idle ( +- 0.16% ) (47.35%)
175,620,521,760 stalled-cycles-backend # 3.78% backend cycles idle ( +- 0.06% ) (47.34%)
5,392,053,273,328 instructions # 1.16 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.02% ) (47.34%)
1,181,224,298,651 branches # 467.572 M/sec ( +- 0.01% ) (47.33%)
27,668,103,863 branch-misses # 2.34% of all branches ( +- 0.04% ) (47.33%)
2,141,384,087,286 L1-dcache-loads # 847.639 M/sec ( +- 0.01% ) (47.32%)
86,216,717,118 L1-dcache-load-misses # 4.03% of all L1-dcache accesses ( +- 0.08% ) (47.35%)
264,844,001,975 cache-references # 104.835 M/sec ( +- 0.03% ) (47.36%)
53,225,109,745 cache-misses # 20.086 % of all cache refs ( +- 0.14% ) (47.37%)
2,610,041,169,859 all_data_cache_accesses # 1.033 G/sec ( +- 0.01% ) (47.37%)
96,419,361,379 l1_data_cache_fills_all # 38.166 M/sec ( +- 0.06% ) (47.37%)
7,005,118,698 l1_data_cache_fills_from_memory # 2.773 M/sec ( +- 0.20% ) (47.38%)

280.12 +- 1.65 seconds time elapsed ( +- 0.59% )

[2] Milan, kernel-build (bg: clear_page_erms() workload)

Performance counter stats for './compile.sh' (5 runs):

2,852,168.93 msec task-clock # 8.852 CPUs utilized ( +- 0.14% )
5,166,249,772,084 cycles # 1.821 GHz ( +- 0.05% ) (47.27%)
62,039,291,151 stalled-cycles-frontend # 1.20% frontend cycles idle ( +- 0.04% ) (47.29%)
238,472,446,709 stalled-cycles-backend # 4.62% backend cycles idle ( +- 0.11% ) (47.30%)
5,419,530,293,688 instructions # 1.05 insn per cycle
# 0.04 stalled cycles per insn ( +- 0.01% ) (47.31%)
1,186,958,893,481 branches # 418.404 M/sec ( +- 0.01% ) (47.31%)
28,106,023,654 branch-misses # 2.37% of all branches ( +- 0.03% ) (47.29%)
2,160,377,315,024 L1-dcache-loads # 761.534 M/sec ( +- 0.03% ) (47.26%)
89,101,836,173 L1-dcache-load-misses # 4.13% of all L1-dcache accesses ( +- 0.06% ) (47.25%)
276,859,144,248 cache-references # 97.593 M/sec ( +- 0.04% ) (47.22%)
57,774,174,239 cache-misses # 20.889 % of all cache refs ( +- 0.13% ) (47.24%)
2,641,613,011,234 all_data_cache_accesses # 931.170 M/sec ( +- 0.01% ) (47.22%)
99,595,968,133 l1_data_cache_fills_all # 35.108 M/sec ( +- 0.06% ) (47.24%)
8,831,873,628 l1_data_cache_fills_from_memory # 3.113 M/sec ( +- 0.32% ) (47.23%)

322.211 +- 0.837 seconds time elapsed ( +- 0.26% )

[3] Milan, kernel-build + (bg: clear_page_clzero() workload)

Performance counter stats for './compile.sh' (5 runs):

2,607,387.17 msec task-clock # 8.493 CPUs utilized ( +- 0.14% )
4,749,807,054,468 cycles # 1.824 GHz ( +- 0.09% ) (47.28%)
56,579,908,946 stalled-cycles-frontend # 1.19% frontend cycles idle ( +- 0.19% ) (47.28%)
183,367,955,020 stalled-cycles-backend # 3.87% backend cycles idle ( +- 0.10% ) (47.28%)
5,395,577,260,957 instructions # 1.14 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.02% ) (47.29%)
1,181,904,525,139 branches # 453.753 M/sec ( +- 0.01% ) (47.30%)
27,702,316,890 branch-misses # 2.34% of all branches ( +- 0.02% ) (47.31%)
2,137,616,885,978 L1-dcache-loads # 820.667 M/sec ( +- 0.01% ) (47.32%)
85,841,996,509 L1-dcache-load-misses # 4.02% of all L1-dcache accesses ( +- 0.03% ) (47.32%)
262,784,890,310 cache-references # 100.888 M/sec ( +- 0.04% ) (47.32%)
52,812,245,646 cache-misses # 20.094 % of all cache refs ( +- 0.11% ) (47.32%)
2,605,653,350,299 all_data_cache_accesses # 1.000 G/sec ( +- 0.01% ) (47.32%)
95,770,076,665 l1_data_cache_fills_all # 36.768 M/sec ( +- 0.03% ) (47.30%)
7,134,690,513 l1_data_cache_fills_from_memory # 2.739 M/sec ( +- 0.12% ) (47.29%)

307.02 +- 4.15 seconds time elapsed ( +- 1.35% )

[4] Icelake, kernel-build

Performance counter stats for './compile.sh' (5 runs):

421,633 cs # 358.780 /sec ( +- 0.04% )
1,173,522.36 msec task-clock # 8.662 CPUs utilized ( +- 0.14% )
2,991,427,421,282 cycles # 2.545 GHz ( +- 0.15% ) (82.42%)
5,410,090,251,681 instructions # 1.81 insn per cycle ( +- 0.02% ) (91.13%)
1,189,406,048,438 branches # 1.012 G/sec ( +- 0.02% ) (91.05%)
21,291,454,717 branch-misses # 1.79% of all branches ( +- 0.02% ) (91.06%)
1,462,419,736,675 L1-dcache-loads # 1.244 G/sec ( +- 0.02% ) (91.06%)
47,084,269,809 L1-dcache-load-misses # 3.22% of all L1-dcache accesses ( +- 0.01% ) (91.05%)
23,527,140,332 cache-references # 20.020 M/sec ( +- 0.13% ) (91.04%)
5,093,132,060 cache-misses # 21.682 % of all cache refs ( +- 0.42% ) (91.03%)
4,220,672,439 LLC-loads # 3.591 M/sec ( +- 0.14% ) (91.04%)
1,501,704,609 LLC-load-misses # 35.56% of all LL-cache accesses ( +- 0.83% ) (73.10%)

135.478 +- 0.335 seconds time elapsed ( +- 0.25% )

[5] Icelake, kernel-build + (bg: clear_page_erms() workload)

Performance counter stats for './compile.sh' (5 runs):

410,611 cs # 347.771 /sec ( +- 0.02% )
1,184,382.84 msec task-clock # 8.661 CPUs utilized ( +- 0.08% )
3,018,535,155,772 cycles # 2.557 GHz ( +- 0.08% ) (82.42%)
5,408,788,104,113 instructions # 1.80 insn per cycle ( +- 0.00% ) (91.13%)
1,189,173,209,515 branches # 1.007 G/sec ( +- 0.00% ) (91.05%)
21,279,087,578 branch-misses # 1.79% of all branches ( +- 0.01% ) (91.06%)
1,462,243,374,967 L1-dcache-loads # 1.238 G/sec ( +- 0.00% ) (91.05%)
47,210,704,159 L1-dcache-load-misses # 3.23% of all L1-dcache accesses ( +- 0.02% ) (91.04%)
23,378,470,958 cache-references # 19.801 M/sec ( +- 0.03% ) (91.05%)
5,339,921,426 cache-misses # 22.814 % of all cache refs ( +- 0.87% ) (91.03%)
4,241,388,134 LLC-loads # 3.592 M/sec ( +- 0.02% ) (91.05%)
1,588,055,137 LLC-load-misses # 37.44% of all LL-cache accesses ( +- 0.99% ) (73.09%)

136.750 +- 0.315 seconds time elapsed ( +- 0.23% )

[6] Icelake, kernel-build + (bg: clear_page_movnt() workload)

Performance counter stats for './compile.sh' (5 runs):

409,978 cs # 347.850 /sec ( +- 0.06% )
1,174,090.99 msec task-clock # 8.655 CPUs utilized ( +- 0.10% )
2,992,914,428,930 cycles # 2.539 GHz ( +- 0.10% ) (82.40%)
5,408,632,560,457 instructions # 1.80 insn per cycle ( +- 0.00% ) (91.12%)
1,189,083,425,674 branches # 1.009 G/sec ( +- 0.00% ) (91.05%)
21,273,992,588 branch-misses # 1.79% of all branches ( +- 0.02% ) (91.05%)
1,462,081,591,012 L1-dcache-loads # 1.241 G/sec ( +- 0.00% ) (91.05%)
47,071,136,770 L1-dcache-load-misses # 3.22% of all L1-dcache accesses ( +- 0.03% ) (91.04%)
23,331,268,072 cache-references # 19.796 M/sec ( +- 0.05% ) (91.04%)
4,953,198,057 cache-misses # 21.190 % of all cache refs ( +- 0.51% ) (91.04%)
4,194,721,070 LLC-loads # 3.559 M/sec ( +- 0.10% ) (91.06%)
1,412,414,538 LLC-load-misses # 33.54% of all LL-cache accesses ( +- 1.17% ) (73.09%)

135.654 +- 0.203 seconds time elapsed ( +- 0.15% )

Signed-off-by: Ankur Arora <[email protected]>
---
fs/hugetlbfs/inode.c | 7 ++++++-
mm/gup.c | 20 ++++++++++++++++++++
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 9 ++++++++-
4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cdfb1ae78a3f..44cee9d30035 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -636,6 +636,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t hpage_size = huge_page_size(h);
unsigned long hpage_shift = huge_page_shift(h);
pgoff_t start, index, end;
+ bool hint_uncached;
int error;
u32 hash;

@@ -653,6 +654,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
start = offset >> hpage_shift;
end = (offset + len + hpage_size - 1) >> hpage_shift;

+ /* Don't pollute the cache if we are fallocte'ing a large region. */
+ hint_uncached = clear_page_prefer_uncached((end - start) << hpage_shift);
+
inode_lock(inode);

/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -731,7 +735,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
error = PTR_ERR(page);
goto out;
}
- clear_huge_page(page, addr, pages_per_huge_page(h));
+ clear_huge_page(page, addr, pages_per_huge_page(h),
+ hint_uncached);
__SetPageUptodate(page);
error = huge_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..930944e0c6eb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -933,6 +933,13 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (*flags & FOLL_HINT_BULK) {
+ /*
+ * From the user hint, we might be faulting-in a large region
+ * so minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_UNCACHED;
+ }

ret = handle_mm_fault(vma, address, fault_flags, NULL);
if (ret & VM_FAULT_ERROR) {
@@ -1100,6 +1107,19 @@ static long __get_user_pages(struct mm_struct *mm,
if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;

+ /*
+ * Uncached page clearing is generally faster when clearing regions
+ * sized ~LLC/2 or thereabouts. So hint the uncached path based
+ * on clear_page_prefer_uncached().
+ *
+ * Note, however that this get_user_pages() call might end up
+ * needing to clear an extent smaller than nr_pages when we have
+ * taken the (potentially slower) uncached path based on the
+ * expectation of a larger nr_pages value.
+ */
+ if (clear_page_prefer_uncached(nr_pages * PAGE_SIZE))
+ gup_flags |= FOLL_HINT_BULK;
+
do {
struct page *page;
unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffd4b07285ba..2d239967a8a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,7 +600,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;
- bool uncached = false;
+ bool uncached = vmf->flags & FAULT_FLAG_UNCACHED;

VM_BUG_ON_PAGE(!PageCompound(page), page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a920b1133cdb..35b643df5854 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4874,7 +4874,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
- bool uncached = false;
+ bool uncached = flags & FAULT_FLAG_UNCACHED;

/*
* Currently, we are forced to kill the process in the event the
@@ -5503,6 +5503,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (flags & FOLL_HINT_BULK) {
+ /*
+ * From the user hint, we might be faulting-in a large
+ * region so minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_UNCACHED;
+ }
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
if (ret & VM_FAULT_ERROR) {
err = vm_fault_to_errno(ret, flags);
--
2.29.2

2021-10-20 17:08:37

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 11/14] gup: add FOLL_HINT_BULK, FAULT_FLAG_UNCACHED

Add FOLL_HINT_BULK, which users of get_user_pages(), and
pin_user_pages() can use to signal that this call is one of
many allowing get_user_pages() to optimize accordingly.

Also add FAULT_FLAG_UNCACHED, which translates to the
same idea for the fault path.

As the second flag suggests, the intent for both is to hint the
use of an uncached path if one is available.

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/mm.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e8ddec2aba2..cf1c711fe5ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -446,6 +446,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNCACHED: Fault handling to choose the uncached path.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -477,6 +478,7 @@ enum fault_flag {
FAULT_FLAG_REMOTE = 1 << 7,
FAULT_FLAG_INSTRUCTION = 1 << 8,
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
+ FAULT_FLAG_UNCACHED = 1 << 10,
};

/*
@@ -2864,6 +2866,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_HINT_BULK 0x100000 /* part of a larger extent being gup'd */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
--
2.29.2

2021-10-20 17:09:04

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 10/14] clear_huge_page: use uncached path

Uncached stores are suitable for circumstances where the region written
to is not expected to be read again soon, or the region written to is
large enough that there's no expectation that we will find the data in
the cache.

Add a new helper clear_subpage_uncached(), which handles the uncached
clearing path for huge and gigantic pages.

This path is always invoked for gigantic pages, for huge pages only if
pages_per_huge_page is larger than the architectural threshold or if
the user gives an explicit hint (say for a bulk transfer.)

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 3 ++-
mm/hugetlb.c | 3 ++-
mm/memory.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
4 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49a97f817eb2..3e8ddec2aba2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3164,7 +3164,8 @@ enum mf_action_page_type {
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
extern void clear_huge_page(struct page *page,
unsigned long addr_hint,
- unsigned int pages_per_huge_page);
+ unsigned int pages_per_huge_page,
+ bool hint_uncached);
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr_hint,
struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..ffd4b07285ba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -600,6 +600,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;
+ bool uncached = false;

VM_BUG_ON_PAGE(!PageCompound(page), page);

@@ -617,7 +618,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
goto release;
}

- clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
+ clear_huge_page(page, vmf->address, HPAGE_PMD_NR, uncached);
/*
* The memory barrier inside __SetPageUptodate makes sure that
* clear_huge_page writes become visible before the set_pmd_at()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..a920b1133cdb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4874,6 +4874,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ bool uncached = false;

/*
* Currently, we are forced to kill the process in the event the
@@ -4928,7 +4929,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spin_unlock(ptl);
goto out;
}
- clear_huge_page(page, address, pages_per_huge_page(h));
+ clear_huge_page(page, address, pages_per_huge_page(h), uncached);
__SetPageUptodate(page);
new_page = true;

diff --git a/mm/memory.c b/mm/memory.c
index 9f6059520985..ef365948f595 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5347,6 +5347,27 @@ static inline void process_huge_page(
}
}

+/*
+ * clear_subpage_uncached: clear the page in an uncached fashion.
+ */
+static void clear_subpage_uncached(unsigned long addr, int idx, void *arg)
+{
+ __incoherent void *kaddr;
+ struct page *page = arg;
+
+ page = page + idx;
+
+ /*
+ * Do the kmap explicitly here since clear_user_page_uncached()
+ * only handles __incoherent addresses.
+ *
+ * Caller is responsible for making the region coherent again.
+ */
+ kaddr = (__incoherent void *)kmap_atomic(page);
+ clear_user_page_uncached(kaddr, addr + idx * PAGE_SIZE, page);
+ kunmap_atomic((__force void *)kaddr);
+}
+
static void clear_gigantic_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page)
@@ -5358,7 +5379,8 @@ static void clear_gigantic_page(struct page *page,
for (i = 0; i < pages_per_huge_page;
i++, p = mem_map_next(p, page, i)) {
cond_resched();
- clear_user_highpage(p, addr + i * PAGE_SIZE);
+
+ clear_subpage_uncached(addr + i * PAGE_SIZE, 0, p);
}
}

@@ -5369,18 +5391,34 @@ static void clear_subpage(unsigned long addr, int idx, void *arg)
clear_user_highpage(page + idx, addr);
}

-void clear_huge_page(struct page *page,
- unsigned long addr_hint, unsigned int pages_per_huge_page)
+void clear_huge_page(struct page *page, unsigned long addr_hint,
+ unsigned int pages_per_huge_page, bool uncached)
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);

if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
+
+ /* Gigantic page clearing always uses __incoherent. */
+ clear_page_uncached_make_coherent();
return;
}

- process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+ /*
+ * The uncached path is typically slower for small extents so take it
+ * only if the user provides an explicit hint or if the extent is large
+ * enough that there are no cache expectations.
+ */
+ if (uncached ||
+ clear_page_prefer_uncached(pages_per_huge_page * PAGE_SIZE)) {
+ process_huge_page(addr_hint, pages_per_huge_page,
+ clear_subpage_uncached, page);
+
+ clear_page_uncached_make_coherent();
+ } else {
+ process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+ }
}

static void copy_user_gigantic_page(struct page *dst, struct page *src,
--
2.29.2

2021-10-20 17:09:15

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 06/14] sparse: add address_space __incoherent

Some CPU architectures provide store instructions that are weakly
ordered with respect to other instructions that operate on the memory
hierarchy.
Add sparse address_space __incoherent to denote pointers used to
operate over these regions.

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/compiler_types.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index b6ff83a714ca..f7f68d7bc494 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -11,6 +11,7 @@
# define __iomem __attribute__((noderef, address_space(__iomem)))
# define __percpu __attribute__((noderef, address_space(__percpu)))
# define __rcu __attribute__((noderef, address_space(__rcu)))
+# define __incoherent __attribute__((noderef, address_space(__incoherent)))
static inline void __chk_user_ptr(const volatile void __user *ptr) { }
static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
/* context/locking */
@@ -37,6 +38,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
# define __iomem
# define __percpu
# define __rcu
+# define __incoherent
# define __chk_user_ptr(x) (void)0
# define __chk_io_ptr(x) (void)0
/* context/locking */
--
2.29.2

2021-10-20 18:55:03

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 13/14] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()

Specify FOLL_HINT_BULK to pin_user_pages() so it is aware that
this pin is part of a larger region being pinned, so it can
optimize based on that expectation.

Cc: [email protected]
Signed-off-by: Ankur Arora <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0e9217687f5c..0d45b0c6464d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -557,6 +557,9 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
if (prot & IOMMU_WRITE)
flags |= FOLL_WRITE;

+ /* Tell gup that this iterations is part of larger set of pins. */
+ flags |= FOLL_HINT_BULK;
+
mmap_read_lock(mm);
ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
pages, NULL, NULL);
--
2.29.2

2021-10-20 18:55:16

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 14/14] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

System: Oracle X8-2
CPU: 2 nodes * 26 cores/node * 2 threads/core
Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory: 3TB evenly split between nodes
Microcode: 0x5002f01
scaling_governor: performance
L3 size: 36MB
intel_pstate/no_turbo: 1

$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
6.361971 GB/sec
# Copying 8MB bytes ...
6.300403 GB/sec
# Copying 32MB bytes ...
6.288992 GB/sec
# Copying 128MB bytes ...
6.328793 GB/sec
# Copying 512MB bytes ...
6.324471 GB/sec

# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:

x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)

16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
1024MB 6.48 GB/s ( +- 0.31%) 6.24 GB/s ( +- 0.00%) -3.70%
4096MB 6.51 GB/s ( +- 0.01%) 6.27 GB/s ( +- 0.42%) -3.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.516972 GB/sec (+- 0.01%)

Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

3,357,373,317 cpu-cycles # 1.133 GHz ( +- 0.01% ) (29.38%)
165,063,710 instructions # 0.05 insn per cycle ( +- 1.54% ) (35.29%)
358,997 cache-references # 0.121 M/sec ( +- 0.89% ) (35.32%)
205,420 cache-misses # 57.221 % of all cache refs ( +- 3.61% ) (35.36%)
6,117,673 branch-instructions # 2.065 M/sec ( +- 1.48% ) (35.38%)
58,309 branch-misses # 0.95% of all branches ( +- 1.30% ) (35.39%)
31,329,466 bus-cycles # 10.575 M/sec ( +- 0.03% ) (23.56%)
68,543,766 L1-dcache-load-misses # 157.03% of all L1-dcache accesses ( +- 0.02% ) (23.53%)
43,648,909 L1-dcache-loads # 14.734 M/sec ( +- 0.50% ) (23.50%)
137,498 LLC-loads # 0.046 M/sec ( +- 0.21% ) (23.49%)
12,308 LLC-load-misses # 8.95% of all LL-cache accesses ( +- 2.52% ) (23.49%)
26,335 LLC-stores # 0.009 M/sec ( +- 5.65% ) (11.75%)
25,008 LLC-store-misses # 0.008 M/sec ( +- 3.42% ) (11.75%)

2.962842 +- 0.000162 seconds time elapsed ( +- 0.01% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.283420 GB/sec (+- 0.01%)

Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

4,462,272,094 cpu-cycles # 1.322 GHz ( +- 0.30% ) (29.38%)
1,633,675,881 instructions # 0.37 insn per cycle ( +- 0.21% ) (35.28%)
283,627 cache-references # 0.084 M/sec ( +- 0.58% ) (35.31%)
28,824 cache-misses # 10.163 % of all cache refs ( +- 20.67% ) (35.34%)
139,719,697 branch-instructions # 41.407 M/sec ( +- 0.16% ) (35.35%)
58,062 branch-misses # 0.04% of all branches ( +- 1.49% ) (35.36%)
41,760,350 bus-cycles # 12.376 M/sec ( +- 0.05% ) (23.55%)
303,300 L1-dcache-load-misses # 0.69% of all L1-dcache accesses ( +- 2.08% ) (23.53%)
43,769,498 L1-dcache-loads # 12.972 M/sec ( +- 0.54% ) (23.52%)
99,570 LLC-loads # 0.030 M/sec ( +- 1.06% ) (23.52%)
1,966 LLC-load-misses # 1.97% of all LL-cache accesses ( +- 6.17% ) (23.52%)
129 LLC-stores # 0.038 K/sec ( +- 27.85% ) (11.75%)
7 LLC-store-misses # 0.002 K/sec ( +- 47.82% ) (11.75%)

3.37465 +- 0.00474 seconds time elapsed ( +- 0.14% )

It's unclear if using MOVNT is a net negative on Skylake. For bulk stores
MOVNT is slightly slower than REP;STOSB, but from the L1-dcache-load-misses
stats (L1D.REPLACEMENT), it does elide the write-allocate and thus helps
with cache efficiency.

However, we err on the side of caution and mark Skylake
X86_FEATURE_MOVNT_SLOW.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kernel/cpu/bugs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 4e1558d22a5f..222d6f095da1 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -96,6 +96,21 @@ void check_movnt_quirks(struct cpuinfo_x86 *c)
* to worry about any CONFIG_X86_32 families that don't
* support SSE2/MOVNT.
*/
+ if (c->x86_vendor == X86_VENDOR_INTEL) {
+ if (c->x86 == 6) {
+ switch (c->x86_model) {
+ case INTEL_FAM6_SKYLAKE_L:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE_X:
+ set_cpu_cap(c, X86_FEATURE_MOVNT_SLOW);
+ break;
+ default:
+ break;
+ }
+ }
+ }
#endif /* CONFIG_X86_64*/
}

--
2.29.2

2021-12-08 08:58:37

by Yu Xu

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()

On 10/21/21 1:02 AM, Ankur Arora wrote:
> Expose the low-level uncached primitives (clear_page_movnt(),
> clear_page_clzero()) as alternatives via clear_page_uncached().
> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
> and the CPU does not have X86_FEATURE_CLZERO.
>
> Both the uncached primitives use stores which are weakly ordered
> with respect to other instructions accessing the memory hierarchy.
> To ensure that callers don't mix accesses to different types of
> address_spaces, annotate clear_user_page_uncached(), and
> clear_page_uncached() as taking __incoherent pointers as arguments.
>
> Also add clear_page_uncached_make_coherent() which provides the
> necessary store fence to flush out the uncached regions.
>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
>
> Notes:
> This patch adds the fallback definitions of clear_user_page_uncached()
> etc in include/linux/mm.h which is likely not the right place for it.
>
> I'm guessing these should be moved to include/asm-generic/page.h
> (or maybe a new include/asm-generic/page_uncached.h) and for
> architectures that do have arch/$arch/include/asm/page.h (which
> seems like all of them), also replicate there?
>
> Anyway, wanted to first check if that's the way to do it, before
> doing that.
>
> arch/x86/include/asm/page.h | 10 ++++++++++
> arch/x86/include/asm/page_32.h | 9 +++++++++
> arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
> include/linux/mm.h | 14 ++++++++++++++
> 4 files changed, 65 insertions(+)
>
> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
> index 94dbd51df58f..163be03ac422 100644
> --- a/arch/x86/include/asm/page_32.h
> +++ b/arch/x86/include/asm/page_32.h
> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
> memset(page, 0, PAGE_SIZE);
> }
>
> +static inline void clear_page_uncached(__incoherent void *page)
> +{
> + clear_page((__force void *) page);
> +}
> +
> +static inline void clear_page_uncached_make_coherent(void)
> +{
> +}
> +
> static inline void copy_page(void *to, void *from)
> {
> memcpy(to, from, PAGE_SIZE);
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 3c53f8ef8818..d7946047c70f 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
> : "cc", "memory", "rax", "rcx");
> }
>
> +/*
> + * clear_page_uncached: only allowed on __incoherent memory regions.
> + */
> +static inline void clear_page_uncached(__incoherent void *page)
> +{
> + alternative_call_2(clear_page_movnt,
> + clear_page, X86_FEATURE_MOVNT_SLOW,
> + clear_page_clzero, X86_FEATURE_CLZERO,
> + "=D" (page),
> + "0" (page)
> + : "cc", "memory", "rax", "rcx");
> +}
> +
> +/*
> + * clear_page_uncached_make_coherent: executes the necessary store
> + * fence after which __incoherent regions can be safely accessed.
> + */
> +static inline void clear_page_uncached_make_coherent(void)
> +{
> + /*
> + * Keep the sfence for oldinstr and clzero separate to guard against
> + * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
> + * and X86_FEATURE_CLZERO.
> + *
> + * The alternatives need to be in the same order as the ones
> + * in clear_page_uncached().
> + */
> + alternative_2("sfence",
> + "", X86_FEATURE_MOVNT_SLOW,
> + "sfence", X86_FEATURE_CLZERO);
> +}
> +
> void copy_page(void *to, void *from);
>
> #ifdef CONFIG_X86_5LEVEL
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..b88069d1116c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>
> +#ifndef clear_user_page_uncached

Hi Ankur Arora,

I've been looking for where clear_user_page_uncached is defined in this
patchset, but failed.

There should be something like follows in arch/x86, right?

static inline void clear_user_page_uncached(__incoherent void *page,
unsigned long vaddr, struct page *pg)
{
clear_page_uncached(page);
}


Did I miss something?

> +/*
> + * clear_user_page_uncached: fallback to the standard clear_user_page().
> + */
> +static inline void clear_user_page_uncached(__incoherent void *page,
> + unsigned long vaddr, struct page *pg)
> +{
> + clear_user_page((__force void *)page, vaddr, pg);
> +}
> +
> +static inline void clear_page_uncached_make_coherent(void) { }
> +#endif
> +
> +
> #ifdef CONFIG_DEBUG_PAGEALLOC
> extern unsigned int _debug_guardpage_minorder;
> DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
>

--
Thanks,
Yu

2021-12-10 04:37:55

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()


Yu Xu <[email protected]> writes:

> On 10/21/21 1:02 AM, Ankur Arora wrote:
>> Expose the low-level uncached primitives (clear_page_movnt(),
>> clear_page_clzero()) as alternatives via clear_page_uncached().
>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>> and the CPU does not have X86_FEATURE_CLZERO.
>> Both the uncached primitives use stores which are weakly ordered
>> with respect to other instructions accessing the memory hierarchy.
>> To ensure that callers don't mix accesses to different types of
>> address_spaces, annotate clear_user_page_uncached(), and
>> clear_page_uncached() as taking __incoherent pointers as arguments.
>> Also add clear_page_uncached_make_coherent() which provides the
>> necessary store fence to flush out the uncached regions.
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> Notes:
>> This patch adds the fallback definitions of clear_user_page_uncached()
>> etc in include/linux/mm.h which is likely not the right place for it.
>> I'm guessing these should be moved to include/asm-generic/page.h
>> (or maybe a new include/asm-generic/page_uncached.h) and for
>> architectures that do have arch/$arch/include/asm/page.h (which
>> seems like all of them), also replicate there?
>> Anyway, wanted to first check if that's the way to do it, before
>> doing that.
>> arch/x86/include/asm/page.h | 10 ++++++++++
>> arch/x86/include/asm/page_32.h | 9 +++++++++
>> arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>> include/linux/mm.h | 14 ++++++++++++++
>> 4 files changed, 65 insertions(+)
>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>> index 94dbd51df58f..163be03ac422 100644
>> --- a/arch/x86/include/asm/page_32.h
>> +++ b/arch/x86/include/asm/page_32.h
>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>> memset(page, 0, PAGE_SIZE);
>> }
>> +static inline void clear_page_uncached(__incoherent void *page)
>> +{
>> + clear_page((__force void *) page);
>> +}
>> +
>> +static inline void clear_page_uncached_make_coherent(void)
>> +{
>> +}
>> +
>> static inline void copy_page(void *to, void *from)
>> {
>> memcpy(to, from, PAGE_SIZE);
>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>> index 3c53f8ef8818..d7946047c70f 100644
>> --- a/arch/x86/include/asm/page_64.h
>> +++ b/arch/x86/include/asm/page_64.h
>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>> : "cc", "memory", "rax", "rcx");
>> }
>> +/*
>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>> + */
>> +static inline void clear_page_uncached(__incoherent void *page)
>> +{
>> + alternative_call_2(clear_page_movnt,
>> + clear_page, X86_FEATURE_MOVNT_SLOW,
>> + clear_page_clzero, X86_FEATURE_CLZERO,
>> + "=D" (page),
>> + "0" (page)
>> + : "cc", "memory", "rax", "rcx");
>> +}
>> +
>> +/*
>> + * clear_page_uncached_make_coherent: executes the necessary store
>> + * fence after which __incoherent regions can be safely accessed.
>> + */
>> +static inline void clear_page_uncached_make_coherent(void)
>> +{
>> + /*
>> + * Keep the sfence for oldinstr and clzero separate to guard against
>> + * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>> + * and X86_FEATURE_CLZERO.
>> + *
>> + * The alternatives need to be in the same order as the ones
>> + * in clear_page_uncached().
>> + */
>> + alternative_2("sfence",
>> + "", X86_FEATURE_MOVNT_SLOW,
>> + "sfence", X86_FEATURE_CLZERO);
>> +}
>> +
>> void copy_page(void *to, void *from);
>> #ifdef CONFIG_X86_5LEVEL
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..b88069d1116c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>> +#ifndef clear_user_page_uncached
>
> Hi Ankur Arora,
>
> I've been looking for where clear_user_page_uncached is defined in this
> patchset, but failed.
>
> There should be something like follows in arch/x86, right?
>
> static inline void clear_user_page_uncached(__incoherent void *page,
> unsigned long vaddr, struct page *pg)
> {
> clear_page_uncached(page);
> }
>
>
> Did I miss something?
>
Hi Yu Xu,

Defined in include/linux/mm.h. Just below :).

>> +/*
>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>> + */
>> +static inline void clear_user_page_uncached(__incoherent void *page,
>> + unsigned long vaddr, struct page *pg)
>> +{
>> + clear_user_page((__force void *)page, vaddr, pg);
>> +}

That said, as this note in the patch mentions, this isn't really a great
place for this definition. As you also mention, the right place for this
would be somewhere in the arch/.../include and include/asm-generic hierarchy.

>> This patch adds the fallback definitions of clear_user_page_uncached()
>> etc in include/linux/mm.h which is likely not the right place for it.
>> I'm guessing these should be moved to include/asm-generic/page.h
>> (or maybe a new include/asm-generic/page_uncached.h) and for
>> architectures that do have arch/$arch/include/asm/page.h (which
>> seems like all of them), also replicate there?
>> Anyway, wanted to first check if that's the way to do it, before
>> doing that.

Recommendations on how to handle this, welcome.

Thanks

--
ankur

2021-12-10 04:48:35

by Yu Xu

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()

On 12/10/21 12:37 PM, Ankur Arora wrote:
>
> Yu Xu <[email protected]> writes:
>
>> On 10/21/21 1:02 AM, Ankur Arora wrote:
>>> Expose the low-level uncached primitives (clear_page_movnt(),
>>> clear_page_clzero()) as alternatives via clear_page_uncached().
>>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>>> and the CPU does not have X86_FEATURE_CLZERO.
>>> Both the uncached primitives use stores which are weakly ordered
>>> with respect to other instructions accessing the memory hierarchy.
>>> To ensure that callers don't mix accesses to different types of
>>> address_spaces, annotate clear_user_page_uncached(), and
>>> clear_page_uncached() as taking __incoherent pointers as arguments.
>>> Also add clear_page_uncached_make_coherent() which provides the
>>> necessary store fence to flush out the uncached regions.
>>> Signed-off-by: Ankur Arora <[email protected]>
>>> ---
>>> Notes:
>>> This patch adds the fallback definitions of clear_user_page_uncached()
>>> etc in include/linux/mm.h which is likely not the right place for it.
>>> I'm guessing these should be moved to include/asm-generic/page.h
>>> (or maybe a new include/asm-generic/page_uncached.h) and for
>>> architectures that do have arch/$arch/include/asm/page.h (which
>>> seems like all of them), also replicate there?
>>> Anyway, wanted to first check if that's the way to do it, before
>>> doing that.
>>> arch/x86/include/asm/page.h | 10 ++++++++++
>>> arch/x86/include/asm/page_32.h | 9 +++++++++
>>> arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>>> include/linux/mm.h | 14 ++++++++++++++
>>> 4 files changed, 65 insertions(+)
>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>> index 94dbd51df58f..163be03ac422 100644
>>> --- a/arch/x86/include/asm/page_32.h
>>> +++ b/arch/x86/include/asm/page_32.h
>>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>>> memset(page, 0, PAGE_SIZE);
>>> }
>>> +static inline void clear_page_uncached(__incoherent void *page)
>>> +{
>>> + clear_page((__force void *) page);
>>> +}
>>> +
>>> +static inline void clear_page_uncached_make_coherent(void)
>>> +{
>>> +}
>>> +
>>> static inline void copy_page(void *to, void *from)
>>> {
>>> memcpy(to, from, PAGE_SIZE);
>>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>>> index 3c53f8ef8818..d7946047c70f 100644
>>> --- a/arch/x86/include/asm/page_64.h
>>> +++ b/arch/x86/include/asm/page_64.h
>>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>>> : "cc", "memory", "rax", "rcx");
>>> }
>>> +/*
>>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>>> + */
>>> +static inline void clear_page_uncached(__incoherent void *page)
>>> +{
>>> + alternative_call_2(clear_page_movnt,
>>> + clear_page, X86_FEATURE_MOVNT_SLOW,
>>> + clear_page_clzero, X86_FEATURE_CLZERO,
>>> + "=D" (page),
>>> + "0" (page)
>>> + : "cc", "memory", "rax", "rcx");
>>> +}
>>> +
>>> +/*
>>> + * clear_page_uncached_make_coherent: executes the necessary store
>>> + * fence after which __incoherent regions can be safely accessed.
>>> + */
>>> +static inline void clear_page_uncached_make_coherent(void)
>>> +{
>>> + /*
>>> + * Keep the sfence for oldinstr and clzero separate to guard against
>>> + * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>>> + * and X86_FEATURE_CLZERO.
>>> + *
>>> + * The alternatives need to be in the same order as the ones
>>> + * in clear_page_uncached().
>>> + */
>>> + alternative_2("sfence",
>>> + "", X86_FEATURE_MOVNT_SLOW,
>>> + "sfence", X86_FEATURE_CLZERO);
>>> +}
>>> +
>>> void copy_page(void *to, void *from);
>>> #ifdef CONFIG_X86_5LEVEL
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 73a52aba448f..b88069d1116c 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>> +#ifndef clear_user_page_uncached
>>
>> Hi Ankur Arora,
>>
>> I've been looking for where clear_user_page_uncached is defined in this
>> patchset, but failed.
>>
>> There should be something like follows in arch/x86, right?
>>
>> static inline void clear_user_page_uncached(__incoherent void *page,
>> unsigned long vaddr, struct page *pg)
>> {
>> clear_page_uncached(page);
>> }
>>
>>
>> Did I miss something?
>>
> Hi Yu Xu,
>
> Defined in include/linux/mm.h. Just below :).

Thanks for your reply :)

This is the version when #ifndef clear_user_page_uncached, i.e., fall
back to standard clear_user_page.

But where is the uncached version of clear_user_page? I am looking for
this.

>
>>> +/*
>>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>>> + */
>>> +static inline void clear_user_page_uncached(__incoherent void *page,
>>> + unsigned long vaddr, struct page *pg)
>>> +{
>>> + clear_user_page((__force void *)page, vaddr, pg);
>>> +}
>
> That said, as this note in the patch mentions, this isn't really a great
> place for this definition. As you also mention, the right place for this
> would be somewhere in the arch/.../include and include/asm-generic hierarchy.
>
>>> This patch adds the fallback definitions of clear_user_page_uncached()
>>> etc in include/linux/mm.h which is likely not the right place for it.
>>> I'm guessing these should be moved to include/asm-generic/page.h
>>> (or maybe a new include/asm-generic/page_uncached.h) and for
>>> architectures that do have arch/$arch/include/asm/page.h (which
>>> seems like all of them), also replicate there?
>>> Anyway, wanted to first check if that's the way to do it, before
>>> doing that.
>
> Recommendations on how to handle this, welcome.
>
> Thanks
>
> --
> ankur
>

--
Thanks,
Yu

2021-12-10 14:27:58

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] x86/clear_page: add clear_page_uncached()


Yu Xu <[email protected]> writes:

> On 12/10/21 12:37 PM, Ankur Arora wrote:
>> Yu Xu <[email protected]> writes:
>>
>>> On 10/21/21 1:02 AM, Ankur Arora wrote:
>>>> Expose the low-level uncached primitives (clear_page_movnt(),
>>>> clear_page_clzero()) as alternatives via clear_page_uncached().
>>>> Also fallback to clear_page(), if X86_FEATURE_MOVNT_SLOW is set
>>>> and the CPU does not have X86_FEATURE_CLZERO.
>>>> Both the uncached primitives use stores which are weakly ordered
>>>> with respect to other instructions accessing the memory hierarchy.
>>>> To ensure that callers don't mix accesses to different types of
>>>> address_spaces, annotate clear_user_page_uncached(), and
>>>> clear_page_uncached() as taking __incoherent pointers as arguments.
>>>> Also add clear_page_uncached_make_coherent() which provides the
>>>> necessary store fence to flush out the uncached regions.
>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>> ---
>>>> Notes:
>>>> This patch adds the fallback definitions of clear_user_page_uncached()
>>>> etc in include/linux/mm.h which is likely not the right place for it.
>>>> I'm guessing these should be moved to include/asm-generic/page.h
>>>> (or maybe a new include/asm-generic/page_uncached.h) and for
>>>> architectures that do have arch/$arch/include/asm/page.h (which
>>>> seems like all of them), also replicate there?
>>>> Anyway, wanted to first check if that's the way to do it, before
>>>> doing that.
>>>> arch/x86/include/asm/page.h | 10 ++++++++++
>>>> arch/x86/include/asm/page_32.h | 9 +++++++++
>>>> arch/x86/include/asm/page_64.h | 32 ++++++++++++++++++++++++++++++++
>>>> include/linux/mm.h | 14 ++++++++++++++
>>>> 4 files changed, 65 insertions(+)
>>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>>> index 94dbd51df58f..163be03ac422 100644
>>>> --- a/arch/x86/include/asm/page_32.h
>>>> +++ b/arch/x86/include/asm/page_32.h
>>>> @@ -39,6 +39,15 @@ static inline void clear_page(void *page)
>>>> memset(page, 0, PAGE_SIZE);
>>>> }
>>>> +static inline void clear_page_uncached(__incoherent void *page)
>>>> +{
>>>> + clear_page((__force void *) page);
>>>> +}
>>>> +
>>>> +static inline void clear_page_uncached_make_coherent(void)
>>>> +{
>>>> +}
>>>> +
>>>> static inline void copy_page(void *to, void *from)
>>>> {
>>>> memcpy(to, from, PAGE_SIZE);
>>>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>>>> index 3c53f8ef8818..d7946047c70f 100644
>>>> --- a/arch/x86/include/asm/page_64.h
>>>> +++ b/arch/x86/include/asm/page_64.h
>>>> @@ -56,6 +56,38 @@ static inline void clear_page(void *page)
>>>> : "cc", "memory", "rax", "rcx");
>>>> }
>>>> +/*
>>>> + * clear_page_uncached: only allowed on __incoherent memory regions.
>>>> + */
>>>> +static inline void clear_page_uncached(__incoherent void *page)
>>>> +{
>>>> + alternative_call_2(clear_page_movnt,
>>>> + clear_page, X86_FEATURE_MOVNT_SLOW,
>>>> + clear_page_clzero, X86_FEATURE_CLZERO,
>>>> + "=D" (page),
>>>> + "0" (page)
>>>> + : "cc", "memory", "rax", "rcx");
>>>> +}
>>>> +
>>>> +/*
>>>> + * clear_page_uncached_make_coherent: executes the necessary store
>>>> + * fence after which __incoherent regions can be safely accessed.
>>>> + */
>>>> +static inline void clear_page_uncached_make_coherent(void)
>>>> +{
>>>> + /*
>>>> + * Keep the sfence for oldinstr and clzero separate to guard against
>>>> + * the possibility that a cpu-model both has X86_FEATURE_MOVNT_SLOW
>>>> + * and X86_FEATURE_CLZERO.
>>>> + *
>>>> + * The alternatives need to be in the same order as the ones
>>>> + * in clear_page_uncached().
>>>> + */
>>>> + alternative_2("sfence",
>>>> + "", X86_FEATURE_MOVNT_SLOW,
>>>> + "sfence", X86_FEATURE_CLZERO);
>>>> +}
>>>> +
>>>> void copy_page(void *to, void *from);
>>>> #ifdef CONFIG_X86_5LEVEL
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 73a52aba448f..b88069d1116c 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3192,6 +3192,20 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>>> +#ifndef clear_user_page_uncached
>>>
>>> Hi Ankur Arora,
>>>
>>> I've been looking for where clear_user_page_uncached is defined in this
>>> patchset, but failed.
>>>
>>> There should be something like follows in arch/x86, right?
>>>
>>> static inline void clear_user_page_uncached(__incoherent void *page,
>>> unsigned long vaddr, struct page *pg)
>>> {
>>> clear_page_uncached(page);
>>> }
>>>
>>>
>>> Did I miss something?
>>>
>> Hi Yu Xu,
>> Defined in include/linux/mm.h. Just below :).
>
> Thanks for your reply :)
>
> This is the version when #ifndef clear_user_page_uncached, i.e., fall
> back to standard clear_user_page.
>
> But where is the uncached version of clear_user_page? I am looking for
> this.

Sorry, my bad. There is a hunk missing. Not sure how, but this hunk
(from patch 7) got dropped in version that I sent out:

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4d5810c8fab7..89229f2db34a 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -28,6 +28,16 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
clear_page(page);
}

+#define clear_user_page_uncached clear_user_page_uncached
+/*
+ * clear_page_uncached: allowed on only __incoherent memory regions.
+ */
+static inline void clear_user_page_uncached(__incoherent void *page,
+ unsigned long vaddr, struct page *pg)
+{
+ clear_page_uncached(page);
+}
+
static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *topage)
{

It is identical to the version that you surmised was missing. Thanks for
the close reading.

Ankur

>>
>>>> +/*
>>>> + * clear_user_page_uncached: fallback to the standard clear_user_page().
>>>> + */
>>>> +static inline void clear_user_page_uncached(__incoherent void *page,
>>>> + unsigned long vaddr, struct page *pg)
>>>> +{
>>>> + clear_user_page((__force void *)page, vaddr, pg);
>>>> +}
>> That said, as this note in the patch mentions, this isn't really a great
>> place for this definition. As you also mention, the right place for this
>> would be somewhere in the arch/.../include and include/asm-generic hierarchy.
>>
>>>> This patch adds the fallback definitions of clear_user_page_uncached()
>>>> etc in include/linux/mm.h which is likely not the right place for it.
>>>> I'm guessing these should be moved to include/asm-generic/page.h
>>>> (or maybe a new include/asm-generic/page_uncached.h) and for
>>>> architectures that do have arch/$arch/include/asm/page.h (which
>>>> seems like all of them), also replicate there?
>>>> Anyway, wanted to first check if that's the way to do it, before
>>>> doing that.
>> Recommendations on how to handle this, welcome.
>> Thanks
>> --
>> ankur
>>