2022-06-06 21:34:07

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 00/21] huge page clearing optimizations

This series introduces two optimizations in the huge page clearing path:

1. extends the clear_page() machinery to also handle extents larger
than a single page.
2. support non-cached page clearing for huge and gigantic pages.

The first optimization is useful for hugepage fault handling, the
second for prefaulting, or for gigantic pages.

The immediate motivation is to speedup creation of large VMs backed
by huge pages.

Performance
==

VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant
run-time improvements:

Icelakex:
Time (s) Delta (%)
clear_page_erms() 22.37 ( +- 0.14s ) # 9.21 bytes/ns
clear_pages_erms() 16.49 ( +- 0.06s ) -26.28% # 12.50 bytes/ns
clear_pages_movnt() 9.42 ( +- 0.20s ) -42.87% # 21.88 bytes/ns

Milan:
Time (s) Delta (%)
clear_page_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns
clear_pages_erms() 11.82 ( +- 0.06s ) -28.32% # 17.44 bytes/ns
clear_pages_clzero() 4.91 ( +- 0.27s ) -58.49% # 41.98 bytes/ns

As a side-effect, non-polluting clearing by eliding zero filling of
caches also shows better LLC miss rates. For a kbuild+background
page-clearing job, this gives up as a small improvement (~2%) in
runtime.

Discussion
==


With the motivation out of the way, the following note describes
v3's handling of past review comments (and other sticking points for
series of this nature -- especially the non-cached part -- over the
years):

1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS'
which unlike a MOVNT loop, has semantically richer information available
which can be used by current (and/or future) processors to make the
same cache-elision optimization.

All true, except a) current-gen uarchs often don't and, b) even when
they do, the kernel by clearing at 4K granularity doesn't expose
the extent information in a way that processors could easily
optimize for.

For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different
chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.)

Intel Icelake (LLC=48MB, no_turbo=1):

chunk-size REP-STOSB MOVNTI
MBps MBps

4K 9444 24510
64K 11931 24508
2M 12355 24524
8M 12369 24525
32M 12368 24523
128M 12374 24522
1GB 12372 24561

Which is pretty flat across chunk-sizes.


AMD Milan (LLC=32MB, boost=0):

chunk-size REP-STOSB MOVNTI CLZERO
MBps MBps MBps

4K 13034 17815 45579
64K 15196 18549 46038
2M 14821 18581 39064
8M 13964 18557 46045
32M 22525 18560 45969
128M 29311 18581 38924
1GB 35807 18574 45981

The scaling on Milan starts right around chunk=LLC-size. It
asymptotically does seem to get close to CLZERO performance, but the
scaling is linear and not a step function.

For b), as I mention above, the kernel by zeroing at 4K granularity,
doesn't send the right signal to the uarch (though the largest
extent we can use for huge pages is 2MB (and lower for preemptible
kernels), which from these numbers is not large enough.)
Still using clear_page_extent() with larger extents would send the
uarch a hint that it could capitalize on in the future.

This is addressed in patches 1-6:
"mm, huge-page: reorder arguments to process_huge_page()"
"mm, huge-page: refactor process_subpage()"
"clear_page: add generic clear_user_pages()"
"mm, clear_huge_page: support clear_user_pages()"
"mm/huge_page: generalize process_huge_page()"
"x86/clear_page: add clear_pages()"

with patch 5, "mm/huge_page: generalize process_huge_page()"
containing the core logic.

2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with
respect to the cache hierarchy and unless they are combined with an
appropriate fence, are unsafe to use.

This is true and is a problem. Patch 12, "sparse: add address_space
__incoherent" adds a new sparse address_space which is used in
the architectural interfaces to make sure that any user is cognizant
of its use:

void clear_user_pages_incoherent(__incoherent void *page, ...)
void clear_pages_incoherent(__incoherent void *page, ...)

One other place it is needed (and is missing) is in highmem:
void clear_user_highpages_incoherent(struct page *page, ...).

Given the natural highmem interface, I couldn't think of a good
way to add the annotation here.

3. Non-caching stores are generally slower than cached for extents
smaller than LLC-size, and faster for larger ones.

This means that if you choose the non-caching path for too small an
extent, you would see performance regressions. There is of course
benefit in not filling the cache with zeroes but that is a somewhat
nebulous advantage and AFAICT there is no representative tests that
probe for it.
(Note that this slowness isn't a consequence of the extra fence --
that is expensive but stops being noticeable for chunk-size >=
~32K-128K depending on uarch.)

This is handled by adding an arch specific threshold (with a
default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16,
"mm/clear_page: add clear_page_non_caching_threshold()",
"x86/clear_page: add arch_clear_page_non_caching_threshold()".

Further, a single call to clear_huge_pages() or get_/pin_user_pages()
might only see a small portion of an extent being cleared in each
iteration. To make sure we choose non-caching stores when working with
large extents, patch 18, "gup: add FOLL_HINT_BULK,
FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for
this purpose. This is used in patch 20, "vfio_iommu_type1: specify
FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
while attaching passthrough PCIe devices.

The get_user_pages() logic to handle these flags is in patch 19,
"gup: hint non-caching if clearing large regions".

4. Subpoint of 3) above (non-caching stores are faster for extents
larger than LLC-sized) is generally true, with a side of Brownian
motion thrown in. For instance, MOVNTI (for > LLC-size) performs well
on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched
in between the two, it does not.

To deal with this, use Ingo's suggestion of "trust but verify",
(https://lore.kernel.org/lkml/[email protected]/)
where we enable MOVNT by default and only disable it on slow
uarchs.
If the non-caching path ends up being a part of the kernel, uarchs
that regress would hopefully show up early enough in chip testing.

Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for
Skylake" disables the non-caching path for Skylake.

Performance numbers are in patches 6 and 19, "x86/clear_page: add
clear_pages()", "gup: hint non-caching if clearing large regions".

Also at:
github.com/terminus/linux clear-page-non-caching.upstream-v3

Comments appreciated!

Changelog
==

v2: https://lore.kernel.org/lkml/[email protected]/
- Add multi-page clearing: this addresses comments from Ingo
(from v1), and from an offlist discussion with Linus.
- Rename clear_pages_uncached() to make the lack of safety
more obvious: this addresses comments from Andy Lutomorski.
- Simplify the clear_huge_page() changes.
- Usual cleanups etc.
- Rebased to v5.18.


v1: https://lore.kernel.org/lkml/[email protected]/
- Make the unsafe nature of clear_page_uncached() more obvious.
- Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't
have to explicitly enable it for every new model: suggestion from
Ingo Molnar.
- Add GUP path (and appropriate threshold) to allow the uncached path
to be used for huge pages.
- Make the code more generic so it's tied to fewer x86 specific assumptions.

Thanks
Ankur

Ankur Arora (21):
mm, huge-page: reorder arguments to process_huge_page()
mm, huge-page: refactor process_subpage()
clear_page: add generic clear_user_pages()
mm, clear_huge_page: support clear_user_pages()
mm/huge_page: generalize process_huge_page()
x86/clear_page: add clear_pages()
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add clear_pages_movnt()
x86/asm: add clear_pages_clzero()
x86/cpuid: add X86_FEATURE_MOVNT_SLOW
sparse: add address_space __incoherent
clear_page: add generic clear_user_pages_incoherent()
x86/clear_page: add clear_pages_incoherent()
mm/clear_page: add clear_page_non_caching_threshold()
x86/clear_page: add arch_clear_page_non_caching_threshold()
clear_huge_page: use non-cached clearing
gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
gup: hint non-caching if clearing large regions
vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

arch/alpha/include/asm/page.h | 1 +
arch/arc/include/asm/page.h | 1 +
arch/arm/include/asm/page.h | 1 +
arch/arm64/include/asm/page.h | 1 +
arch/csky/include/asm/page.h | 1 +
arch/hexagon/include/asm/page.h | 1 +
arch/ia64/include/asm/page.h | 1 +
arch/m68k/include/asm/page.h | 1 +
arch/microblaze/include/asm/page.h | 1 +
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 2 +
arch/openrisc/include/asm/page.h | 1 +
arch/parisc/include/asm/page.h | 1 +
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 +
arch/s390/include/asm/page.h | 1 +
arch/sh/include/asm/page.h | 1 +
arch/sparc/include/asm/page_32.h | 1 +
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 +
arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 26 ++
arch/x86/include/asm/page_64.h | 64 ++++-
arch/x86/kernel/cpu/amd.c | 2 +
arch/x86/kernel/cpu/bugs.c | 30 +++
arch/x86/kernel/cpu/cacheinfo.c | 13 +
arch/x86/kernel/cpu/cpu.h | 2 +
arch/x86/kernel/cpu/intel.c | 2 +
arch/x86/kernel/setup.c | 6 +
arch/x86/lib/clear_page_64.S | 78 ++++--
arch/x86/lib/memset_64.S | 68 ++---
arch/xtensa/include/asm/page.h | 1 +
drivers/vfio/vfio_iommu_type1.c | 3 +
fs/hugetlbfs/inode.c | 7 +-
include/asm-generic/clear_page.h | 69 +++++
include/asm-generic/page.h | 1 +
include/linux/compiler_types.h | 2 +
include/linux/highmem.h | 46 ++++
include/linux/mm.h | 10 +-
include/linux/mm_types.h | 2 +
mm/gup.c | 18 ++
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 10 +-
mm/memory.c | 264 +++++++++++++++----
tools/arch/x86/lib/memset_64.S | 68 ++---
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
47 files changed, 680 insertions(+), 144 deletions(-)
create mode 100644 include/asm-generic/clear_page.h

--
2.31.1


2022-06-06 21:59:32

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 04/21] mm, clear_huge_page: support clear_user_pages()

process_huge_page() now handles page extents with process_subpages()
handling the individual page level operation.

process_subpages() workers, clear_subpages() and copy_subpages()
chunk the clearing in units of clear_page_unit, or continue to copy
using a single page operation.

Relatedly, define clear_user_extent() which uses clear_user_highpages()
to funnel through to clear_user_pages() or falls back to page-at-a-time
clearing via clear_user_highpage().

clear_page_unit, the clearing unit size, is defined to be:
1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER).

Signed-off-by: Ankur Arora <[email protected]>
---
mm/memory.c | 95 ++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 69 insertions(+), 26 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2c86d79c9d98..fbc7bc70dc3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5563,6 +5563,31 @@ EXPORT_SYMBOL(__might_fault);

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)

+static unsigned int __ro_after_init clear_page_unit = 1;
+static int __init setup_clear_page_params(void)
+{
+ clear_page_unit = 1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER);
+ return 0;
+}
+
+/*
+ * cacheinfo is setup via device_initcall and we want to get set after
+ * that. Use the default value until then.
+ */
+late_initcall(setup_clear_page_params);
+
+/*
+ * Clear a page extent.
+ *
+ * With ARCH_MAX_CLEAR_PAGES == 1, clear_user_highpages() drops down
+ * to page-at-a-time mode. Or, funnels through to clear_user_pages().
+ */
+static void clear_user_extent(struct page *start_page, unsigned long vaddr,
+ unsigned int npages)
+{
+ clear_user_highpages(start_page, vaddr, npages);
+}
+
struct subpage_arg {
struct page *dst;
struct page *src;
@@ -5576,34 +5601,29 @@ struct subpage_arg {
*/
static inline void process_huge_page(struct subpage_arg *sa,
unsigned long addr_hint, unsigned int pages_per_huge_page,
- void (*process_subpage)(struct subpage_arg *sa,
- unsigned long base_addr, int idx))
+ void (*process_subpages)(struct subpage_arg *sa,
+ unsigned long base_addr, int lidx, int ridx))
{
int i, n, base, l;
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);

/* Process target subpage last to keep its cache lines hot */
- might_sleep();
n = (addr_hint - addr) / PAGE_SIZE;
+
if (2 * n <= pages_per_huge_page) {
/* If target subpage in first half of huge page */
base = 0;
l = n;
/* Process subpages at the end of huge page */
- for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
- cond_resched();
- process_subpage(sa, addr, i);
- }
+ process_subpages(sa, addr, 2*n, pages_per_huge_page-1);
} else {
/* If target subpage in second half of huge page */
base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
l = pages_per_huge_page - n;
+
/* Process subpages at the begin of huge page */
- for (i = 0; i < base; i++) {
- cond_resched();
- process_subpage(sa, addr, i);
- }
+ process_subpages(sa, addr, 0, base);
}
/*
* Process remaining subpages in left-right-left-right pattern
@@ -5613,15 +5633,13 @@ static inline void process_huge_page(struct subpage_arg *sa,
int left_idx = base + i;
int right_idx = base + 2 * l - 1 - i;

- cond_resched();
- process_subpage(sa, addr, left_idx);
- cond_resched();
- process_subpage(sa, addr, right_idx);
+ process_subpages(sa, addr, left_idx, left_idx);
+ process_subpages(sa, addr, right_idx, right_idx);
}
}

static void clear_gigantic_page(struct page *page,
- unsigned long addr,
+ unsigned long base_addr,
unsigned int pages_per_huge_page)
{
int i;
@@ -5629,18 +5647,35 @@ static void clear_gigantic_page(struct page *page,

might_sleep();
for (i = 0; i < pages_per_huge_page;
- i++, p = mem_map_next(p, page, i)) {
+ i += clear_page_unit, p = mem_map_offset(page, i)) {
+ /*
+ * clear_page_unit is a factor of 1<<MAX_ORDER which
+ * guarantees that p[0] and p[clear_page_unit-1]
+ * never straddle a mem_map discontiguity.
+ */
+ clear_user_extent(p, base_addr + i * PAGE_SIZE, clear_page_unit);
cond_resched();
- clear_user_highpage(p, addr + i * PAGE_SIZE);
}
}

-static void clear_subpage(struct subpage_arg *sa,
- unsigned long base_addr, int idx)
+static void clear_subpages(struct subpage_arg *sa,
+ unsigned long base_addr, int lidx, int ridx)
{
struct page *page = sa->dst;
+ int i, n;

- clear_user_highpage(page + idx, base_addr + idx * PAGE_SIZE);
+ might_sleep();
+
+ for (i = lidx; i <= ridx; ) {
+ unsigned int remaining = (unsigned int) ridx - i + 1;
+
+ n = min(clear_page_unit, remaining);
+
+ clear_user_extent(page + i, base_addr + i * PAGE_SIZE, n);
+ i += n;
+
+ cond_resched();
+ }
}

void clear_huge_page(struct page *page,
@@ -5659,7 +5694,7 @@ void clear_huge_page(struct page *page,
return;
}

- process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpage);
+ process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpages);
}

static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5681,11 +5716,19 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
}
}

-static void copy_subpage(struct subpage_arg *copy_arg,
- unsigned long base_addr, int idx)
+static void copy_subpages(struct subpage_arg *copy_arg,
+ unsigned long base_addr, int lidx, int ridx)
{
- copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
+ int idx;
+
+ might_sleep();
+
+ for (idx = lidx; idx <= ridx; idx++) {
+ copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
base_addr + idx * PAGE_SIZE, copy_arg->vma);
+
+ cond_resched();
+ }
}

void copy_user_huge_page(struct page *dst, struct page *src,
@@ -5706,7 +5749,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
return;
}

- process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpage);
+ process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpages);
}

long copy_huge_page_from_user(struct page *dst_page,
--
2.31.1

2022-06-06 22:22:44

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 19/21] gup: hint non-caching if clearing large regions

When clearing a large region, or when the user explicitly hints
via FOLL_HINT_BULK that a call to get_user_pages() is part of a larger
region being gup'd, take the non-caching path.

One notable limitation is that this is only done when the underlying
pages are huge or gigantic, even if a large region composed of PAGE_SIZE
pages is being cleared. This is because non-caching stores are generally
weakly ordered and need some kind of store fence -- at PTE write
granularity -- to avoid data leakage. This is expensive enough to
negate any performance advantage.

Performance
==

System: Oracle X9-2c (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory: 1024 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0002c1, scaling-governor: performance

System: Oracle E4-2c (2 nodes * 8 CCXes * 8 cores * 2 threads)
Processor: AMD EPYC 7J13 64-Core Processor (Milan, 25:1:1)
Memory: 512 GB evenly split between nodes
LLC-size: 32MB for each CCX (8-cores * 2-threads)
boost: 1, Microcode: 0xa00115d, scaling-governor: performance

Two workloads: qemu VM creation where that is the exclusive load
and, to probe the cache interference with unrelated processes aspect
of these changes, a kbuild with a background page clearing workload.

Workload: create a 192GB qemu-VM (backed by preallocated 2MB
pages on the local node)
==

Icelakex
--
Time (s) Delta (%)
clear_pages_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns
clear_pages_movnt() 9.42 ( +- 0.20s ) -42.87% # 21.88 bytes/ns

It is easy enough to see where the improvement is coming from -- given
the non-caching stores, the CPU does not need to do any RFOs ending up
with way fewer L1-dcache-load-misses:

- 407,619,058 L1-dcache-loads # 24.746 M/sec ( +- 0.17% ) (69.20%)
- 3,245,399,461 L1-dcache-load-misses # 801.49% of all L1-dcache accesses ( +- 0.01% ) (69.22%)
+ 393,160,148 L1-dcache-loads # 41.786 M/sec ( +- 0.80% ) (69.22%)
+ 5,790,543 L1-dcache-load-misses # 1.50% of all L1-dcache accesses ( +- 1.55% ) (69.26%)

(Fuller perf stat output, at [1], [2].)

Milan
--
Time (s) Delta
clear_pages_erms() 11.83 ( +- 0.08s ) # 17.42 bytes/ns
clear_pages_clzero() 4.91 ( +- 0.27s ) -58.49% # 41.98 bytes/ns

Milan does significantly fewer RFO, as well.

- 6,882,968,897 L1-dcache-loads # 582.960 M/sec ( +- 0.03% ) (33.38%)
- 3,267,546,914 L1-dcache-load-misses # 47.45% of all L1-dcache accesses ( +- 0.02% ) (33.37%)
+ 418,489,450 L1-dcache-loads # 85.611 M/sec ( +- 1.19% ) (33.46%)
+ 5,406,557 L1-dcache-load-misses # 1.35% of all L1-dcache accesses ( +- 1.07% ) (33.45%)

(Fuller perf stat output, at [3], [4].)

Workload: Kbuild with background clear_huge_page()
==

Probe the cache-pollution aspect of this commit with a kbuild
(make -j 32 bzImage) alongside a background process doing
clear_huge_page() via mmap(length=64GB, flags=MAP_POPULATE|MAP_HUGE_2MB)
in a loop.

The expectation -- assuming kbuild performance is partly cache
limited -- is that the clear_huge_page() -> clear_pages_erms()
background load would show a greater slowdown than,
clear_huge_page() -> clear_pages_movnt(). The kbuild itself does not
use THP or similar, so any performance changes are due to the
background load.

Icelakex
--

# kbuild: 16 cores, 32 threads
# clear_huge_page() load: single thread bound to the same CPUset
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

- 8,226,884,900,694 instructions # 1.09 insn per cycle ( +- 0.02% ) (47.27%)
+ 8,223,413,950,371 instructions # 1.12 insn per cycle ( +- 0.03% ) (47.31%)

- 20,016,410,480,886 slots # 6.565 G/sec ( +- 0.01% ) (69.84%)
- 1,310,070,777,023 topdown-be-bound # 6.1% backend bound ( +- 0.28% ) (69.84%)
+ 19,328,950,611,944 slots # 6.494 G/sec ( +- 0.02% ) (69.87%)
+ 1,043,408,291,623 topdown-be-bound # 5.0% backend bound ( +- 0.33% ) (69.87%)

- 10,747,834,729 LLC-loads # 3.525 M/sec ( +- 0.05% ) (69.68%)
- 4,841,355,743 LLC-load-misses # 45.02% of all LL-cache accesses ( +- 0.06% ) (69.70%)
+ 10,466,865,056 LLC-loads # 3.517 M/sec ( +- 0.08% ) (69.68%)
+ 4,206,944,783 LLC-load-misses # 40.21% of all LL-cache accesses ( +- 0.06% ) (69.71%)

The LLC-load-misses show a significant improvement (-13.11%) which is
borne out in the (-20.35%) reduction in topdown-be-bound and a (2.7%)
improvement in IPC.

- 7,521,157,276,899 cycles # 2.467 GHz ( +- 0.02% ) (39.65%)
+ 7,348,971,235,549 cycles # 2.469 GHz ( +- 0.04% ) (39.68%)

The ends up with an overall improvement in cycles of (-2.28%).

(Fuller perf stat output, at [5], [6].)

Milan
--

# kbuild: 2 CCxes, 16 cores, 32 threads
# clear_huge_page() load: single thread bound to the same CPUset
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

- 302,739,130,717 stalled-cycles-backend # 3.82% backend cycles idle ( +- 0.10% ) (41.11%)
+ 287,703,667,307 stalled-cycles-backend # 3.74% backend cycles idle ( +- 0.04% ) (41.11%)

- 8,981,403,534,446 instructions # 1.13 insn per cycle
+ 8,969,062,192,998 instructions # 1.16 insn per cycle

Milan sees a (-4.96%) improvement in stalled-cycles-backend and
a (-2.65%) improvement in IPC.

- 7,930,842,057,103 cycles # 2.338 GHz ( +- 0.04% ) (41.09%)
+ 7,705,812,395,365 cycles # 2.339 GHz ( +- 0.01% ) (41.11%)

The ends up with an overall improvement in cycles of (-2.83%).

(Fuller perf stat output, at [7], [8].)

[1] Icelakex, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

16,329.41 msec task-clock # 0.990 CPUs utilized ( +- 0.42% )
143 context-switches # 8.681 /sec ( +- 0.93% )
1 cpu-migrations # 0.061 /sec ( +- 63.25% )
118 page-faults # 7.164 /sec ( +- 0.27% )
41,735,523,673 cycles # 2.534 GHz ( +- 0.42% ) (38.46%)
1,454,116,543 instructions # 0.03 insn per cycle ( +- 0.49% ) (46.16%)
266,749,920 branches # 16.194 M/sec ( +- 0.41% ) (53.86%)
928,726 branch-misses # 0.35% of all branches ( +- 0.38% ) (61.54%)
208,805,754,709 slots # 12.676 G/sec ( +- 0.41% ) (69.23%)
5,355,889,366 topdown-retiring # 2.5% retiring ( +- 0.50% ) (69.23%)
12,720,749,784 topdown-bad-spec # 6.1% bad speculation ( +- 1.38% ) (69.23%)
998,710,552 topdown-fe-bound # 0.5% frontend bound ( +- 0.85% ) (69.23%)
192,653,197,875 topdown-be-bound # 90.9% backend bound ( +- 0.38% ) (69.23%)
407,619,058 L1-dcache-loads # 24.746 M/sec ( +- 0.17% ) (69.20%)
3,245,399,461 L1-dcache-load-misses # 801.49% of all L1-dcache accesses ( +- 0.01% ) (69.22%)
10,805,747 LLC-loads # 656.009 K/sec ( +- 0.37% ) (69.25%)
804,475 LLC-load-misses # 7.44% of all LL-cache accesses ( +- 2.73% ) (69.26%)
<not supported> L1-icache-loads
18,134,527 L1-icache-load-misses ( +- 1.24% ) (30.80%)
435,474,462 dTLB-loads # 26.437 M/sec ( +- 0.28% ) (30.80%)
41,187 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 4.06% ) (30.79%)
<not supported> iTLB-loads
440,135 iTLB-load-misses ( +- 1.07% ) (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

16.4906 +- 0.0676 seconds time elapsed ( +- 0.41% )

[2] Icelakex, clear_pages_movnt()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

9,896.77 msec task-clock # 1.050 CPUs utilized ( +- 2.08% )
135 context-switches # 14.348 /sec ( +- 0.74% )
0 cpu-migrations # 0.000 /sec
116 page-faults # 12.329 /sec ( +- 0.50% )
25,239,642,558 cycles # 2.683 GHz ( +- 2.11% ) (38.43%)
36,791,658,500 instructions # 1.54 insn per cycle ( +- 0.06% ) (46.12%)
3,475,279,229 branches # 369.361 M/sec ( +- 0.09% ) (53.82%)
1,987,098 branch-misses # 0.06% of all branches ( +- 0.71% ) (61.51%)
126,256,220,768 slots # 13.419 G/sec ( +- 2.10% ) (69.21%)
57,705,186,453 topdown-retiring # 47.8% retiring ( +- 0.28% ) (69.21%)
5,934,729,245 topdown-bad-spec # 4.3% bad speculation ( +- 5.91% ) (69.21%)
4,089,990,217 topdown-fe-bound # 3.1% frontend bound ( +- 2.11% ) (69.21%)
60,298,426,167 topdown-be-bound # 44.8% backend bound ( +- 4.21% ) (69.21%)
393,160,148 L1-dcache-loads # 41.786 M/sec ( +- 0.80% ) (69.22%)
5,790,543 L1-dcache-load-misses # 1.50% of all L1-dcache accesses ( +- 1.55% ) (69.26%)
1,069,049 LLC-loads # 113.621 K/sec ( +- 1.25% ) (69.27%)
728,260 LLC-load-misses # 70.65% of all LL-cache accesses ( +- 2.63% ) (69.30%)
<not supported> L1-icache-loads
14,620,549 L1-icache-load-misses ( +- 1.27% ) (30.80%)
404,962,421 dTLB-loads # 43.040 M/sec ( +- 1.13% ) (30.80%)
31,916 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 4.61% ) (30.77%)
<not supported> iTLB-loads
396,984 iTLB-load-misses ( +- 2.23% ) (30.74%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

9.428 +- 0.206 seconds time elapsed ( +- 2.18% )

[3] Milan, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

11,676.79 msec task-clock # 0.987 CPUs utilized ( +- 0.68% )
96 context-switches # 8.131 /sec ( +- 0.78% )
2 cpu-migrations # 0.169 /sec ( +- 18.71% )
106 page-faults # 8.978 /sec ( +- 0.23% )
28,161,726,414 cycles # 2.385 GHz ( +- 0.69% ) (33.33%)
141,032,827 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 52.44% ) (33.35%)
796,792,139 stalled-cycles-backend # 2.80% backend cycles idle ( +- 23.73% ) (33.35%)
1,140,172,646 instructions # 0.04 insn per cycle
# 0.50 stalled cycles per insn ( +- 0.89% ) (33.35%)
219,864,061 branches # 18.622 M/sec ( +- 1.06% ) (33.36%)
1,407,446 branch-misses # 0.63% of all branches ( +- 10.66% ) (33.40%)
6,882,968,897 L1-dcache-loads # 582.960 M/sec ( +- 0.03% ) (33.38%)
3,267,546,914 L1-dcache-load-misses # 47.45% of all L1-dcache accesses ( +- 0.02% ) (33.37%)
<not supported> LLC-loads
<not supported> LLC-load-misses
146,901,513 L1-icache-loads # 12.442 M/sec ( +- 0.78% ) (33.36%)
1,462,155 L1-icache-load-misses # 0.99% of all L1-icache accesses ( +- 0.83% ) (33.34%)
2,055,805 dTLB-loads # 174.118 K/sec ( +- 22.56% ) (33.33%)
136,260 dTLB-load-misses # 4.69% of all dTLB cache accesses ( +- 23.13% ) (33.35%)
941 iTLB-loads # 79.699 /sec ( +- 5.54% ) (33.35%)
115,444 iTLB-load-misses # 14051.12% of all iTLB cache accesses ( +- 21.17% ) (33.34%)
95,438,373 L1-dcache-prefetches # 8.083 M/sec ( +- 19.99% ) (33.34%)
<not supported> L1-dcache-prefetch-misses

11.8296 +- 0.0805 seconds time elapsed ( +- 0.68% )

[4] Milan, clear_pages_clzero()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

4,599.00 msec task-clock # 0.937 CPUs utilized ( +- 5.93% )
91 context-switches # 18.616 /sec ( +- 0.92% )
0 cpu-migrations # 0.000 /sec
107 page-faults # 21.889 /sec ( +- 0.19% )
10,975,453,059 cycles # 2.245 GHz ( +- 6.02% ) (33.28%)
14,193,355 stalled-cycles-frontend # 0.12% frontend cycles idle ( +- 1.90% ) (33.35%)
38,969,144 stalled-cycles-backend # 0.33% backend cycles idle ( +- 23.92% ) (33.34%)
13,951,880,530 instructions # 1.20 insn per cycle
# 0.00 stalled cycles per insn ( +- 0.11% ) (33.33%)
3,426,708,418 branches # 701.003 M/sec ( +- 0.06% ) (33.36%)
2,350,619 branch-misses # 0.07% of all branches ( +- 0.61% ) (33.45%)
418,489,450 L1-dcache-loads # 85.611 M/sec ( +- 1.19% ) (33.46%)
5,406,557 L1-dcache-load-misses # 1.35% of all L1-dcache accesses ( +- 1.07% ) (33.45%)
<not supported> LLC-loads
<not supported> LLC-load-misses
90,088,059 L1-icache-loads # 18.429 M/sec ( +- 0.36% ) (33.44%)
1,081,035 L1-icache-load-misses # 1.20% of all L1-icache accesses ( +- 3.67% ) (33.42%)
4,017,464 dTLB-loads # 821.854 K/sec ( +- 1.02% ) (33.40%)
204,096 dTLB-load-misses # 5.22% of all dTLB cache accesses ( +- 9.77% ) (33.36%)
770 iTLB-loads # 157.519 /sec ( +- 5.12% ) (33.36%)
209,834 iTLB-load-misses # 29479.35% of all iTLB cache accesses ( +- 0.17% ) (33.34%)
1,596,265 L1-dcache-prefetches # 326.548 K/sec ( +- 1.55% ) (33.31%)
<not supported> L1-dcache-prefetch-misses

4.908 +- 0.272 seconds time elapsed ( +- 5.54% )

[5] Icelakex, kbuild + bg:clear_pages_erms() load.
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

3,047,329.07 msec task-clock # 19.520 CPUs utilized ( +- 0.02% )
1,675,061 context-switches # 549.415 /sec ( +- 0.43% )
89,232 cpu-migrations # 29.268 /sec ( +- 2.34% )
85,752,972 page-faults # 28.127 K/sec ( +- 0.00% )
7,521,157,276,899 cycles # 2.467 GHz ( +- 0.02% ) (39.65%)
8,226,884,900,694 instructions # 1.09 insn per cycle ( +- 0.02% ) (47.27%)
1,744,557,848,503 branches # 572.209 M/sec ( +- 0.02% ) (54.83%)
36,252,079,075 branch-misses # 2.08% of all branches ( +- 0.02% ) (62.35%)
20,016,410,480,886 slots # 6.565 G/sec ( +- 0.01% ) (69.84%)
6,518,990,385,998 topdown-retiring # 30.5% retiring ( +- 0.02% ) (69.84%)
7,821,817,193,732 topdown-bad-spec # 36.7% bad speculation ( +- 0.29% ) (69.84%)
5,714,082,318,274 topdown-fe-bound # 26.7% frontend bound ( +- 0.10% ) (69.84%)
1,310,070,777,023 topdown-be-bound # 6.1% backend bound ( +- 0.28% ) (69.84%)
2,270,017,283,501 L1-dcache-loads # 744.558 M/sec ( +- 0.02% ) (69.60%)
103,295,556,544 L1-dcache-load-misses # 4.55% of all L1-dcache accesses ( +- 0.02% ) (69.64%)
10,747,834,729 LLC-loads # 3.525 M/sec ( +- 0.05% ) (69.68%)
4,841,355,743 LLC-load-misses # 45.02% of all LL-cache accesses ( +- 0.06% ) (69.70%)
<not supported> L1-icache-loads
180,672,238,145 L1-icache-load-misses ( +- 0.03% ) (31.18%)
2,216,149,664,522 dTLB-loads # 726.890 M/sec ( +- 0.03% ) (31.83%)
2,000,781,326 dTLB-load-misses # 0.09% of all dTLB cache accesses ( +- 0.08% ) (31.79%)
<not supported> iTLB-loads
1,938,124,234 iTLB-load-misses ( +- 0.04% ) (31.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

156.1136 +- 0.0785 seconds time elapsed ( +- 0.05% )

[6] Icelakex, kbuild + bg:clear_pages_movnt() load.
# taskset -c 16-31,80-95 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

2,978,535.47 msec task-clock # 19.471 CPUs utilized ( +- 0.05% )
1,637,295 context-switches # 550.105 /sec ( +- 0.89% )
91,635 cpu-migrations # 30.788 /sec ( +- 1.88% )
85,754,138 page-faults # 28.812 K/sec ( +- 0.00% )
7,348,971,235,549 cycles # 2.469 GHz ( +- 0.04% ) (39.68%)
8,223,413,950,371 instructions # 1.12 insn per cycle ( +- 0.03% ) (47.31%)
1,743,914,970,674 branches # 585.928 M/sec ( +- 0.01% ) (54.87%)
36,188,623,655 branch-misses # 2.07% of all branches ( +- 0.05% ) (62.39%)
19,328,950,611,944 slots # 6.494 G/sec ( +- 0.02% ) (69.87%)
6,508,801,041,075 topdown-retiring # 31.7% retiring ( +- 0.35% ) (69.87%)
7,581,383,615,462 topdown-bad-spec # 36.4% bad speculation ( +- 0.43% ) (69.87%)
5,521,686,808,149 topdown-fe-bound # 26.8% frontend bound ( +- 0.14% ) (69.87%)
1,043,408,291,623 topdown-be-bound # 5.0% backend bound ( +- 0.33% ) (69.87%)
2,269,475,492,575 L1-dcache-loads # 762.507 M/sec ( +- 0.03% ) (69.63%)
101,544,979,642 L1-dcache-load-misses # 4.47% of all L1-dcache accesses ( +- 0.05% ) (69.66%)
10,466,865,056 LLC-loads # 3.517 M/sec ( +- 0.08% ) (69.68%)
4,206,944,783 LLC-load-misses # 40.21% of all LL-cache accesses ( +- 0.06% ) (69.71%)
<not supported> L1-icache-loads
180,267,126,923 L1-icache-load-misses ( +- 0.07% ) (31.17%)
2,216,010,317,050 dTLB-loads # 744.544 M/sec ( +- 0.03% ) (31.82%)
1,979,801,744 dTLB-load-misses # 0.09% of all dTLB cache accesses ( +- 0.10% ) (31.79%)
<not supported> iTLB-loads
1,925,390,304 iTLB-load-misses ( +- 0.08% ) (31.77%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

152.972 +- 0.309 seconds time elapsed ( +- 0.20% )

[7] Milan, clear_pages_erms()
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

3,390,130.53 msec task-clock # 18.241 CPUs utilized ( +- 0.04% )
1,720,283 context-switches # 507.160 /sec ( +- 0.27% )
96,694 cpu-migrations # 28.507 /sec ( +- 1.41% )
75,872,994 page-faults # 22.368 K/sec ( +- 0.00% )
7,930,842,057,103 cycles # 2.338 GHz ( +- 0.04% ) (41.09%)
39,974,518,172 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 0.05% ) (41.10%)
302,739,130,717 stalled-cycles-backend # 3.82% backend cycles idle ( +- 0.10% ) (41.11%)
8,981,403,534,446 instructions # 1.13 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.03% ) (41.10%)
1,909,303,327,220 branches # 562.886 M/sec ( +- 0.02% ) (41.10%)
50,324,935,298 branch-misses # 2.64% of all branches ( +- 0.02% ) (41.09%)
3,563,297,595,796 L1-dcache-loads # 1.051 G/sec ( +- 0.03% ) (41.08%)
129,901,339,258 L1-dcache-load-misses # 3.65% of all L1-dcache accesses ( +- 0.10% ) (41.07%)
<not supported> LLC-loads
<not supported> LLC-load-misses
809,770,606,566 L1-icache-loads # 238.730 M/sec ( +- 0.03% ) (41.07%)
12,403,758,671 L1-icache-load-misses # 1.53% of all L1-icache accesses ( +- 0.08% ) (41.07%)
60,010,026,089 dTLB-loads # 17.692 M/sec ( +- 0.04% ) (41.07%)
3,254,066,681 dTLB-load-misses # 5.42% of all dTLB cache accesses ( +- 0.09% ) (41.07%)
5,195,070,952 iTLB-loads # 1.532 M/sec ( +- 0.03% ) (41.08%)
489,196,395 iTLB-load-misses # 9.42% of all iTLB cache accesses ( +- 0.10% ) (41.09%)
39,920,161,716 L1-dcache-prefetches # 11.769 M/sec ( +- 0.03% ) (41.09%)
<not supported> L1-dcache-prefetch-misses

185.852 +- 0.501 seconds time elapsed ( +- 0.27% )

[8] Milan, clear_pages_clzero()
# taskset -c 16-31,144-159 perf stat -r 5 -ddd \
make -C .. -j 32 O=b2 clean bzImage

Performance counter stats for 'make -C .. -j 32 O=b2 clean bzImage' (5 runs):

3,296,677.12 msec task-clock # 18.051 CPUs utilized ( +- 0.02% )
1,713,645 context-switches # 520.062 /sec ( +- 0.26% )
91,883 cpu-migrations # 27.885 /sec ( +- 0.83% )
75,877,740 page-faults # 23.028 K/sec ( +- 0.00% )
7,705,812,395,365 cycles # 2.339 GHz ( +- 0.01% ) (41.11%)
38,866,265,031 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 0.09% ) (41.10%)
287,703,667,307 stalled-cycles-backend # 3.74% backend cycles idle ( +- 0.04% ) (41.11%)
8,969,062,192,998 instructions # 1.16 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.01% ) (41.11%)
1,906,857,866,689 branches # 578.699 M/sec ( +- 0.01% ) (41.10%)
50,155,411,444 branch-misses # 2.63% of all branches ( +- 0.03% ) (41.11%)
3,552,652,190,906 L1-dcache-loads # 1.078 G/sec ( +- 0.01% ) (41.13%)
127,238,478,917 L1-dcache-load-misses # 3.58% of all L1-dcache accesses ( +- 0.04% ) (41.13%)
<not supported> LLC-loads
<not supported> LLC-load-misses
808,024,730,682 L1-icache-loads # 245.222 M/sec ( +- 0.03% ) (41.13%)
7,773,178,107 L1-icache-load-misses # 0.96% of all L1-icache accesses ( +- 0.11% ) (41.13%)
59,684,355,294 dTLB-loads # 18.113 M/sec ( +- 0.04% ) (41.12%)
3,247,521,154 dTLB-load-misses # 5.44% of all dTLB cache accesses ( +- 0.04% ) (41.12%)
5,064,547,530 iTLB-loads # 1.537 M/sec ( +- 0.09% ) (41.12%)
462,977,175 iTLB-load-misses # 9.13% of all iTLB cache accesses ( +- 0.07% ) (41.12%)
39,307,810,241 L1-dcache-prefetches # 11.929 M/sec ( +- 0.06% ) (41.11%)
<not supported> L1-dcache-prefetch-misses

182.630 +- 0.365 seconds time elapsed ( +- 0.20% )

Signed-off-by: Ankur Arora <[email protected]>
---

Notes:
Not sure if this wall of perf-stats (or indeed the whole kbuild test) is
warranted here.

To my eyes, there's no non-obvious information in the performance results
(reducing cache usage should and does lead to other processes getting a small
bump in performance), so is there any value in keeping this in the commit
message?

fs/hugetlbfs/inode.c | 7 ++++++-
mm/gup.c | 18 ++++++++++++++++++
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 9 ++++++++-
4 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 62408047e8d7..993bb7227a2f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -650,6 +650,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
loff_t hpage_size = huge_page_size(h);
unsigned long hpage_shift = huge_page_shift(h);
pgoff_t start, index, end;
+ bool hint_non_caching;
int error;
u32 hash;

@@ -667,6 +668,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
start = offset >> hpage_shift;
end = (offset + len + hpage_size - 1) >> hpage_shift;

+ /* Don't pollute the cache if we are fallocte'ing a large region. */
+ hint_non_caching = clear_page_prefer_non_caching((end - start) << hpage_shift);
+
inode_lock(inode);

/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -745,7 +749,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
error = PTR_ERR(page);
goto out;
}
- clear_huge_page(page, addr, pages_per_huge_page(h));
+ clear_huge_page(page, addr, pages_per_huge_page(h),
+ hint_non_caching);
__SetPageUptodate(page);
error = huge_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..bceb6ff64687 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -944,6 +944,13 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (*flags & FOLL_HINT_BULK) {
+ /*
+ * This page is part of a large region being faulted-in
+ * so attempt to minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_NON_CACHING;
+ }
if (unshare) {
fault_flags |= FAULT_FLAG_UNSHARE;
/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
@@ -1116,6 +1123,17 @@ static long __get_user_pages(struct mm_struct *mm,
if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;

+ /*
+ * Non-cached page clearing is generally faster when clearing regions
+ * larger than O(LLC-size). So hint the non-caching path based on
+ * clear_page_prefer_non_caching().
+ *
+ * Note, however this check is optimistic -- nr_pages is the upper
+ * limit and we might be clearing less than that.
+ */
+ if (clear_page_prefer_non_caching(nr_pages * PAGE_SIZE))
+ gup_flags |= FOLL_HINT_BULK;
+
do {
struct page *page;
unsigned int foll_flags = gup_flags;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73654db77a1c..c7294cffc384 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,7 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;
- bool non_cached = false;
+ bool non_cached = vmf->flags & FAULT_FLAG_NON_CACHING;

VM_BUG_ON_PAGE(!PageCompound(page), page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0c4a31b5c1e9..d906c6558b15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5481,7 +5481,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
- bool non_cached = false;
+ bool non_cached = flags & FAULT_FLAG_NON_CACHING;

/*
* Currently, we are forced to kill the process in the event the
@@ -6182,6 +6182,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (flags & FOLL_HINT_BULK) {
+ /*
+ * From the user hint, we might be faulting-in
+ * a large region so minimize cache-pollution.
+ */
+ fault_flags |= FAULT_FLAG_NON_CACHING;
+ }
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
if (ret & VM_FAULT_ERROR) {
err = vm_fault_to_errno(ret, flags);
--
2.31.1

2022-06-06 23:07:37

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 10/21] x86/asm: add clear_pages_clzero()

Add clear_pages_clzero(), which uses CLZERO as the clearing primitive.
CLZERO skips the memory hierarchy, so this provides a non-polluting
implementation of clear_page(). Available if X86_FEATURE_CLZERO is set.

CLZERO, from the AMD architecture guide (Vol 3, Rev 3.30):
"Clears the cache line specified by the logical address in rAX by
writing a zero to every byte in the line. The instruction uses an
implied non temporal memory type, similar to a streaming store, and
uses the write combining protocol to minimize cache pollution.

CLZERO is weakly-ordered with respect to other instructions that
operate on memory. Software should use an SFENCE or stronger to
enforce memory ordering of CLZERO with respect to other store
instructions.

The CLZERO instruction executes at any privilege level. CLZERO
performs all the segmentation and paging checks that a store of
the specified cache line would perform."

The use-case is similar to clear_page_movnt(), except that
clear_pages_clzero() is expected to be more performant.

Cc: [email protected]
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 3affc4ecb8da..e8d4698fda65 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -56,6 +56,7 @@ void clear_pages_orig(void *page, unsigned long npages);
void clear_pages_rep(void *page, unsigned long npages);
void clear_pages_erms(void *page, unsigned long npages);
void clear_pages_movnt(void *page, unsigned long npages);
+void clear_pages_clzero(void *page, unsigned long npages);

#define __HAVE_ARCH_CLEAR_USER_PAGES
static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 83d14f1c9f57..00203103cf77 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -79,3 +79,22 @@ SYM_FUNC_START(clear_pages_movnt)
ja .Lstart
RET
SYM_FUNC_END(clear_pages_movnt)
+
+/*
+ * Zero a page using clzero (On AMD, with CPU_FEATURE_CLZERO.)
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+SYM_FUNC_START(clear_pages_clzero)
+ movq %rdi,%rax
+ movq %rsi,%rcx
+ shlq $PAGE_SHIFT, %rcx
+
+ .p2align 4
+.Liter:
+ clzero
+ addq $0x40, %rax
+ subl $0x40, %ecx
+ ja .Liter
+ RET
+SYM_FUNC_END(clear_pages_clzero)
--
2.31.1

2022-06-06 23:41:52

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 03/21] clear_page: add generic clear_user_pages()

Add generic clear_user_pages() which operates on contiguous
PAGE_SIZE'd chunks via an arch defined primitive.

The generic version defines:
#define ARCH_MAX_CLEAR_PAGES_ORDER 0
so clear_user_pages() would fallback to clear_user_page().

An arch can expose this by defining __HAVE_ARCH_CLEAR_USER_PAGES.

Also add clear_user_highpages() which, either funnels through
to clear_user_pages() or does the clearing page-at-a-time.

Signed-off-by: Ankur Arora <[email protected]>
---

Notes:
1. I'm not sure that a new header asm-generic/clear_page.h is ideal.

The logical place for this is asm-generic/page.h itself. However, only
H8300 includes that and so this (and the next few patches) would need
a stub everywhere else.
(Just rechecked and looks like arch/h8300 is no more.)

If adding a new header looks reasonable to the community, I'm happy
to move clear_user_page(), copy_user_page() stubs out to this file.
(Note that patches further on add non-caching clear_user_pages()
as well.)

Or, if asm-generic/page.h is the best place, then add stubs
everywhere else.

2. Shoehorning a multiple page operation in CONFIG_HIGHMEM seems
ugly but, seemed like the best choice of a bad set of options.
Is there a better way of doing this?

arch/alpha/include/asm/page.h | 1 +
arch/arc/include/asm/page.h | 1 +
arch/arm/include/asm/page.h | 1 +
arch/arm64/include/asm/page.h | 1 +
arch/csky/include/asm/page.h | 1 +
arch/hexagon/include/asm/page.h | 1 +
arch/ia64/include/asm/page.h | 1 +
arch/m68k/include/asm/page.h | 1 +
arch/microblaze/include/asm/page.h | 1 +
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 2 ++
arch/openrisc/include/asm/page.h | 1 +
arch/parisc/include/asm/page.h | 1 +
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 +
arch/s390/include/asm/page.h | 1 +
arch/sh/include/asm/page.h | 1 +
arch/sparc/include/asm/page_32.h | 1 +
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 +
arch/x86/include/asm/page.h | 1 +
arch/xtensa/include/asm/page.h | 1 +
include/asm-generic/clear_page.h | 44 ++++++++++++++++++++++++++++++
include/asm-generic/page.h | 1 +
include/linux/highmem.h | 23 ++++++++++++++++
25 files changed, 91 insertions(+)
create mode 100644 include/asm-generic/clear_page.h

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 8f3f5eecba28..2d3b099e165c 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -93,5 +93,6 @@ typedef struct page *pgtable_t;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _ALPHA_PAGE_H */
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index 9a62e1d87967..abdbef6897bf 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -133,6 +133,7 @@ extern int pfn_valid(unsigned long pfn);

#include <asm-generic/memory_model.h> /* page_to_pfn, pfn_to_page */
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* !__ASSEMBLY__ */

diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 5fcc8a600e36..ba244baca1fa 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -167,5 +167,6 @@ extern int pfn_valid(unsigned long);
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC

#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 993a27ea6f54..8407ac2b5d68 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -50,5 +50,6 @@ int pfn_is_map_memory(unsigned long pfn);
#define VM_DATA_DEFAULT_FLAGS (VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED)

#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif
diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
index ed7451478b1b..47cc27d4ede1 100644
--- a/arch/csky/include/asm/page.h
+++ b/arch/csky/include/asm/page.h
@@ -89,6 +89,7 @@ extern unsigned long va_pa_offset;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* !__ASSEMBLY__ */
#endif /* __ASM_CSKY_PAGE_H */
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 7cbf719c578e..e7a8edd6903a 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -142,6 +142,7 @@ static inline void clear_page(void *page)
#include <asm-generic/memory_model.h>
/* XXX Todo: implement assembly-optimized version of getorder. */
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* ifdef __ASSEMBLY__ */
#endif /* ifdef __KERNEL__ */
diff --git a/arch/ia64/include/asm/page.h b/arch/ia64/include/asm/page.h
index 1b990466d540..1feae333e250 100644
--- a/arch/ia64/include/asm/page.h
+++ b/arch/ia64/include/asm/page.h
@@ -96,6 +96,7 @@ do { \
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

#include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>

#ifdef CONFIG_FLATMEM
# define pfn_valid(pfn) ((pfn) < max_mapnr)
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 2f1c54e4725d..1aeaae820670 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -68,5 +68,6 @@ extern unsigned long _ramend;
#endif

#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _M68K_PAGE_H */
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 4b8b2fa78fc5..baa03569477a 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -137,5 +137,6 @@ extern int page_is_ram(unsigned long pfn);

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _ASM_MICROBLAZE_PAGE_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 96bc798c1ec1..3dde03bf99f3 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -269,5 +269,6 @@ static inline unsigned long kaslr_offset(void)

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _ASM_PAGE_H */
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 6a989819a7c1..9763048bd3ed 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -104,6 +104,8 @@ static inline bool pfn_valid(unsigned long pfn)

#include <asm-generic/getorder.h>

+#include <asm-generic/clear_page.h>
+
#endif /* !__ASSEMBLY__ */

#endif /* _ASM_NIOS2_PAGE_H */
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index aab6e64d6db4..879419c00cd4 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -88,5 +88,6 @@ typedef struct page *pgtable_t;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* __ASM_OPENRISC_PAGE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 6faaaa3ebe9b..961f88d6ff63 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -184,6 +184,7 @@ extern int npmem_ranges;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>
#include <asm/pdc.h>

#define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index e5f75c70eda8..4742b1f99a3e 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -335,6 +335,7 @@ static inline unsigned long kaslr_offset(void)
}

#include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>
#endif /* __ASSEMBLY__ */

#endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 1526e410e802..ce9005ffccb0 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -188,5 +188,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _ASM_RISCV_PAGE_H */
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 61dea67bb9c7..7a598f86ae39 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -207,5 +207,6 @@ int arch_make_page_accessible(struct page *page);

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _S390_PAGE_H */
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index eca5daa43b93..5e49bb342c2c 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -176,6 +176,7 @@ typedef struct page *pgtable_t;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

/*
* Some drivers need to perform DMA into kmalloc'ed buffers
diff --git a/arch/sparc/include/asm/page_32.h b/arch/sparc/include/asm/page_32.h
index fff8861df107..2f061d9a5a30 100644
--- a/arch/sparc/include/asm/page_32.h
+++ b/arch/sparc/include/asm/page_32.h
@@ -135,5 +135,6 @@ extern unsigned long pfn_base;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _SPARC_PAGE_H */
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 254dffd85fb1..2026bf92e3e7 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -159,5 +159,6 @@ extern unsigned long PAGE_OFFSET;
#endif /* !(__ASSEMBLY__) */

#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* _SPARC64_PAGE_H */
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 95af12e82a32..79768ad6069c 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -113,6 +113,7 @@ extern unsigned long uml_physmem;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9cc82f305f4b..5a246a2a66aa 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -85,6 +85,7 @@ static __always_inline u64 __is_canonical_address(u64 vaddr, u8 vaddr_bits)

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA

diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 493eb7083b1a..2812f2bea844 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
#endif /* __ASSEMBLY__ */

#include <asm-generic/memory_model.h>
+#include <asm-generic/clear_page.h>
#endif /* _XTENSA_PAGE_H */
diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
new file mode 100644
index 000000000000..f827d661519c
--- /dev/null
+++ b/include/asm-generic/clear_page.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_CLEAR_PAGE_H
+#define __ASM_GENERIC_CLEAR_PAGE_H
+
+/*
+ * clear_user_pages() operates on contiguous pages and does the clearing
+ * operation in a single arch defined primitive.
+ *
+ * To do this, arch code defines clear_user_pages() and the max granularity
+ * it can handle via ARCH_MAX_CLEAR_PAGES_ORDER.
+ *
+ * Note that given the need for contiguity, __HAVE_ARCH_CLEAR_USER_PAGES
+ * and CONFIG_HIGHMEM are mutually exclusive.
+ */
+
+#if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES)
+#error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES
+#endif
+
+#ifndef __HAVE_ARCH_CLEAR_USER_PAGES
+
+/*
+ * For architectures that do not expose __HAVE_ARCH_CLEAR_USER_PAGES, set
+ * the granularity to be identical to clear_user_page().
+ */
+#define ARCH_MAX_CLEAR_PAGES_ORDER 0
+
+#ifndef __ASSEMBLY__
+
+/*
+ * With ARCH_MAX_CLEAR_PAGES_ORDER == 0, all callers should be specifying
+ * npages == 1 and so we just fallback to clear_user_page().
+ */
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+ struct page *start_page, unsigned int npages)
+{
+ clear_user_page(page, vaddr, start_page);
+}
+#endif /* __ASSEMBLY__ */
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
+#define ARCH_MAX_CLEAR_PAGES (1 << ARCH_MAX_CLEAR_PAGES_ORDER)
+
+#endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index 6fc47561814c..060094e7f964 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -93,5 +93,6 @@ extern unsigned long memory_end;

#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/clear_page.h>

#endif /* __ASM_GENERIC_PAGE_H */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 3af34de54330..08781d7693e7 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -208,6 +208,29 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
}
#endif

+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+ unsigned int npages)
+{
+ void *addr = page_address(page);
+
+ clear_user_pages(addr, vaddr, page, npages);
+}
+#else
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+ unsigned int npages)
+{
+ void *addr;
+ unsigned int i;
+
+ for (i = 0; i < npages; i++, page++, vaddr += PAGE_SIZE) {
+ addr = kmap_local_page(page);
+ clear_user_page(addr, vaddr, page);
+ kunmap_local(addr);
+ }
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
/**
* alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
--
2.31.1

2022-06-07 01:34:22

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 18/21] gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING

Add FOLL_HINT_BULK, which callers of get_user_pages(), pin_user_pages()
can use to signal that this call is one of many, allowing
get_user_pages() to optimize accordingly.

Additionally, add FAULT_FLAG_NON_CACHING, which in the fault handling
path signals that the underlying logic can use non-caching primitives.
This is a possible optimization for FOLL_HINT_BULK calls.

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a9b0c1889348..dbd8b7344dfc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_HINT_BULK 0x100000 /* part of a larger extent being gup'd */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b34ff2cdbc4f..287b3018c14d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -824,6 +824,7 @@ typedef struct {
* mapped R/O.
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
* We should only access orig_pte if this flag set.
+ * @FAULT_FLAG_NON_CACHING: Avoid polluting the cache if possible.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -861,6 +862,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+ FAULT_FLAG_NON_CACHING = 1 << 12,
};

typedef unsigned int __bitwise zap_flags_t;
--
2.31.1

2022-06-07 02:06:23

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 14/21] x86/clear_page: add clear_pages_incoherent()

Expose incoherent clearing primitives (clear_pages_movnt(),
clear_pages_clzero()) as alternatives via clear_pages_incoherent().

Fallback to clear_pages() if, X86_FEATURE_MOVNT_SLOW is set and
the CPU does not have X86_FEATURE_CLZERO.

Both these primitives use weakly-ordered stores. To ensure that
callers don't mix accesses to different types of address_spaces,
annotate clear_user_pages_incoherent(), and clear_pages_incoherent()
as taking __incoherent pointers as arguments.

Also add clear_page_make_coherent() which provides the necessary
store fence to make access to these __incoherent regions safe.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page.h | 13 +++++++++++++
arch/x86/include/asm/page_64.h | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 045eaab08f43..8fc6cc6759b9 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -40,6 +40,19 @@ static inline void clear_user_page(void *page, unsigned long vaddr,
clear_page(page);
}

+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT /* x86_64 */
+/*
+ * clear_pages_incoherent: valid on only __incoherent memory regions.
+ */
+static inline void clear_user_pages_incoherent(__incoherent void *page,
+ unsigned long vaddr,
+ struct page *pg,
+ unsigned int npages)
+{
+ clear_pages_incoherent(page, npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *topage)
{
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index e8d4698fda65..78417f63f522 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -69,6 +69,40 @@ static inline void clear_pages(void *page, unsigned int npages)
: "cc", "memory", "rax", "rcx");
}

+#define __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+/*
+ * clear_pages_incoherent: only allowed on __incoherent memory regions.
+ */
+static inline void clear_pages_incoherent(__incoherent void *page,
+ unsigned int npages)
+{
+ alternative_call_2(clear_pages_movnt,
+ clear_pages, X86_FEATURE_MOVNT_SLOW,
+ clear_pages_clzero, X86_FEATURE_CLZERO,
+ "=D" (page), "S" ((unsigned long) npages),
+ "0" (page)
+ : "cc", "memory", "rax", "rcx");
+}
+
+/*
+ * clear_page_make_coherent: execute the necessary store fence
+ * after which __incoherent regions can be safely accessed.
+ */
+static inline void clear_page_make_coherent(void)
+{
+ /*
+ * Keep the sfence for oldinstr and clzero separate to guard against
+ * the possibility that a CPU has both X86_FEATURE_MOVNT_SLOW and
+ * X86_FEATURE_CLZERO.
+ *
+ * The alternatives need to be in the same order as the ones
+ * in clear_pages_incoherent().
+ */
+ alternative_2("sfence",
+ "", X86_FEATURE_MOVNT_SLOW,
+ "sfence", X86_FEATURE_CLZERO);
+}
+
void copy_page(void *to, void *from);

#ifdef CONFIG_X86_5LEVEL
--
2.31.1

2022-06-07 04:57:51

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 08/21] perf bench: add memset_movnti()

Clone memset_movnti() from arch/x86/lib/memset_64.S.

perf bench mem memset -f x86-64-movnt on Intel Icelakex, AMD Milan:

# Intel Icelakex

$ for i in 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
done

# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 8MB bytes ...
12.896170 GB/sec
# Copying 32MB bytes ...
15.879065 GB/sec
# Copying 128MB bytes ...
20.813214 GB/sec
# Copying 512MB bytes ...
24.190817 GB/sec

# AMD Milan

$ for i in 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB -l 5
done

# Output pruned.
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 8MB bytes ...
22.372566 GB/sec
# Copying 32MB bytes ...
22.507923 GB/sec
# Copying 128MB bytes ...
22.492532 GB/sec
# Copying 512MB bytes ...
22.434603 GB/sec

Signed-off-by: Ankur Arora <[email protected]>
---
tools/arch/x86/lib/memset_64.S | 68 +++++++++++---------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
2 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/tools/arch/x86/lib/memset_64.S b/tools/arch/x86/lib/memset_64.S
index fc9ffd3ff3b2..307b753ca03a 100644
--- a/tools/arch/x86/lib/memset_64.S
+++ b/tools/arch/x86/lib/memset_64.S
@@ -24,7 +24,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS

movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
RET
SYM_FUNC_END(memset_erms)

-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
movq %rdi,%r10

/* expand byte value */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:

movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@

.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@

/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@

-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@

-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
RET

-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index dac6d2b7c39b..53ead7f91313 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */

-MEMSET_FN(memset_orig,
+MEMSET_FN(memset_movq,
"x86-64-unrolled",
"unrolled memset() in arch/x86/lib/memset_64.S")

@@ -11,3 +11,7 @@ MEMSET_FN(__memset,
MEMSET_FN(memset_erms,
"x86-64-stosb",
"movsb-based memset() in arch/x86/lib/memset_64.S")
+
+MEMSET_FN(memset_movnti,
+ "x86-64-movnt",
+ "movnt-based memset() in arch/x86/lib/memset_64.S")
--
2.31.1

2022-06-07 05:02:11

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 15/21] mm/clear_page: add clear_page_non_caching_threshold()

Introduce clear_page_non_caching_threshold_pages which specifies the
threshold above which clear_page_incoherent() is used.

The ideal threshold value depends on the CPU uarch and where the
performance curves for cached and non-cached stores intersect.

Typically this would depend on microarchitectural details and
the LLC-size. Here, we arbitrarily choose a default value of
8MB (CLEAR_PAGE_NON_CACHING_THRESHOLD), a reasonably large LLC.

Also define clear_page_prefer_incoherent() which provides the
interface for querying this.

Signed-off-by: Ankur Arora <[email protected]>
---
include/asm-generic/clear_page.h | 4 ++++
include/linux/mm.h | 6 ++++++
mm/memory.c | 25 +++++++++++++++++++++++++
3 files changed, 35 insertions(+)

diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
index 0ebff70a60a9..b790000661ce 100644
--- a/include/asm-generic/clear_page.h
+++ b/include/asm-generic/clear_page.h
@@ -62,4 +62,8 @@ static inline void clear_page_make_coherent(void) { }
#endif /* __ASSEMBLY__ */
#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */

+#ifndef __ASSEMBLY__
+extern unsigned long __init arch_clear_page_non_caching_threshold(void);
+#endif
+
#endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..5084571b2fb6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3328,6 +3328,12 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
}

+extern bool clear_page_prefer_non_caching(unsigned long extent);
+#else /* !(CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS) */
+static inline bool clear_page_prefer_non_caching(unsigned long extent)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/memory.c b/mm/memory.c
index 04c6bb5d75f6..b78b32a3e915 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5563,10 +5563,28 @@ EXPORT_SYMBOL(__might_fault);

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)

+/*
+ * Default size beyond which huge page clearing uses the non-caching
+ * path. Size it for a reasonable sized LLC.
+ */
+#define CLEAR_PAGE_NON_CACHING_THRESHOLD (8 << 20)
static unsigned int __ro_after_init clear_page_unit = 1;
+
+static unsigned long __read_mostly clear_page_non_caching_threshold_pages =
+ CLEAR_PAGE_NON_CACHING_THRESHOLD / PAGE_SIZE;
+
+/* Arch code can override for a machine specific value. */
+unsigned long __weak __init arch_clear_page_non_caching_threshold(void)
+{
+ return CLEAR_PAGE_NON_CACHING_THRESHOLD;
+}
+
static int __init setup_clear_page_params(void)
{
clear_page_unit = 1 << min(MAX_ORDER - 1, ARCH_MAX_CLEAR_PAGES_ORDER);
+
+ clear_page_non_caching_threshold_pages =
+ arch_clear_page_non_caching_threshold() / PAGE_SIZE;
return 0;
}

@@ -5576,6 +5594,13 @@ static int __init setup_clear_page_params(void)
*/
late_initcall(setup_clear_page_params);

+bool clear_page_prefer_non_caching(unsigned long extent)
+{
+ unsigned long pages = extent / PAGE_SIZE;
+
+ return pages >= clear_page_non_caching_threshold_pages;
+}
+
/*
* Clear a page extent.
*
--
2.31.1

2022-06-07 06:39:33

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()

Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
With this, page-clearing can skip the memory hierarchy, thus providing
a non cache-polluting implementation of clear_pages().

MOVNTI, from the Intel SDM, Volume 2B, 4-101:
"The non-temporal hint is implemented by using a write combining (WC)
memory type protocol when writing the data to memory. Using this
protocol, the processor does not write the data into the cache
hierarchy, nor does it fetch the corresponding cache line from memory
into the cache hierarchy."

The AMD Arch Manual has something similar to say as well.

One use-case is to zero large extents without bringing in never-to-be-
accessed cachelines. Also, often clear_pages_movnt() based clearing is
faster once extent sizes are O(LLC-size).

As the excerpt notes, MOVNTI is weakly ordered with respect to other
instructions operating on the memory hierarchy. This needs to be
handled by the caller by executing an SFENCE when done.

The implementation is straight-forward: unroll the inner loop to keep
the code similar to memset_movnti(), so that we can gauge
clear_pages_movnt() performance via perf bench mem memset.

# Intel Icelakex
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:

System: Oracle X9-2 (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory: 512 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%)
---------------------- --------------------- --------
size BW ( stdev) BW ( stdev)

2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38%
16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02%
128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24%
1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35%
4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50%

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index a88a3508888a..3affc4ecb8da 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
void clear_pages_orig(void *page, unsigned long npages);
void clear_pages_rep(void *page, unsigned long npages);
void clear_pages_erms(void *page, unsigned long npages);
+void clear_pages_movnt(void *page, unsigned long npages);

#define __HAVE_ARCH_CLEAR_USER_PAGES
static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 2cc3b681734a..83d14f1c9f57 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
RET
SYM_FUNC_END(clear_pages_erms)
EXPORT_SYMBOL_GPL(clear_pages_erms)
+
+SYM_FUNC_START(clear_pages_movnt)
+ xorl %eax,%eax
+ movq %rsi,%rcx
+ shlq $PAGE_SHIFT, %rcx
+
+ .p2align 4
+.Lstart:
+ movnti %rax, 0x00(%rdi)
+ movnti %rax, 0x08(%rdi)
+ movnti %rax, 0x10(%rdi)
+ movnti %rax, 0x18(%rdi)
+ movnti %rax, 0x20(%rdi)
+ movnti %rax, 0x28(%rdi)
+ movnti %rax, 0x30(%rdi)
+ movnti %rax, 0x38(%rdi)
+ addq $0x40, %rdi
+ subl $0x40, %ecx
+ ja .Lstart
+ RET
+SYM_FUNC_END(clear_pages_movnt)
--
2.31.1

2022-06-07 08:29:58

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 20/21] vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()

Specify FOLL_HINT_BULK to pin_user_pages_remote() so it is aware
that this pin is part of a larger region being pinned, and can
optimize based on that expectation.

Cc: [email protected]
Signed-off-by: Ankur Arora <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9444c1..138b23769793 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -553,6 +553,9 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
if (prot & IOMMU_WRITE)
flags |= FOLL_WRITE;

+ /* Tell gup that this pin iteration is part of a larger set of pins. */
+ flags |= FOLL_HINT_BULK;
+
mmap_read_lock(mm);
ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
pages, NULL, NULL);
--
2.31.1

2022-06-07 08:46:57

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 05/21] mm/huge_page: generalize process_huge_page()

process_huge_page() processes subpages left-right, narrowing towards
the direction of the faulting subpage to keep spatially close
cachelines hot.

This is done, however, page-at-a-time. Retain the left-right
narrowing logic while using larger chunks for page regions
farther away from the target, and smaller chunks approaching
the target.

Clearing in large chunks allows for uarch specific optimizations.
Do this, however, only for far away subpages because we don't
care about keeping those cachelines hot.

In addition, while narrowing towards the target, access both the
left and right chunks in the forward direction instead of the
reverse -- x86 string instructions perform better that way.

Signed-off-by: Ankur Arora <[email protected]>
---
mm/memory.c | 86 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 64 insertions(+), 22 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fbc7bc70dc3d..04c6bb5d75f6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5592,8 +5592,10 @@ struct subpage_arg {
struct page *dst;
struct page *src;
struct vm_area_struct *vma;
+ int page_unit;
};

+#define NWIDTH 4
/*
* Process all subpages of the specified huge page with the specified
* operation. The target subpage will be processed last to keep its
@@ -5604,37 +5606,75 @@ static inline void process_huge_page(struct subpage_arg *sa,
void (*process_subpages)(struct subpage_arg *sa,
unsigned long base_addr, int lidx, int ridx))
{
- int i, n, base, l;
+ int n, lbound, rbound;
+ int remaining, unit = sa->page_unit;
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);

+ lbound = 0;
+ rbound = pages_per_huge_page - 1;
+ remaining = pages_per_huge_page;
+
/* Process target subpage last to keep its cache lines hot */
n = (addr_hint - addr) / PAGE_SIZE;

- if (2 * n <= pages_per_huge_page) {
- /* If target subpage in first half of huge page */
- base = 0;
- l = n;
- /* Process subpages at the end of huge page */
- process_subpages(sa, addr, 2*n, pages_per_huge_page-1);
- } else {
- /* If target subpage in second half of huge page */
- base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
- l = pages_per_huge_page - n;
-
- /* Process subpages at the begin of huge page */
- process_subpages(sa, addr, 0, base);
- }
/*
- * Process remaining subpages in left-right-left-right pattern
- * towards the target subpage
+ * Process subpages in a left-right-left-right pattern towards the
+ * faulting subpage to keep spatially close cachelines hot.
+ *
+ * If the architecture advertises multi-page clearing/copying, use
+ * the largest extent available, process it in the forward direction,
+ * while iteratively narrowing as the target gets closer.
+ *
+ * Clearing in large chunks allows for uarch specific optimizations.
+ * Do this, however, only for far away subpages because we don't
+ * care about keeping those cachelines hot.
+ *
+ * In addition, while narrowing towards the target, access both the
+ * left and right chunks in the forward direction instead of the
+ * reverse -- x86 string instructions perform better that way.
*/
- for (i = 0; i < l; i++) {
- int left_idx = base + i;
- int right_idx = base + 2 * l - 1 - i;
+ while (remaining) {
+ int left_gap = n - lbound;
+ int right_gap = rbound - n;
+ int neighbourhood;

- process_subpages(sa, addr, left_idx, left_idx);
- process_subpages(sa, addr, right_idx, right_idx);
+ /*
+ * We want to defer processing of the immediate neighbourhood of
+ * the target until rest of the huge-page is exhausted.
+ */
+ neighbourhood = NWIDTH * (left_gap > NWIDTH ||
+ right_gap > NWIDTH);
+
+ /*
+ * Width of the remaining region on the left: n - lbound + 1.
+ * In addition hold an additional neighbourhood region, which is
+ * non-zero until the left, right gaps have been cleared.
+ *
+ * [ddddd....xxxxN
+ * ^ | `---- target
+ * `---|-- lbound
+ * `------------ left neighbourhood edge
+ */
+ if ((n - lbound + 1) >= unit + neighbourhood) {
+ process_subpages(sa, addr, lbound, lbound + unit - 1);
+ lbound += unit;
+ remaining -= unit;
+ }
+
+ /*
+ * Similarly the right:
+ * Nxxxx....ddd]
+ */
+ if ((rbound - n) >= (unit + neighbourhood)) {
+ process_subpages(sa, addr, rbound - unit + 1, rbound);
+ rbound -= unit;
+ remaining -= unit;
+ }
+
+ unit = min(sa->page_unit, unit >> 1);
+ if (unit == 0)
+ unit = 1;
}
}

@@ -5687,6 +5727,7 @@ void clear_huge_page(struct page *page,
.dst = page,
.src = NULL,
.vma = NULL,
+ .page_unit = clear_page_unit,
};

if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
@@ -5741,6 +5782,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
.dst = dst,
.src = src,
.vma = vma,
+ .page_unit = 1,
};

if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
--
2.31.1

2022-06-07 08:53:20

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 11/21] x86/cpuid: add X86_FEATURE_MOVNT_SLOW

X86_FEATURE_MOVNT_SLOW denotes that clear_pages_movnti() is slower for
bulk page clearing (defined as LLC-sized or larger) than the standard
cached clear_page() idiom.

Microarchs where this is true would set this via check_movnt_quirks().

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/amd.c | 2 ++
arch/x86/kernel/cpu/bugs.c | 16 ++++++++++++++++
arch/x86/kernel/cpu/cpu.h | 2 ++
arch/x86/kernel/cpu/intel.c | 2 ++
5 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 393f2bbb5e3a..824bdb1d0da1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -296,6 +296,7 @@
#define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory Bandwidth Allocation */
#define X86_FEATURE_SGX1 (11*32+ 8) /* "" Basic SGX */
#define X86_FEATURE_SGX2 (11*32+ 9) /* "" SGX Enclave Dynamic Memory Management (EDMM) */
+#define X86_FEATURE_MOVNT_SLOW (11*32+10) /* MOVNT is slow. (see check_movnt_quirks()) */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 0c0b09796ced..a5fe1420388d 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -891,6 +891,8 @@ static void init_amd(struct cpuinfo_x86 *c)
if (c->x86 >= 0x10)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);

+ check_movnt_quirks(c);
+
/* get apicid instead of initial apic id from cpuid */
c->apicid = hard_smp_processor_id();

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index d879a6c93609..16e293654d34 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -85,6 +85,22 @@ EXPORT_SYMBOL_GPL(mds_idle_clear);
*/
DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);

+/*
+ * check_movnt_quirks() sets X86_FEATURE_MOVNT_SLOW for uarchs where
+ * clear_pages_movnti() is slower for bulk page clearing than the standard
+ * cached clear_page() idiom (typically rep-stosb/rep-stosq.)
+ *
+ * (Bulk clearing defined as LLC-sized or larger.)
+ *
+ * x86_64 only since clear_pages_movnti() is only defined there.
+ */
+void check_movnt_quirks(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_64
+
+#endif
+}
+
void __init check_bugs(void)
{
identify_boot_cpu();
diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
index 2a8e584fc991..f53f07bf706f 100644
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -83,4 +83,6 @@ extern void update_srbds_msr(void);

extern u64 x86_read_arch_cap_msr(void);

+void check_movnt_quirks(struct cpuinfo_x86 *c);
+
#endif /* ARCH_X86_CPU_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index fd5dead8371c..f0dc9b97dc8f 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -701,6 +701,8 @@ static void init_intel(struct cpuinfo_x86 *c)
c->x86_cache_alignment = c->x86_clflush_size * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+
+ check_movnt_quirks(c);
#else
/*
* Names for the Pentium II/Celeron processors
--
2.31.1

2022-06-07 09:10:31

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 12/21] sparse: add address_space __incoherent

Some CPU architectures provide store instructions that are weakly
ordered with respect to rest of the local instruction stream.

Add sparse address_space __incoherent to denote regions used such.

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/compiler_types.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index d08dfcb0ac68..8e3e736fc82f 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -19,6 +19,7 @@
# define __iomem __attribute__((noderef, address_space(__iomem)))
# define __percpu __attribute__((noderef, address_space(__percpu)))
# define __rcu __attribute__((noderef, address_space(__rcu)))
+# define __incoherent __attribute__((noderef, address_space(__incoherent)))
static inline void __chk_user_ptr(const volatile void __user *ptr) { }
static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
/* context/locking */
@@ -45,6 +46,7 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
# define __iomem
# define __percpu BTF_TYPE_TAG(percpu)
# define __rcu
+# define __incoherent
# define __chk_user_ptr(x) (void)0
# define __chk_io_ptr(x) (void)0
/* context/locking */
--
2.31.1

2022-06-07 10:17:55

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()

Add generic primitives for clear_user_pages_incoherent() and
clear_page_make_coherent().

To ensure that callers don't mix accesses to different types
of address_spaces, annotate clear_user_pages_incoherent()
as taking an __incoherent pointer as argument.

Also add clear_user_highpages_incoherent() which either calls
clear_user_pages_incoherent() or falls back to clear_user_highpages()

Signed-off-by: Ankur Arora <[email protected]>
---

Notes:
clear_user_highpages_incoherent() operates on an __incoherent region
and expects the caller to call clear_page_make_coherent().

It should, however be taking an __incoherent * as argument -- this it
does not do because I couldn't see a clean way of doing that with
highmem. Suggestions?

include/asm-generic/clear_page.h | 21 +++++++++++++++++++++
include/linux/highmem.h | 23 +++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/include/asm-generic/clear_page.h b/include/asm-generic/clear_page.h
index f827d661519c..0ebff70a60a9 100644
--- a/include/asm-generic/clear_page.h
+++ b/include/asm-generic/clear_page.h
@@ -16,6 +16,9 @@
#if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES)
#error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES
#endif
+#if defined(CONFIG_HIGHMEM) && defined(__HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT)
+#error CONFIG_HIGHMEM is incompatible with __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+#endif

#ifndef __HAVE_ARCH_CLEAR_USER_PAGES

@@ -41,4 +44,22 @@ static inline void clear_user_pages(void *page, unsigned long vaddr,

#define ARCH_MAX_CLEAR_PAGES (1 << ARCH_MAX_CLEAR_PAGES_ORDER)

+#ifndef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+#ifndef __ASSEMBLY__
+/*
+ * Fallback path (via clear_user_pages()) if the architecture does not
+ * support incoherent clearing.
+ */
+static inline void clear_user_pages_incoherent(__incoherent void *page,
+ unsigned long vaddr,
+ struct page *pg,
+ unsigned int npages)
+{
+ clear_user_pages((__force void *)page, vaddr, pg, npages);
+}
+
+static inline void clear_page_make_coherent(void) { }
+#endif /* __ASSEMBLY__ */
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
#endif /* __ASM_GENERIC_CLEAR_PAGE_H */
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 08781d7693e7..90179f623c3b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -231,6 +231,29 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
}
#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */

+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT
+static inline void clear_user_highpages_incoherent(struct page *page,
+ unsigned long vaddr,
+ unsigned int npages)
+{
+ __incoherent void *addr = (__incoherent void *) page_address(page);
+
+ clear_user_pages_incoherent(addr, vaddr, page, npages);
+}
+#else
+static inline void clear_user_highpages_incoherent(struct page *page,
+ unsigned long vaddr,
+ unsigned int npages)
+{
+ /*
+ * We fallback to clear_user_highpages() for the CONFIG_HIGHMEM
+ * configs.
+ * For !CONFIG_HIGHMEM, this will get translated to clear_user_pages().
+ */
+ clear_user_highpages(page, vaddr, npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES_INCOHERENT */
+
#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
/**
* alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
--
2.31.1

2022-06-07 10:34:29

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 17/21] clear_huge_page: use non-cached clearing

Non-caching stores are suitable for circumstances where the destination
region is unlikely to be read again soon, or is large enough that
there's no expectation that we will find the data in the cache.

Add a new parameter to clear_user_extent(), which handles the
non-caching clearing path for huge and gigantic pages. This needs a
final clear_page_make_coherent() operation since non-cached clearing
typically involves weakly ordered stores that are incoherent wrt other
operations in the memory hierarchy.

This path is always invoked for gigantic pages, for huge pages only if
pages_per_huge_page is greater than an architectural threshold, or if
the user gives an explicit hint (if for instance, this call is part of
a larger clearing operation.)

Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 3 ++-
mm/hugetlb.c | 3 ++-
mm/memory.c | 50 +++++++++++++++++++++++++++++++++++++++-------
4 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5084571b2fb6..a9b0c1889348 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3302,7 +3302,8 @@ enum mf_action_page_type {
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
extern void clear_huge_page(struct page *page,
unsigned long addr_hint,
- unsigned int pages_per_huge_page);
+ unsigned int pages_per_huge_page,
+ bool non_cached);
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr_hint,
struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a77c78a2b6b5..73654db77a1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,6 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;
+ bool non_cached = false;

VM_BUG_ON_PAGE(!PageCompound(page), page);

@@ -611,7 +612,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
goto release;
}

- clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
+ clear_huge_page(page, vmf->address, HPAGE_PMD_NR, non_cached);
/*
* The memory barrier inside __SetPageUptodate makes sure that
* clear_huge_page writes become visible before the set_pmd_at()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c468ac1d069..0c4a31b5c1e9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5481,6 +5481,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ bool non_cached = false;

/*
* Currently, we are forced to kill the process in the event the
@@ -5536,7 +5537,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spin_unlock(ptl);
goto out;
}
- clear_huge_page(page, address, pages_per_huge_page(h));
+ clear_huge_page(page, address, pages_per_huge_page(h), non_cached);
__SetPageUptodate(page);
new_page = true;

diff --git a/mm/memory.c b/mm/memory.c
index b78b32a3e915..0638dc56828f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5606,11 +5606,18 @@ bool clear_page_prefer_non_caching(unsigned long extent)
*
* With ARCH_MAX_CLEAR_PAGES == 1, clear_user_highpages() drops down
* to page-at-a-time mode. Or, funnels through to clear_user_pages().
+ *
+ * With coherent == false, we use incoherent stores and the caller is
+ * responsible for making the region coherent again by calling
+ * clear_page_make_coherent().
*/
static void clear_user_extent(struct page *start_page, unsigned long vaddr,
- unsigned int npages)
+ unsigned int npages, bool coherent)
{
- clear_user_highpages(start_page, vaddr, npages);
+ if (coherent)
+ clear_user_highpages(start_page, vaddr, npages);
+ else
+ clear_user_highpages_incoherent(start_page, vaddr, npages);
}

struct subpage_arg {
@@ -5709,6 +5716,13 @@ static void clear_gigantic_page(struct page *page,
{
int i;
struct page *p = page;
+ bool coherent;
+
+ /*
+ * Gigantic pages are large enough, that there are no cache
+ * expectations. Use the incoherent path.
+ */
+ coherent = false;

might_sleep();
for (i = 0; i < pages_per_huge_page;
@@ -5718,9 +5732,16 @@ static void clear_gigantic_page(struct page *page,
* guarantees that p[0] and p[clear_page_unit-1]
* never straddle a mem_map discontiguity.
*/
- clear_user_extent(p, base_addr + i * PAGE_SIZE, clear_page_unit);
+ clear_user_extent(p, base_addr + i * PAGE_SIZE,
+ clear_page_unit, coherent);
cond_resched();
}
+
+ /*
+ * We need to make sure that writes above are ordered before
+ * updating the PTE and marking SetPageUptodate().
+ */
+ clear_page_make_coherent();
}

static void clear_subpages(struct subpage_arg *sa,
@@ -5736,15 +5757,16 @@ static void clear_subpages(struct subpage_arg *sa,

n = min(clear_page_unit, remaining);

- clear_user_extent(page + i, base_addr + i * PAGE_SIZE, n);
+ clear_user_extent(page + i, base_addr + i * PAGE_SIZE,
+ n, true);
i += n;

cond_resched();
}
}

-void clear_huge_page(struct page *page,
- unsigned long addr_hint, unsigned int pages_per_huge_page)
+void clear_huge_page(struct page *page, unsigned long addr_hint,
+ unsigned int pages_per_huge_page, bool non_cached)
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
@@ -5755,7 +5777,21 @@ void clear_huge_page(struct page *page,
.page_unit = clear_page_unit,
};

- if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+ /*
+ * The non-caching path is typically slower for small extents so use
+ * it only if the caller explicitly hints it or if the extent is
+ * large enough that there are no cache expectations.
+ *
+ * We let the gigantic page path handle the details.
+ */
+ non_cached |=
+ clear_page_prefer_non_caching(pages_per_huge_page * PAGE_SIZE);
+
+ if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES || non_cached)) {
+ /*
+ * Gigantic page clearing always uses incoherent clearing
+ * internally.
+ */
clear_gigantic_page(page, addr, pages_per_huge_page);
return;
}
--
2.31.1

2022-06-07 12:08:31

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 01/21] mm, huge-page: reorder arguments to process_huge_page()

Mechanical change to process_huge_page() to pass subpage clear/copy
args via struct subpage_arg * instead of passing an opaque pointer
around.

No change in generated code.

Signed-off-by: Ankur Arora <[email protected]>
---
mm/memory.c | 47 ++++++++++++++++++++++++++---------------------
1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 21dadf03f089..c33aacdaaf11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5562,15 +5562,22 @@ EXPORT_SYMBOL(__might_fault);
#endif

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+
+struct subpage_arg {
+ struct page *dst;
+ struct page *src;
+ struct vm_area_struct *vma;
+};
+
/*
* Process all subpages of the specified huge page with the specified
* operation. The target subpage will be processed last to keep its
* cache lines hot.
*/
-static inline void process_huge_page(
+static inline void process_huge_page(struct subpage_arg *sa,
unsigned long addr_hint, unsigned int pages_per_huge_page,
- void (*process_subpage)(unsigned long addr, int idx, void *arg),
- void *arg)
+ void (*process_subpage)(struct subpage_arg *sa,
+ unsigned long addr, int idx))
{
int i, n, base, l;
unsigned long addr = addr_hint &
@@ -5586,7 +5593,7 @@ static inline void process_huge_page(
/* Process subpages at the end of huge page */
for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
cond_resched();
- process_subpage(addr + i * PAGE_SIZE, i, arg);
+ process_subpage(sa, addr + i * PAGE_SIZE, i);
}
} else {
/* If target subpage in second half of huge page */
@@ -5595,7 +5602,7 @@ static inline void process_huge_page(
/* Process subpages at the begin of huge page */
for (i = 0; i < base; i++) {
cond_resched();
- process_subpage(addr + i * PAGE_SIZE, i, arg);
+ process_subpage(sa, addr + i * PAGE_SIZE, i);
}
}
/*
@@ -5607,9 +5614,9 @@ static inline void process_huge_page(
int right_idx = base + 2 * l - 1 - i;

cond_resched();
- process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
+ process_subpage(sa, addr + left_idx * PAGE_SIZE, left_idx);
cond_resched();
- process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
+ process_subpage(sa, addr + right_idx * PAGE_SIZE, right_idx);
}
}

@@ -5628,9 +5635,9 @@ static void clear_gigantic_page(struct page *page,
}
}

-static void clear_subpage(unsigned long addr, int idx, void *arg)
+static void clear_subpage(struct subpage_arg *sa, unsigned long addr, int idx)
{
- struct page *page = arg;
+ struct page *page = sa->dst;

clear_user_highpage(page + idx, addr);
}
@@ -5640,13 +5647,18 @@ void clear_huge_page(struct page *page,
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ struct subpage_arg sa = {
+ .dst = page,
+ .src = NULL,
+ .vma = NULL,
+ };

if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
return;
}

- process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+ process_huge_page(&sa, addr_hint, pages_per_huge_page, clear_subpage);
}

static void copy_user_gigantic_page(struct page *dst, struct page *src,
@@ -5668,16 +5680,9 @@ static void copy_user_gigantic_page(struct page *dst, struct page *src,
}
}

-struct copy_subpage_arg {
- struct page *dst;
- struct page *src;
- struct vm_area_struct *vma;
-};
-
-static void copy_subpage(unsigned long addr, int idx, void *arg)
+static void copy_subpage(struct subpage_arg *copy_arg,
+ unsigned long addr, int idx)
{
- struct copy_subpage_arg *copy_arg = arg;
-
copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
addr, copy_arg->vma);
}
@@ -5688,7 +5693,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
{
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
- struct copy_subpage_arg arg = {
+ struct subpage_arg sa = {
.dst = dst,
.src = src,
.vma = vma,
@@ -5700,7 +5705,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
return;
}

- process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
+ process_huge_page(&sa, addr_hint, pages_per_huge_page, copy_subpage);
}

long copy_huge_page_from_user(struct page *dst_page,
--
2.31.1

2022-06-07 14:13:41

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 06/21] x86/clear_page: add clear_pages()

Add clear_pages(), with ARCH_MAX_CLEAR_PAGES_ORDER=8, so we can clear
in chunks of upto 1024KB.

The case for doing this is to expose huge or gigantic page clearing
as a few long strings of zeroes instead of many PAGE_SIZE'd operations.
Processors could take advantage of this hint by foregoing cacheline
allocation.
Unfortunately current generation CPUs generally do not do this
optimization: among CPUs tested, Intel Skylake, Icelakex don't at
all; AMD Milan does for extents > ~LLC-size.
(Note, however, numbers below do show a ~25% increase in clearing
BW -- just that they aren't due to foregoing cacheline allocation.)

One hope for this change is that it might provide enough of a
hint that future uarchs could optimize for.

A minor negative with this change is that calls to clear_page()
(which now calls clear_pages()) clobber an additional register.

Performance
===

System: Oracle X9-2c (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory: 1024 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0002c1, scaling-governor: performance

System: Oracle E4-2c (2 nodes * 8 CCXs * 8 cores * 2 threads)
Processor: AMD EPYC 7J13 64-Core Processor (Milan, 25:1:1)
Memory: 512 GB evenly split between nodes
LLC-size: 32MB for each CCX (8-cores * 2-threads)
boost: 1, Microcode: 0xa00115d, scaling-governor: performance

Workload: create a 192GB qemu-VM (backed by preallocated 2MB
pages on the local node)
==

Icelakex
--
Time (s) Delta (%)
clear_page_erms() 22.37 ( +- 0.14s ) # 9.21 bytes/ns
clear_pages_erms() 16.49 ( +- 0.06s ) -26.28% # 12.50 bytes/ns

Looking at the perf stats [1] [2], it's not obvious where the
improvement is coming from. For clear_pages_erms(), we do execute
fewer instructions and branches (multiple pages per call to
clear_pages_erms(), and fewer cond_resched() calls) but since this
code isn't frontend bound (though there is a marginal improvement in
topdown-fe-bound), not clear if that's the cause for the ~25%
improvement.
The topdown-be-bound numbers are significantly better but they are
in a similar proportion to the total slots in both cases.

Milan
--
Time (s) Delta (%)
clear_page_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns
clear_pages_erms() 11.82 ( +- 0.06s ) -28.32% # 17.44 bytes/ns

Similar to the Icelakex case above, from the perf stats [3], [4] it's
unclear where the improvement is coming from. We do somewhat better
for L1-dcache-loads and marginally better for stalled-cycles-backend
but nothing obvious stands out.

Workload: vm-scalability hugetlb tests (on Icelakex)
==

For case-anon-w-seq-hugetlb, there is a ~19.49% improvement in
cpu-cycles expended. As above, from perf stats there isn't a clear
reason why. No significant differences in user/kernel cache misses.

case-anon-w-seq-hugetlb:
- 2,632,688,342,385 cpu-cycles # 2.301 GHz ( +- 6.76% ) (33.29%)
+ 2,119,058,504,338 cpu-cycles # 1.654 GHz ( +- 4.63% ) (33.37%)

Other hugetlb tests are flat.

case-anon-w-rand-hugetlb:
- 14,423,774,217,911 cpu-cycles # 2.452 GHz ( +- 0.55% ) (33.30%)
+ 14,009,785,056,082 cpu-cycles # 2.428 GHz ( +- 3.11% ) (33.32%)

case-anon-cow-seq-hugetlb:
- 2,689,994,027,601 cpu-cycles # 2.220 GHz ( +- 1.91% ) (33.27%)
+ 2,735,414,889,894 cpu-cycles # 2.262 GHz ( +- 1.82% ) (27.73%)

case-anon-cow-rand-hugetlb:
- 16,130,147,328,192 cpu-cycles # 2.482 GHz ( +- 1.07% ) (33.30%)
+ 15,815,163,909,204 cpu-cycles # 2.432 GHz ( +- 0.64% ) (33.32%)

cache-references, cache-misses are within margin of error across all
the tests.

[1] Icelakex, create 192GB qemu-VM, clear_page_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

22,378.31 msec task-clock # 1.000 CPUs utilized ( +- 0.67% )
153 context-switches # 6.844 /sec ( +- 0.57% )
8 cpu-migrations # 0.358 /sec ( +- 16.49% )
116 page-faults # 5.189 /sec ( +- 0.17% )
57,290,131,280 cycles # 2.563 GHz ( +- 0.66% ) (38.46%)
3,077,416,348 instructions # 0.05 insn per cycle ( +- 0.30% ) (46.14%)
631,473,780 branches # 28.246 M/sec ( +- 0.18% ) (53.83%)
1,167,792 branch-misses # 0.19% of all branches ( +- 0.79% ) (61.52%)
286,600,215,705 slots # 12.820 G/sec ( +- 0.66% ) (69.20%)
11,435,999,662 topdown-retiring # 3.9% retiring ( +- 1.56% ) (69.20%)
19,428,489,213 topdown-bad-spec # 6.2% bad speculation ( +- 3.23% ) (69.20%)
3,504,763,769 topdown-fe-bound # 1.2% frontend bound ( +- 0.67% ) (69.20%)
258,517,960,428 topdown-be-bound # 88.7% backend bound ( +- 0.58% ) (69.20%)
749,211,322 L1-dcache-loads # 33.513 M/sec ( +- 0.13% ) (69.18%)
3,244,380,956 L1-dcache-load-misses # 433.32% of all L1-dcache accesses ( +- 0.00% ) (69.20%)
11,441,841 LLC-loads # 511.805 K/sec ( +- 0.30% ) (69.23%)
839,878 LLC-load-misses # 7.32% of all LL-cache accesses ( +- 1.28% ) (69.24%)
<not supported> L1-icache-loads
23,091,397 L1-icache-load-misses ( +- 0.72% ) (30.82%)
772,619,434 dTLB-loads # 34.560 M/sec ( +- 0.31% ) (30.82%)
49,750 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 3.21% ) (30.80%)
<not supported> iTLB-loads
503,570 iTLB-load-misses ( +- 0.44% ) (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

22.374 +- 0.149 seconds time elapsed ( +- 0.66% )

[2] Icelakex, create 192GB qemu-VM, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

16,329.41 msec task-clock # 0.990 CPUs utilized ( +- 0.42% )
143 context-switches # 8.681 /sec ( +- 0.93% )
1 cpu-migrations # 0.061 /sec ( +- 63.25% )
118 page-faults # 7.164 /sec ( +- 0.27% )
41,735,523,673 cycles # 2.534 GHz ( +- 0.42% ) (38.46%)
1,454,116,543 instructions # 0.03 insn per cycle ( +- 0.49% ) (46.16%)
266,749,920 branches # 16.194 M/sec ( +- 0.41% ) (53.86%)
928,726 branch-misses # 0.35% of all branches ( +- 0.38% ) (61.54%)
208,805,754,709 slots # 12.676 G/sec ( +- 0.41% ) (69.23%)
5,355,889,366 topdown-retiring # 2.5% retiring ( +- 0.50% ) (69.23%)
12,720,749,784 topdown-bad-spec # 6.1% bad speculation ( +- 1.38% ) (69.23%)
998,710,552 topdown-fe-bound # 0.5% frontend bound ( +- 0.85% ) (69.23%)
192,653,197,875 topdown-be-bound # 90.9% backend bound ( +- 0.38% ) (69.23%)
407,619,058 L1-dcache-loads # 24.746 M/sec ( +- 0.17% ) (69.20%)
3,245,399,461 L1-dcache-load-misses # 801.49% of all L1-dcache accesses ( +- 0.01% ) (69.22%)
10,805,747 LLC-loads # 656.009 K/sec ( +- 0.37% ) (69.25%)
804,475 LLC-load-misses # 7.44% of all LL-cache accesses ( +- 2.73% ) (69.26%)
<not supported> L1-icache-loads
18,134,527 L1-icache-load-misses ( +- 1.24% ) (30.80%)
435,474,462 dTLB-loads # 26.437 M/sec ( +- 0.28% ) (30.80%)
41,187 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 4.06% ) (30.79%)
<not supported> iTLB-loads
440,135 iTLB-load-misses ( +- 1.07% ) (30.78%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

16.4906 +- 0.0676 seconds time elapsed ( +- 0.41% )

[3] Milan, create 192GB qemu-VM, clear_page_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

16,321.98 msec task-clock # 0.989 CPUs utilized ( +- 0.42% )
104 context-switches # 6.312 /sec ( +- 0.47% )
0 cpu-migrations # 0.000 /sec
109 page-faults # 6.616 /sec ( +- 0.41% )
39,430,057,963 cycles # 2.393 GHz ( +- 0.42% ) (33.33%)
252,874,009 stalled-cycles-frontend # 0.64% frontend cycles idle ( +- 17.81% ) (33.34%)
7,240,041 stalled-cycles-backend # 0.02% backend cycles idle ( +-245.73% ) (33.34%)
3,031,754,124 instructions # 0.08 insn per cycle
# 0.11 stalled cycles per insn ( +- 0.41% ) (33.35%)
711,675,976 branches # 43.197 M/sec ( +- 0.15% ) (33.34%)
52,470,018 branch-misses # 7.38% of all branches ( +- 0.21% ) (33.36%)
7,744,057,748 L1-dcache-loads # 470.041 M/sec ( +- 0.05% ) (33.36%)
3,241,880,079 L1-dcache-load-misses # 41.92% of all L1-dcache accesses ( +- 0.01% ) (33.35%)
<not supported> LLC-loads
<not supported> LLC-load-misses
155,312,115 L1-icache-loads # 9.427 M/sec ( +- 0.23% ) (33.34%)
1,573,793 L1-icache-load-misses # 1.01% of all L1-icache accesses ( +- 3.74% ) (33.36%)
3,521,392 dTLB-loads # 213.738 K/sec ( +- 4.97% ) (33.35%)
346,337 dTLB-load-misses # 9.31% of all dTLB cache accesses ( +- 5.54% ) (33.35%)
725 iTLB-loads # 44.005 /sec ( +- 8.75% ) (33.34%)
115,723 iTLB-load-misses # 19261.48% of all iTLB cache accesses ( +- 1.20% ) (33.34%)
139,229,403 L1-dcache-prefetches # 8.451 M/sec ( +- 10.97% ) (33.34%)
<not supported> L1-dcache-prefetch-misses

16.4962 +- 0.0665 seconds time elapsed ( +- 0.40% )

[4] Milan, create 192GB qemu-VM, clear_pages_erms()
# perf stat -r 5 --all-kernel -ddd ./qemu.sh

Performance counter stats for './qemu.sh' (5 runs):

11,676.79 msec task-clock # 0.987 CPUs utilized ( +- 0.68% )
96 context-switches # 8.131 /sec ( +- 0.78% )
2 cpu-migrations # 0.169 /sec ( +- 18.71% )
106 page-faults # 8.978 /sec ( +- 0.23% )
28,161,726,414 cycles # 2.385 GHz ( +- 0.69% ) (33.33%)
141,032,827 stalled-cycles-frontend # 0.50% frontend cycles idle ( +- 52.44% ) (33.35%)
796,792,139 stalled-cycles-backend # 2.80% backend cycles idle ( +- 23.73% ) (33.35%)
1,140,172,646 instructions # 0.04 insn per cycle
# 0.50 stalled cycles per insn ( +- 0.89% ) (33.35%)
219,864,061 branches # 18.622 M/sec ( +- 1.06% ) (33.36%)
1,407,446 branch-misses # 0.63% of all branches ( +- 10.66% ) (33.40%)
6,882,968,897 L1-dcache-loads # 582.960 M/sec ( +- 0.03% ) (33.38%)
3,267,546,914 L1-dcache-load-misses # 47.45% of all L1-dcache accesses ( +- 0.02% ) (33.37%)
<not supported> LLC-loads
<not supported> LLC-load-misses
146,901,513 L1-icache-loads # 12.442 M/sec ( +- 0.78% ) (33.36%)
1,462,155 L1-icache-load-misses # 0.99% of all L1-icache accesses ( +- 0.83% ) (33.34%)
2,055,805 dTLB-loads # 174.118 K/sec ( +- 22.56% ) (33.33%)
136,260 dTLB-load-misses # 4.69% of all dTLB cache accesses ( +- 23.13% ) (33.35%)
941 iTLB-loads # 79.699 /sec ( +- 5.54% ) (33.35%)
115,444 iTLB-load-misses # 14051.12% of all iTLB cache accesses ( +- 21.17% ) (33.34%)
95,438,373 L1-dcache-prefetches # 8.083 M/sec ( +- 19.99% ) (33.34%)
<not supported> L1-dcache-prefetch-misses

11.8296 +- 0.0805 seconds time elapsed ( +- 0.68% )

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/page.h | 12 +++++++++++
arch/x86/include/asm/page_64.h | 28 ++++++++++++++++++-------
arch/x86/lib/clear_page_64.S | 38 ++++++++++++++++++++--------------
3 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 5a246a2a66aa..045eaab08f43 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -22,6 +22,18 @@ struct page;
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

+#ifdef __HAVE_ARCH_CLEAR_USER_PAGES /* x86_64 */
+
+#define clear_page(page) clear_pages(page, 1)
+
+static inline void clear_user_pages(void *page, unsigned long vaddr,
+ struct page *pg, unsigned int npages)
+{
+ clear_pages(page, npages);
+}
+
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGES */
+
static inline void clear_user_page(void *page, unsigned long vaddr,
struct page *pg)
{
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index baa70451b8df..a88a3508888a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -41,16 +41,28 @@ extern unsigned long __phys_addr_symbol(unsigned long);
#define pfn_valid(pfn) ((pfn) < max_pfn)
#endif

-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+/*
+ * Clear in chunks of 256 pages/1024KB.
+ *
+ * Assuming a clearing BW of 3b/cyc (recent generation processors have
+ * more), this amounts to around 400K cycles for each chunk.
+ *
+ * With a cpufreq of ~2.5GHz, this amounts to ~160us for each chunk
+ * (which would also be the interval between calls to cond_resched().)
+ */
+#define ARCH_MAX_CLEAR_PAGES_ORDER 8

-static inline void clear_page(void *page)
+void clear_pages_orig(void *page, unsigned long npages);
+void clear_pages_rep(void *page, unsigned long npages);
+void clear_pages_erms(void *page, unsigned long npages);
+
+#define __HAVE_ARCH_CLEAR_USER_PAGES
+static inline void clear_pages(void *page, unsigned int npages)
{
- alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
- "=D" (page),
+ alternative_call_2(clear_pages_orig,
+ clear_pages_rep, X86_FEATURE_REP_GOOD,
+ clear_pages_erms, X86_FEATURE_ERMS,
+ "=D" (page), "S" ((unsigned long) npages),
"0" (page)
: "cc", "memory", "rax", "rcx");
}
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..2cc3b681734a 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0-only */
#include <linux/linkage.h>
#include <asm/export.h>
+#include <asm/page_types.h>

/*
* Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
@@ -10,23 +11,29 @@
*/

/*
- * Zero a page.
- * %rdi - page
+ * Zero pages.
+ * %rdi - base page
+ * %rsi - number of pages
+ *
+ * Note: clear_pages_*() have differing alignments restrictions
+ * but callers are always expected to page align.
*/
-SYM_FUNC_START(clear_page_rep)
- movl $4096/8,%ecx
+SYM_FUNC_START(clear_pages_rep)
+ movq %rsi,%rcx
+ shlq $(PAGE_SHIFT - 3),%rcx
xorl %eax,%eax
rep stosq
RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
+SYM_FUNC_END(clear_pages_rep)
+EXPORT_SYMBOL_GPL(clear_pages_rep)

-SYM_FUNC_START(clear_page_orig)
+SYM_FUNC_START(clear_pages_orig)
xorl %eax,%eax
- movl $4096/64,%ecx
+ movq %rsi,%rcx
+ shlq $(PAGE_SHIFT - 6),%rcx
.p2align 4
.Lloop:
- decl %ecx
+ decq %rcx
#define PUT(x) movq %rax,x*8(%rdi)
movq %rax,(%rdi)
PUT(1)
@@ -40,13 +47,14 @@ SYM_FUNC_START(clear_page_orig)
jnz .Lloop
nop
RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
+SYM_FUNC_END(clear_pages_orig)
+EXPORT_SYMBOL_GPL(clear_pages_orig)

-SYM_FUNC_START(clear_page_erms)
- movl $4096,%ecx
+SYM_FUNC_START(clear_pages_erms)
+ movq %rsi,%rcx
+ shlq $PAGE_SHIFT, %rcx
xorl %eax,%eax
rep stosb
RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(clear_pages_erms)
+EXPORT_SYMBOL_GPL(clear_pages_erms)
--
2.31.1

2022-06-07 17:37:15

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 16/21] x86/clear_page: add arch_clear_page_non_caching_threshold()

Add arch_clear_page_non_caching_threshold() for a machine specific value
above which clear_page_incoherent() would be used.

The ideal threshold value depends on the CPU model and where the
performance curves for caching and non-caching stores intersect.
A safe value is LLC-size, so we use that of the boot_cpu.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/kernel/cpu/cacheinfo.c | 13 +++++++++++++
arch/x86/kernel/setup.c | 6 ++++++
3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/cacheinfo.h b/arch/x86/include/asm/cacheinfo.h
index 86b2e0dcc4bf..5c6045699e94 100644
--- a/arch/x86/include/asm/cacheinfo.h
+++ b/arch/x86/include/asm/cacheinfo.h
@@ -4,5 +4,6 @@

void cacheinfo_amd_init_llc_id(struct cpuinfo_x86 *c, int cpu);
void cacheinfo_hygon_init_llc_id(struct cpuinfo_x86 *c, int cpu);
+int cacheinfo_lookup_max_size(int cpu);

#endif /* _ASM_X86_CACHEINFO_H */
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index fe98a1465be6..6fb0cb868099 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1034,3 +1034,16 @@ int populate_cache_leaves(unsigned int cpu)

return 0;
}
+
+int cacheinfo_lookup_max_size(int cpu)
+{
+ struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+ struct cacheinfo *this_leaf = this_cpu_ci->info_list;
+ struct cacheinfo *max_leaf;
+
+ /*
+ * Assume that cache sizes always increase with level.
+ */
+ max_leaf = this_leaf + this_cpu_ci->num_leaves - 1;
+ return max_leaf->size;
+}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 249981bf3d8a..701825a22863 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
#include <asm/thermal.h>
#include <asm/unwind.h>
#include <asm/vsyscall.h>
+#include <asm/cacheinfo.h>
#include <linux/vmalloc.h>

/*
@@ -1293,3 +1294,8 @@ static int __init register_kernel_offset_dumper(void)
return 0;
}
__initcall(register_kernel_offset_dumper);
+
+unsigned long __init arch_clear_page_non_caching_threshold(void)
+{
+ return cacheinfo_lookup_max_size(0);
+}
--
2.31.1

2022-06-07 17:49:17

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 21/21] x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

System: Oracle X8-2 (2 nodes * 26 cores/node * 2 threads/core)
Processor: Intel Xeon Platinum 8270CL (Skylakex, 6:85:7)
Memory: 3TB evenly split between nodes
Microcode: 0x5002f01
scaling_governor: performance
LLC size: 36MB for each node
intel_pstate/no_turbo: 1

$ for i in 2 8 32 128 512; do
perf bench mem memset -f x86-64-movnt -s ${i}MB
done
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 2MB bytes ...
6.361971 GB/sec
# Copying 8MB bytes ...
6.300403 GB/sec
# Copying 32MB bytes ...
6.288992 GB/sec
# Copying 128MB bytes ...
6.328793 GB/sec
# Copying 512MB bytes ...
6.324471 GB/sec

# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:

x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)

16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%
1024MB 6.48 GB/s ( +- 0.31%) 6.24 GB/s ( +- 0.00%) -3.70%
4096MB 6.51 GB/s ( +- 0.01%) 6.27 GB/s ( +- 0.42%) -3.68%

Comparing perf stats for size=4096MB:

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb
# Running 'mem/memset' benchmark:
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.516972 GB/sec (+- 0.01%)

Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-stosb' (5 runs):

3,357,373,317 cpu-cycles # 1.133 GHz ( +- 0.01% ) (29.38%)
165,063,710 instructions # 0.05 insn per cycle ( +- 1.54% ) (35.29%)
358,997 cache-references # 0.121 M/sec ( +- 0.89% ) (35.32%)
205,420 cache-misses # 57.221 % of all cache refs ( +- 3.61% ) (35.36%)
6,117,673 branch-instructions # 2.065 M/sec ( +- 1.48% ) (35.38%)
58,309 branch-misses # 0.95% of all branches ( +- 1.30% ) (35.39%)
31,329,466 bus-cycles # 10.575 M/sec ( +- 0.03% ) (23.56%)
68,543,766 L1-dcache-load-misses # 157.03% of all L1-dcache accesses ( +- 0.02% ) (23.53%)
43,648,909 L1-dcache-loads # 14.734 M/sec ( +- 0.50% ) (23.50%)
137,498 LLC-loads # 0.046 M/sec ( +- 0.21% ) (23.49%)
12,308 LLC-load-misses # 8.95% of all LL-cache accesses ( +- 2.52% ) (23.49%)
26,335 LLC-stores # 0.009 M/sec ( +- 5.65% ) (11.75%)
25,008 LLC-store-misses # 0.008 M/sec ( +- 3.42% ) (11.75%)

2.962842 +- 0.000162 seconds time elapsed ( +- 0.01% )

$ perf stat -r 5 --all-user -e ... perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt
# Running 'mem/memset' benchmark:
# function 'x86-64-movnt' (movnt-based memset() in arch/x86/lib/memset_64.S)
# Copying 4096MB bytes ...
6.283420 GB/sec (+- 0.01%)

Performance counter stats for 'perf bench mem memset -l 1 -s 4096MB -f x86-64-movnt' (5 runs):

4,462,272,094 cpu-cycles # 1.322 GHz ( +- 0.30% ) (29.38%)
1,633,675,881 instructions # 0.37 insn per cycle ( +- 0.21% ) (35.28%)
283,627 cache-references # 0.084 M/sec ( +- 0.58% ) (35.31%)
28,824 cache-misses # 10.163 % of all cache refs ( +- 20.67% ) (35.34%)
139,719,697 branch-instructions # 41.407 M/sec ( +- 0.16% ) (35.35%)
58,062 branch-misses # 0.04% of all branches ( +- 1.49% ) (35.36%)
41,760,350 bus-cycles # 12.376 M/sec ( +- 0.05% ) (23.55%)
303,300 L1-dcache-load-misses # 0.69% of all L1-dcache accesses ( +- 2.08% ) (23.53%)
43,769,498 L1-dcache-loads # 12.972 M/sec ( +- 0.54% ) (23.52%)
99,570 LLC-loads # 0.030 M/sec ( +- 1.06% ) (23.52%)
1,966 LLC-load-misses # 1.97% of all LL-cache accesses ( +- 6.17% ) (23.52%)
129 LLC-stores # 0.038 K/sec ( +- 27.85% ) (11.75%)
7 LLC-store-misses # 0.002 K/sec ( +- 47.82% ) (11.75%)

3.37465 +- 0.00474 seconds time elapsed ( +- 0.14% )

It's unclear if using MOVNT is a net negative on Skylake. For bulk stores
MOVNT is slightly slower than REP;STOSB, but from the L1-dcache-load-misses
stats (L1D.REPLACEMENT), it does elide the write-allocate and thus helps
with cache efficiency.

However, we err on the side of caution and set X86_FEATURE_MOVNT_SLOW
on Skylake.

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/kernel/cpu/bugs.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 16e293654d34..ee7206f03d15 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -97,7 +97,21 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
void check_movnt_quirks(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_X86_64
-
+ if (c->x86_vendor == X86_VENDOR_INTEL) {
+ if (c->x86 == 6) {
+ switch (c->x86_model) {
+ case INTEL_FAM6_SKYLAKE_L:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE:
+ fallthrough;
+ case INTEL_FAM6_SKYLAKE_X:
+ set_cpu_cap(c, X86_FEATURE_MOVNT_SLOW);
+ break;
+ default:
+ break;
+ }
+ }
+ }
#endif
}

--
2.31.1

2022-06-07 18:38:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <[email protected]> wrote:
>
> This series introduces two optimizations in the huge page clearing path:
>
> 1. extends the clear_page() machinery to also handle extents larger
> than a single page.
> 2. support non-cached page clearing for huge and gigantic pages.
>
> The first optimization is useful for hugepage fault handling, the
> second for prefaulting, or for gigantic pages.

Please just split these two issues up into entirely different patch series.

That said, I have a few complaints about the individual patches even
in this form, to the point where I think the whole series is nasty:

- get rid of 3/21 entirely. It's wrong in every possible way:

(a) That shouldn't be an inline function in a header file at all.
If you're clearing several pages of data, that just shouldn't be an
inline function.

(b) Get rid of __HAVE_ARCH_CLEAR_USER_PAGES. I hate how people
make up those idiotic pointless names.

If you have to use a #ifdef, just use the name of the
function that the architecture overrides, not some other new name.

But you don't need it at all, because

(c) Just make a __weak function called clear_user_highpages() in
mm/highmem.c, and allow architectures to just create their own
non-weak ones.

- patch 4/21 and 5/32: can we instead just get rid of that silly
"process_huge_page()" thing entirely. It's disgusting, and it's a big
part of why 'rep movs/stos' cannot work efficiently. It also makes NO
SENSE if you then use non-temporal accesses.

So instead of doubling down on the craziness of that function, just
get rid of it entirely.

There are two users, and they want to clear a hugepage and copy it
respectively. Don't make it harder than it is.

*Maybe* the code wants to do a "prefetch" afterwards. Who knows.
But I really think you sh ould do the crapectomy first, make the code
simpler and more straightforward, and just allow architectures to
override the *simple* "copy or clear a lage page" rather than keep
feeding this butt-ugly monstrosity.

- 13/21: see 3/21.

- 14-17/21: see 4/21 and 5/21. Once you do the crapectomy and get rid
of the crazy process_huge_page() abstraction, and just let
architectures do their own clear/copy huge pages, *all* this craziness
goes away. Those "when to use which type of clear/copy" becomes a
*local* question, no silly arch_clear_page_non_caching_threshold()
garbage.

So I really don't like this series. A *lot* of it comes from that
horrible process_huge_page() model, and the whole model is just wrong
and pointless. You're literally trying to fix the mess that that
function is, but you're keeping the fundamental problem around.

The whole *point* of your patch-set is to use non-temporal stores,
which makes all the process_huge_page() things entirely pointless, and
only complicates things.

And even if we don't use non-temporal stores, that process_huge_page()
thing makes for trouble for any "rep stos/movs" implementation that
might actualyl do a better job if it was just chunked up in bigger
chunks.

Yes, yes, you probably still want to chunk that up somewhat due to
latency reasons, but even then architectures might as well just make
their own decisions, rather than have the core mm code make one
clearly bad decision for them. Maybe chunking it up in bigger chunks
than one page.

Maybe an architecture could do even more radical things like "let's
just 'rep stos' for the whole area, but set a special thread flag that
causes the interrupt return to break it up on return to kernel space".
IOW, the "latency fix" might not even be about chunking it up, it
might look more like our exception handling thing.

So I really think that crapectomy should be the first thing you do,
and that should be that first part of "extends the clear_page()
machinery to also handle extents larger than a single page"

Linus

2022-06-08 03:45:15

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v3 07/21] x86/asm: add memset_movnti()

Add a MOVNTI based non-caching implementation of memset().

memset_movnti() only needs to differ from memset_orig() in the opcode
used in the inner loop, so move the memset_orig() logic into a macro,
and use that to generate memset_orig() (now memset_movq()) and
memset_movnti().

Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/lib/memset_64.S | 68 ++++++++++++++++++++++------------------
1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index fc9ffd3ff3b2..307b753ca03a 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -24,7 +24,7 @@ SYM_FUNC_START(__memset)
*
* Otherwise, use original memset function.
*/
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
+ ALTERNATIVE_2 "jmp memset_movq", "", X86_FEATURE_REP_GOOD, \
"jmp memset_erms", X86_FEATURE_ERMS

movq %rdi,%r9
@@ -66,7 +66,8 @@ SYM_FUNC_START_LOCAL(memset_erms)
RET
SYM_FUNC_END(memset_erms)

-SYM_FUNC_START_LOCAL(memset_orig)
+.macro MEMSET_MOV OP fence
+SYM_FUNC_START_LOCAL(memset_\OP)
movq %rdi,%r10

/* expand byte value */
@@ -77,64 +78,71 @@ SYM_FUNC_START_LOCAL(memset_orig)
/* align dst */
movl %edi,%r9d
andl $7,%r9d
- jnz .Lbad_alignment
-.Lafter_bad_alignment:
+ jnz .Lbad_alignment_\@
+.Lafter_bad_alignment_\@:

movq %rdx,%rcx
shrq $6,%rcx
- jz .Lhandle_tail
+ jz .Lhandle_tail_\@

.p2align 4
-.Lloop_64:
+.Lloop_64_\@:
decq %rcx
- movq %rax,(%rdi)
- movq %rax,8(%rdi)
- movq %rax,16(%rdi)
- movq %rax,24(%rdi)
- movq %rax,32(%rdi)
- movq %rax,40(%rdi)
- movq %rax,48(%rdi)
- movq %rax,56(%rdi)
+ \OP %rax,(%rdi)
+ \OP %rax,8(%rdi)
+ \OP %rax,16(%rdi)
+ \OP %rax,24(%rdi)
+ \OP %rax,32(%rdi)
+ \OP %rax,40(%rdi)
+ \OP %rax,48(%rdi)
+ \OP %rax,56(%rdi)
leaq 64(%rdi),%rdi
- jnz .Lloop_64
+ jnz .Lloop_64_\@

/* Handle tail in loops. The loops should be faster than hard
to predict jump tables. */
.p2align 4
-.Lhandle_tail:
+.Lhandle_tail_\@:
movl %edx,%ecx
andl $63&(~7),%ecx
- jz .Lhandle_7
+ jz .Lhandle_7_\@
shrl $3,%ecx
.p2align 4
-.Lloop_8:
+.Lloop_8_\@:
decl %ecx
- movq %rax,(%rdi)
+ \OP %rax,(%rdi)
leaq 8(%rdi),%rdi
- jnz .Lloop_8
+ jnz .Lloop_8_\@

-.Lhandle_7:
+.Lhandle_7_\@:
andl $7,%edx
- jz .Lende
+ jz .Lende_\@
.p2align 4
-.Lloop_1:
+.Lloop_1_\@:
decl %edx
movb %al,(%rdi)
leaq 1(%rdi),%rdi
- jnz .Lloop_1
+ jnz .Lloop_1_\@

-.Lende:
+.Lende_\@:
+ .if \fence
+ sfence
+ .endif
movq %r10,%rax
RET

-.Lbad_alignment:
+.Lbad_alignment_\@:
cmpq $7,%rdx
- jbe .Lhandle_7
+ jbe .Lhandle_7_\@
movq %rax,(%rdi) /* unaligned store */
movq $8,%r8
subq %r9,%r8
addq %r8,%rdi
subq %r8,%rdx
- jmp .Lafter_bad_alignment
-.Lfinal:
-SYM_FUNC_END(memset_orig)
+ jmp .Lafter_bad_alignment_\@
+.Lfinal_\@:
+SYM_FUNC_END(memset_\OP)
+.endm
+
+MEMSET_MOV OP=movq fence=0
+MEMSET_MOV OP=movnti fence=1
--
2.31.1

2022-06-08 04:41:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <[email protected]> wrote:
>
> For highmem and page-at-a-time archs we would need to keep some
> of the same optimizations (via the common clear/copy_user_highpages().)

Yeah, I guess that we could keep the code for legacy use, just make
the existing code be marked __weak so that it can be ignored for any
further work.

IOW, the first patch might be to just add that __weak to
'clear_huge_page()' and 'copy_user_huge_page()'.

At that point, any architecture can just say "I will implement my own
versions of these two".

In fact, you can start with just one or the other, which is probably
nicer to keep the patch series smaller (ie do the simpler
"clear_huge_page()" first).

I worry a bit about the insanity of the "gigantic" pages, and the
mem_map_next() games it plays, but that code is from 2008 and I really
doubt it makes any sense to keep around at least for x86. The source
of that abomination is powerpc, and I do not think that whole issue
with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

It most definitely makes no sense when there is no highmem issues, and
all those 'struct page' games should just be deleted (or at least
relegated entirely to that "legacy __weak function" case so that sane
situations don't need to care).

For that same HIGHMEM reason it's probably a good idea to limit the
new case just to x86-64, and leave 32-bit x86 behind.

> Right. Or doing the whole contiguous area in one or a few chunks
> chunks, and then touching the faulting cachelines towards the end.

Yeah, just add a prefetch for the 'addr_hint' part at the end.

> > Maybe an architecture could do even more radical things like "let's
> > just 'rep stos' for the whole area, but set a special thread flag that
> > causes the interrupt return to break it up on return to kernel space".
> > IOW, the "latency fix" might not even be about chunking it up, it
> > might look more like our exception handling thing.
>
> When I was thinking about this earlier, I had a vague inkling of
> setting a thread flag and defer writes to the last few cachelines
> for just before returning to user-space.
> Can you elaborate a little about what you are describing above?

So 'process_huge_page()' (and the gigantic page case) does three very
different things:

(a) that page chunking for highmem accesses

(b) the page access _ordering_ for the cache hinting reasons

(c) the chunking for _latency_ reasons

and I think all of them are basically "bad legacy" reasons, in that

(a) HIGHMEM doesn't exist on sane architectures that we care about these days

(b) the cache hinting ordering makes no sense if you do non-temporal
accesses (and might then be replaced by a possible "prefetch" at the
end)

(c) the latency reasons still *do* exist, but only with PREEMPT_NONE

So what I was alluding to with those "more radical approaches" was
that PREEMPT_NONE case: we would probably still want to chunk things
up for latency reasons and do that "cond_resched()" in between
chunks.

Now, there are alternatives here:

(a) only override that existing disgusting (but tested) function when
both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false

(b) do something like this:

void clear_huge_page(struct page *page,
unsigned long addr_hint,
unsigned int pages_per_huge_page)
{
void *addr = page_address(page);
#ifdef CONFIG_PREEMPT_NONE
for (int i = 0; i < pages_per_huge_page; i++)
clear_page(addr, PAGE_SIZE);
cond_preempt();
}
#else
nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
prefetch(addr_hint);
#endif
}

or (c), do that "more radical approach", where you do something like this:

void clear_huge_page(struct page *page,
unsigned long addr_hint,
unsigned int pages_per_huge_page)
{
set_thread_flag(TIF_PREEMPT_ME);
nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
clear_thread_flag(TIF_PREEMPT_ME);
prefetch(addr_hint);
}

and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
case and actually force preemption even on a non-preempt kernel.

It's _probably_ the case that CONFIG_PREEMPT_NONE is so rare that it's
n ot even worth doing. I dunno.

And all of the above pseudo-code may _look_ like real code, but is
entirely untested and entirely handwavy "something like this".

Hmm?

Linus

2022-06-08 05:13:05

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

[ Fixed email for Joao Martins. ]

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <[email protected]> wrote:
[snip]

> So I really don't like this series. A *lot* of it comes from that
> horrible process_huge_page() model, and the whole model is just wrong
> and pointless. You're literally trying to fix the mess that that
> function is, but you're keeping the fundamental problem around.
>
> The whole *point* of your patch-set is to use non-temporal stores,
> which makes all the process_huge_page() things entirely pointless, and
> only complicates things.
>
> And even if we don't use non-temporal stores, that process_huge_page()
> thing makes for trouble for any "rep stos/movs" implementation that
> might actualyl do a better job if it was just chunked up in bigger
> chunks.

This makes sense to me. There is a lot of unnecessary machinery
around process_huge_page() and this series adds more of it.

For highmem and page-at-a-time archs we would need to keep some
of the same optimizations (via the common clear/copy_user_highpages().)

Still that rids the arch code from pointless constraints as you
say below.

> Yes, yes, you probably still want to chunk that up somewhat due to
> latency reasons, but even then architectures might as well just make
> their own decisions, rather than have the core mm code make one
> clearly bad decision for them. Maybe chunking it up in bigger chunks
> than one page.

Right. Or doing the whole contiguous area in one or a few chunks
chunks, and then touching the faulting cachelines towards the end.

> Maybe an architecture could do even more radical things like "let's
> just 'rep stos' for the whole area, but set a special thread flag that
> causes the interrupt return to break it up on return to kernel space".
> IOW, the "latency fix" might not even be about chunking it up, it
> might look more like our exception handling thing.

When I was thinking about this earlier, I had a vague inkling of
setting a thread flag and defer writes to the last few cachelines
for just before returning to user-space.
Can you elaborate a little about what you are describing above?

> So I really think that crapectomy should be the first thing you do,
> and that should be that first part of "extends the clear_page()
> machinery to also handle extents larger than a single page"

Ack that. And, thanks for the detailed comments.

--
ankur

2022-06-08 08:28:31

by Luc Van Oostenryck

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()

On Mon, Jun 06, 2022 at 08:37:17PM +0000, Ankur Arora wrote:
> +static inline void clear_user_pages_incoherent(__incoherent void *page,
> + unsigned long vaddr,
> + struct page *pg,
> + unsigned int npages)
> +{
> + clear_user_pages((__force void *)page, vaddr, pg, npages);
> +}

Hi,

Please use 'void __incoherent *' and 'void __force *', as it's done
elsewhere for __force and address spaces.

-- Luc

2022-06-08 20:06:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

On Wed, Jun 08, 2022 at 08:49:57PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> > I worry a bit about the insanity of the "gigantic" pages, and the
> > mem_map_next() games it plays, but that code is from 2008 and I really
> > doubt it makes any sense to keep around at least for x86. The source
> > of that abomination is powerpc, and I do not think that whole issue
> > with MAX_ORDER_NR_PAGES makes any difference on x86, at least.
>
> Oh, argh, I meant to delete mem_map_next(), and forgot.
>
> If you need to use struct page (a later message hints you don't), just
> use nth_page() directly. I optimised it so it's not painful except on
> SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
> And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
> systems.

Oops, wrong commit. I meant 1cfcee728391 from June 2021.

2022-06-08 20:08:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <[email protected]> wrote:
>
> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
> An arch specific clear_huge_page() code could, however handle 1GB pages
> via some kind of static loop around (30 - MAX_SECTION_BITS).

Even if gigantic pages straddle that area, it simply shouldn't matter.

The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.

And the only reason for *that* is because of HIGHMEM.

So it's all entirely silly and pointless on any sane architecture, I think.

> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
> as well, right?

Ahh, yes. I should have looked at the code, and not just gone by my
"PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
split that up.

> Just one minor point -- seems to me that the choice of nontemporal or
> temporal might have to be based on a hint to clear_huge_page().

Quite possibly. But I'd prefer that as a separate "look, this
improves numbers by X%" thing from the whole "let's make the
clear_huge_page() interface at least sane".

Linus

2022-06-08 20:41:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations

On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Oh, argh, I meant to delete mem_map_next(), and forgot.

If you need to use struct page (a later message hints you don't), just
use nth_page() directly. I optimised it so it's not painful except on
SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
systems.

2022-06-08 20:46:50

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations


Linus Torvalds <[email protected]> writes:

> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <[email protected]> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).

Agreed. Best way to iron out all the kinks too.

> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.

But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).

I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.

> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).

Yeah, I'm hoping to do exactly that.

> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.

Ack that.

>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
> (a) that page chunking for highmem accesses
>
> (b) the page access _ordering_ for the cache hinting reasons
>
> (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
> (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
> (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
> (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in between
> chunks.

Thanks for the detail. That helps.

> Now, there are alternatives here:
>
> (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
> (b) do something like this:
>
> void clear_huge_page(struct page *page,
> unsigned long addr_hint,
> unsigned int pages_per_huge_page)
> {
> void *addr = page_address(page);
> #ifdef CONFIG_PREEMPT_NONE
> for (int i = 0; i < pages_per_huge_page; i++)
> clear_page(addr, PAGE_SIZE);
> cond_preempt();
> }
> #else
> nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
> prefetch(addr_hint);
> #endif
> }

We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)

> or (c), do that "more radical approach", where you do something like this:
>
> void clear_huge_page(struct page *page,
> unsigned long addr_hint,
> unsigned int pages_per_huge_page)
> {
> set_thread_flag(TIF_PREEMPT_ME);
> nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
> clear_thread_flag(TIF_PREEMPT_ME);
> prefetch(addr_hint);
> }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.

I like this one. I'll try out (b) and (c) and see how the code shakes
out.

Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().

Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).

So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.

Thanks

--
ankur

2022-06-08 20:51:36

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations


Linus Torvalds <[email protected]> writes:

> On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <[email protected]> wrote:
>>
>> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
>> An arch specific clear_huge_page() code could, however handle 1GB pages
>> via some kind of static loop around (30 - MAX_SECTION_BITS).
>
> Even if gigantic pages straddle that area, it simply shouldn't matter.
>
> The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.
>
> And the only reason for *that* is because of HIGHMEM.
>
> So it's all entirely silly and pointless on any sane architecture, I think.
>
>> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
>> as well, right?
>
> Ahh, yes. I should have looked at the code, and not just gone by my
> "PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
> split that up.
>
>> Just one minor point -- seems to me that the choice of nontemporal or
>> temporal might have to be based on a hint to clear_huge_page().
>
> Quite possibly. But I'd prefer that as a separate "look, this
> improves numbers by X%" thing from the whole "let's make the
> clear_huge_page() interface at least sane".

Makes sense to me.

--
ankur

2022-06-10 22:38:39

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()

On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <[email protected]> wrote:
>
> Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
> With this, page-clearing can skip the memory hierarchy, thus providing
> a non cache-polluting implementation of clear_pages().
>
> MOVNTI, from the Intel SDM, Volume 2B, 4-101:
> "The non-temporal hint is implemented by using a write combining (WC)
> memory type protocol when writing the data to memory. Using this
> protocol, the processor does not write the data into the cache
> hierarchy, nor does it fetch the corresponding cache line from memory
> into the cache hierarchy."
>
> The AMD Arch Manual has something similar to say as well.
>
> One use-case is to zero large extents without bringing in never-to-be-
> accessed cachelines. Also, often clear_pages_movnt() based clearing is
> faster once extent sizes are O(LLC-size).
>
> As the excerpt notes, MOVNTI is weakly ordered with respect to other
> instructions operating on the memory hierarchy. This needs to be
> handled by the caller by executing an SFENCE when done.
>
> The implementation is straight-forward: unroll the inner loop to keep
> the code similar to memset_movnti(), so that we can gauge
> clear_pages_movnt() performance via perf bench mem memset.
>
> # Intel Icelakex
> # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> # (X86_FEATURE_ERMS) and x86-64-movnt:
>
> System: Oracle X9-2 (2 nodes * 32 cores * 2 threads)
> Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
> Memory: 512 GB evenly split between nodes
> LLC-size: 48MB for each node (32-cores * 2-threads)
> no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
>
> x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%)
> ---------------------- --------------------- --------
> size BW ( stdev) BW ( stdev)
>
> 2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38%
> 16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02%
> 128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24%
> 1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35%
> 4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50%

For these sizes it may be worth it to save/rstor an xmm register to do
the memset:

Just on my Tigerlake laptop:
model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

movntdq xmm (5 runs) movnti GPR (5 runs)
Delta(%)
----------------------- -----------------------
size BW GB/s ( +- stdev) BW GB/s ( +-
stdev) %
2 MB 35.71 GB/s ( +- 1.02) 34.62 GB/s ( +-
0.77) -3.15%
16 MB 36.43 GB/s ( +- 0.35) 31.3 GB/s ( +-
0.1) -16.39%
128 MB 35.6 GB/s ( +- 0.83) 30.82 GB/s ( +-
0.08) -15.5%
1024 MB 36.85 GB/s ( +- 0.26) 30.71 GB/s ( +-
0.2) -20.0%
>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> arch/x86/include/asm/page_64.h | 1 +
> arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index a88a3508888a..3affc4ecb8da 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
> void clear_pages_orig(void *page, unsigned long npages);
> void clear_pages_rep(void *page, unsigned long npages);
> void clear_pages_erms(void *page, unsigned long npages);
> +void clear_pages_movnt(void *page, unsigned long npages);
>
> #define __HAVE_ARCH_CLEAR_USER_PAGES
> static inline void clear_pages(void *page, unsigned int npages)
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index 2cc3b681734a..83d14f1c9f57 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
> RET
> SYM_FUNC_END(clear_pages_erms)
> EXPORT_SYMBOL_GPL(clear_pages_erms)
> +
> +SYM_FUNC_START(clear_pages_movnt)
> + xorl %eax,%eax
> + movq %rsi,%rcx
> + shlq $PAGE_SHIFT, %rcx
> +
> + .p2align 4
> +.Lstart:
> + movnti %rax, 0x00(%rdi)
> + movnti %rax, 0x08(%rdi)
> + movnti %rax, 0x10(%rdi)
> + movnti %rax, 0x18(%rdi)
> + movnti %rax, 0x20(%rdi)
> + movnti %rax, 0x28(%rdi)
> + movnti %rax, 0x30(%rdi)
> + movnti %rax, 0x38(%rdi)
> + addq $0x40, %rdi
> + subl $0x40, %ecx
> + ja .Lstart
> + RET
> +SYM_FUNC_END(clear_pages_movnt)
> --
> 2.31.1
>

2022-06-10 23:03:07

by Noah Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()

On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein <[email protected]> wrote:
>
> On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <[email protected]> wrote:
> >
> > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
> > With this, page-clearing can skip the memory hierarchy, thus providing
> > a non cache-polluting implementation of clear_pages().
> >
> > MOVNTI, from the Intel SDM, Volume 2B, 4-101:
> > "The non-temporal hint is implemented by using a write combining (WC)
> > memory type protocol when writing the data to memory. Using this
> > protocol, the processor does not write the data into the cache
> > hierarchy, nor does it fetch the corresponding cache line from memory
> > into the cache hierarchy."
> >
> > The AMD Arch Manual has something similar to say as well.
> >
> > One use-case is to zero large extents without bringing in never-to-be-
> > accessed cachelines. Also, often clear_pages_movnt() based clearing is
> > faster once extent sizes are O(LLC-size).
> >
> > As the excerpt notes, MOVNTI is weakly ordered with respect to other
> > instructions operating on the memory hierarchy. This needs to be
> > handled by the caller by executing an SFENCE when done.
> >
> > The implementation is straight-forward: unroll the inner loop to keep
> > the code similar to memset_movnti(), so that we can gauge
> > clear_pages_movnt() performance via perf bench mem memset.
> >
> > # Intel Icelakex
> > # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
> > # (X86_FEATURE_ERMS) and x86-64-movnt:
> >
> > System: Oracle X9-2 (2 nodes * 32 cores * 2 threads)
> > Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
> > Memory: 512 GB evenly split between nodes
> > LLC-size: 48MB for each node (32-cores * 2-threads)
> > no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
> >
> > x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%)
> > ---------------------- --------------------- --------
> > size BW ( stdev) BW ( stdev)
> >
> > 2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38%
> > 16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02%
> > 128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24%
> > 1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35%
> > 4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50%
>
> For these sizes it may be worth it to save/rstor an xmm register to do
> the memset:
>
> Just on my Tigerlake laptop:
> model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
>
> movntdq xmm (5 runs) movnti GPR (5 runs)
> Delta(%)
> ----------------------- -----------------------
> size BW GB/s ( +- stdev) BW GB/s ( +-
> stdev) %
> 2 MB 35.71 GB/s ( +- 1.02) 34.62 GB/s ( +-
> 0.77) -3.15%
> 16 MB 36.43 GB/s ( +- 0.35) 31.3 GB/s ( +-
> 0.1) -16.39%
> 128 MB 35.6 GB/s ( +- 0.83) 30.82 GB/s ( +-
> 0.08) -15.5%
> 1024 MB 36.85 GB/s ( +- 0.26) 30.71 GB/s ( +-
> 0.2) -20.0%


Also (again just from Tigerlake laptop) I found the trend favor
`rep stosb` more (as opposed to non-cacheable writes) when
there are multiple threads competing for BW:

https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing
> >
> > Signed-off-by: Ankur Arora <[email protected]>
> > ---
> > arch/x86/include/asm/page_64.h | 1 +
> > arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++
> > 2 files changed, 22 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> > index a88a3508888a..3affc4ecb8da 100644
> > --- a/arch/x86/include/asm/page_64.h
> > +++ b/arch/x86/include/asm/page_64.h
> > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
> > void clear_pages_orig(void *page, unsigned long npages);
> > void clear_pages_rep(void *page, unsigned long npages);
> > void clear_pages_erms(void *page, unsigned long npages);
> > +void clear_pages_movnt(void *page, unsigned long npages);
> >
> > #define __HAVE_ARCH_CLEAR_USER_PAGES
> > static inline void clear_pages(void *page, unsigned int npages)
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index 2cc3b681734a..83d14f1c9f57 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
> > RET
> > SYM_FUNC_END(clear_pages_erms)
> > EXPORT_SYMBOL_GPL(clear_pages_erms)
> > +
> > +SYM_FUNC_START(clear_pages_movnt)
> > + xorl %eax,%eax
> > + movq %rsi,%rcx
> > + shlq $PAGE_SHIFT, %rcx
> > +
> > + .p2align 4
> > +.Lstart:
> > + movnti %rax, 0x00(%rdi)
> > + movnti %rax, 0x08(%rdi)
> > + movnti %rax, 0x10(%rdi)
> > + movnti %rax, 0x18(%rdi)
> > + movnti %rax, 0x20(%rdi)
> > + movnti %rax, 0x28(%rdi)
> > + movnti %rax, 0x30(%rdi)
> > + movnti %rax, 0x38(%rdi)
> > + addq $0x40, %rdi
> > + subl $0x40, %ecx
> > + ja .Lstart
> > + RET
> > +SYM_FUNC_END(clear_pages_movnt)
> > --
> > 2.31.1
> >

2022-06-12 16:02:02

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v3 13/21] clear_page: add generic clear_user_pages_incoherent()


Luc Van Oostenryck <[email protected]> writes:

> On Mon, Jun 06, 2022 at 08:37:17PM +0000, Ankur Arora wrote:
>> +static inline void clear_user_pages_incoherent(__incoherent void *page,
>> + unsigned long vaddr,
>> + struct page *pg,
>> + unsigned int npages)
>> +{
>> + clear_user_pages((__force void *)page, vaddr, pg, npages);
>> +}
>
> Hi,
>
> Please use 'void __incoherent *' and 'void __force *', as it's done
> elsewhere for __force and address spaces.

Thanks Luc. Will fix.

--
ankur

2022-06-12 16:06:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()


Noah Goldstein <[email protected]> writes:

> On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein <[email protected]> wrote:
>>
>> On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <[email protected]> wrote:
>> >
>> > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
>> > With this, page-clearing can skip the memory hierarchy, thus providing
>> > a non cache-polluting implementation of clear_pages().
>> >
>> > MOVNTI, from the Intel SDM, Volume 2B, 4-101:
>> > "The non-temporal hint is implemented by using a write combining (WC)
>> > memory type protocol when writing the data to memory. Using this
>> > protocol, the processor does not write the data into the cache
>> > hierarchy, nor does it fetch the corresponding cache line from memory
>> > into the cache hierarchy."
>> >
>> > The AMD Arch Manual has something similar to say as well.
>> >
>> > One use-case is to zero large extents without bringing in never-to-be-
>> > accessed cachelines. Also, often clear_pages_movnt() based clearing is
>> > faster once extent sizes are O(LLC-size).
>> >
>> > As the excerpt notes, MOVNTI is weakly ordered with respect to other
>> > instructions operating on the memory hierarchy. This needs to be
>> > handled by the caller by executing an SFENCE when done.
>> >
>> > The implementation is straight-forward: unroll the inner loop to keep
>> > the code similar to memset_movnti(), so that we can gauge
>> > clear_pages_movnt() performance via perf bench mem memset.
>> >
>> > # Intel Icelakex
>> > # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
>> > # (X86_FEATURE_ERMS) and x86-64-movnt:
>> >
>> > System: Oracle X9-2 (2 nodes * 32 cores * 2 threads)
>> > Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
>> > Memory: 512 GB evenly split between nodes
>> > LLC-size: 48MB for each node (32-cores * 2-threads)
>> > no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
>> >
>> > x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%)
>> > ---------------------- --------------------- --------
>> > size BW ( stdev) BW ( stdev)
>> >
>> > 2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38%
>> > 16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02%
>> > 128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24%
>> > 1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35%
>> > 4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50%
>>
>> For these sizes it may be worth it to save/rstor an xmm register to do
>> the memset:
>>
>> Just on my Tigerlake laptop:
>> model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
>>
>> movntdq xmm (5 runs) movnti GPR (5 runs)
>> Delta(%)
>> ----------------------- -----------------------
>> size BW GB/s ( +- stdev) BW GB/s ( +-
>> stdev) %
>> 2 MB 35.71 GB/s ( +- 1.02) 34.62 GB/s ( +-
>> 0.77) -3.15%
>> 16 MB 36.43 GB/s ( +- 0.35) 31.3 GB/s ( +-
>> 0.1) -16.39%
>> 128 MB 35.6 GB/s ( +- 0.83) 30.82 GB/s ( +-
>> 0.08) -15.5%
>> 1024 MB 36.85 GB/s ( +- 0.26) 30.71 GB/s ( +-
>> 0.2) -20.0%

Thanks this looks interesting. Any thoughts on what causes the drop-off
for the movnti loop as the region size increases?

I can see the usual two problems with using the XMM registers:

- the kernel_fpu_begin()/_end() overhead
- kernel_fpu regions need preemption disabled, which limits the
extent that can be cleared in a single operation

And given how close movntdq and movnti are for size=2MB, I'm not
sure movntdq would even come out ahead if we include the XMM
save/restore overhead?

> Also (again just from Tigerlake laptop) I found the trend favor
> `rep stosb` more (as opposed to non-cacheable writes) when
> there are multiple threads competing for BW:

I notice in your spreadsheet, that you ran the tests only until
~32MB. How does the performance on Tigerlake change as you
go up to say 512MB? Also, a little unexpected that the cacheable
SIMD variant always performs pretty much the worst.

In general, I wouldn't expect NT writes to perform better for O(LLC-size).
That's why this series avoids using NT writes for sizes smaller than
that (see patch-19.)

The argument is: the larger the region being cleared, the less the
caller cares about the contents and thus we can avoid using the cache.
The other part of course is that NT doesn't perform as well for small
sizes and so using that would regress performance for some user.


Ankur

> https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing

>> >
>> > Signed-off-by: Ankur Arora <[email protected]>
>> > ---
>> > arch/x86/include/asm/page_64.h | 1 +
>> > arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++
>> > 2 files changed, 22 insertions(+)
>> >
>> > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>> > index a88a3508888a..3affc4ecb8da 100644
>> > --- a/arch/x86/include/asm/page_64.h
>> > +++ b/arch/x86/include/asm/page_64.h
>> > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
>> > void clear_pages_orig(void *page, unsigned long npages);
>> > void clear_pages_rep(void *page, unsigned long npages);
>> > void clear_pages_erms(void *page, unsigned long npages);
>> > +void clear_pages_movnt(void *page, unsigned long npages);
>> >
>> > #define __HAVE_ARCH_CLEAR_USER_PAGES
>> > static inline void clear_pages(void *page, unsigned int npages)
>> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> > index 2cc3b681734a..83d14f1c9f57 100644
>> > --- a/arch/x86/lib/clear_page_64.S
>> > +++ b/arch/x86/lib/clear_page_64.S
>> > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
>> > RET
>> > SYM_FUNC_END(clear_pages_erms)
>> > EXPORT_SYMBOL_GPL(clear_pages_erms)
>> > +
>> > +SYM_FUNC_START(clear_pages_movnt)
>> > + xorl %eax,%eax
>> > + movq %rsi,%rcx
>> > + shlq $PAGE_SHIFT, %rcx
>> > +
>> > + .p2align 4
>> > +.Lstart:
>> > + movnti %rax, 0x00(%rdi)
>> > + movnti %rax, 0x08(%rdi)
>> > + movnti %rax, 0x10(%rdi)
>> > + movnti %rax, 0x18(%rdi)
>> > + movnti %rax, 0x20(%rdi)
>> > + movnti %rax, 0x28(%rdi)
>> > + movnti %rax, 0x30(%rdi)
>> > + movnti %rax, 0x38(%rdi)
>> > + addq $0x40, %rdi
>> > + subl $0x40, %ecx
>> > + ja .Lstart
>> > + RET
>> > +SYM_FUNC_END(clear_pages_movnt)
>> > --
>> > 2.31.1
>> >


--
ankur