2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 0/5] Speed up mremap on large regions

mremap time can be optimized by moving entries at the PMD/PUD level if
the source and destination addresses are PMD/PUD-aligned and
PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
x86. Other architectures where this type of move is supported and known to
be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
and HAVE_MOVE_PUD.

Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
region on x86 and arm64:

- HAVE_MOVE_PMD is already enabled on x86 : N/A
- Enabling HAVE_MOVE_PUD on x86 : ~13x speed up

- Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
- Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up

Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
give a total of ~150x speed up on arm64.


Kalesh Singh (5):
kselftests: vm: Add mremap tests
arm64: mremap speedup - Enable HAVE_MOVE_PMD
mm: Speedup mremap on 1GB or larger regions
arm64: mremap speedup - Enable HAVE_MOVE_PUD
x86: mremap speedup - Enable HAVE_MOVE_PUD

arch/Kconfig | 7 +
arch/arm64/Kconfig | 2 +
arch/arm64/include/asm/pgtable.h | 1 +
arch/x86/Kconfig | 1 +
mm/mremap.c | 211 +++++++++++++++++---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/mremap_test.c | 243 +++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 11 +
9 files changed, 448 insertions(+), 30 deletions(-)
create mode 100644 tools/testing/selftests/vm/mremap_test.c

--
2.28.0.709.gb0816b6eb0-goog


2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 2/5] arm64: mremap speedup - Enable HAVE_MOVE_PMD

HAVE_MOVE_PMD enables remapping pages at the PMD level if both the
source and destination addresses are PMD-aligned.

HAVE_MOVE_PMD is already enabled on x86. The original patch [1] that
introduced this config did not enable it on arm64 at the time because
of performance issues with flushing the TLB on every PMD move. These
issues have since been addressed in more recent releases with
improvements to the arm64 TLB invalidation and core mmu_gather code as
Will Deacon mentioned in [2].

From the data below, it can be inferred that there is approximately
8x improvement in performance when HAVE_MOVE_PMD is enabled on arm64.

--------- Test Results ----------

The following results were obtained on an arm64 device running a 5.4
kernel, by remapping a PMD-aligned, 1GB sized region to a PMD-aligned
destination. The results from 10 iterations of the test are given below.
All times are in nanoseconds.

Control HAVE_MOVE_PMD

9220833 1247761
9002552 1219896
9254115 1094792
8725885 1227760
9308646 1043698
9001667 1101771
8793385 1159896
8774636 1143594
9553125 1025833
9374010 1078125

9100885.4 1134312.6 <-- Mean Time in nanoseconds

Total mremap time for a 1GB sized PMD-aligned region drops from
~9.1 milliseconds to ~1.1 milliseconds. (~8x speedup).

[1] https://lore.kernel.org/r/[email protected]
[2] https://www.mail-archive.com/[email protected]/msg140837.html

Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..844d089668e3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -121,6 +121,7 @@ config ARM64
select GENERIC_VDSO_TIME_NS
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
+ select HAVE_MOVE_PMD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
--
2.28.0.709.gb0816b6eb0-goog

2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 5/5] x86: mremap speedup - Enable HAVE_MOVE_PUD

HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.

With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 13x improvement in performance on x86. (See data below).

------- Test Results ---------

The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:

Total mremap times for 1GB data on x86. All times are in nanoseconds.

Control HAVE_MOVE_PUD

180394 15089
235728 14056
238931 25741
187330 13838
241742 14187
177925 14778
182758 14728
160872 14418
205813 15107
245722 13998

205721.5 15594 <-- Mean time in nanoseconds

A 1GB mremap completion time drops from ~205 microseconds
to ~15 microseconds on x86. (~13x speed up).

Signed-off-by: Kalesh Singh <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..ff6e2755cab8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -198,6 +198,7 @@ config X86
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
--
2.28.0.709.gb0816b6eb0-goog

2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 4/5] arm64: mremap speedup - Enable HAVE_MOVE_PUD

HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.

With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 19x improvement in performance on arm64. (See data below).

------- Test Results ---------

The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:

Total mremap times for 1GB data on arm64. All times are in nanoseconds.

Control HAVE_MOVE_PUD

1247761 74271
1219896 46771
1094792 59687
1227760 48385
1043698 76666
1101771 50365
1159896 52500
1143594 75261
1025833 61354
1078125 48697

1134312.6 59395.7 <-- Mean time in nanoseconds

A 1GB mremap completion time drops from ~1.1 milliseconds
to ~59 microseconds on arm64. (~19x speed up).

Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 844d089668e3..4d521f0a5863 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -122,6 +122,7 @@ config ARM64
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
--
2.28.0.709.gb0816b6eb0-goog

2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 1/5] kselftests: vm: Add mremap tests

Test mremap on regions of various sizes and alignments and validate
data after remapping. Also provide total time for remapping
the region which is useful for performance comparison of the mremap
optimizations that move pages at the PMD/PUD levels if HAVE_MOVE_PMD
and/or HAVE_MOVE_PUD are enabled.

Signed-off-by: Kalesh Singh <[email protected]>
---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/mremap_test.c | 243 +++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 11 +
4 files changed, 256 insertions(+)
create mode 100644 tools/testing/selftests/vm/mremap_test.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 849e8226395a..b3a183c36cb5 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -8,6 +8,7 @@ thuge-gen
compaction_test
mlock2-tests
mremap_dontunmap
+mremap_test
on-fault-limit
transhuge-stress
protection_keys
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index a9026706d597..f044808b45fa 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -16,6 +16,7 @@ TEST_GEN_FILES += map_populate
TEST_GEN_FILES += mlock-random-test
TEST_GEN_FILES += mlock2-tests
TEST_GEN_FILES += mremap_dontunmap
+TEST_GEN_FILES += mremap_test
TEST_GEN_FILES += on-fault-limit
TEST_GEN_FILES += thuge-gen
TEST_GEN_FILES += transhuge-stress
diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
new file mode 100644
index 000000000000..09dc9a1ef81f
--- /dev/null
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2020 Google LLC
+ */
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <time.h>
+
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+struct config {
+ unsigned long long src_alignment;
+ unsigned long long dest_alignment;
+ unsigned long long region_size;
+ int overlapping;
+};
+
+struct test {
+ const char *name;
+ struct config config;
+ int expect_failure;
+};
+
+enum {
+ _1KB = 1ULL << 10, /* 1KB -> not page aligned */
+ _4KB = 4ULL << 10,
+ _8KB = 8ULL << 10,
+ _1MB = 1ULL << 20,
+ _2MB = 2ULL << 20,
+ _4MB = 4ULL << 20,
+ _1GB = 1ULL << 30,
+ _2GB = 2ULL << 30,
+ PTE = _4KB,
+ PMD = _2MB,
+ PUD = _1GB,
+};
+
+#define MAKE_TEST(source_align, destination_align, size, \
+ overlaps, should_fail, test_name) \
+{ \
+ .name = test_name, \
+ .config = { \
+ .src_alignment = source_align, \
+ .dest_alignment = destination_align, \
+ .region_size = size, \
+ .overlapping = overlaps, \
+ }, \
+ .expect_failure = should_fail \
+}
+
+#define MAKE_SIMPLE_TEST(source_align, destination_align, size) \
+ MAKE_TEST(source_align, destination_align, size, 0, 0, \
+ #size " mremap - Source " #source_align \
+ " aligned, Destination " #destination_align \
+ " aligned")
+
+/*
+ * Returns the start address of the mapping on success, else returns
+ * NULL on failure.
+ */
+static void *get_source_mapping(struct config c)
+{
+ unsigned long long addr = 0ULL;
+ void *src_addr = NULL;
+retry:
+ addr += c.src_alignment;
+ src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
+ MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+ if (src_addr == MAP_FAILED) {
+ if (errno == EPERM)
+ goto retry;
+ goto error;
+ }
+ /*
+ * Check that the address is aligned to the specified alignment. Addresses
+ * which have alignments that are multiples of that specified are not considered
+ * valid. For instance, 1GB address is 2MB-aligned, however it will not be
+ * considered valid for a requested alignment of 2MB. This is done to
+ * reduce coincidental alignment in the tests.
+ */
+ if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
+ !((unsigned long long) src_addr & c.src_alignment))
+ goto retry;
+
+ if (!src_addr)
+ goto error;
+
+ return src_addr;
+error:
+ ksft_print_msg("Failed to map source region: %s\n",
+ strerror(errno));
+ return NULL;
+}
+
+/* Returns the time taken for the remap on success else returns -1. */
+static long long remap_region(struct config c)
+{
+ void *addr, *src_addr, *dest_addr;
+ int i, j;
+ struct timespec t_start = {0, 0}, t_end = {0, 0};
+ long long start_ns, end_ns, align_mask, ret, offset;
+ char pattern[] = {0xa8, 0xcd, 0xfe};
+ int pattern_size = ARRAY_SIZE(pattern);
+
+ src_addr = get_source_mapping(c);
+ if (!src_addr) {
+ ret = -1;
+ goto out;
+ }
+
+ /* Set byte pattern */
+ for (i = 0; i < c.region_size; i++) {
+ for (j = 0; i+j < c.region_size && j < pattern_size; j++)
+ memset((char *) src_addr + i+j, pattern[j], 1);
+ i += pattern_size-1;
+ }
+
+ align_mask = ~(c.dest_alignment - 1);
+ offset = (c.overlapping) ? -c.dest_alignment : c.dest_alignment;
+ addr = (void *) (((unsigned long long) src_addr + c.region_size + offset)
+ & align_mask);
+
+ /* See comment in get_source_mapping() */
+ if (!((unsigned long long) addr & c.dest_alignment))
+ addr = (void *) ((unsigned long long) addr | c.dest_alignment);
+
+ clock_gettime(CLOCK_MONOTONIC, &t_start);
+ dest_addr = mremap(src_addr, c.region_size, c.region_size,
+ MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
+ clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+ if (dest_addr == MAP_FAILED) {
+ ksft_print_msg("mremap failed: %s\n", strerror(errno));
+ ret = -1;
+ goto clean_up_src;
+ }
+
+ /* Verify byte pattern after remapping */
+ for (i = 0; i < c.region_size; i++) {
+ for (j = 0; i+j < c.region_size && j < pattern_size; j++) {
+ if (((char *) dest_addr)[i+j] != (char) pattern[j]) {
+ ksft_print_msg("Data after remap doesn't match at offset %d\n",
+ i+j);
+ ksft_print_msg("Expected: %#x\t Got: %#x\n", pattern[j] & 0xff,
+ ((char *) dest_addr)[i+j] & 0xff);
+ ret = -1;
+ goto clean_up_dest;
+ }
+ }
+ i += pattern_size-1;
+ }
+
+ start_ns = t_start.tv_sec * 1000000000ULL + t_start.tv_nsec;
+ end_ns = t_end.tv_sec * 1000000000ULL + t_end.tv_nsec;
+ ret = end_ns - start_ns;
+
+/*
+ * Since the destination address is specified using MREMAP_FIXED, subsequent mremap will unmap any
+ * previous mapping at the address range specified by dest_addr and region_size. This significantly
+ * affects the remap time of subsequent tests. So we clean up mappings after each test.
+ */
+clean_up_dest:
+ munmap(dest_addr, c.region_size);
+clean_up_src:
+ munmap(src_addr, c.region_size);
+out:
+ return ret;
+}
+
+static void run_mremap_test_case(struct test test_case, int *failures)
+{
+ long long remap_time = remap_region(test_case.config);
+
+ if (remap_time < 0) {
+ if (test_case.expect_failure)
+ ksft_test_result_pass("%s\n\tExpected mremap failure\n", test_case.name);
+ else {
+ ksft_test_result_fail("%s\n", test_case.name);
+ *failures += 1;
+ }
+ } else
+ ksft_test_result_pass("%s\n\tmremap time: %12lldns\n", test_case.name, remap_time);
+}
+
+int main(int argc, char *argv[])
+{
+ int failures = 0;
+ int i;
+
+ struct test test_cases[] = {
+ /* Expected mremap failures */
+ MAKE_TEST(_4KB, _4KB, _4KB, 1 /* overlaps */, 1 /* fails */,
+ "mremap - Source and Destination Regions Overlapping"),
+ MAKE_TEST(_4KB, _1KB, _4KB, 0 /* overlaps */, 1 /* fails */,
+ "mremap - Destination Address Misaligned (1KB-aligned)"),
+ MAKE_TEST(_1KB, _4KB, _4KB, 0 /* overlaps */, 1 /* fails */,
+ "mremap - Source Address Misaligned (1KB-aligned)"),
+
+ /* Src addr PTE aligned */
+ MAKE_SIMPLE_TEST(PTE, PTE, _8KB),
+
+ /* Src addr 1MB aligned */
+ MAKE_SIMPLE_TEST(_1MB, PTE, _2MB),
+ MAKE_SIMPLE_TEST(_1MB, _1MB, _2MB),
+
+ /* Src addr PMD aligned */
+ MAKE_SIMPLE_TEST(PMD, PTE, _4MB),
+ MAKE_SIMPLE_TEST(PMD, _1MB, _4MB),
+ MAKE_SIMPLE_TEST(PMD, PMD, _4MB),
+
+ /* Src addr PUD aligned */
+ MAKE_SIMPLE_TEST(PUD, PTE, _2GB),
+ MAKE_SIMPLE_TEST(PUD, _1MB, _2GB),
+ MAKE_SIMPLE_TEST(PUD, PMD, _2GB),
+ MAKE_SIMPLE_TEST(PUD, PUD, _2GB),
+ };
+
+ struct test perf_test_cases[] = {
+ /* mremap 1GB region - Page table level aligned time comparison */
+ MAKE_SIMPLE_TEST(PTE, PTE, _1GB),
+ MAKE_SIMPLE_TEST(PMD, PMD, _1GB),
+ MAKE_SIMPLE_TEST(PUD, PUD, _1GB),
+ };
+
+ ksft_set_plan(ARRAY_SIZE(test_cases) + ARRAY_SIZE(perf_test_cases));
+
+ for (i = 0; i < ARRAY_SIZE(test_cases); i++)
+ run_mremap_test_case(test_cases[i], &failures);
+
+ ksft_print_msg("\nmremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:\n");
+ for (i = 0; i < ARRAY_SIZE(perf_test_cases); i++)
+ run_mremap_test_case(perf_test_cases[i], &failures);
+
+ if (failures > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index a3f4f30f0a2e..d578ad831813 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -241,6 +241,17 @@ else
echo "[PASS]"
fi

+echo "-------------------"
+echo "running mremap_test"
+echo "-------------------"
+./mremap_test
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
echo "-----------------"
echo "running thuge-gen"
echo "-----------------"
--
2.28.0.709.gb0816b6eb0-goog

2020-10-01 00:06:24

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 3/5] mm: Speedup mremap on 1GB or larger regions

Android needs to move large memory regions for garbage collection.
Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
level if the source and destination addresses are PUD-aligned.
For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
PGD entries, since the PUD entry is “folded back” onto the PGD entry.
Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
isn't supported/tested can turn this off by not selecting the config.

Signed-off-by: Kalesh Singh <[email protected]>
---
arch/Kconfig | 7 +
arch/arm64/include/asm/pgtable.h | 1 +
mm/mremap.c | 211 ++++++++++++++++++++++++++-----
3 files changed, 189 insertions(+), 30 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..5eabaa00bf9b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
Archs need to ensure they use a high enough resolution clock to
support irq time accounting and then call enable_sched_clock_irqtime().

+config HAVE_MOVE_PUD
+ bool
+ help
+ Architectures that select this are able to move page tables at the
+ PUD level. If there are only 3 page table levels, the move effectively
+ happens at the PGD level.
+
config HAVE_MOVE_PMD
bool
help
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..8848125e3024 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
#define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

#define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
+#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))

#define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
#define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
diff --git a/mm/mremap.c b/mm/mremap.c
index 138abbae4f75..a5a1440bd366 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -249,14 +249,167 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,

return true;
}
+#else
+static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
+{
+ return false;
+}
#endif

+#ifdef CONFIG_HAVE_MOVE_PUD
+static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, addr);
+ if (pgd_none_or_clear_bad(pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none_or_clear_bad(p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none_or_clear_bad(pud))
+ return NULL;
+
+ return pud;
+}
+
+static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return NULL;
+
+ return pud;
+}
+
+static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ spinlock_t *old_ptl, *new_ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ pud_t pud;
+
+ /*
+ * The destination pud shouldn't be established, free_pgtables()
+ * should have released it.
+ */
+ if (WARN_ON_ONCE(!pud_none(*new_pud)))
+ return false;
+
+ /*
+ * We don't have to worry about the ordering of src and dst
+ * ptlocks because exclusive mmap_lock prevents deadlock.
+ */
+ old_ptl = pud_lock(vma->vm_mm, old_pud);
+ new_ptl = pud_lockptr(mm, new_pud);
+ if (new_ptl != old_ptl)
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+ /* Clear the pud */
+ pud = *old_pud;
+ pud_clear(old_pud);
+
+ VM_BUG_ON(!pud_none(*new_pud));
+
+ /* Set the new pud */
+ set_pud_at(mm, new_addr, new_pud, pud);
+ flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
+ if (new_ptl != old_ptl)
+ spin_unlock(new_ptl);
+ spin_unlock(old_ptl);
+
+ return true;
+}
+#else
+static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ return false;
+}
+#endif
+
+enum pgt_entry {
+ NORMAL_PMD,
+ HPAGE_PMD,
+ NORMAL_PUD,
+};
+
+/*
+ * Returns an extent of the corresponding size for the pgt_entry specified if valid.
+ * Else returns a smaller extent bounded by the end of the source and destination
+ * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
+ */
+static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
+ unsigned long old_end, unsigned long new_addr)
+{
+ unsigned long next, extent, mask, size;
+
+ if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
+ mask = PMD_MASK;
+ size = PMD_SIZE;
+ } else if (entry == NORMAL_PUD) {
+ mask = PUD_MASK;
+ size = PUD_SIZE;
+ } else
+ return 0;
+
+ next = (old_addr + size) & mask;
+ /* even if next overflowed, extent below will be ok */
+ extent = (next > old_end) ? old_end - old_addr : next - old_addr;
+ next = (new_addr + size) & mask;
+ if (extent > next - new_addr)
+ extent = next - new_addr;
+ return extent;
+}
+
+/*
+ * Attempts to speedup the move by moving entry at the level corresponding to
+ * pgt_entry. Returns true if the move was successful, else false.
+ */
+static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
+ unsigned long old_addr, unsigned long new_addr, void *old_entry,
+ void *new_entry, bool need_rmap_locks)
+{
+ bool moved = false;
+
+ /* See comment in move_ptes() */
+ if (need_rmap_locks)
+ take_rmap_locks(vma);
+ if (entry == NORMAL_PMD)
+ moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
+ else if (entry == NORMAL_PUD)
+ moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
+ else if (entry == HPAGE_PMD)
+ moved = move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
+ else
+ WARN_ON_ONCE(1);
+ if (need_rmap_locks)
+ drop_rmap_locks(vma);
+
+ return moved;
+}
+
unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
bool need_rmap_locks)
{
- unsigned long extent, next, old_end;
+ unsigned long extent, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;

@@ -269,14 +422,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,

for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
- next = (old_addr + PMD_SIZE) & PMD_MASK;
- /* even if next overflowed, extent below will be ok */
- extent = next - old_addr;
- if (extent > old_end - old_addr)
- extent = old_end - old_addr;
- next = (new_addr + PMD_SIZE) & PMD_MASK;
- if (extent > next - new_addr)
- extent = next - new_addr;
+#ifdef CONFIG_HAVE_MOVE_PUD
+ /*
+ * If extent is PUD-sized try to speed up the move by moving at the
+ * PUD level if possible.
+ */
+ extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
+ if (extent == PUD_SIZE) {
+ pud_t *old_pud, *new_pud;
+
+ old_pud = get_old_pud(vma->vm_mm, old_addr);
+ if (!old_pud)
+ continue;
+ new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+ if (!new_pud)
+ break;
+ if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
+ old_pud, new_pud, need_rmap_locks))
+ continue;
+ }
+#endif
+ extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
old_pmd = get_old_pmd(vma->vm_mm, old_addr);
if (!old_pmd)
continue;
@@ -284,18 +450,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (!new_pmd)
break;
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
- if (extent == HPAGE_PMD_SIZE) {
- bool moved;
- /* See comment in move_ptes() */
- if (need_rmap_locks)
- take_rmap_locks(vma);
- moved = move_huge_pmd(vma, old_addr, new_addr,
- old_pmd, new_pmd);
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- if (moved)
- continue;
- }
+ if (extent == HPAGE_PMD_SIZE &&
+ move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
+ new_pmd, need_rmap_locks))
+ continue;
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
@@ -305,15 +463,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
* If the extent is PMD-sized, try to speed the move by
* moving at the PMD level if possible.
*/
- bool moved;
-
- if (need_rmap_locks)
- take_rmap_locks(vma);
- moved = move_normal_pmd(vma, old_addr, new_addr,
- old_pmd, new_pmd);
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- if (moved)
+ if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
+ new_pmd, need_rmap_locks))
continue;
#endif
}
--
2.28.0.709.gb0816b6eb0-goog

2020-10-01 00:25:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> mremap time can be optimized by moving entries at the PMD/PUD level if
> the source and destination addresses are PMD/PUD-aligned and
> PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> x86. Other architectures where this type of move is supported and known to
> be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> and HAVE_MOVE_PUD.
>
> Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> region on x86 and arm64:
>
> - HAVE_MOVE_PMD is already enabled on x86 : N/A
> - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
>
> - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
>
> Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> give a total of ~150x speed up on arm64.

Is there a *real* workload that benefit from HAVE_MOVE_PUD?

--
Kirill A. Shutemov

2020-10-01 00:27:31

by Lokesh Gidra

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > mremap time can be optimized by moving entries at the PMD/PUD level if
> > the source and destination addresses are PMD/PUD-aligned and
> > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > x86. Other architectures where this type of move is supported and known to
> > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > and HAVE_MOVE_PUD.
> >
> > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > region on x86 and arm64:
> >
> > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> >
> > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> >
> > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > give a total of ~150x speed up on arm64.
>
> Is there a *real* workload that benefit from HAVE_MOVE_PUD?
>
We have a Java garbage collector under development which requires
moving physical pages of multi-gigabyte heap using mremap. During this
move, the application threads have to be paused for correctness. It is
critical to keep this pause as short as possible to avoid jitters
during user interaction. This is where HAVE_MOVE_PUD will greatly
help.
> --
> Kirill A. Shutemov

2020-10-01 00:27:52

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Wed, Sep 30, 2020 at 6:42 PM Lokesh Gidra <[email protected]> wrote:
>
> On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > the source and destination addresses are PMD/PUD-aligned and
> > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > x86. Other architectures where this type of move is supported and known to
> > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > and HAVE_MOVE_PUD.
> > >
> > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > region on x86 and arm64:
> > >
> > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > >
> > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > >
> > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > give a total of ~150x speed up on arm64.
> >
> > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> >
> We have a Java garbage collector under development which requires
> moving physical pages of multi-gigabyte heap using mremap. During this
> move, the application threads have to be paused for correctness. It is
> critical to keep this pause as short as possible to avoid jitters
> during user interaction. This is where HAVE_MOVE_PUD will greatly
> help.

And that detail should totally have gone into the commit message :-/

Thanks,

- Joel

2020-10-01 00:29:14

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Wed, Sep 30, 2020 at 6:47 PM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Sep 30, 2020 at 6:42 PM Lokesh Gidra <[email protected]> wrote:
> >
> > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > the source and destination addresses are PMD/PUD-aligned and
> > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > x86. Other architectures where this type of move is supported and known to
> > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > and HAVE_MOVE_PUD.
> > > >
> > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > region on x86 and arm64:
> > > >
> > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > > >
> > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > >
> > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > give a total of ~150x speed up on arm64.
> > >
> > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > >
> > We have a Java garbage collector under development which requires
> > moving physical pages of multi-gigabyte heap using mremap. During this
> > move, the application threads have to be paused for correctness. It is
> > critical to keep this pause as short as possible to avoid jitters
> > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > help.
>
> And that detail should totally have gone into the commit message :-/
Hi Joel,
The patch that introduces HAVE_MOVE_PUD in the series mentions the
Android garbage collection use case. I can add these details there in
the next version.
Thanks,
Kalesh
>
> Thanks,
>
> - Joel

2020-10-01 07:26:29

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] kselftests: vm: Add mremap tests

On 9/30/20 3:21 PM, Kalesh Singh wrote:
> Test mremap on regions of various sizes and alignments and validate
> data after remapping. Also provide total time for remapping
> the region which is useful for performance comparison of the mremap
> optimizations that move pages at the PMD/PUD levels if HAVE_MOVE_PMD
> and/or HAVE_MOVE_PUD are enabled.
>
> Signed-off-by: Kalesh Singh <[email protected]>
> ---
> tools/testing/selftests/vm/.gitignore | 1 +
> tools/testing/selftests/vm/Makefile | 1 +
> tools/testing/selftests/vm/mremap_test.c | 243 +++++++++++++++++++++++
> tools/testing/selftests/vm/run_vmtests | 11 +
> 4 files changed, 256 insertions(+)
> create mode 100644 tools/testing/selftests/vm/mremap_test.c
>

Hi,

This takes 100x longer to run than it should: 1:46 min of running time on
my x86_64 test machine. The entire selftests/vm test suite takes 45 sec on a
bad day, where a bad day is defined as up until about tomorrow, when I will
post a compaction_test.c patch that will cut that time down to about half, or
24 sec total run time...for 22 tests!

In other words, most tests here should take about 1 or 2 seconds, unless they
are exceptionally special snowflakes.

At the very least, the invocation within run_vmtests could pass in a parameter
to tell it to run a shorter test. But there's also opportunities to speed it
up, too.

...
> +
> +#define MAKE_TEST(source_align, destination_align, size, \
> + overlaps, should_fail, test_name) \
> +{ \
> + .name = test_name, \
> + .config = { \
> + .src_alignment = source_align, \
> + .dest_alignment = destination_align, \
> + .region_size = size, \
> + .overlapping = overlaps, \
> + }, \
> + .expect_failure = should_fail \
> +}
> +

OK...

> +#define MAKE_SIMPLE_TEST(source_align, destination_align, size) \
> + MAKE_TEST(source_align, destination_align, size, 0, 0, \
> + #size " mremap - Source " #source_align \
> + " aligned, Destination " #destination_align \
> + " aligned")
> +

...and not OK. :) Because this is just obscuring things. Both the
code and the output are harder to read. For these tiny test programs,
clarity is what we want, not necessarily compactness on the screen.
Because people want to get in, understand what they seen in the code
and match it up with what is printed to stdout--without spending much
time. (And that includes run time, as hinted at above.)

...
> +
> +/* Returns the time taken for the remap on success else returns -1. */
> +static long long remap_region(struct config c)
> +{
> + void *addr, *src_addr, *dest_addr;
> + int i, j;
> + struct timespec t_start = {0, 0}, t_end = {0, 0};
> + long long start_ns, end_ns, align_mask, ret, offset;
> + char pattern[] = {0xa8, 0xcd, 0xfe};

I'd recommend using rand() to help choose the pattern, and using different
patterns for different runs. When testing memory, it's a pitfall to have
the same test pattern.

Normally, you'd also want to report the random seed or the test pattern(s)
or both to stdout, and provide a way to run with the same pattern, but
here I don't *think* you care: all patterns should have the same performance.

> + int pattern_size = ARRAY_SIZE(pattern);
> +
> + src_addr = get_source_mapping(c);
> + if (!src_addr) {
> + ret = -1;
> + goto out;
> + }
> +
> + /* Set byte pattern */
> + for (i = 0; i < c.region_size; i++) {
> + for (j = 0; i+j < c.region_size && j < pattern_size; j++)
> + memset((char *) src_addr + i+j, pattern[j], 1);
> + i += pattern_size-1;
> + }
> +
> + align_mask = ~(c.dest_alignment - 1);
> + offset = (c.overlapping) ? -c.dest_alignment : c.dest_alignment;

A comment for what the above two lines are doing would be a nice touch.

...
> + start_ns = t_start.tv_sec * 1000000000ULL + t_start.tv_nsec;
> + end_ns = t_end.tv_sec * 1000000000ULL + t_end.tv_nsec;

A const or #defined for all those 0000's would help.

...
> +int main(int argc, char *argv[])
> +{
> + int failures = 0;
> + int i;
> +
> + struct test test_cases[] = {
> + /* Expected mremap failures */
> + MAKE_TEST(_4KB, _4KB, _4KB, 1 /* overlaps */, 1 /* fails */,

Named flags instead of 1's and 0's would avoid the need for messy comments.

> + "mremap - Source and Destination Regions Overlapping"),
> + MAKE_TEST(_4KB, _1KB, _4KB, 0 /* overlaps */, 1 /* fails */,
> + "mremap - Destination Address Misaligned (1KB-aligned)"),
> + MAKE_TEST(_1KB, _4KB, _4KB, 0 /* overlaps */, 1 /* fails */,
> + "mremap - Source Address Misaligned (1KB-aligned)"),
> +
> + /* Src addr PTE aligned */
> + MAKE_SIMPLE_TEST(PTE, PTE, _8KB),
> +
> + /* Src addr 1MB aligned */
> + MAKE_SIMPLE_TEST(_1MB, PTE, _2MB),
> + MAKE_SIMPLE_TEST(_1MB, _1MB, _2MB),
> +
> + /* Src addr PMD aligned */
> + MAKE_SIMPLE_TEST(PMD, PTE, _4MB),
> + MAKE_SIMPLE_TEST(PMD, _1MB, _4MB),
> + MAKE_SIMPLE_TEST(PMD, PMD, _4MB),
> +
> + /* Src addr PUD aligned */
> + MAKE_SIMPLE_TEST(PUD, PTE, _2GB),
> + MAKE_SIMPLE_TEST(PUD, _1MB, _2GB),
> + MAKE_SIMPLE_TEST(PUD, PMD, _2GB),
> + MAKE_SIMPLE_TEST(PUD, PUD, _2GB),


Too concise. Not fun lining these up with the stdout report.


thanks,
--
John Hubbard
NVIDIA

2020-10-01 12:28:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > the source and destination addresses are PMD/PUD-aligned and
> > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > x86. Other architectures where this type of move is supported and known to
> > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > and HAVE_MOVE_PUD.
> > >
> > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > region on x86 and arm64:
> > >
> > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > >
> > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > >
> > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > give a total of ~150x speed up on arm64.
> >
> > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> >
> We have a Java garbage collector under development which requires
> moving physical pages of multi-gigabyte heap using mremap. During this
> move, the application threads have to be paused for correctness. It is
> critical to keep this pause as short as possible to avoid jitters
> during user interaction. This is where HAVE_MOVE_PUD will greatly
> help.

Any chance to quantify the effect of mremap() with and without
HAVE_MOVE_PUD?

I doubt it's a major contributor to the GC pause. I expect you need to
move tens of gigs to get sizable effect. And if your GC routinely moves
tens of gigs, maybe problem somewhere else?

I'm asking for numbers, because increase in complexity comes with cost.
If it doesn't provide an substantial benefit to a real workload
maintaining the code forever doesn't make sense.

--
Kirill A. Shutemov

2020-10-01 12:40:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Speedup mremap on 1GB or larger regions

On Wed, Sep 30, 2020 at 10:21:20PM +0000, Kalesh Singh wrote:
> Android needs to move large memory regions for garbage collection.
> Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> level if the source and destination addresses are PUD-aligned.
> For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> isn't supported/tested can turn this off by not selecting the config.
>
> Signed-off-by: Kalesh Singh <[email protected]>
> ---
> arch/Kconfig | 7 +
> arch/arm64/include/asm/pgtable.h | 1 +
> mm/mremap.c | 211 ++++++++++++++++++++++++++-----
> 3 files changed, 189 insertions(+), 30 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..5eabaa00bf9b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> Archs need to ensure they use a high enough resolution clock to
> support irq time accounting and then call enable_sched_clock_irqtime().
>
> +config HAVE_MOVE_PUD
> + bool
> + help
> + Architectures that select this are able to move page tables at the
> + PUD level. If there are only 3 page table levels, the move effectively
> + happens at the PGD level.
> +
> config HAVE_MOVE_PMD
> bool
> help
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index d5d3fbe73953..8848125e3024 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
>
> #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
> +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
>
> #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
> #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)

This doesn't belong to the patch.

> diff --git a/mm/mremap.c b/mm/mremap.c
> index 138abbae4f75..a5a1440bd366 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -249,14 +249,167 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>
> return true;
> }
> +#else
> +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> +{
> + return false;
> +}
> #endif
>
> +#ifdef CONFIG_HAVE_MOVE_PUD
> +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> +
> + pgd = pgd_offset(mm, addr);
> + if (pgd_none_or_clear_bad(pgd))
> + return NULL;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (p4d_none_or_clear_bad(p4d))
> + return NULL;
> +
> + pud = pud_offset(p4d, addr);
> + if (pud_none_or_clear_bad(pud))
> + return NULL;
> +
> + return pud;
> +}
> +
> +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> +
> + pgd = pgd_offset(mm, addr);
> + p4d = p4d_alloc(mm, pgd, addr);
> + if (!p4d)
> + return NULL;
> + pud = pud_alloc(mm, p4d, addr);
> + if (!pud)
> + return NULL;
> +
> + return pud;
> +}
> +
> +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> + pud_t pud;
> +
> + /*
> + * The destination pud shouldn't be established, free_pgtables()
> + * should have released it.
> + */
> + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> + return false;
> +
> + /*
> + * We don't have to worry about the ordering of src and dst
> + * ptlocks because exclusive mmap_lock prevents deadlock.
> + */
> + old_ptl = pud_lock(vma->vm_mm, old_pud);
> + new_ptl = pud_lockptr(mm, new_pud);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pud */
> + pud = *old_pud;
> + pud_clear(old_pud);
> +
> + VM_BUG_ON(!pud_none(*new_pud));
> +
> + /* Set the new pud */
> + set_pud_at(mm, new_addr, new_pud, pud);
> + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + return true;
> +}
> +#else
> +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> + return false;
> +}
> +#endif
> +
> +enum pgt_entry {
> + NORMAL_PMD,
> + HPAGE_PMD,
> + NORMAL_PUD,
> +};
> +
> +/*
> + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> + * Else returns a smaller extent bounded by the end of the source and destination
> + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> + */
> +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> + unsigned long old_end, unsigned long new_addr)
> +{
> + unsigned long next, extent, mask, size;
> +
> + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> + mask = PMD_MASK;
> + size = PMD_SIZE;
> + } else if (entry == NORMAL_PUD) {
> + mask = PUD_MASK;
> + size = PUD_SIZE;
> + } else
> + return 0;
> +
> + next = (old_addr + size) & mask;
> + /* even if next overflowed, extent below will be ok */
> + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> + next = (new_addr + size) & mask;
> + if (extent > next - new_addr)
> + extent = next - new_addr;
> + return extent;
> +}
> +
> +/*
> + * Attempts to speedup the move by moving entry at the level corresponding to
> + * pgt_entry. Returns true if the move was successful, else false.
> + */
> +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> + void *new_entry, bool need_rmap_locks)
> +{
> + bool moved = false;
> +
> + /* See comment in move_ptes() */
> + if (need_rmap_locks)
> + take_rmap_locks(vma);
> + if (entry == NORMAL_PMD)
> + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> + else if (entry == NORMAL_PUD)
> + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> + else if (entry == HPAGE_PMD)
> + moved = move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> + else
> + WARN_ON_ONCE(1);

BUILD_BUG() should work.

And why not use switch() instead of ifs.

> + if (need_rmap_locks)
> + drop_rmap_locks(vma);
> +
> + return moved;
> +}
> +
> unsigned long move_page_tables(struct vm_area_struct *vma,
> unsigned long old_addr, struct vm_area_struct *new_vma,
> unsigned long new_addr, unsigned long len,
> bool need_rmap_locks)
> {
> - unsigned long extent, next, old_end;
> + unsigned long extent, old_end;
> struct mmu_notifier_range range;
> pmd_t *old_pmd, *new_pmd;
>
> @@ -269,14 +422,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
> for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> cond_resched();
> - next = (old_addr + PMD_SIZE) & PMD_MASK;
> - /* even if next overflowed, extent below will be ok */
> - extent = next - old_addr;
> - if (extent > old_end - old_addr)
> - extent = old_end - old_addr;
> - next = (new_addr + PMD_SIZE) & PMD_MASK;
> - if (extent > next - new_addr)
> - extent = next - new_addr;
> +#ifdef CONFIG_HAVE_MOVE_PUD
> + /*
> + * If extent is PUD-sized try to speed up the move by moving at the
> + * PUD level if possible.
> + */
> + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> + if (extent == PUD_SIZE) {
> + pud_t *old_pud, *new_pud;
> +
> + old_pud = get_old_pud(vma->vm_mm, old_addr);
> + if (!old_pud)
> + continue;
> + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> + if (!new_pud)
> + break;
> + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> + old_pud, new_pud, need_rmap_locks))
> + continue;
> + }
> +#endif
> + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> if (!old_pmd)
> continue;
> @@ -284,18 +450,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> if (!new_pmd)
> break;
> if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> - if (extent == HPAGE_PMD_SIZE) {
> - bool moved;
> - /* See comment in move_ptes() */
> - if (need_rmap_locks)
> - take_rmap_locks(vma);
> - moved = move_huge_pmd(vma, old_addr, new_addr,
> - old_pmd, new_pmd);
> - if (need_rmap_locks)
> - drop_rmap_locks(vma);
> - if (moved)
> - continue;
> - }
> + if (extent == HPAGE_PMD_SIZE &&
> + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> + new_pmd, need_rmap_locks))
> + continue;
> split_huge_pmd(vma, old_pmd, old_addr);
> if (pmd_trans_unstable(old_pmd))
> continue;
> @@ -305,15 +463,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> * If the extent is PMD-sized, try to speed the move by
> * moving at the PMD level if possible.
> */
> - bool moved;
> -
> - if (need_rmap_locks)
> - take_rmap_locks(vma);
> - moved = move_normal_pmd(vma, old_addr, new_addr,
> - old_pmd, new_pmd);
> - if (need_rmap_locks)
> - drop_rmap_locks(vma);
> - if (moved)
> + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> + new_pmd, need_rmap_locks))
> continue;
> #endif
> }
> --
> 2.28.0.709.gb0816b6eb0-goog
>

--
Kirill A. Shutemov

2020-10-01 15:48:24

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 1/5] kselftests: vm: Add mremap tests

On Thu, Oct 1, 2020 at 3:24 AM John Hubbard <[email protected]> wrote:
>
> On 9/30/20 3:21 PM, Kalesh Singh wrote:
> > Test mremap on regions of various sizes and alignments and validate
> > data after remapping. Also provide total time for remapping
> > the region which is useful for performance comparison of the mremap
> > optimizations that move pages at the PMD/PUD levels if HAVE_MOVE_PMD
> > and/or HAVE_MOVE_PUD are enabled.
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
> > ---
> > tools/testing/selftests/vm/.gitignore | 1 +
> > tools/testing/selftests/vm/Makefile | 1 +
> > tools/testing/selftests/vm/mremap_test.c | 243 +++++++++++++++++++++++
> > tools/testing/selftests/vm/run_vmtests | 11 +
> > 4 files changed, 256 insertions(+)
> > create mode 100644 tools/testing/selftests/vm/mremap_test.c
> >
>
> Hi,
>
> This takes 100x longer to run than it should: 1:46 min of running time on
> my x86_64 test machine. The entire selftests/vm test suite takes 45 sec on a
> bad day, where a bad day is defined as up until about tomorrow, when I will
> post a compaction_test.c patch that will cut that time down to about half, or
> 24 sec total run time...for 22 tests!
>
> In other words, most tests here should take about 1 or 2 seconds, unless they
> are exceptionally special snowflakes.
>
> At the very least, the invocation within run_vmtests could pass in a parameter
> to tell it to run a shorter test. But there's also opportunities to speed it
> up, too.


Hi John. Thanks for the comments.

The bulk of the test time comes from setting and verifying the byte
pattern in 1GB
or larger regions for testing the HAVE_MOVE_PUD functionality. Without testing
1GB or larger regions the test takes 0.18 seconds on my x86_64 system.

One option I think would be to only validate to a certain threshold of the remap
region. We can have a flag to specify a threshold or to validate the
full size of the
remapped region. I did some initial testing with a 4MB threshold and
the total time
dropped to 0.38 seconds from 1:12 minutes (for verifying the entire
remapped region).
The 4MB threshold would cover the full region of all the tests
excluding those for the
1GB and 2GB sized regions. Let me know what you think.

Your other comments below sound good to me. I’ll make those changes in
the next version.

Thanks,
Kalesh

>
> ...
> > +
> > +#define MAKE_TEST(source_align, destination_align, size, \
> > + overlaps, should_fail, test_name) \
> > +{ \
> > + .name = test_name, \
> > + .config = { \
> > + .src_alignment = source_align, \
> > + .dest_alignment = destination_align, \
> > + .region_size = size, \
> > + .overlapping = overlaps, \
> > + }, \
> > + .expect_failure = should_fail \
> > +}
> > +
>
> OK...
>
> > +#define MAKE_SIMPLE_TEST(source_align, destination_align, size) \
> > + MAKE_TEST(source_align, destination_align, size, 0, 0, \
> > + #size " mremap - Source " #source_align \
> > + " aligned, Destination " #destination_align \
> > + " aligned")
> > +
>
> ...and not OK. :) Because this is just obscuring things. Both the
> code and the output are harder to read. For these tiny test programs,
> clarity is what we want, not necessarily compactness on the screen.
> Because people want to get in, understand what they seen in the code
> and match it up with what is printed to stdout--without spending much
> time. (And that includes run time, as hinted at above.)
>
> ...
> > +
> > +/* Returns the time taken for the remap on success else returns -1. */
> > +static long long remap_region(struct config c)
> > +{
> > + void *addr, *src_addr, *dest_addr;
> > + int i, j;
> > + struct timespec t_start = {0, 0}, t_end = {0, 0};
> > + long long start_ns, end_ns, align_mask, ret, offset;
> > + char pattern[] = {0xa8, 0xcd, 0xfe};
>
> I'd recommend using rand() to help choose the pattern, and using different
> patterns for different runs. When testing memory, it's a pitfall to have
> the same test pattern.
>
> Normally, you'd also want to report the random seed or the test pattern(s)
> or both to stdout, and provide a way to run with the same pattern, but
> here I don't *think* you care: all patterns should have the same performance.
>
> > + int pattern_size = ARRAY_SIZE(pattern);
> > +
> > + src_addr = get_source_mapping(c);
> > + if (!src_addr) {
> > + ret = -1;
> > + goto out;
> > + }
> > +
> > + /* Set byte pattern */
> > + for (i = 0; i < c.region_size; i++) {
> > + for (j = 0; i+j < c.region_size && j < pattern_size; j++)
> > + memset((char *) src_addr + i+j, pattern[j], 1);
> > + i += pattern_size-1;
> > + }
> > +
> > + align_mask = ~(c.dest_alignment - 1);
> > + offset = (c.overlapping) ? -c.dest_alignment : c.dest_alignment;
>
> A comment for what the above two lines are doing would be a nice touch.
>
> ...
> > + start_ns = t_start.tv_sec * 1000000000ULL + t_start.tv_nsec;
> > + end_ns = t_end.tv_sec * 1000000000ULL + t_end.tv_nsec;
>
> A const or #defined for all those 0000's would help.
>
> ...
> > +int main(int argc, char *argv[])
> > +{
> > + int failures = 0;
> > + int i;
> > +
> > + struct test test_cases[] = {
> > + /* Expected mremap failures */
> > + MAKE_TEST(_4KB, _4KB, _4KB, 1 /* overlaps */, 1 /* fails */,
>
> Named flags instead of 1's and 0's would avoid the need for messy comments.
>
> > + "mremap - Source and Destination Regions Overlapping"),
> > + MAKE_TEST(_4KB, _1KB, _4KB, 0 /* overlaps */, 1 /* fails */,
> > + "mremap - Destination Address Misaligned (1KB-aligned)"),
> > + MAKE_TEST(_1KB, _4KB, _4KB, 0 /* overlaps */, 1 /* fails */,
> > + "mremap - Source Address Misaligned (1KB-aligned)"),
> > +
> > + /* Src addr PTE aligned */
> > + MAKE_SIMPLE_TEST(PTE, PTE, _8KB),
> > +
> > + /* Src addr 1MB aligned */
> > + MAKE_SIMPLE_TEST(_1MB, PTE, _2MB),
> > + MAKE_SIMPLE_TEST(_1MB, _1MB, _2MB),
> > +
> > + /* Src addr PMD aligned */
> > + MAKE_SIMPLE_TEST(PMD, PTE, _4MB),
> > + MAKE_SIMPLE_TEST(PMD, _1MB, _4MB),
> > + MAKE_SIMPLE_TEST(PMD, PMD, _4MB),
> > +
> > + /* Src addr PUD aligned */
> > + MAKE_SIMPLE_TEST(PUD, PTE, _2GB),
> > + MAKE_SIMPLE_TEST(PUD, _1MB, _2GB),
> > + MAKE_SIMPLE_TEST(PUD, PMD, _2GB),
> > + MAKE_SIMPLE_TEST(PUD, PUD, _2GB),
>
>
> Too concise. Not fun lining these up with the stdout report.
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2020-10-01 16:04:34

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > the source and destination addresses are PMD/PUD-aligned and
> > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > x86. Other architectures where this type of move is supported and known to
> > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > and HAVE_MOVE_PUD.
> > > >
> > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > region on x86 and arm64:
> > > >
> > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > > >
> > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > >
> > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > give a total of ~150x speed up on arm64.
> > >
> > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > >
> > We have a Java garbage collector under development which requires
> > moving physical pages of multi-gigabyte heap using mremap. During this
> > move, the application threads have to be paused for correctness. It is
> > critical to keep this pause as short as possible to avoid jitters
> > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > help.
>
> Any chance to quantify the effect of mremap() with and without
> HAVE_MOVE_PUD?
>
> I doubt it's a major contributor to the GC pause. I expect you need to
> move tens of gigs to get sizable effect. And if your GC routinely moves
> tens of gigs, maybe problem somewhere else?
>
> I'm asking for numbers, because increase in complexity comes with cost.
> If it doesn't provide an substantial benefit to a real workload
> maintaining the code forever doesn't make sense.

Lokesh on this thread would be better able to answer this. I'll let
him weigh in here.
Thanks, Kalesh
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2020-10-01 16:42:53

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Speedup mremap on 1GB or larger regions

On Thu, Oct 1, 2020 at 8:37 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Wed, Sep 30, 2020 at 10:21:20PM +0000, Kalesh Singh wrote:
> > Android needs to move large memory regions for garbage collection.
> > Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> > level if the source and destination addresses are PUD-aligned.
> > For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> > PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> > Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> > isn't supported/tested can turn this off by not selecting the config.
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
> > ---
> > arch/Kconfig | 7 +
> > arch/arm64/include/asm/pgtable.h | 1 +
> > mm/mremap.c | 211 ++++++++++++++++++++++++++-----
> > 3 files changed, 189 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index af14a567b493..5eabaa00bf9b 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> > Archs need to ensure they use a high enough resolution clock to
> > support irq time accounting and then call enable_sched_clock_irqtime().
> >
> > +config HAVE_MOVE_PUD
> > + bool
> > + help
> > + Architectures that select this are able to move page tables at the
> > + PUD level. If there are only 3 page table levels, the move effectively
> > + happens at the PGD level.
> > +
> > config HAVE_MOVE_PMD
> > bool
> > help
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index d5d3fbe73953..8848125e3024 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> > #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
> >
> > #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
> > +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
> >
> > #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
> > #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
>
> This doesn't belong to the patch.
Good catch. I'll move this into a separate patch.
>
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 138abbae4f75..a5a1440bd366 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -249,14 +249,167 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >
> > return true;
> > }
> > +#else
> > +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > +#ifdef CONFIG_HAVE_MOVE_PUD
> > +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > + pud_t *pud;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (pgd_none_or_clear_bad(pgd))
> > + return NULL;
> > +
> > + p4d = p4d_offset(pgd, addr);
> > + if (p4d_none_or_clear_bad(p4d))
> > + return NULL;
> > +
> > + pud = pud_offset(p4d, addr);
> > + if (pud_none_or_clear_bad(pud))
> > + return NULL;
> > +
> > + return pud;
> > +}
> > +
> > +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > + pud_t *pud;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + p4d = p4d_alloc(mm, pgd, addr);
> > + if (!p4d)
> > + return NULL;
> > + pud = pud_alloc(mm, p4d, addr);
> > + if (!pud)
> > + return NULL;
> > +
> > + return pud;
> > +}
> > +
> > +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > +{
> > + spinlock_t *old_ptl, *new_ptl;
> > + struct mm_struct *mm = vma->vm_mm;
> > + pud_t pud;
> > +
> > + /*
> > + * The destination pud shouldn't be established, free_pgtables()
> > + * should have released it.
> > + */
> > + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> > + return false;
> > +
> > + /*
> > + * We don't have to worry about the ordering of src and dst
> > + * ptlocks because exclusive mmap_lock prevents deadlock.
> > + */
> > + old_ptl = pud_lock(vma->vm_mm, old_pud);
> > + new_ptl = pud_lockptr(mm, new_pud);
> > + if (new_ptl != old_ptl)
> > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > + /* Clear the pud */
> > + pud = *old_pud;
> > + pud_clear(old_pud);
> > +
> > + VM_BUG_ON(!pud_none(*new_pud));
> > +
> > + /* Set the new pud */
> > + set_pud_at(mm, new_addr, new_pud, pud);
> > + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> > + if (new_ptl != old_ptl)
> > + spin_unlock(new_ptl);
> > + spin_unlock(old_ptl);
> > +
> > + return true;
> > +}
> > +#else
> > +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > +enum pgt_entry {
> > + NORMAL_PMD,
> > + HPAGE_PMD,
> > + NORMAL_PUD,
> > +};
> > +
> > +/*
> > + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> > + * Else returns a smaller extent bounded by the end of the source and destination
> > + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> > + */
> > +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> > + unsigned long old_end, unsigned long new_addr)
> > +{
> > + unsigned long next, extent, mask, size;
> > +
> > + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> > + mask = PMD_MASK;
> > + size = PMD_SIZE;
> > + } else if (entry == NORMAL_PUD) {
> > + mask = PUD_MASK;
> > + size = PUD_SIZE;
> > + } else
> > + return 0;
> > +
> > + next = (old_addr + size) & mask;
> > + /* even if next overflowed, extent below will be ok */
> > + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> > + next = (new_addr + size) & mask;
> > + if (extent > next - new_addr)
> > + extent = next - new_addr;
> > + return extent;
> > +}
> > +
> > +/*
> > + * Attempts to speedup the move by moving entry at the level corresponding to
> > + * pgt_entry. Returns true if the move was successful, else false.
> > + */
> > +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> > + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> > + void *new_entry, bool need_rmap_locks)
> > +{
> > + bool moved = false;
> > +
> > + /* See comment in move_ptes() */
> > + if (need_rmap_locks)
> > + take_rmap_locks(vma);
> > + if (entry == NORMAL_PMD)
> > + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > + else if (entry == NORMAL_PUD)
> > + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> > + else if (entry == HPAGE_PMD)
> > + moved = move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > + else
> > + WARN_ON_ONCE(1);
>
> BUILD_BUG() should work.
This doesn't get caught at compile time since entry isn't a constant.
>
> And why not use switch() instead of ifs.
I'll move to switch() in the next version.
Thanks, Kalesh
>
> > + if (need_rmap_locks)
> > + drop_rmap_locks(vma);
> > +
> > + return moved;
> > +}
> > +
> > unsigned long move_page_tables(struct vm_area_struct *vma,
> > unsigned long old_addr, struct vm_area_struct *new_vma,
> > unsigned long new_addr, unsigned long len,
> > bool need_rmap_locks)
> > {
> > - unsigned long extent, next, old_end;
> > + unsigned long extent, old_end;
> > struct mmu_notifier_range range;
> > pmd_t *old_pmd, *new_pmd;
> >
> > @@ -269,14 +422,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >
> > for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > cond_resched();
> > - next = (old_addr + PMD_SIZE) & PMD_MASK;
> > - /* even if next overflowed, extent below will be ok */
> > - extent = next - old_addr;
> > - if (extent > old_end - old_addr)
> > - extent = old_end - old_addr;
> > - next = (new_addr + PMD_SIZE) & PMD_MASK;
> > - if (extent > next - new_addr)
> > - extent = next - new_addr;
> > +#ifdef CONFIG_HAVE_MOVE_PUD
> > + /*
> > + * If extent is PUD-sized try to speed up the move by moving at the
> > + * PUD level if possible.
> > + */
> > + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> > + if (extent == PUD_SIZE) {
> > + pud_t *old_pud, *new_pud;
> > +
> > + old_pud = get_old_pud(vma->vm_mm, old_addr);
> > + if (!old_pud)
> > + continue;
> > + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> > + if (!new_pud)
> > + break;
> > + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> > + old_pud, new_pud, need_rmap_locks))
> > + continue;
> > + }
> > +#endif
> > + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> > old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> > if (!old_pmd)
> > continue;
> > @@ -284,18 +450,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > if (!new_pmd)
> > break;
> > if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> > - if (extent == HPAGE_PMD_SIZE) {
> > - bool moved;
> > - /* See comment in move_ptes() */
> > - if (need_rmap_locks)
> > - take_rmap_locks(vma);
> > - moved = move_huge_pmd(vma, old_addr, new_addr,
> > - old_pmd, new_pmd);
> > - if (need_rmap_locks)
> > - drop_rmap_locks(vma);
> > - if (moved)
> > - continue;
> > - }
> > + if (extent == HPAGE_PMD_SIZE &&
> > + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> > + new_pmd, need_rmap_locks))
> > + continue;
> > split_huge_pmd(vma, old_pmd, old_addr);
> > if (pmd_trans_unstable(old_pmd))
> > continue;
> > @@ -305,15 +463,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > * If the extent is PMD-sized, try to speed the move by
> > * moving at the PMD level if possible.
> > */
> > - bool moved;
> > -
> > - if (need_rmap_locks)
> > - take_rmap_locks(vma);
> > - moved = move_normal_pmd(vma, old_addr, new_addr,
> > - old_pmd, new_pmd);
> > - if (need_rmap_locks)
> > - drop_rmap_locks(vma);
> > - if (moved)
> > + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> > + new_pmd, need_rmap_locks))
> > continue;
> > #endif
> > }
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2020-10-01 18:12:16

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 3/5] mm: Speedup mremap on 1GB or larger regions

On Thu, Oct 1, 2020 at 12:40 PM Kalesh Singh <[email protected]> wrote:
>
> On Thu, Oct 1, 2020 at 8:37 AM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Wed, Sep 30, 2020 at 10:21:20PM +0000, Kalesh Singh wrote:
> > > Android needs to move large memory regions for garbage collection.
> > > Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> > > level if the source and destination addresses are PUD-aligned.
> > > For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> > > PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> > > Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> > > isn't supported/tested can turn this off by not selecting the config.
> > >
> > > Signed-off-by: Kalesh Singh <[email protected]>
> > > ---
> > > arch/Kconfig | 7 +
> > > arch/arm64/include/asm/pgtable.h | 1 +
> > > mm/mremap.c | 211 ++++++++++++++++++++++++++-----
> > > 3 files changed, 189 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index af14a567b493..5eabaa00bf9b 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> > > Archs need to ensure they use a high enough resolution clock to
> > > support irq time accounting and then call enable_sched_clock_irqtime().
> > >
> > > +config HAVE_MOVE_PUD
> > > + bool
> > > + help
> > > + Architectures that select this are able to move page tables at the
> > > + PUD level. If there are only 3 page table levels, the move effectively
> > > + happens at the PGD level.
> > > +
> > > config HAVE_MOVE_PMD
> > > bool
> > > help
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index d5d3fbe73953..8848125e3024 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> > > #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
> > >
> > > #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
> > > +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
> > >
> > > #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
> > > #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
> >
> > This doesn't belong to the patch.
> Good catch. I'll move this into a separate patch.
> >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 138abbae4f75..a5a1440bd366 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -249,14 +249,167 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > >
> > > return true;
> > > }
> > > +#else
> > > +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> > > +{
> > > + return false;
> > > +}
> > > #endif
> > >
> > > +#ifdef CONFIG_HAVE_MOVE_PUD
> > > +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + pgd_t *pgd;
> > > + p4d_t *p4d;
> > > + pud_t *pud;
> > > +
> > > + pgd = pgd_offset(mm, addr);
> > > + if (pgd_none_or_clear_bad(pgd))
> > > + return NULL;
> > > +
> > > + p4d = p4d_offset(pgd, addr);
> > > + if (p4d_none_or_clear_bad(p4d))
> > > + return NULL;
> > > +
> > > + pud = pud_offset(p4d, addr);
> > > + if (pud_none_or_clear_bad(pud))
> > > + return NULL;
> > > +
> > > + return pud;
> > > +}
> > > +
> > > +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> > > + unsigned long addr)
> > > +{
> > > + pgd_t *pgd;
> > > + p4d_t *p4d;
> > > + pud_t *pud;
> > > +
> > > + pgd = pgd_offset(mm, addr);
> > > + p4d = p4d_alloc(mm, pgd, addr);
> > > + if (!p4d)
> > > + return NULL;
> > > + pud = pud_alloc(mm, p4d, addr);
> > > + if (!pud)
> > > + return NULL;
> > > +
> > > + return pud;
> > > +}
> > > +
> > > +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + pud_t pud;
> > > +
> > > + /*
> > > + * The destination pud shouldn't be established, free_pgtables()
> > > + * should have released it.
> > > + */
> > > + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> > > + return false;
> > > +
> > > + /*
> > > + * We don't have to worry about the ordering of src and dst
> > > + * ptlocks because exclusive mmap_lock prevents deadlock.
> > > + */
> > > + old_ptl = pud_lock(vma->vm_mm, old_pud);
> > > + new_ptl = pud_lockptr(mm, new_pud);
> > > + if (new_ptl != old_ptl)
> > > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > > +
> > > + /* Clear the pud */
> > > + pud = *old_pud;
> > > + pud_clear(old_pud);
> > > +
> > > + VM_BUG_ON(!pud_none(*new_pud));
> > > +
> > > + /* Set the new pud */
> > > + set_pud_at(mm, new_addr, new_pud, pud);
> > > + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> > > + if (new_ptl != old_ptl)
> > > + spin_unlock(new_ptl);
> > > + spin_unlock(old_ptl);
> > > +
> > > + return true;
> > > +}
> > > +#else
> > > +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > +enum pgt_entry {
> > > + NORMAL_PMD,
> > > + HPAGE_PMD,
> > > + NORMAL_PUD,
> > > +};
> > > +
> > > +/*
> > > + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> > > + * Else returns a smaller extent bounded by the end of the source and destination
> > > + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> > > + */
> > > +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> > > + unsigned long old_end, unsigned long new_addr)
> > > +{
> > > + unsigned long next, extent, mask, size;
> > > +
> > > + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> > > + mask = PMD_MASK;
> > > + size = PMD_SIZE;
> > > + } else if (entry == NORMAL_PUD) {
> > > + mask = PUD_MASK;
> > > + size = PUD_SIZE;
> > > + } else
> > > + return 0;
> > > +
> > > + next = (old_addr + size) & mask;
> > > + /* even if next overflowed, extent below will be ok */
> > > + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> > > + next = (new_addr + size) & mask;
> > > + if (extent > next - new_addr)
> > > + extent = next - new_addr;
> > > + return extent;
> > > +}
> > > +
> > > +/*
> > > + * Attempts to speedup the move by moving entry at the level corresponding to
> > > + * pgt_entry. Returns true if the move was successful, else false.
> > > + */
> > > +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> > > + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> > > + void *new_entry, bool need_rmap_locks)
> > > +{
> > > + bool moved = false;
> > > +
> > > + /* See comment in move_ptes() */
> > > + if (need_rmap_locks)
> > > + take_rmap_locks(vma);
> > > + if (entry == NORMAL_PMD)
> > > + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > > + else if (entry == NORMAL_PUD)
> > > + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> > > + else if (entry == HPAGE_PMD)
> > > + moved = move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > > + else
> > > + WARN_ON_ONCE(1);
> >
> > BUILD_BUG() should work.
Please ignore the previous comment. You are right, BUILD_BUG() would work.
> This doesn't get caught at compile time since entry isn't a constant.
> >
> > And why not use switch() instead of ifs.
> I'll move to switch() in the next version.
> Thanks, Kalesh
> >
> > > + if (need_rmap_locks)
> > > + drop_rmap_locks(vma);
> > > +
> > > + return moved;
> > > +}
> > > +
> > > unsigned long move_page_tables(struct vm_area_struct *vma,
> > > unsigned long old_addr, struct vm_area_struct *new_vma,
> > > unsigned long new_addr, unsigned long len,
> > > bool need_rmap_locks)
> > > {
> > > - unsigned long extent, next, old_end;
> > > + unsigned long extent, old_end;
> > > struct mmu_notifier_range range;
> > > pmd_t *old_pmd, *new_pmd;
> > >
> > > @@ -269,14 +422,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > >
> > > for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > > cond_resched();
> > > - next = (old_addr + PMD_SIZE) & PMD_MASK;
> > > - /* even if next overflowed, extent below will be ok */
> > > - extent = next - old_addr;
> > > - if (extent > old_end - old_addr)
> > > - extent = old_end - old_addr;
> > > - next = (new_addr + PMD_SIZE) & PMD_MASK;
> > > - if (extent > next - new_addr)
> > > - extent = next - new_addr;
> > > +#ifdef CONFIG_HAVE_MOVE_PUD
> > > + /*
> > > + * If extent is PUD-sized try to speed up the move by moving at the
> > > + * PUD level if possible.
> > > + */
> > > + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> > > + if (extent == PUD_SIZE) {
> > > + pud_t *old_pud, *new_pud;
> > > +
> > > + old_pud = get_old_pud(vma->vm_mm, old_addr);
> > > + if (!old_pud)
> > > + continue;
> > > + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> > > + if (!new_pud)
> > > + break;
> > > + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> > > + old_pud, new_pud, need_rmap_locks))
> > > + continue;
> > > + }
> > > +#endif
> > > + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> > > old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> > > if (!old_pmd)
> > > continue;
> > > @@ -284,18 +450,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > if (!new_pmd)
> > > break;
> > > if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> > > - if (extent == HPAGE_PMD_SIZE) {
> > > - bool moved;
> > > - /* See comment in move_ptes() */
> > > - if (need_rmap_locks)
> > > - take_rmap_locks(vma);
> > > - moved = move_huge_pmd(vma, old_addr, new_addr,
> > > - old_pmd, new_pmd);
> > > - if (need_rmap_locks)
> > > - drop_rmap_locks(vma);
> > > - if (moved)
> > > - continue;
> > > - }
> > > + if (extent == HPAGE_PMD_SIZE &&
> > > + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> > > + new_pmd, need_rmap_locks))
> > > + continue;
> > > split_huge_pmd(vma, old_pmd, old_addr);
> > > if (pmd_trans_unstable(old_pmd))
> > > continue;
> > > @@ -305,15 +463,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > * If the extent is PMD-sized, try to speed the move by
> > > * moving at the PMD level if possible.
> > > */
> > > - bool moved;
> > > -
> > > - if (need_rmap_locks)
> > > - take_rmap_locks(vma);
> > > - moved = move_normal_pmd(vma, old_addr, new_addr,
> > > - old_pmd, new_pmd);
> > > - if (need_rmap_locks)
> > > - drop_rmap_locks(vma);
> > > - if (moved)
> > > + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> > > + new_pmd, need_rmap_locks))
> > > continue;
> > > #endif
> > > }
> > > --
> > > 2.28.0.709.gb0816b6eb0-goog
> > >
> >
> > --
> > Kirill A. Shutemov
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >

2020-10-02 00:13:14

by Lokesh Gidra

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <[email protected]> wrote:
>
> On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > x86. Other architectures where this type of move is supported and known to
> > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > and HAVE_MOVE_PUD.
> > > > >
> > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > region on x86 and arm64:
> > > > >
> > > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > > > >
> > > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > >
> > > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > give a total of ~150x speed up on arm64.
> > > >
> > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > >
> > > We have a Java garbage collector under development which requires
> > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > move, the application threads have to be paused for correctness. It is
> > > critical to keep this pause as short as possible to avoid jitters
> > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > help.
> >
> > Any chance to quantify the effect of mremap() with and without
> > HAVE_MOVE_PUD?
> >
> > I doubt it's a major contributor to the GC pause. I expect you need to
> > move tens of gigs to get sizable effect. And if your GC routinely moves
> > tens of gigs, maybe problem somewhere else?
> >
> > I'm asking for numbers, because increase in complexity comes with cost.
> > If it doesn't provide an substantial benefit to a real workload
> > maintaining the code forever doesn't make sense.
>
mremap is indeed the biggest contributor to the GC pause. It has to
take place in what is typically known as a 'stop-the-world' pause,
wherein all application threads are paused. During this pause the GC
thread flips the GC roots (threads' stacks, globals etc.), and then
resumes threads along with concurrent compaction of the heap.This
GC-root flip differs depending on which compaction algorithm is being
used.

In our case it involves updating object references in threads' stacks
and remapping java heap to a different location. The threads' stacks
can be handled in parallel with the mremap. Therefore, the dominant
factor is indeed the cost of mremap. From patches 2 and 4, it is clear
that remapping 1GB without this optimization will take ~9ms on arm64.

Although this mremap has to happen only once every GC cycle, and the
typical size is also not going to be more than a GB or 2, pausing
application threads for ~9ms is guaranteed to cause jitters. OTOH,
with this optimization, mremap is reduced to ~60us, which is a totally
acceptable pause time.

Unfortunately, implementation of the new GC algorithm hasn't yet
reached the point where I can quantify the effect of this
optimization. But I can confirm that without this optimization the new
GC will not be approved.


> Lokesh on this thread would be better able to answer this. I'll let
> him weigh in here.
> Thanks, Kalesh
> >
> > --
> > Kirill A. Shutemov
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >

2020-10-02 05:40:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote:
> On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <[email protected]> wrote:
> >
> > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > > x86. Other architectures where this type of move is supported and known to
> > > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > > and HAVE_MOVE_PUD.
> > > > > >
> > > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > > region on x86 and arm64:
> > > > > >
> > > > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > > > > >
> > > > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > > >
> > > > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > > give a total of ~150x speed up on arm64.
> > > > >
> > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > > >
> > > > We have a Java garbage collector under development which requires
> > > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > > move, the application threads have to be paused for correctness. It is
> > > > critical to keep this pause as short as possible to avoid jitters
> > > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > > help.
> > >
> > > Any chance to quantify the effect of mremap() with and without
> > > HAVE_MOVE_PUD?
> > >
> > > I doubt it's a major contributor to the GC pause. I expect you need to
> > > move tens of gigs to get sizable effect. And if your GC routinely moves
> > > tens of gigs, maybe problem somewhere else?
> > >
> > > I'm asking for numbers, because increase in complexity comes with cost.
> > > If it doesn't provide an substantial benefit to a real workload
> > > maintaining the code forever doesn't make sense.
> >
> mremap is indeed the biggest contributor to the GC pause. It has to
> take place in what is typically known as a 'stop-the-world' pause,
> wherein all application threads are paused. During this pause the GC
> thread flips the GC roots (threads' stacks, globals etc.), and then
> resumes threads along with concurrent compaction of the heap.This
> GC-root flip differs depending on which compaction algorithm is being
> used.
>
> In our case it involves updating object references in threads' stacks
> and remapping java heap to a different location. The threads' stacks
> can be handled in parallel with the mremap. Therefore, the dominant
> factor is indeed the cost of mremap. From patches 2 and 4, it is clear
> that remapping 1GB without this optimization will take ~9ms on arm64.
>
> Although this mremap has to happen only once every GC cycle, and the
> typical size is also not going to be more than a GB or 2, pausing
> application threads for ~9ms is guaranteed to cause jitters. OTOH,
> with this optimization, mremap is reduced to ~60us, which is a totally
> acceptable pause time.
>
> Unfortunately, implementation of the new GC algorithm hasn't yet
> reached the point where I can quantify the effect of this
> optimization. But I can confirm that without this optimization the new
> GC will not be approved.

IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD
and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some
workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see
it's useful for your workload?

--
Kirill A. Shutemov

2020-10-02 06:43:53

by Lokesh Gidra

[permalink] [raw]
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Thu, Oct 1, 2020 at 10:36 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote:
> > On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <[email protected]> wrote:
> > >
> > > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > > > x86. Other architectures where this type of move is supported and known to
> > > > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > > > and HAVE_MOVE_PUD.
> > > > > > >
> > > > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > > > region on x86 and arm64:
> > > > > > >
> > > > > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
> > > > > > >
> > > > > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > > > >
> > > > > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > > > give a total of ~150x speed up on arm64.
> > > > > >
> > > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > > > >
> > > > > We have a Java garbage collector under development which requires
> > > > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > > > move, the application threads have to be paused for correctness. It is
> > > > > critical to keep this pause as short as possible to avoid jitters
> > > > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > > > help.
> > > >
> > > > Any chance to quantify the effect of mremap() with and without
> > > > HAVE_MOVE_PUD?
> > > >
> > > > I doubt it's a major contributor to the GC pause. I expect you need to
> > > > move tens of gigs to get sizable effect. And if your GC routinely moves
> > > > tens of gigs, maybe problem somewhere else?
> > > >
> > > > I'm asking for numbers, because increase in complexity comes with cost.
> > > > If it doesn't provide an substantial benefit to a real workload
> > > > maintaining the code forever doesn't make sense.
> > >
> > mremap is indeed the biggest contributor to the GC pause. It has to
> > take place in what is typically known as a 'stop-the-world' pause,
> > wherein all application threads are paused. During this pause the GC
> > thread flips the GC roots (threads' stacks, globals etc.), and then
> > resumes threads along with concurrent compaction of the heap.This
> > GC-root flip differs depending on which compaction algorithm is being
> > used.
> >
> > In our case it involves updating object references in threads' stacks
> > and remapping java heap to a different location. The threads' stacks
> > can be handled in parallel with the mremap. Therefore, the dominant
> > factor is indeed the cost of mremap. From patches 2 and 4, it is clear
> > that remapping 1GB without this optimization will take ~9ms on arm64.
> >
> > Although this mremap has to happen only once every GC cycle, and the
> > typical size is also not going to be more than a GB or 2, pausing
> > application threads for ~9ms is guaranteed to cause jitters. OTOH,
> > with this optimization, mremap is reduced to ~60us, which is a totally
> > acceptable pause time.
> >
> > Unfortunately, implementation of the new GC algorithm hasn't yet
> > reached the point where I can quantify the effect of this
> > optimization. But I can confirm that without this optimization the new
> > GC will not be approved.
>
> IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD
> and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some
> workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see
> it's useful for your workload?
>
Yes, 9ms -> 90us is when both are combined. The past experience has
been that even ~1ms long stop-the-world pause is prone to cause
jitters. HAVE_MOVE_PMD takes us only this far. So HAVE_MOVE_PUD is
required to bring the mremap cost to acceptable level.

Ideally, I was hoping that the functionality of HAVE_MOVE_PMD can be
extended to all levels of the hierarchical page table, and in the
process simplify the implementation. But unfortunately, that doesn't
seem to be possible from patch 3.

> --
> Kirill A. Shutemov