This version 2 of the mremap speed up patches previously posted at:
https://lore.kernel.org/r/[email protected]
mremap time can be optimized by moving entries at the PMD/PUD level if
the source and destination addresses are PMD/PUD-aligned and
PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
x86. Other architectures where this type of move is supported and known to
be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
and HAVE_MOVE_PUD.
Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
region on x86 and arm64:
- HAVE_MOVE_PMD is already enabled on x86 : N/A
- Enabling HAVE_MOVE_PUD on x86 : ~13x speed up
- Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
- Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
give a total of ~150x speed up on arm64.
Changes in v2:
- Reduce mremap_test time by only validating a configurable
threshold of the remapped region, as per John.
- Use a random pattern for mremap validation. Provide pattern
seed in test output, as per John.
- Moved set_pud_at() to separate patch, per Kirill.
- Use switch() instead of ifs in move_pgt_entry(), per Kirill.
- Update commit message with description of Android
garbage collector use case for HAVE_MOVE_PUD, as per Joel.
- Fix build test error reported by kernel test robot in [1].
[1] https://lists.01.org/hyperkitty/list/[email protected]/thread/CKPGL4FH4NG7TGH2CVYX2UX76L25BTA3/
Kalesh Singh (6):
kselftests: vm: Add mremap tests
arm64: mremap speedup - Enable HAVE_MOVE_PMD
mm: Speedup mremap on 1GB or larger regions
arm64: Add set_pud_at() functions
arm64: mremap speedup - Enable HAVE_MOVE_PUD
x86: mremap speedup - Enable HAVE_MOVE_PUD
arch/Kconfig | 7 +
arch/arm64/Kconfig | 2 +
arch/arm64/include/asm/pgtable.h | 1 +
arch/x86/Kconfig | 1 +
mm/mremap.c | 220 +++++++++++++--
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/mremap_test.c | 333 +++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 11 +
9 files changed, 547 insertions(+), 30 deletions(-)
create mode 100644 tools/testing/selftests/vm/mremap_test.c
base-commit: 472e5b056f000a778abb41f1e443de58eb259783
--
2.28.0.806.g8561365e88-goog
HAVE_MOVE_PMD enables remapping pages at the PMD level if both the
source and destination addresses are PMD-aligned.
HAVE_MOVE_PMD is already enabled on x86. The original patch [1] that
introduced this config did not enable it on arm64 at the time because
of performance issues with flushing the TLB on every PMD move. These
issues have since been addressed in more recent releases with
improvements to the arm64 TLB invalidation and core mmu_gather code as
Will Deacon mentioned in [2].
From the data below, it can be inferred that there is approximately
8x improvement in performance when HAVE_MOVE_PMD is enabled on arm64.
--------- Test Results ----------
The following results were obtained on an arm64 device running a 5.4
kernel, by remapping a PMD-aligned, 1GB sized region to a PMD-aligned
destination. The results from 10 iterations of the test are given below.
All times are in nanoseconds.
Control HAVE_MOVE_PMD
9220833 1247761
9002552 1219896
9254115 1094792
8725885 1227760
9308646 1043698
9001667 1101771
8793385 1159896
8774636 1143594
9553125 1025833
9374010 1078125
9100885.4 1134312.6 <-- Mean Time in nanoseconds
Total mremap time for a 1GB sized PMD-aligned region drops from
~9.1 milliseconds to ~1.1 milliseconds. (~8x speedup).
[1] https://lore.kernel.org/r/[email protected]
[2] https://www.mail-archive.com/[email protected]/msg140837.html
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..844d089668e3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -121,6 +121,7 @@ config ARM64
select GENERIC_VDSO_TIME_NS
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
+ select HAVE_MOVE_PMD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
--
2.28.0.806.g8561365e88-goog
Android needs to move large memory regions for garbage collection.
The GC requires moving physical pages of multi-gigabyte heap
using mremap. During this move, the application threads have to
be paused for correctness. It is critical to keep this pause as
short as possible to avoid jitters during user interaction.
Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
level if the source and destination addresses are PUD-aligned.
For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
PGD entries, since the PUD entry is “folded back” onto the PGD entry.
Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
isn't supported/tested can turn this off by not selecting the config.
Fix build test error from v1 of this series reported by
kernel test robot in [1].
[1] https://lists.01.org/hyperkitty/list/[email protected]/thread/CKPGL4FH4NG7TGH2CVYX2UX76L25BTA3/
Signed-off-by: Kalesh Singh <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
Changes in v2:
- Update commit message with description of Android GC's use case.
- Move set_pud_at() to a separate patch.
- Use switch() instead of ifs in move_pgt_entry()
- Fix build test error reported by kernel test robot on x86_64 in [1].
Guard move_huge_pmd() with IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE),
since this section doesn't get optimized out in the kernel test
robot's build test when HAVE_MOVE_PUD is enabled.
- Keep WARN_ON_ONCE(1) instead of BUILD_BUG() for the aforementioned
reason.
arch/Kconfig | 7 ++
mm/mremap.c | 220 ++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 197 insertions(+), 30 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..5eabaa00bf9b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
Archs need to ensure they use a high enough resolution clock to
support irq time accounting and then call enable_sched_clock_irqtime().
+config HAVE_MOVE_PUD
+ bool
+ help
+ Architectures that select this are able to move page tables at the
+ PUD level. If there are only 3 page table levels, the move effectively
+ happens at the PGD level.
+
config HAVE_MOVE_PMD
bool
help
diff --git a/mm/mremap.c b/mm/mremap.c
index 138abbae4f75..c1d6ab667d70 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -249,14 +249,176 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
return true;
}
+#else
+static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
+{
+ return false;
+}
#endif
+#ifdef CONFIG_HAVE_MOVE_PUD
+static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, addr);
+ if (pgd_none_or_clear_bad(pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none_or_clear_bad(p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none_or_clear_bad(pud))
+ return NULL;
+
+ return pud;
+}
+
+static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return NULL;
+
+ return pud;
+}
+
+static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ spinlock_t *old_ptl, *new_ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ pud_t pud;
+
+ /*
+ * The destination pud shouldn't be established, free_pgtables()
+ * should have released it.
+ */
+ if (WARN_ON_ONCE(!pud_none(*new_pud)))
+ return false;
+
+ /*
+ * We don't have to worry about the ordering of src and dst
+ * ptlocks because exclusive mmap_lock prevents deadlock.
+ */
+ old_ptl = pud_lock(vma->vm_mm, old_pud);
+ new_ptl = pud_lockptr(mm, new_pud);
+ if (new_ptl != old_ptl)
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+ /* Clear the pud */
+ pud = *old_pud;
+ pud_clear(old_pud);
+
+ VM_BUG_ON(!pud_none(*new_pud));
+
+ /* Set the new pud */
+ set_pud_at(mm, new_addr, new_pud, pud);
+ flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
+ if (new_ptl != old_ptl)
+ spin_unlock(new_ptl);
+ spin_unlock(old_ptl);
+
+ return true;
+}
+#else
+static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ return false;
+}
+#endif
+
+enum pgt_entry {
+ NORMAL_PMD,
+ HPAGE_PMD,
+ NORMAL_PUD,
+};
+
+/*
+ * Returns an extent of the corresponding size for the pgt_entry specified if valid.
+ * Else returns a smaller extent bounded by the end of the source and destination
+ * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
+ */
+static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
+ unsigned long old_end, unsigned long new_addr)
+{
+ unsigned long next, extent, mask, size;
+
+ if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
+ mask = PMD_MASK;
+ size = PMD_SIZE;
+ } else if (entry == NORMAL_PUD) {
+ mask = PUD_MASK;
+ size = PUD_SIZE;
+ } else
+ return 0;
+
+ next = (old_addr + size) & mask;
+ /* even if next overflowed, extent below will be ok */
+ extent = (next > old_end) ? old_end - old_addr : next - old_addr;
+ next = (new_addr + size) & mask;
+ if (extent > next - new_addr)
+ extent = next - new_addr;
+ return extent;
+}
+
+/*
+ * Attempts to speedup the move by moving entry at the level corresponding to
+ * pgt_entry. Returns true if the move was successful, else false.
+ */
+static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
+ unsigned long old_addr, unsigned long new_addr, void *old_entry,
+ void *new_entry, bool need_rmap_locks)
+{
+ bool moved = false;
+
+ /* See comment in move_ptes() */
+ if (need_rmap_locks)
+ take_rmap_locks(vma);
+
+ switch (entry) {
+ case NORMAL_PMD:
+ moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
+ break;
+ case NORMAL_PUD:
+ moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
+ break;
+ case HPAGE_PMD:
+ moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
+
+ if (need_rmap_locks)
+ drop_rmap_locks(vma);
+
+ return moved;
+}
+
unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
bool need_rmap_locks)
{
- unsigned long extent, next, old_end;
+ unsigned long extent, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
@@ -269,14 +431,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
- next = (old_addr + PMD_SIZE) & PMD_MASK;
- /* even if next overflowed, extent below will be ok */
- extent = next - old_addr;
- if (extent > old_end - old_addr)
- extent = old_end - old_addr;
- next = (new_addr + PMD_SIZE) & PMD_MASK;
- if (extent > next - new_addr)
- extent = next - new_addr;
+#ifdef CONFIG_HAVE_MOVE_PUD
+ /*
+ * If extent is PUD-sized try to speed up the move by moving at the
+ * PUD level if possible.
+ */
+ extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
+ if (extent == PUD_SIZE) {
+ pud_t *old_pud, *new_pud;
+
+ old_pud = get_old_pud(vma->vm_mm, old_addr);
+ if (!old_pud)
+ continue;
+ new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+ if (!new_pud)
+ break;
+ if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
+ old_pud, new_pud, need_rmap_locks))
+ continue;
+ }
+#endif
+ extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
old_pmd = get_old_pmd(vma->vm_mm, old_addr);
if (!old_pmd)
continue;
@@ -284,18 +459,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (!new_pmd)
break;
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
- if (extent == HPAGE_PMD_SIZE) {
- bool moved;
- /* See comment in move_ptes() */
- if (need_rmap_locks)
- take_rmap_locks(vma);
- moved = move_huge_pmd(vma, old_addr, new_addr,
- old_pmd, new_pmd);
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- if (moved)
- continue;
- }
+ if (extent == HPAGE_PMD_SIZE &&
+ move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
+ new_pmd, need_rmap_locks))
+ continue;
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
@@ -305,15 +472,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
* If the extent is PMD-sized, try to speed the move by
* moving at the PMD level if possible.
*/
- bool moved;
-
- if (need_rmap_locks)
- take_rmap_locks(vma);
- moved = move_normal_pmd(vma, old_addr, new_addr,
- old_pmd, new_pmd);
- if (need_rmap_locks)
- drop_rmap_locks(vma);
- if (moved)
+ if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
+ new_pmd, need_rmap_locks))
continue;
#endif
}
--
2.28.0.806.g8561365e88-goog
set_pud_at() is used in move_normal_pud() for remapping
pages at the PUD level.
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..8848125e3024 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
#define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
#define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
+#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
#define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
#define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
--
2.28.0.806.g8561365e88-goog
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 19x improvement in performance on arm64. (See data below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:
Total mremap times for 1GB data on arm64. All times are in nanoseconds.
Control HAVE_MOVE_PUD
1247761 74271
1219896 46771
1094792 59687
1227760 48385
1043698 76666
1101771 50365
1159896 52500
1143594 75261
1025833 61354
1078125 48697
1134312.6 59395.7 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~1.1 milliseconds
to ~59 microseconds on arm64. (~19x speed up).
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 844d089668e3..4d521f0a5863 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -122,6 +122,7 @@ config ARM64
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_PCI
select HAVE_ACPI_APEI if (ACPI && EFI)
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
--
2.28.0.806.g8561365e88-goog
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 13x improvement in performance on x86. (See data below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:
Total mremap times for 1GB data on x86. All times are in nanoseconds.
Control HAVE_MOVE_PUD
180394 15089
235728 14056
238931 25741
187330 13838
241742 14187
177925 14778
182758 14728
160872 14418
205813 15107
245722 13998
205721.5 15594 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~205 microseconds
to ~15 microseconds on x86. (~13x speed up).
Signed-off-by: Kalesh Singh <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..ff6e2755cab8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -198,6 +198,7 @@ config X86
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
--
2.28.0.806.g8561365e88-goog
On Fri, Oct 02, 2020 at 04:20:49PM +0000, Kalesh Singh wrote:
> set_pud_at() is used in move_normal_pud() for remapping
> pages at the PUD level.
>
> Signed-off-by: Kalesh Singh <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index d5d3fbe73953..8848125e3024 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
>
> #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
> +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
>
> #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
> #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
Just fold it into the next patch.
--
Kirill A. Shutemov
On Fri, Oct 02, 2020 at 04:20:48PM +0000, Kalesh Singh wrote:
> Android needs to move large memory regions for garbage collection.
> The GC requires moving physical pages of multi-gigabyte heap
> using mremap. During this move, the application threads have to
> be paused for correctness. It is critical to keep this pause as
> short as possible to avoid jitters during user interaction.
>
> Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> level if the source and destination addresses are PUD-aligned.
> For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> isn't supported/tested can turn this off by not selecting the config.
>
> Fix build test error from v1 of this series reported by
> kernel test robot in [1].
>
> [1] https://lists.01.org/hyperkitty/list/[email protected]/thread/CKPGL4FH4NG7TGH2CVYX2UX76L25BTA3/
>
> Signed-off-by: Kalesh Singh <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> ---
> Changes in v2:
> - Update commit message with description of Android GC's use case.
> - Move set_pud_at() to a separate patch.
> - Use switch() instead of ifs in move_pgt_entry()
> - Fix build test error reported by kernel test robot on x86_64 in [1].
> Guard move_huge_pmd() with IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE),
> since this section doesn't get optimized out in the kernel test
> robot's build test when HAVE_MOVE_PUD is enabled.
> - Keep WARN_ON_ONCE(1) instead of BUILD_BUG() for the aforementioned
> reason.
Okay, but IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) on the caller side would
do the trick, I believe.
>
> arch/Kconfig | 7 ++
> mm/mremap.c | 220 ++++++++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 197 insertions(+), 30 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index af14a567b493..5eabaa00bf9b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> Archs need to ensure they use a high enough resolution clock to
> support irq time accounting and then call enable_sched_clock_irqtime().
>
> +config HAVE_MOVE_PUD
> + bool
> + help
> + Architectures that select this are able to move page tables at the
> + PUD level. If there are only 3 page table levels, the move effectively
> + happens at the PGD level.
> +
> config HAVE_MOVE_PMD
> bool
> help
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 138abbae4f75..c1d6ab667d70 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -249,14 +249,176 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>
> return true;
> }
> +#else
> +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> +{
> + return false;
> +}
> #endif
>
> +#ifdef CONFIG_HAVE_MOVE_PUD
> +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> +
> + pgd = pgd_offset(mm, addr);
> + if (pgd_none_or_clear_bad(pgd))
> + return NULL;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (p4d_none_or_clear_bad(p4d))
> + return NULL;
> +
> + pud = pud_offset(p4d, addr);
> + if (pud_none_or_clear_bad(pud))
> + return NULL;
> +
> + return pud;
> +}
> +
> +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> +
> + pgd = pgd_offset(mm, addr);
> + p4d = p4d_alloc(mm, pgd, addr);
> + if (!p4d)
> + return NULL;
> + pud = pud_alloc(mm, p4d, addr);
> + if (!pud)
> + return NULL;
> +
> + return pud;
> +}
Looks like a code duplication.
Could you move these two helpers out of #ifdef CONFIG_HAVE_MOVE_PUD and
make get_old_pmd() and alloc_new_pmd() use them?
> +
> +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> + spinlock_t *old_ptl, *new_ptl;
> + struct mm_struct *mm = vma->vm_mm;
> + pud_t pud;
> +
> + /*
> + * The destination pud shouldn't be established, free_pgtables()
> + * should have released it.
> + */
> + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> + return false;
> +
> + /*
> + * We don't have to worry about the ordering of src and dst
> + * ptlocks because exclusive mmap_lock prevents deadlock.
> + */
> + old_ptl = pud_lock(vma->vm_mm, old_pud);
> + new_ptl = pud_lockptr(mm, new_pud);
> + if (new_ptl != old_ptl)
> + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Clear the pud */
> + pud = *old_pud;
> + pud_clear(old_pud);
> +
> + VM_BUG_ON(!pud_none(*new_pud));
> +
> + /* Set the new pud */
> + set_pud_at(mm, new_addr, new_pud, pud);
> + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + return true;
> +}
> +#else
> +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> + return false;
> +}
> +#endif
> +
> +enum pgt_entry {
> + NORMAL_PMD,
> + HPAGE_PMD,
> + NORMAL_PUD,
> +};
> +
> +/*
> + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> + * Else returns a smaller extent bounded by the end of the source and destination
> + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> + */
> +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> + unsigned long old_end, unsigned long new_addr)
> +{
> + unsigned long next, extent, mask, size;
> +
> + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> + mask = PMD_MASK;
> + size = PMD_SIZE;
> + } else if (entry == NORMAL_PUD) {
> + mask = PUD_MASK;
> + size = PUD_SIZE;
> + } else
> + return 0;
Em. Who would ever specify invalid pgt_entry? It's bug.
Again, switch()?
> +
> + next = (old_addr + size) & mask;
> + /* even if next overflowed, extent below will be ok */
> + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> + next = (new_addr + size) & mask;
> + if (extent > next - new_addr)
> + extent = next - new_addr;
> + return extent;
> +}
> +
> +/*
> + * Attempts to speedup the move by moving entry at the level corresponding to
> + * pgt_entry. Returns true if the move was successful, else false.
> + */
> +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> + void *new_entry, bool need_rmap_locks)
> +{
> + bool moved = false;
> +
> + /* See comment in move_ptes() */
> + if (need_rmap_locks)
> + take_rmap_locks(vma);
> +
> + switch (entry) {
> + case NORMAL_PMD:
> + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
Nit: here and below, double space after '='. Why?
> + break;
> + case NORMAL_PUD:
> + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> + break;
> + case HPAGE_PMD:
> + moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> + move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + break;
> + }
> +
> + if (need_rmap_locks)
> + drop_rmap_locks(vma);
> +
> + return moved;
> +}
> +
> unsigned long move_page_tables(struct vm_area_struct *vma,
> unsigned long old_addr, struct vm_area_struct *new_vma,
> unsigned long new_addr, unsigned long len,
> bool need_rmap_locks)
> {
> - unsigned long extent, next, old_end;
> + unsigned long extent, old_end;
> struct mmu_notifier_range range;
> pmd_t *old_pmd, *new_pmd;
>
> @@ -269,14 +431,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
> for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> cond_resched();
> - next = (old_addr + PMD_SIZE) & PMD_MASK;
> - /* even if next overflowed, extent below will be ok */
> - extent = next - old_addr;
> - if (extent > old_end - old_addr)
> - extent = old_end - old_addr;
> - next = (new_addr + PMD_SIZE) & PMD_MASK;
> - if (extent > next - new_addr)
> - extent = next - new_addr;
> +#ifdef CONFIG_HAVE_MOVE_PUD
Any chance if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD)) would work here?
> + /*
> + * If extent is PUD-sized try to speed up the move by moving at the
> + * PUD level if possible.
> + */
> + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> + if (extent == PUD_SIZE) {
> + pud_t *old_pud, *new_pud;
> +
> + old_pud = get_old_pud(vma->vm_mm, old_addr);
> + if (!old_pud)
> + continue;
> + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> + if (!new_pud)
> + break;
> + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> + old_pud, new_pud, need_rmap_locks))
> + continue;
> + }
> +#endif
> + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> if (!old_pmd)
> continue;
> @@ -284,18 +459,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> if (!new_pmd)
> break;
> if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> - if (extent == HPAGE_PMD_SIZE) {
> - bool moved;
> - /* See comment in move_ptes() */
> - if (need_rmap_locks)
> - take_rmap_locks(vma);
> - moved = move_huge_pmd(vma, old_addr, new_addr,
> - old_pmd, new_pmd);
> - if (need_rmap_locks)
> - drop_rmap_locks(vma);
> - if (moved)
> - continue;
> - }
> + if (extent == HPAGE_PMD_SIZE &&
> + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> + new_pmd, need_rmap_locks))
> + continue;
> split_huge_pmd(vma, old_pmd, old_addr);
> if (pmd_trans_unstable(old_pmd))
> continue;
> @@ -305,15 +472,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> * If the extent is PMD-sized, try to speed the move by
> * moving at the PMD level if possible.
> */
> - bool moved;
> -
> - if (need_rmap_locks)
> - take_rmap_locks(vma);
> - moved = move_normal_pmd(vma, old_addr, new_addr,
> - old_pmd, new_pmd);
> - if (need_rmap_locks)
> - drop_rmap_locks(vma);
> - if (moved)
> + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> + new_pmd, need_rmap_locks))
> continue;
> #endif
> }
> --
> 2.28.0.806.g8561365e88-goog
>
--
Kirill A. Shutemov
Hi Kirill, thank you for the feedback.
On Fri, Oct 2, 2020 at 12:51 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Fri, Oct 02, 2020 at 04:20:48PM +0000, Kalesh Singh wrote:
> > Android needs to move large memory regions for garbage collection.
> > The GC requires moving physical pages of multi-gigabyte heap
> > using mremap. During this move, the application threads have to
> > be paused for correctness. It is critical to keep this pause as
> > short as possible to avoid jitters during user interaction.
> >
> > Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> > level if the source and destination addresses are PUD-aligned.
> > For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> > PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> > Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> > isn't supported/tested can turn this off by not selecting the config.
> >
> > Fix build test error from v1 of this series reported by
> > kernel test robot in [1].
> >
> > [1] https://lists.01.org/hyperkitty/list/[email protected]/thread/CKPGL4FH4NG7TGH2CVYX2UX76L25BTA3/
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > ---
> > Changes in v2:
> > - Update commit message with description of Android GC's use case.
> > - Move set_pud_at() to a separate patch.
> > - Use switch() instead of ifs in move_pgt_entry()
> > - Fix build test error reported by kernel test robot on x86_64 in [1].
> > Guard move_huge_pmd() with IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE),
> > since this section doesn't get optimized out in the kernel test
> > robot's build test when HAVE_MOVE_PUD is enabled.
> > - Keep WARN_ON_ONCE(1) instead of BUILD_BUG() for the aforementioned
> > reason.
>
> Okay, but IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) on the caller side would
> do the trick, I believe.
I tried moving this to the caller side in move_page_tables(),
- if (extent == HPAGE_PMD_SIZE &&
+ if (extent == HPAGE_PMD_SIZE &&
IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
but it produces the same error as reported by kernel test robot:
ld.lld: error: undefined symbol: move_huge_pmd
I'm not sure why these are different but the kernel test robot
compiler complains.
>
> >
> > arch/Kconfig | 7 ++
> > mm/mremap.c | 220 ++++++++++++++++++++++++++++++++++++++++++++-------
> > 2 files changed, 197 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index af14a567b493..5eabaa00bf9b 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> > Archs need to ensure they use a high enough resolution clock to
> > support irq time accounting and then call enable_sched_clock_irqtime().
> >
> > +config HAVE_MOVE_PUD
> > + bool
> > + help
> > + Architectures that select this are able to move page tables at the
> > + PUD level. If there are only 3 page table levels, the move effectively
> > + happens at the PGD level.
> > +
> > config HAVE_MOVE_PMD
> > bool
> > help
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 138abbae4f75..c1d6ab667d70 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -249,14 +249,176 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> >
> > return true;
> > }
> > +#else
> > +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > +#ifdef CONFIG_HAVE_MOVE_PUD
> > +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > + pud_t *pud;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (pgd_none_or_clear_bad(pgd))
> > + return NULL;
> > +
> > + p4d = p4d_offset(pgd, addr);
> > + if (p4d_none_or_clear_bad(p4d))
> > + return NULL;
> > +
> > + pud = pud_offset(p4d, addr);
> > + if (pud_none_or_clear_bad(pud))
> > + return NULL;
> > +
> > + return pud;
> > +}
> > +
> > +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + pgd_t *pgd;
> > + p4d_t *p4d;
> > + pud_t *pud;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + p4d = p4d_alloc(mm, pgd, addr);
> > + if (!p4d)
> > + return NULL;
> > + pud = pud_alloc(mm, p4d, addr);
> > + if (!pud)
> > + return NULL;
> > +
> > + return pud;
> > +}
>
> Looks like a code duplication.
>
> Could you move these two helpers out of #ifdef CONFIG_HAVE_MOVE_PUD and
> make get_old_pmd() and alloc_new_pmd() use them?
Yes, that will be cleaner. I'll update it in the next version.
>
> > +
> > +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > +{
> > + spinlock_t *old_ptl, *new_ptl;
> > + struct mm_struct *mm = vma->vm_mm;
> > + pud_t pud;
> > +
> > + /*
> > + * The destination pud shouldn't be established, free_pgtables()
> > + * should have released it.
> > + */
> > + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> > + return false;
> > +
> > + /*
> > + * We don't have to worry about the ordering of src and dst
> > + * ptlocks because exclusive mmap_lock prevents deadlock.
> > + */
> > + old_ptl = pud_lock(vma->vm_mm, old_pud);
> > + new_ptl = pud_lockptr(mm, new_pud);
> > + if (new_ptl != old_ptl)
> > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > +
> > + /* Clear the pud */
> > + pud = *old_pud;
> > + pud_clear(old_pud);
> > +
> > + VM_BUG_ON(!pud_none(*new_pud));
> > +
> > + /* Set the new pud */
> > + set_pud_at(mm, new_addr, new_pud, pud);
> > + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> > + if (new_ptl != old_ptl)
> > + spin_unlock(new_ptl);
> > + spin_unlock(old_ptl);
> > +
> > + return true;
> > +}
> > +#else
> > +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > +enum pgt_entry {
> > + NORMAL_PMD,
> > + HPAGE_PMD,
> > + NORMAL_PUD,
> > +};
> > +
> > +/*
> > + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> > + * Else returns a smaller extent bounded by the end of the source and destination
> > + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> > + */
> > +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> > + unsigned long old_end, unsigned long new_addr)
> > +{
> > + unsigned long next, extent, mask, size;
> > +
> > + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> > + mask = PMD_MASK;
> > + size = PMD_SIZE;
> > + } else if (entry == NORMAL_PUD) {
> > + mask = PUD_MASK;
> > + size = PUD_SIZE;
> > + } else
> > + return 0;
>
> Em. Who would ever specify invalid pgt_entry? It's bug.
> Again, switch()?
Sounds good. I'll use BUG() and switch() instead.
>
> > +
> > + next = (old_addr + size) & mask;
> > + /* even if next overflowed, extent below will be ok */
> > + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> > + next = (new_addr + size) & mask;
> > + if (extent > next - new_addr)
> > + extent = next - new_addr;
> > + return extent;
> > +}
> > +
> > +/*
> > + * Attempts to speedup the move by moving entry at the level corresponding to
> > + * pgt_entry. Returns true if the move was successful, else false.
> > + */
> > +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> > + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> > + void *new_entry, bool need_rmap_locks)
> > +{
> > + bool moved = false;
> > +
> > + /* See comment in move_ptes() */
> > + if (need_rmap_locks)
> > + take_rmap_locks(vma);
> > +
> > + switch (entry) {
> > + case NORMAL_PMD:
> > + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
>
> Nit: here and below, double space after '='. Why?
Sorry, editing mistake on my end. I'll clean this up before resending.
>
> > + break;
> > + case NORMAL_PUD:
> > + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> > + break;
> > + case HPAGE_PMD:
> > + moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> > + move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > + break;
> > + default:
> > + WARN_ON_ONCE(1);
> > + break;
> > + }
> > +
> > + if (need_rmap_locks)
> > + drop_rmap_locks(vma);
> > +
> > + return moved;
> > +}
> > +
> > unsigned long move_page_tables(struct vm_area_struct *vma,
> > unsigned long old_addr, struct vm_area_struct *new_vma,
> > unsigned long new_addr, unsigned long len,
> > bool need_rmap_locks)
> > {
> > - unsigned long extent, next, old_end;
> > + unsigned long extent, old_end;
> > struct mmu_notifier_range range;
> > pmd_t *old_pmd, *new_pmd;
> >
> > @@ -269,14 +431,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >
> > for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > cond_resched();
> > - next = (old_addr + PMD_SIZE) & PMD_MASK;
> > - /* even if next overflowed, extent below will be ok */
> > - extent = next - old_addr;
> > - if (extent > old_end - old_addr)
> > - extent = old_end - old_addr;
> > - next = (new_addr + PMD_SIZE) & PMD_MASK;
> > - if (extent > next - new_addr)
> > - extent = next - new_addr;
> > +#ifdef CONFIG_HAVE_MOVE_PUD
>
> Any chance if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD)) would work here?
Once we move get_old_put() and alloc_new_pud() out of the #ifdefs as
you suggested
above, it should work. It would also now be possible to replace the
#ifdef CONFIG_HAVE_MOVE_PMD in move_page_tables() with
IS_ENABLED(CONFIG_HAVE_MOVE_PMD).
Thanks,
Kalesh
>
> > + /*
> > + * If extent is PUD-sized try to speed up the move by moving at the
> > + * PUD level if possible.
> > + */
> > + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> > + if (extent == PUD_SIZE) {
> > + pud_t *old_pud, *new_pud;
> > +
> > + old_pud = get_old_pud(vma->vm_mm, old_addr);
> > + if (!old_pud)
> > + continue;
> > + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> > + if (!new_pud)
> > + break;
> > + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> > + old_pud, new_pud, need_rmap_locks))
> > + continue;
> > + }
> > +#endif
> > + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> > old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> > if (!old_pmd)
> > continue;
> > @@ -284,18 +459,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > if (!new_pmd)
> > break;
> > if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> > - if (extent == HPAGE_PMD_SIZE) {
> > - bool moved;
> > - /* See comment in move_ptes() */
> > - if (need_rmap_locks)
> > - take_rmap_locks(vma);
> > - moved = move_huge_pmd(vma, old_addr, new_addr,
> > - old_pmd, new_pmd);
> > - if (need_rmap_locks)
> > - drop_rmap_locks(vma);
> > - if (moved)
> > - continue;
> > - }
> > + if (extent == HPAGE_PMD_SIZE &&
> > + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> > + new_pmd, need_rmap_locks))
> > + continue;
> > split_huge_pmd(vma, old_pmd, old_addr);
> > if (pmd_trans_unstable(old_pmd))
> > continue;
> > @@ -305,15 +472,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > * If the extent is PMD-sized, try to speed the move by
> > * moving at the PMD level if possible.
> > */
> > - bool moved;
> > -
> > - if (need_rmap_locks)
> > - take_rmap_locks(vma);
> > - moved = move_normal_pmd(vma, old_addr, new_addr,
> > - old_pmd, new_pmd);
> > - if (need_rmap_locks)
> > - drop_rmap_locks(vma);
> > - if (moved)
> > + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> > + new_pmd, need_rmap_locks))
> > continue;
> > #endif
> > }
> > --
> > 2.28.0.806.g8561365e88-goog
> >
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Oct 2, 2020 at 12:52 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Fri, Oct 02, 2020 at 04:20:49PM +0000, Kalesh Singh wrote:
> > set_pud_at() is used in move_normal_pud() for remapping
> > pages at the PUD level.
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
> > ---
> > arch/arm64/include/asm/pgtable.h | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index d5d3fbe73953..8848125e3024 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -415,6 +415,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> > #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
> >
> > #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))
> > +#define set_pud_at(mm, addr, pudp, pud) set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud))
> >
> > #define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
> > #define __phys_to_p4d_val(phys) __phys_to_pte_val(phys)
>
> Just fold it into the next patch.
Sounds good. I'll update in the next version. Thanks
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Oct 2, 2020 at 6:08 PM Kalesh Singh <[email protected]> wrote:
>
> Hi Kirill, thank you for the feedback.
>
> On Fri, Oct 2, 2020 at 12:51 PM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Fri, Oct 02, 2020 at 04:20:48PM +0000, Kalesh Singh wrote:
> > > Android needs to move large memory regions for garbage collection.
> > > The GC requires moving physical pages of multi-gigabyte heap
> > > using mremap. During this move, the application threads have to
> > > be paused for correctness. It is critical to keep this pause as
> > > short as possible to avoid jitters during user interaction.
> > >
> > > Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD
> > > level if the source and destination addresses are PUD-aligned.
> > > For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves
> > > PGD entries, since the PUD entry is “folded back” onto the PGD entry.
> > > Add HAVE_MOVE_PUD so that architectures where moving at the PUD level
> > > isn't supported/tested can turn this off by not selecting the config.
> > >
> > > Fix build test error from v1 of this series reported by
> > > kernel test robot in [1].
> > >
> > > [1] https://lists.01.org/hyperkitty/list/[email protected]/thread/CKPGL4FH4NG7TGH2CVYX2UX76L25BTA3/
> > >
> > > Signed-off-by: Kalesh Singh <[email protected]>
> > > Reported-by: kernel test robot <[email protected]>
> > > ---
> > > Changes in v2:
> > > - Update commit message with description of Android GC's use case.
> > > - Move set_pud_at() to a separate patch.
> > > - Use switch() instead of ifs in move_pgt_entry()
> > > - Fix build test error reported by kernel test robot on x86_64 in [1].
> > > Guard move_huge_pmd() with IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE),
> > > since this section doesn't get optimized out in the kernel test
> > > robot's build test when HAVE_MOVE_PUD is enabled.
> > > - Keep WARN_ON_ONCE(1) instead of BUILD_BUG() for the aforementioned
> > > reason.
> >
> > Okay, but IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) on the caller side would
> > do the trick, I believe.
> I tried moving this to the caller side in move_page_tables(),
> - if (extent == HPAGE_PMD_SIZE &&
> + if (extent == HPAGE_PMD_SIZE &&
> IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> but it produces the same error as reported by kernel test robot:
> ld.lld: error: undefined symbol: move_huge_pmd
> I'm not sure why these are different but the kernel test robot
> compiler complains.
I should mention also that the patch series compiles without having to
use IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) when using the test config
from kernel test robot’s report and clang --version:
Android (6443078 based on r383902) clang version 11.0.1
(https://android.googlesource.com/toolchain/llvm-project
b397f81060ce6d701042b782172ed13bee898b79)
> >
> > >
> > > arch/Kconfig | 7 ++
> > > mm/mremap.c | 220 ++++++++++++++++++++++++++++++++++++++++++++-------
> > > 2 files changed, 197 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index af14a567b493..5eabaa00bf9b 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -602,6 +602,13 @@ config HAVE_IRQ_TIME_ACCOUNTING
> > > Archs need to ensure they use a high enough resolution clock to
> > > support irq time accounting and then call enable_sched_clock_irqtime().
> > >
> > > +config HAVE_MOVE_PUD
> > > + bool
> > > + help
> > > + Architectures that select this are able to move page tables at the
> > > + PUD level. If there are only 3 page table levels, the move effectively
> > > + happens at the PGD level.
> > > +
> > > config HAVE_MOVE_PMD
> > > bool
> > > help
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 138abbae4f75..c1d6ab667d70 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -249,14 +249,176 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > >
> > > return true;
> > > }
> > > +#else
> > > +static inline bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
> > > +{
> > > + return false;
> > > +}
> > > #endif
> > >
> > > +#ifdef CONFIG_HAVE_MOVE_PUD
> > > +static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + pgd_t *pgd;
> > > + p4d_t *p4d;
> > > + pud_t *pud;
> > > +
> > > + pgd = pgd_offset(mm, addr);
> > > + if (pgd_none_or_clear_bad(pgd))
> > > + return NULL;
> > > +
> > > + p4d = p4d_offset(pgd, addr);
> > > + if (p4d_none_or_clear_bad(p4d))
> > > + return NULL;
> > > +
> > > + pud = pud_offset(p4d, addr);
> > > + if (pud_none_or_clear_bad(pud))
> > > + return NULL;
> > > +
> > > + return pud;
> > > +}
> > > +
> > > +static pud_t *alloc_new_pud(struct mm_struct *mm, struct vm_area_struct *vma,
> > > + unsigned long addr)
> > > +{
> > > + pgd_t *pgd;
> > > + p4d_t *p4d;
> > > + pud_t *pud;
> > > +
> > > + pgd = pgd_offset(mm, addr);
> > > + p4d = p4d_alloc(mm, pgd, addr);
> > > + if (!p4d)
> > > + return NULL;
> > > + pud = pud_alloc(mm, p4d, addr);
> > > + if (!pud)
> > > + return NULL;
> > > +
> > > + return pud;
> > > +}
> >
> > Looks like a code duplication.
> >
> > Could you move these two helpers out of #ifdef CONFIG_HAVE_MOVE_PUD and
> > make get_old_pmd() and alloc_new_pmd() use them?
> Yes, that will be cleaner. I'll update it in the next version.
> >
> > > +
> > > +static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > > +{
> > > + spinlock_t *old_ptl, *new_ptl;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + pud_t pud;
> > > +
> > > + /*
> > > + * The destination pud shouldn't be established, free_pgtables()
> > > + * should have released it.
> > > + */
> > > + if (WARN_ON_ONCE(!pud_none(*new_pud)))
> > > + return false;
> > > +
> > > + /*
> > > + * We don't have to worry about the ordering of src and dst
> > > + * ptlocks because exclusive mmap_lock prevents deadlock.
> > > + */
> > > + old_ptl = pud_lock(vma->vm_mm, old_pud);
> > > + new_ptl = pud_lockptr(mm, new_pud);
> > > + if (new_ptl != old_ptl)
> > > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> > > +
> > > + /* Clear the pud */
> > > + pud = *old_pud;
> > > + pud_clear(old_pud);
> > > +
> > > + VM_BUG_ON(!pud_none(*new_pud));
> > > +
> > > + /* Set the new pud */
> > > + set_pud_at(mm, new_addr, new_pud, pud);
> > > + flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
> > > + if (new_ptl != old_ptl)
> > > + spin_unlock(new_ptl);
> > > + spin_unlock(old_ptl);
> > > +
> > > + return true;
> > > +}
> > > +#else
> > > +static inline bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
> > > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > +enum pgt_entry {
> > > + NORMAL_PMD,
> > > + HPAGE_PMD,
> > > + NORMAL_PUD,
> > > +};
> > > +
> > > +/*
> > > + * Returns an extent of the corresponding size for the pgt_entry specified if valid.
> > > + * Else returns a smaller extent bounded by the end of the source and destination
> > > + * pgt_entry. Returns 0 if an invalid pgt_entry is specified.
> > > + */
> > > +static unsigned long get_extent(enum pgt_entry entry, unsigned long old_addr,
> > > + unsigned long old_end, unsigned long new_addr)
> > > +{
> > > + unsigned long next, extent, mask, size;
> > > +
> > > + if (entry == NORMAL_PMD || entry == HPAGE_PMD) {
> > > + mask = PMD_MASK;
> > > + size = PMD_SIZE;
> > > + } else if (entry == NORMAL_PUD) {
> > > + mask = PUD_MASK;
> > > + size = PUD_SIZE;
> > > + } else
> > > + return 0;
> >
> > Em. Who would ever specify invalid pgt_entry? It's bug.
> > Again, switch()?
> Sounds good. I'll use BUG() and switch() instead.
> >
> > > +
> > > + next = (old_addr + size) & mask;
> > > + /* even if next overflowed, extent below will be ok */
> > > + extent = (next > old_end) ? old_end - old_addr : next - old_addr;
> > > + next = (new_addr + size) & mask;
> > > + if (extent > next - new_addr)
> > > + extent = next - new_addr;
> > > + return extent;
> > > +}
> > > +
> > > +/*
> > > + * Attempts to speedup the move by moving entry at the level corresponding to
> > > + * pgt_entry. Returns true if the move was successful, else false.
> > > + */
> > > +static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
> > > + unsigned long old_addr, unsigned long new_addr, void *old_entry,
> > > + void *new_entry, bool need_rmap_locks)
> > > +{
> > > + bool moved = false;
> > > +
> > > + /* See comment in move_ptes() */
> > > + if (need_rmap_locks)
> > > + take_rmap_locks(vma);
> > > +
> > > + switch (entry) {
> > > + case NORMAL_PMD:
> > > + moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> >
> > Nit: here and below, double space after '='. Why?
> Sorry, editing mistake on my end. I'll clean this up before resending.
> >
> > > + break;
> > > + case NORMAL_PUD:
> > > + moved = move_normal_pud(vma, old_addr, new_addr, old_entry, new_entry);
> > > + break;
> > > + case HPAGE_PMD:
> > > + moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> > > + move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry);
> > > + break;
> > > + default:
> > > + WARN_ON_ONCE(1);
> > > + break;
> > > + }
> > > +
> > > + if (need_rmap_locks)
> > > + drop_rmap_locks(vma);
> > > +
> > > + return moved;
> > > +}
> > > +
> > > unsigned long move_page_tables(struct vm_area_struct *vma,
> > > unsigned long old_addr, struct vm_area_struct *new_vma,
> > > unsigned long new_addr, unsigned long len,
> > > bool need_rmap_locks)
> > > {
> > > - unsigned long extent, next, old_end;
> > > + unsigned long extent, old_end;
> > > struct mmu_notifier_range range;
> > > pmd_t *old_pmd, *new_pmd;
> > >
> > > @@ -269,14 +431,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > >
> > > for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > > cond_resched();
> > > - next = (old_addr + PMD_SIZE) & PMD_MASK;
> > > - /* even if next overflowed, extent below will be ok */
> > > - extent = next - old_addr;
> > > - if (extent > old_end - old_addr)
> > > - extent = old_end - old_addr;
> > > - next = (new_addr + PMD_SIZE) & PMD_MASK;
> > > - if (extent > next - new_addr)
> > > - extent = next - new_addr;
> > > +#ifdef CONFIG_HAVE_MOVE_PUD
> >
> > Any chance if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD)) would work here?
> Once we move get_old_put() and alloc_new_pud() out of the #ifdefs as
> you suggested
> above, it should work. It would also now be possible to replace the
> #ifdef CONFIG_HAVE_MOVE_PMD in move_page_tables() with
> IS_ENABLED(CONFIG_HAVE_MOVE_PMD).
>
> Thanks,
> Kalesh
> >
> > > + /*
> > > + * If extent is PUD-sized try to speed up the move by moving at the
> > > + * PUD level if possible.
> > > + */
> > > + extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> > > + if (extent == PUD_SIZE) {
> > > + pud_t *old_pud, *new_pud;
> > > +
> > > + old_pud = get_old_pud(vma->vm_mm, old_addr);
> > > + if (!old_pud)
> > > + continue;
> > > + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> > > + if (!new_pud)
> > > + break;
> > > + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
> > > + old_pud, new_pud, need_rmap_locks))
> > > + continue;
> > > + }
> > > +#endif
> > > + extent = get_extent(NORMAL_PMD, old_addr, old_end, new_addr);
> > > old_pmd = get_old_pmd(vma->vm_mm, old_addr);
> > > if (!old_pmd)
> > > continue;
> > > @@ -284,18 +459,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > if (!new_pmd)
> > > break;
> > > if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) {
> > > - if (extent == HPAGE_PMD_SIZE) {
> > > - bool moved;
> > > - /* See comment in move_ptes() */
> > > - if (need_rmap_locks)
> > > - take_rmap_locks(vma);
> > > - moved = move_huge_pmd(vma, old_addr, new_addr,
> > > - old_pmd, new_pmd);
> > > - if (need_rmap_locks)
> > > - drop_rmap_locks(vma);
> > > - if (moved)
> > > - continue;
> > > - }
> > > + if (extent == HPAGE_PMD_SIZE &&
> > > + move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, old_pmd,
> > > + new_pmd, need_rmap_locks))
> > > + continue;
> > > split_huge_pmd(vma, old_pmd, old_addr);
> > > if (pmd_trans_unstable(old_pmd))
> > > continue;
> > > @@ -305,15 +472,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > * If the extent is PMD-sized, try to speed the move by
> > > * moving at the PMD level if possible.
> > > */
> > > - bool moved;
> > > -
> > > - if (need_rmap_locks)
> > > - take_rmap_locks(vma);
> > > - moved = move_normal_pmd(vma, old_addr, new_addr,
> > > - old_pmd, new_pmd);
> > > - if (need_rmap_locks)
> > > - drop_rmap_locks(vma);
> > > - if (moved)
> > > + if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, old_pmd,
> > > + new_pmd, need_rmap_locks))
> > > continue;
> > > #endif
> > > }
> > > --
> > > 2.28.0.806.g8561365e88-goog
> > >
> >
> > --
> > Kirill A. Shutemov
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >