Hi everyone,
I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
with feedbacks for ver.1.
General description (no change since ver.1)
===========================================
This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
- mbind(2)
- move_pages(2)
- migrate_pages(2)
- cgroup/cpuset migration
- memory hotremove
- soft offline
The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.
The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.
Changes / Notes
===============
- pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
bits together, which makes implementing thp migration a bit hard because
_PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
of the justification, so I keep it in this version. Instead, my approach
is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
pmd non-present cases.
- this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
and I think it's favorably received so that's fine. But there's unsolved
minor suggestion by Aneesh, so I don't include it in this set, expecting
that it will be updated/reposted.
- pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
this version should work for such kinds of thp.
- thp page cache is not tested yet, and it's at the head of my todo list
for future version.
Any comments or advices are welcomed.
Thanks,
Naoya Horiguchi
pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
return when it races with thp spilt (during which _PAGE_PRESENT is temporary
cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
well because it can hurt optimization of tlb handling in thp split.
In the current kernel, bit 6 is not used in non-present format because nonlinear
file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
Bit 7 is used as reserved (always clear), so please don't use it for other
purpose.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_types.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_types.h
index 8b4de22..89e88f1 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_types.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_types.h
@@ -98,14 +98,14 @@
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
* with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 6 for soft dirty tracking.
*
* Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY _PAGE_DIRTY
#else
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
--
2.7.0
Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
v1 -> v2:
- fixed config name in subject and patch description
---
arch/x86/Kconfig | 4 ++++
include/linux/huge_mm.h | 14 ++++++++++++++
mm/Kconfig | 3 +++
3 files changed, 21 insertions(+)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/Kconfig v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/Kconfig
index 19d237b..63b2f24 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/Kconfig
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/Kconfig
@@ -2246,6 +2246,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION
+config ARCH_ENABLE_THP_MIGRATION
+ def_bool y
+ depends on X86_64 && TRANSPARENT_HUGEPAGE && MIGRATION
+
menu "Power management and ACPI options"
config ARCH_HIBERNATION_HEADER
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
index 31f2c32..fcbca51 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
@@ -161,6 +161,15 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+static inline bool thp_migration_supported(void)
+{
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ return true;
+#else
+ return false;
+#endif
+}
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -232,6 +241,11 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
{
return NULL;
}
+
+static inline bool thp_migration_supported(void)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/Kconfig v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/Kconfig
index be0ee11..1965310 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/Kconfig
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
+config ARCH_ENABLE_THP_MIGRATION
+ bool
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
--
2.7.0
This patch prepares thp migration's core code. These code will be open when
unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
to allocate destination thps.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp
---
arch/x86/include/asm/pgtable_64.h | 2 +
include/linux/swapops.h | 61 +++++++++++++++
mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
mm/migrate.c | 44 ++++++++++-
mm/pgtable-generic.c | 3 +-
5 files changed, 262 insertions(+), 2 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
index 1cc82ec..3a1b48e 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
@@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
((type) << (SWP_TYPE_FIRST_BIT)) \
| ((offset) << SWP_OFFSET_FIRST_BIT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
extern int kern_addr_valid(unsigned long addr);
extern void cleanup_highmap(void);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
index 5c3a5f3..b6b22a2 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
@@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
#endif
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page *page,
+ struct vm_area_struct *vma, unsigned long address);
+
+extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
+ struct vm_area_struct *vma, unsigned long addr, void *old);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __pmd_to_swp_entry(pmd);
+ return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+ return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page *page,
+ struct vm_area_struct *vma, unsigned long address)
+{
+}
+
+static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
+ struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+ return 0;
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ pmd_t pmd = {};
+
+ return pmd;
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MEMORY_FAILURE
extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 0509d17..b3022b3 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
}
late_initcall(split_huge_pages_debugfs);
#endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pmd_t pmdval;
+ swp_entry_t entry;
+ spinlock_t *ptl;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ return;
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ return;
+ pmd = pmd_offset(pud, addr);
+ pmdval = *pmd;
+ barrier();
+ if (!pmd_present(pmdval))
+ return;
+
+ mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
+ if (pmd_trans_huge(pmdval)) {
+ pmd_t pmdswp;
+
+ ptl = pmd_lock(mm, pmd);
+ if (!pmd_present(*pmd))
+ goto unlock_pmd;
+ if (unlikely(!pmd_trans_huge(*pmd)))
+ goto unlock_pmd;
+ if (pmd_page(*pmd) != page)
+ goto unlock_pmd;
+
+ pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
+ if (pmd_dirty(pmdval))
+ set_page_dirty(page);
+ entry = make_migration_entry(page, pmd_write(pmdval));
+ pmdswp = swp_entry_to_pmd(entry);
+ pmdswp = pmd_mkhuge(pmdswp);
+ set_pmd_at(mm, addr, pmd, pmdswp);
+ page_remove_rmap(page, true);
+ put_page(page);
+unlock_pmd:
+ spin_unlock(ptl);
+ } else { /* pte-mapped thp */
+ pte_t *pte;
+ pte_t pteval;
+ struct page *tmp = compound_head(page);
+ unsigned long address = addr & HPAGE_PMD_MASK;
+ pte_t swp_pte;
+ int i;
+
+ pte = pte_offset_map(pmd, address);
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
+ if (!(pte_present(*pte) &&
+ page_to_pfn(tmp) == pte_pfn(*pte)))
+ continue;
+ pteval = ptep_clear_flush(vma, address, pte);
+ if (pte_dirty(pteval))
+ set_page_dirty(tmp);
+ entry = make_migration_entry(tmp, pte_write(pteval));
+ swp_pte = swp_entry_to_pte(entry);
+ set_pte_at(mm, address, pte, swp_pte);
+ page_remove_rmap(tmp, false);
+ put_page(tmp);
+ }
+ pte_unmap_unlock(pte, ptl);
+ }
+ mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
+ return;
+}
+
+int remove_migration_pmd(struct page *new, pmd_t *pmd,
+ struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
+ pmd_t pmde;
+ swp_entry_t entry;
+
+ pmde = *pmd;
+ barrier();
+
+ if (!pmd_present(pmde)) {
+ if (is_migration_entry(pmd_to_swp_entry(pmde))) {
+ unsigned long mmun_start = addr & HPAGE_PMD_MASK;
+ unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+
+ ptl = pmd_lock(mm, pmd);
+ entry = pmd_to_swp_entry(*pmd);
+ if (migration_entry_to_page(entry) != old)
+ goto unlock_ptl;
+ get_page(new);
+ pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (is_write_migration_entry(entry))
+ pmde = maybe_pmd_mkwrite(pmde, vma);
+ flush_cache_range(vma, mmun_start, mmun_end);
+ page_add_anon_rmap(new, vma, mmun_start, true);
+ pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
+ set_pmd_at(mm, mmun_start, pmd, pmde);
+ flush_tlb_range(vma, mmun_start, mmun_end);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_vma_page(new);
+ update_mmu_cache_pmd(vma, addr, pmd);
+unlock_ptl:
+ spin_unlock(ptl);
+ }
+ } else { /* pte-mapped thp */
+ pte_t *ptep;
+ pte_t pte;
+ int i;
+ struct page *tmpnew = compound_head(new);
+ struct page *tmpold = compound_head((struct page *)old);
+ unsigned long address = addr & HPAGE_PMD_MASK;
+
+ ptep = pte_offset_map(pmd, addr);
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+
+ for (i = 0; i < HPAGE_PMD_NR;
+ i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
+ pte = *ptep;
+ if (!is_swap_pte(pte))
+ continue;
+ entry = pte_to_swp_entry(pte);
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != tmpold)
+ continue;
+ get_page(tmpnew);
+ pte = pte_mkold(mk_pte(tmpnew,
+ READ_ONCE(vma->vm_page_prot)));
+ if (pte_swp_soft_dirty(*ptep))
+ pte = pte_mksoft_dirty(pte);
+ if (is_write_migration_entry(entry))
+ pte = maybe_mkwrite(pte, vma);
+ flush_dcache_page(tmpnew);
+ set_pte_at(mm, address, ptep, pte);
+ if (PageAnon(new))
+ page_add_anon_rmap(tmpnew, vma, address, false);
+ else
+ page_add_file_rmap(tmpnew, false);
+ update_mmu_cache(vma, address, ptep);
+ }
+ pte_unmap_unlock(ptep, ptl);
+ }
+ return SWAP_AGAIN;
+}
+#endif
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 66ce6b4..54f2eb6 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
swp_entry_t entry;
+ pgd_t *pgd;
+ pud_t *pud;
pmd_t *pmd;
pte_t *ptep, pte;
spinlock_t *ptl;
@@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
goto out;
ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
} else {
- pmd = mm_find_pmd(mm, addr);
+ pmd_t pmde;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+ pmd = pmd_offset(pud, addr);
if (!pmd)
goto out;
+ if (PageTransCompound(new)) {
+ remove_migration_pmd(new, pmd, vma, addr, old);
+ goto out;
+ }
+
+ pmde = *pmd;
+ barrier();
+
+ if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+ goto out;
+
ptep = pte_offset_map(pmd, addr);
/*
@@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
__migration_entry_wait(mm, pte, ptl);
}
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ struct page *page;
+
+ ptl = pmd_lock(mm, pmd);
+ if (!is_pmd_migration_entry(*pmd))
+ goto unlock;
+ page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+ if (!get_page_unless_zero(page))
+ goto unlock;
+ spin_unlock(ptl);
+ wait_on_page_locked(page);
+ put_page(page);
+ return;
+unlock:
+ spin_unlock(ptl);
+}
+#endif
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
index 71c5f91..6012343 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
@@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+ !pmd_devmap(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
--
2.7.0
TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/rmap.h | 1 +
mm/huge_memory.c | 2 +-
mm/rmap.c | 8 +++++---
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/rmap.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/rmap.h
index b46bb56..a2fa425 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/rmap.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/rmap.h
@@ -87,6 +87,7 @@ enum ttu_flags {
TTU_MUNLOCK = 4, /* munlock mode */
TTU_LZFREE = 8, /* lazy free mode */
TTU_SPLIT_HUGE_PMD = 16, /* split huge PMD if any */
+ TTU_SPLIT_FREEZE = 32, /* freeze pte under splitting thp */
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 2d1d6bb..0509d17 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -1794,7 +1794,7 @@ static void freeze_page(struct page *page)
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page))
- ttu_flags |= TTU_MIGRATION;
+ ttu_flags |= TTU_SPLIT_FREEZE;
/* We only need TTU_SPLIT_HUGE_PMD once */
ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
index 1ef3640..a4be307 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
@@ -1449,7 +1449,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_address(vma, address,
- flags & TTU_MIGRATION, page);
+ flags & TTU_SPLIT_FREEZE, page);
/* check if we have anything to do after split */
if (page_mapcount(page) == 0)
goto out;
@@ -1527,7 +1527,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* will take care of the rest.
*/
dec_mm_counter(mm, mm_counter(page));
- } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) {
+ } else if (IS_ENABLED(CONFIG_MIGRATION) &&
+ (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
swp_entry_t entry;
pte_t swp_pte;
/*
@@ -1649,7 +1650,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+ if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+ && !PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;
if (flags & TTU_RMAP_LOCKED)
--
2.7.0
This patch enables thp migration for mbind(2) and migrate_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 79 insertions(+), 29 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
index 77d0668..96507ee 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
@@ -94,6 +94,7 @@
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
+#include <linux/swapops.h>
#include <asm/tlbflush.h>
#include <asm/uaccess.h>
@@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
}
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ int ret = 0;
+ struct page *page;
+ struct queue_pages *qp = walk->private;
+ unsigned long flags;
+
+ if (unlikely(is_pmd_migration_entry(*pmd))) {
+ ret = 1;
+ goto unlock;
+ }
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+ goto out;
+ }
+ if (!thp_migration_supported()) {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
+ if (queue_pages_node_check(page, qp)) {
+ ret = 1;
+ goto unlock;
+ }
+
+ ret = 1;
+ flags = qp->flags;
+ /* go to thp migration */
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ migrate_page_add(page, qp->pagelist, flags);
+unlock:
+ spin_unlock(ptl);
+out:
+ return ret;
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid, ret;
+ int ret;
pte_t *pte;
spinlock_t *ptl;
- if (pmd_trans_huge(*pmd)) {
- ptl = pmd_lock(walk->mm, pmd);
- if (pmd_trans_huge(*pmd)) {
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- __split_huge_pmd(vma, pmd, addr, false, NULL);
- } else {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- if (ret)
- return 0;
- }
- } else {
- spin_unlock(ptl);
- }
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+ if (ret)
+ return 0;
}
if (pmd_trans_unstable(pmd))
@@ -541,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (queue_pages_node_check(page, qp))
continue;
- if (PageTransCompound(page)) {
+ if (PageTransCompound(page) && !thp_migration_supported()) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
@@ -959,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
#ifdef CONFIG_MIGRATION
/*
- * page migration
+ * page migration, thp tail pages can be passed.
*/
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
+ struct page *head = compound_head(page);
/*
* Avoid migrating a page that is shared with others.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
- if (!isolate_lru_page(page)) {
- list_add_tail(&page->lru, pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
}
}
@@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else
+ else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
+ } else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
}
/*
* if !vma, alloc_page_vma() will use task or system default policy
--
2.7.0
This patch enables thp migration for move_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 97ab8d9..6a589b9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -1443,7 +1443,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(pm->node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1470,6 +1480,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
struct vm_area_struct *vma;
struct page *page;
+ struct page *head;
+ unsigned int follflags;
err = -EFAULT;
vma = find_vma(mm, pp->addr);
@@ -1477,8 +1489,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto set_status;
/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, pp->addr,
- FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+ follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
+ if (thp_migration_supported())
+ follflags &= ~FOLL_SPLIT;
+ page = follow_page(vma, pp->addr, follflags);
err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1488,7 +1502,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!page)
goto set_status;
- pp->page = page;
err = page_to_nid(page);
if (err == pp->node)
@@ -1503,16 +1516,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto put_and_set;
if (PageHuge(page)) {
- if (PageHead(page))
+ if (PageHead(page)) {
isolate_huge_page(page, &pagelist);
+ err = 0;
+ pp->page = page;
+ }
goto put_and_set;
}
- err = isolate_lru_page(page);
+ pp->page = compound_head(page);
+ head = compound_head(page);
+ err = isolate_lru_page(head);
if (!err) {
- list_add_tail(&page->lru, &pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ list_add_tail(&head->lru, &pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
put_and_set:
/*
--
2.7.0
This patch enables thp migration for memory hotremove.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
mm/memory_hotplug.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory_hotplug.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory_hotplug.c
index b18dab40..a9c3fe1 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory_hotplug.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory_hotplug.c
@@ -1543,6 +1543,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
struct page *new_page = NULL;
+ unsigned int order = 0;
/*
* TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1553,6 +1554,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
return alloc_huge_page_node(page_hstate(compound_head(page)),
next_node_in(nid, nmask));
+ if (thp_migration_supported() && PageTransHuge(page)) {
+ order = HPAGE_PMD_ORDER;
+ gfp_mask |= GFP_TRANSHUGE;
+ }
+
node_clear(nid, nmask);
if (PageHighMem(page)
@@ -1560,12 +1566,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
gfp_mask |= __GFP_HIGHMEM;
if (!nodes_empty(nmask))
- new_page = __alloc_pages_nodemask(gfp_mask, 0,
+ new_page = __alloc_pages_nodemask(gfp_mask, order,
node_zonelist(nid, gfp_mask), &nmask);
if (!new_page)
- new_page = __alloc_pages(gfp_mask, 0,
+ new_page = __alloc_pages(gfp_mask, order,
node_zonelist(nid, gfp_mask));
+ if (new_page && order == HPAGE_PMD_ORDER)
+ prep_transhuge_page(new_page);
+
return new_page;
}
@@ -1595,7 +1604,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- }
+ } else if (thp_migration_supported() && PageTransHuge(page))
+ pfn = page_to_pfn(compound_head(page))
+ + HPAGE_PMD_NR - 1;
if (!get_page_unless_zero(page))
continue;
--
2.7.0
Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 6
- clear_soft_dirty_pmd can handle migration entry
---
arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
fs/proc/task_mmu.c | 17 +++++++++++------
include/asm-generic/pgtable.h | 34 +++++++++++++++++++++++++++++++++-
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 25 +++++++++++++++++++++++--
5 files changed, 86 insertions(+), 9 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable.h
index 437feb4..ceec210 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable.h
@@ -948,6 +948,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
#endif
#define PKRU_AD_BIT 0x1
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
index c1f9cf4..85745b9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
@@ -900,12 +900,17 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
-
- pmd = pmd_wrprotect(pmd);
- pmd = pmd_clear_soft_dirty(pmd);
-
- set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ pmd_t pmd = *pmdp;
+
+ if (pmd_present(pmd)) {
+ pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+ pmd = pmd_wrprotect(pmd);
+ pmd = pmd_clear_soft_dirty(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+ pmd = pmd_swp_clear_soft_dirty(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ }
}
#else
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/asm-generic/pgtable.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/asm-generic/pgtable.h
index c4f8fd2..f50e8b4 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/asm-generic/pgtable.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/asm-generic/pgtable.h
@@ -489,7 +489,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
static inline int pte_soft_dirty(pte_t pte)
{
return 0;
@@ -534,6 +551,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte;
}
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
#endif
#ifndef __HAVE_PFNMAP_TRACKING
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
index b6b22a2..db8a858 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
@@ -176,6 +176,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
{
swp_entry_t arch_entry;
+ if (pmd_swp_soft_dirty(pmd))
+ pmd = pmd_swp_clear_soft_dirty(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index 4e9090c..bd9c23e 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -832,6 +832,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (is_write_migration_entry(entry)) {
make_migration_entry_read(&entry);
pmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1470,6 +1472,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 1;
}
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ if (unlikely(is_pmd_migration_entry(pmd)))
+ pmd = pmd_mksoft_dirty(pmd);
+ else if (pmd_present(pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
+#endif
+ return pmd;
+}
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
@@ -1510,7 +1523,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
- set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+ pmd = move_soft_dirty_pmd(pmd);
+ set_pmd_at(mm, new_addr, new_pmd, pmd);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
spin_unlock(old_ptl);
@@ -1556,6 +1570,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
newpmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(newpmd))
+ newpmd = pmd_swp_mksoft_dirty(newpmd);
set_pmd_at(mm, addr, pmd, newpmd);
}
goto unlock;
@@ -2408,7 +2424,8 @@ void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
set_page_dirty(page);
entry = make_migration_entry(page, pmd_write(pmdval));
pmdswp = swp_entry_to_pmd(entry);
- pmdswp = pmd_mkhuge(pmdswp);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
set_pmd_at(mm, addr, pmd, pmdswp);
page_remove_rmap(page, true);
put_page(page);
@@ -2434,6 +2451,8 @@ void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
set_page_dirty(tmp);
entry = make_migration_entry(tmp, pte_write(pteval));
swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
set_pte_at(mm, address, pte, swp_pte);
page_remove_rmap(tmp, false);
put_page(tmp);
@@ -2466,6 +2485,8 @@ int remove_migration_pmd(struct page *new, pmd_t *pmd,
goto unlock_ptl;
get_page(new);
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (pmd_swp_soft_dirty(*pmd))
+ pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
flush_cache_range(vma, mmun_start, mmun_end);
--
2.7.0
This patch enables thp migration for soft offline.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/memory-failure.c | 31 ++++++++++++-------------------
1 file changed, 12 insertions(+), 19 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
index 19e796d..6cc8157 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
@@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
nid);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(nid,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}
@@ -1687,28 +1697,11 @@ static int __soft_offline_page(struct page *page, int flags)
static int soft_offline_in_use_page(struct page *page, int flags)
{
int ret;
- struct page *hpage = compound_head(page);
-
- if (!PageHuge(page) && PageTransHuge(hpage)) {
- lock_page(hpage);
- if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
- unlock_page(hpage);
- if (!PageAnon(hpage))
- pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
- else
- pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
- put_hwpoison_page(hpage);
- return -EBUSY;
- }
- unlock_page(hpage);
- get_hwpoison_page(page);
- put_hwpoison_page(hpage);
- }
if (PageHuge(page))
ret = soft_offline_huge_page(page, flags);
else
- ret = __soft_offline_page(page, flags);
+ ret = __soft_offline_page(compound_head(page), flags);
return ret;
}
--
2.7.0
Introduce a separate check routine related to MPOL_MF_INVERT flag. This patch
just does cleanup, no behavioral change.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/mempolicy.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
index 6d3639e..77d0668 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
struct vm_area_struct *prev;
};
+static inline bool queue_pages_node_check(struct page *page,
+ struct queue_pages *qp)
+{
+ int nid = page_to_nid(page);
+ unsigned long flags = qp->flags;
+
+ return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
*/
if (PageReserved(page))
continue;
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (queue_pages_node_check(page, qp))
continue;
if (PageTransCompound(page)) {
get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
#ifdef CONFIG_HUGETLB_PAGE
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid;
struct page *page;
spinlock_t *ptl;
pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (queue_pages_node_check(page, qp))
goto unlock;
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
--
2.7.0
If one of callers of page migration starts to handle thp, memory management code
start to see pmd migration entry, so we need to prepare for it before enabling.
This patch changes various code point which checks the status of given pmds in
order to prevent race between thp migration and the pmd-related works.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't think up
no better name. Any suggesntion is welcomed.)
---
arch/x86/mm/gup.c | 4 +--
fs/proc/task_mmu.c | 23 +++++++------
include/linux/huge_mm.h | 9 ++++-
mm/gup.c | 10 ++++--
mm/huge_memory.c | 88 ++++++++++++++++++++++++++++++++++++++++---------
mm/madvise.c | 2 +-
mm/memcontrol.c | 2 ++
mm/memory.c | 6 +++-
mm/mprotect.c | 2 ++
mm/mremap.c | 2 +-
10 files changed, 114 insertions(+), 34 deletions(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
index 0d4fb3e..78a153d 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
@@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd))
return 0;
- if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+ if (unlikely(pmd_large(pmd))) {
/*
* NUMA hinting faults need to be handled in the GUP
* slowpath for accounting purposes and so that they
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
index 35b92d8..c1f9cf4 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- smaps_pmd_entry(pmd, addr, walk);
+ if (pmd_present(*pmd))
+ smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
}
@@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
goto out;
}
+ if (!pmd_present(*pmd))
+ goto out;
+
page = pmd_page(*pmd);
/* Clear accessed and referenced bits. */
@@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (ptl) {
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
+ struct page *page;
if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- /*
- * Currently pmd for thp is always present because thp
- * can not be swapped-out, migrated, or HWPOISONed
- * (split in such cases instead.)
- * This if-check is just to prepare for future implementation.
- */
- if (pmd_present(pmd)) {
- struct page *page = pmd_page(pmd);
-
+ if (is_pmd_migration_entry(pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+ frame = swp_type(entry) |
+ (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+ page = migration_entry_to_page(entry);
+ } else if (pmd_present(pmd)) {
+ page = pmd_page(pmd);
if (page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
index fcbca51..3c252cd 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
@@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next);
extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma);
+
+static inline int pmd_related(pmd_t pmd)
+{
+ return !pmd_none(pmd) &&
+ (!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
+}
+
/* mmap_sem must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (pmd_related(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
index e50178c..2dc4978 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
@@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
+ if (!pmd_present(*pmd))
+ return no_page_table(vma, flags);
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return follow_page_pte(vma, address, pmd, flags);
ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_present(*pmd))) {
+ spin_unlock(ptl);
+ return no_page_table(vma, flags);
+ }
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
@@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pud = pud_offset(pgd, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return -EFAULT;
VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
@@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = READ_ONCE(*pmdp);
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd))
return 0;
if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
index b3022b3..4e9090c 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
@@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
+
+ if (unlikely(is_pmd_migration_entry(pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ if (is_write_migration_entry(entry)) {
+ make_migration_entry_read(&entry);
+ pmd = swp_entry_to_pmd(entry);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
+ set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ ret = 0;
+ goto out_unlock;
+ }
+
if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
@@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
goto out_unlock;
+ if (unlikely(!pmd_present(orig_pmd)))
+ goto out_unlock;
+
page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
/*
@@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
goto out;
- page = pmd_page(*pmd);
+ if (is_pmd_migration_entry(*pmd)) {
+ swp_entry_t entry;
+ entry = pmd_to_swp_entry(*pmd);
+ page = pfn_to_page(swp_offset(entry));
+ if (!is_migration_entry(entry))
+ goto out;
+ } else
+ page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
if (flags & FOLL_TOUCH)
touch_pmd(vma, addr, pmd);
@@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd))
goto out;
+ if (unlikely(!pmd_present(orig_pmd)))
+ goto out;
+
page = pmd_page(orig_pmd);
/*
* If other processes are mapping this page, we couldn't discard
@@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
} else {
struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(!PageHead(page), page);
- if (PageAnon(page)) {
- pgtable_t pgtable;
- pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
- pte_free(tlb->mm, pgtable);
- atomic_long_dec(&tlb->mm->nr_ptes);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ int migration = 0;
+
+ if (!is_pmd_migration_entry(orig_pmd)) {
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ if (PageAnon(page)) {
+ pgtable_t pgtable;
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm,
+ pmd);
+ pte_free(tlb->mm, pgtable);
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ add_mm_counter(tlb->mm, MM_ANONPAGES,
+ -HPAGE_PMD_NR);
+ } else {
+ add_mm_counter(tlb->mm, MM_FILEPAGES,
+ -HPAGE_PMD_NR);
+ }
} else {
- add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(orig_pmd);
+ free_swap_and_cache(entry); /* waring in failure? */
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ migration = 1;
}
spin_unlock(ptl);
- tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+ if (!migration)
+ tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
}
return 1;
}
@@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
bool preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;
+ if (!pmd_present(*pmd))
+ goto unlock;
/*
* Avoid trapping faults against the zero page. The read-only
* data is likely to be read-cached on the local CPU and
* local/remote hits to the zero page are not interesting.
*/
- if (prot_numa && is_huge_zero_pmd(*pmd)) {
- spin_unlock(ptl);
- return ret;
+ if (prot_numa && is_huge_zero_pmd(*pmd))
+ goto unlock;
+
+ if (is_pmd_migration_entry(*pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+ if (is_write_migration_entry(entry)) {
+ pmd_t newpmd;
+
+ make_migration_entry_read(&entry);
+ newpmd = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, addr, pmd, newpmd);
+ }
+ goto unlock;
}
if (!prot_numa || !pmd_protnone(*pmd)) {
@@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
pmd_write(entry));
}
+unlock:
spin_unlock(ptl);
}
@@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ if (likely(pmd_related(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
index 0e3828e..eaa2b02 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
@@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long next;
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd))
+ if (pmd_related(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
goto next;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
index 91dfc7c..ebc2c42 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
@@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
struct page *page = NULL;
enum mc_target_type ret = MC_TARGET_NONE;
+ if (unlikely(!pmd_present(pmd)))
+ return ret;
page = pmd_page(pmd);
VM_BUG_ON_PAGE(!page || !PageHead(page), page);
if (!(mc.flags & MOVE_ANON))
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
index 94b5e2c..33fa439 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
@@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+ if (pmd_related(*src_pmd)) {
int err;
VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
err = copy_huge_pmd(dst_mm, src_mm,
@@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
int ret;
barrier();
+ if (unlikely(is_pmd_migration_entry(orig_pmd))) {
+ pmd_migration_entry_wait(mm, fe.pmd);
+ return 0;
+ }
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&fe, orig_pmd);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
index c5ba2aa..81186e3 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
@@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long this_pages;
next = pmd_addr_end(addr, end);
+ if (!pmd_present(*pmd))
+ continue;
if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
&& pmd_none_or_clear_bad(pmd))
continue;
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
index da22ad2..a94a698 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
@@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (pmd_trans_huge(*old_pmd)) {
+ if (pmd_related(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
bool moved;
/* See comment in move_ptes() */
--
2.7.0
Hi Naoya,
[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x007-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All error/warnings (new ones prefixed by >>):
mm/memory.c: In function 'copy_pmd_range':
>> mm/memory.c:1002:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
if (pmd_related(*src_pmd)) {
^~~~~~~~~~~
cc1: some warnings being treated as errors
--
mm/mremap.c: In function 'move_page_tables':
>> mm/mremap.c:197:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
if (pmd_related(*old_pmd)) {
^~~~~~~~~~~
In file included from include/asm-generic/bug.h:4:0,
from arch/x86/include/asm/bug.h:35,
from include/linux/bug.h:4,
from include/linux/mmdebug.h:4,
from include/linux/mm.h:8,
from mm/mremap.c:10:
>> include/linux/compiler.h:518:38: error: call to '__compiletime_assert_198' declared with attribute error: BUILD_BUG failed
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
prefix ## suffix(); \
^~~~~~
include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^~~~~~~~~~~~~~~~~~~
include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^~~~~~~~~~~~~~~~~~
include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
#define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
^~~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:183:27: note: in expansion of macro 'BUILD_BUG'
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
^~~~~~~~~
>> mm/mremap.c:198:18: note: in expansion of macro 'HPAGE_PMD_SIZE'
if (extent == HPAGE_PMD_SIZE) {
^~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
mm/madvise.c: In function 'madvise_free_pte_range':
>> mm/madvise.c:277:6: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
if (pmd_related(*pmd))
^~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/pmd_related +1002 mm/memory.c
996 dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
997 if (!dst_pmd)
998 return -ENOMEM;
999 src_pmd = pmd_offset(src_pud, addr);
1000 do {
1001 next = pmd_addr_end(addr, end);
> 1002 if (pmd_related(*src_pmd)) {
1003 int err;
1004 VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
1005 err = copy_huge_pmd(dst_mm, src_mm,
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Naoya,
[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20161028]
[cannot apply to v4.9-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x000-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
Cyclomatic Complexity 3 mm/memory_hotplug.c:setup_memhp_default_state
Cyclomatic Complexity 6 mm/memory_hotplug.c:find_smallest_section_pfn
Cyclomatic Complexity 6 mm/memory_hotplug.c:find_biggest_section_pfn
Cyclomatic Complexity 9 mm/memory_hotplug.c:shrink_pgdat_span
Cyclomatic Complexity 1 include/linux/err.h:IS_ERR
Cyclomatic Complexity 2 include/linux/page-flags.h:compound_head
Cyclomatic Complexity 1 include/linux/page_ref.h:page_count
Cyclomatic Complexity 2 include/linux/thread_info.h:test_ti_thread_flag
Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_thread_flag
Cyclomatic Complexity 1 include/linux/sched.h:signal_pending
Cyclomatic Complexity 0 include/linux/memory_hotplug.h:generic_alloc_nodedata
Cyclomatic Complexity 3 mm/memory_hotplug.c:next_active_pageblock
Cyclomatic Complexity 1 mm/memory_hotplug.c:__online_page_increment_counters
Cyclomatic Complexity 1 include/linux/mm.h:__free_reserved_page
Cyclomatic Complexity 1 include/linux/mm.h:free_reserved_page
Cyclomatic Complexity 1 mm/memory_hotplug.c:__online_page_free
Cyclomatic Complexity 1 mm/memory_hotplug.c:generic_online_page
Cyclomatic Complexity 1 include/linux/mm.h:put_page_testzero
Cyclomatic Complexity 4 mm/memory_hotplug.c:check_hotplug_memory_range
Cyclomatic Complexity 1 mm/memory_hotplug.c:cmdline_parse_movable_node
Cyclomatic Complexity 2 mm/memory_hotplug.c:ensure_zone_is_initialized
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin_nested
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqlock
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writelock
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_end
Cyclomatic Complexity 1 include/linux/seqlock.h:write_sequnlock
Cyclomatic Complexity 1 include/linux/memory_hotplug.h:zone_span_writeunlock
Cyclomatic Complexity 3 mm/memory_hotplug.c:grow_zone_span
Cyclomatic Complexity 4 mm/memory_hotplug.c:__add_zone
Cyclomatic Complexity 4 mm/memory_hotplug.c:__add_section
Cyclomatic Complexity 7 mm/memory_hotplug.c:__add_pages
Cyclomatic Complexity 9 mm/memory_hotplug.c:shrink_zone_span
Cyclomatic Complexity 1 mm/memory_hotplug.c:__remove_zone
Cyclomatic Complexity 2 mm/memory_hotplug.c:resize_zone
Cyclomatic Complexity 6 mm/memory_hotplug.c:move_pfn_range_left
Cyclomatic Complexity 6 mm/memory_hotplug.c:move_pfn_range_right
Cyclomatic Complexity 4 mm/memory_hotplug.c:move_pfn_range
Cyclomatic Complexity 3 mm/memory_hotplug.c:__remove_section
Cyclomatic Complexity 6 mm/memory_hotplug.c:__remove_pages
Cyclomatic Complexity 2 mm/memory_hotplug.c:check_pages_isolated
Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_isolated_pages
Cyclomatic Complexity 3 mm/memory_hotplug.c:hotadd_new_pgdat
Cyclomatic Complexity 1 mm/memory_hotplug.c:online_memory_block
Cyclomatic Complexity 1 mm/memory_hotplug.c:rollback_node_hotadd
Cyclomatic Complexity 4 mm/memory_hotplug.c:register_memory_resource
Cyclomatic Complexity 2 mm/memory_hotplug.c:release_memory_resource
Cyclomatic Complexity 6 mm/memory_hotplug.c:scan_movable_pages
Cyclomatic Complexity 1 include/linux/gfp.h:__alloc_pages
Cyclomatic Complexity 1 include/linux/hugetlb.h:page_hstate
Cyclomatic Complexity 3 include/linux/bitmap.h:bitmap_empty
Cyclomatic Complexity 1 include/linux/nodemask.h:__nodes_empty
Cyclomatic Complexity 12 mm/memory_hotplug.c:new_node_page
Cyclomatic Complexity 3 include/linux/mm.h:put_page
Cyclomatic Complexity 15 mm/memory_hotplug.c:do_migrate_range
Cyclomatic Complexity 2 mm/memory_hotplug.c:check_pages_isolated_cb
Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_isolated_pages_cb
Cyclomatic Complexity 2 mm/memory_hotplug.c:check_memblock_offlined_cb
Cyclomatic Complexity 2 mm/memory_hotplug.c:get_online_mems
Cyclomatic Complexity 6 mm/memory_hotplug.c:put_online_mems
Cyclomatic Complexity 2 mm/memory_hotplug.c:set_online_page_callback
Cyclomatic Complexity 2 mm/memory_hotplug.c:restore_online_page_callback
Cyclomatic Complexity 2 mm/memory_hotplug.c:mem_hotplug_begin
Cyclomatic Complexity 1 mm/memory_hotplug.c:mem_hotplug_done
Cyclomatic Complexity 1 mm/memory_hotplug.c:get_page_bootmem
Cyclomatic Complexity 2 mm/memory_hotplug.c:put_page_bootmem
Cyclomatic Complexity 9 mm/memory_hotplug.c:zone_can_shift
Cyclomatic Complexity 16 mm/memory_hotplug.c:online_pages
Cyclomatic Complexity 4 mm/memory_hotplug.c:try_online_node
Cyclomatic Complexity 2 mm/memory_hotplug.c:zone_for_memory
Cyclomatic Complexity 3 mm/memory_hotplug.c:is_mem_section_removable
Cyclomatic Complexity 7 mm/memory_hotplug.c:test_pages_in_a_zone
Cyclomatic Complexity 22 mm/memory_hotplug.c:__offline_pages
Cyclomatic Complexity 1 mm/memory_hotplug.c:offline_pages
Cyclomatic Complexity 9 mm/memory_hotplug.c:walk_memory_range
Cyclomatic Complexity 8 mm/memory_hotplug.c:add_memory_resource
Cyclomatic Complexity 3 mm/memory_hotplug.c:add_memory
Cyclomatic Complexity 1 mm/memory_hotplug.c:remove_memory
In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from mm/memory_hotplug.c:7:
mm/memory_hotplug.c: In function 'new_node_page':
include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
prefix ## suffix(); \
^~~~~~
include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^~~~~~~~~~~~~~~~~~~
include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^~~~~~~~~~~~~~~~~~
include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
#define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
^~~~~~~~~~~~~~~~
include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
^~~~~~~~~
>> include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
^~~~~~~~~~~~~~~
>> mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
if (new_page && order == HPAGE_PMD_ORDER)
^~~~~~~~~~~~~~~
vim +/HPAGE_PMD_ORDER +1575 mm/memory_hotplug.c
1559 gfp_mask |= GFP_TRANSHUGE;
1560 }
1561
1562 node_clear(nid, nmask);
1563
1564 if (PageHighMem(page)
1565 || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
1566 gfp_mask |= __GFP_HIGHMEM;
1567
1568 if (!nodes_empty(nmask))
1569 new_page = __alloc_pages_nodemask(gfp_mask, order,
1570 node_zonelist(nid, gfp_mask), &nmask);
1571 if (!new_page)
1572 new_page = __alloc_pages(gfp_mask, order,
1573 node_zonelist(nid, gfp_mask));
1574
> 1575 if (new_page && order == HPAGE_PMD_ORDER)
1576 prep_transhuge_page(new_page);
1577
1578 return new_page;
1579 }
1580
1581 #define NR_OFFLINE_AT_ONCE_PAGES (256)
1582 static int
1583 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Naoya,
[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20161028]
[cannot apply to v4.9-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base: git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x003-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
mm/memory_hotplug.c: In function 'try_offline_node':
mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
int i;
^
In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from mm/memory_hotplug.c:7:
mm/memory_hotplug.c: In function 'new_node_page':
include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
______r = !!(cond); \
^~~~
>> mm/memory_hotplug.c:1575:2: note: in expansion of macro 'if'
if (new_page && order == HPAGE_PMD_ORDER)
^~
include/linux/compiler.h:506:2: note: in expansion of macro '__compiletime_assert'
__compiletime_assert(condition, msg, prefix, suffix)
^~~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^~~~~~~~~~~~~~~~~~~
include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^~~~~~~~~~~~~~~~~~
include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
#define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
^~~~~~~~~~~~~~~~
include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
^~~~~~~~~
include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
^~~~~~~~~~~~~~~
mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
if (new_page && order == HPAGE_PMD_ORDER)
^~~~~~~~~~~~~~~
vim +/if +1575 mm/memory_hotplug.c
1559 gfp_mask |= GFP_TRANSHUGE;
1560 }
1561
1562 node_clear(nid, nmask);
1563
1564 if (PageHighMem(page)
1565 || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
1566 gfp_mask |= __GFP_HIGHMEM;
1567
1568 if (!nodes_empty(nmask))
1569 new_page = __alloc_pages_nodemask(gfp_mask, order,
1570 node_zonelist(nid, gfp_mask), &nmask);
1571 if (!new_page)
1572 new_page = __alloc_pages(gfp_mask, order,
1573 node_zonelist(nid, gfp_mask));
1574
> 1575 if (new_page && order == HPAGE_PMD_ORDER)
1576 prep_transhuge_page(new_page);
1577
1578 return new_page;
1579 }
1580
1581 #define NR_OFFLINE_AT_ONCE_PAGES (256)
1582 static int
1583 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
This patch makes it possible to support thp migration gradually. If you fail
to allocate a destination page as a thp, you just split the source thp as we
do now, and then enter the normal page migration. If you succeed to allocate
destination thp, you enter thp migration. Subsequent patches actually enable
thp migration for each caller of page migration by allowing its get_new_page()
callback to allocate thps.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/migrate.c | 2 +-
mm/rmap.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
index 54f2eb6..97ab8d9 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
@@ -1142,7 +1142,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}
- if (unlikely(PageTransHuge(page))) {
+ if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
index a4be307..a0b665c 100644
--- v4.9-rc2-mmotm-2016-10-27-18-27/mm/rmap.c
+++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/rmap.c
@@ -1443,6 +1443,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
struct rmap_private *rp = arg;
enum ttu_flags flags = rp->flags;
+ if (flags & TTU_MIGRATION) {
+ if (!PageHuge(page) && PageTransCompound(page)) {
+ set_pmd_migration_entry(page, vma, address);
+ goto out;
+ }
+ }
+
/* munlock has nothing to gain from examining un-locked vmas */
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
goto out;
--
2.7.0
On Tue, Nov 08, 2016 at 08:23:50AM +0800, kbuild test robot wrote:
> Hi Naoya,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-randconfig-x007-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All error/warnings (new ones prefixed by >>):
>
> mm/memory.c: In function 'copy_pmd_range':
> >> mm/memory.c:1002:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
> if (pmd_related(*src_pmd)) {
> ^~~~~~~~~~~
> cc1: some warnings being treated as errors
> --
I forgot to declare a noop routine for CONFIG_TRANSPARENT_HUGEPAGE=n.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -222,6 +229,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
+static inline int pmd_related(pmd_t pmd)
+{
+ return 0;
+}
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
> mm/mremap.c: In function 'move_page_tables':
> >> mm/mremap.c:197:7: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
> if (pmd_related(*old_pmd)) {
> ^~~~~~~~~~~
> In file included from include/asm-generic/bug.h:4:0,
> from arch/x86/include/asm/bug.h:35,
> from include/linux/bug.h:4,
> from include/linux/mmdebug.h:4,
> from include/linux/mm.h:8,
> from mm/mremap.c:10:
> >> include/linux/compiler.h:518:38: error: call to '__compiletime_assert_198' declared with attribute error: BUILD_BUG failed
> _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> ^
> include/linux/compiler.h:501:4: note: in definition of macro '__compiletime_assert'
> prefix ## suffix(); \
> ^~~~~~
> include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
> _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> ^~~~~~~~~~~~~~~~~~~
> include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
> #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> ^~~~~~~~~~~~~~~~~~
> include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
> #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> ^~~~~~~~~~~~~~~~
> >> include/linux/huge_mm.h:183:27: note: in expansion of macro 'BUILD_BUG'
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> ^~~~~~~~~
> >> mm/mremap.c:198:18: note: in expansion of macro 'HPAGE_PMD_SIZE'
> if (extent == HPAGE_PMD_SIZE) {
> ^~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
HPAGE_PMD_SIZE is available only in CONFIG_TRANSPARENT_HUGEPAGE=y, and
this code looks to violate the rule, but it is in if (pmd_related()) block
which is compiled out in CONFIG_TRANSPARENT_HUGEPAGE=n, so this is OK
only with the above change.
Thanks,
Naoya Horiguchi
> --
> mm/madvise.c: In function 'madvise_free_pte_range':
> >> mm/madvise.c:277:6: error: implicit declaration of function 'pmd_related' [-Werror=implicit-function-declaration]
> if (pmd_related(*pmd))
> ^~~~~~~~~~~
> cc1: some warnings being treated as errors
>
> vim +/pmd_related +1002 mm/memory.c
>
> 996 dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
> 997 if (!dst_pmd)
> 998 return -ENOMEM;
> 999 src_pmd = pmd_offset(src_pud, addr);
> 1000 do {
> 1001 next = pmd_addr_end(addr, end);
> > 1002 if (pmd_related(*src_pmd)) {
> 1003 int err;
> 1004 VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
> 1005 err = copy_huge_pmd(dst_mm, src_mm,
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
Fix the following warning:
mm/memory_hotplug.c: In function 'try_offline_node':
mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
int i;
^
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/memory_hotplug.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 534348ddd285..d612a75ceec4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2128,7 +2128,6 @@ void try_offline_node(int nid)
unsigned long start_pfn = pgdat->node_start_pfn;
unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
unsigned long pfn;
- int i;
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
unsigned long section_nr = pfn_to_section_nr(pfn);
--
2.7.4
On Tue, Nov 08, 2016 at 08:30:10AM +0800, kbuild test robot wrote:
> Hi Naoya,
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on next-20161028]
> [cannot apply to v4.9-rc4]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-x003-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64
>
> All warnings (new ones prefixed by >>):
>
> mm/memory_hotplug.c: In function 'try_offline_node':
> mm/memory_hotplug.c:2131:6: warning: unused variable 'i' [-Wunused-variable]
> int i;
> ^
> In file included from include/uapi/linux/stddef.h:1:0,
> from include/linux/stddef.h:4,
> from mm/memory_hotplug.c:7:
This seems unrelated to my patchset, but the fix is easy.
I'll post a separate patch later.
> mm/memory_hotplug.c: In function 'new_node_page':
> include/linux/compiler.h:518:38: error: call to '__compiletime_assert_1575' declared with attribute error: BUILD_BUG failed
> _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> ^
> include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
> ______r = !!(cond); \
> ^~~~
> >> mm/memory_hotplug.c:1575:2: note: in expansion of macro 'if'
> if (new_page && order == HPAGE_PMD_ORDER)
> ^~
> include/linux/compiler.h:506:2: note: in expansion of macro '__compiletime_assert'
> __compiletime_assert(condition, msg, prefix, suffix)
> ^~~~~~~~~~~~~~~~~~~~
> include/linux/compiler.h:518:2: note: in expansion of macro '_compiletime_assert'
> _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> ^~~~~~~~~~~~~~~~~~~
> include/linux/bug.h:54:37: note: in expansion of macro 'compiletime_assert'
> #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> ^~~~~~~~~~~~~~~~~~
> include/linux/bug.h:88:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
> #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
> ^~~~~~~~~~~~~~~~
> include/linux/huge_mm.h:181:28: note: in expansion of macro 'BUILD_BUG'
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> ^~~~~~~~~
> include/linux/huge_mm.h:56:26: note: in expansion of macro 'HPAGE_PMD_SHIFT'
> #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> ^~~~~~~~~~~~~~~
> mm/memory_hotplug.c:1575:27: note: in expansion of macro 'HPAGE_PMD_ORDER'
> if (new_page && order == HPAGE_PMD_ORDER)
> ^~~~~~~~~~~~~~~
HPAGE_PMD_ORDER is not available in non-thp code now, so let's add
a simple wrapper to access it in generic code.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c252cdef587..b75a9a1bbf3e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -148,6 +148,12 @@ static inline int hpage_nr_pages(struct page *page)
return HPAGE_PMD_NR;
return 1;
}
+static inline int hpage_order(struct page *page)
+{
+ if (unlikely(PageTransHuge(page)))
+ return HPAGE_PMD_ORDER;
+ return 0;
+}
extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
@@ -183,6 +189,7 @@ static inline bool thp_migration_supported(void)
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
#define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
#define transparent_hugepage_enabled(__vma) 0
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a9c3fe1b55ea..d612a75ceec4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1555,7 +1555,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
next_node_in(nid, nmask));
if (thp_migration_supported() && PageTransHuge(page)) {
- order = HPAGE_PMD_ORDER;
+ order = hpage_order(page);
gfp_mask |= GFP_TRANSHUGE;
}
@@ -1572,7 +1572,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
new_page = __alloc_pages(gfp_mask, order,
node_zonelist(nid, gfp_mask));
- if (new_page && order == HPAGE_PMD_ORDER)
+ if (new_page && order == hpage_order(page))
prep_transhuge_page(new_page);
return new_page;
@@ -1606,7 +1606,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
continue;
} else if (thp_migration_supported() && PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
- + HPAGE_PMD_NR - 1;
+ + hpage_nr_pages(page) - 1;
if (!get_page_unless_zero(page))
continue;
These changes are applied in the next version.
Thanks,
Naoya Horiguchi
>
> vim +/if +1575 mm/memory_hotplug.c
>
> 1559 gfp_mask |= GFP_TRANSHUGE;
> 1560 }
> 1561
> 1562 node_clear(nid, nmask);
> 1563
> 1564 if (PageHighMem(page)
> 1565 || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> 1566 gfp_mask |= __GFP_HIGHMEM;
> 1567
> 1568 if (!nodes_empty(nmask))
> 1569 new_page = __alloc_pages_nodemask(gfp_mask, order,
> 1570 node_zonelist(nid, gfp_mask), &nmask);
> 1571 if (!new_page)
> 1572 new_page = __alloc_pages(gfp_mask, order,
> 1573 node_zonelist(nid, gfp_mask));
> 1574
> > 1575 if (new_page && order == HPAGE_PMD_ORDER)
> 1576 prep_transhuge_page(new_page);
> 1577
> 1578 return new_page;
> 1579 }
> 1580
> 1581 #define NR_OFFLINE_AT_ONCE_PAGES (256)
> 1582 static int
> 1583 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Naoya,
[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.9-rc4 next-20161028]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
base: git://git.cmpxchg.org/linux-mmotm.git master
config: arm-at91_dt_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm
All errors (new ones prefixed by >>):
In file included from fs/proc/task_mmu.c:14:0:
include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:216:14: error: empty scalar initializer
pmd_t pmd = {};
^
include/linux/swapops.h:216:14: note: (near initialization for 'pmd')
vim +216 include/linux/swapops.h
210 {
211 return swp_entry(0, 0);
212 }
213
214 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
215 {
> 216 pmd_t pmd = {};
217
218 return pmd;
219 }
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Tue, Nov 08, 2016 at 11:05:52AM +0800, kbuild test robot wrote:
> Hi Naoya,
>
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.9-rc4 next-20161028]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Naoya-Horiguchi/mm-x86-move-_PAGE_SWP_SOFT_DIRTY-from-bit-7-to-bit-6/20161108-080615
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: arm-at91_dt_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm
>
> All errors (new ones prefixed by >>):
>
> In file included from fs/proc/task_mmu.c:14:0:
> include/linux/swapops.h: In function 'swp_entry_to_pmd':
> >> include/linux/swapops.h:216:14: error: empty scalar initializer
> pmd_t pmd = {};
> ^
> include/linux/swapops.h:216:14: note: (near initialization for 'pmd')
>
> vim +216 include/linux/swapops.h
>
> 210 {
> 211 return swp_entry(0, 0);
> 212 }
> 213
> 214 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> 215 {
> > 216 pmd_t pmd = {};
> 217
> 218 return pmd;
> 219 }
Here is an alternative:
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index db8a858cc6ff..748c9233b3a5 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -215,9 +215,7 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
{
- pmd_t pmd = {};
-
- return pmd;
+ return (pmd_t) { 0 };
}
static inline int is_pmd_migration_entry(pmd_t pmd)
Thanks,
Naoya Horiguchi
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +
> include/linux/swapops.h | 61 +++++++++++++++
> mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> mm/migrate.c | 44 ++++++++++-
> mm/pgtable-generic.c | 3 +-
> 5 files changed, 262 insertions(+), 2 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> ((type) << (SWP_TYPE_FIRST_BIT)) \
> | ((offset) << SWP_OFFSET_FIRST_BIT) })
> #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
The above macro takes *pte* but evaluates on *pmd*, guess its should
be fixed either way.
On Tue, Nov 08, 2016 at 01:43:54PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> > arch/x86/include/asm/pgtable_64.h | 2 +
> > include/linux/swapops.h | 61 +++++++++++++++
> > mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> > mm/migrate.c | 44 ++++++++++-
> > mm/pgtable-generic.c | 3 +-
> > 5 files changed, 262 insertions(+), 2 deletions(-)
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > index 1cc82ec..3a1b48e 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> > ((type) << (SWP_TYPE_FIRST_BIT)) \
> > | ((offset) << SWP_OFFSET_FIRST_BIT) })
> > #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
> > +#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
>
> The above macro takes *pte* but evaluates on *pmd*, guess its should
> be fixed either way.
Right, I fix it.
Thank you very much.
- Naoya
On 08/11/16 10:31, Naoya Horiguchi wrote:
> Hi everyone,
>
> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> with feedbacks for ver.1.
>
> General description (no change since ver.1)
> ===========================================
>
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
> - mbind(2)
> - move_pages(2)
> - migrate_pages(2)
> - cgroup/cpuset migration
> - memory hotremove
> - soft offline
>
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
>
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>
> Changes / Notes
> ===============
>
> - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
> bits together, which makes implementing thp migration a bit hard because
> _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
> I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
> of the justification, so I keep it in this version. Instead, my approach
> is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
> pmd non-present cases.
Thanks, IIRC
pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
AutoNUMA balancing would change it to
pmd_present = _PAGE_PROTNONE | _PAGE_PSE
and PMD_SWP_SOFT_DIRTY would make it
pmd_present = _PAGE_PSE
What you seem to be suggesting in your comment is that
pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE
Isn't that good enough?
For THP migration I guess we use
_PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)
>
> - this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
> and I think it's favorably received so that's fine. But there's unsolved
> minor suggestion by Aneesh, so I don't include it in this set, expecting
> that it will be updated/reposted.
>
> - pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
> this version should work for such kinds of thp.
>
> - thp page cache is not tested yet, and it's at the head of my todo list
> for future version.
>
> Any comments or advices are welcomed.
Balbir Singh
On Wed, Nov 09, 2016 at 01:32:04PM +1100, Balbir Singh wrote:
> On 08/11/16 10:31, Naoya Horiguchi wrote:
> > Hi everyone,
> >
> > I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> > with feedbacks for ver.1.
> >
> > General description (no change since ver.1)
> > ===========================================
> >
> > This patchset enhances page migration functionality to handle thp migration
> > for various page migration's callers:
> > - mbind(2)
> > - move_pages(2)
> > - migrate_pages(2)
> > - cgroup/cpuset migration
> > - memory hotremove
> > - soft offline
> >
> > The main benefit is that we can avoid unnecessary thp splits, which helps us
> > avoid performance decrease when your applications handles NUMA optimization on
> > their own.
> >
> > The implementation is similar to that of normal page migration, the key point
> > is that we modify a pmd to a pmd migration entry in swap-entry like format.
> >
> > Changes / Notes
> > ===============
> >
> > - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
> > bits together, which makes implementing thp migration a bit hard because
> > _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
> > I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
> > of the justification, so I keep it in this version. Instead, my approach
> > is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
> > pmd non-present cases.
>
> Thanks, IIRC
>
> pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
>
> AutoNUMA balancing would change it to
>
> pmd_present = _PAGE_PROTNONE | _PAGE_PSE
>
> and PMD_SWP_SOFT_DIRTY would make it
>
> pmd_present = _PAGE_PSE
>
> What you seem to be suggesting in your comment is that
>
> pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE
This (no _PAGE_PSE) was a possibile solution, and as I described I gave up
this solution, because I noticed that what I actually wanted was that
pmd_present() certainly returns false during thp migration and that's done
by moving _PAGE_SWP_SOFT_DIRTY. So
pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
is still correct in this patchset.
>
> Isn't that good enough?
>
> For THP migration I guess we use
>
> _PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)
Though I might misread your notations, I hope that the following code
seems describe itself well.
static inline int is_pmd_migration_entry(pmd_t pmd)
{
return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
}
Thanks,
Naoya Horiguchi
>
>
> >
> > - this patchset still covers only x86_64. Zi Yan posted a patch for ppc64
> > and I think it's favorably received so that's fine. But there's unsolved
> > minor suggestion by Aneesh, so I don't include it in this set, expecting
> > that it will be updated/reposted.
> >
> > - pte-mapped thp and doubly-mapped thp were not supported in ver.1, but
> > this version should work for such kinds of thp.
> >
> > - thp page cache is not tested yet, and it's at the head of my todo list
> > for future version.
> >
> > Any comments or advices are welcomed.
>
> Balbir Singh
>
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> Hi everyone,
>
> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> with feedbacks for ver.1.
>
> General description (no change since ver.1)
> ===========================================
>
> This patchset enhances page migration functionality to handle thp migration
> for various page migration's callers:
> - mbind(2)
> - move_pages(2)
> - migrate_pages(2)
> - cgroup/cpuset migration
> - memory hotremove
> - soft offline
>
> The main benefit is that we can avoid unnecessary thp splits, which helps us
> avoid performance decrease when your applications handles NUMA optimization on
> their own.
>
> The implementation is similar to that of normal page migration, the key point
> is that we modify a pmd to a pmd migration entry in swap-entry like format.
Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
VM events to capture how many times the migration worked without first
splitting the huge page and how many time it did not work ? Also do you
have a test case which demonstrates this THP migration and kind of shows
its better than the present split and move method ?
On 09/11/16 15:59, Naoya Horiguchi wrote:
> On Wed, Nov 09, 2016 at 01:32:04PM +1100, Balbir Singh wrote:
>> On 08/11/16 10:31, Naoya Horiguchi wrote:
>>> Hi everyone,
>>>
>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>> with feedbacks for ver.1.
>>>
>>> General description (no change since ver.1)
>>> ===========================================
>>>
>>> This patchset enhances page migration functionality to handle thp migration
>>> for various page migration's callers:
>>> - mbind(2)
>>> - move_pages(2)
>>> - migrate_pages(2)
>>> - cgroup/cpuset migration
>>> - memory hotremove
>>> - soft offline
>>>
>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>> avoid performance decrease when your applications handles NUMA optimization on
>>> their own.
>>>
>>> The implementation is similar to that of normal page migration, the key point
>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>>
>>> Changes / Notes
>>> ===============
>>>
>>> - pmd_present() in x86 checks _PAGE_PRESENT, _PAGE_PROTNONE and _PAGE_PSE
>>> bits together, which makes implementing thp migration a bit hard because
>>> _PAGE_PSE bit is currently used by soft-dirty in swap-entry format.
>>> I was advised to dropping _PAGE_PSE in pmd_present(), but I don't think
>>> of the justification, so I keep it in this version. Instead, my approach
>>> is to move _PAGE_SWP_SOFT_DIRTY to bit 6 (unused) and reserve bit 7 for
>>> pmd non-present cases.
>>
>> Thanks, IIRC
>>
>> pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
>>
>> AutoNUMA balancing would change it to
>>
>> pmd_present = _PAGE_PROTNONE | _PAGE_PSE
>>
>> and PMD_SWP_SOFT_DIRTY would make it
>>
>> pmd_present = _PAGE_PSE
>>
>> What you seem to be suggesting in your comment is that
>>
>> pmd_present should be _PAGE_PRESENT | _PAGE_PROTNONE
>
> This (no _PAGE_PSE) was a possibile solution, and as I described I gave up
> this solution, because I noticed that what I actually wanted was that
> pmd_present() certainly returns false during thp migration and that's done
> by moving _PAGE_SWP_SOFT_DIRTY. So
>
> pmd_present = _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE
>
> is still correct in this patchset.
>
Thanks, I was wondering if there is any advantage or you felt its
safer not to change pmd_present().
>>
>> Isn't that good enough?
>>
>> For THP migration I guess we use
>>
>> _PAGE_PRESENT | _PAGE_PROTNONE | is_migration_entry(pmd)
>
> Though I might misread your notations, I hope that the following code
> seems describe itself well.
>
> static inline int is_pmd_migration_entry(pmd_t pmd)
> {
> return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> }
>
Thanks, yes my notation is not the best.
Balbir Singh.
Hi Anshuman,
On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > Hi everyone,
> >
> > I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
> > with feedbacks for ver.1.
> >
> > General description (no change since ver.1)
> > ===========================================
> >
> > This patchset enhances page migration functionality to handle thp migration
> > for various page migration's callers:
> > - mbind(2)
> > - move_pages(2)
> > - migrate_pages(2)
> > - cgroup/cpuset migration
> > - memory hotremove
> > - soft offline
> >
> > The main benefit is that we can avoid unnecessary thp splits, which helps us
> > avoid performance decrease when your applications handles NUMA optimization on
> > their own.
> >
> > The implementation is similar to that of normal page migration, the key point
> > is that we modify a pmd to a pmd migration entry in swap-entry like format.
>
> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
> VM events to capture how many times the migration worked without first
> splitting the huge page and how many time it did not work ?
Thank you for the suggestion.
I think that's helpful, so will try it in next version.
> Also do you
> have a test case which demonstrates this THP migration and kind of shows
> its better than the present split and move method ?
I don't have test cases which compare thp migration and split-then-migration
with some numbers. Maybe measuring/comparing the overhead of migration is
a good start point, although I think the real benefit of thp migration comes
from workload "after migration" by avoiding thp split.
Thanks,
Naoya Horiguchi
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
>
Snip
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +
> include/linux/swapops.h | 61 +++++++++++++++
> mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> mm/migrate.c | 44 ++++++++++-
> mm/pgtable-generic.c | 3 +-
> 5 files changed, 262 insertions(+), 2 deletions(-)
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> {
> pmd_t pmd;
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> - VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> + !pmd_devmap(*pmdp))
Its a valid VM_BUG_ON check but is it related to THP migration or
just a regular fix up ?
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> TTU_MIGRATION is used to convert pte into migration entry until thp split
> completes. This behavior conflicts with thp migration added later patches,
Hmm, could you please explain why it conflicts with the PMD based
migration without split ? Why TTU_MIGRATION cannot be used to
freeze/hold on the PMD while it's being migrated ?
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.
There are lot of changes in this one patch. Should not we split
this up into multiple patches and explain them in a bit detail
through their commit messages ?
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> Soft dirty bit is designed to keep tracked over page migration. This patch
Small nit here. s/tracked/track/
On Thu, Nov 10, 2016 at 02:01:56PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > TTU_MIGRATION is used to convert pte into migration entry until thp split
> > completes. This behavior conflicts with thp migration added later patches,
>
> Hmm, could you please explain why it conflicts with the PMD based
> migration without split ? Why TTU_MIGRATION cannot be used to
> freeze/hold on the PMD while it's being migrated ?
try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
and ttu_flag given for tail pages is:
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)
__unmap_and_move() calls try_to_unmap() with ttu_flag:
(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
if (flags & TTU_MIGRATION) {
if (!PageHuge(page) && PageTransCompound(page)) {
set_pmd_migration_entry(page, vma, address);
goto out;
}
}
, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)
I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().
Anyway, I'll add this explanation into the patch description in the next post.
Thanks,
Naoya Horiguchi
On Thu, Nov 10, 2016 at 02:06:14PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > If one of callers of page migration starts to handle thp, memory management code
> > start to see pmd migration entry, so we need to prepare for it before enabling.
> > This patch changes various code point which checks the status of given pmds in
> > order to prevent race between thp migration and the pmd-related works.
>
> There are lot of changes in this one patch. Should not we split
> this up into multiple patches and explain them in a bit detail
> through their commit messages ?
Yes, and I admit that I might change more than necessary, if the context
never encounters migration entry for any reason, no change is needed.
I'll dig more detail.
- Naoya Horiguchi
On Thu, Nov 10, 2016 at 02:08:38PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > Soft dirty bit is designed to keep tracked over page migration. This patch
>
> Small nit here. s/tracked/track/
>
Fixed. thank you :)
- Naoya
Hi Hillf,
On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> >
> > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > goto out_unlock;
> >
> > + if (unlikely(!pmd_present(orig_pmd)))
> > + goto out_unlock;
> > +
>
> Can we encounter a migration entry after acquiring ptl ?
I think we can. thp migration code releases ptl after converting pmd into
migration entry, so other code can see it even within ptl.
Thanks,
Naoya Horiguchi
On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
>
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> goto out_unlock;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out_unlock;
> +
Can we encounter a migration entry after acquiring ptl ?
> page = pmd_page(orig_pmd);
> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> /*
[...]
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> int ret;
>
> barrier();
> + if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> + pmd_migration_entry_wait(mm, fe.pmd);
> + return 0;
> + }
> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&fe, orig_pmd);
On Thursday, November 10, 2016 5:22 PM Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> > On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> > >
> > > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > > if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > > goto out_unlock;
> > >
> > > + if (unlikely(!pmd_present(orig_pmd)))
> > > + goto out_unlock;
> > > +
> >
> > Can we encounter a migration entry after acquiring ptl ?
>
> I think we can. thp migration code releases ptl after converting pmd into
> migration entry, so other code can see it even within ptl.
>
But we have a pmd_same check there, you see.
On Thu, Nov 10, 2016 at 05:28:20PM +0800, Hillf Danton wrote:
> On Thursday, November 10, 2016 5:22 PM Naoya Horiguchi wrote:
> > On Thu, Nov 10, 2016 at 05:08:07PM +0800, Hillf Danton wrote:
> > > On Tuesday, November 08, 2016 7:32 AM Naoya Horiguchi wrote:
> > > >
> > > > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > > > if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > > > goto out_unlock;
> > > >
> > > > + if (unlikely(!pmd_present(orig_pmd)))
> > > > + goto out_unlock;
> > > > +
> > >
> > > Can we encounter a migration entry after acquiring ptl ?
> >
> > I think we can. thp migration code releases ptl after converting pmd into
> > migration entry, so other code can see it even within ptl.
> >
> But we have a pmd_same check there, you see.
You're right. So we can omit this pmd_present check.
Thanks,
Naoya Horiguchi
On 08/11/16 10:31, Naoya Horiguchi wrote:
> This patch enables thp migration for soft offline.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> mm/memory-failure.c | 31 ++++++++++++-------------------
> 1 file changed, 12 insertions(+), 19 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> index 19e796d..6cc8157 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> if (PageHuge(p))
> return alloc_huge_page_node(page_hstate(compound_head(p)),
> nid);
> - else
> + else if (thp_migration_supported() && PageTransHuge(p)) {
> + struct page *thp;
> +
> + thp = alloc_pages_node(nid,
> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> + HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
Just wondering if new_page() fails, migration of that entry fails. Do we then
split and migrate? I guess this applies to THP migration in general.
Balbir Singh
On 9 Nov 2016, at 18:52, Naoya Horiguchi wrote:
> Hi Anshuman,
>
> On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
>> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>> Hi everyone,
>>>
>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>> with feedbacks for ver.1.
>>>
>>> General description (no change since ver.1)
>>> ===========================================
>>>
>>> This patchset enhances page migration functionality to handle thp migration
>>> for various page migration's callers:
>>> - mbind(2)
>>> - move_pages(2)
>>> - migrate_pages(2)
>>> - cgroup/cpuset migration
>>> - memory hotremove
>>> - soft offline
>>>
>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>> avoid performance decrease when your applications handles NUMA optimization on
>>> their own.
>>>
>>> The implementation is similar to that of normal page migration, the key point
>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>
>> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
>> VM events to capture how many times the migration worked without first
>> splitting the huge page and how many time it did not work ?
>
> Thank you for the suggestion.
> I think that's helpful, so will try it in next version.
>
>> Also do you
>> have a test case which demonstrates this THP migration and kind of shows
>> its better than the present split and move method ?
>
> I don't have test cases which compare thp migration and split-then-migration
> with some numbers. Maybe measuring/comparing the overhead of migration is
> a good start point, although I think the real benefit of thp migration comes
> from workload "after migration" by avoiding thp split.
Migrating 4KB pages has much lower (~1/3) throughput than 2MB pages.
What I get is that on average it takes 1987.38 us to migrate 512 4KB pages and
658.54 us to migrate 1 2MB page.
I did the test in a two-socket Intel Xeon E5-2640v4 box. I used migrate_pages()
system call to migrate pages. MADV_NOHUGEPAGE and MADV_HUGEPAGE are used to
make 4KB and 2MB pages and each page’s flags are checked to make sure the page
size is 4KB or 2MB THP.
There is no split page. But the page migration time already tells the story.
—
Best Regards,
Yan Zi
On Thu, Nov 10, 2016 at 01:59:03PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> >
>
> Snip
>
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> > arch/x86/include/asm/pgtable_64.h | 2 +
> > include/linux/swapops.h | 61 +++++++++++++++
> > mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> > mm/migrate.c | 44 ++++++++++-
> > mm/pgtable-generic.c | 3 +-
> > 5 files changed, 262 insertions(+), 2 deletions(-)
>
>
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > index 71c5f91..6012343 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> > {
> > pmd_t pmd;
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > - VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> > + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> > + !pmd_devmap(*pmdp))
>
> Its a valid VM_BUG_ON check but is it related to THP migration or
> just a regular fix up ?
Without this change, this VM_BUG_ON always triggers when migration happens
on normal thp and it succeeds, so I included it here.
- Naoya Horiguchi
On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> well because it can hurt optimization of tlb handling in thp split.
> In the current kernel, bit 6 is not used in non-present format because nonlinear
> file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> Bit 7 is used as reserved (always clear), so please don't use it for other
> purpose.
...
> #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
> +#define _PAGE_SWP_SOFT_DIRTY _PAGE_DIRTY
> #else
> #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
> #endif
I'm not sure this works. Take a look at commit 00839ee3b29 and the
erratum it works around. I _think_ this means that a system affected by
the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
set in swap ptes.
There are much worse things that can happen, but I don't think bits 5
(Accessed) and 6 (Dirty) are good choices since they're affected by the
erratum.
On Thu, Nov 10, 2016 at 09:31:10PM +1100, Balbir Singh wrote:
>
>
> On 08/11/16 10:31, Naoya Horiguchi wrote:
> > This patch enables thp migration for soft offline.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > mm/memory-failure.c | 31 ++++++++++++-------------------
> > 1 file changed, 12 insertions(+), 19 deletions(-)
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> > index 19e796d..6cc8157 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
> > @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> > if (PageHuge(p))
> > return alloc_huge_page_node(page_hstate(compound_head(p)),
> > nid);
> > - else
> > + else if (thp_migration_supported() && PageTransHuge(p)) {
> > + struct page *thp;
> > +
> > + thp = alloc_pages_node(nid,
> > + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> > + HPAGE_PMD_ORDER);
> > + if (!thp)
> > + return NULL;
>
> Just wondering if new_page() fails, migration of that entry fails. Do we then
> split and migrate? I guess this applies to THP migration in general.
Yes, that's not implemented yet, but can be helpful.
I think that there are 2 types of callers of page migration,
one is a caller that specifies the target pages individually (like move_pages
and soft offline), and another is a caller that specifies the target pages
by (physical/virtual) address range basis.
Maybe the former ones want to fall back immediately to split and retry if
thp migration fails, and the latter ones want to retry thp migration more.
If this makes sense, we can make some more changes on retry logic to fit
the situation.
Thanks,
Naoya Horiguchi
On Thu, Nov 10, 2016 at 03:29:51PM -0800, Dave Hansen wrote:
> On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> > pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> > return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> > cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> > well because it can hurt optimization of tlb handling in thp split.
> > In the current kernel, bit 6 is not used in non-present format because nonlinear
> > file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> > Bit 7 is used as reserved (always clear), so please don't use it for other
> > purpose.
> ...
> > #ifdef CONFIG_MEM_SOFT_DIRTY
> > -#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
> > +#define _PAGE_SWP_SOFT_DIRTY _PAGE_DIRTY
> > #else
> > #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
> > #endif
>
> I'm not sure this works. Take a look at commit 00839ee3b29 and the
> erratum it works around. I _think_ this means that a system affected by
> the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
> set in swap ptes.
>
> There are much worse things that can happen, but I don't think bits 5
> (Accessed) and 6 (Dirty) are good choices since they're affected by the
> erratum.
Thank you for the information. According to 00839ee3b29, some bits which
are safe from the errata are reclaimed, so assigning one of such bits for
_PAGE_SWP_SOFT_DIRTY seems to work. And I'll update the description.
Thanks,
Naoya Horiguchi
On 11/10/2016 02:39 PM, Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 02:01:56PM +0530, Anshuman Khandual wrote:
>> > On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>> > > TTU_MIGRATION is used to convert pte into migration entry until thp split
>>> > > completes. This behavior conflicts with thp migration added later patches,
>> >
>> > Hmm, could you please explain why it conflicts with the PMD based
>> > migration without split ? Why TTU_MIGRATION cannot be used to
>> > freeze/hold on the PMD while it's being migrated ?
> try_to_unmap() is used both for thp split (via freeze_page()) and page
> migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
> head page is like below (assuming anonymous thp):
>
> (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
> TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
Right.
>
> and ttu_flag given for tail pages is:
>
> (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
> TTU_MIGRATION)
Right.
>
> __unmap_and_move() calls try_to_unmap() with ttu_flag:
>
> (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
>
> Now I'm trying to insert a branch for thp migration at the top of
> try_to_unmap_one() like below
>
>
> static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> unsigned long address, void *arg)
> {
> ...
> if (flags & TTU_MIGRATION) {
> if (!PageHuge(page) && PageTransCompound(page)) {
> set_pmd_migration_entry(page, vma, address);
> goto out;
> }
> }
>
> , so try_to_unmap() for tail pages called by thp split can go into thp
> migration code path (which converts *pmd* into migration entry), while
> the expectation is to freeze thp (which converts *pte* into migration entry.)
Got it.
>
> I detected this failure as a "bad page state" error in a testcase where
> split_huge_page() is called from queue_pages_pte_range().
>
> Anyway, I'll add this explanation into the patch description in the next post.
Sure, thanks for the explanation.
On 11/10/2016 07:31 PM, Zi Yan wrote:
> On 9 Nov 2016, at 18:52, Naoya Horiguchi wrote:
>
>> Hi Anshuman,
>>
>> On Wed, Nov 09, 2016 at 04:03:04PM +0530, Anshuman Khandual wrote:
>>> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
>>>> Hi everyone,
>>>>
>>>> I've updated thp migration patches for v4.9-rc2-mmotm-2016-10-27-18-27
>>>> with feedbacks for ver.1.
>>>>
>>>> General description (no change since ver.1)
>>>> ===========================================
>>>>
>>>> This patchset enhances page migration functionality to handle thp migration
>>>> for various page migration's callers:
>>>> - mbind(2)
>>>> - move_pages(2)
>>>> - migrate_pages(2)
>>>> - cgroup/cpuset migration
>>>> - memory hotremove
>>>> - soft offline
>>>>
>>>> The main benefit is that we can avoid unnecessary thp splits, which helps us
>>>> avoid performance decrease when your applications handles NUMA optimization on
>>>> their own.
>>>>
>>>> The implementation is similar to that of normal page migration, the key point
>>>> is that we modify a pmd to a pmd migration entry in swap-entry like format.
>>>
>>> Will it be better to have new THP_MIGRATE_SUCCESS and THP_MIGRATE_FAIL
>>> VM events to capture how many times the migration worked without first
>>> splitting the huge page and how many time it did not work ?
>>
>> Thank you for the suggestion.
>> I think that's helpful, so will try it in next version.
>>
>>> Also do you
>>> have a test case which demonstrates this THP migration and kind of shows
>>> its better than the present split and move method ?
>>
>> I don't have test cases which compare thp migration and split-then-migration
>> with some numbers. Maybe measuring/comparing the overhead of migration is
>> a good start point, although I think the real benefit of thp migration comes
>> from workload "after migration" by avoiding thp split.
>
> Migrating 4KB pages has much lower (~1/3) throughput than 2MB pages.
I assume the 2MB throughput you mentioned is with this THP migration
feature enabled.
>
> What I get is that on average it takes 1987.38 us to migrate 512 4KB pages and
> 658.54 us to migrate 1 2MB page.
>
> I did the test in a two-socket Intel Xeon E5-2640v4 box. I used migrate_pages()
> system call to migrate pages. MADV_NOHUGEPAGE and MADV_HUGEPAGE are used to
> make 4KB and 2MB pages and each page’s flags are checked to make sure the page
> size is 4KB or 2MB THP.
>
> There is no split page. But the page migration time already tells the story.
Right. Just wondering if we can add a test case which measures just
this migration time improvement by avoiding the split not the TLB
based improvement which the workload will receive as an addition.
On Thu, Nov 10, 2016 at 03:29:51PM -0800, Dave Hansen wrote:
> On 11/07/2016 03:31 PM, Naoya Horiguchi wrote:
> > pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid false negative
> > return when it races with thp spilt (during which _PAGE_PRESENT is temporary
> > cleared.) I don't think that dropping _PAGE_PSE check in pmd_present() works
> > well because it can hurt optimization of tlb handling in thp split.
> > In the current kernel, bit 6 is not used in non-present format because nonlinear
> > file mapping is obsolete, so let's move _PAGE_SWP_SOFT_DIRTY to that bit.
> > Bit 7 is used as reserved (always clear), so please don't use it for other
> > purpose.
> ...
> > #ifdef CONFIG_MEM_SOFT_DIRTY
> > -#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
> > +#define _PAGE_SWP_SOFT_DIRTY _PAGE_DIRTY
> > #else
> > #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
> > #endif
>
> I'm not sure this works. Take a look at commit 00839ee3b29 and the
> erratum it works around. I _think_ this means that a system affected by
> the erratum might see an erroneous _PAGE_SWP_SOFT_DIRTY/_PAGE_DIRTY get
> set in swap ptes.
But, is it destructive in any way? What is the harm if we mark swap entry
dirty by mistake?
Pavel?
--
Kirill A. Shutemov
On Tue, Nov 08, 2016 at 08:31:48AM +0900, Naoya Horiguchi wrote:
> TTU_MIGRATION is used to convert pte into migration entry until thp split
> completes. This behavior conflicts with thp migration added later patches,
> so let's introduce a new TTU flag specifically for freezing.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Tue, Nov 08, 2016 at 08:31:49AM +0900, Naoya Horiguchi wrote:
> +static inline bool thp_migration_supported(void)
> +{
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + return true;
> +#else
> + return false;
> +#endif
Em..
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
?
Otherwise looks good to me.
--
Kirill A. Shutemov
On Tue, Nov 08, 2016 at 08:31:50AM +0900, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +
> include/linux/swapops.h | 61 +++++++++++++++
> mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> mm/migrate.c | 44 ++++++++++-
> mm/pgtable-generic.c | 3 +-
> 5 files changed, 262 insertions(+), 2 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> ((type) << (SWP_TYPE_FIRST_BIT)) \
> | ((offset) << SWP_OFFSET_FIRST_BIT) })
> #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
> #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
>
> extern int kern_addr_valid(unsigned long addr);
> extern void cleanup_highmap(void);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> index 5c3a5f3..b6b22a2 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>
> #endif
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page *page,
> + struct vm_area_struct *vma, unsigned long address);
> +
> +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> + swp_entry_t arch_entry;
> +
> + arch_entry = __pmd_to_swp_entry(pmd);
> + return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> + swp_entry_t arch_entry;
> +
> + arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> + return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> + return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page *page,
> + struct vm_area_struct *vma, unsigned long address)
> +{
VM_BUG()? Or BUILD_BUG()?
> +}
> +
> +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> + return 0;
Ditto.
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> + return swp_entry(0, 0);
Ditto.
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> + pmd_t pmd = {};
Ditto.
> + return pmd;
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_MEMORY_FAILURE
>
> extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index 0509d17..b3022b3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
> }
> late_initcall(split_huge_pages_debugfs);
> #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pmd_t pmdval;
> + swp_entry_t entry;
> + spinlock_t *ptl;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + return;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + return;
> + pmd = pmd_offset(pud, addr);
> + pmdval = *pmd;
> + barrier();
> + if (!pmd_present(pmdval))
> + return;
> +
> + mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> + if (pmd_trans_huge(pmdval)) {
> + pmd_t pmdswp;
> +
> + ptl = pmd_lock(mm, pmd);
> + if (!pmd_present(*pmd))
> + goto unlock_pmd;
> + if (unlikely(!pmd_trans_huge(*pmd)))
> + goto unlock_pmd;
Just check *pmd == pmdval?
> + if (pmd_page(*pmd) != page)
> + goto unlock_pmd;
> +
> + pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> + if (pmd_dirty(pmdval))
> + set_page_dirty(page);
> + entry = make_migration_entry(page, pmd_write(pmdval));
> + pmdswp = swp_entry_to_pmd(entry);
> + pmdswp = pmd_mkhuge(pmdswp);
> + set_pmd_at(mm, addr, pmd, pmdswp);
> + page_remove_rmap(page, true);
> + put_page(page);
> +unlock_pmd:
> + spin_unlock(ptl);
> + } else { /* pte-mapped thp */
> + pte_t *pte;
> + pte_t pteval;
> + struct page *tmp = compound_head(page);
> + unsigned long address = addr & HPAGE_PMD_MASK;
> + pte_t swp_pte;
> + int i;
> +
> + pte = pte_offset_map(pmd, address);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
pte_offset_map_lock() ?
> + for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> + if (!(pte_present(*pte) &&
> + page_to_pfn(tmp) == pte_pfn(*pte)))
if (!pte_present(*pte) || pte_page(*pte) != tmp) ?
> + continue;
> + pteval = ptep_clear_flush(vma, address, pte);
> + if (pte_dirty(pteval))
> + set_page_dirty(tmp);
> + entry = make_migration_entry(tmp, pte_write(pteval));
> + swp_pte = swp_entry_to_pte(entry);
> + set_pte_at(mm, address, pte, swp_pte);
> + page_remove_rmap(tmp, false);
> + put_page(tmp);
> + }
> + pte_unmap_unlock(pte, ptl);
> + }
> + mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> + return;
> +}
> +
> +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + spinlock_t *ptl;
> + pmd_t pmde;
> + swp_entry_t entry;
> +
> + pmde = *pmd;
> + barrier();
> +
> + if (!pmd_present(pmde)) {
> + if (is_migration_entry(pmd_to_swp_entry(pmde))) {
if (!is_migration_entry(pmd_to_swp_entry(pmde)))
return SWAP_AGAIN;
And one level less indentation below.
> + unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> + unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +
> + ptl = pmd_lock(mm, pmd);
> + entry = pmd_to_swp_entry(*pmd);
> + if (migration_entry_to_page(entry) != old)
> + goto unlock_ptl;
> + get_page(new);
> + pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> + if (is_write_migration_entry(entry))
> + pmde = maybe_pmd_mkwrite(pmde, vma);
> + flush_cache_range(vma, mmun_start, mmun_end);
> + page_add_anon_rmap(new, vma, mmun_start, true);
> + pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> + set_pmd_at(mm, mmun_start, pmd, pmde);
> + flush_tlb_range(vma, mmun_start, mmun_end);
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_vma_page(new);
> + update_mmu_cache_pmd(vma, addr, pmd);
> +unlock_ptl:
> + spin_unlock(ptl);
return SWAP_AGAIN;
And one level less indentation below.
> + }
> + } else { /* pte-mapped thp */
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> + struct page *tmpnew = compound_head(new);
> + struct page *tmpold = compound_head((struct page *)old);
> + unsigned long address = addr & HPAGE_PMD_MASK;
> +
> + ptep = pte_offset_map(pmd, addr);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
pte_offset_map_lock() ?
> +
> + for (i = 0; i < HPAGE_PMD_NR;
> + i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> + pte = *ptep;
> + if (!is_swap_pte(pte))
> + continue;
> + entry = pte_to_swp_entry(pte);
> + if (!is_migration_entry(entry) ||
> + migration_entry_to_page(entry) != tmpold)
> + continue;
> + get_page(tmpnew);
> + pte = pte_mkold(mk_pte(tmpnew,
> + READ_ONCE(vma->vm_page_prot)));
READ_ONCE()? Do we get here under mmap_sem, right?
> + if (pte_swp_soft_dirty(*ptep))
> + pte = pte_mksoft_dirty(pte);
> + if (is_write_migration_entry(entry))
> + pte = maybe_mkwrite(pte, vma);
> + flush_dcache_page(tmpnew);
> + set_pte_at(mm, address, ptep, pte);
> + if (PageAnon(new))
> + page_add_anon_rmap(tmpnew, vma, address, false);
> + else
> + page_add_file_rmap(tmpnew, false);
> + update_mmu_cache(vma, address, ptep);
> + }
> + pte_unmap_unlock(ptep, ptl);
> + }
> + return SWAP_AGAIN;
> +}
> +#endif
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 66ce6b4..54f2eb6 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> {
> struct mm_struct *mm = vma->vm_mm;
> swp_entry_t entry;
> + pgd_t *pgd;
> + pud_t *pud;
> pmd_t *pmd;
> pte_t *ptep, pte;
> spinlock_t *ptl;
> @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> goto out;
> ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
> } else {
> - pmd = mm_find_pmd(mm, addr);
> + pmd_t pmde;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> + pmd = pmd_offset(pud, addr);
> if (!pmd)
> goto out;
>
> + if (PageTransCompound(new)) {
> + remove_migration_pmd(new, pmd, vma, addr, old);
> + goto out;
> + }
> +
> + pmde = *pmd;
> + barrier();
> +
> + if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> + goto out;
> +
> ptep = pte_offset_map(pmd, addr);
>
> /*
> @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
> __migration_entry_wait(mm, pte, ptl);
> }
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> + spinlock_t *ptl;
> + struct page *page;
> +
> + ptl = pmd_lock(mm, pmd);
> + if (!is_pmd_migration_entry(*pmd))
> + goto unlock;
> + page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> + if (!get_page_unless_zero(page))
> + goto unlock;
> + spin_unlock(ptl);
> + wait_on_page_locked(page);
> + put_page(page);
> + return;
> +unlock:
> + spin_unlock(ptl);
> +}
> +#endif
> +
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> {
> pmd_t pmd;
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> - VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> + !pmd_devmap(*pmdp));
> pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> return pmd;
> --
> 2.7.0
>
--
Kirill A. Shutemov
On 11/11/16 10:58, Naoya Horiguchi wrote:
> On Thu, Nov 10, 2016 at 09:31:10PM +1100, Balbir Singh wrote:
>>
>>
>> On 08/11/16 10:31, Naoya Horiguchi wrote:
>>> This patch enables thp migration for soft offline.
>>>
>>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>> ---
>>> mm/memory-failure.c | 31 ++++++++++++-------------------
>>> 1 file changed, 12 insertions(+), 19 deletions(-)
>>>
>>> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
>>> index 19e796d..6cc8157 100644
>>> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory-failure.c
>>> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory-failure.c
>>> @@ -1485,7 +1485,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>>> if (PageHuge(p))
>>> return alloc_huge_page_node(page_hstate(compound_head(p)),
>>> nid);
>>> - else
>>> + else if (thp_migration_supported() && PageTransHuge(p)) {
>>> + struct page *thp;
>>> +
>>> + thp = alloc_pages_node(nid,
>>> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>>> + HPAGE_PMD_ORDER);
>>> + if (!thp)
>>> + return NULL;
>>
>> Just wondering if new_page() fails, migration of that entry fails. Do we then
>> split and migrate? I guess this applies to THP migration in general.
>
> Yes, that's not implemented yet, but can be helpful.
>
> I think that there are 2 types of callers of page migration,
> one is a caller that specifies the target pages individually (like move_pages
> and soft offline), and another is a caller that specifies the target pages
> by (physical/virtual) address range basis.
> Maybe the former ones want to fall back immediately to split and retry if
> thp migration fails, and the latter ones want to retry thp migration more.
> If this makes sense, we can make some more changes on retry logic to fit
> the situation.
>
I think we definitely need the retry with split option, but may be we can
build it on top of this series.
Balbir Singh.
On Fri, Nov 11, 2016 at 02:18:40PM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:49AM +0900, Naoya Horiguchi wrote:
> > +static inline bool thp_migration_supported(void)
> > +{
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > + return true;
> > +#else
> > + return false;
> > +#endif
>
> Em..
>
> return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
Looks better, thank you.
- Naoya
On Mon, Nov 14, 2016 at 02:45:03PM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:50AM +0900, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > ChangeLog v1 -> v2:
> > - support pte-mapped thp, doubly-mapped thp
> > ---
> > arch/x86/include/asm/pgtable_64.h | 2 +
> > include/linux/swapops.h | 61 +++++++++++++++
> > mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> > mm/migrate.c | 44 ++++++++++-
> > mm/pgtable-generic.c | 3 +-
> > 5 files changed, 262 insertions(+), 2 deletions(-)
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > index 1cc82ec..3a1b48e 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> > @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> > ((type) << (SWP_TYPE_FIRST_BIT)) \
> > | ((offset) << SWP_OFFSET_FIRST_BIT) })
> > #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
> > +#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
> > #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
> > +#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
> >
> > extern int kern_addr_valid(unsigned long addr);
> > extern void cleanup_highmap(void);
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> > index 5c3a5f3..b6b22a2 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> > @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
> >
> > #endif
> >
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +extern void set_pmd_migration_entry(struct page *page,
> > + struct vm_area_struct *vma, unsigned long address);
> > +
> > +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > + struct vm_area_struct *vma, unsigned long addr, void *old);
> > +
> > +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> > +
> > +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> > +{
> > + swp_entry_t arch_entry;
> > +
> > + arch_entry = __pmd_to_swp_entry(pmd);
> > + return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> > +}
> > +
> > +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> > +{
> > + swp_entry_t arch_entry;
> > +
> > + arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> > + return __swp_entry_to_pmd(arch_entry);
> > +}
> > +
> > +static inline int is_pmd_migration_entry(pmd_t pmd)
> > +{
> > + return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> > +}
> > +#else
> > +static inline void set_pmd_migration_entry(struct page *page,
> > + struct vm_area_struct *vma, unsigned long address)
> > +{
>
> VM_BUG()? Or BUILD_BUG()?
These should be compiled out, so BUILD_BUG() seems better to me.
3 routines below will be done in the same manner.
> > +}
> > +
> > +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > + struct vm_area_struct *vma, unsigned long addr, void *old)
> > +{
> > + return 0;
>
> Ditto.
>
> > +}
> > +
> > +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> > +
> > +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> > +{
> > + return swp_entry(0, 0);
>
> Ditto.
>
> > +}
> > +
> > +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> > +{
> > + pmd_t pmd = {};
>
> Ditto.
>
> > + return pmd;
> > +}
> > +
> > +static inline int is_pmd_migration_entry(pmd_t pmd)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_MEMORY_FAILURE
> >
> > extern atomic_long_t num_poisoned_pages __read_mostly;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > index 0509d17..b3022b3 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
> > }
> > late_initcall(split_huge_pages_debugfs);
> > #endif
> > +
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pmd_t pmdval;
> > + swp_entry_t entry;
> > + spinlock_t *ptl;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (!pgd_present(*pgd))
> > + return;
> > + pud = pud_offset(pgd, addr);
> > + if (!pud_present(*pud))
> > + return;
> > + pmd = pmd_offset(pud, addr);
> > + pmdval = *pmd;
> > + barrier();
> > + if (!pmd_present(pmdval))
> > + return;
> > +
> > + mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> > + if (pmd_trans_huge(pmdval)) {
> > + pmd_t pmdswp;
> > +
> > + ptl = pmd_lock(mm, pmd);
> > + if (!pmd_present(*pmd))
> > + goto unlock_pmd;
> > + if (unlikely(!pmd_trans_huge(*pmd)))
> > + goto unlock_pmd;
>
> Just check *pmd == pmdval?
OK.
>
> > + if (pmd_page(*pmd) != page)
> > + goto unlock_pmd;
> > +
> > + pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> > + if (pmd_dirty(pmdval))
> > + set_page_dirty(page);
> > + entry = make_migration_entry(page, pmd_write(pmdval));
> > + pmdswp = swp_entry_to_pmd(entry);
> > + pmdswp = pmd_mkhuge(pmdswp);
> > + set_pmd_at(mm, addr, pmd, pmdswp);
> > + page_remove_rmap(page, true);
> > + put_page(page);
> > +unlock_pmd:
> > + spin_unlock(ptl);
> > + } else { /* pte-mapped thp */
> > + pte_t *pte;
> > + pte_t pteval;
> > + struct page *tmp = compound_head(page);
> > + unsigned long address = addr & HPAGE_PMD_MASK;
> > + pte_t swp_pte;
> > + int i;
> > +
> > + pte = pte_offset_map(pmd, address);
> > + ptl = pte_lockptr(mm, pmd);
> > + spin_lock(ptl);
>
> pte_offset_map_lock() ?
Right.
> > + for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> > + if (!(pte_present(*pte) &&
> > + page_to_pfn(tmp) == pte_pfn(*pte)))
>
> if (!pte_present(*pte) || pte_page(*pte) != tmp) ?
Yes, this is shorter/simpler.
>
> > + continue;
> > + pteval = ptep_clear_flush(vma, address, pte);
> > + if (pte_dirty(pteval))
> > + set_page_dirty(tmp);
> > + entry = make_migration_entry(tmp, pte_write(pteval));
> > + swp_pte = swp_entry_to_pte(entry);
> > + set_pte_at(mm, address, pte, swp_pte);
> > + page_remove_rmap(tmp, false);
> > + put_page(tmp);
> > + }
> > + pte_unmap_unlock(pte, ptl);
> > + }
> > + mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> > + return;
> > +}
> > +
> > +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> > + struct vm_area_struct *vma, unsigned long addr, void *old)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + spinlock_t *ptl;
> > + pmd_t pmde;
> > + swp_entry_t entry;
> > +
> > + pmde = *pmd;
> > + barrier();
> > +
> > + if (!pmd_present(pmde)) {
> > + if (is_migration_entry(pmd_to_swp_entry(pmde))) {
>
> if (!is_migration_entry(pmd_to_swp_entry(pmde)))
> return SWAP_AGAIN;
>
> And one level less indentation below.
OK.
> > + unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> > + unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> > +
> > + ptl = pmd_lock(mm, pmd);
> > + entry = pmd_to_swp_entry(*pmd);
> > + if (migration_entry_to_page(entry) != old)
> > + goto unlock_ptl;
> > + get_page(new);
> > + pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> > + if (is_write_migration_entry(entry))
> > + pmde = maybe_pmd_mkwrite(pmde, vma);
> > + flush_cache_range(vma, mmun_start, mmun_end);
> > + page_add_anon_rmap(new, vma, mmun_start, true);
> > + pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> > + set_pmd_at(mm, mmun_start, pmd, pmde);
> > + flush_tlb_range(vma, mmun_start, mmun_end);
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_vma_page(new);
> > + update_mmu_cache_pmd(vma, addr, pmd);
> > +unlock_ptl:
> > + spin_unlock(ptl);
>
> return SWAP_AGAIN;
>
> And one level less indentation below.
>
> > + }
> > + } else { /* pte-mapped thp */
> > + pte_t *ptep;
> > + pte_t pte;
> > + int i;
> > + struct page *tmpnew = compound_head(new);
> > + struct page *tmpold = compound_head((struct page *)old);
> > + unsigned long address = addr & HPAGE_PMD_MASK;
> > +
> > + ptep = pte_offset_map(pmd, addr);
> > + ptl = pte_lockptr(mm, pmd);
> > + spin_lock(ptl);
>
> pte_offset_map_lock() ?
>
> > +
> > + for (i = 0; i < HPAGE_PMD_NR;
> > + i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> > + pte = *ptep;
> > + if (!is_swap_pte(pte))
> > + continue;
> > + entry = pte_to_swp_entry(pte);
> > + if (!is_migration_entry(entry) ||
> > + migration_entry_to_page(entry) != tmpold)
> > + continue;
> > + get_page(tmpnew);
> > + pte = pte_mkold(mk_pte(tmpnew,
> > + READ_ONCE(vma->vm_page_prot)));
>
> READ_ONCE()? Do we get here under mmap_sem, right?
Some callers of page migration (mbind, move_pages, migrate_pages, cpuset)
do get mmap_sem, but others (memory hotremove, soft offline) don't.
For this part, I borrowed some code from remove_migration_pte() which was
updated at the following commit:
commit 6d2329f8872f23e46a19d240930571510ce525eb
Author: Andrea Arcangeli <[email protected]>
Date: Fri Oct 7 17:01:22 2016 -0700
mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE
Thank you for reviewing in detail!
Naoya Horiguchi
> > + if (pte_swp_soft_dirty(*ptep))
> > + pte = pte_mksoft_dirty(pte);
> > + if (is_write_migration_entry(entry))
> > + pte = maybe_mkwrite(pte, vma);
> > + flush_dcache_page(tmpnew);
> > + set_pte_at(mm, address, ptep, pte);
> > + if (PageAnon(new))
> > + page_add_anon_rmap(tmpnew, vma, address, false);
> > + else
> > + page_add_file_rmap(tmpnew, false);
> > + update_mmu_cache(vma, address, ptep);
> > + }
> > + pte_unmap_unlock(ptep, ptl);
> > + }
> > + return SWAP_AGAIN;
> > +}
> > +#endif
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> > index 66ce6b4..54f2eb6 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> > @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > swp_entry_t entry;
> > + pgd_t *pgd;
> > + pud_t *pud;
> > pmd_t *pmd;
> > pte_t *ptep, pte;
> > spinlock_t *ptl;
> > @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > goto out;
> > ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
> > } else {
> > - pmd = mm_find_pmd(mm, addr);
> > + pmd_t pmde;
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (!pgd_present(*pgd))
> > + goto out;
> > + pud = pud_offset(pgd, addr);
> > + if (!pud_present(*pud))
> > + goto out;
> > + pmd = pmd_offset(pud, addr);
> > if (!pmd)
> > goto out;
> >
> > + if (PageTransCompound(new)) {
> > + remove_migration_pmd(new, pmd, vma, addr, old);
> > + goto out;
> > + }
> > +
> > + pmde = *pmd;
> > + barrier();
> > +
> > + if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > + goto out;
> > +
> > ptep = pte_offset_map(pmd, addr);
> >
> > /*
> > @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
> > __migration_entry_wait(mm, pte, ptl);
> > }
> >
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> > +{
> > + spinlock_t *ptl;
> > + struct page *page;
> > +
> > + ptl = pmd_lock(mm, pmd);
> > + if (!is_pmd_migration_entry(*pmd))
> > + goto unlock;
> > + page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> > + if (!get_page_unless_zero(page))
> > + goto unlock;
> > + spin_unlock(ptl);
> > + wait_on_page_locked(page);
> > + put_page(page);
> > + return;
> > +unlock:
> > + spin_unlock(ptl);
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_BLOCK
> > /* Returns true if all buffers are successfully locked */
> > static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > index 71c5f91..6012343 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> > @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> > {
> > pmd_t pmd;
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > - VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> > + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> > + !pmd_devmap(*pmdp));
> > pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> > flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> > return pmd;
> > --
> > 2.7.0
> >
>
> --
> Kirill A. Shutemov
>
On Tue, Nov 08, 2016 at 08:31:52AM +0900, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't think up
> no better name. Any suggesntion is welcomed.)
> ---
> arch/x86/mm/gup.c | 4 +--
> fs/proc/task_mmu.c | 23 +++++++------
> include/linux/huge_mm.h | 9 ++++-
> mm/gup.c | 10 ++++--
> mm/huge_memory.c | 88 ++++++++++++++++++++++++++++++++++++++++---------
> mm/madvise.c | 2 +-
> mm/memcontrol.c | 2 ++
> mm/memory.c | 6 +++-
> mm/mprotect.c | 2 ++
> mm/mremap.c | 2 +-
> 10 files changed, 114 insertions(+), 34 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> index 0d4fb3e..78a153d 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = *pmdp;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> + if (unlikely(pmd_large(pmd))) {
> /*
> * NUMA hinting faults need to be handled in the GUP
> * slowpath for accounting purposes and so that they
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> index 35b92d8..c1f9cf4 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> - smaps_pmd_entry(pmd, addr, walk);
> + if (pmd_present(*pmd))
> + smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> return 0;
> }
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> goto out;
> }
>
> + if (!pmd_present(*pmd))
> + goto out;
> +
Hm. Looks like clear_soft_dirty_pmd() should handle !present. It doesn't.
Ah.. Found it in 08/12.
> page = pmd_page(*pmd);
>
> /* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> if (ptl) {
> u64 flags = 0, frame = 0;
> pmd_t pmd = *pmdp;
> + struct page *page;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> flags |= PM_SOFT_DIRTY;
>
> - /*
> - * Currently pmd for thp is always present because thp
> - * can not be swapped-out, migrated, or HWPOISONed
> - * (split in such cases instead.)
> - * This if-check is just to prepare for future implementation.
> - */
> - if (pmd_present(pmd)) {
> - struct page *page = pmd_page(pmd);
> -
> + if (is_pmd_migration_entry(pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> + frame = swp_type(entry) |
> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + page = migration_entry_to_page(entry);
> + } else if (pmd_present(pmd)) {
> + page = pmd_page(pmd);
> if (page_mapcount(page) == 1)
> flags |= PM_MMAP_EXCLUSIVE;
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> index fcbca51..3c252cd 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
> long adjust_next);
> extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma);
> +
> +static inline int pmd_related(pmd_t pmd)
> +{
> + return !pmd_none(pmd) &&
> + (!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> +}
> +
I would rather create is_swap_pmd() -- (!none && !present) and leave the
reset open-codded.
> /* mmap_sem must be held on entry */
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> + if (pmd_related(*pmd))
> return __pmd_trans_huge_lock(pmd, vma);
> else
> return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> index e50178c..2dc4978 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> }
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> return no_page_table(vma, flags);
> + if (!pmd_present(*pmd))
> + return no_page_table(vma, flags);
Don't we need FOLL_MIGRATION, like on pte side?
> if (pmd_devmap(*pmd)) {
> ptl = pmd_lock(mm, pmd);
> page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> return follow_page_pte(vma, address, pmd, flags);
>
> ptl = pmd_lock(mm, pmd);
> + if (unlikely(!pmd_present(*pmd))) {
> + spin_unlock(ptl);
> + return no_page_table(vma, flags);
> + }
Ditto.
> if (unlikely(!pmd_trans_huge(*pmd))) {
> spin_unlock(ptl);
> return follow_page_pte(vma, address, pmd, flags);
> @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> pud = pud_offset(pgd, address);
> BUG_ON(pud_none(*pud));
> pmd = pmd_offset(pud, address);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return -EFAULT;
> VM_BUG_ON(pmd_trans_huge(*pmd));
> pte = pte_offset_map(pmd, address);
> @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
>
> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index b3022b3..4e9090c 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> ret = -EAGAIN;
> pmd = *src_pmd;
> +
> + if (unlikely(is_pmd_migration_entry(pmd))) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + if (is_write_migration_entry(entry)) {
> + make_migration_entry_read(&entry);
> + pmd = swp_entry_to_pmd(entry);
> + set_pmd_at(src_mm, addr, src_pmd, pmd);
> + }
I think we should put at least WARN_ONCE() in 'else' here. We don't want
to miss such places when swap will be supported (or other swap entry type).
> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> + ret = 0;
> + goto out_unlock;
> + }
> +
> if (unlikely(!pmd_trans_huge(pmd))) {
> pte_free(dst_mm, pgtable);
> goto out_unlock;
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> goto out_unlock;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out_unlock;
> +
> page = pmd_page(orig_pmd);
> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> /*
> @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> goto out;
>
> - page = pmd_page(*pmd);
> + if (is_pmd_migration_entry(*pmd)) {
> + swp_entry_t entry;
> + entry = pmd_to_swp_entry(*pmd);
> + page = pfn_to_page(swp_offset(entry));
> + if (!is_migration_entry(entry))
> + goto out;
follow_page_pte() does different thing: wait for page to be migrated and
retry. Any reason you don't do the same?
> + } else
> + page = pmd_page(*pmd);
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> if (flags & FOLL_TOUCH)
> touch_pmd(vma, addr, pmd);
> @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> if (is_huge_zero_pmd(orig_pmd))
> goto out;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out;
> +
> page = pmd_page(orig_pmd);
> /*
> * If other processes are mapping this page, we couldn't discard
> @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> } else {
> struct page *page = pmd_page(orig_pmd);
> - page_remove_rmap(page, true);
> - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> - VM_BUG_ON_PAGE(!PageHead(page), page);
> - if (PageAnon(page)) {
> - pgtable_t pgtable;
> - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> - pte_free(tlb->mm, pgtable);
> - atomic_long_dec(&tlb->mm->nr_ptes);
> - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + int migration = 0;
> +
> + if (!is_pmd_migration_entry(orig_pmd)) {
> + page_remove_rmap(page, true);
> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> + VM_BUG_ON_PAGE(!PageHead(page), page);
> + if (PageAnon(page)) {
> + pgtable_t pgtable;
> + pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> + pmd);
> + pte_free(tlb->mm, pgtable);
> + atomic_long_dec(&tlb->mm->nr_ptes);
> + add_mm_counter(tlb->mm, MM_ANONPAGES,
> + -HPAGE_PMD_NR);
> + } else {
> + add_mm_counter(tlb->mm, MM_FILEPAGES,
> + -HPAGE_PMD_NR);
> + }
> } else {
> - add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> + swp_entry_t entry;
> +
> + entry = pmd_to_swp_entry(orig_pmd);
> + free_swap_and_cache(entry); /* waring in failure? */
> + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + migration = 1;
> }
> spin_unlock(ptl);
> - tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> + if (!migration)
> + tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> }
> return 1;
> }
> @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> bool preserve_write = prot_numa && pmd_write(*pmd);
> ret = 1;
>
> + if (!pmd_present(*pmd))
> + goto unlock;
> /*
> * Avoid trapping faults against the zero page. The read-only
> * data is likely to be read-cached on the local CPU and
> * local/remote hits to the zero page are not interesting.
> */
> - if (prot_numa && is_huge_zero_pmd(*pmd)) {
> - spin_unlock(ptl);
> - return ret;
> + if (prot_numa && is_huge_zero_pmd(*pmd))
> + goto unlock;
> +
> + if (is_pmd_migration_entry(*pmd)) {
Hm? But we filtered out !present above?
> + swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> + if (is_write_migration_entry(entry)) {
> + pmd_t newpmd;
> +
> + make_migration_entry_read(&entry);
> + newpmd = swp_entry_to_pmd(entry);
> + set_pmd_at(mm, addr, pmd, newpmd);
> + }
> + goto unlock;
> }
>
> if (!prot_numa || !pmd_protnone(*pmd)) {
> @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> pmd_write(entry));
> }
> +unlock:
> spin_unlock(ptl);
> }
>
> @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> {
> spinlock_t *ptl;
> ptl = pmd_lock(vma->vm_mm, pmd);
> - if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> + if (likely(pmd_related(*pmd)))
> return ptl;
> spin_unlock(ptl);
> return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> index 0e3828e..eaa2b02 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long next;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*pmd))
> + if (pmd_related(*pmd))
I don't see a point going for madvise_free_huge_pmd(), just to fall off on
!present inside.
And is it safe for devmap?
> if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
> goto next;
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> index 91dfc7c..ebc2c42 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> struct page *page = NULL;
> enum mc_target_type ret = MC_TARGET_NONE;
>
> + if (unlikely(!pmd_present(pmd)))
> + return ret;
> page = pmd_page(pmd);
> VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> if (!(mc.flags & MOVE_ANON))
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> index 94b5e2c..33fa439 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
> src_pmd = pmd_offset(src_pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> + if (pmd_related(*src_pmd)) {
> int err;
> VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
> err = copy_huge_pmd(dst_mm, src_mm,
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> int ret;
>
> barrier();
> + if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> + pmd_migration_entry_wait(mm, fe.pmd);
> + return 0;
> + }
> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&fe, orig_pmd);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> index c5ba2aa..81186e3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> unsigned long this_pages;
>
> next = pmd_addr_end(addr, end);
> + if (!pmd_present(*pmd))
> + continue;
> if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> && pmd_none_or_clear_bad(pmd))
> continue;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> index da22ad2..a94a698 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
> if (!new_pmd)
> break;
> - if (pmd_trans_huge(*old_pmd)) {
> + if (pmd_related(*old_pmd)) {
> if (extent == HPAGE_PMD_SIZE) {
> bool moved;
> /* See comment in move_ptes() */
> --
> 2.7.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Tue, Nov 08, 2016 at 08:31:56AM +0900, Naoya Horiguchi wrote:
> This patch enables thp migration for move_pages(2).
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> mm/migrate.c | 37 ++++++++++++++++++++++++++++---------
> 1 file changed, 28 insertions(+), 9 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 97ab8d9..6a589b9 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -1443,7 +1443,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
> if (PageHuge(p))
> return alloc_huge_page_node(page_hstate(compound_head(p)),
> pm->node);
> - else
> + else if (thp_migration_supported() && PageTransHuge(p)) {
> + struct page *thp;
> +
> + thp = alloc_pages_node(pm->node,
> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> + HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
> + } else
> return __alloc_pages_node(pm->node,
> GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> }
> @@ -1470,6 +1480,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
> struct vm_area_struct *vma;
> struct page *page;
> + struct page *head;
> + unsigned int follflags;
>
> err = -EFAULT;
> vma = find_vma(mm, pp->addr);
> @@ -1477,8 +1489,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> goto set_status;
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - page = follow_page(vma, pp->addr,
> - FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
> + follflags = FOLL_GET | FOLL_SPLIT | FOLL_DUMP;
> + if (thp_migration_supported())
> + follflags &= ~FOLL_SPLIT;
Nit: I would rather filp the condition -- adding flag is easier to read
than clearing.
> + page = follow_page(vma, pp->addr, follflags);
>
> err = PTR_ERR(page);
> if (IS_ERR(page))
> @@ -1488,7 +1502,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> if (!page)
> goto set_status;
>
> - pp->page = page;
> err = page_to_nid(page);
>
> if (err == pp->node)
> @@ -1503,16 +1516,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> goto put_and_set;
>
> if (PageHuge(page)) {
> - if (PageHead(page))
> + if (PageHead(page)) {
> isolate_huge_page(page, &pagelist);
> + err = 0;
> + pp->page = page;
> + }
> goto put_and_set;
> }
>
> - err = isolate_lru_page(page);
> + pp->page = compound_head(page);
> + head = compound_head(page);
> + err = isolate_lru_page(head);
> if (!err) {
> - list_add_tail(&page->lru, &pagelist);
> - inc_node_page_state(page, NR_ISOLATED_ANON +
> - page_is_file_cache(page));
> + list_add_tail(&head->lru, &pagelist);
> + mod_node_page_state(page_pgdat(head),
> + NR_ISOLATED_ANON + page_is_file_cache(head),
> + hpage_nr_pages(head));
> }
> put_and_set:
> /*
> --
> 2.7.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
> This patch enables thp migration for mbind(2) and migrate_pages(2).
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped and doubly-mapped thp
> ---
> mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 79 insertions(+), 29 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
> index 77d0668..96507ee 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mempolicy.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mempolicy.c
> @@ -94,6 +94,7 @@
> #include <linux/mm_inline.h>
> #include <linux/mmu_notifier.h>
> #include <linux/printk.h>
> +#include <linux/swapops.h>
>
> #include <asm/tlbflush.h>
> #include <asm/uaccess.h>
> @@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
> return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
> }
>
> +static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +{
> + int ret = 0;
> + struct page *page;
> + struct queue_pages *qp = walk->private;
> + unsigned long flags;
> +
> + if (unlikely(is_pmd_migration_entry(*pmd))) {
> + ret = 1;
> + goto unlock;
> + }
> + page = pmd_page(*pmd);
> + if (is_huge_zero_page(page)) {
> + spin_unlock(ptl);
> + __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> + goto out;
> + }
> + if (!thp_migration_supported()) {
> + get_page(page);
> + spin_unlock(ptl);
> + lock_page(page);
> + ret = split_huge_page(page);
> + unlock_page(page);
> + put_page(page);
> + goto out;
> + }
> + if (queue_pages_node_check(page, qp)) {
> + ret = 1;
> + goto unlock;
> + }
> +
> + ret = 1;
> + flags = qp->flags;
> + /* go to thp migration */
> + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + migrate_page_add(page, qp->pagelist, flags);
> +unlock:
> + spin_unlock(ptl);
> +out:
> + return ret;
> +}
> +
> /*
> * Scan through pages checking if pages follow certain conditions,
> * and move them to the pagelist if they do.
> @@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> struct page *page;
> struct queue_pages *qp = walk->private;
> unsigned long flags = qp->flags;
> - int nid, ret;
> + int ret;
> pte_t *pte;
> spinlock_t *ptl;
>
> - if (pmd_trans_huge(*pmd)) {
> - ptl = pmd_lock(walk->mm, pmd);
> - if (pmd_trans_huge(*pmd)) {
> - page = pmd_page(*pmd);
> - if (is_huge_zero_page(page)) {
> - spin_unlock(ptl);
> - __split_huge_pmd(vma, pmd, addr, false, NULL);
> - } else {
> - get_page(page);
> - spin_unlock(ptl);
> - lock_page(page);
> - ret = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - if (ret)
> - return 0;
> - }
> - } else {
> - spin_unlock(ptl);
> - }
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> + if (ret)
> + return 0;
> }
I wonder if we should introduce pte_entry function along with pmd_entry
function as we are first looking for trans huge PMDs either for direct
addition into the migration list or splitting it before looking for PTEs.
On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> functionality to x86_64, which should be safer at the first step.
Please make sure to describe why this has to be arch specific and what
are arches supposed to provide in order to enable this option.
--
Michal Hocko
SUSE Labs
On Tue 08-11-16 08:31:50, Naoya Horiguchi wrote:
> This patch prepares thp migration's core code. These code will be open when
> unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> to allocate destination thps.
this description is underdocumented to say the least. Could you
provide a high level documentation here please?
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +
> include/linux/swapops.h | 61 +++++++++++++++
> mm/huge_memory.c | 154 ++++++++++++++++++++++++++++++++++++++
> mm/migrate.c | 44 ++++++++++-
> mm/pgtable-generic.c | 3 +-
> 5 files changed, 262 insertions(+), 2 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> index 1cc82ec..3a1b48e 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/include/asm/pgtable_64.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/include/asm/pgtable_64.h
> @@ -167,7 +167,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> ((type) << (SWP_TYPE_FIRST_BIT)) \
> | ((offset) << SWP_OFFSET_FIRST_BIT) })
> #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
> +#define __pmd_to_swp_entry(pte) ((swp_entry_t) { pmd_val((pmd)) })
> #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
> +#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
>
> extern int kern_addr_valid(unsigned long addr);
> extern void cleanup_highmap(void);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> index 5c3a5f3..b6b22a2 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/swapops.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/swapops.h
> @@ -163,6 +163,67 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>
> #endif
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +extern void set_pmd_migration_entry(struct page *page,
> + struct vm_area_struct *vma, unsigned long address);
> +
> +extern int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old);
> +
> +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> + swp_entry_t arch_entry;
> +
> + arch_entry = __pmd_to_swp_entry(pmd);
> + return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> + swp_entry_t arch_entry;
> +
> + arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
> + return __swp_entry_to_pmd(arch_entry);
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> + return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> +}
> +#else
> +static inline void set_pmd_migration_entry(struct page *page,
> + struct vm_area_struct *vma, unsigned long address)
> +{
> +}
> +
> +static inline int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> + return 0;
> +}
> +
> +static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
> +
> +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
> +{
> + return swp_entry(0, 0);
> +}
> +
> +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> +{
> + pmd_t pmd = {};
> +
> + return pmd;
> +}
> +
> +static inline int is_pmd_migration_entry(pmd_t pmd)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifdef CONFIG_MEMORY_FAILURE
>
> extern atomic_long_t num_poisoned_pages __read_mostly;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index 0509d17..b3022b3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -2310,3 +2310,157 @@ static int __init split_huge_pages_debugfs(void)
> }
> late_initcall(split_huge_pages_debugfs);
> #endif
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void set_pmd_migration_entry(struct page *page, struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pmd_t pmdval;
> + swp_entry_t entry;
> + spinlock_t *ptl;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + return;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + return;
> + pmd = pmd_offset(pud, addr);
> + pmdval = *pmd;
> + barrier();
> + if (!pmd_present(pmdval))
> + return;
> +
> + mmu_notifier_invalidate_range_start(mm, addr, addr + HPAGE_PMD_SIZE);
> + if (pmd_trans_huge(pmdval)) {
> + pmd_t pmdswp;
> +
> + ptl = pmd_lock(mm, pmd);
> + if (!pmd_present(*pmd))
> + goto unlock_pmd;
> + if (unlikely(!pmd_trans_huge(*pmd)))
> + goto unlock_pmd;
> + if (pmd_page(*pmd) != page)
> + goto unlock_pmd;
> +
> + pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
> + if (pmd_dirty(pmdval))
> + set_page_dirty(page);
> + entry = make_migration_entry(page, pmd_write(pmdval));
> + pmdswp = swp_entry_to_pmd(entry);
> + pmdswp = pmd_mkhuge(pmdswp);
> + set_pmd_at(mm, addr, pmd, pmdswp);
> + page_remove_rmap(page, true);
> + put_page(page);
> +unlock_pmd:
> + spin_unlock(ptl);
> + } else { /* pte-mapped thp */
> + pte_t *pte;
> + pte_t pteval;
> + struct page *tmp = compound_head(page);
> + unsigned long address = addr & HPAGE_PMD_MASK;
> + pte_t swp_pte;
> + int i;
> +
> + pte = pte_offset_map(pmd, address);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> + for (i = 0; i < HPAGE_PMD_NR; i++, pte++, tmp++) {
> + if (!(pte_present(*pte) &&
> + page_to_pfn(tmp) == pte_pfn(*pte)))
> + continue;
> + pteval = ptep_clear_flush(vma, address, pte);
> + if (pte_dirty(pteval))
> + set_page_dirty(tmp);
> + entry = make_migration_entry(tmp, pte_write(pteval));
> + swp_pte = swp_entry_to_pte(entry);
> + set_pte_at(mm, address, pte, swp_pte);
> + page_remove_rmap(tmp, false);
> + put_page(tmp);
> + }
> + pte_unmap_unlock(pte, ptl);
> + }
> + mmu_notifier_invalidate_range_end(mm, addr, addr + HPAGE_PMD_SIZE);
> + return;
> +}
> +
> +int remove_migration_pmd(struct page *new, pmd_t *pmd,
> + struct vm_area_struct *vma, unsigned long addr, void *old)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + spinlock_t *ptl;
> + pmd_t pmde;
> + swp_entry_t entry;
> +
> + pmde = *pmd;
> + barrier();
> +
> + if (!pmd_present(pmde)) {
> + if (is_migration_entry(pmd_to_swp_entry(pmde))) {
> + unsigned long mmun_start = addr & HPAGE_PMD_MASK;
> + unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> +
> + ptl = pmd_lock(mm, pmd);
> + entry = pmd_to_swp_entry(*pmd);
> + if (migration_entry_to_page(entry) != old)
> + goto unlock_ptl;
> + get_page(new);
> + pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
> + if (is_write_migration_entry(entry))
> + pmde = maybe_pmd_mkwrite(pmde, vma);
> + flush_cache_range(vma, mmun_start, mmun_end);
> + page_add_anon_rmap(new, vma, mmun_start, true);
> + pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
> + set_pmd_at(mm, mmun_start, pmd, pmde);
> + flush_tlb_range(vma, mmun_start, mmun_end);
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_vma_page(new);
> + update_mmu_cache_pmd(vma, addr, pmd);
> +unlock_ptl:
> + spin_unlock(ptl);
> + }
> + } else { /* pte-mapped thp */
> + pte_t *ptep;
> + pte_t pte;
> + int i;
> + struct page *tmpnew = compound_head(new);
> + struct page *tmpold = compound_head((struct page *)old);
> + unsigned long address = addr & HPAGE_PMD_MASK;
> +
> + ptep = pte_offset_map(pmd, addr);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> +
> + for (i = 0; i < HPAGE_PMD_NR;
> + i++, ptep++, tmpnew++, tmpold++, address += PAGE_SIZE) {
> + pte = *ptep;
> + if (!is_swap_pte(pte))
> + continue;
> + entry = pte_to_swp_entry(pte);
> + if (!is_migration_entry(entry) ||
> + migration_entry_to_page(entry) != tmpold)
> + continue;
> + get_page(tmpnew);
> + pte = pte_mkold(mk_pte(tmpnew,
> + READ_ONCE(vma->vm_page_prot)));
> + if (pte_swp_soft_dirty(*ptep))
> + pte = pte_mksoft_dirty(pte);
> + if (is_write_migration_entry(entry))
> + pte = maybe_mkwrite(pte, vma);
> + flush_dcache_page(tmpnew);
> + set_pte_at(mm, address, ptep, pte);
> + if (PageAnon(new))
> + page_add_anon_rmap(tmpnew, vma, address, false);
> + else
> + page_add_file_rmap(tmpnew, false);
> + update_mmu_cache(vma, address, ptep);
> + }
> + pte_unmap_unlock(ptep, ptl);
> + }
> + return SWAP_AGAIN;
> +}
> +#endif
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> index 66ce6b4..54f2eb6 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/migrate.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/migrate.c
> @@ -198,6 +198,8 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> {
> struct mm_struct *mm = vma->vm_mm;
> swp_entry_t entry;
> + pgd_t *pgd;
> + pud_t *pud;
> pmd_t *pmd;
> pte_t *ptep, pte;
> spinlock_t *ptl;
> @@ -208,10 +210,29 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> goto out;
> ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
> } else {
> - pmd = mm_find_pmd(mm, addr);
> + pmd_t pmde;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> + pmd = pmd_offset(pud, addr);
> if (!pmd)
> goto out;
>
> + if (PageTransCompound(new)) {
> + remove_migration_pmd(new, pmd, vma, addr, old);
> + goto out;
> + }
> +
> + pmde = *pmd;
> + barrier();
> +
> + if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> + goto out;
> +
> ptep = pte_offset_map(pmd, addr);
>
> /*
> @@ -344,6 +365,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
> __migration_entry_wait(mm, pte, ptl);
> }
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
> +{
> + spinlock_t *ptl;
> + struct page *page;
> +
> + ptl = pmd_lock(mm, pmd);
> + if (!is_pmd_migration_entry(*pmd))
> + goto unlock;
> + page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> + if (!get_page_unless_zero(page))
> + goto unlock;
> + spin_unlock(ptl);
> + wait_on_page_locked(page);
> + put_page(page);
> + return;
> +unlock:
> + spin_unlock(ptl);
> +}
> +#endif
> +
> #ifdef CONFIG_BLOCK
> /* Returns true if all buffers are successfully locked */
> static bool buffer_migrate_lock_buffers(struct buffer_head *head,
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> index 71c5f91..6012343 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/pgtable-generic.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/pgtable-generic.c
> @@ -118,7 +118,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> {
> pmd_t pmd;
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> - VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> + VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> + !pmd_devmap(*pmdp));
> pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> return pmd;
> --
> 2.7.0
--
Michal Hocko
SUSE Labs
On Tue 08-11-16 08:31:51, Naoya Horiguchi wrote:
> This patch makes it possible to support thp migration gradually. If you fail
> to allocate a destination page as a thp, you just split the source thp as we
> do now, and then enter the normal page migration. If you succeed to allocate
> destination thp, you enter thp migration. Subsequent patches actually enable
> thp migration for each caller of page migration by allowing its get_new_page()
> callback to allocate thps.
Does this need to be in a separate patch? Wouldn't it make more sense to
have the full THP migration code in a single one?
--
Michal Hocko
SUSE Labs
On Tue 08-11-16 08:31:52, Naoya Horiguchi wrote:
> If one of callers of page migration starts to handle thp, memory management code
> start to see pmd migration entry, so we need to prepare for it before enabling.
> This patch changes various code point which checks the status of given pmds in
> order to prevent race between thp migration and the pmd-related works.
Please be much more verbose about which code paths those are and why do
we care. The above wording didn't really help me to understand both the
problem and the solution until I started to stare into the code. And
even then I am not sure how to know that all places have been handled
properly.
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't think up
> no better name. Any suggesntion is welcomed.)
> ---
> arch/x86/mm/gup.c | 4 +--
> fs/proc/task_mmu.c | 23 +++++++------
> include/linux/huge_mm.h | 9 ++++-
> mm/gup.c | 10 ++++--
> mm/huge_memory.c | 88 ++++++++++++++++++++++++++++++++++++++++---------
> mm/madvise.c | 2 +-
> mm/memcontrol.c | 2 ++
> mm/memory.c | 6 +++-
> mm/mprotect.c | 2 ++
> mm/mremap.c | 2 +-
> 10 files changed, 114 insertions(+), 34 deletions(-)
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> index 0d4fb3e..78a153d 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = *pmdp;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> + if (unlikely(pmd_large(pmd))) {
> /*
> * NUMA hinting faults need to be handled in the GUP
> * slowpath for accounting purposes and so that they
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> index 35b92d8..c1f9cf4 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> - smaps_pmd_entry(pmd, addr, walk);
> + if (pmd_present(*pmd))
> + smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> return 0;
> }
> @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> goto out;
> }
>
> + if (!pmd_present(*pmd))
> + goto out;
> +
> page = pmd_page(*pmd);
>
> /* Clear accessed and referenced bits. */
> @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> if (ptl) {
> u64 flags = 0, frame = 0;
> pmd_t pmd = *pmdp;
> + struct page *page;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> flags |= PM_SOFT_DIRTY;
>
> - /*
> - * Currently pmd for thp is always present because thp
> - * can not be swapped-out, migrated, or HWPOISONed
> - * (split in such cases instead.)
> - * This if-check is just to prepare for future implementation.
> - */
> - if (pmd_present(pmd)) {
> - struct page *page = pmd_page(pmd);
> -
> + if (is_pmd_migration_entry(pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> + frame = swp_type(entry) |
> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + page = migration_entry_to_page(entry);
> + } else if (pmd_present(pmd)) {
> + page = pmd_page(pmd);
> if (page_mapcount(page) == 1)
> flags |= PM_MMAP_EXCLUSIVE;
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> index fcbca51..3c252cd 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
> long adjust_next);
> extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma);
> +
> +static inline int pmd_related(pmd_t pmd)
> +{
> + return !pmd_none(pmd) &&
> + (!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> +}
> +
> /* mmap_sem must be held on entry */
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> + if (pmd_related(*pmd))
> return __pmd_trans_huge_lock(pmd, vma);
> else
> return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> index e50178c..2dc4978 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> }
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> return no_page_table(vma, flags);
> + if (!pmd_present(*pmd))
> + return no_page_table(vma, flags);
> if (pmd_devmap(*pmd)) {
> ptl = pmd_lock(mm, pmd);
> page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> return follow_page_pte(vma, address, pmd, flags);
>
> ptl = pmd_lock(mm, pmd);
> + if (unlikely(!pmd_present(*pmd))) {
> + spin_unlock(ptl);
> + return no_page_table(vma, flags);
> + }
> if (unlikely(!pmd_trans_huge(*pmd))) {
> spin_unlock(ptl);
> return follow_page_pte(vma, address, pmd, flags);
> @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> pud = pud_offset(pgd, address);
> BUG_ON(pud_none(*pud));
> pmd = pmd_offset(pud, address);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return -EFAULT;
> VM_BUG_ON(pmd_trans_huge(*pmd));
> pte = pte_offset_map(pmd, address);
> @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
>
> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> index b3022b3..4e9090c 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> ret = -EAGAIN;
> pmd = *src_pmd;
> +
> + if (unlikely(is_pmd_migration_entry(pmd))) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + if (is_write_migration_entry(entry)) {
> + make_migration_entry_read(&entry);
> + pmd = swp_entry_to_pmd(entry);
> + set_pmd_at(src_mm, addr, src_pmd, pmd);
> + }
> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> + ret = 0;
> + goto out_unlock;
> + }
> +
> if (unlikely(!pmd_trans_huge(pmd))) {
> pte_free(dst_mm, pgtable);
> goto out_unlock;
> @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> goto out_unlock;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out_unlock;
> +
> page = pmd_page(orig_pmd);
> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> /*
> @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> goto out;
>
> - page = pmd_page(*pmd);
> + if (is_pmd_migration_entry(*pmd)) {
> + swp_entry_t entry;
> + entry = pmd_to_swp_entry(*pmd);
> + page = pfn_to_page(swp_offset(entry));
> + if (!is_migration_entry(entry))
> + goto out;
> + } else
> + page = pmd_page(*pmd);
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> if (flags & FOLL_TOUCH)
> touch_pmd(vma, addr, pmd);
> @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> if (is_huge_zero_pmd(orig_pmd))
> goto out;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out;
> +
> page = pmd_page(orig_pmd);
> /*
> * If other processes are mapping this page, we couldn't discard
> @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> } else {
> struct page *page = pmd_page(orig_pmd);
> - page_remove_rmap(page, true);
> - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> - VM_BUG_ON_PAGE(!PageHead(page), page);
> - if (PageAnon(page)) {
> - pgtable_t pgtable;
> - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> - pte_free(tlb->mm, pgtable);
> - atomic_long_dec(&tlb->mm->nr_ptes);
> - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + int migration = 0;
> +
> + if (!is_pmd_migration_entry(orig_pmd)) {
> + page_remove_rmap(page, true);
> + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> + VM_BUG_ON_PAGE(!PageHead(page), page);
> + if (PageAnon(page)) {
> + pgtable_t pgtable;
> + pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> + pmd);
> + pte_free(tlb->mm, pgtable);
> + atomic_long_dec(&tlb->mm->nr_ptes);
> + add_mm_counter(tlb->mm, MM_ANONPAGES,
> + -HPAGE_PMD_NR);
> + } else {
> + add_mm_counter(tlb->mm, MM_FILEPAGES,
> + -HPAGE_PMD_NR);
> + }
> } else {
> - add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> + swp_entry_t entry;
> +
> + entry = pmd_to_swp_entry(orig_pmd);
> + free_swap_and_cache(entry); /* waring in failure? */
> + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + migration = 1;
> }
> spin_unlock(ptl);
> - tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> + if (!migration)
> + tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> }
> return 1;
> }
> @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> bool preserve_write = prot_numa && pmd_write(*pmd);
> ret = 1;
>
> + if (!pmd_present(*pmd))
> + goto unlock;
> /*
> * Avoid trapping faults against the zero page. The read-only
> * data is likely to be read-cached on the local CPU and
> * local/remote hits to the zero page are not interesting.
> */
> - if (prot_numa && is_huge_zero_pmd(*pmd)) {
> - spin_unlock(ptl);
> - return ret;
> + if (prot_numa && is_huge_zero_pmd(*pmd))
> + goto unlock;
> +
> + if (is_pmd_migration_entry(*pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> + if (is_write_migration_entry(entry)) {
> + pmd_t newpmd;
> +
> + make_migration_entry_read(&entry);
> + newpmd = swp_entry_to_pmd(entry);
> + set_pmd_at(mm, addr, pmd, newpmd);
> + }
> + goto unlock;
> }
>
> if (!prot_numa || !pmd_protnone(*pmd)) {
> @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> pmd_write(entry));
> }
> +unlock:
> spin_unlock(ptl);
> }
>
> @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> {
> spinlock_t *ptl;
> ptl = pmd_lock(vma->vm_mm, pmd);
> - if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> + if (likely(pmd_related(*pmd)))
> return ptl;
> spin_unlock(ptl);
> return NULL;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> index 0e3828e..eaa2b02 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> unsigned long next;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*pmd))
> + if (pmd_related(*pmd))
> if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
> goto next;
>
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> index 91dfc7c..ebc2c42 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> struct page *page = NULL;
> enum mc_target_type ret = MC_TARGET_NONE;
>
> + if (unlikely(!pmd_present(pmd)))
> + return ret;
> page = pmd_page(pmd);
> VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> if (!(mc.flags & MOVE_ANON))
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> index 94b5e2c..33fa439 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
> src_pmd = pmd_offset(src_pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> + if (pmd_related(*src_pmd)) {
> int err;
> VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
> err = copy_huge_pmd(dst_mm, src_mm,
> @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> int ret;
>
> barrier();
> + if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> + pmd_migration_entry_wait(mm, fe.pmd);
> + return 0;
> + }
> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&fe, orig_pmd);
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> index c5ba2aa..81186e3 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> unsigned long this_pages;
>
> next = pmd_addr_end(addr, end);
> + if (!pmd_present(*pmd))
> + continue;
> if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> && pmd_none_or_clear_bad(pmd))
> continue;
> diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> index da22ad2..a94a698 100644
> --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
> if (!new_pmd)
> break;
> - if (pmd_trans_huge(*old_pmd)) {
> + if (pmd_related(*old_pmd)) {
> if (extent == HPAGE_PMD_SIZE) {
> bool moved;
> /* See comment in move_ptes() */
> --
> 2.7.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
# sorry for late reply ...
On Fri, Nov 18, 2016 at 02:56:24AM +0300, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 08:31:52AM +0900, Naoya Horiguchi wrote:
> > If one of callers of page migration starts to handle thp, memory management code
> > start to see pmd migration entry, so we need to prepare for it before enabling.
> > This patch changes various code point which checks the status of given pmds in
> > order to prevent race between thp migration and the pmd-related works.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > ChangeLog v1 -> v2:
> > - introduce pmd_related() (I know the naming is not good, but can't think up
> > no better name. Any suggesntion is welcomed.)
> > ---
> > arch/x86/mm/gup.c | 4 +--
> > fs/proc/task_mmu.c | 23 +++++++------
> > include/linux/huge_mm.h | 9 ++++-
> > mm/gup.c | 10 ++++--
> > mm/huge_memory.c | 88 ++++++++++++++++++++++++++++++++++++++++---------
> > mm/madvise.c | 2 +-
> > mm/memcontrol.c | 2 ++
> > mm/memory.c | 6 +++-
> > mm/mprotect.c | 2 ++
> > mm/mremap.c | 2 +-
> > 10 files changed, 114 insertions(+), 34 deletions(-)
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> > index 0d4fb3e..78a153d 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/arch/x86/mm/gup.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/arch/x86/mm/gup.c
> > @@ -222,9 +222,9 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> > pmd_t pmd = *pmdp;
> >
> > next = pmd_addr_end(addr, end);
> > - if (pmd_none(pmd))
> > + if (!pmd_present(pmd))
> > return 0;
> > - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> > + if (unlikely(pmd_large(pmd))) {
> > /*
> > * NUMA hinting faults need to be handled in the GUP
> > * slowpath for accounting purposes and so that they
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> > index 35b92d8..c1f9cf4 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/fs/proc/task_mmu.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/fs/proc/task_mmu.c
> > @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >
> > ptl = pmd_trans_huge_lock(pmd, vma);
> > if (ptl) {
> > - smaps_pmd_entry(pmd, addr, walk);
> > + if (pmd_present(*pmd))
> > + smaps_pmd_entry(pmd, addr, walk);
> > spin_unlock(ptl);
> > return 0;
> > }
> > @@ -929,6 +930,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > goto out;
> > }
> >
> > + if (!pmd_present(*pmd))
> > + goto out;
> > +
>
> Hm. Looks like clear_soft_dirty_pmd() should handle !present. It doesn't.
>
> Ah.. Found it in 08/12.
>
> > page = pmd_page(*pmd);
> >
> > /* Clear accessed and referenced bits. */
> > @@ -1208,19 +1212,18 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > if (ptl) {
> > u64 flags = 0, frame = 0;
> > pmd_t pmd = *pmdp;
> > + struct page *page;
> >
> > if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> > flags |= PM_SOFT_DIRTY;
> >
> > - /*
> > - * Currently pmd for thp is always present because thp
> > - * can not be swapped-out, migrated, or HWPOISONed
> > - * (split in such cases instead.)
> > - * This if-check is just to prepare for future implementation.
> > - */
> > - if (pmd_present(pmd)) {
> > - struct page *page = pmd_page(pmd);
> > -
> > + if (is_pmd_migration_entry(pmd)) {
> > + swp_entry_t entry = pmd_to_swp_entry(pmd);
> > + frame = swp_type(entry) |
> > + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > + page = migration_entry_to_page(entry);
> > + } else if (pmd_present(pmd)) {
> > + page = pmd_page(pmd);
> > if (page_mapcount(page) == 1)
> > flags |= PM_MMAP_EXCLUSIVE;
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> > index fcbca51..3c252cd 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/include/linux/huge_mm.h
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/include/linux/huge_mm.h
> > @@ -125,12 +125,19 @@ extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
> > long adjust_next);
> > extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> > struct vm_area_struct *vma);
> > +
> > +static inline int pmd_related(pmd_t pmd)
> > +{
> > + return !pmd_none(pmd) &&
> > + (!pmd_present(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd));
> > +}
> > +
>
> I would rather create is_swap_pmd() -- (!none && !present) and leave the
> reset open-codded.
OK, I do this.
>
> > /* mmap_sem must be held on entry */
> > static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> > struct vm_area_struct *vma)
> > {
> > VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> > - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> > + if (pmd_related(*pmd))
> > return __pmd_trans_huge_lock(pmd, vma);
> > else
> > return NULL;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> > index e50178c..2dc4978 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/gup.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/gup.c
> > @@ -267,6 +267,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> > }
> > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> > return no_page_table(vma, flags);
> > + if (!pmd_present(*pmd))
> > + return no_page_table(vma, flags);
>
> Don't we need FOLL_MIGRATION, like on pte side?
That's better, applied.
> > if (pmd_devmap(*pmd)) {
> > ptl = pmd_lock(mm, pmd);
> > page = follow_devmap_pmd(vma, address, pmd, flags);
> > @@ -278,6 +280,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> > return follow_page_pte(vma, address, pmd, flags);
> >
> > ptl = pmd_lock(mm, pmd);
> > + if (unlikely(!pmd_present(*pmd))) {
> > + spin_unlock(ptl);
> > + return no_page_table(vma, flags);
> > + }
>
> Ditto.
>
> > if (unlikely(!pmd_trans_huge(*pmd))) {
> > spin_unlock(ptl);
> > return follow_page_pte(vma, address, pmd, flags);
> > @@ -333,7 +339,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> > pud = pud_offset(pgd, address);
> > BUG_ON(pud_none(*pud));
> > pmd = pmd_offset(pud, address);
> > - if (pmd_none(*pmd))
> > + if (!pmd_present(*pmd))
> > return -EFAULT;
> > VM_BUG_ON(pmd_trans_huge(*pmd));
> > pte = pte_offset_map(pmd, address);
> > @@ -1357,7 +1363,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> > pmd_t pmd = READ_ONCE(*pmdp);
> >
> > next = pmd_addr_end(addr, end);
> > - if (pmd_none(pmd))
> > + if (!pmd_present(pmd))
> > return 0;
> >
> > if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > index b3022b3..4e9090c 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/huge_memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/huge_memory.c
> > @@ -825,6 +825,20 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >
> > ret = -EAGAIN;
> > pmd = *src_pmd;
> > +
> > + if (unlikely(is_pmd_migration_entry(pmd))) {
> > + swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > + if (is_write_migration_entry(entry)) {
> > + make_migration_entry_read(&entry);
> > + pmd = swp_entry_to_pmd(entry);
> > + set_pmd_at(src_mm, addr, src_pmd, pmd);
> > + }
>
> I think we should put at least WARN_ONCE() in 'else' here. We don't want
> to miss such places when swap will be supported (or other swap entry type).
I guess that you mean 'else' branch of outer if-block, right?
> > + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> > + ret = 0;
> > + goto out_unlock;
> > + }
Maybe inserting the below here seems OK.
WARN_ONCE(!pmd_present(pmd), "Unknown non-present format on pmd.\n");
> > +
> > if (unlikely(!pmd_trans_huge(pmd))) {
> > pte_free(dst_mm, pgtable);
> > goto out_unlock;
> > @@ -1013,6 +1027,9 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
> > if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
> > goto out_unlock;
> >
> > + if (unlikely(!pmd_present(orig_pmd)))
> > + goto out_unlock;
> > +
> > page = pmd_page(orig_pmd);
> > VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> > /*
> > @@ -1137,7 +1154,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> > goto out;
> >
> > - page = pmd_page(*pmd);
> > + if (is_pmd_migration_entry(*pmd)) {
> > + swp_entry_t entry;
> > + entry = pmd_to_swp_entry(*pmd);
> > + page = pfn_to_page(swp_offset(entry));
> > + if (!is_migration_entry(entry))
> > + goto out;
>
> follow_page_pte() does different thing: wait for page to be migrated and
> retry. Any reason you don't do the same?
No, just my self-check wasn't enough.
> > + } else
> > + page = pmd_page(*pmd);
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > if (flags & FOLL_TOUCH)
> > touch_pmd(vma, addr, pmd);
> > @@ -1332,6 +1356,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > if (is_huge_zero_pmd(orig_pmd))
> > goto out;
> >
> > + if (unlikely(!pmd_present(orig_pmd)))
> > + goto out;
> > +
> > page = pmd_page(orig_pmd);
> > /*
> > * If other processes are mapping this page, we couldn't discard
> > @@ -1410,20 +1437,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> > } else {
> > struct page *page = pmd_page(orig_pmd);
> > - page_remove_rmap(page, true);
> > - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > - VM_BUG_ON_PAGE(!PageHead(page), page);
> > - if (PageAnon(page)) {
> > - pgtable_t pgtable;
> > - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> > - pte_free(tlb->mm, pgtable);
> > - atomic_long_dec(&tlb->mm->nr_ptes);
> > - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > + int migration = 0;
> > +
> > + if (!is_pmd_migration_entry(orig_pmd)) {
> > + page_remove_rmap(page, true);
> > + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> > + VM_BUG_ON_PAGE(!PageHead(page), page);
> > + if (PageAnon(page)) {
> > + pgtable_t pgtable;
> > + pgtable = pgtable_trans_huge_withdraw(tlb->mm,
> > + pmd);
> > + pte_free(tlb->mm, pgtable);
> > + atomic_long_dec(&tlb->mm->nr_ptes);
> > + add_mm_counter(tlb->mm, MM_ANONPAGES,
> > + -HPAGE_PMD_NR);
> > + } else {
> > + add_mm_counter(tlb->mm, MM_FILEPAGES,
> > + -HPAGE_PMD_NR);
> > + }
> > } else {
> > - add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> > + swp_entry_t entry;
> > +
> > + entry = pmd_to_swp_entry(orig_pmd);
> > + free_swap_and_cache(entry); /* waring in failure? */
> > + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > + migration = 1;
> > }
> > spin_unlock(ptl);
> > - tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> > + if (!migration)
> > + tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
> > }
> > return 1;
> > }
> > @@ -1496,14 +1538,27 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > bool preserve_write = prot_numa && pmd_write(*pmd);
> > ret = 1;
> >
> > + if (!pmd_present(*pmd))
> > + goto unlock;
> > /*
> > * Avoid trapping faults against the zero page. The read-only
> > * data is likely to be read-cached on the local CPU and
> > * local/remote hits to the zero page are not interesting.
> > */
> > - if (prot_numa && is_huge_zero_pmd(*pmd)) {
> > - spin_unlock(ptl);
> > - return ret;
> > + if (prot_numa && is_huge_zero_pmd(*pmd))
> > + goto unlock;
> > +
> > + if (is_pmd_migration_entry(*pmd)) {
>
> Hm? But we filtered out !present above?
the !present check must come after this if-block with WARN_ONCE().
> > + swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > +
> > + if (is_write_migration_entry(entry)) {
> > + pmd_t newpmd;
> > +
> > + make_migration_entry_read(&entry);
> > + newpmd = swp_entry_to_pmd(entry);
> > + set_pmd_at(mm, addr, pmd, newpmd);
> > + }
> > + goto unlock;
> > }
> >
> > if (!prot_numa || !pmd_protnone(*pmd)) {
> > @@ -1516,6 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> > pmd_write(entry));
> > }
> > +unlock:
> > spin_unlock(ptl);
> > }
> >
> > @@ -1532,7 +1588,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> > {
> > spinlock_t *ptl;
> > ptl = pmd_lock(vma->vm_mm, pmd);
> > - if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > + if (likely(pmd_related(*pmd)))
> > return ptl;
> > spin_unlock(ptl);
> > return NULL;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> > index 0e3828e..eaa2b02 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/madvise.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/madvise.c
> > @@ -274,7 +274,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > unsigned long next;
> >
> > next = pmd_addr_end(addr, end);
> > - if (pmd_trans_huge(*pmd))
> > + if (pmd_related(*pmd))
>
> I don't see a point going for madvise_free_huge_pmd(), just to fall off on
> !present inside.
Sorry, this code was wrong. I should've done like below simply:
+ if (!pmd_present(*pmd))
+ return 0;
if (pmd_trans_huge(*pmd))
...
> And is it safe for devmap?
I'm not sure, so let's keep it as-is.
Thanks,
Naoya Horiguchi
>
> > if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
> > goto next;
> >
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> > index 91dfc7c..ebc2c42 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memcontrol.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memcontrol.c
> > @@ -4635,6 +4635,8 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> > struct page *page = NULL;
> > enum mc_target_type ret = MC_TARGET_NONE;
> >
> > + if (unlikely(!pmd_present(pmd)))
> > + return ret;
> > page = pmd_page(pmd);
> > VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> > if (!(mc.flags & MOVE_ANON))
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> > index 94b5e2c..33fa439 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/memory.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/memory.c
> > @@ -999,7 +999,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
> > src_pmd = pmd_offset(src_pud, addr);
> > do {
> > next = pmd_addr_end(addr, end);
> > - if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> > + if (pmd_related(*src_pmd)) {
> > int err;
> > VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
> > err = copy_huge_pmd(dst_mm, src_mm,
> > @@ -3591,6 +3591,10 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > int ret;
> >
> > barrier();
> > + if (unlikely(is_pmd_migration_entry(orig_pmd))) {
> > + pmd_migration_entry_wait(mm, fe.pmd);
> > + return 0;
> > + }
> > if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> > if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> > return do_huge_pmd_numa_page(&fe, orig_pmd);
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> > index c5ba2aa..81186e3 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mprotect.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mprotect.c
> > @@ -164,6 +164,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > unsigned long this_pages;
> >
> > next = pmd_addr_end(addr, end);
> > + if (!pmd_present(*pmd))
> > + continue;
> > if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> > && pmd_none_or_clear_bad(pmd))
> > continue;
> > diff --git v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> > index da22ad2..a94a698 100644
> > --- v4.9-rc2-mmotm-2016-10-27-18-27/mm/mremap.c
> > +++ v4.9-rc2-mmotm-2016-10-27-18-27_patched/mm/mremap.c
> > @@ -194,7 +194,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
> > if (!new_pmd)
> > break;
> > - if (pmd_trans_huge(*old_pmd)) {
> > + if (pmd_related(*old_pmd)) {
> > if (extent == HPAGE_PMD_SIZE) {
> > bool moved;
> > /* See comment in move_ptes() */
> > --
> > 2.7.0
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Kirill A. Shutemov
>
On Fri, Nov 25, 2016 at 05:57:20PM +0530, Anshuman Khandual wrote:
> On 11/08/2016 05:01 AM, Naoya Horiguchi wrote:
...
> > @@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> > struct page *page;
> > struct queue_pages *qp = walk->private;
> > unsigned long flags = qp->flags;
> > - int nid, ret;
> > + int ret;
> > pte_t *pte;
> > spinlock_t *ptl;
> >
> > - if (pmd_trans_huge(*pmd)) {
> > - ptl = pmd_lock(walk->mm, pmd);
> > - if (pmd_trans_huge(*pmd)) {
> > - page = pmd_page(*pmd);
> > - if (is_huge_zero_page(page)) {
> > - spin_unlock(ptl);
> > - __split_huge_pmd(vma, pmd, addr, false, NULL);
> > - } else {
> > - get_page(page);
> > - spin_unlock(ptl);
> > - lock_page(page);
> > - ret = split_huge_page(page);
> > - unlock_page(page);
> > - put_page(page);
> > - if (ret)
> > - return 0;
> > - }
> > - } else {
> > - spin_unlock(ptl);
> > - }
> > + ptl = pmd_trans_huge_lock(pmd, vma);
> > + if (ptl) {
> > + ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
> > + if (ret)
> > + return 0;
> > }
>
> I wonder if we should introduce pte_entry function along with pmd_entry
> function as we are first looking for trans huge PMDs either for direct
> addition into the migration list or splitting it before looking for PTEs.
Most of pagewalk users don't define pte_entry because of performance reason
(to avoid the overhead of PTRS_PER_PMD function calls).
But that could be a nice cleanup if we have a workaround.
Thanks,
Naoya Horiguchi
On Mon, Nov 28, 2016 at 03:21:54PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> > Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > functionality to x86_64, which should be safer at the first step.
>
> Please make sure to describe why this has to be arch specific and what
> are arches supposed to provide in order to enable this option.
OK, the below will be added in the future version:
Thp migration is an arch-specific feature because it depends on the
arch-dependent behavior of non-present format of page table entry.
What you need to enable this option in other archs are:
- to define arch-specific transformation functions like __pmd_to_swp_entry()
and __swp_entry_to_pmd(),
- to make sure that arch-specific page table walking code can properly handle
!pmd_present case (gup_pmd_range() is a good example),
- (if your archs enables CONFIG_HAVE_ARCH_SOFT_DIRTY,) to define soft dirty
routines like pmd_swp_mksoft_dirty.
Thanks,
Naoya Horiguchi
On Mon, Nov 28, 2016 at 03:31:32PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:50, Naoya Horiguchi wrote:
> > This patch prepares thp migration's core code. These code will be open when
> > unmap_and_move() stops unconditionally splitting thp and get_new_page() starts
> > to allocate destination thps.
>
> this description is underdocumented to say the least. Could you
> provide a high level documentation here please?
Yes, I'll do.
And maybe some update on Documentation/vm/page_migration will be wanted,
so will do it too.
Thanks,
Naoya Horiguchi
On Mon, Nov 28, 2016 at 03:33:32PM +0100, Michal Hocko wrote:
> On Tue 08-11-16 08:31:51, Naoya Horiguchi wrote:
> > This patch makes it possible to support thp migration gradually. If you fail
> > to allocate a destination page as a thp, you just split the source thp as we
> > do now, and then enter the normal page migration. If you succeed to allocate
> > destination thp, you enter thp migration. Subsequent patches actually enable
> > thp migration for each caller of page migration by allowing its get_new_page()
> > callback to allocate thps.
>
> Does this need to be in a separate patch? Wouldn't it make more sense to
> have the full THP migration code in a single one?
Actually, no big reason for separate patch. So I'm OK to merge this into
patch 5/12.
Thanks,
Naoya Horiguchi
On Tue 29-11-16 07:50:06, Naoya Horiguchi wrote:
> On Mon, Nov 28, 2016 at 03:21:54PM +0100, Michal Hocko wrote:
> > On Tue 08-11-16 08:31:49, Naoya Horiguchi wrote:
> > > Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > > functionality to x86_64, which should be safer at the first step.
> >
> > Please make sure to describe why this has to be arch specific and what
> > are arches supposed to provide in order to enable this option.
>
> OK, the below will be added in the future version:
>
> Thp migration is an arch-specific feature because it depends on the
> arch-dependent behavior of non-present format of page table entry.
> What you need to enable this option in other archs are:
> - to define arch-specific transformation functions like __pmd_to_swp_entry()
> and __swp_entry_to_pmd(),
> - to make sure that arch-specific page table walking code can properly handle
> !pmd_present case (gup_pmd_range() is a good example),
> - (if your archs enables CONFIG_HAVE_ARCH_SOFT_DIRTY,) to define soft dirty
> routines like pmd_swp_mksoft_dirty.
Thanks this is _much_ better!
--
Michal Hocko
SUSE Labs