From: Zi Yan <[email protected]>
Hi all,
The patches are rebased on mmotm-2017-04-13-14-50 with the feedbacks from
v4 patches.
Hi Kirill, I have cleaned up Patch 5 and Patch 6, so PTE-mapped THP migration is
handled fully by existing code. Can I have your Ack, at least on these
two patches?
Hi Naoya, as I significantly modified Patch 5 and Patch 6, I change the author
to me. Let me know if you are OK with it. If not, I will change them back.
I did a thorough check on all patches, it should work well. Please consider
merging it. Otherwise, please give more suggestions.
Motivations
===========================================
1. THP migration becomes important in the upcoming heterogeneous memory systems.
As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/[email protected]/msg1349227.html),
future GPUs or other accelerators will have their memory managed by operating
systems. Moving data into and out of these memory nodes efficiently is critical
to applications that use GPUs or other accelerators. Existing page migration
only supports base pages, which has a very low memory bandwidth utilization.
My experiments (see below) show THP migration can migrate pages more efficiently.
2. Base page migration vs THP migration throughput.
Here are cross-socket page migration results from calling
move_pages() syscall:
In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.
In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.
THP migration can give us 3x and 1.15x throughput over base page migration
in x86_64 and ppc64 respectivley.
You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench
3. Existing page migration splits THP before migration and cannot guarantee
the migrated pages are still contiguous. Contiguity is always what GPUs and
accelerators look for. Without THP migration, khugepaged needs to do extra work
to reassemble the migrated pages back to THPs.
ChangeLog
===========================================
Changes since v4:
* In Patch 5, I dropped PTE-mapped THP migration handling code, since it is
already well handled by existing code.
* In Patch 6, I did a thorough check on PMD handling places and corrected all
errors I discovered.
* In Patch 6, I use is_swap_pmd() to check PMD migration entries and add
VM_BUG_ON to make sure only migration entries present. It should be useful
later when someone wants to add PMD swap entries, since VM_BUG_ON will
catch the missing code path.
* In Patch 6, I keep pmd_none() in pmd_none_or_trans_huge_or_clear_bad() to
avoid confusion on the function name. I also add a comment to explain it.
* In Patch 7-11, I added some missing soft dirty bit preserving code and
corrected page stats countings.
Changes since v3:
* I dropped my fix on zap_pmd_range() since THP migration will not trigger
it and Kirill has posted patches to fix the bug triggered by MADV_DONTNEED.
* In Patch 6, I used !pmd_present() instead of is_pmd_migration_entry()
in pmd_none_or_trans_huge_or_clear_bad() to avoid moving the function to
linux/swapops.h. Currently, !pmd_present() is equivalent to
is_pmd_migration_entry(). Any suggestion is welcome to this change.
Changes since v2:
* I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
which leads to PTE page table not freed.
* In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
can be set by some CPUs by mistake and the new swap entry format does not use
bit 1-4.
* I also adjust two core migration functions, set_pmd_migration_entry() and
remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
function. Patch 8 needs Kirill's comments, since I also add changes
to his page_vma_mapped_walk() function with pmd_migration_entry handling.
* In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
in set_pmd_migration_entry() to avoid data corruption after page migration.
* In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
leads to deposited PTE page table not freed.
* I personally use this patchset with my customized kernel to test frequent
page migrations by replacing page reclaim with page migration.
The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
I did a 16-hour stress test that has ~7 billion total page migrations.
No error or data corruption was found.
General description
===========================================
This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
- mbind(2)
- move_pages(2)
- migrate_pages(2)
- cgroup/cpuset migration
- memory hotremove
- soft offline
The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.
The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.
Any comments or advices are welcomed.
Best Regards,
Yan Zi
Naoya Horiguchi (9):
mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
mm: mempolicy: add queue_pages_node_check()
mm: thp: introduce separate TTU flag for thp freezing
mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
mm: soft-dirty: keep soft-dirty bits over thp migration
mm: hwpoison: soft offline supports thp migration
mm: mempolicy: mbind and migrate_pages support thp migration
mm: migrate: move_pages() supports thp migration
mm: memory_hotplug: memory hotremove supports thp migration
Zi Yan (2):
mm: thp: enable thp migration in generic path
mm: thp: check pmd migration entry in common path
arch/x86/Kconfig | 4 +
arch/x86/include/asm/pgtable.h | 17 ++++
arch/x86/include/asm/pgtable_64.h | 14 ++-
arch/x86/include/asm/pgtable_types.h | 10 +--
arch/x86/mm/gup.c | 7 +-
fs/proc/task_mmu.c | 57 +++++++-----
include/asm-generic/pgtable.h | 51 ++++++++++-
include/linux/huge_mm.h | 32 ++++++-
include/linux/rmap.h | 3 +-
include/linux/swapops.h | 71 ++++++++++++++-
mm/Kconfig | 3 +
mm/gup.c | 22 ++++-
mm/huge_memory.c | 170 ++++++++++++++++++++++++++++++++---
mm/memcontrol.c | 5 ++
mm/memory-failure.c | 35 +++-----
mm/memory.c | 12 ++-
mm/memory_hotplug.c | 17 +++-
mm/mempolicy.c | 124 ++++++++++++++++++-------
mm/migrate.c | 77 ++++++++++++----
mm/mprotect.c | 4 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 13 ++-
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 18 +++-
24 files changed, 632 insertions(+), 139 deletions(-)
--
2.11.0
From: Zi Yan <[email protected]>
This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.
This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.
ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2 -> v3:
- use page_vma_mapped_walk()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
set_pmd_migration_entry()
ChangeLog v3 -> v4:
- factor out the code of removing pte pgtable page in zap_huge_pmd()
ChangeLog v4 -> v5:
- remove unnecessary PTE-mapped THP code in remove_migration_pmd()
and set_pmd_migration_entry()
- restructure the code in zap_huge_pmd() to avoid factoring out
the pte pgtable page code
- in zap_huge_pmd(), check that PMD swap entries are migration entries
- change author information
Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 +
include/linux/swapops.h | 69 +++++++++++++++++++++++++++++++-
mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
mm/migrate.c | 30 +++++++++++++-
mm/page_vma_mapped.c | 13 ++++--
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 11 +++++
7 files changed, 200 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 770b5ae271ed..bd0252630bb3 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -187,7 +187,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
((type) << (SWP_TYPE_FIRST_BIT)) \
| ((offset) << SWP_OFFSET_FIRST_BIT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val((pmd)) })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
extern int kern_addr_valid(unsigned long addr);
extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3e7eec..c543c6f25e8f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
- BUG_ON(!PageLocked(page));
+ BUG_ON(!PageLocked(compound_head(page)));
+
return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
page_to_pfn(page));
}
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
* Any use of migration entries may only occur while the
* corresponding page is locked
*/
- BUG_ON(!PageLocked(p));
+ BUG_ON(!PageLocked(compound_head(p)));
return p;
}
@@ -163,6 +164,70 @@ static inline int is_write_migration_entry(swp_entry_t entry)
#endif
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __pmd_to_swp_entry(pmd);
+ return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+ return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new)
+{
+ BUILD_BUG();
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ BUILD_BUG();
+ return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ BUILD_BUG();
+ return (pmd_t){{ 0 }};
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MEMORY_FAILURE
extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0db1f1c90aad..7406d88445bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1633,10 +1633,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
} else {
- struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(!PageHead(page), page);
+ struct page *page = NULL;
+ int migration = 0;
+
+ if (pmd_present(orig_pmd)) {
+ page = pmd_page(orig_pmd);
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ } else {
+ swp_entry_t entry;
+
+ VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+ entry = pmd_to_swp_entry(orig_pmd);
+ page = pfn_to_page(swp_offset(entry));
+ migration = 1;
+ }
+
if (PageAnon(page)) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
@@ -1645,8 +1658,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
}
+
spin_unlock(ptl);
- tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+ if (!migration)
+ tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
}
return 1;
}
@@ -2669,3 +2684,62 @@ static int __init split_huge_pages_debugfs(void)
}
late_initcall(split_huge_pages_debugfs);
#endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ pmd_t pmdval;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ mmu_notifier_invalidate_range_start(mm, address,
+ address + HPAGE_PMD_SIZE);
+
+ flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+ if (pmd_dirty(pmdval))
+ set_page_dirty(page);
+ entry = make_migration_entry(page, pmd_write(pmdval));
+ pmdval = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ page_remove_rmap(page, true);
+ put_page(page);
+
+ mmu_notifier_invalidate_range_end(mm, address,
+ address + HPAGE_PMD_SIZE);
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long mmun_start = address & HPAGE_PMD_MASK;
+ unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+ pmd_t pmde;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ entry = pmd_to_swp_entry(*pvmw->pmd);
+ get_page(new);
+ pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (is_write_migration_entry(entry))
+ pmde = maybe_pmd_mkwrite(pmde, vma);
+
+ flush_cache_range(vma, mmun_start, mmun_end);
+ page_add_anon_rmap(new, vma, mmun_start, true);
+ set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+ flush_tlb_range(vma, mmun_start, mmun_end);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_vma_page(new);
+ update_mmu_cache_pmd(vma, address, pvmw->pmd);
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 5cfe3c27bcbe..bbc856264b69 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -214,6 +214,13 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
new = page - pvmw.page->index +
linear_page_index(vma, pvmw.address);
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte && pvmw.page) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
+ remove_migration_pmd(&pvmw, new);
+ continue;
+ }
+
get_page(new);
pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
if (pte_swp_soft_dirty(*pvmw.pte))
@@ -327,6 +334,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
__migration_entry_wait(mm, pte, ptl);
}
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ struct page *page;
+
+ ptl = pmd_lock(mm, pmd);
+ if (!is_pmd_migration_entry(*pmd))
+ goto unlock;
+ page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+ if (!get_page_unless_zero(page))
+ goto unlock;
+ spin_unlock(ptl);
+ wait_on_page_locked(page);
+ put_page(page);
+ return;
+unlock:
+ spin_unlock(ptl);
+}
+#endif
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1085,7 +1113,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}
- if (unlikely(PageTransHuge(page))) {
+ if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index de9c40d7304a..e209a12d8722 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -137,16 +137,23 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
- if (pmd_trans_huge(*pvmw->pmd)) {
+ if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- if (!pmd_present(*pvmw->pmd))
- return not_found(pvmw);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (pmd_page(*pvmw->pmd) != page)
return not_found(pvmw);
return true;
+ } else if (!pmd_present(*pvmw->pmd)) {
+ if (unlikely(is_migration_entry(pmd_to_swp_entry(*pvmw->pmd)))) {
+ swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+ if (migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ return not_found(pvmw);
} else {
/* THP pmd was split under us: handle on pte level */
spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..1175f6a24fdb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -124,7 +124,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+ !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index b0c6b20dca74..b9505f15c099 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
bool ret = true;
enum ttu_flags flags = (enum ttu_flags)arg;
+
/* munlock has nothing to gain from examining un-locked vmas */
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
return true;
@@ -1312,6 +1313,16 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}
while (page_vma_mapped_walk(&pvmw)) {
+ /* PMD-mapped THP migration entry */
+ if (flags & TTU_MIGRATION) {
+ if (!pvmw.pte && page) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page),
+ page);
+ set_pmd_migration_entry(&pvmw, page);
+ continue;
+ }
+ }
+
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
v1 -> v2:
- fixed config name in subject and patch description
---
arch/x86/Kconfig | 4 ++++
include/linux/huge_mm.h | 10 ++++++++++
mm/Kconfig | 3 +++
3 files changed, 17 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 69188841717a..97d094c67110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION
+config ARCH_ENABLE_THP_MIGRATION
+ def_bool y
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+
menu "Power management and ACPI options"
config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..1b81cb57ff0f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -212,6 +212,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+static inline bool thp_migration_supported(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -306,6 +311,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
{
return NULL;
}
+
+static inline bool thp_migration_supported(void)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..317a2f973720 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,9 @@ config MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
+config ARCH_ENABLE_THP_MIGRATION
+ bool
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for memory hotremove.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
---
include/linux/huge_mm.h | 8 ++++++++
mm/memory_hotplug.c | 17 ++++++++++++++---
2 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6f44a2352597..92c2161704c3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,13 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
}
+static inline int hpage_order(struct page *page)
+{
+ if (unlikely(PageTransHuge(page)))
+ return HPAGE_PMD_ORDER;
+ return 0;
+}
+
struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags);
struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
@@ -233,6 +240,7 @@ static inline bool thp_migration_supported(void)
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
#define hpage_nr_pages(x) 1
+#define hpage_order(x) 0
#define transparent_hugepage_enabled(__vma) 0
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 257166ebdff0..ecae0852994f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1574,6 +1574,7 @@ static struct page *new_node_page(struct page *page, unsigned long private,
int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
struct page *new_page = NULL;
+ unsigned int order = 0;
/*
* TODO: allocate a destination hugepage from a nearest neighbor node,
@@ -1584,6 +1585,11 @@ static struct page *new_node_page(struct page *page, unsigned long private,
return alloc_huge_page_node(page_hstate(compound_head(page)),
next_node_in(nid, nmask));
+ if (thp_migration_supported() && PageTransHuge(page)) {
+ order = hpage_order(page);
+ gfp_mask |= GFP_TRANSHUGE;
+ }
+
node_clear(nid, nmask);
if (PageHighMem(page)
@@ -1591,12 +1597,15 @@ static struct page *new_node_page(struct page *page, unsigned long private,
gfp_mask |= __GFP_HIGHMEM;
if (!nodes_empty(nmask))
- new_page = __alloc_pages_nodemask(gfp_mask, 0,
+ new_page = __alloc_pages_nodemask(gfp_mask, order,
node_zonelist(nid, gfp_mask), &nmask);
if (!new_page)
- new_page = __alloc_pages(gfp_mask, 0,
+ new_page = __alloc_pages(gfp_mask, order,
node_zonelist(nid, gfp_mask));
+ if (new_page && order == hpage_order(page))
+ prep_transhuge_page(new_page);
+
return new_page;
}
@@ -1626,7 +1635,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- }
+ } else if (thp_migration_supported() && PageTransHuge(page))
+ pfn = page_to_pfn(compound_head(page))
+ + hpage_nr_pages(page) - 1;
if (!get_page_unless_zero(page))
continue;
--
2.11.0
From: Zi Yan <[email protected]>
If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.
ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
think up no better name. Any suggesntion is welcomed.)
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
pmd_trans_huge(), pmd_devmap(), or pmd_none()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
true on pmd_migration_entry, so that migration entries are not
treated as pmd page table entries.
ChangeLog v4 -> v5:
- add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
the equivalence of !pmd_present() and is_pmd_migration_entry()
- fix migration entry wait deadlock code (from v1) in follow_page_mask()
- remove unnecessary code (from v1) in follow_trans_huge_pmd()
- use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
so it will not be confused with pmd_none()
- change author information
Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/mm/gup.c | 7 +++--
fs/proc/task_mmu.c | 30 +++++++++++++--------
include/asm-generic/pgtable.h | 17 +++++++++++-
include/linux/huge_mm.h | 14 ++++++++--
mm/gup.c | 22 ++++++++++++++--
mm/huge_memory.c | 61 ++++++++++++++++++++++++++++++++++++++-----
mm/memcontrol.c | 5 ++++
mm/memory.c | 12 +++++++--
mm/mprotect.c | 4 +--
mm/mremap.c | 2 +-
10 files changed, 145 insertions(+), 29 deletions(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 456dfdfd2249..096bbcc801e6 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -9,6 +9,7 @@
#include <linux/vmstat.h>
#include <linux/highmem.h>
#include <linux/swap.h>
+#include <linux/swapops.h>
#include <linux/memremap.h>
#include <asm/mmu_context.h>
@@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd)) {
+ VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
return 0;
- if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+ } else if (unlikely(pmd_large(pmd))) {
/*
* NUMA hinting faults need to be handled in the GUP
* slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5c8359704601..57489dcd71c4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- smaps_pmd_entry(pmd, addr, walk);
+ if (pmd_present(*pmd))
+ smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
}
@@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
goto out;
}
+ if (!pmd_present(*pmd))
+ goto out;
+
page = pmd_page(*pmd);
/* Clear accessed and referenced bits. */
@@ -1221,28 +1225,32 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (ptl) {
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
+ struct page *page = NULL;
if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- /*
- * Currently pmd for thp is always present because thp
- * can not be swapped-out, migrated, or HWPOISONed
- * (split in such cases instead.)
- * This if-check is just to prepare for future implementation.
- */
if (pmd_present(pmd)) {
- struct page *page = pmd_page(pmd);
-
- if (page_mapcount(page) == 1)
- flags |= PM_MMAP_EXCLUSIVE;
+ page = pmd_page(pmd);
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ } else if (is_swap_pmd(pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ frame = swp_type(entry) |
+ (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+ flags |= PM_SWAP;
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ page = migration_entry_to_page(entry);
}
+ if (page && page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
+
for (; addr != end; addr += PAGE_SIZE) {
pagemap_entry_t pme = make_pme(frame, flags);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1fad160f35de..23bf18116df4 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -809,7 +809,22 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
barrier();
#endif
- if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
+ /*
+ * !pmd_present() checks for pmd migration entries
+ *
+ * The complete check uses is_pmd_migration_entry() in linux/swapops.h
+ * But using that requires moving current function and pmd_trans_unstable()
+ * to linux/swapops.h to resovle dependency, which is too much code move.
+ *
+ * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
+ * because !pmd_present() pages can only be under migration not swapped
+ * out.
+ *
+ * pmd_none() is preseved for future condition checks on pmd migration
+ * entries and not confusing with this function name, although it is
+ * redundant with !pmd_present().
+ */
+ if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || !pmd_present(pmdval))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1b81cb57ff0f..6f44a2352597 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (pmd_trans_huge(*____pmd) \
+ if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
|| pmd_devmap(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
@@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma);
extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
/* mmap_sem must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
@@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return 0;
+}
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
diff --git a/mm/gup.c b/mm/gup.c
index 4039ec2993d3..b24c7d10aced 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -278,6 +278,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+retry:
+ if (!pmd_present(*pmd)) {
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(*pmd));
+ if (is_pmd_migration_entry(*pmd))
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry;
+ }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -291,7 +301,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
+retry_locked:
ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_present(*pmd))) {
+ spin_unlock(ptl);
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry_locked;
+ }
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
@@ -350,7 +368,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pud = pud_offset(p4d, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return -EFAULT;
VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
@@ -1378,7 +1396,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = READ_ONCE(*pmdp);
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd))
return 0;
if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7406d88445bf..3479e9caf2fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -912,6 +912,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
+
+ if (unlikely(is_swap_pmd(pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ if (is_write_migration_entry(entry)) {
+ make_migration_entry_read(&entry);
+ pmd = swp_entry_to_pmd(entry);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
+ set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ ret = 0;
+ goto out_unlock;
+ }
+
if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
@@ -1218,6 +1234,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
goto out_unlock;
+ if (unlikely(!pmd_present(orig_pmd)))
+ goto out_unlock;
+
page = pmd_page(orig_pmd);
VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
/*
@@ -1548,6 +1567,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd))
goto out;
+ if (unlikely(!pmd_present(orig_pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(orig_pmd));
+ goto out;
+ }
+
page = pmd_page(orig_pmd);
/*
* If other processes are mapping this page, we couldn't discard
@@ -1758,6 +1783,21 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;
+ if (is_swap_pmd(*pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(*pmd));
+ if (is_write_migration_entry(entry)) {
+ pmd_t newpmd;
+
+ make_migration_entry_read(&entry);
+ newpmd = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, addr, pmd, newpmd);
+ }
+ goto unlock;
+ }
+
/*
* Avoid trapping faults against the zero page. The read-only
* data is likely to be read-cached on the local CPU and
@@ -1823,7 +1863,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+ pmd_devmap(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -1941,14 +1982,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t _pmd;
- bool young, write, dirty, soft_dirty;
+ bool young, write, dirty, soft_dirty, pmd_migration;
unsigned long addr;
int i;
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+ && !pmd_devmap(*pmd));
count_vm_event(THP_SPLIT_PMD);
@@ -1973,7 +2015,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
return __split_huge_zero_page_pmd(vma, haddr, pmd);
}
- page = pmd_page(*pmd);
+ pmd_migration = is_pmd_migration_entry(*pmd);
+ if (pmd_migration) {
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(*pmd);
+ page = pfn_to_page(swp_offset(entry));
+ } else
+ page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
@@ -1992,7 +2041,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- if (freeze) {
+ if (freeze || pmd_migration) {
swp_entry_t swp_entry;
swp_entry = make_migration_entry(page + i, write);
entry = swp_entry_to_pte(swp_entry);
@@ -2091,7 +2140,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
page = pmd_page(*pmd);
if (PageMlocked(page))
clear_page_mlock(page);
- } else if (!pmd_devmap(*pmd))
+ } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, haddr, freeze);
out:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16c556ac103d..ca4016198076 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4628,6 +4628,11 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
struct page *page = NULL;
enum mc_target_type ret = MC_TARGET_NONE;
+ if (unlikely(is_swap_pmd(pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ return ret;
+ }
page = pmd_page(pmd);
VM_BUG_ON_PAGE(!page || !PageHead(page), page);
if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 9c82e25141ba..b2de091d046f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1032,7 +1032,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+ if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+ || pmd_devmap(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
err = copy_huge_pmd(dst_mm, src_mm,
@@ -1292,7 +1293,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3818,6 +3819,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
pmd_t orig_pmd = *vmf.pmd;
barrier();
+ if (unlikely(is_swap_pmd(orig_pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(orig_pmd));
+ if (is_pmd_migration_entry(orig_pmd))
+ pmd_migration_entry_wait(mm, vmf.pmd);
+ return 0;
+ }
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3e1a9015c500..59999ac6b1e9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long this_pages;
next = pmd_addr_end(addr, end);
- if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+ if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
&& pmd_none_or_clear_bad(pmd))
continue;
@@ -160,7 +160,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mni_start, end);
}
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..1c49b9fb994a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -222,7 +222,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (pmd_trans_huge(*old_pmd)) {
+ if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
bool moved;
/* See comment in move_ptes() */
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for soft offline.
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog: v1 -> v5:
- fix page isolation counting error
Signed-off-by: Zi Yan <[email protected]>
---
mm/memory-failure.c | 35 ++++++++++++++---------------------
1 file changed, 14 insertions(+), 21 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9b77476ef31f..23ff02eb3ed4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1481,7 +1481,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
nid);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(nid,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}
@@ -1665,8 +1675,8 @@ static int __soft_offline_page(struct page *page, int flags)
* cannot have PAGE_MAPPING_MOVABLE.
*/
if (!__PageMovable(page))
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), hpage_nr_pages(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
@@ -1689,28 +1699,11 @@ static int __soft_offline_page(struct page *page, int flags)
static int soft_offline_in_use_page(struct page *page, int flags)
{
int ret;
- struct page *hpage = compound_head(page);
-
- if (!PageHuge(page) && PageTransHuge(hpage)) {
- lock_page(hpage);
- if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
- unlock_page(hpage);
- if (!PageAnon(hpage))
- pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
- else
- pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
- put_hwpoison_page(hpage);
- return -EBUSY;
- }
- unlock_page(hpage);
- get_hwpoison_page(page);
- put_hwpoison_page(hpage);
- }
if (PageHuge(page))
ret = soft_offline_huge_page(page, flags);
else
- ret = __soft_offline_page(page, flags);
+ ret = __soft_offline_page(compound_head(page), flags);
return ret;
}
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v1 -> v5:
- read soft dirty bit from correct place (*src_pmd) in copy_huge_pmd()
- add missing soft dirty bit transfer in change_huge_pmd()
Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
fs/proc/task_mmu.c | 27 ++++++++++++++++-----------
include/asm-generic/pgtable.h | 34 +++++++++++++++++++++++++++++++++-
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 27 ++++++++++++++++++++++++---
5 files changed, 92 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 998081ae84f2..3447fd7ddb10 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1104,6 +1104,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
#endif
#define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 57489dcd71c4..34913dede836 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -908,17 +908,22 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
{
pmd_t pmd = *pmdp;
- /* See comment in change_huge_pmd() */
- pmdp_invalidate(vma, addr, pmdp);
- if (pmd_dirty(*pmdp))
- pmd = pmd_mkdirty(pmd);
- if (pmd_young(*pmdp))
- pmd = pmd_mkyoung(pmd);
-
- pmd = pmd_wrprotect(pmd);
- pmd = pmd_clear_soft_dirty(pmd);
-
- set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ if (pmd_present(pmd)) {
+ /* See comment in change_huge_pmd() */
+ pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(*pmdp))
+ pmd = pmd_mkdirty(pmd);
+ if (pmd_young(*pmdp))
+ pmd = pmd_mkyoung(pmd);
+
+ pmd = pmd_wrprotect(pmd);
+ pmd = pmd_clear_soft_dirty(pmd);
+
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+ pmd = pmd_swp_clear_soft_dirty(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ }
}
#else
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 23bf18116df4..5cd865ceb2cd 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -593,7 +593,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
static inline int pte_soft_dirty(pte_t pte)
{
return 0;
@@ -638,6 +655,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte;
}
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
#endif
#ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c543c6f25e8f..c2f2efa45f3a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
{
swp_entry_t arch_entry;
+ if (pmd_swp_soft_dirty(pmd))
+ pmd = pmd_swp_clear_soft_dirty(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3479e9caf2fa..cba64d0dc523 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -921,6 +921,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (is_write_migration_entry(entry)) {
make_migration_entry_read(&entry);
pmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*src_pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1706,6 +1708,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
}
#endif
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ if (unlikely(is_pmd_migration_entry(pmd)))
+ pmd = pmd_swp_mksoft_dirty(pmd);
+ else if (pmd_present(pmd))
+ pmd = pmd_mksoft_dirty(pmd);
+#endif
+ return pmd;
+}
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1748,7 +1761,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
- set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+ pmd = move_soft_dirty_pmd(pmd);
+ set_pmd_at(mm, new_addr, new_pmd, pmd);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
if (force_flush)
@@ -1793,6 +1807,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
newpmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*pmd))
+ newpmd = pmd_swp_mksoft_dirty(newpmd);
set_pmd_at(mm, addr, pmd, newpmd);
}
goto unlock;
@@ -2743,6 +2759,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
unsigned long address = pvmw->address;
pmd_t pmdval;
swp_entry_t entry;
+ pmd_t pmdswp;
if (!(pvmw->pmd && !pvmw->pte))
return;
@@ -2755,8 +2772,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_dirty(pmdval))
set_page_dirty(page);
entry = make_migration_entry(page, pmd_write(pmdval));
- pmdval = swp_entry_to_pmd(entry);
- set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ pmdswp = swp_entry_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ set_pmd_at(mm, address, pvmw->pmd, pmdswp);
page_remove_rmap(page, true);
put_page(page);
@@ -2780,6 +2799,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (pmd_swp_soft_dirty(*pvmw->pmd))
+ pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
--
2.11.0
From: Naoya Horiguchi <[email protected]>
pmd_present() checks _PAGE_PSE along with _PAGE_PRESENT to avoid
false negative return when it races with thp spilt
(during which _PAGE_PRESENT is temporary cleared.) I don't think that
dropping _PAGE_PSE check in pmd_present() works well because it can
hurt optimization of tlb handling in thp split.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 12 +++++++++---
arch/x86/include/asm/pgtable_types.h | 10 +++++-----
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 73c7ccc38912..770b5ae271ed 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -157,15 +157,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
/*
* Encode and de-code a swap entry
*
- * | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13) |0|0|X|X| X| X|X|SD|0| <- swp entry
*
* G (8) is aliased and used as a PROT_NONE indicator for
* !present ptes. We need to start storing swap entries above
* there. We also need to avoid using A and D because of an
* erratum where they can be incorrectly set by hardware on
* non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but also L and G.
*/
#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
#define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index df08535f774a..9a4ac934659e 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
*
* Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY _PAGE_RW
#else
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/mempolicy.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dc8a2672c407..fb18ce891586 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -477,6 +477,15 @@ struct queue_pages {
struct vm_area_struct *prev;
};
+static inline bool queue_pages_node_check(struct page *page,
+ struct queue_pages *qp)
+{
+ int nid = page_to_nid(page);
+ unsigned long flags = qp->flags;
+
+ return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -530,8 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
*/
if (PageReserved(page))
continue;
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (queue_pages_node_check(page, qp))
continue;
if (PageTransCompound(page)) {
get_page(page);
@@ -563,7 +571,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
#ifdef CONFIG_HUGETLB_PAGE
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid;
struct page *page;
spinlock_t *ptl;
pte_t entry;
@@ -573,8 +580,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (queue_pages_node_check(page, qp))
goto unlock;
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for mbind(2) and migrate_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
---
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
---
mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 79 insertions(+), 29 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fb18ce891586..c2550e7307bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -94,6 +94,7 @@
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
+#include <linux/swapops.h>
#include <asm/tlbflush.h>
#include <linux/uaccess.h>
@@ -486,6 +487,49 @@ static inline bool queue_pages_node_check(struct page *page,
return node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT);
}
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ int ret = 0;
+ struct page *page;
+ struct queue_pages *qp = walk->private;
+ unsigned long flags;
+
+ if (unlikely(is_pmd_migration_entry(*pmd))) {
+ ret = 1;
+ goto unlock;
+ }
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+ goto out;
+ }
+ if (!thp_migration_supported()) {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
+ if (queue_pages_node_check(page, qp)) {
+ ret = 1;
+ goto unlock;
+ }
+
+ ret = 1;
+ flags = qp->flags;
+ /* go to thp migration */
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ migrate_page_add(page, qp->pagelist, flags);
+unlock:
+ spin_unlock(ptl);
+out:
+ return ret;
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -497,30 +541,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid, ret;
+ int ret;
pte_t *pte;
spinlock_t *ptl;
- if (pmd_trans_huge(*pmd)) {
- ptl = pmd_lock(walk->mm, pmd);
- if (pmd_trans_huge(*pmd)) {
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- __split_huge_pmd(vma, pmd, addr, false, NULL);
- } else {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- if (ret)
- return 0;
- }
- } else {
- spin_unlock(ptl);
- }
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+ if (ret)
+ return 0;
}
if (pmd_trans_unstable(pmd))
@@ -541,7 +570,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (queue_pages_node_check(page, qp))
continue;
- if (PageTransCompound(page)) {
+ if (PageTransCompound(page) && !thp_migration_supported()) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
@@ -959,19 +988,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
#ifdef CONFIG_MIGRATION
/*
- * page migration
+ * page migration, thp tail pages can be passed.
*/
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
+ struct page *head = compound_head(page);
/*
* Avoid migrating a page that is shared with others.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
- if (!isolate_lru_page(page)) {
- list_add_tail(&page->lru, pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
}
}
@@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else
+ else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
+ } else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
}
/*
* if !vma, alloc_page_vma() will use task or system default policy
--
2.11.0
From: Naoya Horiguchi <[email protected]>
TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.
try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
and ttu_flag given for tail pages is:
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)
__unmap_and_move() calls try_to_unmap() with ttu_flag:
(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
if (flags & TTU_MIGRATION) {
if (!pvmw.pte && page) {
set_pmd_migration_entry(&pvmw, page);
continue;
}
}
, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)
I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().
Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/rmap.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 7 ++++---
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 43ef2c30cb0f..f8ca2e74b819 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -93,8 +93,9 @@ enum ttu_flags {
TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible
* and caller guarantees they will
* do a final flush if necessary */
- TTU_RMAP_LOCKED = 0x80 /* do not grab rmap lock:
+ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
+ TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */
};
#ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c19331a2db9..0db1f1c90aad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2159,7 +2159,7 @@ static void freeze_page(struct page *page)
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page))
- ttu_flags |= TTU_MIGRATION;
+ ttu_flags |= TTU_SPLIT_FREEZE;
unmap_success = try_to_unmap(page, ttu_flags);
VM_BUG_ON_PAGE(!unmap_success, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5c97ce4f5b2d..b0c6b20dca74 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1308,7 +1308,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_address(vma, address,
- flags & TTU_MIGRATION, page);
+ flags & TTU_SPLIT_FREEZE, page);
}
while (page_vma_mapped_walk(&pvmw)) {
@@ -1394,7 +1394,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
dec_mm_counter(mm, mm_counter(page));
} else if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & TTU_MIGRATION)) {
+ (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
swp_entry_t entry;
pte_t swp_pte;
/*
@@ -1519,7 +1519,8 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+ if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+ && !PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;
if (flags & TTU_RMAP_LOCKED)
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for move_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog: v1 -> v5:
- fix page counting
Signed-off-by: Zi Yan <[email protected]>
---
mm/migrate.c | 47 +++++++++++++++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 14 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index bbc856264b69..f7c1a1999c8e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,8 +184,8 @@ void putback_movable_pages(struct list_head *l)
put_page(page);
} else {
putback_lru_page(page);
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
}
}
}
@@ -1141,8 +1141,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* as __PageMovable
*/
if (likely(!__PageMovable(page)))
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
}
/*
@@ -1159,7 +1159,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* it's how HWPoison flag works at the moment.
*/
if (!test_set_page_hwpoison(page))
- num_poisoned_pages_inc();
+ num_poisoned_pages_add(hpage_nr_pages(page));
}
} else {
if (rc != -EAGAIN) {
@@ -1414,7 +1414,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(pm->node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1441,6 +1451,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
struct vm_area_struct *vma;
struct page *page;
+ struct page *head;
+ unsigned int follflags;
err = -EFAULT;
vma = find_vma(mm, pp->addr);
@@ -1448,8 +1460,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto set_status;
/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, pp->addr,
- FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, pp->addr, follflags);
err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1459,7 +1473,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!page)
goto set_status;
- pp->page = page;
err = page_to_nid(page);
if (err == pp->node)
@@ -1474,16 +1487,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto put_and_set;
if (PageHuge(page)) {
- if (PageHead(page))
+ if (PageHead(page)) {
isolate_huge_page(page, &pagelist);
+ err = 0;
+ pp->page = page;
+ }
goto put_and_set;
}
- err = isolate_lru_page(page);
+ pp->page = compound_head(page);
+ head = compound_head(page);
+ err = isolate_lru_page(head);
if (!err) {
- list_add_tail(&page->lru, &pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ list_add_tail(&head->lru, &pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
put_and_set:
/*
--
2.11.0
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Naoya Horiguchi <[email protected]>
>
> Introduce a separate check routine related to MPOL_MF_INVERT flag.
> This patch just does cleanup, no behavioral change.
Can you please send it separately first, this should be debated
and merged quickly and not hang on to the series if we have to
respin again.
Reviewed-by: Anshuman Khandual <[email protected]>
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Naoya Horiguchi <[email protected]>
>
> TTU_MIGRATION is used to convert pte into migration entry until thp split
> completes. This behavior conflicts with thp migration added later patches,
> so let's introduce a new TTU flag specifically for freezing.
>
> try_to_unmap() is used both for thp split (via freeze_page()) and page
> migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
> head page is like below (assuming anonymous thp):
>
> (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
> TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
>
> and ttu_flag given for tail pages is:
>
> (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
> TTU_MIGRATION)
>
> __unmap_and_move() calls try_to_unmap() with ttu_flag:
>
> (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
>
> Now I'm trying to insert a branch for thp migration at the top of
> try_to_unmap_one() like below
>
> static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> unsigned long address, void *arg)
> {
> ...
> if (flags & TTU_MIGRATION) {
> if (!pvmw.pte && page) {
> set_pmd_migration_entry(&pvmw, page);
> continue;
> }
> }
>
> , so try_to_unmap() for tail pages called by thp split can go into thp
> migration code path (which converts *pmd* into migration entry), while
> the expectation is to freeze thp (which converts *pte* into migration entry.)
>
> I detected this failure as a "bad page state" error in a testcase where
> split_huge_page() is called from queue_pages_pte_range().
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
It had Kirril's acked-by (https://patchwork.kernel.org/patch/9416221/)
last time around. Please include again.
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Naoya Horiguchi <[email protected]>
>
> Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> functionality to x86_64, which should be safer at the first step.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
Aneesh's latest HugeTLB migration enablement on powerpc should
make this work on powerpc platform as well. Will test it out
in some time but for now enabling only on x86 where you have
tested the series makes sense.
Reviewed-by: Anshuman Khandual <[email protected]>
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
>
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
> think up no better name. Any suggesntion is welcomed.)
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
>
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
> pmd_trans_huge(), pmd_devmap(), or pmd_none()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
> true on pmd_migration_entry, so that migration entries are not
> treated as pmd page table entries.
>
> ChangeLog v4 -> v5:
> - add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
> the equivalence of !pmd_present() and is_pmd_migration_entry()
> - fix migration entry wait deadlock code (from v1) in follow_page_mask()
> - remove unnecessary code (from v1) in follow_trans_huge_pmd()
> - use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
> so it will not be confused with pmd_none()
> - change author information
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> arch/x86/mm/gup.c | 7 +++--
> fs/proc/task_mmu.c | 30 +++++++++++++--------
> include/asm-generic/pgtable.h | 17 +++++++++++-
> include/linux/huge_mm.h | 14 ++++++++--
> mm/gup.c | 22 ++++++++++++++--
> mm/huge_memory.c | 61 ++++++++++++++++++++++++++++++++++++++-----
> mm/memcontrol.c | 5 ++++
> mm/memory.c | 12 +++++++--
> mm/mprotect.c | 4 +--
> mm/mremap.c | 2 +-
> 10 files changed, 145 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 456dfdfd2249..096bbcc801e6 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -9,6 +9,7 @@
> #include <linux/vmstat.h>
> #include <linux/highmem.h>
> #include <linux/swap.h>
> +#include <linux/swapops.h>
> #include <linux/memremap.h>
>
> #include <asm/mmu_context.h>
> @@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = *pmdp;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd)) {
> + VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(pmd));
> return 0;
> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> + } else if (unlikely(pmd_large(pmd))) {
> /*
> * NUMA hinting faults need to be handled in the GUP
> * slowpath for accounting purposes and so that they
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5c8359704601..57489dcd71c4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> - smaps_pmd_entry(pmd, addr, walk);
> + if (pmd_present(*pmd))
> + smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> return 0;
> }
> @@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> goto out;
> }
>
> + if (!pmd_present(*pmd))
> + goto out;
> +
These pmd_present() checks should have been done irrespective of the
presence of new PMD migration entries. Please separate them out in a
different clean up patch.
> page = pmd_page(*pmd);
>
> /* Clear accessed and referenced bits. */
> @@ -1221,28 +1225,32 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> if (ptl) {
> u64 flags = 0, frame = 0;
> pmd_t pmd = *pmdp;
> + struct page *page = NULL;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> flags |= PM_SOFT_DIRTY;
>
> - /*
> - * Currently pmd for thp is always present because thp
> - * can not be swapped-out, migrated, or HWPOISONed
> - * (split in such cases instead.)
> - * This if-check is just to prepare for future implementation.
> - */
> if (pmd_present(pmd)) {
> - struct page *page = pmd_page(pmd);
> -
> - if (page_mapcount(page) == 1)
> - flags |= PM_MMAP_EXCLUSIVE;
> + page = pmd_page(pmd);
>
> flags |= PM_PRESENT;
> if (pm->show_pfn)
> frame = pmd_pfn(pmd) +
> ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> + } else if (is_swap_pmd(pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + frame = swp_type(entry) |
> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + flags |= PM_SWAP;
> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(pmd));
> + page = migration_entry_to_page(entry);
> }
>
> + if (page && page_mapcount(page) == 1)
> + flags |= PM_MMAP_EXCLUSIVE;
> +
Yeah, this makes sense. It discovers the page details in case its
a migration PMD entry which never existed before.
> for (; addr != end; addr += PAGE_SIZE) {
> pagemap_entry_t pme = make_pme(frame, flags);
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..23bf18116df4 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -809,7 +809,22 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> barrier();
> #endif
> - if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> + /*
> + * !pmd_present() checks for pmd migration entries
> + *
> + * The complete check uses is_pmd_migration_entry() in linux/swapops.h
> + * But using that requires moving current function and pmd_trans_unstable()
> + * to linux/swapops.h to resovle dependency, which is too much code move.
> + *
> + * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
> + * because !pmd_present() pages can only be under migration not swapped
> + * out.
> + *
> + * pmd_none() is preseved for future condition checks on pmd migration
> + * entries and not confusing with this function name, although it is
> + * redundant with !pmd_present().
> + */
> + if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || !pmd_present(pmdval))
> return 1;
> if (unlikely(pmd_bad(pmdval))) {
> pmd_clear_bad(pmd);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1b81cb57ff0f..6f44a2352597 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -126,7 +126,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> #define split_huge_pmd(__vma, __pmd, __address) \
> do { \
> pmd_t *____pmd = (__pmd); \
> - if (pmd_trans_huge(*____pmd) \
> + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
> || pmd_devmap(*____pmd)) \
> __split_huge_pmd(__vma, __pmd, __address, \
> false, NULL); \
> @@ -157,12 +157,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma);
> extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
> struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> + return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
> /* mmap_sem must be held on entry */
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> return __pmd_trans_huge_lock(pmd, vma);
> else
> return NULL;
> @@ -269,6 +275,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> long adjust_next)
> {
> }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> + return 0;
> +}
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> diff --git a/mm/gup.c b/mm/gup.c
> index 4039ec2993d3..b24c7d10aced 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -278,6 +278,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> return page;
> return no_page_table(vma, flags);
> }
> +retry:
> + if (!pmd_present(*pmd)) {
> + if (likely(!(flags & FOLL_MIGRATION)))
> + return no_page_table(vma, flags);
> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(*pmd));
> + if (is_pmd_migration_entry(*pmd))
> + pmd_migration_entry_wait(mm, pmd);
> + goto retry;
> + }
> if (pmd_devmap(*pmd)) {
> ptl = pmd_lock(mm, pmd);
> page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -291,7 +301,15 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> return no_page_table(vma, flags);
>
> +retry_locked:
> ptl = pmd_lock(mm, pmd);
> + if (unlikely(!pmd_present(*pmd))) {
> + spin_unlock(ptl);
> + if (likely(!(flags & FOLL_MIGRATION)))
> + return no_page_table(vma, flags);
> + pmd_migration_entry_wait(mm, pmd);
> + goto retry_locked;
> + }
> if (unlikely(!pmd_trans_huge(*pmd))) {
> spin_unlock(ptl);
> return follow_page_pte(vma, address, pmd, flags);
> @@ -350,7 +368,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> pud = pud_offset(p4d, address);
> BUG_ON(pud_none(*pud));
> pmd = pmd_offset(pud, address);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return -EFAULT;
> VM_BUG_ON(pmd_trans_huge(*pmd));
> pte = pte_offset_map(pmd, address);
> @@ -1378,7 +1396,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
>
> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7406d88445bf..3479e9caf2fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -912,6 +912,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> ret = -EAGAIN;
> pmd = *src_pmd;
> +
> + if (unlikely(is_swap_pmd(pmd))) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(pmd));
> + if (is_write_migration_entry(entry)) {
> + make_migration_entry_read(&entry);
We create a read migration entry after detecting a write ?
> + pmd = swp_entry_to_pmd(entry);
> + set_pmd_at(src_mm, addr, src_pmd, pmd);
> + }
> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> + ret = 0;
> + goto out_unlock;
> + }
> +
> if (unlikely(!pmd_trans_huge(pmd))) {
> pte_free(dst_mm, pgtable);
> goto out_unlock;
> @@ -1218,6 +1234,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
> goto out_unlock;
>
> + if (unlikely(!pmd_present(orig_pmd)))
> + goto out_unlock;
> +
> page = pmd_page(orig_pmd);
> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> /*
> @@ -1548,6 +1567,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> if (is_huge_zero_pmd(orig_pmd))
> goto out;
>
> + if (unlikely(!pmd_present(orig_pmd))) {
> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(orig_pmd));
> + goto out;
> + }
> +
> page = pmd_page(orig_pmd);
> /*
> * If other processes are mapping this page, we couldn't discard
> @@ -1758,6 +1783,21 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> preserve_write = prot_numa && pmd_write(*pmd);
> ret = 1;
>
> + if (is_swap_pmd(*pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(*pmd));
> + if (is_write_migration_entry(entry)) {
> + pmd_t newpmd;
> +
> + make_migration_entry_read(&entry);
Same here or maybe I am missing something.
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Naoya Horiguchi <[email protected]>
>
> This patch enables thp migration for soft offline.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
>
> ChangeLog: v1 -> v5:
> - fix page isolation counting error
>
> Signed-off-by: Zi Yan <[email protected]>
> ---
> mm/memory-failure.c | 35 ++++++++++++++---------------------
> 1 file changed, 14 insertions(+), 21 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9b77476ef31f..23ff02eb3ed4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1481,7 +1481,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> if (PageHuge(p))
> return alloc_huge_page_node(page_hstate(compound_head(p)),
> nid);
> - else
> + else if (thp_migration_supported() && PageTransHuge(p)) {
> + struct page *thp;
> +
> + thp = alloc_pages_node(nid,
> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
Why not __GFP_RECLAIM ? Its soft offline path we wait a bit before
declaring that THP page cannot be allocated and hence should invoke
reclaim methods as well.
> + HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
> + } else
> return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> }
>
> @@ -1665,8 +1675,8 @@ static int __soft_offline_page(struct page *page, int flags)
> * cannot have PAGE_MAPPING_MOVABLE.
> */
> if (!__PageMovable(page))
> - inc_node_page_state(page, NR_ISOLATED_ANON +
> - page_is_file_cache(page));
> + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> + page_is_file_cache(page), hpage_nr_pages(page));
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> MIGRATE_SYNC, MR_MEMORY_FAILURE);
> @@ -1689,28 +1699,11 @@ static int __soft_offline_page(struct page *page, int flags)
> static int soft_offline_in_use_page(struct page *page, int flags)
> {
> int ret;
> - struct page *hpage = compound_head(page);
> -
> - if (!PageHuge(page) && PageTransHuge(hpage)) {
> - lock_page(hpage);
> - if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
> - unlock_page(hpage);
> - if (!PageAnon(hpage))
> - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
> - else
> - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
> - put_hwpoison_page(hpage);
> - return -EBUSY;
> - }
> - unlock_page(hpage);
> - get_hwpoison_page(page);
> - put_hwpoison_page(hpage);
> - }
>
> if (PageHuge(page))
> ret = soft_offline_huge_page(page, flags);
> else
> - ret = __soft_offline_page(page, flags);
> + ret = __soft_offline_page(compound_head(page), flags);
Hmm, what if the THP allocation fails in the new_page() path and
we fallback for general page allocation. In that case we will
always be still calling with the head page ? Because we dont
split the huge page any more.
On 04/21/2017 02:17 AM, Zi Yan wrote:
> From: Naoya Horiguchi <[email protected]>
>
> This patch enables thp migration for mbind(2) and migrate_pages(2).
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> ChangeLog v1 -> v2:
> - support pte-mapped and doubly-mapped thp
> ---
> mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 79 insertions(+), 29 deletions(-)
Snip
> @@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> node);
> - else
> + else if (thp_migration_supported() && PageTransHuge(page)) {
> + struct page *thp;
> +
> + thp = alloc_pages_node(node,
> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> + HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
> + } else
> return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
> __GFP_THISNODE, 0);
> }
> @@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
> if (PageHuge(page)) {
> BUG_ON(!vma);
> return alloc_huge_page_noerr(vma, address, 1);
> + } else if (thp_migration_supported() && PageTransHuge(page)) {
> + struct page *thp;
> +
> + thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
> + HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
GFP flags in both these new page allocation functions should be the same.
Does alloc_hugepage_vma() will eventually call page allocation with the
following flags.
(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM
Anshuman Khandual wrote:
> On 04/21/2017 02:17 AM, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>> think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>> pmd_trans_huge(), pmd_devmap(), or pmd_none()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>> true on pmd_migration_entry, so that migration entries are not
>> treated as pmd page table entries.
>>
>> ChangeLog v4 -> v5:
>> - add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
>> the equivalence of !pmd_present() and is_pmd_migration_entry()
>> - fix migration entry wait deadlock code (from v1) in follow_page_mask()
>> - remove unnecessary code (from v1) in follow_trans_huge_pmd()
>> - use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
>> so it will not be confused with pmd_none()
>> - change author information
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> arch/x86/mm/gup.c | 7 +++--
>> fs/proc/task_mmu.c | 30 +++++++++++++--------
>> include/asm-generic/pgtable.h | 17 +++++++++++-
>> include/linux/huge_mm.h | 14 ++++++++--
>> mm/gup.c | 22 ++++++++++++++--
>> mm/huge_memory.c | 61 ++++++++++++++++++++++++++++++++++++++-----
>> mm/memcontrol.c | 5 ++++
>> mm/memory.c | 12 +++++++--
>> mm/mprotect.c | 4 +--
>> mm/mremap.c | 2 +-
>> 10 files changed, 145 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
>> index 456dfdfd2249..096bbcc801e6 100644
>> --- a/arch/x86/mm/gup.c
>> +++ b/arch/x86/mm/gup.c
>> @@ -9,6 +9,7 @@
>> #include <linux/vmstat.h>
>> #include <linux/highmem.h>
>> #include <linux/swap.h>
>> +#include <linux/swapops.h>
>> #include <linux/memremap.h>
>>
>> #include <asm/mmu_context.h>
>> @@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>> pmd_t pmd = *pmdp;
>>
>> next = pmd_addr_end(addr, end);
>> - if (pmd_none(pmd))
>> + if (!pmd_present(pmd)) {
>> + VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
>> + !is_pmd_migration_entry(pmd));
>> return 0;
>> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
>> + } else if (unlikely(pmd_large(pmd))) {
>> /*
>> * NUMA hinting faults need to be handled in the GUP
>> * slowpath for accounting purposes and so that they
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 5c8359704601..57489dcd71c4 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -600,7 +600,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> if (ptl) {
>> - smaps_pmd_entry(pmd, addr, walk);
>> + if (pmd_present(*pmd))
>> + smaps_pmd_entry(pmd, addr, walk);
>> spin_unlock(ptl);
>> return 0;
>> }
>> @@ -942,6 +943,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>> goto out;
>> }
>>
>> + if (!pmd_present(*pmd))
>> + goto out;
>> +
>
> These pmd_present() checks should have been done irrespective of the
> presence of new PMD migration entries. Please separate them out in a
> different clean up patch.
Not really. The introduction of PMD migration entries makes
pmd_trans_huge_lock() return a lock when PMD is a swap entry (See
changes on pmd_trans_huge_lock() in this patch). This was not the case
before, where pmd_trans_huge_lock() returned NULL if PMD entry was
pmd_none() and both two chunks were not reachable.
Maybe I should use is_swap_pmd() to clarify the confusion.
<snip>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 7406d88445bf..3479e9caf2fa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -912,6 +912,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>
>> ret = -EAGAIN;
>> pmd = *src_pmd;
>> +
>> + if (unlikely(is_swap_pmd(pmd))) {
>> + swp_entry_t entry = pmd_to_swp_entry(pmd);
>> +
>> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
>> + !is_pmd_migration_entry(pmd));
>> + if (is_write_migration_entry(entry)) {
>> + make_migration_entry_read(&entry);
>
> We create a read migration entry after detecting a write ?
When copying page tables, COW mappings require pages in both parent and
child to be set to read. In copy_huge_pmd(), only anonymous VMAs are
copied and the other VMAs will be refilled on fault. Writable anonymous
VMAs have VM_MAYWRITE set but not VM_SHARED and this matches
is_cow_mapping(). So all mappings copied in this function are COW mappings.
>
>> + pmd = swp_entry_to_pmd(entry);
>> + set_pmd_at(src_mm, addr, src_pmd, pmd);
>> + }
>> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>> + ret = 0;
>> + goto out_unlock;
>> + }
>> +
>> if (unlikely(!pmd_trans_huge(pmd))) {
>> pte_free(dst_mm, pgtable);
>> goto out_unlock;
>> @@ -1218,6 +1234,9 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>> if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
>> goto out_unlock;
>>
>> + if (unlikely(!pmd_present(orig_pmd)))
>> + goto out_unlock;
>> +
>> page = pmd_page(orig_pmd);
>> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>> /*
>> @@ -1548,6 +1567,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> if (is_huge_zero_pmd(orig_pmd))
>> goto out;
>>
>> + if (unlikely(!pmd_present(orig_pmd))) {
>> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
>> + !is_pmd_migration_entry(orig_pmd));
>> + goto out;
>> + }
>> +
>> page = pmd_page(orig_pmd);
>> /*
>> * If other processes are mapping this page, we couldn't discard
>> @@ -1758,6 +1783,21 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> preserve_write = prot_numa && pmd_write(*pmd);
>> ret = 1;
>>
>> + if (is_swap_pmd(*pmd)) {
>> + swp_entry_t entry = pmd_to_swp_entry(*pmd);
>> +
>> + VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
>> + !is_pmd_migration_entry(*pmd));
>> + if (is_write_migration_entry(entry)) {
>> + pmd_t newpmd;
>> +
>> + make_migration_entry_read(&entry);
>
> Same here or maybe I am missing something.
I follow the same pattern in change_pte_range() (mm/mprotect.c). The
comment there says "A protection check is difficult so just be safe and
disable write".
--
Best Regards,
Yan Zi
Anshuman Khandual wrote:
> On 04/21/2017 02:17 AM, Zi Yan wrote:
>> From: Naoya Horiguchi <[email protected]>
>>
>> This patch enables thp migration for soft offline.
>>
>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>
>> ChangeLog: v1 -> v5:
>> - fix page isolation counting error
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> ---
>> mm/memory-failure.c | 35 ++++++++++++++---------------------
>> 1 file changed, 14 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 9b77476ef31f..23ff02eb3ed4 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1481,7 +1481,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>> if (PageHuge(p))
>> return alloc_huge_page_node(page_hstate(compound_head(p)),
>> nid);
>> - else
>> + else if (thp_migration_supported() && PageTransHuge(p)) {
>> + struct page *thp;
>> +
>> + thp = alloc_pages_node(nid,
>> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>
> Why not __GFP_RECLAIM ? Its soft offline path we wait a bit before
> declaring that THP page cannot be allocated and hence should invoke
> reclaim methods as well.
I am not sure how much effort the kernel wants to put here to soft
offline a THP. Naoya knows more here.
>
>> + HPAGE_PMD_ORDER);
>> + if (!thp)
>> + return NULL;
>> + prep_transhuge_page(thp);
>> + return thp;
>> + } else
>> return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
>> }
>>
>> @@ -1665,8 +1675,8 @@ static int __soft_offline_page(struct page *page, int flags)
>> * cannot have PAGE_MAPPING_MOVABLE.
>> */
>> if (!__PageMovable(page))
>> - inc_node_page_state(page, NR_ISOLATED_ANON +
>> - page_is_file_cache(page));
>> + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
>> + page_is_file_cache(page), hpage_nr_pages(page));
>> list_add(&page->lru, &pagelist);
>> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>> MIGRATE_SYNC, MR_MEMORY_FAILURE);
>> @@ -1689,28 +1699,11 @@ static int __soft_offline_page(struct page *page, int flags)
>> static int soft_offline_in_use_page(struct page *page, int flags)
>> {
>> int ret;
>> - struct page *hpage = compound_head(page);
>> -
>> - if (!PageHuge(page) && PageTransHuge(hpage)) {
>> - lock_page(hpage);
>> - if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
>> - unlock_page(hpage);
>> - if (!PageAnon(hpage))
>> - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
>> - else
>> - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
>> - put_hwpoison_page(hpage);
>> - return -EBUSY;
>> - }
>> - unlock_page(hpage);
>> - get_hwpoison_page(page);
>> - put_hwpoison_page(hpage);
>> - }
>>
>> if (PageHuge(page))
>> ret = soft_offline_huge_page(page, flags);
>> else
>> - ret = __soft_offline_page(page, flags);
>> + ret = __soft_offline_page(compound_head(page), flags);
>
> Hmm, what if the THP allocation fails in the new_page() path and
> we fallback for general page allocation. In that case we will
> always be still calling with the head page ? Because we dont
> split the huge page any more.
This could be a problem if the user wants to offline a TailPage but due
to THP allocation failure, the HeadPage is offlined.
It may be better to only soft offline THPs if page ==
compound_head(page). If page != compound_head(page), we still split THPs
like before.
Because in migrate_pages(), we cannot guarantee any TailPages in that
THP are migrated (1. THP allocation failure causes THP splitting, then
only HeadPage is going to be migrated; 2. even if we change existing
migrate_pages() implementation to add all TailPages to migration list
instead of LRU list, we still cannot guarantee the TailPage we want to
migrate is migrated.).
Naoya, what do you think?
--
Best Regards,
Yan Zi
Anshuman Khandual wrote:
> On 04/21/2017 02:17 AM, Zi Yan wrote:
>> From: Naoya Horiguchi <[email protected]>
>>
>> This patch enables thp migration for mbind(2) and migrate_pages(2).
>>
>> Signed-off-by: Naoya Horiguchi <[email protected]>
>> ---
>> ChangeLog v1 -> v2:
>> - support pte-mapped and doubly-mapped thp
>> ---
>> mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
>> 1 file changed, 79 insertions(+), 29 deletions(-)
>
> Snip
>
>> @@ -981,7 +1012,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
>> if (PageHuge(page))
>> return alloc_huge_page_node(page_hstate(compound_head(page)),
>> node);
>> - else
>> + else if (thp_migration_supported() && PageTransHuge(page)) {
>> + struct page *thp;
>> +
>> + thp = alloc_pages_node(node,
>> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>> + HPAGE_PMD_ORDER);
>> + if (!thp)
>> + return NULL;
>> + prep_transhuge_page(thp);
>> + return thp;
>> + } else
>> return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
>> __GFP_THISNODE, 0);
>> }
>> @@ -1147,6 +1188,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
>> if (PageHuge(page)) {
>> BUG_ON(!vma);
>> return alloc_huge_page_noerr(vma, address, 1);
>> + } else if (thp_migration_supported() && PageTransHuge(page)) {
>> + struct page *thp;
>> +
>> + thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
>> + HPAGE_PMD_ORDER);
>> + if (!thp)
>> + return NULL;
>> + prep_transhuge_page(thp);
>> + return thp;
>
> GFP flags in both these new page allocation functions should be the same.
> Does alloc_hugepage_vma() will eventually call page allocation with the
> following flags.
>
> (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM
Sure. This is equivalent to (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
which I am going to use.
--
Best Regards,
Yan Zi
On Fri, Apr 21, 2017 at 10:55:49AM -0500, Zi Yan wrote:
>
>
> Anshuman Khandual wrote:
> > On 04/21/2017 02:17 AM, Zi Yan wrote:
> >> From: Naoya Horiguchi <[email protected]>
> >>
> >> This patch enables thp migration for soft offline.
> >>
> >> Signed-off-by: Naoya Horiguchi <[email protected]>
> >>
> >> ChangeLog: v1 -> v5:
> >> - fix page isolation counting error
> >>
> >> Signed-off-by: Zi Yan <[email protected]>
> >> ---
> >> mm/memory-failure.c | 35 ++++++++++++++---------------------
> >> 1 file changed, 14 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 9b77476ef31f..23ff02eb3ed4 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1481,7 +1481,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> >> if (PageHuge(p))
> >> return alloc_huge_page_node(page_hstate(compound_head(p)),
> >> nid);
> >> - else
> >> + else if (thp_migration_supported() && PageTransHuge(p)) {
> >> + struct page *thp;
> >> +
> >> + thp = alloc_pages_node(nid,
> >> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> >
> > Why not __GFP_RECLAIM ? Its soft offline path we wait a bit before
> > declaring that THP page cannot be allocated and hence should invoke
> > reclaim methods as well.
>
> I am not sure how much effort the kernel wants to put here to soft
> offline a THP. Naoya knows more here.
What I thought at first was that soft offline is not an urgent user
and no need to reclaim (i.e. give a little impact on other thread.)
But that's not a strong opinion, so if you like __GFP_RECLAIM here,
I'm fine about that.
>
>
> >
> >> + HPAGE_PMD_ORDER);
> >> + if (!thp)
> >> + return NULL;
> >> + prep_transhuge_page(thp);
> >> + return thp;
> >> + } else
> >> return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> >> }
> >>
> >> @@ -1665,8 +1675,8 @@ static int __soft_offline_page(struct page *page, int flags)
> >> * cannot have PAGE_MAPPING_MOVABLE.
> >> */
> >> if (!__PageMovable(page))
> >> - inc_node_page_state(page, NR_ISOLATED_ANON +
> >> - page_is_file_cache(page));
> >> + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
> >> + page_is_file_cache(page), hpage_nr_pages(page));
> >> list_add(&page->lru, &pagelist);
> >> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> >> MIGRATE_SYNC, MR_MEMORY_FAILURE);
> >> @@ -1689,28 +1699,11 @@ static int __soft_offline_page(struct page *page, int flags)
> >> static int soft_offline_in_use_page(struct page *page, int flags)
> >> {
> >> int ret;
> >> - struct page *hpage = compound_head(page);
> >> -
> >> - if (!PageHuge(page) && PageTransHuge(hpage)) {
> >> - lock_page(hpage);
> >> - if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
> >> - unlock_page(hpage);
> >> - if (!PageAnon(hpage))
> >> - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
> >> - else
> >> - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
> >> - put_hwpoison_page(hpage);
> >> - return -EBUSY;
> >> - }
> >> - unlock_page(hpage);
> >> - get_hwpoison_page(page);
> >> - put_hwpoison_page(hpage);
> >> - }
> >>
> >> if (PageHuge(page))
> >> ret = soft_offline_huge_page(page, flags);
> >> else
> >> - ret = __soft_offline_page(page, flags);
> >> + ret = __soft_offline_page(compound_head(page), flags);
> >
> > Hmm, what if the THP allocation fails in the new_page() path and
> > we fallback for general page allocation. In that case we will
> > always be still calling with the head page ? Because we dont
> > split the huge page any more.
>
> This could be a problem if the user wants to offline a TailPage but due
> to THP allocation failure, the HeadPage is offlined.
Right, "retry with split" part is unfinished, so we need some improvement.
>
> It may be better to only soft offline THPs if page ==
> compound_head(page). If page != compound_head(page), we still split THPs
> like before.
>
> Because in migrate_pages(), we cannot guarantee any TailPages in that
> THP are migrated (1. THP allocation failure causes THP splitting, then
> only HeadPage is going to be migrated; 2. even if we change existing
> migrate_pages() implementation to add all TailPages to migration list
> instead of LRU list, we still cannot guarantee the TailPage we want to
> migrate is migrated.).
>
> Naoya, what do you think?
Maybe soft offline is a special caller of page migration because it
basically wants to migrate only one page, but thp migration still has
a benefit because we can avoid thp split.
So I like that we try thp migration at first, and if it fails we fall
back to split and migrate (only) a raw error page. This should be done
in caller side for soft offline, because it knows where the error page is.
As for generic case (for other migration callers which mainly want to
migrate multiple pages for their purpose,) thp split and retry can be
done in common migration code. After thp split, all subpages are linked
to migration list, then we retry without returning to the caller.
So I think that split_huge_page() can be moved to (for example) for-loop
in migrate_pages().
I tried to write a patch for it last year, but considering vm event
accounting, the patch might be large (~100 lines).
Thanks,
Naoya Horiguchi
Naoya Horiguchi wrote:
> On Fri, Apr 21, 2017 at 10:55:49AM -0500, Zi Yan wrote:
>>
>> Anshuman Khandual wrote:
>>> On 04/21/2017 02:17 AM, Zi Yan wrote:
>>>> From: Naoya Horiguchi <[email protected]>
>>>>
>>>> This patch enables thp migration for soft offline.
>>>>
>>>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>>>
>>>> ChangeLog: v1 -> v5:
>>>> - fix page isolation counting error
>>>>
>>>> Signed-off-by: Zi Yan <[email protected]>
>>>> ---
>>>> mm/memory-failure.c | 35 ++++++++++++++---------------------
>>>> 1 file changed, 14 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 9b77476ef31f..23ff02eb3ed4 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1481,7 +1481,17 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>>>> if (PageHuge(p))
>>>> return alloc_huge_page_node(page_hstate(compound_head(p)),
>>>> nid);
>>>> - else
>>>> + else if (thp_migration_supported() && PageTransHuge(p)) {
>>>> + struct page *thp;
>>>> +
>>>> + thp = alloc_pages_node(nid,
>>>> + (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
>>> Why not __GFP_RECLAIM ? Its soft offline path we wait a bit before
>>> declaring that THP page cannot be allocated and hence should invoke
>>> reclaim methods as well.
>> I am not sure how much effort the kernel wants to put here to soft
>> offline a THP. Naoya knows more here.
>
> What I thought at first was that soft offline is not an urgent user
> and no need to reclaim (i.e. give a little impact on other thread.)
> But that's not a strong opinion, so if you like __GFP_RECLAIM here,
> I'm fine about that.
OK, I will add __GFP_RECLAIM.
>
>>
>>>> + HPAGE_PMD_ORDER);
>>>> + if (!thp)
>>>> + return NULL;
>>>> + prep_transhuge_page(thp);
>>>> + return thp;
>>>> + } else
>>>> return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
>>>> }
>>>>
>>>> @@ -1665,8 +1675,8 @@ static int __soft_offline_page(struct page *page, int flags)
>>>> * cannot have PAGE_MAPPING_MOVABLE.
>>>> */
>>>> if (!__PageMovable(page))
>>>> - inc_node_page_state(page, NR_ISOLATED_ANON +
>>>> - page_is_file_cache(page));
>>>> + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
>>>> + page_is_file_cache(page), hpage_nr_pages(page));
>>>> list_add(&page->lru, &pagelist);
>>>> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>>>> MIGRATE_SYNC, MR_MEMORY_FAILURE);
>>>> @@ -1689,28 +1699,11 @@ static int __soft_offline_page(struct page *page, int flags)
>>>> static int soft_offline_in_use_page(struct page *page, int flags)
>>>> {
>>>> int ret;
>>>> - struct page *hpage = compound_head(page);
>>>> -
>>>> - if (!PageHuge(page) && PageTransHuge(hpage)) {
>>>> - lock_page(hpage);
>>>> - if (!PageAnon(hpage) || unlikely(split_huge_page(hpage))) {
>>>> - unlock_page(hpage);
>>>> - if (!PageAnon(hpage))
>>>> - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
>>>> - else
>>>> - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
>>>> - put_hwpoison_page(hpage);
>>>> - return -EBUSY;
>>>> - }
>>>> - unlock_page(hpage);
>>>> - get_hwpoison_page(page);
>>>> - put_hwpoison_page(hpage);
>>>> - }
>>>>
>>>> if (PageHuge(page))
>>>> ret = soft_offline_huge_page(page, flags);
>>>> else
>>>> - ret = __soft_offline_page(page, flags);
>>>> + ret = __soft_offline_page(compound_head(page), flags);
>>> Hmm, what if the THP allocation fails in the new_page() path and
>>> we fallback for general page allocation. In that case we will
>>> always be still calling with the head page ? Because we dont
>>> split the huge page any more.
>> This could be a problem if the user wants to offline a TailPage but due
>> to THP allocation failure, the HeadPage is offlined.
>
> Right, "retry with split" part is unfinished, so we need some improvement.
>
>> It may be better to only soft offline THPs if page ==
>> compound_head(page). If page != compound_head(page), we still split THPs
>> like before.
>>
>> Because in migrate_pages(), we cannot guarantee any TailPages in that
>> THP are migrated (1. THP allocation failure causes THP splitting, then
>> only HeadPage is going to be migrated; 2. even if we change existing
>> migrate_pages() implementation to add all TailPages to migration list
>> instead of LRU list, we still cannot guarantee the TailPage we want to
>> migrate is migrated.).
>>
>> Naoya, what do you think?
>
> Maybe soft offline is a special caller of page migration because it
> basically wants to migrate only one page, but thp migration still has
> a benefit because we can avoid thp split.
> So I like that we try thp migration at first, and if it fails we fall
> back to split and migrate (only) a raw error page. This should be done
> in caller side for soft offline, because it knows where the error page is.
Make sense. So when migrate_pages() sees the migrate reason is
MR_MEMORY_FAILURE, it will not split THP when newpage allocation fails.
Then, the soft offline caller will split failed THP and retry migrating
the error subpage. I can do that.
>
> As for generic case (for other migration callers which mainly want to
> migrate multiple pages for their purpose,) thp split and retry can be
> done in common migration code. After thp split, all subpages are linked
> to migration list, then we retry without returning to the caller.
> So I think that split_huge_page() can be moved to (for example) for-loop
> in migrate_pages().
>
> I tried to write a patch for it last year, but considering vm event
> accounting, the patch might be large (~100 lines).
Yes. I saw your code on your github. I can pick it up and send it for
review after this patchset is merged, if you are OK with it.
--
Best Regards,
Yan Zi