From: Zi Yan <[email protected]>
Hi Andrew,
The patches are rebased on mmotm-2017-07-12-15-11 with the feedbacks
from v8 patches. Is there anything I need to do to get them merged?
Patch 1 factors out common code. It could be picked up easily.
Patch 2 moves _PAGE_SWP_SOFT_DIRTY bit to prepare for THP migration.
Patch 3 adds a new TTU flag to avoid the conflict between TTU_MIGRATION and THP migration.
Patch 4-6 are the core part of THP migration.
Patch 7 adds soft dirty bit to THP migraiton.
Patch 8-10 enables THP migration in the various locations in the kernel.
Thanks.
Motivations
===========================================
1. THP migration becomes important in the upcoming heterogeneous memory systems.
As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/[email protected]/msg1349227.html),
future GPUs or other accelerators will have their memory managed by operating
systems. Moving data into and out of these memory nodes efficiently is critical
to applications that use GPUs or other accelerators. Existing page migration
only supports base pages, which has a very low memory bandwidth utilization.
My experiments (see below) show THP migration can migrate pages more efficiently.
2. Base page migration vs THP migration throughput.
Here are cross-socket page migration results from calling
move_pages() syscall:
In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.
In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.
THP migration can give us 3x and 1.15x throughput over base page migration
in x86_64 and ppc64 respectivley.
You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench
3. Existing page migration splits THP before migration and cannot guarantee
the migrated pages are still contiguous. Contiguity is always what GPUs and
accelerators look for. Without THP migration, khugepaged needs to do extra work
to reassemble the migrated pages back to THPs.
ChangeLog
===========================================
Changes since v8:
* Avoid shmem THP migrations, which are not supported yet.
* Simplify PMD-mapped THP checks in try_to_unmap_one() and remove_migration_pte().
* Fix VM_BUG_ON()s that trigger false alarms.
Changes since v7:
* Remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to allow
replacing macro with IS_ENABLED at several code chunks. This makes them
easy to read.
* Rename variable 'migration' to 'flush_needed' for better understanding.
* Use pmdp_invdalite() to avoid race with MADV_DONTNEED.
* Remove unnecessary tlb flush in remove_migration_pmd().
* Add the missing migration flag check in page_vma_mapped_walk().
* Remove not used code in do_huge_pmd_wp_page().
* Add migration entry permission change comment to change_huge_pmd()
to avoid confusion.
Changes since v6:
* Fix the kbuild bot warning in swp_entry_to_pmd().
* Add macro to disable the code when thp migration is not enabled. This fixes
the kbuild bot errors while building kernels without THP migration enabled.
* In memory hotremove, move THP allocation code from new_node_page() to
new_page_nodemask(). This follows the patch ("mm: unify new_node_page and
alloc_migrate_target") in latest mmotm.
Changes since v5:
* THP migration support for soft-offline patch is dropped, because it needs
more discussion. I will send it separately.
* Better commit message in Patch 2 (on moving _PAGE_SWP_SOFT_DIRTY bit),
thanks for Dave Hansen's help.
Changes since v4:
* In Patch 5, I dropped PTE-mapped THP migration handling code, since it is
already well handled by existing code.
* In Patch 6, I did a thorough check on PMD handling places and corrected all
errors I discovered.
* In Patch 6, I use is_swap_pmd() to check PMD migration entries and add
VM_BUG_ON to make sure only migration entries present. It should be useful
later when someone wants to add PMD swap entries, since VM_BUG_ON will
catch the missing code path.
* In Patch 6, I keep pmd_none() in pmd_none_or_trans_huge_or_clear_bad() to
avoid confusion on the function name. I also add a comment to explain it.
* In Patch 7-11, I added some missing soft dirty bit preserving code and
corrected page stats countings.
Changes since v3:
* I dropped my fix on zap_pmd_range() since THP migration will not trigger
it and Kirill has posted patches to fix the bug triggered by MADV_DONTNEED.
* In Patch 6, I used !pmd_present() instead of is_pmd_migration_entry()
in pmd_none_or_trans_huge_or_clear_bad() to avoid moving the function to
linux/swapops.h. Currently, !pmd_present() is equivalent to
is_pmd_migration_entry(). Any suggestion is welcome to this change.
Changes since v2:
* I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
which leads to PTE page table not freed.
* In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
can be set by some CPUs by mistake and the new swap entry format does not use
bit 1-4.
* I also adjust two core migration functions, set_pmd_migration_entry() and
remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
function. Patch 8 needs Kirill's comments, since I also add changes
to his page_vma_mapped_walk() function with pmd_migration_entry handling.
* In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
in set_pmd_migration_entry() to avoid data corruption after page migration.
* In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
leads to deposited PTE page table not freed.
* I personally use this patchset with my customized kernel to test frequent
page migrations by replacing page reclaim with page migration.
The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
I did a 16-hour stress test that has ~7 billion total page migrations.
No error or data corruption was found.
General description
===========================================
This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
- mbind(2)
- move_pages(2)
- migrate_pages(2)
- cgroup/cpuset migration
- memory hotremove
The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.
The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.
Naoya Horiguchi (8):
mm: mempolicy: add queue_pages_required()
mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
mm: thp: introduce separate TTU flag for thp freezing
mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
mm: soft-dirty: keep soft-dirty bits over thp migration
mm: mempolicy: mbind and migrate_pages support thp migration
mm: migrate: move_pages() supports thp migration
mm: memory_hotplug: memory hotremove supports thp migration
Zi Yan (2):
mm: thp: enable thp migration in generic path
mm: thp: check pmd migration entry in common path
arch/x86/Kconfig | 4 +
arch/x86/include/asm/pgtable.h | 17 ++++
arch/x86/include/asm/pgtable_64.h | 14 ++-
arch/x86/include/asm/pgtable_types.h | 10 +-
fs/proc/task_mmu.c | 59 +++++++-----
include/asm-generic/pgtable.h | 52 ++++++++++-
include/linux/huge_mm.h | 24 ++++-
include/linux/migrate.h | 15 ++-
include/linux/rmap.h | 3 +-
include/linux/swapops.h | 69 +++++++++++++-
mm/Kconfig | 3 +
mm/gup.c | 22 ++++-
mm/huge_memory.c | 174 ++++++++++++++++++++++++++++++++---
mm/memcontrol.c | 5 +
mm/memory.c | 12 ++-
mm/memory_hotplug.c | 4 +-
mm/mempolicy.c | 130 +++++++++++++++++++-------
mm/migrate.c | 77 +++++++++++++---
mm/mprotect.c | 4 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 18 +++-
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 20 +++-
23 files changed, 627 insertions(+), 114 deletions(-)
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
mm/mempolicy.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d911fa5cb2a7..58166bf1d1fd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -412,6 +412,21 @@ struct queue_pages {
};
/*
+ * Check if the page's nid is in qp->nmask.
+ *
+ * If MPOL_MF_INVERT is set in qp->flags, check if the nid is
+ * in the invert of qp->nmask.
+ */
+static inline bool queue_pages_required(struct page *page,
+ struct queue_pages *qp)
+{
+ int nid = page_to_nid(page);
+ unsigned long flags = qp->flags;
+
+ return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
+}
+
+/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
*/
@@ -464,8 +479,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
*/
if (PageReserved(page))
continue;
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (!queue_pages_required(page, qp))
continue;
if (PageTransCompound(page)) {
get_page(page);
@@ -497,7 +511,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
#ifdef CONFIG_HUGETLB_PAGE
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid;
struct page *page;
spinlock_t *ptl;
pte_t entry;
@@ -507,8 +520,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (!queue_pages_required(page, qp))
goto unlock;
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for memory hotremove.
---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2->v7:
- base code switched from new_node_page() new_page_nodemask()
Signed-off-by: Zi Yan <[email protected]>
---
include/linux/migrate.h | 15 ++++++++++++++-
mm/memory_hotplug.c | 4 +++-
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e0d405dc842..ce15989521a1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -35,15 +35,28 @@ static inline struct page *new_page_nodemask(struct page *page,
int preferred_nid, nodemask_t *nodemask)
{
gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
+ unsigned int order = 0;
+ struct page *new_page = NULL;
if (PageHuge(page))
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);
+ if (thp_migration_supported() && PageTransHuge(page)) {
+ order = HPAGE_PMD_ORDER;
+ gfp_mask |= GFP_TRANSHUGE;
+ }
+
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
gfp_mask |= __GFP_HIGHMEM;
- return __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+ new_page = __alloc_pages_nodemask(gfp_mask, order,
+ preferred_nid, nodemask);
+
+ if (new_page && PageTransHuge(page))
+ prep_transhuge_page(new_page);
+
+ return new_page;
}
#ifdef CONFIG_MIGRATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d620d0427b6b..30e980069351 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1416,7 +1416,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- }
+ } else if (thp_migration_supported() && PageTransHuge(page))
+ pfn = page_to_pfn(compound_head(page))
+ + hpage_nr_pages(page) - 1;
if (!get_page_unless_zero(page))
continue;
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for move_pages(2).
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog: v1 -> v5:
- fix page counting
ChangeLog: v5 -> v6:
- drop changes on soft-offline in unmap_and_move()
Signed-off-by: Zi Yan <[email protected]>
---
mm/migrate.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 0e2916f0c201..fd33aa444f47 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,8 +184,8 @@ void putback_movable_pages(struct list_head *l)
unlock_page(page);
put_page(page);
} else {
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
putback_lru_page(page);
}
}
@@ -1145,8 +1145,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* as __PageMovable
*/
if (likely(!__PageMovable(page)))
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
}
/*
@@ -1420,7 +1420,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(pm->node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1447,6 +1457,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
struct vm_area_struct *vma;
struct page *page;
+ struct page *head;
+ unsigned int follflags;
err = -EFAULT;
vma = find_vma(mm, pp->addr);
@@ -1454,8 +1466,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto set_status;
/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, pp->addr,
- FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, pp->addr, follflags);
err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1465,7 +1479,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!page)
goto set_status;
- pp->page = page;
err = page_to_nid(page);
if (err == pp->node)
@@ -1480,16 +1493,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto put_and_set;
if (PageHuge(page)) {
- if (PageHead(page))
+ if (PageHead(page)) {
isolate_huge_page(page, &pagelist);
+ err = 0;
+ pp->page = page;
+ }
goto put_and_set;
}
- err = isolate_lru_page(page);
+ pp->page = compound_head(page);
+ head = compound_head(page);
+ err = isolate_lru_page(head);
if (!err) {
- list_add_tail(&page->lru, &pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ list_add_tail(&head->lru, &pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
put_and_set:
/*
--
2.11.0
From: Naoya Horiguchi <[email protected]>
This patch enables thp migration for mbind(2) and migrate_pages(2).
ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2 -> v6:
- use the same gfp flag (GFP_TRANSHUGE) in mbind() and migrate_pages()
for thp allocations.
Signed-off-by: Zi Yan <[email protected]>
---
mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 79 insertions(+), 29 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 58166bf1d1fd..088e6562f6f4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -97,6 +97,7 @@
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
+#include <linux/swapops.h>
#include <asm/tlbflush.h>
#include <linux/uaccess.h>
@@ -426,6 +427,49 @@ static inline bool queue_pages_required(struct page *page,
return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
}
+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ int ret = 0;
+ struct page *page;
+ struct queue_pages *qp = walk->private;
+ unsigned long flags;
+
+ if (unlikely(is_pmd_migration_entry(*pmd))) {
+ ret = 1;
+ goto unlock;
+ }
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+ goto out;
+ }
+ if (!thp_migration_supported()) {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
+ if (!queue_pages_required(page, qp)) {
+ ret = 1;
+ goto unlock;
+ }
+
+ ret = 1;
+ flags = qp->flags;
+ /* go to thp migration */
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ migrate_page_add(page, qp->pagelist, flags);
+unlock:
+ spin_unlock(ptl);
+out:
+ return ret;
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -437,30 +481,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid, ret;
+ int ret;
pte_t *pte;
spinlock_t *ptl;
- if (pmd_trans_huge(*pmd)) {
- ptl = pmd_lock(walk->mm, pmd);
- if (pmd_trans_huge(*pmd)) {
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- __split_huge_pmd(vma, pmd, addr, false, NULL);
- } else {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- if (ret)
- return 0;
- }
- } else {
- spin_unlock(ptl);
- }
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+ if (ret)
+ return 0;
}
if (pmd_trans_unstable(pmd))
@@ -481,7 +510,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
- if (PageTransCompound(page)) {
+ if (PageTransCompound(page) && !thp_migration_supported()) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
@@ -898,19 +927,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
#ifdef CONFIG_MIGRATION
/*
- * page migration
+ * page migration, thp tail pages can be passed.
*/
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
+ struct page *head = compound_head(page);
/*
* Avoid migrating a page that is shared with others.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
- if (!isolate_lru_page(page)) {
- list_add_tail(&page->lru, pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
}
}
@@ -920,7 +951,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else
+ else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(node,
+ (GFP_TRANSHUGE | __GFP_THISNODE),
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1086,6 +1127,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
+ } else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
}
/*
* if !vma, alloc_page_vma() will use task or system default policy
--
2.11.0
From: Zi Yan <[email protected]>
If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.
ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
think up no better name. Any suggesntion is welcomed.)
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
pmd_trans_huge(), pmd_devmap(), or pmd_none()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
true on pmd_migration_entry, so that migration entries are not
treated as pmd page table entries.
ChangeLog v4 -> v5:
- add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
the equivalence of !pmd_present() and is_pmd_migration_entry()
- fix migration entry wait deadlock code (from v1) in follow_page_mask()
- remove unnecessary code (from v1) in follow_trans_huge_pmd()
- use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
so it will not be confused with pmd_none()
- change author information
ChangeLog v5 -> v7
- use macro to disable the code when thp migration is not enabled
ChangeLog v7 -> v8
- remove not used code in do_huge_pmd_wp_page()
- copy the comment from change_pte_range() on downgrading
write migration entry to read to change_huge_pmd()
ChangeLog v8 -> v9
- fix VM_BUG_ON()s that trigger false alarms.
Signed-off-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 32 +++++++++++++--------
include/asm-generic/pgtable.h | 18 +++++++++++-
include/linux/huge_mm.h | 14 ++++++++--
mm/gup.c | 22 +++++++++++++--
mm/huge_memory.c | 65 +++++++++++++++++++++++++++++++++++++++----
mm/memcontrol.c | 5 ++++
mm/memory.c | 12 ++++++--
mm/mprotect.c | 4 +--
mm/mremap.c | 2 +-
9 files changed, 147 insertions(+), 27 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..0f17a7cccb41 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- smaps_pmd_entry(pmd, addr, walk);
+ if (pmd_present(*pmd))
+ smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
}
@@ -938,6 +939,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
goto out;
}
+ if (!pmd_present(*pmd))
+ goto out;
+
page = pmd_page(*pmd);
/* Clear accessed and referenced bits. */
@@ -1217,27 +1221,33 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (ptl) {
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
+ struct page *page = NULL;
if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
- /*
- * Currently pmd for thp is always present because thp
- * can not be swapped-out, migrated, or HWPOISONed
- * (split in such cases instead.)
- * This if-check is just to prepare for future implementation.
- */
if (pmd_present(pmd)) {
- struct page *page = pmd_page(pmd);
-
- if (page_mapcount(page) == 1)
- flags |= PM_MMAP_EXCLUSIVE;
+ page = pmd_page(pmd);
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
}
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ else if (is_swap_pmd(pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ frame = swp_type(entry) |
+ (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+ flags |= PM_SWAP;
+ VM_BUG_ON(!is_pmd_migration_entry(pmd));
+ page = migration_entry_to_page(entry);
+ }
+#endif
+
+ if (page && page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
for (; addr != end; addr += PAGE_SIZE) {
pagemap_entry_t pme = make_pme(frame, flags);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7dfa767dc680..8937d51c2834 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -834,7 +834,23 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
barrier();
#endif
- if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
+ /*
+ * !pmd_present() checks for pmd migration entries
+ *
+ * The complete check uses is_pmd_migration_entry() in linux/swapops.h
+ * But using that requires moving current function and pmd_trans_unstable()
+ * to linux/swapops.h to resovle dependency, which is too much code move.
+ *
+ * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
+ * because !pmd_present() pages can only be under migration not swapped
+ * out.
+ *
+ * pmd_none() is preseved for future condition checks on pmd migration
+ * entries and not confusing with this function name, although it is
+ * redundant with !pmd_present().
+ */
+ if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
+ (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d8f35a0865dc..14bc21c2ee7f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -147,7 +147,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (pmd_trans_huge(*____pmd) \
+ if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
|| pmd_devmap(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
@@ -178,12 +178,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma);
extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
/* mmap_sem must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
@@ -299,6 +305,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return 0;
+}
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
diff --git a/mm/gup.c b/mm/gup.c
index 23f01c40c88f..d81dd886d5d9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -234,6 +234,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+retry:
+ if (!pmd_present(*pmd)) {
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(*pmd));
+ if (is_pmd_migration_entry(*pmd))
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry;
+ }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -247,7 +257,15 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);
+retry_locked:
ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_present(*pmd))) {
+ spin_unlock(ptl);
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry_locked;
+ }
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
@@ -424,7 +442,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pud = pud_offset(p4d, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return -EFAULT;
VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
@@ -1534,7 +1552,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = READ_ONCE(*pmdp);
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd))
return 0;
if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9668f8cb8317..dc7830e4993f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -914,6 +914,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if (unlikely(is_swap_pmd(pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ VM_BUG_ON(!is_pmd_migration_entry(pmd));
+ if (is_write_migration_entry(entry)) {
+ make_migration_entry_read(&entry);
+ pmd = swp_entry_to_pmd(entry);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
+ set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ ret = 0;
+ goto out_unlock;
+ }
+#endif
+
if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
@@ -1556,6 +1573,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd))
goto out;
+ if (unlikely(!pmd_present(orig_pmd))) {
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(orig_pmd));
+ goto out;
+ }
+
page = pmd_page(orig_pmd);
/*
* If other processes are mapping this page, we couldn't discard
@@ -1767,6 +1790,25 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if (is_swap_pmd(*pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+ if (is_write_migration_entry(entry)) {
+ pmd_t newpmd;
+ /*
+ * A protection check is difficult so
+ * just be safe and disable write
+ */
+ make_migration_entry_read(&entry);
+ newpmd = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, addr, pmd, newpmd);
+ }
+ goto unlock;
+ }
+#endif
+
/*
* Avoid trapping faults against the zero page. The read-only
* data is likely to be read-cached on the local CPU and
@@ -1832,7 +1874,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+ pmd_devmap(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -1950,14 +1993,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t _pmd;
- bool young, write, dirty, soft_dirty;
+ bool young, write, dirty, soft_dirty, pmd_migration = false;
unsigned long addr;
int i;
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+ && !pmd_devmap(*pmd));
count_vm_event(THP_SPLIT_PMD);
@@ -1982,7 +2026,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
return __split_huge_zero_page_pmd(vma, haddr, pmd);
}
- page = pmd_page(*pmd);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ pmd_migration = is_pmd_migration_entry(*pmd);
+ if (pmd_migration) {
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(*pmd);
+ page = pfn_to_page(swp_offset(entry));
+ } else
+#endif
+ page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
@@ -2001,7 +2054,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- if (freeze) {
+ if (freeze || pmd_migration) {
swp_entry_t swp_entry;
swp_entry = make_migration_entry(page + i, write);
entry = swp_entry_to_pte(swp_entry);
@@ -2100,7 +2153,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
page = pmd_page(*pmd);
if (PageMlocked(page))
clear_page_mlock(page);
- } else if (!pmd_devmap(*pmd))
+ } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, haddr, freeze);
out:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e5cbbd..8d9e9c13fe4f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4638,6 +4638,11 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
struct page *page = NULL;
enum mc_target_type ret = MC_TARGET_NONE;
+ if (unlikely(is_swap_pmd(pmd))) {
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(pmd));
+ return ret;
+ }
page = pmd_page(pmd);
VM_BUG_ON_PAGE(!page || !PageHead(page), page);
if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..000d54dc1c68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1036,7 +1036,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+ if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+ || pmd_devmap(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
err = copy_huge_pmd(dst_mm, src_mm,
@@ -1296,7 +1297,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3804,6 +3805,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
pmd_t orig_pmd = *vmf.pmd;
barrier();
+ if (unlikely(is_swap_pmd(orig_pmd))) {
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(orig_pmd));
+ if (is_pmd_migration_entry(orig_pmd))
+ pmd_migration_entry_wait(mm, vmf.pmd);
+ return 0;
+ }
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1a8c9ca83e48..d60a1eedcc54 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long this_pages;
next = pmd_addr_end(addr, end);
- if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+ if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
&& pmd_none_or_clear_bad(pmd))
continue;
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mni_start, end);
}
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..1c49b9fb994a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -222,7 +222,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (pmd_trans_huge(*old_pmd)) {
+ if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
bool moved;
/* See comment in move_ptes() */
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.
---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v1 -> v5:
- read soft dirty bit from correct place (*src_pmd) in copy_huge_pmd()
- add missing soft dirty bit transfer in change_huge_pmd()
Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
fs/proc/task_mmu.c | 27 ++++++++++++++++-----------
include/asm-generic/pgtable.h | 34 +++++++++++++++++++++++++++++++++-
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 27 ++++++++++++++++++++++++---
5 files changed, 92 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 77037b6f1caa..8020cfb9dd5a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1158,6 +1158,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
#endif
#define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0f17a7cccb41..35be35e05153 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -904,17 +904,22 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
{
pmd_t pmd = *pmdp;
- /* See comment in change_huge_pmd() */
- pmdp_invalidate(vma, addr, pmdp);
- if (pmd_dirty(*pmdp))
- pmd = pmd_mkdirty(pmd);
- if (pmd_young(*pmdp))
- pmd = pmd_mkyoung(pmd);
-
- pmd = pmd_wrprotect(pmd);
- pmd = pmd_clear_soft_dirty(pmd);
-
- set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ if (pmd_present(pmd)) {
+ /* See comment in change_huge_pmd() */
+ pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(*pmdp))
+ pmd = pmd_mkdirty(pmd);
+ if (pmd_young(*pmdp))
+ pmd = pmd_mkyoung(pmd);
+
+ pmd = pmd_wrprotect(pmd);
+ pmd = pmd_clear_soft_dirty(pmd);
+
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+ pmd = pmd_swp_clear_soft_dirty(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ }
}
#else
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 8937d51c2834..745533f2937a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -618,7 +618,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif
-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
static inline int pte_soft_dirty(pte_t pte)
{
return 0;
@@ -663,6 +680,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte;
}
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
#endif
#ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c8c6511750f1..acf37fb9136a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
{
swp_entry_t arch_entry;
+ if (pmd_swp_soft_dirty(pmd))
+ pmd = pmd_swp_clear_soft_dirty(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dc7830e4993f..a298431d32c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -923,6 +923,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (is_write_migration_entry(entry)) {
make_migration_entry_read(&entry);
pmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*src_pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1713,6 +1715,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
}
#endif
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ if (unlikely(is_pmd_migration_entry(pmd)))
+ pmd = pmd_swp_mksoft_dirty(pmd);
+ else if (pmd_present(pmd))
+ pmd = pmd_mksoft_dirty(pmd);
+#endif
+ return pmd;
+}
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1755,7 +1768,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
- set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+ pmd = move_soft_dirty_pmd(pmd);
+ set_pmd_at(mm, new_addr, new_pmd, pmd);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
if (force_flush)
@@ -1803,6 +1817,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
*/
make_migration_entry_read(&entry);
newpmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*pmd))
+ newpmd = pmd_swp_mksoft_dirty(newpmd);
set_pmd_at(mm, addr, pmd, newpmd);
}
goto unlock;
@@ -2773,6 +2789,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
unsigned long address = pvmw->address;
pmd_t pmdval;
swp_entry_t entry;
+ pmd_t pmdswp;
if (!(pvmw->pmd && !pvmw->pte))
return;
@@ -2786,8 +2803,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_dirty(pmdval))
set_page_dirty(page);
entry = make_migration_entry(page, pmd_write(pmdval));
- pmdval = swp_entry_to_pmd(entry);
- set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ pmdswp = swp_entry_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ set_pmd_at(mm, address, pvmw->pmd, pmdswp);
page_remove_rmap(page, true);
put_page(page);
@@ -2810,6 +2829,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (pmd_swp_soft_dirty(*pvmw->pmd))
+ pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
--
2.11.0
From: Naoya Horiguchi <[email protected]>
_PAGE_PSE is used to distinguish between a truly non-present
(_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP
split and should be treated as present.
But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit,
which would cause confusion between one of those PMDs
undergoing a THP split, and a soft-dirty PMD.
Dropping _PAGE_PSE check in pmd_present() does not work well,
because it can hurt optimization of tlb handling in thp split.
Thus, we need to move the bit.
In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 12 +++++++++---
arch/x86/include/asm/pgtable_types.h | 10 +++++-----
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2160c1fee920..46bf71c77a25 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -180,15 +180,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
/*
* Encode and de-code a swap entry
*
- * | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13) |0|0|X|X| X| X|X|SD|0| <- swp entry
*
* G (8) is aliased and used as a PROT_NONE indicator for
* !present ptes. We need to start storing swap entries above
* there. We also need to avoid using A and D because of an
* erratum where they can be incorrectly set by hardware on
* non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but also L and G.
*/
#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
#define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index bf9638e1ee42..c612a8f08422 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
*
* Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY _PAGE_RW
#else
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
--
2.11.0
From: Naoya Horiguchi <[email protected]>
TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.
try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)
and ttu_flag given for tail pages is:
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)
__unmap_and_move() calls try_to_unmap() with ttu_flag:
(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
/* PMD-mapped THP migration entry */
if (!pvmw.pte && (flags & TTU_MIGRATION)) {
if (!PageAnon(page))
continue;
set_pmd_migration_entry(&pvmw, page);
continue;
}
...
}
, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)
I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 7 ++++---
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 43ef2c30cb0f..f8ca2e74b819 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -93,8 +93,9 @@ enum ttu_flags {
TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible
* and caller guarantees they will
* do a final flush if necessary */
- TTU_RMAP_LOCKED = 0x80 /* do not grab rmap lock:
+ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
+ TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */
};
#ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..35711b35b067 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2167,7 +2167,7 @@ static void freeze_page(struct page *page)
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page))
- ttu_flags |= TTU_MIGRATION;
+ ttu_flags |= TTU_SPLIT_FREEZE;
unmap_success = try_to_unmap(page, ttu_flags);
VM_BUG_ON_PAGE(!unmap_success, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index ced14f1af6dc..353cb61bb013 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1296,7 +1296,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_address(vma, address,
- flags & TTU_MIGRATION, page);
+ flags & TTU_SPLIT_FREEZE, page);
}
while (page_vma_mapped_walk(&pvmw)) {
@@ -1385,7 +1385,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
dec_mm_counter(mm, mm_counter(page));
} else if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & TTU_MIGRATION)) {
+ (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
swp_entry_t entry;
pte_t swp_pte;
/*
@@ -1510,7 +1510,8 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+ if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+ && !PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;
if (flags & TTU_RMAP_LOCKED)
--
2.11.0
From: Naoya Horiguchi <[email protected]>
Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.
ChangeLog v1 -> v2:
- fixed config name in subject and patch description
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
---
arch/x86/Kconfig | 4 ++++
include/linux/huge_mm.h | 10 ++++++++++
mm/Kconfig | 3 +++
3 files changed, 17 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6b8d49ec54be..0bc7a1b84fdc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2285,6 +2285,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION
+config ARCH_ENABLE_THP_MIGRATION
+ def_bool y
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+
menu "Power management and ACPI options"
config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee696347f928..d8f35a0865dc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -233,6 +233,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+static inline bool thp_migration_supported(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -336,6 +341,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
{
return NULL;
}
+
+static inline bool thp_migration_supported(void)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e1e1f65e5b91..fd79351c5b44 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -262,6 +262,9 @@ config MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
+config ARCH_ENABLE_THP_MIGRATION
+ bool
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
--
2.11.0
From: Zi Yan <[email protected]>
This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.
This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.
ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp
Signed-off-by: Naoya Horiguchi <[email protected]>
ChangeLog v2 -> v3:
- use page_vma_mapped_walk()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
set_pmd_migration_entry()
ChangeLog v3 -> v4:
- factor out the code of removing pte pgtable page in zap_huge_pmd()
ChangeLog v4 -> v5:
- remove unnecessary PTE-mapped THP code in remove_migration_pmd()
and set_pmd_migration_entry()
- restructure the code in zap_huge_pmd() to avoid factoring out
the pte pgtable page code
- in zap_huge_pmd(), check that PMD swap entries are migration entries
- change author information
ChangeLog v5 -> v7
- use macro to disable the code when thp migration is not enabled
ChangeLog v7 -> v8
- use IS_ENABLED instead of macro to make code look clean in
zap_huge_pmd() and page_vma_mapped_walk()
- remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
avoid compilation error
- rename variable 'migration' to 'flush_needed' and invert the logic in
zap_huge_pmd() to make code more descriptive
- use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
with MADV_DONTNEED
- remove unnecessary tlb flush in remove_migration_pmd()
- add the missing migration flag check in page_vma_mapped_walk()
ChangeLog v8 -> v9
- avoid migrating shmem THPs. It is not supported yet.
- simplify PMD-mapped THP checks in try_to_unmap_one() and
remove_migration_pte().
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 +
include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
mm/migrate.c | 32 ++++++++++++++-
mm/page_vma_mapped.c | 18 +++++++--
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 13 ++++++
7 files changed, 207 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 46bf71c77a25..972a4698c530 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -210,7 +210,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
((type) << (SWP_TYPE_FIRST_BIT)) \
| ((offset) << SWP_OFFSET_FIRST_BIT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val((pmd)) })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })
extern int kern_addr_valid(unsigned long addr);
extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c5ff7b217ee6..c8c6511750f1 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
- BUG_ON(!PageLocked(page));
+ BUG_ON(!PageLocked(compound_head(page)));
+
return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
page_to_pfn(page));
}
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
* Any use of migration entries may only occur while the
* corresponding page is locked
*/
- BUG_ON(!PageLocked(p));
+ BUG_ON(!PageLocked(compound_head(p)));
return p;
}
@@ -163,6 +164,68 @@ static inline int is_write_migration_entry(swp_entry_t entry)
#endif
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __pmd_to_swp_entry(pmd);
+ return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+ return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new)
+{
+ BUILD_BUG();
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MEMORY_FAILURE
extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35711b35b067..9668f8cb8317 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1641,10 +1641,24 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
} else {
- struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(!PageHead(page), page);
+ struct page *page = NULL;
+ int flush_needed = 1;
+
+ if (pmd_present(orig_pmd)) {
+ page = pmd_page(orig_pmd);
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ } else if (thp_migration_supported()) {
+ swp_entry_t entry;
+
+ VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+ entry = pmd_to_swp_entry(orig_pmd);
+ page = pfn_to_page(swp_offset(entry));
+ flush_needed = 0;
+ } else
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
if (PageAnon(page)) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
@@ -1653,8 +1667,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
}
+
spin_unlock(ptl);
- tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+ if (flush_needed)
+ tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
}
return 1;
}
@@ -2694,3 +2710,61 @@ static int __init split_huge_pages_debugfs(void)
}
late_initcall(split_huge_pages_debugfs);
#endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ pmd_t pmdval;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ mmu_notifier_invalidate_range_start(mm, address,
+ address + HPAGE_PMD_SIZE);
+
+ flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmdval = *pvmw->pmd;
+ pmdp_invalidate(vma, address, pvmw->pmd);
+ if (pmd_dirty(pmdval))
+ set_page_dirty(page);
+ entry = make_migration_entry(page, pmd_write(pmdval));
+ pmdval = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ page_remove_rmap(page, true);
+ put_page(page);
+
+ mmu_notifier_invalidate_range_end(mm, address,
+ address + HPAGE_PMD_SIZE);
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long mmun_start = address & HPAGE_PMD_MASK;
+ pmd_t pmde;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ entry = pmd_to_swp_entry(*pvmw->pmd);
+ get_page(new);
+ pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (is_write_migration_entry(entry))
+ pmde = maybe_pmd_mkwrite(pmde, vma);
+
+ flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
+ page_add_anon_rmap(new, vma, mmun_start, true);
+ set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_vma_page(new);
+ update_mmu_cache_pmd(vma, address, pvmw->pmd);
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 627671551873..0e2916f0c201 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -215,6 +215,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
new = page - pvmw.page->index +
linear_page_index(vma, pvmw.address);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
+ remove_migration_pmd(&pvmw, new);
+ continue;
+ }
+#endif
+
get_page(new);
pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
if (pte_swp_soft_dirty(*pvmw.pte))
@@ -329,6 +338,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
__migration_entry_wait(mm, pte, ptl);
}
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ struct page *page;
+
+ ptl = pmd_lock(mm, pmd);
+ if (!is_pmd_migration_entry(*pmd))
+ goto unlock;
+ page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+ if (!get_page_unless_zero(page))
+ goto unlock;
+ spin_unlock(ptl);
+ wait_on_page_locked(page);
+ put_page(page);
+ return;
+unlock:
+ spin_unlock(ptl);
+}
+#endif
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1087,7 +1117,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}
- if (unlikely(PageTransHuge(page))) {
+ if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8ec6ba230bb9..3bd3008db4cb 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -138,16 +138,28 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
- if (pmd_trans_huge(*pvmw->pmd)) {
+ if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- if (!pmd_present(*pvmw->pmd))
- return not_found(pvmw);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (pmd_page(*pvmw->pmd) != page)
return not_found(pvmw);
return true;
+ } else if (!pmd_present(*pvmw->pmd)) {
+ if (thp_migration_supported()) {
+ if (!(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+ if (migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ } else
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+ return not_found(pvmw);
} else {
/* THP pmd was split under us: handle on pte level */
spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..1175f6a24fdb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -124,7 +124,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+ !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 353cb61bb013..f310a45b571c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1300,6 +1300,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}
while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte && (flags & TTU_MIGRATION)) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
+
+ if (!PageAnon(page))
+ continue;
+
+ set_pmd_migration_entry(&pvmw, page);
+ continue;
+ }
+#endif
+
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
--
2.11.0
On Mon 17-07-17 15:39:51, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
I am sorry to nitpick on the changelog but the patch is scary large and
it would deserve much better description. What are those "various code
point" and how do you "prevent race". How can we double check that none
of them were missed?
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
> think up no better name. Any suggesntion is welcomed.)
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
>
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
> pmd_trans_huge(), pmd_devmap(), or pmd_none()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
> true on pmd_migration_entry, so that migration entries are not
> treated as pmd page table entries.
>
> ChangeLog v4 -> v5:
> - add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
> the equivalence of !pmd_present() and is_pmd_migration_entry()
> - fix migration entry wait deadlock code (from v1) in follow_page_mask()
> - remove unnecessary code (from v1) in follow_trans_huge_pmd()
> - use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
> so it will not be confused with pmd_none()
> - change author information
>
> ChangeLog v5 -> v7
> - use macro to disable the code when thp migration is not enabled
>
> ChangeLog v7 -> v8
> - remove not used code in do_huge_pmd_wp_page()
> - copy the comment from change_pte_range() on downgrading
> write migration entry to read to change_huge_pmd()
>
> ChangeLog v8 -> v9
> - fix VM_BUG_ON()s that trigger false alarms.
>
> Signed-off-by: Zi Yan <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> ---
> fs/proc/task_mmu.c | 32 +++++++++++++--------
> include/asm-generic/pgtable.h | 18 +++++++++++-
> include/linux/huge_mm.h | 14 ++++++++--
> mm/gup.c | 22 +++++++++++++--
> mm/huge_memory.c | 65 +++++++++++++++++++++++++++++++++++++++----
> mm/memcontrol.c | 5 ++++
> mm/memory.c | 12 ++++++--
> mm/mprotect.c | 4 +--
> mm/mremap.c | 2 +-
> 9 files changed, 147 insertions(+), 27 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b836fd61ed87..0f17a7cccb41 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> - smaps_pmd_entry(pmd, addr, walk);
> + if (pmd_present(*pmd))
> + smaps_pmd_entry(pmd, addr, walk);
> spin_unlock(ptl);
> return 0;
> }
> @@ -938,6 +939,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> goto out;
> }
>
> + if (!pmd_present(*pmd))
> + goto out;
> +
> page = pmd_page(*pmd);
>
> /* Clear accessed and referenced bits. */
> @@ -1217,27 +1221,33 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> if (ptl) {
> u64 flags = 0, frame = 0;
> pmd_t pmd = *pmdp;
> + struct page *page = NULL;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
> flags |= PM_SOFT_DIRTY;
>
> - /*
> - * Currently pmd for thp is always present because thp
> - * can not be swapped-out, migrated, or HWPOISONed
> - * (split in such cases instead.)
> - * This if-check is just to prepare for future implementation.
> - */
> if (pmd_present(pmd)) {
> - struct page *page = pmd_page(pmd);
> -
> - if (page_mapcount(page) == 1)
> - flags |= PM_MMAP_EXCLUSIVE;
> + page = pmd_page(pmd);
>
> flags |= PM_PRESENT;
> if (pm->show_pfn)
> frame = pmd_pfn(pmd) +
> ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> }
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + else if (is_swap_pmd(pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + frame = swp_type(entry) |
> + (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + flags |= PM_SWAP;
> + VM_BUG_ON(!is_pmd_migration_entry(pmd));
> + page = migration_entry_to_page(entry);
> + }
> +#endif
> +
> + if (page && page_mapcount(page) == 1)
> + flags |= PM_MMAP_EXCLUSIVE;
>
> for (; addr != end; addr += PAGE_SIZE) {
> pagemap_entry_t pme = make_pme(frame, flags);
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 7dfa767dc680..8937d51c2834 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -834,7 +834,23 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> barrier();
> #endif
> - if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
> + /*
> + * !pmd_present() checks for pmd migration entries
> + *
> + * The complete check uses is_pmd_migration_entry() in linux/swapops.h
> + * But using that requires moving current function and pmd_trans_unstable()
> + * to linux/swapops.h to resovle dependency, which is too much code move.
> + *
> + * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
> + * because !pmd_present() pages can only be under migration not swapped
> + * out.
> + *
> + * pmd_none() is preseved for future condition checks on pmd migration
> + * entries and not confusing with this function name, although it is
> + * redundant with !pmd_present().
> + */
> + if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
> + (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
> return 1;
> if (unlikely(pmd_bad(pmdval))) {
> pmd_clear_bad(pmd);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index d8f35a0865dc..14bc21c2ee7f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -147,7 +147,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> #define split_huge_pmd(__vma, __pmd, __address) \
> do { \
> pmd_t *____pmd = (__pmd); \
> - if (pmd_trans_huge(*____pmd) \
> + if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
> || pmd_devmap(*____pmd)) \
> __split_huge_pmd(__vma, __pmd, __address, \
> false, NULL); \
> @@ -178,12 +178,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma);
> extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
> struct vm_area_struct *vma);
> +
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> + return !pmd_none(pmd) && !pmd_present(pmd);
> +}
> +
> /* mmap_sem must be held on entry */
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> return __pmd_trans_huge_lock(pmd, vma);
> else
> return NULL;
> @@ -299,6 +305,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> long adjust_next)
> {
> }
> +static inline int is_swap_pmd(pmd_t pmd)
> +{
> + return 0;
> +}
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
> struct vm_area_struct *vma)
> {
> diff --git a/mm/gup.c b/mm/gup.c
> index 23f01c40c88f..d81dd886d5d9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -234,6 +234,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> return page;
> return no_page_table(vma, flags);
> }
> +retry:
> + if (!pmd_present(*pmd)) {
> + if (likely(!(flags & FOLL_MIGRATION)))
> + return no_page_table(vma, flags);
> + VM_BUG_ON(thp_migration_supported() &&
> + !is_pmd_migration_entry(*pmd));
> + if (is_pmd_migration_entry(*pmd))
> + pmd_migration_entry_wait(mm, pmd);
> + goto retry;
> + }
> if (pmd_devmap(*pmd)) {
> ptl = pmd_lock(mm, pmd);
> page = follow_devmap_pmd(vma, address, pmd, flags);
> @@ -247,7 +257,15 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> return no_page_table(vma, flags);
>
> +retry_locked:
> ptl = pmd_lock(mm, pmd);
> + if (unlikely(!pmd_present(*pmd))) {
> + spin_unlock(ptl);
> + if (likely(!(flags & FOLL_MIGRATION)))
> + return no_page_table(vma, flags);
> + pmd_migration_entry_wait(mm, pmd);
> + goto retry_locked;
> + }
> if (unlikely(!pmd_trans_huge(*pmd))) {
> spin_unlock(ptl);
> return follow_page_pte(vma, address, pmd, flags);
> @@ -424,7 +442,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> pud = pud_offset(p4d, address);
> BUG_ON(pud_none(*pud));
> pmd = pmd_offset(pud, address);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return -EFAULT;
> VM_BUG_ON(pmd_trans_huge(*pmd));
> pte = pte_offset_map(pmd, address);
> @@ -1534,7 +1552,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = READ_ONCE(*pmdp);
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd))
> return 0;
>
> if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9668f8cb8317..dc7830e4993f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -914,6 +914,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>
> ret = -EAGAIN;
> pmd = *src_pmd;
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + if (unlikely(is_swap_pmd(pmd))) {
> + swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> + VM_BUG_ON(!is_pmd_migration_entry(pmd));
> + if (is_write_migration_entry(entry)) {
> + make_migration_entry_read(&entry);
> + pmd = swp_entry_to_pmd(entry);
> + set_pmd_at(src_mm, addr, src_pmd, pmd);
> + }
> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> + ret = 0;
> + goto out_unlock;
> + }
> +#endif
> +
> if (unlikely(!pmd_trans_huge(pmd))) {
> pte_free(dst_mm, pgtable);
> goto out_unlock;
> @@ -1556,6 +1573,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> if (is_huge_zero_pmd(orig_pmd))
> goto out;
>
> + if (unlikely(!pmd_present(orig_pmd))) {
> + VM_BUG_ON(thp_migration_supported() &&
> + !is_pmd_migration_entry(orig_pmd));
> + goto out;
> + }
> +
> page = pmd_page(orig_pmd);
> /*
> * If other processes are mapping this page, we couldn't discard
> @@ -1767,6 +1790,25 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> preserve_write = prot_numa && pmd_write(*pmd);
> ret = 1;
>
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + if (is_swap_pmd(*pmd)) {
> + swp_entry_t entry = pmd_to_swp_entry(*pmd);
> +
> + VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> + if (is_write_migration_entry(entry)) {
> + pmd_t newpmd;
> + /*
> + * A protection check is difficult so
> + * just be safe and disable write
> + */
> + make_migration_entry_read(&entry);
> + newpmd = swp_entry_to_pmd(entry);
> + set_pmd_at(mm, addr, pmd, newpmd);
> + }
> + goto unlock;
> + }
> +#endif
> +
> /*
> * Avoid trapping faults against the zero page. The read-only
> * data is likely to be read-cached on the local CPU and
> @@ -1832,7 +1874,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> {
> spinlock_t *ptl;
> ptl = pmd_lock(vma->vm_mm, pmd);
> - if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> + if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
> + pmd_devmap(*pmd)))
> return ptl;
> spin_unlock(ptl);
> return NULL;
> @@ -1950,14 +1993,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> struct page *page;
> pgtable_t pgtable;
> pmd_t _pmd;
> - bool young, write, dirty, soft_dirty;
> + bool young, write, dirty, soft_dirty, pmd_migration = false;
> unsigned long addr;
> int i;
>
> VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> - VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
> + VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
> + && !pmd_devmap(*pmd));
>
> count_vm_event(THP_SPLIT_PMD);
>
> @@ -1982,7 +2026,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> return __split_huge_zero_page_pmd(vma, haddr, pmd);
> }
>
> - page = pmd_page(*pmd);
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + pmd_migration = is_pmd_migration_entry(*pmd);
> + if (pmd_migration) {
> + swp_entry_t entry;
> +
> + entry = pmd_to_swp_entry(*pmd);
> + page = pfn_to_page(swp_offset(entry));
> + } else
> +#endif
> + page = pmd_page(*pmd);
> VM_BUG_ON_PAGE(!page_count(page), page);
> page_ref_add(page, HPAGE_PMD_NR - 1);
> write = pmd_write(*pmd);
> @@ -2001,7 +2054,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> * transferred to avoid any possibility of altering
> * permissions across VMAs.
> */
> - if (freeze) {
> + if (freeze || pmd_migration) {
> swp_entry_t swp_entry;
> swp_entry = make_migration_entry(page + i, write);
> entry = swp_entry_to_pte(swp_entry);
> @@ -2100,7 +2153,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> page = pmd_page(*pmd);
> if (PageMlocked(page))
> clear_page_mlock(page);
> - } else if (!pmd_devmap(*pmd))
> + } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
> goto out;
> __split_huge_pmd_locked(vma, pmd, haddr, freeze);
> out:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 544d47e5cbbd..8d9e9c13fe4f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4638,6 +4638,11 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> struct page *page = NULL;
> enum mc_target_type ret = MC_TARGET_NONE;
>
> + if (unlikely(is_swap_pmd(pmd))) {
> + VM_BUG_ON(thp_migration_supported() &&
> + !is_pmd_migration_entry(pmd));
> + return ret;
> + }
> page = pmd_page(pmd);
> VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> if (!(mc.flags & MOVE_ANON))
> diff --git a/mm/memory.c b/mm/memory.c
> index 0e517be91a89..000d54dc1c68 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1036,7 +1036,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
> src_pmd = pmd_offset(src_pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
> + if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> + || pmd_devmap(*src_pmd)) {
> int err;
> VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
> err = copy_huge_pmd(dst_mm, src_mm,
> @@ -1296,7 +1297,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> pmd = pmd_offset(pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> if (next - addr != HPAGE_PMD_SIZE) {
> VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
> !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
> @@ -3804,6 +3805,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> pmd_t orig_pmd = *vmf.pmd;
>
> barrier();
> + if (unlikely(is_swap_pmd(orig_pmd))) {
> + VM_BUG_ON(thp_migration_supported() &&
> + !is_pmd_migration_entry(orig_pmd));
> + if (is_pmd_migration_entry(orig_pmd))
> + pmd_migration_entry_wait(mm, vmf.pmd);
> + return 0;
> + }
> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf, orig_pmd);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1a8c9ca83e48..d60a1eedcc54 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> unsigned long this_pages;
>
> next = pmd_addr_end(addr, end);
> - if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> + if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> && pmd_none_or_clear_bad(pmd))
> continue;
>
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> mmu_notifier_invalidate_range_start(mm, mni_start, end);
> }
>
> - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> if (next - addr != HPAGE_PMD_SIZE) {
> __split_huge_pmd(vma, pmd, addr, false, NULL);
> } else {
> diff --git a/mm/mremap.c b/mm/mremap.c
> index cd8a1b199ef9..1c49b9fb994a 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -222,7 +222,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
> if (!new_pmd)
> break;
> - if (pmd_trans_huge(*old_pmd)) {
> + if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
> if (extent == HPAGE_PMD_SIZE) {
> bool moved;
> /* See comment in move_ptes() */
> --
> 2.11.0
--
Michal Hocko
SUSE Labs
Hi Zi,
[auto build test WARNING on mmotm/master]
[also build test WARNING on v4.13-rc1 next-20170718]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Zi-Yan/mm-page-migration-enhancement-for-thp/20170718-095519
base: git://git.cmpxchg.org/linux-mmotm.git master
config: xtensa-common_defconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa
All warnings (new ones prefixed by >>):
In file included from mm/vmscan.c:55:0:
include/linux/swapops.h: In function 'swp_entry_to_pmd':
>> include/linux/swapops.h:220:2: warning: missing braces around initializer [-Wmissing-braces]
return (pmd_t){ 0 };
^
include/linux/swapops.h:220:2: warning: (near initialization for '(anonymous).pud') [-Wmissing-braces]
vim +220 include/linux/swapops.h
217
218 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
219 {
> 220 return (pmd_t){ 0 };
221 }
222
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 19 Jul 2017, at 4:02, Michal Hocko wrote:
> On Mon 17-07-17 15:39:51, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>
> I am sorry to nitpick on the changelog but the patch is scary large and
> it would deserve much better description. What are those "various code
> point" and how do you "prevent race". How can we double check that none
> of them were missed?
Thanks for pointing this out.
Let me know if the following new description looks good to you:
When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or is_swap_pmd()
(depending on whether pmd_none() needs separate code or not) to
check pmd migration entries at the places where a pmd entry is present.
Since pmd-related code uses split_huge_page(), split_huge_pmd(), pmd_trans_huge(),
pmd_trans_unstable(), or pmd_none_or_trans_huge_or_clear_bad(),
this patch:
1. adds pmd migration entry split code in split_huge_pmd(),
2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable() is equivalent
to pmd_none_or_trans_huge_or_clear_bad(), we do not change them.
Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().
--
Best Regards
Yan Zi
On 19 Jul 2017, at 4:04, kbuild test robot wrote:
> Hi Zi,
>
> [auto build test WARNING on mmotm/master]
> [also build test WARNING on v4.13-rc1 next-20170718]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommits%2FZi-Yan%2Fmm-page-migration-enhancement-for-thp%2F20170718-095519&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=NpxRpWbxe6o56xDJYpw1K6wgQo11IPCAbG2tE8l%2BU6E%3D&reserved=0
> base: git://git.cmpxchg.org/linux-mmotm.git master
> config: xtensa-common_defconfig (attached as .config)
> compiler: xtensa-linux-gcc (GCC) 4.9.0
> reproduce:
> wget https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2F01org%2Flkp-tests%2Fmaster%2Fsbin%2Fmake.cross&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=rBCfu0xUg3v%2B8r%2Be2tsiqRcqw%2FEZSTa4OtF0hU%2FqMbc%3D&reserved=0 -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=xtensa
>
> All warnings (new ones prefixed by >>):
>
> In file included from mm/vmscan.c:55:0:
> include/linux/swapops.h: In function 'swp_entry_to_pmd':
>>> include/linux/swapops.h:220:2: warning: missing braces around initializer [-Wmissing-braces]
> return (pmd_t){ 0 };
> ^
> include/linux/swapops.h:220:2: warning: (near initialization for '(anonymous).pud') [-Wmissing-braces]
>
> vim +220 include/linux/swapops.h
>
> 217
> 218 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> 219 {
>> 220 return (pmd_t){ 0 };
> 221 }
> 222
It is a GCC 4.9.0 bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
Upgrading GCC can get rid of this warning.
--
Best Regards
Yan Zi
On Wed, 19 Jul 2017 14:39:43 -0400 "Zi Yan" <[email protected]> wrote:
> On 19 Jul 2017, at 4:04, kbuild test robot wrote:
>
> > Hi Zi,
> >
> > [auto build test WARNING on mmotm/master]
> > [also build test WARNING on v4.13-rc1 next-20170718]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> >
> > url: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommits%2FZi-Yan%2Fmm-page-migration-enhancement-for-thp%2F20170718-095519&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=NpxRpWbxe6o56xDJYpw1K6wgQo11IPCAbG2tE8l%2BU6E%3D&reserved=0
> > base: git://git.cmpxchg.org/linux-mmotm.git master
> > config: xtensa-common_defconfig (attached as .config)
> > compiler: xtensa-linux-gcc (GCC) 4.9.0
> > reproduce:
> > wget https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2F01org%2Flkp-tests%2Fmaster%2Fsbin%2Fmake.cross&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=rBCfu0xUg3v%2B8r%2Be2tsiqRcqw%2FEZSTa4OtF0hU%2FqMbc%3D&reserved=0 -O ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > make.cross ARCH=xtensa
> >
> > All warnings (new ones prefixed by >>):
> >
> > In file included from mm/vmscan.c:55:0:
> > include/linux/swapops.h: In function 'swp_entry_to_pmd':
> >>> include/linux/swapops.h:220:2: warning: missing braces around initializer [-Wmissing-braces]
> > return (pmd_t){ 0 };
> > ^
> > include/linux/swapops.h:220:2: warning: (near initialization for '(anonymous).pud') [-Wmissing-braces]
> >
> > vim +220 include/linux/swapops.h
> >
> > 217
> > 218 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> > 219 {
> >> 220 return (pmd_t){ 0 };
> > 221 }
> > 222
>
> It is a GCC 4.9.0 bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
>
> Upgrading GCC can get rid of this warning.
I think there was a workaround for this, but I don't recall what it
was.
This suppressed the warning:
--- a/include/linux/swapops.h~a
+++ a/include/linux/swapops.h
@@ -217,7 +217,7 @@ static inline swp_entry_t pmd_to_swp_ent
static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
{
- return (pmd_t){ 0 };
+ return (pmd_t){};
}
static inline int is_pmd_migration_entry(pmd_t pmd)
But I don't know if this is the approved workaround and I don't know
what it will do at runtime!
But we should fix this. Expecting zillions of people to update their
compiler version isn't nice.
On 19 Jul 2017, at 16:59, Andrew Morton wrote:
> On Wed, 19 Jul 2017 14:39:43 -0400 "Zi Yan" <[email protected]>
> wrote:
>
>> On 19 Jul 2017, at 4:04, kbuild test robot wrote:
>>
>>> Hi Zi,
>>>
>>> [auto build test WARNING on mmotm/master]
>>> [also build test WARNING on v4.13-rc1 next-20170718]
>>> [if your patch is applied to the wrong git tree, please drop us a
>>> note to help improve the system]
>>>
>>> url:
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F0day-ci%2Flinux%2Fcommits%2FZi-Yan%2Fmm-page-migration-enhancement-for-thp%2F20170718-095519&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=NpxRpWbxe6o56xDJYpw1K6wgQo11IPCAbG2tE8l%2BU6E%3D&reserved=0
>>> base: git://git.cmpxchg.org/linux-mmotm.git master
>>> config: xtensa-common_defconfig (attached as .config)
>>> compiler: xtensa-linux-gcc (GCC) 4.9.0
>>> reproduce:
>>> wget
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fraw.githubusercontent.com%2F01org%2Flkp-tests%2Fmaster%2Fsbin%2Fmake.cross&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Ca711ac47d4c0436ef66f08d4ce7cf30c%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360483431631457&sdata=rBCfu0xUg3v%2B8r%2Be2tsiqRcqw%2FEZSTa4OtF0hU%2FqMbc%3D&reserved=0
>>> -O ~/bin/make.cross
>>> chmod +x ~/bin/make.cross
>>> # save the attached .config to linux build tree
>>> make.cross ARCH=xtensa
>>>
>>> All warnings (new ones prefixed by >>):
>>>
>>> In file included from mm/vmscan.c:55:0:
>>> include/linux/swapops.h: In function 'swp_entry_to_pmd':
>>>>> include/linux/swapops.h:220:2: warning: missing braces around
>>>>> initializer [-Wmissing-braces]
>>> return (pmd_t){ 0 };
>>> ^
>>> include/linux/swapops.h:220:2: warning: (near initialization for
>>> '(anonymous).pud') [-Wmissing-braces]
>>>
>>> vim +220 include/linux/swapops.h
>>>
>>> 217
>>> 218 static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
>>> 219 {
>>>> 220 return (pmd_t){ 0 };
>>> 221 }
>>> 222
>>
>> It is a GCC 4.9.0 bug:
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgcc.gnu.org%2Fbugzilla%2Fshow_bug.cgi%3Fid%3D53119&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7C07c903c4f1444958942508d4cee90ca7%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636360947714283172&sdata=84%2BXG7hglTCsTjGA8G3jyL7%2BFupkQaMkjAwzofffA5A%3D&reserved=0
>>
>> Upgrading GCC can get rid of this warning.
>
> I think there was a workaround for this, but I don't recall what it
> was.
>
> This suppressed the warning:
>
> --- a/include/linux/swapops.h~a
> +++ a/include/linux/swapops.h
> @@ -217,7 +217,7 @@ static inline swp_entry_t pmd_to_swp_ent
>
> static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
> {
> - return (pmd_t){ 0 };
> + return (pmd_t){};
> }
>
> static inline int is_pmd_migration_entry(pmd_t pmd)
>
> But I don't know if this is the approved workaround and I don't know
> what it will do at runtime!
>
> But we should fix this. Expecting zillions of people to update their
> compiler version isn't nice.
How about this one?
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -219,7 +219,7 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t
pmd)
static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
{
- return (pmd_t){ 0 };
+ return __pmd(0);
}
static inline int is_pmd_migration_entry(pmd_t pmd)
No warning or error was present during i386 kernel compilations
with gcc-4.9.3 or gcc-6.4.0. i386 gcc should share the same front-end
as xtensa-linux-gcc.
__pmd() should be the standard way of making pmd entries, right?
--
Best Regards
Yan Zi