2017-07-01 13:40:56

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 00/10] mm: page migration enhancement for thp

From: Zi Yan <[email protected]>

Hi all,

The patches are rebased on mmotm-2017-06-29-16-41 with the feedbacks
from v7 patches.

Hi Kirill, I have added the changes you suggested to Patch 5 and Patch 6,
please let me know if you are OK with them.

Patch 1 factors out common code. It could be picked up easily.
Patch 2 moves _PAGE_SWP_SOFT_DIRTY bit to prepare for THP migration.
Patch 3 adds a new TTU flag to avoid the conflict between TTU_MIGRATION and THP migration.
Patch 4-6 are the core part of THP migration.
Patch 7 adds soft dirty bit to THP migraiton.
Patch 8-10 enables THP migration in the various locations in the kernel.

Please review, give comments, and consider applying the patches. Thanks.


Motivations
===========================================
1. THP migration becomes important in the upcoming heterogeneous memory systems.

As David Nellans from NVIDIA pointed out from other threads
(http://www.mail-archive.com/[email protected]/msg1349227.html),
future GPUs or other accelerators will have their memory managed by operating
systems. Moving data into and out of these memory nodes efficiently is critical
to applications that use GPUs or other accelerators. Existing page migration
only supports base pages, which has a very low memory bandwidth utilization.
My experiments (see below) show THP migration can migrate pages more efficiently.

2. Base page migration vs THP migration throughput.

Here are cross-socket page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration can give us 3x and 1.15x throughput over base page migration
in x86_64 and ppc64 respectivley.

You can test it out by using the code here:
https://github.com/x-y-z/thp-migration-bench

3. Existing page migration splits THP before migration and cannot guarantee
the migrated pages are still contiguous. Contiguity is always what GPUs and
accelerators look for. Without THP migration, khugepaged needs to do extra work
to reassemble the migrated pages back to THPs.

ChangeLog
===========================================

Changes since v7:
* Remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to allow
replacing macro with IS_ENABLED at several code chunks. This makes them
easy to read.
* Rename variable 'migration' to 'flush_needed' for better understanding.
* Use pmdp_invdalite() to avoid race with MADV_DONTNEED.
* Remove unnecessary tlb flush in remove_migration_pmd().
* Add the missing migration flag check in page_vma_mapped_walk().
* Remove not used code in do_huge_pmd_wp_page().
* Add migration entry permission change comment to change_huge_pmd()
to avoid confusion.

Changes since v6:
* Fix the kbuild bot warning in swp_entry_to_pmd().
* Add macro to disable the code when thp migration is not enabled. This fixes
the kbuild bot errors while building kernels without THP migration enabled.
* In memory hotremove, move THP allocation code from new_node_page() to
new_page_nodemask(). This follows the patch ("mm: unify new_node_page and
alloc_migrate_target") in latest mmotm.

Changes since v5:
* THP migration support for soft-offline patch is dropped, because it needs
more discussion. I will send it separately.
* Better commit message in Patch 2 (on moving _PAGE_SWP_SOFT_DIRTY bit),
thanks for Dave Hansen's help.

Changes since v4:
* In Patch 5, I dropped PTE-mapped THP migration handling code, since it is
already well handled by existing code.

* In Patch 6, I did a thorough check on PMD handling places and corrected all
errors I discovered.

* In Patch 6, I use is_swap_pmd() to check PMD migration entries and add
VM_BUG_ON to make sure only migration entries present. It should be useful
later when someone wants to add PMD swap entries, since VM_BUG_ON will
catch the missing code path.

* In Patch 6, I keep pmd_none() in pmd_none_or_trans_huge_or_clear_bad() to
avoid confusion on the function name. I also add a comment to explain it.

* In Patch 7-11, I added some missing soft dirty bit preserving code and
corrected page stats countings.

Changes since v3:

* I dropped my fix on zap_pmd_range() since THP migration will not trigger
it and Kirill has posted patches to fix the bug triggered by MADV_DONTNEED.

* In Patch 6, I used !pmd_present() instead of is_pmd_migration_entry()
in pmd_none_or_trans_huge_or_clear_bad() to avoid moving the function to
linux/swapops.h. Currently, !pmd_present() is equivalent to
is_pmd_migration_entry(). Any suggestion is welcome to this change.

Changes since v2:

* I fix a bug in zap_pmd_range() and include the fixes in Patches 1-3.
The racy check in zap_pmd_range() can miss pmd_protnone and pmd_migration_entry,
which leads to PTE page table not freed.

* In Patch 4, I move _PAGE_SWP_SOFT_DIRTY to bit 1. Because bit 6 (used in v2)
can be set by some CPUs by mistake and the new swap entry format does not use
bit 1-4.

* I also adjust two core migration functions, set_pmd_migration_entry() and
remove_migration_pmd(), to use Kirill A. Shutemov's page_vma_mapped_walk()
function. Patch 8 needs Kirill's comments, since I also add changes
to his page_vma_mapped_walk() function with pmd_migration_entry handling.

* In Patch 8, I replace pmdp_huge_get_and_clear() with pmdp_huge_clear_flush()
in set_pmd_migration_entry() to avoid data corruption after page migration.

* In Patch 9, I include is_pmd_migration_entry() in pmd_none_or_trans_huge_or_clear_bad().
Otherwise, a pmd_migration_entry is treated as pmd_bad and cleared, which
leads to deposited PTE page table not freed.

* I personally use this patchset with my customized kernel to test frequent
page migrations by replacing page reclaim with page migration.
The bugs fixed in Patches 1-3 and 8 was discovered while I am testing my kernel.
I did a 16-hour stress test that has ~7 billion total page migrations.
No error or data corruption was found.

General description
===========================================

This patchset enhances page migration functionality to handle thp migration
for various page migration's callers:
- mbind(2)
- move_pages(2)
- migrate_pages(2)
- cgroup/cpuset migration
- memory hotremove

The main benefit is that we can avoid unnecessary thp splits, which helps us
avoid performance decrease when your applications handles NUMA optimization on
their own.

The implementation is similar to that of normal page migration, the key point
is that we modify a pmd to a pmd migration entry in swap-entry like format.

Naoya Horiguchi (8):
mm: mempolicy: add queue_pages_required()
mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
mm: thp: introduce separate TTU flag for thp freezing
mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION
mm: soft-dirty: keep soft-dirty bits over thp migration
mm: mempolicy: mbind and migrate_pages support thp migration
mm: migrate: move_pages() supports thp migration
mm: memory_hotplug: memory hotremove supports thp migration

Zi Yan (2):
mm: thp: enable thp migration in generic path
mm: thp: check pmd migration entry in common path

arch/x86/Kconfig | 4 +
arch/x86/include/asm/pgtable.h | 17 ++++
arch/x86/include/asm/pgtable_64.h | 14 ++-
arch/x86/include/asm/pgtable_types.h | 10 +-
arch/x86/mm/gup.c | 7 +-
fs/proc/task_mmu.c | 60 +++++++-----
include/asm-generic/pgtable.h | 51 +++++++++-
include/linux/huge_mm.h | 24 ++++-
include/linux/migrate.h | 15 ++-
include/linux/rmap.h | 3 +-
include/linux/swapops.h | 69 +++++++++++++-
mm/Kconfig | 3 +
mm/gup.c | 22 ++++-
mm/huge_memory.c | 176 ++++++++++++++++++++++++++++++++---
mm/memcontrol.c | 5 +
mm/memory.c | 12 ++-
mm/memory_hotplug.c | 4 +-
mm/mempolicy.c | 130 +++++++++++++++++++-------
mm/migrate.c | 77 ++++++++++++---
mm/mprotect.c | 4 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 18 +++-
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 20 +++-
24 files changed, 634 insertions(+), 116 deletions(-)

--
2.11.0


2017-07-01 13:41:00

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 02/10] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1

From: Naoya Horiguchi <[email protected]>

_PAGE_PSE is used to distinguish between a truly non-present
(_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP
split and should be treated as present.

But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit,
which would cause confusion between one of those PMDs
undergoing a THP split, and a soft-dirty PMD.
Dropping _PAGE_PSE check in pmd_present() does not work well,
because it can hurt optimization of tlb handling in thp split.

Thus, we need to move the bit.

In the current kernel, bits 1-4 are not used in non-present format
since commit 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to
work around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1.
Bit 7 is used as reserved (always clear), so please don't use it for
other purpose.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 12 +++++++++---
arch/x86/include/asm/pgtable_types.h | 10 +++++-----
2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 9991224f6238..45b7a4094de0 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -178,15 +178,21 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
/*
* Encode and de-code a swap entry
*
- * | ... | 11| 10| 9|8|7|6|5| 4| 3|2|1|0| <- bit number
- * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13) |0|X|X|X| X| X|X|X|0| <- swp entry
+ * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * | OFFSET (14->63) | TYPE (9-13) |0|0|X|X| X| X|X|SD|0| <- swp entry
*
* G (8) is aliased and used as a PROT_NONE indicator for
* !present ptes. We need to start storing swap entries above
* there. We also need to avoid using A and D because of an
* erratum where they can be incorrectly set by hardware on
* non-present PTEs.
+ *
+ * SD (1) in swp entry is used to store soft dirty bit, which helps us
+ * remember soft dirty over page migration
+ *
+ * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
+ * but also L and G.
*/
#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
#define SWP_TYPE_BITS 5
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index bf9638e1ee42..c612a8f08422 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -97,15 +97,15 @@
/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
- * with swap entry format. On x86 bits 6 and 7 are *not* involved
- * into swap entry computation, but bit 6 is used for nonlinear
- * file mapping, so we borrow bit 7 for soft dirty tracking.
+ * with swap entry format. On x86 bits 1-4 are *not* involved
+ * into swap entry computation, but bit 7 is used for thp migration,
+ * so we borrow bit 1 for soft dirty tracking.
*
* Please note that this bit must be treated as swap dirty page
- * mark if and only if the PTE has present bit clear!
+ * mark if and only if the PTE/PMD has present bit clear!
*/
#ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE
+#define _PAGE_SWP_SOFT_DIRTY _PAGE_RW
#else
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
--
2.11.0

2017-07-01 13:40:55

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 01/10] mm: mempolicy: add queue_pages_required()

From: Naoya Horiguchi <[email protected]>

Introduce a separate check routine related to MPOL_MF_INVERT flag.
This patch just does cleanup, no behavioral change.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Zi Yan <[email protected]>
---
mm/mempolicy.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d911fa5cb2a7..58166bf1d1fd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -412,6 +412,21 @@ struct queue_pages {
};

/*
+ * Check if the page's nid is in qp->nmask.
+ *
+ * If MPOL_MF_INVERT is set in qp->flags, check if the nid is
+ * in the invert of qp->nmask.
+ */
+static inline bool queue_pages_required(struct page *page,
+ struct queue_pages *qp)
+{
+ int nid = page_to_nid(page);
+ unsigned long flags = qp->flags;
+
+ return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
+}
+
+/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
*/
@@ -464,8 +479,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
*/
if (PageReserved(page))
continue;
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (!queue_pages_required(page, qp))
continue;
if (PageTransCompound(page)) {
get_page(page);
@@ -497,7 +511,6 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
#ifdef CONFIG_HUGETLB_PAGE
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid;
struct page *page;
spinlock_t *ptl;
pte_t entry;
@@ -507,8 +520,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
- nid = page_to_nid(page);
- if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ if (!queue_pages_required(page, qp))
goto unlock;
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
--
2.11.0

2017-07-01 13:41:34

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 09/10] mm: migrate: move_pages() supports thp migration

From: Naoya Horiguchi <[email protected]>

This patch enables thp migration for move_pages(2).

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog: v1 -> v5:
- fix page counting

ChangeLog: v5 -> v6:
- drop changes on soft-offline in unmap_and_move()

Signed-off-by: Zi Yan <[email protected]>
---
mm/migrate.c | 45 ++++++++++++++++++++++++++++++++-------------
1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index cae5c3b3b491..ff3ca4b90b92 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,8 +184,8 @@ void putback_movable_pages(struct list_head *l)
unlock_page(page);
put_page(page);
} else {
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
putback_lru_page(page);
}
}
@@ -1145,8 +1145,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* as __PageMovable
*/
if (likely(!__PageMovable(page)))
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page), -hpage_nr_pages(page));
}

/*
@@ -1420,7 +1420,17 @@ static struct page *new_page_node(struct page *p, unsigned long private,
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
- else
+ else if (thp_migration_supported() && PageTransHuge(p)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(pm->node,
+ (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1447,6 +1457,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
struct vm_area_struct *vma;
struct page *page;
+ struct page *head;
+ unsigned int follflags;

err = -EFAULT;
vma = find_vma(mm, pp->addr);
@@ -1454,8 +1466,10 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto set_status;

/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, pp->addr,
- FOLL_GET | FOLL_SPLIT | FOLL_DUMP);
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, pp->addr, follflags);

err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1465,7 +1479,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
if (!page)
goto set_status;

- pp->page = page;
err = page_to_nid(page);

if (err == pp->node)
@@ -1480,16 +1493,22 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
goto put_and_set;

if (PageHuge(page)) {
- if (PageHead(page))
+ if (PageHead(page)) {
isolate_huge_page(page, &pagelist);
+ err = 0;
+ pp->page = page;
+ }
goto put_and_set;
}

- err = isolate_lru_page(page);
+ pp->page = compound_head(page);
+ head = compound_head(page);
+ err = isolate_lru_page(head);
if (!err) {
- list_add_tail(&page->lru, &pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ list_add_tail(&head->lru, &pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
put_and_set:
/*
--
2.11.0

2017-07-01 13:41:35

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 10/10] mm: memory_hotplug: memory hotremove supports thp migration

From: Naoya Horiguchi <[email protected]>

This patch enables thp migration for memory hotremove.

---
ChangeLog v1->v2:
- base code switched from alloc_migrate_target to new_node_page()

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog v2->v7:
- base code switched from new_node_page() new_page_nodemask()

Signed-off-by: Zi Yan <[email protected]>
---
include/linux/migrate.h | 15 ++++++++++++++-
mm/memory_hotplug.c | 4 +++-
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 67ca33665e83..ff1f76683ee6 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -35,16 +35,29 @@ static inline struct page *new_page_nodemask(struct page *page, int preferred_ni
nodemask_t *nodemask)
{
gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
+ unsigned int order = 0;
+ struct page *new_page = NULL;

if (PageHuge(page))
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);

+ if (thp_migration_supported() && PageTransHuge(page)) {
+ order = HPAGE_PMD_ORDER;
+ gfp_mask |= GFP_TRANSHUGE;
+ }
+
if (PageHighMem(page)
|| (zone_idx(page_zone(page)) == ZONE_MOVABLE))
gfp_mask |= __GFP_HIGHMEM;

- return __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+ new_page = __alloc_pages_nodemask(gfp_mask, order,
+ preferred_nid, nodemask);
+
+ if (new_page && PageTransHuge(page))
+ prep_transhuge_page(new_page);
+
+ return new_page;
}

#ifdef CONFIG_MIGRATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 57b03be3a8da..72110ea2ee1b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1416,7 +1416,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- }
+ } else if (thp_migration_supported() && PageTransHuge(page))
+ pfn = page_to_pfn(compound_head(page))
+ + hpage_nr_pages(page) - 1;

if (!get_page_unless_zero(page))
continue;
--
2.11.0

2017-07-01 13:42:13

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 03/10] mm: thp: introduce separate TTU flag for thp freezing

From: Naoya Horiguchi <[email protected]>

TTU_MIGRATION is used to convert pte into migration entry until thp split
completes. This behavior conflicts with thp migration added later patches,
so let's introduce a new TTU flag specifically for freezing.

try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given for
head page is like below (assuming anonymous thp):

(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

and ttu_flag given for tail pages is:

(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)

__unmap_and_move() calls try_to_unmap() with ttu_flag:

(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like below

static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
if (flags & TTU_MIGRATION) {
if (!pvmw.pte && page) {
set_pmd_migration_entry(&pvmw, page);
continue;
}
}

, so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration entry.)

I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().

Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/rmap.h | 3 ++-
mm/huge_memory.c | 2 +-
mm/rmap.c | 7 ++++---
3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 43ef2c30cb0f..f8ca2e74b819 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -93,8 +93,9 @@ enum ttu_flags {
TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible
* and caller guarantees they will
* do a final flush if necessary */
- TTU_RMAP_LOCKED = 0x80 /* do not grab rmap lock:
+ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
+ TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */
};

#ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86975dec0ba1..35711b35b067 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2167,7 +2167,7 @@ static void freeze_page(struct page *page)
VM_BUG_ON_PAGE(!PageHead(page), page);

if (PageAnon(page))
- ttu_flags |= TTU_MIGRATION;
+ ttu_flags |= TTU_SPLIT_FREEZE;

unmap_success = try_to_unmap(page, ttu_flags);
VM_BUG_ON_PAGE(!unmap_success, page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 2324c923c813..91948fbbb0bb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1308,7 +1308,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

if (flags & TTU_SPLIT_HUGE_PMD) {
split_huge_pmd_address(vma, address,
- flags & TTU_MIGRATION, page);
+ flags & TTU_SPLIT_FREEZE, page);
}

while (page_vma_mapped_walk(&pvmw)) {
@@ -1397,7 +1397,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*/
dec_mm_counter(mm, mm_counter(page));
} else if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & TTU_MIGRATION)) {
+ (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
swp_entry_t entry;
pte_t swp_pte;
/*
@@ -1522,7 +1522,8 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
+ if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
+ && !PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;

if (flags & TTU_RMAP_LOCKED)
--
2.11.0

2017-07-01 13:42:11

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 04/10] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION

From: Naoya Horiguchi <[email protected]>

Introduces CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.

ChangeLog v1 -> v2:
- fixed config name in subject and patch description

Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
---
arch/x86/Kconfig | 4 ++++
include/linux/huge_mm.h | 10 ++++++++++
mm/Kconfig | 3 +++
3 files changed, 17 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b6373817e6f4..631af221ce63 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2276,6 +2276,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on X86_64 && HUGETLB_PAGE && MIGRATION

+config ARCH_ENABLE_THP_MIGRATION
+ def_bool y
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+
menu "Power management and ACPI options"

config ARCH_HIBERNATION_HEADER
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee696347f928..d8f35a0865dc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -233,6 +233,11 @@ void mm_put_huge_zero_page(struct mm_struct *mm);

#define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))

+static inline bool thp_migration_supported(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
+}
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -336,6 +341,11 @@ static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
{
return NULL;
}
+
+static inline bool thp_migration_supported(void)
+{
+ return false;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9bf2055ed061..6634e0ed5c1b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -262,6 +262,9 @@ config MIGRATION
config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool

+config ARCH_ENABLE_THP_MIGRATION
+ bool
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT

--
2.11.0

2017-07-01 13:42:10

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

From: Zi Yan <[email protected]>

This patch adds thp migration's core code, including conversions
between a PMD entry and a swap entry, setting PMD migration entry,
removing PMD migration entry, and waiting on PMD migration entries.

This patch makes it possible to support thp migration.
If you fail to allocate a destination page as a thp, you just split
the source thp as we do now, and then enter the normal page migration.
If you succeed to allocate destination thp, you enter thp migration.
Subsequent patches actually enable thp migration for each caller of
page migration by allowing its get_new_page() callback to
allocate thps.

ChangeLog v1 -> v2:
- support pte-mapped thp, doubly-mapped thp

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog v2 -> v3:
- use page_vma_mapped_walk()
- use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
set_pmd_migration_entry()

ChangeLog v3 -> v4:
- factor out the code of removing pte pgtable page in zap_huge_pmd()

ChangeLog v4 -> v5:
- remove unnecessary PTE-mapped THP code in remove_migration_pmd()
and set_pmd_migration_entry()
- restructure the code in zap_huge_pmd() to avoid factoring out
the pte pgtable page code
- in zap_huge_pmd(), check that PMD swap entries are migration entries
- change author information

ChangeLog v5 -> v7
- use macro to disable the code when thp migration is not enabled

ChangeLog v7 -> v8
- use IS_ENABLED instead of macro to make code look clean in
zap_huge_pmd() and page_vma_mapped_walk()
- remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
avoid compilation error
- rename variable 'migration' to 'flush_needed' and invert the logic in
zap_huge_pmd() to make code more descriptive
- use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
with MADV_DONTNEED
- remove unnecessary tlb flush in remove_migration_pmd()
- add the missing migration flag check in page_vma_mapped_walk()

Signed-off-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 +
include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
mm/migrate.c | 32 ++++++++++++++-
mm/page_vma_mapped.c | 18 +++++++--
mm/pgtable-generic.c | 3 +-
mm/rmap.c | 13 ++++++
7 files changed, 207 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 45b7a4094de0..eac7f8cf4ae0 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -208,7 +208,9 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
((type) << (SWP_TYPE_FIRST_BIT)) \
| ((offset) << SWP_OFFSET_FIRST_BIT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
+#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val((pmd)) })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+#define __swp_entry_to_pmd(x) ((pmd_t) { .pmd = (x).val })

extern int kern_addr_valid(unsigned long addr);
extern void cleanup_highmap(void);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c5ff7b217ee6..c8c6511750f1 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -103,7 +103,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
- BUG_ON(!PageLocked(page));
+ BUG_ON(!PageLocked(compound_head(page)));
+
return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
page_to_pfn(page));
}
@@ -126,7 +127,7 @@ static inline struct page *migration_entry_to_page(swp_entry_t entry)
* Any use of migration entries may only occur while the
* corresponding page is locked
*/
- BUG_ON(!PageLocked(p));
+ BUG_ON(!PageLocked(compound_head(p)));
return p;
}

@@ -163,6 +164,68 @@ static inline int is_write_migration_entry(swp_entry_t entry)

#endif

+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __pmd_to_swp_entry(pmd);
+ return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ swp_entry_t arch_entry;
+
+ arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
+ return __swp_entry_to_pmd(arch_entry);
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return !pmd_present(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
+}
+#else
+static inline void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ BUILD_BUG();
+}
+
+static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+ struct page *new)
+{
+ BUILD_BUG();
+}
+
+static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
+
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+ return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+ return (pmd_t){ 0 };
+}
+
+static inline int is_pmd_migration_entry(pmd_t pmd)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_MEMORY_FAILURE

extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35711b35b067..9668f8cb8317 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1641,10 +1641,24 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
} else {
- struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- VM_BUG_ON_PAGE(!PageHead(page), page);
+ struct page *page = NULL;
+ int flush_needed = 1;
+
+ if (pmd_present(orig_pmd)) {
+ page = pmd_page(orig_pmd);
+ page_remove_rmap(page, true);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ } else if (thp_migration_supported()) {
+ swp_entry_t entry;
+
+ VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+ entry = pmd_to_swp_entry(orig_pmd);
+ page = pfn_to_page(swp_offset(entry));
+ flush_needed = 0;
+ } else
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
if (PageAnon(page)) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
@@ -1653,8 +1667,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
}
+
spin_unlock(ptl);
- tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
+ if (flush_needed)
+ tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
}
return 1;
}
@@ -2694,3 +2710,61 @@ static int __init split_huge_pages_debugfs(void)
}
late_initcall(split_huge_pages_debugfs);
#endif
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ pmd_t pmdval;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ mmu_notifier_invalidate_range_start(mm, address,
+ address + HPAGE_PMD_SIZE);
+
+ flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmdval = *pvmw->pmd;
+ pmdp_invalidate(vma, address, pvmw->pmd);
+ if (pmd_dirty(pmdval))
+ set_page_dirty(page);
+ entry = make_migration_entry(page, pmd_write(pmdval));
+ pmdval = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ page_remove_rmap(page, true);
+ put_page(page);
+
+ mmu_notifier_invalidate_range_end(mm, address,
+ address + HPAGE_PMD_SIZE);
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+ struct vm_area_struct *vma = pvmw->vma;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = pvmw->address;
+ unsigned long mmun_start = address & HPAGE_PMD_MASK;
+ pmd_t pmde;
+ swp_entry_t entry;
+
+ if (!(pvmw->pmd && !pvmw->pte))
+ return;
+
+ entry = pmd_to_swp_entry(*pvmw->pmd);
+ get_page(new);
+ pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (is_write_migration_entry(entry))
+ pmde = maybe_pmd_mkwrite(pmde, vma);
+
+ flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
+ page_add_anon_rmap(new, vma, mmun_start, true);
+ set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_vma_page(new);
+ update_mmu_cache_pmd(vma, address, pvmw->pmd);
+}
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 627671551873..cae5c3b3b491 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -215,6 +215,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
new = page - pvmw.page->index +
linear_page_index(vma, pvmw.address);

+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte && pvmw.page) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
+ remove_migration_pmd(&pvmw, new);
+ continue;
+ }
+#endif
+
get_page(new);
pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
if (pte_swp_soft_dirty(*pvmw.pte))
@@ -329,6 +338,27 @@ void migration_entry_wait_huge(struct vm_area_struct *vma,
__migration_entry_wait(mm, pte, ptl);
}

+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl;
+ struct page *page;
+
+ ptl = pmd_lock(mm, pmd);
+ if (!is_pmd_migration_entry(*pmd))
+ goto unlock;
+ page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+ if (!get_page_unless_zero(page))
+ goto unlock;
+ spin_unlock(ptl);
+ wait_on_page_locked(page);
+ put_page(page);
+ return;
+unlock:
+ spin_unlock(ptl);
+}
+#endif
+
#ifdef CONFIG_BLOCK
/* Returns true if all buffers are successfully locked */
static bool buffer_migrate_lock_buffers(struct buffer_head *head,
@@ -1087,7 +1117,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}

- if (unlikely(PageTransHuge(page))) {
+ if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
lock_page(page);
rc = split_huge_page(page);
unlock_page(page);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8ec6ba230bb9..3bd3008db4cb 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -138,16 +138,28 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
- if (pmd_trans_huge(*pvmw->pmd)) {
+ if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- if (!pmd_present(*pvmw->pmd))
- return not_found(pvmw);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (pmd_page(*pvmw->pmd) != page)
return not_found(pvmw);
return true;
+ } else if (!pmd_present(*pvmw->pmd)) {
+ if (thp_migration_supported()) {
+ if (!(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+ if (migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ } else
+ WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+ return not_found(pvmw);
} else {
/* THP pmd was split under us: handle on pte level */
spin_unlock(pvmw->ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..1175f6a24fdb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -124,7 +124,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+ !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
diff --git a/mm/rmap.c b/mm/rmap.c
index 91948fbbb0bb..b28f633cd569 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
bool ret = true;
enum ttu_flags flags = (enum ttu_flags)arg;

+
/* munlock has nothing to gain from examining un-locked vmas */
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
return true;
@@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}

while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (flags & TTU_MIGRATION) {
+ if (!pvmw.pte && page) {
+ VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page),
+ page);
+ set_pmd_migration_entry(&pvmw, page);
+ continue;
+ }
+ }
+#endif
+
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
--
2.11.0

2017-07-01 13:42:09

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 06/10] mm: thp: check pmd migration entry in common path

From: Zi Yan <[email protected]>

If one of callers of page migration starts to handle thp,
memory management code start to see pmd migration entry, so we need
to prepare for it before enabling. This patch changes various code
point which checks the status of given pmds in order to prevent race
between thp migration and the pmd-related works.

ChangeLog v1 -> v2:
- introduce pmd_related() (I know the naming is not good, but can't
think up no better name. Any suggesntion is welcomed.)

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog v2 -> v3:
- add is_swap_pmd()
- a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
pmd_trans_huge(), pmd_devmap(), or pmd_none()
- pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
true on pmd_migration_entry, so that migration entries are not
treated as pmd page table entries.

ChangeLog v4 -> v5:
- add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
the equivalence of !pmd_present() and is_pmd_migration_entry()
- fix migration entry wait deadlock code (from v1) in follow_page_mask()
- remove unnecessary code (from v1) in follow_trans_huge_pmd()
- use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
so it will not be confused with pmd_none()
- change author information

ChangeLog v5 -> v7
- use macro to disable the code when thp migration is not enabled

ChangeLog v7 -> v8
- remove not used code in do_huge_pmd_wp_page()
- copy the comment from change_pte_range() on downgrading
write migration entry to read to change_huge_pmd()

Signed-off-by: Zi Yan <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/gup.c | 7 +++--
fs/proc/task_mmu.c | 33 ++++++++++++++-------
include/asm-generic/pgtable.h | 17 ++++++++++-
include/linux/huge_mm.h | 14 +++++++--
mm/gup.c | 22 ++++++++++++--
mm/huge_memory.c | 67 +++++++++++++++++++++++++++++++++++++++----
mm/memcontrol.c | 5 ++++
mm/memory.c | 12 ++++++--
mm/mprotect.c | 4 +--
mm/mremap.c | 2 +-
10 files changed, 154 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 456dfdfd2249..096bbcc801e6 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -9,6 +9,7 @@
#include <linux/vmstat.h>
#include <linux/highmem.h>
#include <linux/swap.h>
+#include <linux/swapops.h>
#include <linux/memremap.h>

#include <asm/mmu_context.h>
@@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = *pmdp;

next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd)) {
+ VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
return 0;
- if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
+ } else if (unlikely(pmd_large(pmd))) {
/*
* NUMA hinting faults need to be handled in the GUP
* slowpath for accounting purposes and so that they
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b836fd61ed87..01ad4101ef7b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -596,7 +596,8 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- smaps_pmd_entry(pmd, addr, walk);
+ if (pmd_present(*pmd))
+ smaps_pmd_entry(pmd, addr, walk);
spin_unlock(ptl);
return 0;
}
@@ -938,6 +939,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
goto out;
}

+ if (!pmd_present(*pmd))
+ goto out;
+
page = pmd_page(*pmd);

/* Clear accessed and referenced bits. */
@@ -1217,27 +1221,34 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (ptl) {
u64 flags = 0, frame = 0;
pmd_t pmd = *pmdp;
+ struct page *page = NULL;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;

- /*
- * Currently pmd for thp is always present because thp
- * can not be swapped-out, migrated, or HWPOISONed
- * (split in such cases instead.)
- * This if-check is just to prepare for future implementation.
- */
if (pmd_present(pmd)) {
- struct page *page = pmd_page(pmd);
-
- if (page_mapcount(page) == 1)
- flags |= PM_MMAP_EXCLUSIVE;
+ page = pmd_page(pmd);

flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
}
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ else if (is_swap_pmd(pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ frame = swp_type(entry) |
+ (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+ flags |= PM_SWAP;
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ page = migration_entry_to_page(entry);
+ }
+#endif
+
+ if (page && page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;

for (; addr != end; addr += PAGE_SIZE) {
pagemap_entry_t pme = make_pme(frame, flags);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7dfa767dc680..88119351fecc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -834,7 +834,22 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
barrier();
#endif
- if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
+ /*
+ * !pmd_present() checks for pmd migration entries
+ *
+ * The complete check uses is_pmd_migration_entry() in linux/swapops.h
+ * But using that requires moving current function and pmd_trans_unstable()
+ * to linux/swapops.h to resovle dependency, which is too much code move.
+ *
+ * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
+ * because !pmd_present() pages can only be under migration not swapped
+ * out.
+ *
+ * pmd_none() is preseved for future condition checks on pmd migration
+ * entries and not confusing with this function name, although it is
+ * redundant with !pmd_present().
+ */
+ if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || !pmd_present(pmdval))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d8f35a0865dc..14bc21c2ee7f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -147,7 +147,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (pmd_trans_huge(*____pmd) \
+ if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
|| pmd_devmap(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
@@ -178,12 +178,18 @@ extern spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma);
extern spinlock_t *__pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
/* mmap_sem must be held on entry */
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
@@ -299,6 +305,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
+static inline int is_swap_pmd(pmd_t pmd)
+{
+ return 0;
+}
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
diff --git a/mm/gup.c b/mm/gup.c
index 96bf802c3533..d3458aff7178 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -234,6 +234,16 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+retry:
+ if (!pmd_present(*pmd)) {
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(*pmd));
+ if (is_pmd_migration_entry(*pmd))
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry;
+ }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -247,7 +257,15 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
return no_page_table(vma, flags);

+retry_locked:
ptl = pmd_lock(mm, pmd);
+ if (unlikely(!pmd_present(*pmd))) {
+ spin_unlock(ptl);
+ if (likely(!(flags & FOLL_MIGRATION)))
+ return no_page_table(vma, flags);
+ pmd_migration_entry_wait(mm, pmd);
+ goto retry_locked;
+ }
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags);
@@ -424,7 +442,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pud = pud_offset(p4d, address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return -EFAULT;
VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
@@ -1534,7 +1552,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
pmd_t pmd = READ_ONCE(*pmdp);

next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ if (!pmd_present(pmd))
return 0;

if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9668f8cb8317..acf2d7a5953f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -914,6 +914,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = -EAGAIN;
pmd = *src_pmd;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if (unlikely(is_swap_pmd(pmd))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ if (is_write_migration_entry(entry)) {
+ make_migration_entry_read(&entry);
+ pmd = swp_entry_to_pmd(entry);
+ set_pmd_at(src_mm, addr, src_pmd, pmd);
+ }
+ set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ ret = 0;
+ goto out_unlock;
+ }
+#endif
+
if (unlikely(!pmd_trans_huge(pmd))) {
pte_free(dst_mm, pgtable);
goto out_unlock;
@@ -1556,6 +1574,12 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd))
goto out;

+ if (unlikely(!pmd_present(orig_pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(orig_pmd));
+ goto out;
+ }
+
page = pmd_page(orig_pmd);
/*
* If other processes are mapping this page, we couldn't discard
@@ -1767,6 +1791,26 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
preserve_write = prot_numa && pmd_write(*pmd);
ret = 1;

+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ if (is_swap_pmd(*pmd)) {
+ swp_entry_t entry = pmd_to_swp_entry(*pmd);
+
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(*pmd));
+ if (is_write_migration_entry(entry)) {
+ pmd_t newpmd;
+ /*
+ * A protection check is difficult so
+ * just be safe and disable write
+ */
+ make_migration_entry_read(&entry);
+ newpmd = swp_entry_to_pmd(entry);
+ set_pmd_at(mm, addr, pmd, newpmd);
+ }
+ goto unlock;
+ }
+#endif
+
/*
* Avoid trapping faults against the zero page. The read-only
* data is likely to be read-cached on the local CPU and
@@ -1832,7 +1876,8 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+ pmd_devmap(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -1950,14 +1995,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t _pmd;
- bool young, write, dirty, soft_dirty;
+ bool young, write, dirty, soft_dirty, pmd_migration = false;
unsigned long addr;
int i;

VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd));
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+ && !pmd_devmap(*pmd));

count_vm_event(THP_SPLIT_PMD);

@@ -1982,7 +2028,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
return __split_huge_zero_page_pmd(vma, haddr, pmd);
}

- page = pmd_page(*pmd);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ pmd_migration = is_pmd_migration_entry(*pmd);
+ if (pmd_migration) {
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(*pmd);
+ page = pfn_to_page(swp_offset(entry));
+ } else
+#endif
+ page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
@@ -2001,7 +2056,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* transferred to avoid any possibility of altering
* permissions across VMAs.
*/
- if (freeze) {
+ if (freeze || pmd_migration) {
swp_entry_t swp_entry;
swp_entry = make_migration_entry(page + i, write);
entry = swp_entry_to_pte(swp_entry);
@@ -2100,7 +2155,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
page = pmd_page(*pmd);
if (PageMlocked(page))
clear_page_mlock(page);
- } else if (!pmd_devmap(*pmd))
+ } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, haddr, freeze);
out:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e5cbbd..72caeb5d3c9f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4638,6 +4638,11 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
struct page *page = NULL;
enum mc_target_type ret = MC_TARGET_NONE;

+ if (unlikely(is_swap_pmd(pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(pmd));
+ return ret;
+ }
page = pmd_page(pmd);
VM_BUG_ON_PAGE(!page || !PageHead(page), page);
if (!(mc.flags & MOVE_ANON))
diff --git a/mm/memory.c b/mm/memory.c
index d6a9d40d0548..1260d140ccf8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1036,7 +1036,8 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+ if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
+ || pmd_devmap(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
err = copy_huge_pmd(dst_mm, src_mm,
@@ -1296,7 +1297,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
@@ -3804,6 +3805,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
pmd_t orig_pmd = *vmf.pmd;

barrier();
+ if (unlikely(is_swap_pmd(orig_pmd))) {
+ VM_BUG_ON(IS_ENABLED(CONFIG_MIGRATION) &&
+ !is_pmd_migration_entry(orig_pmd));
+ if (is_pmd_migration_entry(orig_pmd))
+ pmd_migration_entry_wait(mm, vmf.pmd);
+ return 0;
+ }
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1a8c9ca83e48..d60a1eedcc54 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -148,7 +148,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long this_pages;

next = pmd_addr_end(addr, end);
- if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
+ if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
&& pmd_none_or_clear_bad(pmd))
continue;

@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mni_start, end);
}

- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..1c49b9fb994a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -222,7 +222,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
- if (pmd_trans_huge(*old_pmd)) {
+ if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
bool moved;
/* See comment in move_ptes() */
--
2.11.0

2017-07-01 13:42:06

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 07/10] mm: soft-dirty: keep soft-dirty bits over thp migration

From: Naoya Horiguchi <[email protected]>

Soft dirty bit is designed to keep tracked over page migration. This patch
makes it work in the same manner for thp migration too.

---
ChangeLog v1 -> v2:
- separate diff moving _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1
- clear_soft_dirty_pmd can handle migration entry

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog v1 -> v5:
- read soft dirty bit from correct place (*src_pmd) in copy_huge_pmd()
- add missing soft dirty bit transfer in change_huge_pmd()

Signed-off-by: Zi Yan <[email protected]>
---
arch/x86/include/asm/pgtable.h | 17 +++++++++++++++++
fs/proc/task_mmu.c | 27 ++++++++++++++++-----------
include/asm-generic/pgtable.h | 34 +++++++++++++++++++++++++++++++++-
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 27 ++++++++++++++++++++++++---
5 files changed, 92 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f5af95a0c6b8..54fc6da8bdf0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1153,6 +1153,23 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_SOFT_DIRTY;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
+}
+#endif
#endif

#define PKRU_AD_BIT 0x1
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 01ad4101ef7b..1deab9a31802 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -904,17 +904,22 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
{
pmd_t pmd = *pmdp;

- /* See comment in change_huge_pmd() */
- pmdp_invalidate(vma, addr, pmdp);
- if (pmd_dirty(*pmdp))
- pmd = pmd_mkdirty(pmd);
- if (pmd_young(*pmdp))
- pmd = pmd_mkyoung(pmd);
-
- pmd = pmd_wrprotect(pmd);
- pmd = pmd_clear_soft_dirty(pmd);
-
- set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ if (pmd_present(pmd)) {
+ /* See comment in change_huge_pmd() */
+ pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(*pmdp))
+ pmd = pmd_mkdirty(pmd);
+ if (pmd_young(*pmdp))
+ pmd = pmd_mkyoung(pmd);
+
+ pmd = pmd_wrprotect(pmd);
+ pmd = pmd_clear_soft_dirty(pmd);
+
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+ pmd = pmd_swp_clear_soft_dirty(pmd);
+ set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+ }
}
#else
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 88119351fecc..34ef631c5964 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -618,7 +618,24 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
#define arch_start_context_switch(prev) do {} while (0)
#endif

-#ifndef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
static inline int pte_soft_dirty(pte_t pte)
{
return 0;
@@ -663,6 +680,21 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
return pte;
}
+
+static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_soft_dirty(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
+{
+ return pmd;
+}
#endif

#ifndef __HAVE_PFNMAP_TRACKING
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c8c6511750f1..acf37fb9136a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -179,6 +179,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
{
swp_entry_t arch_entry;

+ if (pmd_swp_soft_dirty(pmd))
+ pmd = pmd_swp_clear_soft_dirty(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index acf2d7a5953f..1a1cc6d4df5f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -924,6 +924,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (is_write_migration_entry(entry)) {
make_migration_entry_read(&entry);
pmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*src_pmd))
+ pmd = pmd_swp_mksoft_dirty(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
@@ -1714,6 +1716,17 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
}
#endif

+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ if (unlikely(is_pmd_migration_entry(pmd)))
+ pmd = pmd_swp_mksoft_dirty(pmd);
+ else if (pmd_present(pmd))
+ pmd = pmd_mksoft_dirty(pmd);
+#endif
+ return pmd;
+}
+
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
@@ -1756,7 +1769,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
}
- set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
+ pmd = move_soft_dirty_pmd(pmd);
+ set_pmd_at(mm, new_addr, new_pmd, pmd);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
if (force_flush)
@@ -1805,6 +1819,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
*/
make_migration_entry_read(&entry);
newpmd = swp_entry_to_pmd(entry);
+ if (pmd_swp_soft_dirty(*pmd))
+ newpmd = pmd_swp_mksoft_dirty(newpmd);
set_pmd_at(mm, addr, pmd, newpmd);
}
goto unlock;
@@ -2775,6 +2791,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
unsigned long address = pvmw->address;
pmd_t pmdval;
swp_entry_t entry;
+ pmd_t pmdswp;

if (!(pvmw->pmd && !pvmw->pte))
return;
@@ -2788,8 +2805,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_dirty(pmdval))
set_page_dirty(page);
entry = make_migration_entry(page, pmd_write(pmdval));
- pmdval = swp_entry_to_pmd(entry);
- set_pmd_at(mm, address, pvmw->pmd, pmdval);
+ pmdswp = swp_entry_to_pmd(entry);
+ if (pmd_soft_dirty(pmdval))
+ pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+ set_pmd_at(mm, address, pvmw->pmd, pmdswp);
page_remove_rmap(page, true);
put_page(page);

@@ -2812,6 +2831,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
+ if (pmd_swp_soft_dirty(*pvmw->pmd))
+ pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);

--
2.11.0

2017-07-01 13:42:07

by Zi Yan

[permalink] [raw]
Subject: [PATCH v8 08/10] mm: mempolicy: mbind and migrate_pages support thp migration

From: Naoya Horiguchi <[email protected]>

This patch enables thp migration for mbind(2) and migrate_pages(2).

ChangeLog v1 -> v2:
- support pte-mapped and doubly-mapped thp

Signed-off-by: Naoya Horiguchi <[email protected]>

ChangeLog v2 -> v6:
- use the same gfp flag (GFP_TRANSHUGE) in mbind() and migrate_pages()
for thp allocations.

Signed-off-by: Zi Yan <[email protected]>
---
mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 58166bf1d1fd..088e6562f6f4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -97,6 +97,7 @@
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
+#include <linux/swapops.h>

#include <asm/tlbflush.h>
#include <linux/uaccess.h>
@@ -426,6 +427,49 @@ static inline bool queue_pages_required(struct page *page,
return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
}

+static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ int ret = 0;
+ struct page *page;
+ struct queue_pages *qp = walk->private;
+ unsigned long flags;
+
+ if (unlikely(is_pmd_migration_entry(*pmd))) {
+ ret = 1;
+ goto unlock;
+ }
+ page = pmd_page(*pmd);
+ if (is_huge_zero_page(page)) {
+ spin_unlock(ptl);
+ __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
+ goto out;
+ }
+ if (!thp_migration_supported()) {
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ ret = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ goto out;
+ }
+ if (!queue_pages_required(page, qp)) {
+ ret = 1;
+ goto unlock;
+ }
+
+ ret = 1;
+ flags = qp->flags;
+ /* go to thp migration */
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ migrate_page_add(page, qp->pagelist, flags);
+unlock:
+ spin_unlock(ptl);
+out:
+ return ret;
+}
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
@@ -437,30 +481,15 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
struct queue_pages *qp = walk->private;
unsigned long flags = qp->flags;
- int nid, ret;
+ int ret;
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge(*pmd)) {
- ptl = pmd_lock(walk->mm, pmd);
- if (pmd_trans_huge(*pmd)) {
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- __split_huge_pmd(vma, pmd, addr, false, NULL);
- } else {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- if (ret)
- return 0;
- }
- } else {
- spin_unlock(ptl);
- }
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
+ if (ret)
+ return 0;
}

if (pmd_trans_unstable(pmd))
@@ -481,7 +510,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
- if (PageTransCompound(page)) {
+ if (PageTransCompound(page) && !thp_migration_supported()) {
get_page(page);
pte_unmap_unlock(pte, ptl);
lock_page(page);
@@ -898,19 +927,21 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,

#ifdef CONFIG_MIGRATION
/*
- * page migration
+ * page migration, thp tail pages can be passed.
*/
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags)
{
+ struct page *head = compound_head(page);
/*
* Avoid migrating a page that is shared with others.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
- if (!isolate_lru_page(page)) {
- list_add_tail(&page->lru, pagelist);
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
+ if (!isolate_lru_page(head)) {
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
}
}
}
@@ -920,7 +951,17 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else
+ else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_pages_node(node,
+ (GFP_TRANSHUGE | __GFP_THISNODE),
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
+ } else
return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1086,6 +1127,15 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
+ } else if (thp_migration_supported() && PageTransHuge(page)) {
+ struct page *thp;
+
+ thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+ HPAGE_PMD_ORDER);
+ if (!thp)
+ return NULL;
+ prep_transhuge_page(thp);
+ return thp;
}
/*
* if !vma, alloc_page_vma() will use task or system default policy
--
2.11.0

2017-07-02 17:57:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> bool ret = true;
> enum ttu_flags flags = (enum ttu_flags)arg;
>
> +
> /* munlock has nothing to gain from examining un-locked vmas */
> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> return true;

With exception of this useless hunk, looks good to me

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2017-07-03 01:26:54

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On 2 Jul 2017, at 13:57, Kirill A. Shutemov wrote:

> On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> bool ret = true;
>> enum ttu_flags flags = (enum ttu_flags)arg;
>>
>> +
>> /* munlock has nothing to gain from examining un-locked vmas */
>> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>> return true;
>
> With exception of this useless hunk, looks good to me
>
> Acked-by: Kirill A. Shutemov <[email protected]>
>

Thanks.

BTW, is this Acked-by for Patch 5 or both Path 5 and 6?

--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-07-11 06:50:40

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> This patch adds thp migration's core code, including conversions
> between a PMD entry and a swap entry, setting PMD migration entry,
> removing PMD migration entry, and waiting on PMD migration entries.
>
> This patch makes it possible to support thp migration.
> If you fail to allocate a destination page as a thp, you just split
> the source thp as we do now, and then enter the normal page migration.
> If you succeed to allocate destination thp, you enter thp migration.
> Subsequent patches actually enable thp migration for each caller of
> page migration by allowing its get_new_page() callback to
> allocate thps.
>
> ChangeLog v1 -> v2:
> - support pte-mapped thp, doubly-mapped thp
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
>
> ChangeLog v2 -> v3:
> - use page_vma_mapped_walk()
> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
> set_pmd_migration_entry()
>
> ChangeLog v3 -> v4:
> - factor out the code of removing pte pgtable page in zap_huge_pmd()
>
> ChangeLog v4 -> v5:
> - remove unnecessary PTE-mapped THP code in remove_migration_pmd()
> and set_pmd_migration_entry()
> - restructure the code in zap_huge_pmd() to avoid factoring out
> the pte pgtable page code
> - in zap_huge_pmd(), check that PMD swap entries are migration entries
> - change author information
>
> ChangeLog v5 -> v7
> - use macro to disable the code when thp migration is not enabled
>
> ChangeLog v7 -> v8
> - use IS_ENABLED instead of macro to make code look clean in
> zap_huge_pmd() and page_vma_mapped_walk()
> - remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
> avoid compilation error
> - rename variable 'migration' to 'flush_needed' and invert the logic in
> zap_huge_pmd() to make code more descriptive
> - use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
> with MADV_DONTNEED
> - remove unnecessary tlb flush in remove_migration_pmd()
> - add the missing migration flag check in page_vma_mapped_walk()
>
> Signed-off-by: Zi Yan <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +
> include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
> mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
> mm/migrate.c | 32 ++++++++++++++-
> mm/page_vma_mapped.c | 18 +++++++--
> mm/pgtable-generic.c | 3 +-
> mm/rmap.c | 13 ++++++
> 7 files changed, 207 insertions(+), 12 deletions(-)
>
...

> diff --git a/mm/rmap.c b/mm/rmap.c
> index 91948fbbb0bb..b28f633cd569 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> bool ret = true;
> enum ttu_flags flags = (enum ttu_flags)arg;
>
> +
> /* munlock has nothing to gain from examining un-locked vmas */
> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> return true;
> @@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> }
>
> while (page_vma_mapped_walk(&pvmw)) {
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> + /* PMD-mapped THP migration entry */
> + if (flags & TTU_MIGRATION) {

My testing based on mmotm-2017-07-06-16-18 showed that migrating shmem thp
caused kernel crash. I don't think this is critical because that case is
just not-prepared yet. So in order to avoid the crash, please add
PageAnon(page) check here. This makes shmem thp migration just fail.

+ if (!PageAnon(page))
+ continue;

> + if (!pvmw.pte && page) {

Just from curiosity, do we really need this page check?
try_to_unmap() always passes down the parameter 'page' to try_to_unmap_one()
via rmap_walk_* family, so I think we can assume page is always non-NULL.

Thanks,
Naoya Horiguchi

> + VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page),
> + page);
> + set_pmd_migration_entry(&pvmw, page);
> + continue;
> + }
> + }
> +#endif
> +
> /*
> * If the page is mlock()d, we cannot swap it out.
> * If it's recently referenced (perhaps page_referenced
> --
> 2.11.0
>
>

2017-07-11 14:00:39

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On 11 Jul 2017, at 2:47, Naoya Horiguchi wrote:

> On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> This patch adds thp migration's core code, including conversions
>> between a PMD entry and a swap entry, setting PMD migration entry,
>> removing PMD migration entry, and waiting on PMD migration entries.
>>
>> This patch makes it possible to support thp migration.
>> If you fail to allocate a destination page as a thp, you just split
>> the source thp as we do now, and then enter the normal page migration.
>> If you succeed to allocate destination thp, you enter thp migration.
>> Subsequent patches actually enable thp migration for each caller of
>> page migration by allowing its get_new_page() callback to
>> allocate thps.
>>
>> ChangeLog v1 -> v2:
>> - support pte-mapped thp, doubly-mapped thp
>>
>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>
>> ChangeLog v2 -> v3:
>> - use page_vma_mapped_walk()
>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
>> set_pmd_migration_entry()
>>
>> ChangeLog v3 -> v4:
>> - factor out the code of removing pte pgtable page in zap_huge_pmd()
>>
>> ChangeLog v4 -> v5:
>> - remove unnecessary PTE-mapped THP code in remove_migration_pmd()
>> and set_pmd_migration_entry()
>> - restructure the code in zap_huge_pmd() to avoid factoring out
>> the pte pgtable page code
>> - in zap_huge_pmd(), check that PMD swap entries are migration entries
>> - change author information
>>
>> ChangeLog v5 -> v7
>> - use macro to disable the code when thp migration is not enabled
>>
>> ChangeLog v7 -> v8
>> - use IS_ENABLED instead of macro to make code look clean in
>> zap_huge_pmd() and page_vma_mapped_walk()
>> - remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
>> avoid compilation error
>> - rename variable 'migration' to 'flush_needed' and invert the logic in
>> zap_huge_pmd() to make code more descriptive
>> - use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
>> with MADV_DONTNEED
>> - remove unnecessary tlb flush in remove_migration_pmd()
>> - add the missing migration flag check in page_vma_mapped_walk()
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Cc: Kirill A. Shutemov <[email protected]>
>> ---
>> arch/x86/include/asm/pgtable_64.h | 2 +
>> include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
>> mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
>> mm/migrate.c | 32 ++++++++++++++-
>> mm/page_vma_mapped.c | 18 +++++++--
>> mm/pgtable-generic.c | 3 +-
>> mm/rmap.c | 13 ++++++
>> 7 files changed, 207 insertions(+), 12 deletions(-)
>>
> ...
>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 91948fbbb0bb..b28f633cd569 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> bool ret = true;
>> enum ttu_flags flags = (enum ttu_flags)arg;
>>
>> +
>> /* munlock has nothing to gain from examining un-locked vmas */
>> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>> return true;
>> @@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> }
>>
>> while (page_vma_mapped_walk(&pvmw)) {
>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> + /* PMD-mapped THP migration entry */
>> + if (flags & TTU_MIGRATION) {
>
> My testing based on mmotm-2017-07-06-16-18 showed that migrating shmem thp
> caused kernel crash. I don't think this is critical because that case is
> just not-prepared yet. So in order to avoid the crash, please add
> PageAnon(page) check here. This makes shmem thp migration just fail.
>
> + if (!PageAnon(page))
> + continue;
>

Thanks for your testing. I will add this check in my next version.


>> + if (!pvmw.pte && page) {
>
> Just from curiosity, do we really need this page check?
> try_to_unmap() always passes down the parameter 'page' to try_to_unmap_one()
> via rmap_walk_* family, so I think we can assume page is always non-NULL.

You are right. Checking page is not necessary here. I will remove it in my
next version.



--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-07-13 09:34:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On Tue, Jul 11, 2017 at 10:00:30AM -0400, Zi Yan wrote:
> On 11 Jul 2017, at 2:47, Naoya Horiguchi wrote:
>
> > On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
> >> From: Zi Yan <[email protected]>
> >>
> >> This patch adds thp migration's core code, including conversions
> >> between a PMD entry and a swap entry, setting PMD migration entry,
> >> removing PMD migration entry, and waiting on PMD migration entries.
> >>
> >> This patch makes it possible to support thp migration.
> >> If you fail to allocate a destination page as a thp, you just split
> >> the source thp as we do now, and then enter the normal page migration.
> >> If you succeed to allocate destination thp, you enter thp migration.
> >> Subsequent patches actually enable thp migration for each caller of
> >> page migration by allowing its get_new_page() callback to
> >> allocate thps.
> >>
> >> ChangeLog v1 -> v2:
> >> - support pte-mapped thp, doubly-mapped thp
> >>
> >> Signed-off-by: Naoya Horiguchi <[email protected]>
> >>
> >> ChangeLog v2 -> v3:
> >> - use page_vma_mapped_walk()
> >> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
> >> set_pmd_migration_entry()
> >>
> >> ChangeLog v3 -> v4:
> >> - factor out the code of removing pte pgtable page in zap_huge_pmd()
> >>
> >> ChangeLog v4 -> v5:
> >> - remove unnecessary PTE-mapped THP code in remove_migration_pmd()
> >> and set_pmd_migration_entry()
> >> - restructure the code in zap_huge_pmd() to avoid factoring out
> >> the pte pgtable page code
> >> - in zap_huge_pmd(), check that PMD swap entries are migration entries
> >> - change author information
> >>
> >> ChangeLog v5 -> v7
> >> - use macro to disable the code when thp migration is not enabled
> >>
> >> ChangeLog v7 -> v8
> >> - use IS_ENABLED instead of macro to make code look clean in
> >> zap_huge_pmd() and page_vma_mapped_walk()
> >> - remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
> >> avoid compilation error
> >> - rename variable 'migration' to 'flush_needed' and invert the logic in
> >> zap_huge_pmd() to make code more descriptive
> >> - use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
> >> with MADV_DONTNEED
> >> - remove unnecessary tlb flush in remove_migration_pmd()
> >> - add the missing migration flag check in page_vma_mapped_walk()
> >>
> >> Signed-off-by: Zi Yan <[email protected]>
> >> Cc: Kirill A. Shutemov <[email protected]>
> >> ---
> >> arch/x86/include/asm/pgtable_64.h | 2 +
> >> include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
> >> mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
> >> mm/migrate.c | 32 ++++++++++++++-
> >> mm/page_vma_mapped.c | 18 +++++++--
> >> mm/pgtable-generic.c | 3 +-
> >> mm/rmap.c | 13 ++++++
> >> 7 files changed, 207 insertions(+), 12 deletions(-)
> >>
> > ...
> >
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 91948fbbb0bb..b28f633cd569 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> bool ret = true;
> >> enum ttu_flags flags = (enum ttu_flags)arg;
> >>
> >> +
> >> /* munlock has nothing to gain from examining un-locked vmas */
> >> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> >> return true;
> >> @@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >> }
> >>
> >> while (page_vma_mapped_walk(&pvmw)) {
> >> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >> + /* PMD-mapped THP migration entry */
> >> + if (flags & TTU_MIGRATION) {
> >
> > My testing based on mmotm-2017-07-06-16-18 showed that migrating shmem thp
> > caused kernel crash. I don't think this is critical because that case is
> > just not-prepared yet. So in order to avoid the crash, please add
> > PageAnon(page) check here. This makes shmem thp migration just fail.
> >
> > + if (!PageAnon(page))
> > + continue;
> >
>
> Thanks for your testing. I will add this check in my next version.

Sorry, the code I'm suggesting above doesn't work because it makes normal
pagecache migration fail. This check should come after making sure that
pvmw.pte is NULL.

Thanks,
Naoya Horiguchi

2017-07-13 11:28:34

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On 13 Jul 2017, at 5:30, Naoya Horiguchi wrote:

> On Tue, Jul 11, 2017 at 10:00:30AM -0400, Zi Yan wrote:
>> On 11 Jul 2017, at 2:47, Naoya Horiguchi wrote:
>>
>>> On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
>>>> From: Zi Yan <[email protected]>
>>>>
>>>> This patch adds thp migration's core code, including conversions
>>>> between a PMD entry and a swap entry, setting PMD migration entry,
>>>> removing PMD migration entry, and waiting on PMD migration entries.
>>>>
>>>> This patch makes it possible to support thp migration.
>>>> If you fail to allocate a destination page as a thp, you just split
>>>> the source thp as we do now, and then enter the normal page migration.
>>>> If you succeed to allocate destination thp, you enter thp migration.
>>>> Subsequent patches actually enable thp migration for each caller of
>>>> page migration by allowing its get_new_page() callback to
>>>> allocate thps.
>>>>
>>>> ChangeLog v1 -> v2:
>>>> - support pte-mapped thp, doubly-mapped thp
>>>>
>>>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>>>
>>>> ChangeLog v2 -> v3:
>>>> - use page_vma_mapped_walk()
>>>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
>>>> set_pmd_migration_entry()
>>>>
>>>> ChangeLog v3 -> v4:
>>>> - factor out the code of removing pte pgtable page in zap_huge_pmd()
>>>>
>>>> ChangeLog v4 -> v5:
>>>> - remove unnecessary PTE-mapped THP code in remove_migration_pmd()
>>>> and set_pmd_migration_entry()
>>>> - restructure the code in zap_huge_pmd() to avoid factoring out
>>>> the pte pgtable page code
>>>> - in zap_huge_pmd(), check that PMD swap entries are migration entries
>>>> - change author information
>>>>
>>>> ChangeLog v5 -> v7
>>>> - use macro to disable the code when thp migration is not enabled
>>>>
>>>> ChangeLog v7 -> v8
>>>> - use IS_ENABLED instead of macro to make code look clean in
>>>> zap_huge_pmd() and page_vma_mapped_walk()
>>>> - remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
>>>> avoid compilation error
>>>> - rename variable 'migration' to 'flush_needed' and invert the logic in
>>>> zap_huge_pmd() to make code more descriptive
>>>> - use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
>>>> with MADV_DONTNEED
>>>> - remove unnecessary tlb flush in remove_migration_pmd()
>>>> - add the missing migration flag check in page_vma_mapped_walk()
>>>>
>>>> Signed-off-by: Zi Yan <[email protected]>
>>>> Cc: Kirill A. Shutemov <[email protected]>
>>>> ---
>>>> arch/x86/include/asm/pgtable_64.h | 2 +
>>>> include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
>>>> mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
>>>> mm/migrate.c | 32 ++++++++++++++-
>>>> mm/page_vma_mapped.c | 18 +++++++--
>>>> mm/pgtable-generic.c | 3 +-
>>>> mm/rmap.c | 13 ++++++
>>>> 7 files changed, 207 insertions(+), 12 deletions(-)
>>>>
>>> ...
>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 91948fbbb0bb..b28f633cd569 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> bool ret = true;
>>>> enum ttu_flags flags = (enum ttu_flags)arg;
>>>>
>>>> +
>>>> /* munlock has nothing to gain from examining un-locked vmas */
>>>> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>>>> return true;
>>>> @@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> }
>>>>
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>> + /* PMD-mapped THP migration entry */
>>>> + if (flags & TTU_MIGRATION) {
>>>
>>> My testing based on mmotm-2017-07-06-16-18 showed that migrating shmem thp
>>> caused kernel crash. I don't think this is critical because that case is
>>> just not-prepared yet. So in order to avoid the crash, please add
>>> PageAnon(page) check here. This makes shmem thp migration just fail.
>>>
>>> + if (!PageAnon(page))
>>> + continue;
>>>
>>
>> Thanks for your testing. I will add this check in my next version.
>
> Sorry, the code I'm suggesting above doesn't work because it makes normal
> pagecache migration fail. This check should come after making sure that
> pvmw.pte is NULL.

Right. I think the two ifs are confusing. Replacing the chunk with:

if (!pvmw.pte && (flags & TTU_MIGRATION)) {
VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page),
page);

if (!PageAnon(page))
continue;

set_pmd_migration_entry(&pvmw, page);
continue;
}

would be better.

BTW, is your page migration test suite available online? If so, I could use
it to test my code.

Thanks.




Best Regards,
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-07-14 00:07:58

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] mm: thp: enable thp migration in generic path

On Thu, Jul 13, 2017 at 07:28:24AM -0400, Zi Yan wrote:
> On 13 Jul 2017, at 5:30, Naoya Horiguchi wrote:
>
> > On Tue, Jul 11, 2017 at 10:00:30AM -0400, Zi Yan wrote:
> >> On 11 Jul 2017, at 2:47, Naoya Horiguchi wrote:
> >>
> >>> On Sat, Jul 01, 2017 at 09:40:03AM -0400, Zi Yan wrote:
> >>>> From: Zi Yan <[email protected]>
> >>>>
> >>>> This patch adds thp migration's core code, including conversions
> >>>> between a PMD entry and a swap entry, setting PMD migration entry,
> >>>> removing PMD migration entry, and waiting on PMD migration entries.
> >>>>
> >>>> This patch makes it possible to support thp migration.
> >>>> If you fail to allocate a destination page as a thp, you just split
> >>>> the source thp as we do now, and then enter the normal page migration.
> >>>> If you succeed to allocate destination thp, you enter thp migration.
> >>>> Subsequent patches actually enable thp migration for each caller of
> >>>> page migration by allowing its get_new_page() callback to
> >>>> allocate thps.
> >>>>
> >>>> ChangeLog v1 -> v2:
> >>>> - support pte-mapped thp, doubly-mapped thp
> >>>>
> >>>> Signed-off-by: Naoya Horiguchi <[email protected]>
> >>>>
> >>>> ChangeLog v2 -> v3:
> >>>> - use page_vma_mapped_walk()
> >>>> - use pmdp_huge_clear_flush() instead of pmdp_huge_get_and_clear() in
> >>>> set_pmd_migration_entry()
> >>>>
> >>>> ChangeLog v3 -> v4:
> >>>> - factor out the code of removing pte pgtable page in zap_huge_pmd()
> >>>>
> >>>> ChangeLog v4 -> v5:
> >>>> - remove unnecessary PTE-mapped THP code in remove_migration_pmd()
> >>>> and set_pmd_migration_entry()
> >>>> - restructure the code in zap_huge_pmd() to avoid factoring out
> >>>> the pte pgtable page code
> >>>> - in zap_huge_pmd(), check that PMD swap entries are migration entries
> >>>> - change author information
> >>>>
> >>>> ChangeLog v5 -> v7
> >>>> - use macro to disable the code when thp migration is not enabled
> >>>>
> >>>> ChangeLog v7 -> v8
> >>>> - use IS_ENABLED instead of macro to make code look clean in
> >>>> zap_huge_pmd() and page_vma_mapped_walk()
> >>>> - remove BUILD_BUG() in pmd_to_swp_entry() and swp_entry_to_pmd() to
> >>>> avoid compilation error
> >>>> - rename variable 'migration' to 'flush_needed' and invert the logic in
> >>>> zap_huge_pmd() to make code more descriptive
> >>>> - use pmdp_invalidate() in set_pmd_migration_entry() to avoid race
> >>>> with MADV_DONTNEED
> >>>> - remove unnecessary tlb flush in remove_migration_pmd()
> >>>> - add the missing migration flag check in page_vma_mapped_walk()
> >>>>
> >>>> Signed-off-by: Zi Yan <[email protected]>
> >>>> Cc: Kirill A. Shutemov <[email protected]>
> >>>> ---
> >>>> arch/x86/include/asm/pgtable_64.h | 2 +
> >>>> include/linux/swapops.h | 67 ++++++++++++++++++++++++++++++-
> >>>> mm/huge_memory.c | 84 ++++++++++++++++++++++++++++++++++++---
> >>>> mm/migrate.c | 32 ++++++++++++++-
> >>>> mm/page_vma_mapped.c | 18 +++++++--
> >>>> mm/pgtable-generic.c | 3 +-
> >>>> mm/rmap.c | 13 ++++++
> >>>> 7 files changed, 207 insertions(+), 12 deletions(-)
> >>>>
> >>> ...
> >>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 91948fbbb0bb..b28f633cd569 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1302,6 +1302,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>>> bool ret = true;
> >>>> enum ttu_flags flags = (enum ttu_flags)arg;
> >>>>
> >>>> +
> >>>> /* munlock has nothing to gain from examining un-locked vmas */
> >>>> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> >>>> return true;
> >>>> @@ -1312,6 +1313,18 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>>> }
> >>>>
> >>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >>>> + /* PMD-mapped THP migration entry */
> >>>> + if (flags & TTU_MIGRATION) {
> >>>
> >>> My testing based on mmotm-2017-07-06-16-18 showed that migrating shmem thp
> >>> caused kernel crash. I don't think this is critical because that case is
> >>> just not-prepared yet. So in order to avoid the crash, please add
> >>> PageAnon(page) check here. This makes shmem thp migration just fail.
> >>>
> >>> + if (!PageAnon(page))
> >>> + continue;
> >>>
> >>
> >> Thanks for your testing. I will add this check in my next version.
> >
> > Sorry, the code I'm suggesting above doesn't work because it makes normal
> > pagecache migration fail. This check should come after making sure that
> > pvmw.pte is NULL.
>
> Right. I think the two ifs are confusing. Replacing the chunk with:
>
> if (!pvmw.pte && (flags & TTU_MIGRATION)) {
> VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page),
> page);
>
> if (!PageAnon(page))
> continue;
>
> set_pmd_migration_entry(&pvmw, page);
> continue;
> }
>
> would be better.

Yes, it looks good.

>
> BTW, is your page migration test suite available online? If so, I could use
> it to test my code.

Please refer to https://github.com/Naoya-Horiguchi/mm_regression.

Thanks,
Naoya Horiguchi

2017-07-14 09:32:53

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v8 06/10] mm: thp: check pmd migration entry in common path

On Sat, Jul 01, 2017 at 09:40:04AM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> If one of callers of page migration starts to handle thp,
> memory management code start to see pmd migration entry, so we need
> to prepare for it before enabling. This patch changes various code
> point which checks the status of given pmds in order to prevent race
> between thp migration and the pmd-related works.
>
> ChangeLog v1 -> v2:
> - introduce pmd_related() (I know the naming is not good, but can't
> think up no better name. Any suggesntion is welcomed.)
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
>
> ChangeLog v2 -> v3:
> - add is_swap_pmd()
> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
> pmd_trans_huge(), pmd_devmap(), or pmd_none()
> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
> true on pmd_migration_entry, so that migration entries are not
> treated as pmd page table entries.
>
> ChangeLog v4 -> v5:
> - add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
> the equivalence of !pmd_present() and is_pmd_migration_entry()
> - fix migration entry wait deadlock code (from v1) in follow_page_mask()
> - remove unnecessary code (from v1) in follow_trans_huge_pmd()
> - use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
> so it will not be confused with pmd_none()
> - change author information
>
> ChangeLog v5 -> v7
> - use macro to disable the code when thp migration is not enabled
>
> ChangeLog v7 -> v8
> - remove not used code in do_huge_pmd_wp_page()
> - copy the comment from change_pte_range() on downgrading
> write migration entry to read to change_huge_pmd()
>
> Signed-off-by: Zi Yan <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/mm/gup.c | 7 +++--
> fs/proc/task_mmu.c | 33 ++++++++++++++-------
> include/asm-generic/pgtable.h | 17 ++++++++++-
> include/linux/huge_mm.h | 14 +++++++--
> mm/gup.c | 22 ++++++++++++--
> mm/huge_memory.c | 67 +++++++++++++++++++++++++++++++++++++++----
> mm/memcontrol.c | 5 ++++
> mm/memory.c | 12 ++++++--
> mm/mprotect.c | 4 +--
> mm/mremap.c | 2 +-
> 10 files changed, 154 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 456dfdfd2249..096bbcc801e6 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -9,6 +9,7 @@
> #include <linux/vmstat.h>
> #include <linux/highmem.h>
> #include <linux/swap.h>
> +#include <linux/swapops.h>
> #include <linux/memremap.h>
>
> #include <asm/mmu_context.h>
> @@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> pmd_t pmd = *pmdp;
>
> next = pmd_addr_end(addr, end);
> - if (pmd_none(pmd))
> + if (!pmd_present(pmd)) {
> + VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
> + !is_pmd_migration_entry(pmd));

This VM_BUG_ON() triggers when gup is called on hugetlb hwpoison entry.
I think that in such case kernel falls into the gup slow path, and
a page fault in follow_hugetlb_page() can properly report the error to
affected processes, so no need to alarm with BUG_ON.

Could you make this VM_BUG_ON more specific, or just remove it?

Thanks,
Naoya Horiguchi

> return 0;
> - if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
> + } else if (unlikely(pmd_large(pmd))) {
> /*
> * NUMA hinting faults need to be handled in the GUP
> * slowpath for accounting purposes and so that they

2017-07-14 18:29:12

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v8 06/10] mm: thp: check pmd migration entry in common path

On 14 Jul 2017, at 5:29, Naoya Horiguchi wrote:

> On Sat, Jul 01, 2017 at 09:40:04AM -0400, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> If one of callers of page migration starts to handle thp,
>> memory management code start to see pmd migration entry, so we need
>> to prepare for it before enabling. This patch changes various code
>> point which checks the status of given pmds in order to prevent race
>> between thp migration and the pmd-related works.
>>
>> ChangeLog v1 -> v2:
>> - introduce pmd_related() (I know the naming is not good, but can't
>> think up no better name. Any suggesntion is welcomed.)
>>
>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>
>> ChangeLog v2 -> v3:
>> - add is_swap_pmd()
>> - a pmd entry should be pmd pointing to pte pages, is_swap_pmd(),
>> pmd_trans_huge(), pmd_devmap(), or pmd_none()
>> - pmd_none_or_trans_huge_or_clear_bad() and pmd_trans_unstable() return
>> true on pmd_migration_entry, so that migration entries are not
>> treated as pmd page table entries.
>>
>> ChangeLog v4 -> v5:
>> - add explanation in pmd_none_or_trans_huge_or_clear_bad() to state
>> the equivalence of !pmd_present() and is_pmd_migration_entry()
>> - fix migration entry wait deadlock code (from v1) in follow_page_mask()
>> - remove unnecessary code (from v1) in follow_trans_huge_pmd()
>> - use is_swap_pmd() instead of !pmd_present() for pmd migration entry,
>> so it will not be confused with pmd_none()
>> - change author information
>>
>> ChangeLog v5 -> v7
>> - use macro to disable the code when thp migration is not enabled
>>
>> ChangeLog v7 -> v8
>> - remove not used code in do_huge_pmd_wp_page()
>> - copy the comment from change_pte_range() on downgrading
>> write migration entry to read to change_huge_pmd()
>>
>> Signed-off-by: Zi Yan <[email protected]>
>> Cc: Kirill A. Shutemov <[email protected]>
>> ---
>> arch/x86/mm/gup.c | 7 +++--
>> fs/proc/task_mmu.c | 33 ++++++++++++++-------
>> include/asm-generic/pgtable.h | 17 ++++++++++-
>> include/linux/huge_mm.h | 14 +++++++--
>> mm/gup.c | 22 ++++++++++++--
>> mm/huge_memory.c | 67 +++++++++++++++++++++++++++++++++++++++----
>> mm/memcontrol.c | 5 ++++
>> mm/memory.c | 12 ++++++--
>> mm/mprotect.c | 4 +--
>> mm/mremap.c | 2 +-
>> 10 files changed, 154 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
>> index 456dfdfd2249..096bbcc801e6 100644
>> --- a/arch/x86/mm/gup.c
>> +++ b/arch/x86/mm/gup.c
>> @@ -9,6 +9,7 @@
>> #include <linux/vmstat.h>
>> #include <linux/highmem.h>
>> #include <linux/swap.h>
>> +#include <linux/swapops.h>
>> #include <linux/memremap.h>
>>
>> #include <asm/mmu_context.h>
>> @@ -243,9 +244,11 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>> pmd_t pmd = *pmdp;
>>
>> next = pmd_addr_end(addr, end);
>> - if (pmd_none(pmd))
>> + if (!pmd_present(pmd)) {
>> + VM_BUG_ON(is_swap_pmd(pmd) && IS_ENABLED(CONFIG_MIGRATION) &&
>> + !is_pmd_migration_entry(pmd));
>
> This VM_BUG_ON() triggers when gup is called on hugetlb hwpoison entry.
> I think that in such case kernel falls into the gup slow path, and
> a page fault in follow_hugetlb_page() can properly report the error to
> affected processes, so no need to alarm with BUG_ON.
>
> Could you make this VM_BUG_ON more specific, or just remove it?

I will remove it, since adding code to detect hugetlb hwpoison entry
to existing VM_BUG_ON() will be quite messy.

Thanks for pointing this out.

--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature