2018-01-03 08:26:06

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/3] unclutter thp migration

Hi,
I have posted this work as an RFC [1] and there were no fundamental
objections to the approach so I am resending for inclusion. It is
quite late in the release cycle and I definitely do not want to
rush this into the next release cycle but having it in linux-next
for longer should be only better to show potential fallouts.

Motivation:
THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by splitting the THP into small pages while moving the head
page to the newly allocated order-0 page. Remaining pages are moved to
the LRU list by split_huge_page. The same happens if the THP allocation
fails. This is really ugly and error prone [2].

I also believe that split_huge_page to the LRU lists is inherently
wrong because all tail pages are not migrated. Some callers will just
work around that by retrying (e.g. memory hotplug). There are other
pfn walkers which are simply broken though. e.g. madvise_inject_error
will migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh
THP so we end up with ENOMEM and fail the whole migration which is a
questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.

The first patch reworks do_pages_move which relies on a very ugly
calling semantic when the return status is pushed to the migration
path via private pointer. It uses pre allocated fixed size batching to
achieve that. We simply cannot do the same if a THP is to be split
during the migration path which is done in the patch 3. Patch 2 is
follow up cleanup which removes the mentioned return status calling
convention ugliness.

On a side note:
There are some semantic issues I have encountered on the way when
working on patch 1 but I am not addressing them here. E.g. trying
to move THP tail pages will result in either success or EBUSY (the
later one more likely once we isolate head from the LRU list). Hugetlb
reports EACCESS on tail pages. Some errors are reported via status
parameter but migration failures are not even though the original
`reason' argument suggests there was an intention to do so. From a
quick look into git history this never worked. I have tried to keep the
semantic unchanged.

Then there is a relatively minor thing that the page isolation might
fail because of pages not being on the LRU - e.g. because they are
sitting on the per-cpu LRU caches. Easily fixable.

Shortlog
Michal Hocko (3):
mm, numa: rework do_pages_move
mm, migrate: remove reason argument from new_page_t
mm: unclutter THP migration

Diffstat
include/linux/migrate.h | 7 +-
include/linux/page-isolation.h | 3 +-
mm/compaction.c | 3 +-
mm/huge_memory.c | 6 +
mm/internal.h | 1 +
mm/memory_hotplug.c | 5 +-
mm/mempolicy.c | 40 +----
mm/migrate.c | 354 ++++++++++++++++++-----------------------
mm/page_isolation.c | 3 +-
9 files changed, 181 insertions(+), 241 deletions(-)

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]



2018-01-03 08:26:13

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/3] mm, numa: rework do_pages_move

From: Michal Hocko <[email protected]>

do_pages_move is supposed to move user defined memory (an array of
addresses) to the user defined numa nodes (an array of nodes one for
each address). The user provided status array then contains resulting
numa node for each address or an error. The semantic of this function is
little bit confusing because only some errors are reported back. Notably
migrate_pages error is only reported via the return value. This patch
doesn't try to address these semantic nuances but rather change the
underlying implementation.

Currently we are processing user input (which can be really large)
in batches which are stored to a temporarily allocated page. Each
address is resolved to its struct page and stored to page_to_node
structure along with the requested target numa node. The array of these
structures is then conveyed down the page migration path via private
argument. new_page_node then finds the corresponding structure and
allocates the proper target page.

What is the problem with the current implementation and why to change
it? Apart from being quite ugly it also doesn't cope with unexpected
pages showing up on the migration list inside migrate_pages path.
That doesn't happen currently but the follow up patch would like to
make the thp migration code more clear and that would need to split a
THP into the list for some cases.

How does the new implementation work? Well, instead of batching into a
fixed size array we simply batch all pages that should be migrated to
the same node and isolate all of them into a linked list which doesn't
require any additional storage. This should work reasonably well because
page migration usually migrates larger ranges of memory to a specific
node. So the common case should work equally well as the current
implementation. Even if somebody constructs an input where the target
numa nodes would be interleaved we shouldn't see a large performance
impact because page migration alone doesn't really benefit from
batching. mmap_sem batching for the lookup is quite questionable and
isolate_lru_page which would benefit from batching is not using it even
in the current implementation.

Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/internal.h | 1 +
mm/mempolicy.c | 5 +-
mm/migrate.c | 306 +++++++++++++++++++++++++--------------------------------
3 files changed, 138 insertions(+), 174 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3e5dc95dc259..745e247aca9c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -540,4 +540,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}

void setup_zone_pageset(struct zone *zone);
+extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..66c9c79b21be 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
}
}

-static struct page *new_node_page(struct page *page, unsigned long node, int **x)
+/* page allocation callback for NUMA node migration */
+struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
{
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);

if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_node_page, NULL, dest,
+ err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
MIGRATE_SYNC, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..8fb90bcd44a7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,141 +1444,103 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}

#ifdef CONFIG_NUMA
-/*
- * Move a list of individual pages
- */
-struct page_to_node {
- unsigned long addr;
- struct page *page;
- int node;
- int status;
-};

-static struct page *new_page_node(struct page *p, unsigned long private,
- int **result)
+static int store_status(int __user *status, int start, int value, int nr)
{
- struct page_to_node *pm = (struct page_to_node *)private;
-
- while (pm->node != MAX_NUMNODES && pm->page != p)
- pm++;
+ while (nr-- > 0) {
+ if (put_user(value, status + start))
+ return -EFAULT;
+ start++;
+ }

- if (pm->node == MAX_NUMNODES)
- return NULL;
+ return 0;
+}

- *result = &pm->status;
+static int do_move_pages_to_node(struct mm_struct *mm,
+ struct list_head *pagelist, int node)
+{
+ int err;

- if (PageHuge(p))
- return alloc_huge_page_node(page_hstate(compound_head(p)),
- pm->node);
- else if (thp_migration_supported() && PageTransHuge(p)) {
- struct page *thp;
+ if (list_empty(pagelist))
+ return 0;

- thp = alloc_pages_node(pm->node,
- (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
- HPAGE_PMD_ORDER);
- if (!thp)
- return NULL;
- prep_transhuge_page(thp);
- return thp;
- } else
- return __alloc_pages_node(pm->node,
- GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+ err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
+ MIGRATE_SYNC, MR_SYSCALL);
+ if (err)
+ putback_movable_pages(pagelist);
+ return err;
}

/*
- * Move a set of pages as indicated in the pm array. The addr
- * field must be set to the virtual address of the page to be moved
- * and the node number must contain a valid target node.
- * The pm array ends with node = MAX_NUMNODES.
+ * Resolves the given address to a struct page, isolates it from the LRU and
+ * puts it to the given pagelist.
+ * Returns -errno if the page cannot be found/isolated or 0 when it has been
+ * queued or the page doesn't need to be migrated because it is already on
+ * the target node
*/
-static int do_move_page_to_node_array(struct mm_struct *mm,
- struct page_to_node *pm,
- int migrate_all)
+static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
+ int node, struct list_head *pagelist, bool migrate_all)
{
+ struct vm_area_struct *vma;
+ struct page *page;
+ unsigned int follflags;
int err;
- struct page_to_node *pp;
- LIST_HEAD(pagelist);

down_read(&mm->mmap_sem);
+ err = -EFAULT;
+ vma = find_vma(mm, addr);
+ if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+ goto out;

- /*
- * Build a list of pages to migrate
- */
- for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
- struct vm_area_struct *vma;
- struct page *page;
- struct page *head;
- unsigned int follflags;
-
- err = -EFAULT;
- vma = find_vma(mm, pp->addr);
- if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
- goto set_status;
-
- /* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
- page = follow_page(vma, pp->addr, follflags);
+ /* FOLL_DUMP to ignore special (like zero) pages */
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, addr, follflags);

- err = PTR_ERR(page);
- if (IS_ERR(page))
- goto set_status;
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;

- err = -ENOENT;
- if (!page)
- goto set_status;
+ err = -ENOENT;
+ if (!page)
+ goto out;

- err = page_to_nid(page);
+ err = 0;
+ if (page_to_nid(page) == node)
+ goto out_putpage;

- if (err == pp->node)
- /*
- * Node already in the right place
- */
- goto put_and_set;
+ err = -EACCES;
+ if (page_mapcount(page) > 1 && !migrate_all)
+ goto out_putpage;

- err = -EACCES;
- if (page_mapcount(page) > 1 &&
- !migrate_all)
- goto put_and_set;
-
- if (PageHuge(page)) {
- if (PageHead(page)) {
- isolate_huge_page(page, &pagelist);
- err = 0;
- pp->page = page;
- }
- goto put_and_set;
+ if (PageHuge(page)) {
+ if (PageHead(page)) {
+ isolate_huge_page(page, pagelist);
+ err = 0;
}
+ } else {
+ struct page *head;

- pp->page = compound_head(page);
head = compound_head(page);
err = isolate_lru_page(head);
- if (!err) {
- list_add_tail(&head->lru, &pagelist);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON + page_is_file_cache(head),
- hpage_nr_pages(head));
- }
-put_and_set:
- /*
- * Either remove the duplicate refcount from
- * isolate_lru_page() or drop the page ref if it was
- * not isolated.
- */
- put_page(page);
-set_status:
- pp->status = err;
- }
-
- err = 0;
- if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_page_node, NULL,
- (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
if (err)
- putback_movable_pages(&pagelist);
- }
+ goto out_putpage;

+ err = 0;
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
+ }
+out_putpage:
+ /*
+ * Either remove the duplicate refcount from
+ * isolate_lru_page() or drop the page ref if it was
+ * not isolated.
+ */
+ put_page(page);
+out:
up_read(&mm->mmap_sem);
return err;
}
@@ -1593,79 +1555,79 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
const int __user *nodes,
int __user *status, int flags)
{
- struct page_to_node *pm;
- unsigned long chunk_nr_pages;
- unsigned long chunk_start;
- int err;
-
- err = -ENOMEM;
- pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
- if (!pm)
- goto out;
+ int current_node = NUMA_NO_NODE;
+ LIST_HEAD(pagelist);
+ int start, i;
+ int err = 0, err1;

migrate_prep();

- /*
- * Store a chunk of page_to_node array in a page,
- * but keep the last one as a marker
- */
- chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
-
- for (chunk_start = 0;
- chunk_start < nr_pages;
- chunk_start += chunk_nr_pages) {
- int j;
+ for (i = start = 0; i < nr_pages; i++) {
+ const void __user *p;
+ unsigned long addr;
+ int node;

- if (chunk_start + chunk_nr_pages > nr_pages)
- chunk_nr_pages = nr_pages - chunk_start;
-
- /* fill the chunk pm with addrs and nodes from user-space */
- for (j = 0; j < chunk_nr_pages; j++) {
- const void __user *p;
- int node;
-
- err = -EFAULT;
- if (get_user(p, pages + j + chunk_start))
- goto out_pm;
- pm[j].addr = (unsigned long) p;
-
- if (get_user(node, nodes + j + chunk_start))
- goto out_pm;
-
- err = -ENODEV;
- if (node < 0 || node >= MAX_NUMNODES)
- goto out_pm;
-
- if (!node_state(node, N_MEMORY))
- goto out_pm;
-
- err = -EACCES;
- if (!node_isset(node, task_nodes))
- goto out_pm;
+ err = -EFAULT;
+ if (get_user(p, pages + i))
+ goto out_flush;
+ if (get_user(node, nodes + i))
+ goto out_flush;
+ addr = (unsigned long)p;
+
+ err = -ENODEV;
+ if (node < 0 || node >= MAX_NUMNODES)
+ goto out_flush;
+ if (!node_state(node, N_MEMORY))
+ goto out_flush;

- pm[j].node = node;
+ err = -EACCES;
+ if (!node_isset(node, task_nodes))
+ goto out_flush;
+
+ if (current_node == NUMA_NO_NODE) {
+ current_node = node;
+ start = i;
+ } else if (node != current_node) {
+ err = do_move_pages_to_node(mm, &pagelist, current_node);
+ if (err)
+ goto out;
+ err = store_status(status, start, current_node, i - start);
+ if (err)
+ goto out;
+ start = i;
+ current_node = node;
}

- /* End marker for this chunk */
- pm[chunk_nr_pages].node = MAX_NUMNODES;
+ /*
+ * Errors in the page lookup or isolation are not fatal and we simply
+ * report them via status
+ */
+ err = add_page_for_migration(mm, addr, current_node,
+ &pagelist, flags & MPOL_MF_MOVE_ALL);
+ if (!err)
+ continue;

- /* Migrate this chunk */
- err = do_move_page_to_node_array(mm, pm,
- flags & MPOL_MF_MOVE_ALL);
- if (err < 0)
- goto out_pm;
+ err = store_status(status, i, err, 1);
+ if (err)
+ goto out_flush;

- /* Return status information */
- for (j = 0; j < chunk_nr_pages; j++)
- if (put_user(pm[j].status, status + j + chunk_start)) {
- err = -EFAULT;
- goto out_pm;
- }
+ err = do_move_pages_to_node(mm, &pagelist, current_node);
+ if (err)
+ goto out;
+ if (i > start) {
+ err = store_status(status, start, current_node, i - start);
+ if (err)
+ goto out;
+ }
+ current_node = NUMA_NO_NODE;
}
- err = 0;
-
-out_pm:
- free_page((unsigned long)pm);
+out_flush:
+ /* Make sure we do not overwrite the existing error */
+ err1 = do_move_pages_to_node(mm, &pagelist, current_node);
+ if (!err1)
+ err1 = store_status(status, start, current_node, i - start);
+ if (!err)
+ err = err1;
out:
return err;
}
--
2.15.1

2018-01-03 08:26:19

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/3] mm: unclutter THP migration

From: Michal Hocko <[email protected]>

THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by spliting the THP into small pages while moving the
head page to the newly allocated order-0 page. Remaning pages are moved
to the LRU list by split_huge_page. The same happens if the THP
allocation fails. This is really ugly and error prone [1].

I also believe that split_huge_page to the LRU lists is inherently
wrong because all tail pages are not migrated. Some callers will just
work around that by retrying (e.g. memory hotplug). There are other
pfn walkers which are simply broken though. e.g. madvise_inject_error
will migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh
THP so we end up with ENOMEM and fail the whole migration which is
a questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.

This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs. We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry. So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.

[1] http://lkml.kernel.org/r/[email protected]

- document changed ordering of split THP page in migrate_pages as per
Zi Yan

Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/migrate.h | 4 ++--
mm/huge_memory.c | 6 ++++++
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 31 +++----------------------------
mm/migrate.c | 34 ++++++++++++++++++++++++----------
5 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e5d99ade2319..0c6fe904bc97 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -42,9 +42,9 @@ static inline struct page *new_page_nodemask(struct page *page,
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);

- if (thp_migration_supported() && PageTransHuge(page)) {
- order = HPAGE_PMD_ORDER;
+ if (PageTransHuge(page)) {
gfp_mask |= GFP_TRANSHUGE;
+ order = HPAGE_PMD_ORDER;
}

if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 93d729fc94a4..8c296f19ff6e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2417,6 +2417,12 @@ static void __split_huge_page_tail(struct page *head, int tail,

page_tail->index = head->index + tail;
page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+
+ /*
+ * always add to the tail because some iterators expect new
+ * pages to show after the currently processed elements - e.g.
+ * migrate_pages
+ */
lru_add_page_tail(head, page_tail, lruvec, list);
}

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9a2381469172..25060b0184e9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1387,7 +1387,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- } else if (thp_migration_supported() && PageTransHuge(page))
+ } else if (PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4d849d3098e5..b6f4fcf9df64 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,15 +446,6 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
goto out;
}
- if (!thp_migration_supported()) {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- goto out;
- }
if (!queue_pages_required(page, qp)) {
ret = 1;
goto unlock;
@@ -495,7 +486,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,

if (pmd_trans_unstable(pmd))
return 0;
-retry:
+
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
@@ -511,22 +502,6 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
- if (PageTransCompound(page) && !thp_migration_supported()) {
- get_page(page);
- pte_unmap_unlock(pte, ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- /* Failed to split -- skip. */
- if (ret) {
- pte = pte_offset_map_lock(walk->mm, pmd,
- addr, &ptl);
- continue;
- }
- goto retry;
- }
-
migrate_page_add(page, qp->pagelist, flags);
}
pte_unmap_unlock(pte - 1, ptl);
@@ -948,7 +923,7 @@ struct page *alloc_new_node_page(struct page *page, unsigned long node)
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else if (thp_migration_supported() && PageTransHuge(page)) {
+ else if (PageTransHuge(page)) {
struct page *thp;

thp = alloc_pages_node(node,
@@ -1124,7 +1099,7 @@ static struct page *new_page(struct page *page, unsigned long start)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
- } else if (thp_migration_supported() && PageTransHuge(page)) {
+ } else if (PageTransHuge(page)) {
struct page *thp;

thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
diff --git a/mm/migrate.c b/mm/migrate.c
index aba3759a2e27..feba2e63e165 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1138,6 +1138,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
int rc = MIGRATEPAGE_SUCCESS;
struct page *newpage;

+ if (!thp_migration_supported() && PageTransHuge(page))
+ return -ENOMEM;
+
newpage = get_new_page(page, private);
if (!newpage)
return -ENOMEM;
@@ -1159,14 +1162,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}

- if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
- lock_page(page);
- rc = split_huge_page(page);
- unlock_page(page);
- if (rc)
- goto out;
- }
-
rc = __unmap_and_move(page, newpage, force, mode);
if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(newpage, reason);
@@ -1381,6 +1376,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
retry = 0;

list_for_each_entry_safe(page, page2, from, lru) {
+retry:
cond_resched();

if (PageHuge(page))
@@ -1394,6 +1390,26 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,

switch(rc) {
case -ENOMEM:
+ /*
+ * THP migration might be unsupported or the
+ * allocation could've failed so we should
+ * retry on the same page with the THP split
+ * to base pages.
+ *
+ * Head page is retried immediately and tail
+ * pages are added to the tail of the list so
+ * we encounter them after the rest of the list
+ * is processed.
+ */
+ if (PageTransHuge(page)) {
+ lock_page(page);
+ rc = split_huge_page_to_list(page, from);
+ unlock_page(page);
+ if (!rc) {
+ list_safe_reset_next(page, page2, lru);
+ goto retry;
+ }
+ }
nr_failed++;
goto out;
case -EAGAIN:
@@ -1480,8 +1496,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,

/* FOLL_DUMP to ignore special (like zero) pages */
follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
page = follow_page(vma, addr, follflags);

err = PTR_ERR(page);
--
2.15.1

2018-01-03 08:26:39

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/3] mm, migrate: remove reason argument from new_page_t

From: Michal Hocko <[email protected]>

No allocation callback is using this argument anymore. new_page_node
used to use this parameter to convey node_id resp. migration error
up to move_pages code (do_move_page_to_node_array). The error status
never made it into the final status field and we have a better way
to communicate node id to the status field now. All other allocation
callbacks simply ignored the argument so we can drop it finally.

Reviewed-by: Zi Yan <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/migrate.h | 3 +--
include/linux/page-isolation.h | 3 +--
mm/compaction.c | 3 +--
mm/internal.h | 2 +-
mm/memory_hotplug.c | 3 +--
mm/mempolicy.c | 6 +++---
mm/migrate.c | 18 ++----------------
mm/page_isolation.c | 3 +--
8 files changed, 11 insertions(+), 30 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a2246cf670ba..e5d99ade2319 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,8 +7,7 @@
#include <linux/migrate_mode.h>
#include <linux/hugetlb.h>

-typedef struct page *new_page_t(struct page *page, unsigned long private,
- int **reason);
+typedef struct page *new_page_t(struct page *page, unsigned long private);
typedef void free_page_t(struct page *page, unsigned long private);

/*
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index cdad58bbfd8b..4ae347cbc36d 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -63,7 +63,6 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
bool skip_hwpoisoned_pages);

-struct page *alloc_migrate_target(struct page *page, unsigned long private,
- int **resultp);
+struct page *alloc_migrate_target(struct page *page, unsigned long private);

#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index b8c23882c8ae..4589830b5959 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1165,8 +1165,7 @@ static void isolate_freepages(struct compact_control *cc)
* from the isolated freelists in the block we are migrating to.
*/
static struct page *compaction_alloc(struct page *migratepage,
- unsigned long data,
- int **result)
+ unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage;
diff --git a/mm/internal.h b/mm/internal.h
index 745e247aca9c..62d8c34e63d5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -540,5 +540,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}

void setup_zone_pageset(struct zone *zone);
-extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
+extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 262bfd26baf9..9a2381469172 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1344,8 +1344,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
return 0;
}

-static struct page *new_node_page(struct page *page, unsigned long private,
- int **result)
+static struct page *new_node_page(struct page *page, unsigned long private)
{
int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66c9c79b21be..4d849d3098e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -943,7 +943,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
}

/* page allocation callback for NUMA node migration */
-struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
+struct page *alloc_new_node_page(struct page *page, unsigned long node)
{
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -1108,7 +1108,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
* list of pages handed to migrate_pages()--which is how we get here--
* is in virtual address order.
*/
-static struct page *new_page(struct page *page, unsigned long start, int **x)
+static struct page *new_page(struct page *page, unsigned long start)
{
struct vm_area_struct *vma;
unsigned long uninitialized_var(address);
@@ -1153,7 +1153,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
return -ENOSYS;
}

-static struct page *new_page(struct page *page, unsigned long start, int **x)
+static struct page *new_page(struct page *page, unsigned long start)
{
return NULL;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 8fb90bcd44a7..aba3759a2e27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1136,10 +1136,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
enum migrate_reason reason)
{
int rc = MIGRATEPAGE_SUCCESS;
- int *result = NULL;
struct page *newpage;

- newpage = get_new_page(page, private, &result);
+ newpage = get_new_page(page, private);
if (!newpage)
return -ENOMEM;

@@ -1230,12 +1229,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
put_page(newpage);
}

- if (result) {
- if (rc)
- *result = rc;
- else
- *result = page_to_nid(newpage);
- }
return rc;
}

@@ -1263,7 +1256,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
enum migrate_mode mode, int reason)
{
int rc = -EAGAIN;
- int *result = NULL;
int page_was_mapped = 0;
struct page *new_hpage;
struct anon_vma *anon_vma = NULL;
@@ -1280,7 +1272,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
return -ENOSYS;
}

- new_hpage = get_new_page(hpage, private, &result);
+ new_hpage = get_new_page(hpage, private);
if (!new_hpage)
return -ENOMEM;

@@ -1345,12 +1337,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
else
putback_active_hugepage(new_hpage);

- if (result) {
- if (rc)
- *result = rc;
- else
- *result = page_to_nid(new_hpage);
- }
return rc;
}

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 165ed8117bd1..53d801235e22 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -293,8 +293,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
return pfn < end_pfn ? -EBUSY : 0;
}

-struct page *alloc_migrate_target(struct page *page, unsigned long private,
- int **resultp)
+struct page *alloc_migrate_target(struct page *page, unsigned long private)
{
return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
}
--
2.15.1

2018-01-03 09:11:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, migrate: remove reason argument from new_page_t

Ups, this one is missing so it should be foleded in.
---
>From a6f412a700a20ffee3ff3839eae8a0f891332f8a Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 3 Jan 2018 10:00:16 +0100
Subject: [PATCH] fold me "mm, migrate: remove reason argument from new_page_t"

- fix new_iommu_non_cma_page leftover
---
arch/powerpc/mm/mmu_context_iommu.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index e0a2d8e806ed..91ee2231c527 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -75,8 +75,7 @@ EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
/*
* Taken from alloc_migrate_target with changes to remove CMA allocations
*/
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private,
- int **resultp)
+struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
{
gfp_t gfp_mask = GFP_USER;
struct page *new_page;
--
2.15.1

--
Michal Hocko
SUSE Labs

2018-01-03 14:00:54

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, migrate: remove reason argument from new_page_t

On 01/03/2018 01:55 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> No allocation callback is using this argument anymore. new_page_node
> used to use this parameter to convey node_id resp. migration error
> up to move_pages code (do_move_page_to_node_array). The error status
> never made it into the final status field and we have a better way
> to communicate node id to the status field now. All other allocation
> callbacks simply ignored the argument so we can drop it finally.

There is a migrate_pages() call in powerpc which needs to be changed
as well. It was failing the build on powerpc.

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index e0a2d8e..91ee223 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -75,8 +75,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
/*
* Taken from alloc_migrate_target with changes to remove CMA allocations
*/
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private,
- int **resultp)
+struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
{
gfp_t gfp_mask = GFP_USER;
struct page *new_page;

2018-01-03 14:09:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, migrate: remove reason argument from new_page_t

On Wed 03-01-18 19:30:38, Anshuman Khandual wrote:
> On 01/03/2018 01:55 PM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > No allocation callback is using this argument anymore. new_page_node
> > used to use this parameter to convey node_id resp. migration error
> > up to move_pages code (do_move_page_to_node_array). The error status
> > never made it into the final status field and we have a better way
> > to communicate node id to the status field now. All other allocation
> > callbacks simply ignored the argument so we can drop it finally.
>
> There is a migrate_pages() call in powerpc which needs to be changed
> as well. It was failing the build on powerpc.

Yes, see http://lkml.kernel.org/r/[email protected]

--
Michal Hocko
SUSE Labs

2018-01-03 14:19:58

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, migrate: remove reason argument from new_page_t

On 01/03/2018 07:39 PM, Michal Hocko wrote:
> On Wed 03-01-18 19:30:38, Anshuman Khandual wrote:
>> On 01/03/2018 01:55 PM, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> No allocation callback is using this argument anymore. new_page_node
>>> used to use this parameter to convey node_id resp. migration error
>>> up to move_pages code (do_move_page_to_node_array). The error status
>>> never made it into the final status field and we have a better way
>>> to communicate node id to the status field now. All other allocation
>>> callbacks simply ignored the argument so we can drop it finally.
>>
>> There is a migrate_pages() call in powerpc which needs to be changed
>> as well. It was failing the build on powerpc.
>
> Yes, see http://lkml.kernel.org/r/[email protected]

Oops, my bad. I am sorry, missed this.

2018-01-05 03:52:36

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On 01/03/2018 01:55 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
> doesn't try to address these semantic nuances but rather change the
> underlying implementation.
>
> Currently we are processing user input (which can be really large)
> in batches which are stored to a temporarily allocated page. Each
> address is resolved to its struct page and stored to page_to_node
> structure along with the requested target numa node. The array of these
> structures is then conveyed down the page migration path via private
> argument. new_page_node then finds the corresponding structure and
> allocates the proper target page.
>
> What is the problem with the current implementation and why to change
> it? Apart from being quite ugly it also doesn't cope with unexpected
> pages showing up on the migration list inside migrate_pages path.
> That doesn't happen currently but the follow up patch would like to
> make the thp migration code more clear and that would need to split a
> THP into the list for some cases.
>
> How does the new implementation work? Well, instead of batching into a
> fixed size array we simply batch all pages that should be migrated to
> the same node and isolate all of them into a linked list which doesn't
> require any additional storage. This should work reasonably well because
> page migration usually migrates larger ranges of memory to a specific
> node. So the common case should work equally well as the current
> implementation. Even if somebody constructs an input where the target
> numa nodes would be interleaved we shouldn't see a large performance
> impact because page migration alone doesn't really benefit from
> batching. mmap_sem batching for the lookup is quite questionable and
> isolate_lru_page which would benefit from batching is not using it even
> in the current implementation.

Hi Michal,

After slightly modifying your test case (like fixing the page size for
powerpc and just doing simple migration from node 0 to 8 instead of the
interleaving), I tried to measure the migration speed with and without
the patches on mainline. Its interesting....

10000 pages | 100000 pages
--------------------------
Mainline 165 ms 1674 ms
Mainline + first patch (move_pages) 191 ms 1952 ms
Mainline + all three patches 146 ms 1469 ms

Though overall it gives performance improvement, some how it slows
down migration after the first patch. Will look into this further.

2018-01-05 09:14:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri 05-01-18 09:22:22, Anshuman Khandual wrote:
[...]
> Hi Michal,
>
> After slightly modifying your test case (like fixing the page size for
> powerpc and just doing simple migration from node 0 to 8 instead of the
> interleaving), I tried to measure the migration speed with and without
> the patches on mainline. Its interesting....
>
> 10000 pages | 100000 pages
> --------------------------
> Mainline 165 ms 1674 ms
> Mainline + first patch (move_pages) 191 ms 1952 ms
> Mainline + all three patches 146 ms 1469 ms
>
> Though overall it gives performance improvement, some how it slows
> down migration after the first patch. Will look into this further.

What are you measuring actually? All pages migrated to the same node?
Do you have any profiles? How stable are the results?

--
Michal Hocko
SUSE Labs

2018-01-05 09:20:21

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On 01/05/2018 02:44 PM, Michal Hocko wrote:
> On Fri 05-01-18 09:22:22, Anshuman Khandual wrote:
> [...]
>> Hi Michal,
>>
>> After slightly modifying your test case (like fixing the page size for
>> powerpc and just doing simple migration from node 0 to 8 instead of the
>> interleaving), I tried to measure the migration speed with and without
>> the patches on mainline. Its interesting....
>>
>> 10000 pages | 100000 pages
>> --------------------------
>> Mainline 165 ms 1674 ms
>> Mainline + first patch (move_pages) 191 ms 1952 ms
>> Mainline + all three patches 146 ms 1469 ms
>>
>> Though overall it gives performance improvement, some how it slows
>> down migration after the first patch. Will look into this further.
>
> What are you measuring actually? All pages migrated to the same node?

The mount of time move_pages() system call took to move these many
pages from node 0 to node 8. Yeah they migrated to the same node.

> Do you have any profiles? How stable are the results?

No, are you referring to perf record kind profile ? Results were
repeating.



2018-01-05 09:33:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri 05-01-18 14:50:04, Anshuman Khandual wrote:
> On 01/05/2018 02:44 PM, Michal Hocko wrote:
> > On Fri 05-01-18 09:22:22, Anshuman Khandual wrote:
> > [...]
> >> Hi Michal,
> >>
> >> After slightly modifying your test case (like fixing the page size for
> >> powerpc and just doing simple migration from node 0 to 8 instead of the
> >> interleaving), I tried to measure the migration speed with and without
> >> the patches on mainline. Its interesting....
> >>
> >> 10000 pages | 100000 pages
> >> --------------------------
> >> Mainline 165 ms 1674 ms
> >> Mainline + first patch (move_pages) 191 ms 1952 ms
> >> Mainline + all three patches 146 ms 1469 ms
> >>
> >> Though overall it gives performance improvement, some how it slows
> >> down migration after the first patch. Will look into this further.
> >
> > What are you measuring actually? All pages migrated to the same node?
>
> The mount of time move_pages() system call took to move these many
> pages from node 0 to node 8. Yeah they migrated to the same node.
>
> > Do you have any profiles? How stable are the results?
>
> No, are you referring to perf record kind profile ? Results were
> repeating.

Yes. I am really wondering because there souldn't anything specific to
improve the situation with patch 2 and 3. Likewise the only overhead
from the patch 1 I can see is the reduced batching of the mmap_sem. But
then I am wondering what would compensate that later...

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri, 5 Jan 2018, Michal Hocko wrote:

> Yes. I am really wondering because there souldn't anything specific to
> improve the situation with patch 2 and 3. Likewise the only overhead
> from the patch 1 I can see is the reduced batching of the mmap_sem. But
> then I am wondering what would compensate that later...

Could you reduce the frequency of taking mmap_sem? Maybe take it when
picking a new node and drop it when done with that node before migrating
the list of pages?

There is the potential of large amounts of pages being migrated and
having to take a semaphore on every one of them would create a nice amount
of overhead.

2018-01-05 18:09:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri 05-01-18 11:15:18, Cristopher Lameter wrote:
> On Fri, 5 Jan 2018, Michal Hocko wrote:
>
> > Yes. I am really wondering because there souldn't anything specific to
> > improve the situation with patch 2 and 3. Likewise the only overhead
> > from the patch 1 I can see is the reduced batching of the mmap_sem. But
> > then I am wondering what would compensate that later...
>
> Could you reduce the frequency of taking mmap_sem? Maybe take it when
> picking a new node and drop it when done with that node before migrating
> the list of pages?

I believe there should be some cap on the number of pages. We shouldn't
keep it held for million of pages if all of them are moved to the same
node. I would really like to postpone that to later unless it causes
some noticeable regressions because this would complicate the code
further and I am not sure this is all worth it.

--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri, 5 Jan 2018, Michal Hocko wrote:

> I believe there should be some cap on the number of pages. We shouldn't
> keep it held for million of pages if all of them are moved to the same
> node. I would really like to postpone that to later unless it causes
> some noticeable regressions because this would complicate the code
> further and I am not sure this is all worth it.

Attached a patch to make the code more readable.

Also why are you migrating the pages on pagelist if a
add_page_for_migration() fails? One could simply update
the status in user space and continue.


Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c
+++ linux/mm/migrate.c
@@ -1584,16 +1584,20 @@ static int do_pages_move(struct mm_struc
if (!node_isset(node, task_nodes))
goto out_flush;

- if (current_node == NUMA_NO_NODE) {
- current_node = node;
- start = i;
- } else if (node != current_node) {
- err = do_move_pages_to_node(mm, &pagelist, current_node);
- if (err)
- goto out;
- err = store_status(status, start, current_node, i - start);
- if (err)
- goto out;
+ if (node != current_node) {
+
+ if (current_node != NUMA_NO_NODE) {
+
+ /* Move the pages to current_node */
+ err = do_move_pages_to_node(mm, &pagelist, current_node);
+ if (err)
+ goto out;
+
+ err = store_status(status, start, current_node, i - start);
+ if (err)
+ goto out;
+ }
+
start = i;
current_node = node;
}
@@ -1607,6 +1611,10 @@ static int do_pages_move(struct mm_struc
if (!err)
continue;

+ /*
+ * Failure to isolate a page so flush the pages on
+ * pagelist after storing status and continue.
+ */
err = store_status(status, i, err, 1);
if (err)
goto out_flush;
@@ -1614,6 +1622,7 @@ static int do_pages_move(struct mm_struc
err = do_move_pages_to_node(mm, &pagelist, current_node);
if (err)
goto out;
+
if (i > start) {
err = store_status(status, start, current_node, i - start);
if (err)

2018-01-05 18:48:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri 05-01-18 12:41:22, Cristopher Lameter wrote:
> On Fri, 5 Jan 2018, Michal Hocko wrote:
>
> > I believe there should be some cap on the number of pages. We shouldn't
> > keep it held for million of pages if all of them are moved to the same
> > node. I would really like to postpone that to later unless it causes
> > some noticeable regressions because this would complicate the code
> > further and I am not sure this is all worth it.
>
> Attached a patch to make the code more readable.
>
> Also why are you migrating the pages on pagelist if a
> add_page_for_migration() fails? One could simply update
> the status in user space and continue.

I am open to further cleanups. Care to send a full patch with the
changelog? I would rather not fold more changes to the already tested
one.

Thanks!
--
Michal Hocko
SUSE Labs

Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri, 5 Jan 2018, Michal Hocko wrote:

> > Also why are you migrating the pages on pagelist if a
> > add_page_for_migration() fails? One could simply update
> > the status in user space and continue.
>
> I am open to further cleanups. Care to send a full patch with the
> changelog? I would rather not fold more changes to the already tested
> one.

While doing that I saw that one could pull the rwsem locking out of
add_page_for_migration() as well in order to avoid taking it for each 4k
page. Include that?

2018-01-05 19:44:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Fri 05-01-18 13:27:48, Cristopher Lameter wrote:
> On Fri, 5 Jan 2018, Michal Hocko wrote:
>
> > > Also why are you migrating the pages on pagelist if a
> > > add_page_for_migration() fails? One could simply update
> > > the status in user space and continue.
> >
> > I am open to further cleanups. Care to send a full patch with the
> > changelog? I would rather not fold more changes to the already tested
> > one.
>
> While doing that I saw that one could pull the rwsem locking out of
> add_page_for_migration() as well in order to avoid taking it for each 4k
> page. Include that?

Yeah, why not if the end result turns out to be simpler and easier to
maintain. Please note that I was mostly after simplicity. There are
other things to sort out though. Please read the cover which contains
several API oddities which would be good to either sort out or at least
document them.

Please also note that I am too busy with the most "popular" bug these
days, unfortunately, so my review bandwidth will be very limited.

--
Michal Hocko
SUSE Labs

2018-01-29 22:08:16

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

Hi Michal,

I discover that this patch does not hold mmap_sem while migrating pages in
do_move_pages_to_node().

A simple fix below moves mmap_sem from add_page_for_migration()
to the outmost do_pages_move():


diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..28b9e126cb38 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1487,7 +1487,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
unsigned int follflags;
int err;

- down_read(&mm->mmap_sem);
err = -EFAULT;
vma = find_vma(mm, addr);
if (!vma || addr < vma->vm_start || !vma_migratable(vma))
@@ -1540,7 +1539,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
*/
put_page(page);
out:
- up_read(&mm->mmap_sem);
return err;
}

@@ -1561,6 +1559,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,

migrate_prep();

+ down_read(&mm->mmap_sem);
for (i = start = 0; i < nr_pages; i++) {
const void __user *p;
unsigned long addr;
@@ -1628,6 +1627,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (!err)
err = err1;
out:
+ up_read(&mm->mmap_sem);
return err;
}


--
Best Regards
Yan Zi

On 3 Jan 2018, at 3:25, Michal Hocko wrote:

> From: Michal Hocko <[email protected]>
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
> doesn't try to address these semantic nuances but rather change the
> underlying implementation.
>
> Currently we are processing user input (which can be really large)
> in batches which are stored to a temporarily allocated page. Each
> address is resolved to its struct page and stored to page_to_node
> structure along with the requested target numa node. The array of these
> structures is then conveyed down the page migration path via private
> argument. new_page_node then finds the corresponding structure and
> allocates the proper target page.
>
> What is the problem with the current implementation and why to change
> it? Apart from being quite ugly it also doesn't cope with unexpected
> pages showing up on the migration list inside migrate_pages path.
> That doesn't happen currently but the follow up patch would like to
> make the thp migration code more clear and that would need to split a
> THP into the list for some cases.
>
> How does the new implementation work? Well, instead of batching into a
> fixed size array we simply batch all pages that should be migrated to
> the same node and isolate all of them into a linked list which doesn't
> require any additional storage. This should work reasonably well because
> page migration usually migrates larger ranges of memory to a specific
> node. So the common case should work equally well as the current
> implementation. Even if somebody constructs an input where the target
> numa nodes would be interleaved we shouldn't see a large performance
> impact because page migration alone doesn't really benefit from
> batching. mmap_sem batching for the lookup is quite questionable and
> isolate_lru_page which would benefit from batching is not using it even
> in the current implementation.
>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 1 +
> mm/mempolicy.c | 5 +-
> mm/migrate.c | 306 +++++++++++++++++++++++++--------------------------------
> 3 files changed, 138 insertions(+), 174 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 3e5dc95dc259..745e247aca9c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -540,4 +540,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> +extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..66c9c79b21be 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> }
> }
>
> -static struct page *new_node_page(struct page *page, unsigned long node, int **x)
> +/* page allocation callback for NUMA node migration */
> +struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
> {
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> @@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>
> if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, new_node_page, NULL, dest,
> + err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4d0be47a322a..8fb90bcd44a7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1444,141 +1444,103 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> }
>
> #ifdef CONFIG_NUMA
> -/*
> - * Move a list of individual pages
> - */
> -struct page_to_node {
> - unsigned long addr;
> - struct page *page;
> - int node;
> - int status;
> -};
>
> -static struct page *new_page_node(struct page *p, unsigned long private,
> - int **result)
> +static int store_status(int __user *status, int start, int value, int nr)
> {
> - struct page_to_node *pm = (struct page_to_node *)private;
> -
> - while (pm->node != MAX_NUMNODES && pm->page != p)
> - pm++;
> + while (nr-- > 0) {
> + if (put_user(value, status + start))
> + return -EFAULT;
> + start++;
> + }
>
> - if (pm->node == MAX_NUMNODES)
> - return NULL;
> + return 0;
> +}
>
> - *result = &pm->status;
> +static int do_move_pages_to_node(struct mm_struct *mm,
> + struct list_head *pagelist, int node)
> +{
> + int err;
>
> - if (PageHuge(p))
> - return alloc_huge_page_node(page_hstate(compound_head(p)),
> - pm->node);
> - else if (thp_migration_supported() && PageTransHuge(p)) {
> - struct page *thp;
> + if (list_empty(pagelist))
> + return 0;
>
> - thp = alloc_pages_node(pm->node,
> - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> - HPAGE_PMD_ORDER);
> - if (!thp)
> - return NULL;
> - prep_transhuge_page(thp);
> - return thp;
> - } else
> - return __alloc_pages_node(pm->node,
> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> + MIGRATE_SYNC, MR_SYSCALL);
> + if (err)
> + putback_movable_pages(pagelist);
> + return err;
> }
>
> /*
> - * Move a set of pages as indicated in the pm array. The addr
> - * field must be set to the virtual address of the page to be moved
> - * and the node number must contain a valid target node.
> - * The pm array ends with node = MAX_NUMNODES.
> + * Resolves the given address to a struct page, isolates it from the LRU and
> + * puts it to the given pagelist.
> + * Returns -errno if the page cannot be found/isolated or 0 when it has been
> + * queued or the page doesn't need to be migrated because it is already on
> + * the target node
> */
> -static int do_move_page_to_node_array(struct mm_struct *mm,
> - struct page_to_node *pm,
> - int migrate_all)
> +static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> + int node, struct list_head *pagelist, bool migrate_all)
> {
> + struct vm_area_struct *vma;
> + struct page *page;
> + unsigned int follflags;
> int err;
> - struct page_to_node *pp;
> - LIST_HEAD(pagelist);
>
> down_read(&mm->mmap_sem);
> + err = -EFAULT;
> + vma = find_vma(mm, addr);
> + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
> + goto out;
>
> - /*
> - * Build a list of pages to migrate
> - */
> - for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
> - struct vm_area_struct *vma;
> - struct page *page;
> - struct page *head;
> - unsigned int follflags;
> -
> - err = -EFAULT;
> - vma = find_vma(mm, pp->addr);
> - if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
> - goto set_status;
> -
> - /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> - if (!thp_migration_supported())
> - follflags |= FOLL_SPLIT;
> - page = follow_page(vma, pp->addr, follflags);
> + /* FOLL_DUMP to ignore special (like zero) pages */
> + follflags = FOLL_GET | FOLL_DUMP;
> + if (!thp_migration_supported())
> + follflags |= FOLL_SPLIT;
> + page = follow_page(vma, addr, follflags);
>
> - err = PTR_ERR(page);
> - if (IS_ERR(page))
> - goto set_status;
> + err = PTR_ERR(page);
> + if (IS_ERR(page))
> + goto out;
>
> - err = -ENOENT;
> - if (!page)
> - goto set_status;
> + err = -ENOENT;
> + if (!page)
> + goto out;
>
> - err = page_to_nid(page);
> + err = 0;
> + if (page_to_nid(page) == node)
> + goto out_putpage;
>
> - if (err == pp->node)
> - /*
> - * Node already in the right place
> - */
> - goto put_and_set;
> + err = -EACCES;
> + if (page_mapcount(page) > 1 && !migrate_all)
> + goto out_putpage;
>
> - err = -EACCES;
> - if (page_mapcount(page) > 1 &&
> - !migrate_all)
> - goto put_and_set;
> -
> - if (PageHuge(page)) {
> - if (PageHead(page)) {
> - isolate_huge_page(page, &pagelist);
> - err = 0;
> - pp->page = page;
> - }
> - goto put_and_set;
> + if (PageHuge(page)) {
> + if (PageHead(page)) {
> + isolate_huge_page(page, pagelist);
> + err = 0;
> }
> + } else {
> + struct page *head;
>
> - pp->page = compound_head(page);
> head = compound_head(page);
> err = isolate_lru_page(head);
> - if (!err) {
> - list_add_tail(&head->lru, &pagelist);
> - mod_node_page_state(page_pgdat(head),
> - NR_ISOLATED_ANON + page_is_file_cache(head),
> - hpage_nr_pages(head));
> - }
> -put_and_set:
> - /*
> - * Either remove the duplicate refcount from
> - * isolate_lru_page() or drop the page ref if it was
> - * not isolated.
> - */
> - put_page(page);
> -set_status:
> - pp->status = err;
> - }
> -
> - err = 0;
> - if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, new_page_node, NULL,
> - (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> - putback_movable_pages(&pagelist);
> - }
> + goto out_putpage;
>
> + err = 0;
> + list_add_tail(&head->lru, pagelist);
> + mod_node_page_state(page_pgdat(head),
> + NR_ISOLATED_ANON + page_is_file_cache(head),
> + hpage_nr_pages(head));
> + }
> +out_putpage:
> + /*
> + * Either remove the duplicate refcount from
> + * isolate_lru_page() or drop the page ref if it was
> + * not isolated.
> + */
> + put_page(page);
> +out:
> up_read(&mm->mmap_sem);
> return err;
> }
> @@ -1593,79 +1555,79 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> const int __user *nodes,
> int __user *status, int flags)
> {
> - struct page_to_node *pm;
> - unsigned long chunk_nr_pages;
> - unsigned long chunk_start;
> - int err;
> -
> - err = -ENOMEM;
> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> - if (!pm)
> - goto out;
> + int current_node = NUMA_NO_NODE;
> + LIST_HEAD(pagelist);
> + int start, i;
> + int err = 0, err1;
>
> migrate_prep();
>
> - /*
> - * Store a chunk of page_to_node array in a page,
> - * but keep the last one as a marker
> - */
> - chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
> -
> - for (chunk_start = 0;
> - chunk_start < nr_pages;
> - chunk_start += chunk_nr_pages) {
> - int j;
> + for (i = start = 0; i < nr_pages; i++) {
> + const void __user *p;
> + unsigned long addr;
> + int node;
>
> - if (chunk_start + chunk_nr_pages > nr_pages)
> - chunk_nr_pages = nr_pages - chunk_start;
> -
> - /* fill the chunk pm with addrs and nodes from user-space */
> - for (j = 0; j < chunk_nr_pages; j++) {
> - const void __user *p;
> - int node;
> -
> - err = -EFAULT;
> - if (get_user(p, pages + j + chunk_start))
> - goto out_pm;
> - pm[j].addr = (unsigned long) p;
> -
> - if (get_user(node, nodes + j + chunk_start))
> - goto out_pm;
> -
> - err = -ENODEV;
> - if (node < 0 || node >= MAX_NUMNODES)
> - goto out_pm;
> -
> - if (!node_state(node, N_MEMORY))
> - goto out_pm;
> -
> - err = -EACCES;
> - if (!node_isset(node, task_nodes))
> - goto out_pm;
> + err = -EFAULT;
> + if (get_user(p, pages + i))
> + goto out_flush;
> + if (get_user(node, nodes + i))
> + goto out_flush;
> + addr = (unsigned long)p;
> +
> + err = -ENODEV;
> + if (node < 0 || node >= MAX_NUMNODES)
> + goto out_flush;
> + if (!node_state(node, N_MEMORY))
> + goto out_flush;
>
> - pm[j].node = node;
> + err = -EACCES;
> + if (!node_isset(node, task_nodes))
> + goto out_flush;
> +
> + if (current_node == NUMA_NO_NODE) {
> + current_node = node;
> + start = i;
> + } else if (node != current_node) {
> + err = do_move_pages_to_node(mm, &pagelist, current_node);
> + if (err)
> + goto out;
> + err = store_status(status, start, current_node, i - start);
> + if (err)
> + goto out;
> + start = i;
> + current_node = node;
> }
>
> - /* End marker for this chunk */
> - pm[chunk_nr_pages].node = MAX_NUMNODES;
> + /*
> + * Errors in the page lookup or isolation are not fatal and we simply
> + * report them via status
> + */
> + err = add_page_for_migration(mm, addr, current_node,
> + &pagelist, flags & MPOL_MF_MOVE_ALL);
> + if (!err)
> + continue;
>
> - /* Migrate this chunk */
> - err = do_move_page_to_node_array(mm, pm,
> - flags & MPOL_MF_MOVE_ALL);
> - if (err < 0)
> - goto out_pm;
> + err = store_status(status, i, err, 1);
> + if (err)
> + goto out_flush;
>
> - /* Return status information */
> - for (j = 0; j < chunk_nr_pages; j++)
> - if (put_user(pm[j].status, status + j + chunk_start)) {
> - err = -EFAULT;
> - goto out_pm;
> - }
> + err = do_move_pages_to_node(mm, &pagelist, current_node);
> + if (err)
> + goto out;
> + if (i > start) {
> + err = store_status(status, start, current_node, i - start);
> + if (err)
> + goto out;
> + }
> + current_node = NUMA_NO_NODE;
> }
> - err = 0;
> -
> -out_pm:
> - free_page((unsigned long)pm);
> +out_flush:
> + /* Make sure we do not overwrite the existing error */
> + err1 = do_move_pages_to_node(mm, &pagelist, current_node);
> + if (!err1)
> + err1 = store_status(status, start, current_node, i - start);
> + if (!err)
> + err = err1;
> out:
> return err;
> }
> --
> 2.15.1


Attachments:
signature.asc (507.00 B)
OpenPGP digital signature

2018-01-29 22:36:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Mon, 29 Jan 2018 17:06:14 -0500 "Zi Yan" <[email protected]> wrote:

> I discover that this patch does not hold mmap_sem while migrating pages in
> do_move_pages_to_node().
>
> A simple fix below moves mmap_sem from add_page_for_migration()
> to the outmost do_pages_move():

I'm not surprised. Why does do_move_pages_to_node() need mmap_sem
and how is a reader to discover that fact???

2018-01-29 23:41:01

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On 29 Jan 2018, at 17:35, Andrew Morton wrote:

> On Mon, 29 Jan 2018 17:06:14 -0500 "Zi Yan" <[email protected]> wrote:
>
>> I discover that this patch does not hold mmap_sem while migrating pages in
>> do_move_pages_to_node().
>>
>> A simple fix below moves mmap_sem from add_page_for_migration()
>> to the outmost do_pages_move():
>
> I'm not surprised. Why does do_move_pages_to_node() need mmap_sem
> and how is a reader to discover that fact???

do_move_pages_to_node() calls migrate_pages(), which requires down_read(&mmap_sem).

In the outmost do_pages_move(), both add_page_for_migration() and
do_move_pages_to_node() inside it need to hold read lock of mmap_sem.

Do we need to add comments for both functions?

--
Best Regards
Yan Zi


Attachments:
signature.asc (507.00 B)
OpenPGP digital signature

2018-01-29 23:49:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, numa: rework do_pages_move

On Mon, 29 Jan 2018 18:39:01 -0500 "Zi Yan" <[email protected]> wrote:

> On 29 Jan 2018, at 17:35, Andrew Morton wrote:
>
> > On Mon, 29 Jan 2018 17:06:14 -0500 "Zi Yan" <[email protected]> wrote:
> >
> >> I discover that this patch does not hold mmap_sem while migrating pages in
> >> do_move_pages_to_node().
> >>
> >> A simple fix below moves mmap_sem from add_page_for_migration()
> >> to the outmost do_pages_move():
> >
> > I'm not surprised. Why does do_move_pages_to_node() need mmap_sem
> > and how is a reader to discover that fact???
>
> do_move_pages_to_node() calls migrate_pages(), which requires down_read(&mmap_sem).
>
> In the outmost do_pages_move(), both add_page_for_migration() and
> do_move_pages_to_node() inside it need to hold read lock of mmap_sem.
>
> Do we need to add comments for both functions?

Just for migrate_pages(), I guess. Let's include a description of
*why* mmap_sem is needed. What it is protecting.