From: Michal Hocko <[email protected]>
THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by spliting the THP into small pages while moving the
head page to the newly allocated order-0 page. Remaning pages are moved
to the LRU list by split_huge_page. The same happens if the THP
allocation fails. This is really ugly and error prone [1].
I also believe that split_huge_page to the LRU lists is inherently
wrong because all tail pages are not migrated. Some callers will just
work around that by retrying (e.g. memory hotplug). There are other
pfn walkers which are simply broken though. e.g. madvise_inject_error
will migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh
THP so we end up with ENOMEM and fail the whole migration which is
a questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.
This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs. We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry. So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.
[1] http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
Hi,
this is a follow up for [2]. I find this approach much less hackish and
easier to maintain as well. It also fixes few bugs. I didn't really go
deeply into each migration path and evaluate the user visible bugs but
at least the explicit migration is suboptimal to say the least
# A simple 100M mmap with MADV_HUGEPAGE and explicit migrate from node 0
# to node 1 with results displayed "After pause"
root@test1:~# numactl -m 0 ./map_thp
7f749d0aa000 bind:0 anon=25600 dirty=25600 N0=25600 kernelpagesize_kB=4
7f749d0aa000-7f74a34aa000 rw-p 00000000 00:00 0
Size: 102400 kB
Rss: 102400 kB
AnonHugePages: 100352 kB
After pause
7f749d0aa000 bind:0 anon=25600 dirty=25600 N0=18602 N1=6998 kernelpagesize_kB=4
7f749d0aa000-7f74a34aa000 rw-p 00000000 00:00 0
Size: 102400 kB
Rss: 102400 kB
AnonHugePages: 100352 kB
root@test1:~# migratepages $(pgrep map_thp) 0 1
migrate_pages: Cannot allocate memory
While the migration succeeds with the patch applied even though some THP
had to be split and migrated page by page.
I believe that thp_migration_supported shouldn't be spread outside
of the migration code but I've left few assertion in place. Maybe
they should go as well. I haven't spent too much time on those. My
testing was quite limited and this might still blow up so I would really
appreciate a careful review.
Thanks!
[2] http://lkml.kernel.org/r/[email protected]
include/linux/migrate.h | 6 ++++--
mm/huge_memory.c | 6 ++++++
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 29 ++---------------------------
mm/migrate.c | 31 ++++++++++++++++++++-----------
5 files changed, 33 insertions(+), 41 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a2246cf670ba..ec9503e5f2c2 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,9 +43,11 @@ static inline struct page *new_page_nodemask(struct page *page,
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);
- if (thp_migration_supported() && PageTransHuge(page)) {
- order = HPAGE_PMD_ORDER;
+ if (PageTransHuge(page)) {
+ if (!thp_migration_supported())
+ return NULL;
gfp_mask |= GFP_TRANSHUGE;
+ order = HPAGE_PMD_ORDER;
}
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7544ce4ef4dc..304f39b9aa5c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2425,6 +2425,12 @@ static void __split_huge_page_tail(struct page *head, int tail,
page_tail->index = head->index + tail;
page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+
+ /*
+ * always add to the tail because some iterators expect new
+ * pages to show after the currently processed elements - e.g.
+ * migrate_pages
+ */
lru_add_page_tail(head, page_tail, lruvec, list);
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d0856ab2f28d..ad0a84aa7b53 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1391,7 +1391,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- } else if (thp_migration_supported() && PageTransHuge(page))
+ } else if (PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..49ecbb50b5f0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,15 +446,6 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
goto out;
}
- if (!thp_migration_supported()) {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- goto out;
- }
if (!queue_pages_required(page, qp)) {
ret = 1;
goto unlock;
@@ -511,22 +502,6 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
- if (PageTransCompound(page) && !thp_migration_supported()) {
- get_page(page);
- pte_unmap_unlock(pte, ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- /* Failed to split -- skip. */
- if (ret) {
- pte = pte_offset_map_lock(walk->mm, pmd,
- addr, &ptl);
- continue;
- }
- goto retry;
- }
-
migrate_page_add(page, qp->pagelist, flags);
}
pte_unmap_unlock(pte - 1, ptl);
@@ -947,7 +922,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else if (thp_migration_supported() && PageTransHuge(page)) {
+ else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_pages_node(node,
@@ -1123,7 +1098,7 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
- } else if (thp_migration_supported() && PageTransHuge(page)) {
+ } else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..ed21642a5c1d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1139,6 +1139,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
int *result = NULL;
struct page *newpage;
+ if (!thp_migration_supported() && PageTransHuge(page))
+ return -ENOMEM;
+
newpage = get_new_page(page, private, &result);
if (!newpage)
return -ENOMEM;
@@ -1160,14 +1163,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}
- if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
- lock_page(page);
- rc = split_huge_page(page);
- unlock_page(page);
- if (rc)
- goto out;
- }
-
rc = __unmap_and_move(page, newpage, force, mode);
if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(newpage, reason);
@@ -1395,6 +1390,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
retry = 0;
list_for_each_entry_safe(page, page2, from, lru) {
+retry:
cond_resched();
if (PageHuge(page))
@@ -1408,6 +1404,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
switch(rc) {
case -ENOMEM:
+ /*
+ * THP migration might be unsupported or the
+ * allocation could've failed so we should
+ * retry on the same page with the THP split
+ * to base pages.
+ */
+ if (PageTransHuge(page)) {
+ lock_page(page);
+ rc = split_huge_page_to_list(page, from);
+ unlock_page(page);
+ if (!rc) {
+ list_safe_reset_next(page, page2, lru);
+ goto retry;
+ }
+ }
nr_failed++;
goto out;
case -EAGAIN:
@@ -1470,7 +1481,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
if (PageHuge(p))
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
- else if (thp_migration_supported() && PageTransHuge(p)) {
+ else if (PageTransHuge(p)) {
struct page *thp;
thp = alloc_pages_node(pm->node,
@@ -1517,8 +1528,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
/* FOLL_DUMP to ignore special (like zero) pages */
follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
page = follow_page(vma, pp->addr, follflags);
err = PTR_ERR(page);
--
2.15.0
Hi Michal,
Thanks for sending this out.
Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> THP migration is hacked into the generic migration with rather
> surprising semantic. The migration allocation callback is supposed to
> check whether the THP can be migrated at once and if that is not the
> case then it allocates a simple page to migrate. unmap_and_move then
> fixes that up by spliting the THP into small pages while moving the
> head page to the newly allocated order-0 page. Remaning pages are moved
> to the LRU list by split_huge_page. The same happens if the THP
> allocation fails. This is really ugly and error prone [1].
>
> I also believe that split_huge_page to the LRU lists is inherently
> wrong because all tail pages are not migrated. Some callers will just
I agree with you that we should try to migrate all tail pages if the THP
needs to be split. But this might not be compatible with "getting
migration results" in unmap_and_move(), since a caller of
migrate_pages() may want to know the status of each page in the
migration list via int **result in get_new_page() (e.g.
new_page_node()). The caller has no idea whether a THP in its migration
list will be split or not, thus, storing migration results might be
quite tricky if tail pages are added into the migration list.
We need to consider this when we clean up migrate_pages().
> work around that by retrying (e.g. memory hotplug). There are other
> pfn walkers which are simply broken though. e.g. madvise_inject_error
> will migrate head and then advances next pfn by the huge page size.
> do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
> will simply split the THP before migration if the THP migration is not
> supported then falls back to single page migration but it doesn't handle
> tail pages if the THP migration path is not able to allocate a fresh
> THP so we end up with ENOMEM and fail the whole migration which is
> a questionable behavior. Page compaction doesn't try to migrate large
> pages so it should be immune.
>
> This patch tries to unclutter the situation by moving the special THP
> handling up to the migrate_pages layer where it actually belongs. We
> simply split the THP page into the existing list if unmap_and_move fails
> with ENOMEM and retry. So we will _always_ migrate all THP subpages and
> specific migrate_pages users do not have to deal with this case in a
> special way.
>
> [1] https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml.kernel.org%2Fr%2F20171121021855.50525-1-zi.yan%40sent.com&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7C1eb88428b6a24e774ee108d53d70cf1f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636482477084480449&sdata=q5nY%2Fe%2F8peEiR1YdcdE3PBGBhC%2B4VsCwadBpQGMeBCo%3D&reserved=0
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Hi,
> this is a follow up for [2]. I find this approach much less hackish and
> easier to maintain as well. It also fixes few bugs. I didn't really go
> deeply into each migration path and evaluate the user visible bugs but
> at least the explicit migration is suboptimal to say the least
>
> # A simple 100M mmap with MADV_HUGEPAGE and explicit migrate from node 0
> # to node 1 with results displayed "After pause"
> root@test1:~# numactl -m 0 ./map_thp
> 7f749d0aa000 bind:0 anon=25600 dirty=25600 N0=25600 kernelpagesize_kB=4
> 7f749d0aa000-7f74a34aa000 rw-p 00000000 00:00 0
> Size: 102400 kB
> Rss: 102400 kB
> AnonHugePages: 100352 kB
>
> After pause
> 7f749d0aa000 bind:0 anon=25600 dirty=25600 N0=18602 N1=6998 kernelpagesize_kB=4
> 7f749d0aa000-7f74a34aa000 rw-p 00000000 00:00 0
> Size: 102400 kB
> Rss: 102400 kB
> AnonHugePages: 100352 kB
>
> root@test1:~# migratepages $(pgrep map_thp) 0 1
> migrate_pages: Cannot allocate memory
>
> While the migration succeeds with the patch applied even though some THP
> had to be split and migrated page by page.
>
> I believe that thp_migration_supported shouldn't be spread outside
> of the migration code but I've left few assertion in place. Maybe
> they should go as well. I haven't spent too much time on those. My
> testing was quite limited and this might still blow up so I would really
> appreciate a careful review.
>
> Thanks!
>
> [2] https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml.kernel.org%2Fr%2F20171122130121.ujp6qppa7nhahazh%40dhcp22.suse.cz&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7C1eb88428b6a24e774ee108d53d70cf1f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636482477084480449&sdata=9W%2FW3zgu8lKXEZLoXRi%2BQd9xr1PuuSI%2FRZZrl2ylkdQ%3D&reserved=0
>
> include/linux/migrate.h | 6 ++++--
> mm/huge_memory.c | 6 ++++++
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 29 ++---------------------------
> mm/migrate.c | 31 ++++++++++++++++++++-----------
> 5 files changed, 33 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index a2246cf670ba..ec9503e5f2c2 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -43,9 +43,11 @@ static inline struct page *new_page_nodemask(struct page *page,
> return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> preferred_nid, nodemask);
>
> - if (thp_migration_supported() && PageTransHuge(page)) {
> - order = HPAGE_PMD_ORDER;
> + if (PageTransHuge(page)) {
> + if (!thp_migration_supported())
> + return NULL;
We may not need these two lines, since if thp_migration_supported() is
false, unmap_and_move() returns -ENOMEM in your code below, which has
the same result of returning NULL here.
> gfp_mask |= GFP_TRANSHUGE;
> + order = HPAGE_PMD_ORDER;
> }
>
> if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7544ce4ef4dc..304f39b9aa5c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2425,6 +2425,12 @@ static void __split_huge_page_tail(struct page *head, int tail,
>
> page_tail->index = head->index + tail;
> page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
> +
> + /*
> + * always add to the tail because some iterators expect new
> + * pages to show after the currently processed elements - e.g.
> + * migrate_pages
> + */
> lru_add_page_tail(head, page_tail, lruvec, list);
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d0856ab2f28d..ad0a84aa7b53 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1391,7 +1391,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> if (isolate_huge_page(page, &source))
> move_pages -= 1 << compound_order(head);
> continue;
> - } else if (thp_migration_supported() && PageTransHuge(page))
> + } else if (PageTransHuge(page))
> pfn = page_to_pfn(compound_head(page))
> + hpage_nr_pages(page) - 1;
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..49ecbb50b5f0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -446,15 +446,6 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> goto out;
> }
> - if (!thp_migration_supported()) {
> - get_page(page);
> - spin_unlock(ptl);
> - lock_page(page);
> - ret = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - goto out;
> - }
> if (!queue_pages_required(page, qp)) {
> ret = 1;
> goto unlock;
> @@ -511,22 +502,6 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> continue;
> if (!queue_pages_required(page, qp))
> continue;
> - if (PageTransCompound(page) && !thp_migration_supported()) {
> - get_page(page);
> - pte_unmap_unlock(pte, ptl);
> - lock_page(page);
> - ret = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - /* Failed to split -- skip. */
> - if (ret) {
> - pte = pte_offset_map_lock(walk->mm, pmd,
> - addr, &ptl);
> - continue;
> - }
> - goto retry;
> - }
> -
> migrate_page_add(page, qp->pagelist, flags);
> }
> pte_unmap_unlock(pte - 1, ptl);
> @@ -947,7 +922,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> node);
> - else if (thp_migration_supported() && PageTransHuge(page)) {
> + else if (PageTransHuge(page)) {
> struct page *thp;
>
> thp = alloc_pages_node(node,
> @@ -1123,7 +1098,7 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
> if (PageHuge(page)) {
> BUG_ON(!vma);
> return alloc_huge_page_noerr(vma, address, 1);
> - } else if (thp_migration_supported() && PageTransHuge(page)) {
> + } else if (PageTransHuge(page)) {
> struct page *thp;
>
> thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4d0be47a322a..ed21642a5c1d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1139,6 +1139,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> int *result = NULL;
> struct page *newpage;
>
> + if (!thp_migration_supported() && PageTransHuge(page))
> + return -ENOMEM;
> +
> newpage = get_new_page(page, private, &result);
> if (!newpage)
> return -ENOMEM;
> @@ -1160,14 +1163,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> goto out;
> }
>
> - if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
> - lock_page(page);
> - rc = split_huge_page(page);
> - unlock_page(page);
> - if (rc)
> - goto out;
> - }
> -
> rc = __unmap_and_move(page, newpage, force, mode);
> if (rc == MIGRATEPAGE_SUCCESS)
> set_page_owner_migrate_reason(newpage, reason);
> @@ -1395,6 +1390,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> retry = 0;
>
> list_for_each_entry_safe(page, page2, from, lru) {
> +retry:
> cond_resched();
>
> if (PageHuge(page))
> @@ -1408,6 +1404,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>
> switch(rc) {
> case -ENOMEM:
> + /*
> + * THP migration might be unsupported or the
> + * allocation could've failed so we should
> + * retry on the same page with the THP split
> + * to base pages.
> + */
> + if (PageTransHuge(page)) {
> + lock_page(page);
> + rc = split_huge_page_to_list(page, from);
> + unlock_page(page);
> + if (!rc) {
> + list_safe_reset_next(page, page2, lru);
> + goto retry;
> + }
> + }
> nr_failed++;
> goto out;
> case -EAGAIN:
> @@ -1470,7 +1481,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
> if (PageHuge(p))
> return alloc_huge_page_node(page_hstate(compound_head(p)),
> pm->node);
> - else if (thp_migration_supported() && PageTransHuge(p)) {
> + else if (PageTransHuge(p)) {
> struct page *thp;
>
> thp = alloc_pages_node(pm->node,
> @@ -1517,8 +1528,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> follflags = FOLL_GET | FOLL_DUMP;
> - if (!thp_migration_supported())
> - follflags |= FOLL_SPLIT;
> page = follow_page(vma, pp->addr, follflags);
>
> err = PTR_ERR(page);
Other than the concern of "getting migration results" and two lines of
code I mentioned above, the rest of the code looks good to me.
--
Best Regards,
Yan Zi
On Thu 07-12-17 22:10:47, Zi Yan wrote:
> Hi Michal,
>
> Thanks for sending this out.
>
> Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > THP migration is hacked into the generic migration with rather
> > surprising semantic. The migration allocation callback is supposed to
> > check whether the THP can be migrated at once and if that is not the
> > case then it allocates a simple page to migrate. unmap_and_move then
> > fixes that up by spliting the THP into small pages while moving the
> > head page to the newly allocated order-0 page. Remaning pages are moved
> > to the LRU list by split_huge_page. The same happens if the THP
> > allocation fails. This is really ugly and error prone [1].
> >
> > I also believe that split_huge_page to the LRU lists is inherently
> > wrong because all tail pages are not migrated. Some callers will just
>
> I agree with you that we should try to migrate all tail pages if the THP
> needs to be split. But this might not be compatible with "getting
> migration results" in unmap_and_move(), since a caller of
> migrate_pages() may want to know the status of each page in the
> migration list via int **result in get_new_page() (e.g.
> new_page_node()). The caller has no idea whether a THP in its migration
> list will be split or not, thus, storing migration results might be
> quite tricky if tail pages are added into the migration list.
Ouch. I wasn't aware of this "beauty". I will try to wrap my head around
this code and think about what to do about it. Thanks for point me to
it.
> We need to consider this when we clean up migrate_pages().
>
[...]
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index a2246cf670ba..ec9503e5f2c2 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -43,9 +43,11 @@ static inline struct page *new_page_nodemask(struct page *page,
> > return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> > preferred_nid, nodemask);
> >
> > - if (thp_migration_supported() && PageTransHuge(page)) {
> > - order = HPAGE_PMD_ORDER;
> > + if (PageTransHuge(page)) {
> > + if (!thp_migration_supported())
> > + return NULL;
> We may not need these two lines, since if thp_migration_supported() is
> false, unmap_and_move() returns -ENOMEM in your code below, which has
> the same result of returning NULL here.
yes, this is a left over after rebase. Originally I used to have
thp_migration_supported in allocation callbacks but then moved it to
unmap_and_move to reduce the code duplication and also it makes much
more sense to have this up in the migration layer. I've fixed this up in
my local copy now.
--
Michal Hocko
SUSE Labs
On Thu 07-12-17 15:34:01, Michal Hocko wrote:
> On Thu 07-12-17 22:10:47, Zi Yan wrote:
[...]
> > I agree with you that we should try to migrate all tail pages if the THP
> > needs to be split. But this might not be compatible with "getting
> > migration results" in unmap_and_move(), since a caller of
> > migrate_pages() may want to know the status of each page in the
> > migration list via int **result in get_new_page() (e.g.
> > new_page_node()). The caller has no idea whether a THP in its migration
> > list will be split or not, thus, storing migration results might be
> > quite tricky if tail pages are added into the migration list.
>
> Ouch. I wasn't aware of this "beauty". I will try to wrap my head around
> this code and think about what to do about it. Thanks for point me to
> it.
OK, so was staring at this yesterday and concluded that the current
implementation of do_move_page_to_node_array is unfixable to work with
split_thp_page_list in migrate_pages. So I've reimplemented it to not
use the quite ugly fixed sized batching. Instead I am using dynamic
batching based on the same node request. See the patch 1 for more
details about implementation. This will allow us to remove the quite
ugly 'int **reason' from the allocation callback as well. This is patch
2 and patch 3 is finally the thp migration code.
Diffstat is quite supportive for this cleanup.
include/linux/migrate.h | 7 +-
include/linux/page-isolation.h | 3 +-
mm/compaction.c | 3 +-
mm/huge_memory.c | 6 +
mm/internal.h | 1 +
mm/memory_hotplug.c | 5 +-
mm/mempolicy.c | 40 +----
mm/migrate.c | 350 ++++++++++++++++++-----------------------
mm/page_isolation.c | 3 +-
9 files changed, 177 insertions(+), 241 deletions(-)
Does anybody see any issues with this approach?
On a side note:
There are some semantic issues I have encountered on the way but I am
not addressing them here. E.g. trying to move THP tail pages will result
in either success or EBUSY (the later one more likely once we isolate
head from the LRU list). Hugetlb reports EACCESS on tail pages.
Some errors are reported via status parameter but migration failures are
not even though the original `reason' argument suggests there was an
intention to do so. From a quick look into git history this never
worked. I have tried to keep the semantic unchanged.
Then there is a relatively minor thing that the page isolation might
fail because of pages not being on the LRU - e.g. because they are
sitting on the per-cpu LRU caches. Easily fixable.
From: Michal Hocko <[email protected]>
do_pages_move is supposed to move user defined memory (an array of
addresses) to the user defined numa nodes (an array of nodes one for
each address). The user provided status array then contains resulting
numa node for each address or an error. The semantic of this function is
little bit confusing because only some errors are reported back. Notably
migrate_pages error is only reported via the return value. This patch
doesn't try to address these semantic nuances but rather change the
underlying implementation.
Currently we are processing user input (which can be really large)
in batches which are stored to a temporarily allocated page. Each
address is resolved to its struct page and stored to page_to_node
structure along with the requested target numa node. The array of these
structures is then conveyed down the page migration path via private
argument. new_page_node then finds the corresponding structure and
allocates the proper target page.
What is the problem with the current implementation and why to change
it? Apart from being quite ugly it also doesn't cope with unexpected
pages showing up on the migration list inside migrate_pages path.
That doesn't happen currently but the follow up patch would like to
make the thp migration code more clear and that would need to split a
THP into the list for some cases.
How does the new implementation work? Well, instead of batching into a
fixed size array we simply batch all pages that should be migrated to
the same node and isolate all of them into a linked list which doesn't
require any additional storage. This should work reasonably well because
page migration usually migrates larger ranges of memory to a specific
node. So the common case should work equally well as the current
implementation. Even if somebody constructs an input where the target
numa nodes would be interleaved we shouldn't see a large performance
impact because page migration alone doesn't really benefit from
batching. mmap_sem batching for the lookup is quite questionable and
isolate_lru_page which would benefit from batching is not using it even
in the current implementation.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/internal.h | 1 +
mm/mempolicy.c | 5 +-
mm/migrate.c | 306 +++++++++++++++++++++++++--------------------------------
3 files changed, 139 insertions(+), 173 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..1a1bb5d59c15 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,4 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}
void setup_zone_pageset(struct zone *zone);
+extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..66c9c79b21be 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
}
}
-static struct page *new_node_page(struct page *page, unsigned long node, int **x)
+/* page allocation callback for NUMA node migration */
+struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
{
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);
if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_node_page, NULL, dest,
+ err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
MIGRATE_SYNC, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..9d7252ea2acd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,141 +1444,104 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}
#ifdef CONFIG_NUMA
-/*
- * Move a list of individual pages
- */
-struct page_to_node {
- unsigned long addr;
- struct page *page;
- int node;
- int status;
-};
-static struct page *new_page_node(struct page *p, unsigned long private,
- int **result)
+static int store_status(int __user *status, int start, int value, int nr)
{
- struct page_to_node *pm = (struct page_to_node *)private;
-
- while (pm->node != MAX_NUMNODES && pm->page != p)
- pm++;
+ while (nr-- > 0) {
+ if (put_user(value, status + start))
+ return -EFAULT;
+ start++;
+ }
- if (pm->node == MAX_NUMNODES)
- return NULL;
+ return 0;
+}
- *result = &pm->status;
+static int do_move_pages_to_node(struct mm_struct *mm,
+ struct list_head *pagelist, int node)
+{
+ int err;
- if (PageHuge(p))
- return alloc_huge_page_node(page_hstate(compound_head(p)),
- pm->node);
- else if (thp_migration_supported() && PageTransHuge(p)) {
- struct page *thp;
+ if (list_empty(pagelist))
+ return 0;
- thp = alloc_pages_node(pm->node,
- (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
- HPAGE_PMD_ORDER);
- if (!thp)
- return NULL;
- prep_transhuge_page(thp);
- return thp;
- } else
- return __alloc_pages_node(pm->node,
- GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
+ err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
+ MIGRATE_SYNC, MR_SYSCALL);
+ if (err)
+ putback_movable_pages(pagelist);
+ return err;
}
/*
- * Move a set of pages as indicated in the pm array. The addr
- * field must be set to the virtual address of the page to be moved
- * and the node number must contain a valid target node.
- * The pm array ends with node = MAX_NUMNODES.
+ * Resolves the given address to a struct page, isolates it from the LRU and
+ * puts it to the given pagelist.
+ * Returns -errno if the page cannot be found/isolated or 0 when it has been
+ * queued or the page doesn't need to be migrated because it is already on
+ * the target node
*/
-static int do_move_page_to_node_array(struct mm_struct *mm,
- struct page_to_node *pm,
- int migrate_all)
+static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
+ int node, struct list_head *pagelist, bool migrate_all)
{
+ struct vm_area_struct *vma;
+ struct page *page;
+ unsigned int follflags;
int err;
- struct page_to_node *pp;
- LIST_HEAD(pagelist);
down_read(&mm->mmap_sem);
+ err = -EFAULT;
+ vma = find_vma(mm, addr);
+ if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+ goto out;
- /*
- * Build a list of pages to migrate
- */
- for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
- struct vm_area_struct *vma;
- struct page *page;
- struct page *head;
- unsigned int follflags;
-
- err = -EFAULT;
- vma = find_vma(mm, pp->addr);
- if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
- goto set_status;
-
- /* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
- page = follow_page(vma, pp->addr, follflags);
+ /* FOLL_DUMP to ignore special (like zero) pages */
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, addr, follflags);
- err = PTR_ERR(page);
- if (IS_ERR(page))
- goto set_status;
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;
- err = -ENOENT;
- if (!page)
- goto set_status;
+ err = -ENOENT;
+ if (!page)
+ goto out;
- err = page_to_nid(page);
+ err = 0;
+ if (page_to_nid(page) == node)
+ goto out_putpage;
- if (err == pp->node)
- /*
- * Node already in the right place
- */
- goto put_and_set;
+ err = -EACCES;
+ if (page_mapcount(page) > 1 &&
+ !migrate_all)
+ goto out_putpage;
- err = -EACCES;
- if (page_mapcount(page) > 1 &&
- !migrate_all)
- goto put_and_set;
-
- if (PageHuge(page)) {
- if (PageHead(page)) {
- isolate_huge_page(page, &pagelist);
- err = 0;
- pp->page = page;
- }
- goto put_and_set;
+ if (PageHuge(page)) {
+ if (PageHead(page)) {
+ isolate_huge_page(page, pagelist);
+ err = 0;
}
+ } else {
+ struct page *head;
- pp->page = compound_head(page);
head = compound_head(page);
err = isolate_lru_page(head);
- if (!err) {
- list_add_tail(&head->lru, &pagelist);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON + page_is_file_cache(head),
- hpage_nr_pages(head));
- }
-put_and_set:
- /*
- * Either remove the duplicate refcount from
- * isolate_lru_page() or drop the page ref if it was
- * not isolated.
- */
- put_page(page);
-set_status:
- pp->status = err;
- }
-
- err = 0;
- if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_page_node, NULL,
- (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
if (err)
- putback_movable_pages(&pagelist);
- }
+ goto out_putpage;
+ err = 0;
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
+ }
+out_putpage:
+ /*
+ * Either remove the duplicate refcount from
+ * isolate_lru_page() or drop the page ref if it was
+ * not isolated.
+ */
+ put_page(page);
+out:
up_read(&mm->mmap_sem);
return err;
}
@@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
const int __user *nodes,
int __user *status, int flags)
{
- struct page_to_node *pm;
- unsigned long chunk_nr_pages;
- unsigned long chunk_start;
- int err;
-
- err = -ENOMEM;
- pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
- if (!pm)
- goto out;
+ int chunk_node = NUMA_NO_NODE;
+ LIST_HEAD(pagelist);
+ int chunk_start, i;
+ int err = 0, err1;
migrate_prep();
- /*
- * Store a chunk of page_to_node array in a page,
- * but keep the last one as a marker
- */
- chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
-
- for (chunk_start = 0;
- chunk_start < nr_pages;
- chunk_start += chunk_nr_pages) {
- int j;
-
- if (chunk_start + chunk_nr_pages > nr_pages)
- chunk_nr_pages = nr_pages - chunk_start;
-
- /* fill the chunk pm with addrs and nodes from user-space */
- for (j = 0; j < chunk_nr_pages; j++) {
- const void __user *p;
- int node;
-
- err = -EFAULT;
- if (get_user(p, pages + j + chunk_start))
- goto out_pm;
- pm[j].addr = (unsigned long) p;
-
- if (get_user(node, nodes + j + chunk_start))
- goto out_pm;
-
- err = -ENODEV;
- if (node < 0 || node >= MAX_NUMNODES)
- goto out_pm;
+ for (i = chunk_start = 0; i < nr_pages; i++) {
+ const void __user *p;
+ unsigned long addr;
+ int node;
- if (!node_state(node, N_MEMORY))
- goto out_pm;
-
- err = -EACCES;
- if (!node_isset(node, task_nodes))
- goto out_pm;
+ err = -EFAULT;
+ if (get_user(p, pages + i))
+ goto out_flush;
+ if (get_user(node, nodes + i))
+ goto out_flush;
+ addr = (unsigned long)p;
+
+ err = -ENODEV;
+ if (node < 0 || node >= MAX_NUMNODES)
+ goto out_flush;
+ if (!node_state(node, N_MEMORY))
+ goto out_flush;
- pm[j].node = node;
+ err = -EACCES;
+ if (!node_isset(node, task_nodes))
+ goto out_flush;
+
+ if (chunk_node == NUMA_NO_NODE) {
+ chunk_node = node;
+ chunk_start = i;
+ } else if (node != chunk_node) {
+ err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (err)
+ goto out;
+ err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (err)
+ goto out;
+ chunk_start = i;
+ chunk_node = node;
}
- /* End marker for this chunk */
- pm[chunk_nr_pages].node = MAX_NUMNODES;
+ /*
+ * Errors in the page lookup or isolation are not fatal and we simply
+ * report them via status
+ */
+ err = add_page_for_migration(mm, addr, chunk_node,
+ &pagelist, flags & MPOL_MF_MOVE_ALL);
+ if (!err)
+ continue;
- /* Migrate this chunk */
- err = do_move_page_to_node_array(mm, pm,
- flags & MPOL_MF_MOVE_ALL);
- if (err < 0)
- goto out_pm;
+ err = store_status(status, i, err, 1);
+ if (err)
+ goto out_flush;
- /* Return status information */
- for (j = 0; j < chunk_nr_pages; j++)
- if (put_user(pm[j].status, status + j + chunk_start)) {
- err = -EFAULT;
- goto out_pm;
- }
+ err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (err)
+ goto out;
+ if (i > chunk_start) {
+ err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (err)
+ goto out;
+ }
+ chunk_node = NUMA_NO_NODE;
}
err = 0;
-
-out_pm:
- free_page((unsigned long)pm);
+out_flush:
+ /* Make sure we do not overwrite the existing error */
+ err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (!err1)
+ err1 = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (!err)
+ err = err1;
out:
return err;
}
--
2.15.0
From: Michal Hocko <[email protected]>
No allocation callback is using this argument anymore. new_page_node
used to use this parameter to convey node_id resp. migration error
up to move_pages code (do_move_page_to_node_array). The error status
never made it into the final status field and we have a better way
to communicate node id to the status field now. All other allocation
callbacks simply ignored the argument so we can drop it finally.
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/migrate.h | 3 +--
include/linux/page-isolation.h | 3 +--
mm/compaction.c | 3 +--
mm/internal.h | 2 +-
mm/memory_hotplug.c | 3 +--
mm/mempolicy.c | 6 +++---
mm/migrate.c | 19 ++-----------------
mm/page_isolation.c | 3 +--
8 files changed, 11 insertions(+), 31 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a2246cf670ba..e5d99ade2319 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,8 +7,7 @@
#include <linux/migrate_mode.h>
#include <linux/hugetlb.h>
-typedef struct page *new_page_t(struct page *page, unsigned long private,
- int **reason);
+typedef struct page *new_page_t(struct page *page, unsigned long private);
typedef void free_page_t(struct page *page, unsigned long private);
/*
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index cdad58bbfd8b..4ae347cbc36d 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -63,7 +63,6 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
bool skip_hwpoisoned_pages);
-struct page *alloc_migrate_target(struct page *page, unsigned long private,
- int **resultp);
+struct page *alloc_migrate_target(struct page *page, unsigned long private);
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 10cd757f1006..692d21d63391 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1165,8 +1165,7 @@ static void isolate_freepages(struct compact_control *cc)
* from the isolated freelists in the block we are migrating to.
*/
static struct page *compaction_alloc(struct page *migratepage,
- unsigned long data,
- int **result)
+ unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage;
diff --git a/mm/internal.h b/mm/internal.h
index 1a1bb5d59c15..502d14189794 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,5 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}
void setup_zone_pageset(struct zone *zone);
-extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
+extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d0856ab2f28d..d865623edee7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1347,8 +1347,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
return 0;
}
-static struct page *new_node_page(struct page *page, unsigned long private,
- int **result)
+static struct page *new_node_page(struct page *page, unsigned long private)
{
int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66c9c79b21be..4d849d3098e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -943,7 +943,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
}
/* page allocation callback for NUMA node migration */
-struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
+struct page *alloc_new_node_page(struct page *page, unsigned long node)
{
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -1108,7 +1108,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
* list of pages handed to migrate_pages()--which is how we get here--
* is in virtual address order.
*/
-static struct page *new_page(struct page *page, unsigned long start, int **x)
+static struct page *new_page(struct page *page, unsigned long start)
{
struct vm_area_struct *vma;
unsigned long uninitialized_var(address);
@@ -1153,7 +1153,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
return -ENOSYS;
}
-static struct page *new_page(struct page *page, unsigned long start, int **x)
+static struct page *new_page(struct page *page, unsigned long start)
{
return NULL;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 9d7252ea2acd..f9235f0155a4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1136,10 +1136,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
enum migrate_reason reason)
{
int rc = MIGRATEPAGE_SUCCESS;
- int *result = NULL;
struct page *newpage;
- newpage = get_new_page(page, private, &result);
+ newpage = get_new_page(page, private);
if (!newpage)
return -ENOMEM;
@@ -1230,12 +1229,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
put_page(newpage);
}
- if (result) {
- if (rc)
- *result = rc;
- else
- *result = page_to_nid(newpage);
- }
return rc;
}
@@ -1263,7 +1256,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
enum migrate_mode mode, int reason)
{
int rc = -EAGAIN;
- int *result = NULL;
int page_was_mapped = 0;
struct page *new_hpage;
struct anon_vma *anon_vma = NULL;
@@ -1280,7 +1272,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
return -ENOSYS;
}
- new_hpage = get_new_page(hpage, private, &result);
+ new_hpage = get_new_page(hpage, private);
if (!new_hpage)
return -ENOMEM;
@@ -1345,12 +1337,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
else
putback_active_hugepage(new_hpage);
- if (result) {
- if (rc)
- *result = rc;
- else
- *result = page_to_nid(new_hpage);
- }
return rc;
}
@@ -1622,7 +1608,6 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
}
chunk_node = NUMA_NO_NODE;
}
- err = 0;
out_flush:
/* Make sure we do not overwrite the existing error */
err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 165ed8117bd1..53d801235e22 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -293,8 +293,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
return pfn < end_pfn ? -EBUSY : 0;
}
-struct page *alloc_migrate_target(struct page *page, unsigned long private,
- int **resultp)
+struct page *alloc_migrate_target(struct page *page, unsigned long private)
{
return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
}
--
2.15.0
From: Michal Hocko <[email protected]>
THP migration is hacked into the generic migration with rather
surprising semantic. The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate. unmap_and_move then
fixes that up by spliting the THP into small pages while moving the
head page to the newly allocated order-0 page. Remaning pages are moved
to the LRU list by split_huge_page. The same happens if the THP
allocation fails. This is really ugly and error prone [1].
I also believe that split_huge_page to the LRU lists is inherently
wrong because all tail pages are not migrated. Some callers will just
work around that by retrying (e.g. memory hotplug). There are other
pfn walkers which are simply broken though. e.g. madvise_inject_error
will migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh
THP so we end up with ENOMEM and fail the whole migration which is
a questionable behavior. Page compaction doesn't try to migrate large
pages so it should be immune.
This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs. We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry. So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.
[1] http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/migrate.h | 4 ++--
mm/huge_memory.c | 6 ++++++
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 31 +++----------------------------
mm/migrate.c | 29 +++++++++++++++++++----------
5 files changed, 31 insertions(+), 41 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e5d99ade2319..0c6fe904bc97 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -42,9 +42,9 @@ static inline struct page *new_page_nodemask(struct page *page,
return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
preferred_nid, nodemask);
- if (thp_migration_supported() && PageTransHuge(page)) {
- order = HPAGE_PMD_ORDER;
+ if (PageTransHuge(page)) {
gfp_mask |= GFP_TRANSHUGE;
+ order = HPAGE_PMD_ORDER;
}
if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7544ce4ef4dc..8865906c248c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2425,6 +2425,12 @@ static void __split_huge_page_tail(struct page *head, int tail,
page_tail->index = head->index + tail;
page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+
+ /*
+ * always add to the tail because some iterators expect new
+ * pages to show after the currently processed elements - e.g.
+ * migrate_pages
+ */
lru_add_page_tail(head, page_tail, lruvec, list);
}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d865623edee7..442e63a2cf72 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1390,7 +1390,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
- } else if (thp_migration_supported() && PageTransHuge(page))
+ } else if (PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4d849d3098e5..b6f4fcf9df64 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -446,15 +446,6 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
goto out;
}
- if (!thp_migration_supported()) {
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- goto out;
- }
if (!queue_pages_required(page, qp)) {
ret = 1;
goto unlock;
@@ -495,7 +486,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
if (pmd_trans_unstable(pmd))
return 0;
-retry:
+
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
@@ -511,22 +502,6 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
- if (PageTransCompound(page) && !thp_migration_supported()) {
- get_page(page);
- pte_unmap_unlock(pte, ptl);
- lock_page(page);
- ret = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- /* Failed to split -- skip. */
- if (ret) {
- pte = pte_offset_map_lock(walk->mm, pmd,
- addr, &ptl);
- continue;
- }
- goto retry;
- }
-
migrate_page_add(page, qp->pagelist, flags);
}
pte_unmap_unlock(pte - 1, ptl);
@@ -948,7 +923,7 @@ struct page *alloc_new_node_page(struct page *page, unsigned long node)
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
- else if (thp_migration_supported() && PageTransHuge(page)) {
+ else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_pages_node(node,
@@ -1124,7 +1099,7 @@ static struct page *new_page(struct page *page, unsigned long start)
if (PageHuge(page)) {
BUG_ON(!vma);
return alloc_huge_page_noerr(vma, address, 1);
- } else if (thp_migration_supported() && PageTransHuge(page)) {
+ } else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
diff --git a/mm/migrate.c b/mm/migrate.c
index f9235f0155a4..dc5df5fe5c82 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1138,6 +1138,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
int rc = MIGRATEPAGE_SUCCESS;
struct page *newpage;
+ if (!thp_migration_supported() && PageTransHuge(page))
+ return -ENOMEM;
+
newpage = get_new_page(page, private);
if (!newpage)
return -ENOMEM;
@@ -1159,14 +1162,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
goto out;
}
- if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
- lock_page(page);
- rc = split_huge_page(page);
- unlock_page(page);
- if (rc)
- goto out;
- }
-
rc = __unmap_and_move(page, newpage, force, mode);
if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(newpage, reason);
@@ -1381,6 +1376,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
retry = 0;
list_for_each_entry_safe(page, page2, from, lru) {
+retry:
cond_resched();
if (PageHuge(page))
@@ -1394,6 +1390,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
switch(rc) {
case -ENOMEM:
+ /*
+ * THP migration might be unsupported or the
+ * allocation could've failed so we should
+ * retry on the same page with the THP split
+ * to base pages.
+ */
+ if (PageTransHuge(page)) {
+ lock_page(page);
+ rc = split_huge_page_to_list(page, from);
+ unlock_page(page);
+ if (!rc) {
+ list_safe_reset_next(page, page2, lru);
+ goto retry;
+ }
+ }
nr_failed++;
goto out;
case -EAGAIN:
@@ -1480,8 +1491,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
/* FOLL_DUMP to ignore special (like zero) pages */
follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
page = follow_page(vma, addr, follflags);
err = PTR_ERR(page);
--
2.15.0
On Fri, Dec 08, 2017 at 05:15:57PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
> doesn't try to address these semantic nuances but rather change the
> underlying implementation.
>
> Currently we are processing user input (which can be really large)
> in batches which are stored to a temporarily allocated page. Each
> address is resolved to its struct page and stored to page_to_node
> structure along with the requested target numa node. The array of these
> structures is then conveyed down the page migration path via private
> argument. new_page_node then finds the corresponding structure and
> allocates the proper target page.
>
> What is the problem with the current implementation and why to change
> it? Apart from being quite ugly it also doesn't cope with unexpected
> pages showing up on the migration list inside migrate_pages path.
> That doesn't happen currently but the follow up patch would like to
> make the thp migration code more clear and that would need to split a
> THP into the list for some cases.
>
> How does the new implementation work? Well, instead of batching into a
> fixed size array we simply batch all pages that should be migrated to
> the same node and isolate all of them into a linked list which doesn't
> require any additional storage. This should work reasonably well because
> page migration usually migrates larger ranges of memory to a specific
> node. So the common case should work equally well as the current
> implementation. Even if somebody constructs an input where the target
> numa nodes would be interleaved we shouldn't see a large performance
> impact because page migration alone doesn't really benefit from
> batching. mmap_sem batching for the lookup is quite questionable and
> isolate_lru_page which would benefit from batching is not using it even
> in the current implementation.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 1 +
> mm/mempolicy.c | 5 +-
> mm/migrate.c | 306 +++++++++++++++++++++++++--------------------------------
> 3 files changed, 139 insertions(+), 173 deletions(-)
The approach looks fine to me.
But patch is rather large and hard to review. And how git mixed add/remove
lines doesn't help too. Any chance to split it up further?
One nitpick: I don't think 'chunk' terminology should go away with the
patch.
--
Kirill A. Shutemov
On Wed 13-12-17 15:07:33, Kirill A. Shutemov wrote:
[...]
> The approach looks fine to me.
>
> But patch is rather large and hard to review. And how git mixed add/remove
> lines doesn't help too. Any chance to split it up further?
I was trying to do that but this is a drop in replacement so it is quite
hard to do in smaller pieces. I've already put the allocation callback
cleanup into a separate one but this is about all that I figured how to
split. If you have any suggestions I am willing to try them out.
> One nitpick: I don't think 'chunk' terminology should go away with the
> patch.
Not sure what you mean here. I have kept chunk_start, chunk_node, so I
am not really changing that terminology
Thanks!
--
Michal Hocko
SUSE Labs
On Fri, Dec 08, 2017 at 05:15:59PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> THP migration is hacked into the generic migration with rather
> surprising semantic. The migration allocation callback is supposed to
> check whether the THP can be migrated at once and if that is not the
> case then it allocates a simple page to migrate. unmap_and_move then
> fixes that up by spliting the THP into small pages while moving the
> head page to the newly allocated order-0 page. Remaning pages are moved
> to the LRU list by split_huge_page. The same happens if the THP
> allocation fails. This is really ugly and error prone [1].
>
> I also believe that split_huge_page to the LRU lists is inherently
> wrong because all tail pages are not migrated. Some callers will just
> work around that by retrying (e.g. memory hotplug). There are other
> pfn walkers which are simply broken though. e.g. madvise_inject_error
> will migrate head and then advances next pfn by the huge page size.
> do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
> will simply split the THP before migration if the THP migration is not
> supported then falls back to single page migration but it doesn't handle
> tail pages if the THP migration path is not able to allocate a fresh
> THP so we end up with ENOMEM and fail the whole migration which is
> a questionable behavior. Page compaction doesn't try to migrate large
> pages so it should be immune.
>
> This patch tries to unclutter the situation by moving the special THP
> handling up to the migrate_pages layer where it actually belongs. We
> simply split the THP page into the existing list if unmap_and_move fails
> with ENOMEM and retry. So we will _always_ migrate all THP subpages and
> specific migrate_pages users do not have to deal with this case in a
> special way.
>
> [1] http://lkml.kernel.org/r/[email protected]
>
> Signed-off-by: Michal Hocko <[email protected]>
Looks good to me.
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Dec 13, 2017 at 01:17:03PM +0100, Michal Hocko wrote:
> On Wed 13-12-17 15:07:33, Kirill A. Shutemov wrote:
> [...]
> > The approach looks fine to me.
> >
> > But patch is rather large and hard to review. And how git mixed add/remove
> > lines doesn't help too. Any chance to split it up further?
>
> I was trying to do that but this is a drop in replacement so it is quite
> hard to do in smaller pieces. I've already put the allocation callback
> cleanup into a separate one but this is about all that I figured how to
> split. If you have any suggestions I am willing to try them out.
"git diff --patience" seems generate more readable output for the patch.
> > One nitpick: I don't think 'chunk' terminology should go away with the
> > patch.
>
> Not sure what you mean here. I have kept chunk_start, chunk_node, so I
> am not really changing that terminology
We don't really have chunks anymore, right? We still *may* have per-node
batching, but..
Maybe just 'start' and 'current_node'?
--
Kirill A. Shutemov
On Wed 13-12-17 15:47:31, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2017 at 01:17:03PM +0100, Michal Hocko wrote:
> > On Wed 13-12-17 15:07:33, Kirill A. Shutemov wrote:
> > [...]
> > > The approach looks fine to me.
> > >
> > > But patch is rather large and hard to review. And how git mixed add/remove
> > > lines doesn't help too. Any chance to split it up further?
> >
> > I was trying to do that but this is a drop in replacement so it is quite
> > hard to do in smaller pieces. I've already put the allocation callback
> > cleanup into a separate one but this is about all that I figured how to
> > split. If you have any suggestions I am willing to try them out.
>
> "git diff --patience" seems generate more readable output for the patch.
Hmm, I wasn't aware of this option. Are you suggesting I should use it
to general the patch to send?
> > > One nitpick: I don't think 'chunk' terminology should go away with the
> > > patch.
> >
> > Not sure what you mean here. I have kept chunk_start, chunk_node, so I
> > am not really changing that terminology
>
> We don't really have chunks anymore, right? We still *may* have per-node
> batching, but..
>
> Maybe just 'start' and 'current_node'?
Ohh, I've read your response that you want to preserve the naming. I can
certainly do the rename.
---
diff --git a/mm/migrate.c b/mm/migrate.c
index 9d7252ea2acd..5491045b60f9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1556,14 +1556,14 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
const int __user *nodes,
int __user *status, int flags)
{
- int chunk_node = NUMA_NO_NODE;
+ int current_node = NUMA_NO_NODE;
LIST_HEAD(pagelist);
- int chunk_start, i;
+ int start, i;
int err = 0, err1;
migrate_prep();
- for (i = chunk_start = 0; i < nr_pages; i++) {
+ for (i = start = 0; i < nr_pages; i++) {
const void __user *p;
unsigned long addr;
int node;
@@ -1585,25 +1585,25 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (!node_isset(node, task_nodes))
goto out_flush;
- if (chunk_node == NUMA_NO_NODE) {
- chunk_node = node;
- chunk_start = i;
- } else if (node != chunk_node) {
- err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (current_node == NUMA_NO_NODE) {
+ current_node = node;
+ start = i;
+ } else if (node != current_node) {
+ err = do_move_pages_to_node(mm, &pagelist, current_node);
if (err)
goto out;
- err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ err = store_status(status, start, current_node, i - start);
if (err)
goto out;
- chunk_start = i;
- chunk_node = node;
+ start = i;
+ current_node = node;
}
/*
* Errors in the page lookup or isolation are not fatal and we simply
* report them via status
*/
- err = add_page_for_migration(mm, addr, chunk_node,
+ err = add_page_for_migration(mm, addr, current_node,
&pagelist, flags & MPOL_MF_MOVE_ALL);
if (!err)
continue;
@@ -1612,22 +1612,22 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (err)
goto out_flush;
- err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ err = do_move_pages_to_node(mm, &pagelist, current_node);
if (err)
goto out;
- if (i > chunk_start) {
- err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (i > start) {
+ err = store_status(status, start, current_node, i - start);
if (err)
goto out;
}
- chunk_node = NUMA_NO_NODE;
+ current_node = NUMA_NO_NODE;
}
err = 0;
out_flush:
/* Make sure we do not overwrite the existing error */
- err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ err1 = do_move_pages_to_node(mm, &pagelist, current_node);
if (!err1)
- err1 = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ err1 = store_status(status, start, current_node, i - start);
if (!err)
err = err1;
out:
--
Michal Hocko
SUSE Labs
On Wed, Dec 13, 2017 at 03:10:39PM +0100, Michal Hocko wrote:
> On Wed 13-12-17 15:47:31, Kirill A. Shutemov wrote:
> > On Wed, Dec 13, 2017 at 01:17:03PM +0100, Michal Hocko wrote:
> > > On Wed 13-12-17 15:07:33, Kirill A. Shutemov wrote:
> > > [...]
> > > > The approach looks fine to me.
> > > >
> > > > But patch is rather large and hard to review. And how git mixed add/remove
> > > > lines doesn't help too. Any chance to split it up further?
> > >
> > > I was trying to do that but this is a drop in replacement so it is quite
> > > hard to do in smaller pieces. I've already put the allocation callback
> > > cleanup into a separate one but this is about all that I figured how to
> > > split. If you have any suggestions I am willing to try them out.
> >
> > "git diff --patience" seems generate more readable output for the patch.
>
> Hmm, I wasn't aware of this option. Are you suggesting I should use it
> to general the patch to send?
I don't know if it's better in general (it's not default after all), but it
seems helps for this particular case.
>
> > > > One nitpick: I don't think 'chunk' terminology should go away with the
> > > > patch.
> > >
> > > Not sure what you mean here. I have kept chunk_start, chunk_node, so I
> > > am not really changing that terminology
> >
> > We don't really have chunks anymore, right? We still *may* have per-node
> > batching, but..
> >
> > Maybe just 'start' and 'current_node'?
>
> Ohh, I've read your response that you want to preserve the naming. I can
> certainly do the rename.
Yep, that's better.
--
Kirill A. Shutemov
This patch has been generated with --patience parameter as suggested by
Kirill and it realy seems to provide a more compact diff.
---
>From 1f529769d099ca605888b29059014e7c8f0bfd50 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 8 Dec 2017 12:28:34 +0100
Subject: [PATCH] mm, numa: rework do_pages_move
do_pages_move is supposed to move user defined memory (an array of
addresses) to the user defined numa nodes (an array of nodes one for
each address). The user provided status array then contains resulting
numa node for each address or an error. The semantic of this function is
little bit confusing because only some errors are reported back. Notably
migrate_pages error is only reported via the return value. This patch
doesn't try to address these semantic nuances but rather change the
underlying implementation.
Currently we are processing user input (which can be really large)
in batches which are stored to a temporarily allocated page. Each
address is resolved to its struct page and stored to page_to_node
structure along with the requested target numa node. The array of these
structures is then conveyed down the page migration path via private
argument. new_page_node then finds the corresponding structure and
allocates the proper target page.
What is the problem with the current implementation and why to change
it? Apart from being quite ugly it also doesn't cope with unexpected
pages showing up on the migration list inside migrate_pages path.
That doesn't happen currently but the follow up patch would like to
make the thp migration code more clear and that would need to split a
THP into the list for some cases.
How does the new implementation work? Well, instead of batching into a
fixed size array we simply batch all pages that should be migrated to
the same node and isolate all of them into a linked list which doesn't
require any additional storage. This should work reasonably well because
page migration usually migrates larger ranges of memory to a specific
node. So the common case should work equally well as the current
implementation. Even if somebody constructs an input where the target
numa nodes would be interleaved we shouldn't see a large performance
impact because page migration alone doesn't really benefit from
batching. mmap_sem batching for the lookup is quite questionable and
isolate_lru_page which would benefit from batching is not using it even
in the current implementation.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/internal.h | 1 +
mm/mempolicy.c | 5 +-
mm/migrate.c | 340 ++++++++++++++++++++++++++-------------------------------
3 files changed, 156 insertions(+), 190 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..1a1bb5d59c15 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,4 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}
void setup_zone_pageset(struct zone *zone);
+extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f604b22ebb65..66c9c79b21be 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
}
}
-static struct page *new_node_page(struct page *page, unsigned long node, int **x)
+/* page allocation callback for NUMA node migration */
+struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
{
if (PageHuge(page))
return alloc_huge_page_node(page_hstate(compound_head(page)),
@@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);
if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_node_page, NULL, dest,
+ err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
MIGRATE_SYNC, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..9d7252ea2acd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1444,141 +1444,104 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
}
#ifdef CONFIG_NUMA
+
+static int store_status(int __user *status, int start, int value, int nr)
+{
+ while (nr-- > 0) {
+ if (put_user(value, status + start))
+ return -EFAULT;
+ start++;
+ }
+
+ return 0;
+}
+
+static int do_move_pages_to_node(struct mm_struct *mm,
+ struct list_head *pagelist, int node)
+{
+ int err;
+
+ if (list_empty(pagelist))
+ return 0;
+
+ err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
+ MIGRATE_SYNC, MR_SYSCALL);
+ if (err)
+ putback_movable_pages(pagelist);
+ return err;
+}
+
/*
- * Move a list of individual pages
+ * Resolves the given address to a struct page, isolates it from the LRU and
+ * puts it to the given pagelist.
+ * Returns -errno if the page cannot be found/isolated or 0 when it has been
+ * queued or the page doesn't need to be migrated because it is already on
+ * the target node
*/
-struct page_to_node {
- unsigned long addr;
+static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
+ int node, struct list_head *pagelist, bool migrate_all)
+{
+ struct vm_area_struct *vma;
struct page *page;
- int node;
- int status;
-};
-
-static struct page *new_page_node(struct page *p, unsigned long private,
- int **result)
-{
- struct page_to_node *pm = (struct page_to_node *)private;
-
- while (pm->node != MAX_NUMNODES && pm->page != p)
- pm++;
-
- if (pm->node == MAX_NUMNODES)
- return NULL;
-
- *result = &pm->status;
-
- if (PageHuge(p))
- return alloc_huge_page_node(page_hstate(compound_head(p)),
- pm->node);
- else if (thp_migration_supported() && PageTransHuge(p)) {
- struct page *thp;
-
- thp = alloc_pages_node(pm->node,
- (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
- HPAGE_PMD_ORDER);
- if (!thp)
- return NULL;
- prep_transhuge_page(thp);
- return thp;
- } else
- return __alloc_pages_node(pm->node,
- GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
-}
-
-/*
- * Move a set of pages as indicated in the pm array. The addr
- * field must be set to the virtual address of the page to be moved
- * and the node number must contain a valid target node.
- * The pm array ends with node = MAX_NUMNODES.
- */
-static int do_move_page_to_node_array(struct mm_struct *mm,
- struct page_to_node *pm,
- int migrate_all)
-{
+ unsigned int follflags;
int err;
- struct page_to_node *pp;
- LIST_HEAD(pagelist);
down_read(&mm->mmap_sem);
+ err = -EFAULT;
+ vma = find_vma(mm, addr);
+ if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+ goto out;
- /*
- * Build a list of pages to migrate
- */
- for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
- struct vm_area_struct *vma;
- struct page *page;
+ /* FOLL_DUMP to ignore special (like zero) pages */
+ follflags = FOLL_GET | FOLL_DUMP;
+ if (!thp_migration_supported())
+ follflags |= FOLL_SPLIT;
+ page = follow_page(vma, addr, follflags);
+
+ err = PTR_ERR(page);
+ if (IS_ERR(page))
+ goto out;
+
+ err = -ENOENT;
+ if (!page)
+ goto out;
+
+ err = 0;
+ if (page_to_nid(page) == node)
+ goto out_putpage;
+
+ err = -EACCES;
+ if (page_mapcount(page) > 1 &&
+ !migrate_all)
+ goto out_putpage;
+
+ if (PageHuge(page)) {
+ if (PageHead(page)) {
+ isolate_huge_page(page, pagelist);
+ err = 0;
+ }
+ } else {
struct page *head;
- unsigned int follflags;
- err = -EFAULT;
- vma = find_vma(mm, pp->addr);
- if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
- goto set_status;
-
- /* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- if (!thp_migration_supported())
- follflags |= FOLL_SPLIT;
- page = follow_page(vma, pp->addr, follflags);
-
- err = PTR_ERR(page);
- if (IS_ERR(page))
- goto set_status;
-
- err = -ENOENT;
- if (!page)
- goto set_status;
-
- err = page_to_nid(page);
-
- if (err == pp->node)
- /*
- * Node already in the right place
- */
- goto put_and_set;
-
- err = -EACCES;
- if (page_mapcount(page) > 1 &&
- !migrate_all)
- goto put_and_set;
-
- if (PageHuge(page)) {
- if (PageHead(page)) {
- isolate_huge_page(page, &pagelist);
- err = 0;
- pp->page = page;
- }
- goto put_and_set;
- }
-
- pp->page = compound_head(page);
head = compound_head(page);
err = isolate_lru_page(head);
- if (!err) {
- list_add_tail(&head->lru, &pagelist);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON + page_is_file_cache(head),
- hpage_nr_pages(head));
- }
-put_and_set:
- /*
- * Either remove the duplicate refcount from
- * isolate_lru_page() or drop the page ref if it was
- * not isolated.
- */
- put_page(page);
-set_status:
- pp->status = err;
- }
-
- err = 0;
- if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, new_page_node, NULL,
- (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
if (err)
- putback_movable_pages(&pagelist);
- }
+ goto out_putpage;
+ err = 0;
+ list_add_tail(&head->lru, pagelist);
+ mod_node_page_state(page_pgdat(head),
+ NR_ISOLATED_ANON + page_is_file_cache(head),
+ hpage_nr_pages(head));
+ }
+out_putpage:
+ /*
+ * Either remove the duplicate refcount from
+ * isolate_lru_page() or drop the page ref if it was
+ * not isolated.
+ */
+ put_page(page);
+out:
up_read(&mm->mmap_sem);
return err;
}
@@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
const int __user *nodes,
int __user *status, int flags)
{
- struct page_to_node *pm;
- unsigned long chunk_nr_pages;
- unsigned long chunk_start;
- int err;
-
- err = -ENOMEM;
- pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
- if (!pm)
- goto out;
+ int chunk_node = NUMA_NO_NODE;
+ LIST_HEAD(pagelist);
+ int chunk_start, i;
+ int err = 0, err1;
migrate_prep();
- /*
- * Store a chunk of page_to_node array in a page,
- * but keep the last one as a marker
- */
- chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
-
- for (chunk_start = 0;
- chunk_start < nr_pages;
- chunk_start += chunk_nr_pages) {
- int j;
-
- if (chunk_start + chunk_nr_pages > nr_pages)
- chunk_nr_pages = nr_pages - chunk_start;
-
- /* fill the chunk pm with addrs and nodes from user-space */
- for (j = 0; j < chunk_nr_pages; j++) {
- const void __user *p;
- int node;
-
- err = -EFAULT;
- if (get_user(p, pages + j + chunk_start))
- goto out_pm;
- pm[j].addr = (unsigned long) p;
-
- if (get_user(node, nodes + j + chunk_start))
- goto out_pm;
-
- err = -ENODEV;
- if (node < 0 || node >= MAX_NUMNODES)
- goto out_pm;
-
- if (!node_state(node, N_MEMORY))
- goto out_pm;
-
- err = -EACCES;
- if (!node_isset(node, task_nodes))
- goto out_pm;
-
- pm[j].node = node;
+ for (i = chunk_start = 0; i < nr_pages; i++) {
+ const void __user *p;
+ unsigned long addr;
+ int node;
+
+ err = -EFAULT;
+ if (get_user(p, pages + i))
+ goto out_flush;
+ if (get_user(node, nodes + i))
+ goto out_flush;
+ addr = (unsigned long)p;
+
+ err = -ENODEV;
+ if (node < 0 || node >= MAX_NUMNODES)
+ goto out_flush;
+ if (!node_state(node, N_MEMORY))
+ goto out_flush;
+
+ err = -EACCES;
+ if (!node_isset(node, task_nodes))
+ goto out_flush;
+
+ if (chunk_node == NUMA_NO_NODE) {
+ chunk_node = node;
+ chunk_start = i;
+ } else if (node != chunk_node) {
+ err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (err)
+ goto out;
+ err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (err)
+ goto out;
+ chunk_start = i;
+ chunk_node = node;
}
- /* End marker for this chunk */
- pm[chunk_nr_pages].node = MAX_NUMNODES;
-
- /* Migrate this chunk */
- err = do_move_page_to_node_array(mm, pm,
- flags & MPOL_MF_MOVE_ALL);
- if (err < 0)
- goto out_pm;
-
- /* Return status information */
- for (j = 0; j < chunk_nr_pages; j++)
- if (put_user(pm[j].status, status + j + chunk_start)) {
- err = -EFAULT;
- goto out_pm;
- }
+ /*
+ * Errors in the page lookup or isolation are not fatal and we simply
+ * report them via status
+ */
+ err = add_page_for_migration(mm, addr, chunk_node,
+ &pagelist, flags & MPOL_MF_MOVE_ALL);
+ if (!err)
+ continue;
+
+ err = store_status(status, i, err, 1);
+ if (err)
+ goto out_flush;
+
+ err = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (err)
+ goto out;
+ if (i > chunk_start) {
+ err = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (err)
+ goto out;
+ }
+ chunk_node = NUMA_NO_NODE;
}
err = 0;
-
-out_pm:
- free_page((unsigned long)pm);
+out_flush:
+ /* Make sure we do not overwrite the existing error */
+ err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
+ if (!err1)
+ err1 = store_status(status, chunk_start, chunk_node, i - chunk_start);
+ if (!err)
+ err = err1;
out:
return err;
}
--
2.15.0
--
Michal Hocko
SUSE Labs
On Wed, Dec 13, 2017 at 03:39:48PM +0100, Michal Hocko wrote:
> This patch has been generated with --patience parameter as suggested by
> Kirill and it realy seems to provide a more compact diff.
> ---
> From 1f529769d099ca605888b29059014e7c8f0bfd50 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 8 Dec 2017 12:28:34 +0100
> Subject: [PATCH] mm, numa: rework do_pages_move
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
> doesn't try to address these semantic nuances but rather change the
> underlying implementation.
>
> Currently we are processing user input (which can be really large)
> in batches which are stored to a temporarily allocated page. Each
> address is resolved to its struct page and stored to page_to_node
> structure along with the requested target numa node. The array of these
> structures is then conveyed down the page migration path via private
> argument. new_page_node then finds the corresponding structure and
> allocates the proper target page.
>
> What is the problem with the current implementation and why to change
> it? Apart from being quite ugly it also doesn't cope with unexpected
> pages showing up on the migration list inside migrate_pages path.
> That doesn't happen currently but the follow up patch would like to
> make the thp migration code more clear and that would need to split a
> THP into the list for some cases.
>
> How does the new implementation work? Well, instead of batching into a
> fixed size array we simply batch all pages that should be migrated to
> the same node and isolate all of them into a linked list which doesn't
> require any additional storage. This should work reasonably well because
> page migration usually migrates larger ranges of memory to a specific
> node. So the common case should work equally well as the current
> implementation. Even if somebody constructs an input where the target
> numa nodes would be interleaved we shouldn't see a large performance
> impact because page migration alone doesn't really benefit from
> batching. mmap_sem batching for the lookup is quite questionable and
> isolate_lru_page which would benefit from batching is not using it even
> in the current implementation.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 1 +
> mm/mempolicy.c | 5 +-
> mm/migrate.c | 340 ++++++++++++++++++++++++++-------------------------------
> 3 files changed, 156 insertions(+), 190 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..1a1bb5d59c15 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -538,4 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> +extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..66c9c79b21be 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> }
> }
>
> -static struct page *new_node_page(struct page *page, unsigned long node, int **x)
> +/* page allocation callback for NUMA node migration */
> +struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
> {
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> @@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>
> if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, new_node_page, NULL, dest,
> + err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4d0be47a322a..9d7252ea2acd 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1444,141 +1444,104 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> }
>
> #ifdef CONFIG_NUMA
> +
> +static int store_status(int __user *status, int start, int value, int nr)
> +{
> + while (nr-- > 0) {
> + if (put_user(value, status + start))
> + return -EFAULT;
> + start++;
> + }
> +
> + return 0;
> +}
> +
> +static int do_move_pages_to_node(struct mm_struct *mm,
> + struct list_head *pagelist, int node)
> +{
> + int err;
> +
> + if (list_empty(pagelist))
> + return 0;
> +
> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> + MIGRATE_SYNC, MR_SYSCALL);
> + if (err)
> + putback_movable_pages(pagelist);
> + return err;
> +}
> +
> /*
> - * Move a list of individual pages
> + * Resolves the given address to a struct page, isolates it from the LRU and
> + * puts it to the given pagelist.
> + * Returns -errno if the page cannot be found/isolated or 0 when it has been
> + * queued or the page doesn't need to be migrated because it is already on
> + * the target node
> */
> -struct page_to_node {
> - unsigned long addr;
> +static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> + int node, struct list_head *pagelist, bool migrate_all)
> +{
> + struct vm_area_struct *vma;
> struct page *page;
> - int node;
> - int status;
> -};
> -
> -static struct page *new_page_node(struct page *p, unsigned long private,
> - int **result)
> -{
> - struct page_to_node *pm = (struct page_to_node *)private;
> -
> - while (pm->node != MAX_NUMNODES && pm->page != p)
> - pm++;
> -
> - if (pm->node == MAX_NUMNODES)
> - return NULL;
> -
> - *result = &pm->status;
> -
> - if (PageHuge(p))
> - return alloc_huge_page_node(page_hstate(compound_head(p)),
> - pm->node);
> - else if (thp_migration_supported() && PageTransHuge(p)) {
> - struct page *thp;
> -
> - thp = alloc_pages_node(pm->node,
> - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> - HPAGE_PMD_ORDER);
> - if (!thp)
> - return NULL;
> - prep_transhuge_page(thp);
> - return thp;
> - } else
> - return __alloc_pages_node(pm->node,
> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> -}
> -
> -/*
> - * Move a set of pages as indicated in the pm array. The addr
> - * field must be set to the virtual address of the page to be moved
> - * and the node number must contain a valid target node.
> - * The pm array ends with node = MAX_NUMNODES.
> - */
> -static int do_move_page_to_node_array(struct mm_struct *mm,
> - struct page_to_node *pm,
> - int migrate_all)
> -{
> + unsigned int follflags;
> int err;
> - struct page_to_node *pp;
> - LIST_HEAD(pagelist);
>
> down_read(&mm->mmap_sem);
> + err = -EFAULT;
> + vma = find_vma(mm, addr);
> + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
> + goto out;
>
> - /*
> - * Build a list of pages to migrate
> - */
> - for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
> - struct vm_area_struct *vma;
> - struct page *page;
> + /* FOLL_DUMP to ignore special (like zero) pages */
> + follflags = FOLL_GET | FOLL_DUMP;
> + if (!thp_migration_supported())
> + follflags |= FOLL_SPLIT;
> + page = follow_page(vma, addr, follflags);
> +
> + err = PTR_ERR(page);
> + if (IS_ERR(page))
> + goto out;
> +
> + err = -ENOENT;
> + if (!page)
> + goto out;
> +
> + err = 0;
> + if (page_to_nid(page) == node)
> + goto out_putpage;
> +
> + err = -EACCES;
> + if (page_mapcount(page) > 1 &&
> + !migrate_all)
Non-sensible line break.
> + goto out_putpage;
> +
> + if (PageHuge(page)) {
> + if (PageHead(page)) {
> + isolate_huge_page(page, pagelist);
> + err = 0;
> + }
> + } else {
Hm. I think if the page is PageTail() we have to split the huge page.
If an user asks to migrate part of THP, we shouldn't migrate the whole page,
otherwise it's not transparent anymore.
Otherwise, the patch looks good to me.
--
Kirill A. Shutemov
On Thu 14-12-17 18:35:58, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2017 at 03:39:48PM +0100, Michal Hocko wrote:
[...]
> > + err = 0;
> > + if (page_to_nid(page) == node)
> > + goto out_putpage;
> > +
> > + err = -EACCES;
> > + if (page_mapcount(page) > 1 &&
> > + !migrate_all)
>
> Non-sensible line break.
fixed
> > + goto out_putpage;
> > +
> > + if (PageHuge(page)) {
> > + if (PageHead(page)) {
> > + isolate_huge_page(page, pagelist);
> > + err = 0;
> > + }
> > + } else {
>
> Hm. I think if the page is PageTail() we have to split the huge page.
> If an user asks to migrate part of THP, we shouldn't migrate the whole page,
> otherwise it's not transparent anymore.
Well, as I've said in the cover letter. There are more things which are
worth considering but I've tried to keep the original semantic so
further changes should be done in separete patches. I will work on those
but I would prefer this to stay smaller if you do not mind.
> Otherwise, the patch looks good to me.
Thanks for the review
--
Michal Hocko
SUSE Labs
On Fri, Dec 15, 2017 at 10:28:59AM +0100, Michal Hocko wrote:
> On Thu 14-12-17 18:35:58, Kirill A. Shutemov wrote:
> > On Wed, Dec 13, 2017 at 03:39:48PM +0100, Michal Hocko wrote:
> [...]
> > > + err = 0;
> > > + if (page_to_nid(page) == node)
> > > + goto out_putpage;
> > > +
> > > + err = -EACCES;
> > > + if (page_mapcount(page) > 1 &&
> > > + !migrate_all)
> >
> > Non-sensible line break.
>
> fixed
>
> > > + goto out_putpage;
> > > +
> > > + if (PageHuge(page)) {
> > > + if (PageHead(page)) {
> > > + isolate_huge_page(page, pagelist);
> > > + err = 0;
> > > + }
> > > + } else {
> >
> > Hm. I think if the page is PageTail() we have to split the huge page.
> > If an user asks to migrate part of THP, we shouldn't migrate the whole page,
> > otherwise it's not transparent anymore.
>
> Well, as I've said in the cover letter. There are more things which are
> worth considering but I've tried to keep the original semantic so
> further changes should be done in separete patches. I will work on those
> but I would prefer this to stay smaller if you do not mind.
Sure.
Fill free to use my ack.
--
Kirill A. Shutemov
On Fri 15-12-17 12:51:25, Kirill A. Shutemov wrote:
> On Fri, Dec 15, 2017 at 10:28:59AM +0100, Michal Hocko wrote:
> > On Thu 14-12-17 18:35:58, Kirill A. Shutemov wrote:
> > > On Wed, Dec 13, 2017 at 03:39:48PM +0100, Michal Hocko wrote:
> > [...]
> > > > + err = 0;
> > > > + if (page_to_nid(page) == node)
> > > > + goto out_putpage;
> > > > +
> > > > + err = -EACCES;
> > > > + if (page_mapcount(page) > 1 &&
> > > > + !migrate_all)
> > >
> > > Non-sensible line break.
> >
> > fixed
> >
> > > > + goto out_putpage;
> > > > +
> > > > + if (PageHuge(page)) {
> > > > + if (PageHead(page)) {
> > > > + isolate_huge_page(page, pagelist);
> > > > + err = 0;
> > > > + }
> > > > + } else {
> > >
> > > Hm. I think if the page is PageTail() we have to split the huge page.
> > > If an user asks to migrate part of THP, we shouldn't migrate the whole page,
> > > otherwise it's not transparent anymore.
> >
> > Well, as I've said in the cover letter. There are more things which are
> > worth considering but I've tried to keep the original semantic so
> > further changes should be done in separete patches. I will work on those
> > but I would prefer this to stay smaller if you do not mind.
>
> Sure.
>
> Fill free to use my ack.
Thanks a lot Kirill! I will wait for some more feedback and then
resubmit later next week.
--
Michal Hocko
SUSE Labs
Are there any more comments here? It seems the initial reaction wasn't
all that bad and so I would like to post this with RFC dropped.
On Fri 08-12-17 17:15:56, Michal Hocko wrote:
> On Thu 07-12-17 15:34:01, Michal Hocko wrote:
> > On Thu 07-12-17 22:10:47, Zi Yan wrote:
> [...]
> > > I agree with you that we should try to migrate all tail pages if the THP
> > > needs to be split. But this might not be compatible with "getting
> > > migration results" in unmap_and_move(), since a caller of
> > > migrate_pages() may want to know the status of each page in the
> > > migration list via int **result in get_new_page() (e.g.
> > > new_page_node()). The caller has no idea whether a THP in its migration
> > > list will be split or not, thus, storing migration results might be
> > > quite tricky if tail pages are added into the migration list.
> >
> > Ouch. I wasn't aware of this "beauty". I will try to wrap my head around
> > this code and think about what to do about it. Thanks for point me to
> > it.
>
> OK, so was staring at this yesterday and concluded that the current
> implementation of do_move_page_to_node_array is unfixable to work with
> split_thp_page_list in migrate_pages. So I've reimplemented it to not
> use the quite ugly fixed sized batching. Instead I am using dynamic
> batching based on the same node request. See the patch 1 for more
> details about implementation. This will allow us to remove the quite
> ugly 'int **reason' from the allocation callback as well. This is patch
> 2 and patch 3 is finally the thp migration code.
>
> Diffstat is quite supportive for this cleanup.
> include/linux/migrate.h | 7 +-
> include/linux/page-isolation.h | 3 +-
> mm/compaction.c | 3 +-
> mm/huge_memory.c | 6 +
> mm/internal.h | 1 +
> mm/memory_hotplug.c | 5 +-
> mm/mempolicy.c | 40 +----
> mm/migrate.c | 350 ++++++++++++++++++-----------------------
> mm/page_isolation.c | 3 +-
> 9 files changed, 177 insertions(+), 241 deletions(-)
>
> Does anybody see any issues with this approach?
>
> On a side note:
> There are some semantic issues I have encountered on the way but I am
> not addressing them here. E.g. trying to move THP tail pages will result
> in either success or EBUSY (the later one more likely once we isolate
> head from the LRU list). Hugetlb reports EACCESS on tail pages.
> Some errors are reported via status parameter but migration failures are
> not even though the original `reason' argument suggests there was an
> intention to do so. From a quick look into git history this never
> worked. I have tried to keep the semantic unchanged.
>
> Then there is a relatively minor thing that the page isolation might
> fail because of pages not being on the LRU - e.g. because they are
> sitting on the per-cpu LRU caches. Easily fixable.
--
Michal Hocko
SUSE Labs
On 8 Dec 2017, at 11:15, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> No allocation callback is using this argument anymore. new_page_node
> used to use this parameter to convey node_id resp. migration error
> up to move_pages code (do_move_page_to_node_array). The error status
> never made it into the final status field and we have a better way
> to communicate node id to the status field now. All other allocation
> callbacks simply ignored the argument so we can drop it finally.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/migrate.h | 3 +--
> include/linux/page-isolation.h | 3 +--
> mm/compaction.c | 3 +--
> mm/internal.h | 2 +-
> mm/memory_hotplug.c | 3 +--
> mm/mempolicy.c | 6 +++---
> mm/migrate.c | 19 ++-----------------
> mm/page_isolation.c | 3 +--
> 8 files changed, 11 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index a2246cf670ba..e5d99ade2319 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -7,8 +7,7 @@
> #include <linux/migrate_mode.h>
> #include <linux/hugetlb.h>
>
> -typedef struct page *new_page_t(struct page *page, unsigned long private,
> - int **reason);
> +typedef struct page *new_page_t(struct page *page, unsigned long private);
> typedef void free_page_t(struct page *page, unsigned long private);
>
> /*
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index cdad58bbfd8b..4ae347cbc36d 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -63,7 +63,6 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> bool skip_hwpoisoned_pages);
>
> -struct page *alloc_migrate_target(struct page *page, unsigned long private,
> - int **resultp);
> +struct page *alloc_migrate_target(struct page *page, unsigned long private);
>
> #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 10cd757f1006..692d21d63391 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1165,8 +1165,7 @@ static void isolate_freepages(struct compact_control *cc)
> * from the isolated freelists in the block we are migrating to.
> */
> static struct page *compaction_alloc(struct page *migratepage,
> - unsigned long data,
> - int **result)
> + unsigned long data)
> {
> struct compact_control *cc = (struct compact_control *)data;
> struct page *freepage;
> diff --git a/mm/internal.h b/mm/internal.h
> index 1a1bb5d59c15..502d14189794 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -538,5 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> -extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
> +extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d0856ab2f28d..d865623edee7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1347,8 +1347,7 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> return 0;
> }
>
> -static struct page *new_node_page(struct page *page, unsigned long private,
> - int **result)
> +static struct page *new_node_page(struct page *page, unsigned long private)
> {
> int nid = page_to_nid(page);
> nodemask_t nmask = node_states[N_MEMORY];
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 66c9c79b21be..4d849d3098e5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -943,7 +943,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> }
>
> /* page allocation callback for NUMA node migration */
> -struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
> +struct page *alloc_new_node_page(struct page *page, unsigned long node)
> {
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> @@ -1108,7 +1108,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
> * list of pages handed to migrate_pages()--which is how we get here--
> * is in virtual address order.
> */
> -static struct page *new_page(struct page *page, unsigned long start, int **x)
> +static struct page *new_page(struct page *page, unsigned long start)
> {
> struct vm_area_struct *vma;
> unsigned long uninitialized_var(address);
> @@ -1153,7 +1153,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
> return -ENOSYS;
> }
>
> -static struct page *new_page(struct page *page, unsigned long start, int **x)
> +static struct page *new_page(struct page *page, unsigned long start)
> {
> return NULL;
> }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9d7252ea2acd..f9235f0155a4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1136,10 +1136,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> enum migrate_reason reason)
> {
> int rc = MIGRATEPAGE_SUCCESS;
> - int *result = NULL;
> struct page *newpage;
>
> - newpage = get_new_page(page, private, &result);
> + newpage = get_new_page(page, private);
> if (!newpage)
> return -ENOMEM;
>
> @@ -1230,12 +1229,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> put_page(newpage);
> }
>
> - if (result) {
> - if (rc)
> - *result = rc;
> - else
> - *result = page_to_nid(newpage);
> - }
> return rc;
> }
>
> @@ -1263,7 +1256,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> enum migrate_mode mode, int reason)
> {
> int rc = -EAGAIN;
> - int *result = NULL;
> int page_was_mapped = 0;
> struct page *new_hpage;
> struct anon_vma *anon_vma = NULL;
> @@ -1280,7 +1272,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> return -ENOSYS;
> }
>
> - new_hpage = get_new_page(hpage, private, &result);
> + new_hpage = get_new_page(hpage, private);
> if (!new_hpage)
> return -ENOMEM;
>
> @@ -1345,12 +1337,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> else
> putback_active_hugepage(new_hpage);
>
> - if (result) {
> - if (rc)
> - *result = rc;
> - else
> - *result = page_to_nid(new_hpage);
> - }
> return rc;
> }
>
> @@ -1622,7 +1608,6 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> }
> chunk_node = NUMA_NO_NODE;
> }
> - err = 0;
This line can be merged into Patch 1. Or did I miss anything?
> out_flush:
> /* Make sure we do not overwrite the existing error */
> err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 165ed8117bd1..53d801235e22 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -293,8 +293,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> return pfn < end_pfn ? -EBUSY : 0;
> }
>
> -struct page *alloc_migrate_target(struct page *page, unsigned long private,
> - int **resultp)
> +struct page *alloc_migrate_target(struct page *page, unsigned long private)
> {
> return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
> }
> --
> 2.15.0
Everything else looks good to me.
Reviewed-by: Zi Yan <[email protected]>
—
Best Regards,
Yan Zi
On 8 Dec 2017, at 11:15, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> THP migration is hacked into the generic migration with rather
> surprising semantic. The migration allocation callback is supposed to
> check whether the THP can be migrated at once and if that is not the
> case then it allocates a simple page to migrate. unmap_and_move then
> fixes that up by spliting the THP into small pages while moving the
> head page to the newly allocated order-0 page. Remaning pages are moved
> to the LRU list by split_huge_page. The same happens if the THP
> allocation fails. This is really ugly and error prone [1].
>
> I also believe that split_huge_page to the LRU lists is inherently
> wrong because all tail pages are not migrated. Some callers will just
> work around that by retrying (e.g. memory hotplug). There are other
> pfn walkers which are simply broken though. e.g. madvise_inject_error
> will migrate head and then advances next pfn by the huge page size.
> do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
> will simply split the THP before migration if the THP migration is not
> supported then falls back to single page migration but it doesn't handle
> tail pages if the THP migration path is not able to allocate a fresh
> THP so we end up with ENOMEM and fail the whole migration which is
> a questionable behavior. Page compaction doesn't try to migrate large
> pages so it should be immune.
>
> This patch tries to unclutter the situation by moving the special THP
> handling up to the migrate_pages layer where it actually belongs. We
> simply split the THP page into the existing list if unmap_and_move fails
> with ENOMEM and retry. So we will _always_ migrate all THP subpages and
> specific migrate_pages users do not have to deal with this case in a
> special way.
>
> [1] https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml.kernel.org%2Fr%2F20171121021855.50525-1-zi.yan%40sent.com&data=02%7C01%7Czi.yan%40cs.rutgers.edu%7Cfbb3ed29196a430d9c7808d53e5703cc%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636483465807198257&sdata=1jHMT9Nsyfc7xiMpjy05vYHrY9DCV4Z9LlOSsaJFdBY%3D&reserved=0
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/migrate.h | 4 ++--
> mm/huge_memory.c | 6 ++++++
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 31 +++----------------------------
> mm/migrate.c | 29 +++++++++++++++++++----------
> 5 files changed, 31 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e5d99ade2319..0c6fe904bc97 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -42,9 +42,9 @@ static inline struct page *new_page_nodemask(struct page *page,
> return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> preferred_nid, nodemask);
>
> - if (thp_migration_supported() && PageTransHuge(page)) {
> - order = HPAGE_PMD_ORDER;
> + if (PageTransHuge(page)) {
> gfp_mask |= GFP_TRANSHUGE;
> + order = HPAGE_PMD_ORDER;
> }
>
> if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7544ce4ef4dc..8865906c248c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2425,6 +2425,12 @@ static void __split_huge_page_tail(struct page *head, int tail,
>
> page_tail->index = head->index + tail;
> page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
> +
> + /*
> + * always add to the tail because some iterators expect new
> + * pages to show after the currently processed elements - e.g.
> + * migrate_pages
> + */
> lru_add_page_tail(head, page_tail, lruvec, list);
> }
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d865623edee7..442e63a2cf72 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1390,7 +1390,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> if (isolate_huge_page(page, &source))
> move_pages -= 1 << compound_order(head);
> continue;
> - } else if (thp_migration_supported() && PageTransHuge(page))
> + } else if (PageTransHuge(page))
> pfn = page_to_pfn(compound_head(page))
> + hpage_nr_pages(page) - 1;
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4d849d3098e5..b6f4fcf9df64 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -446,15 +446,6 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
> __split_huge_pmd(walk->vma, pmd, addr, false, NULL);
> goto out;
> }
> - if (!thp_migration_supported()) {
> - get_page(page);
> - spin_unlock(ptl);
> - lock_page(page);
> - ret = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - goto out;
> - }
> if (!queue_pages_required(page, qp)) {
> ret = 1;
> goto unlock;
> @@ -495,7 +486,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>
> if (pmd_trans_unstable(pmd))
> return 0;
> -retry:
> +
> pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> if (!pte_present(*pte))
> @@ -511,22 +502,6 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> continue;
> if (!queue_pages_required(page, qp))
> continue;
> - if (PageTransCompound(page) && !thp_migration_supported()) {
> - get_page(page);
> - pte_unmap_unlock(pte, ptl);
> - lock_page(page);
> - ret = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - /* Failed to split -- skip. */
> - if (ret) {
> - pte = pte_offset_map_lock(walk->mm, pmd,
> - addr, &ptl);
> - continue;
> - }
> - goto retry;
> - }
> -
> migrate_page_add(page, qp->pagelist, flags);
> }
> pte_unmap_unlock(pte - 1, ptl);
> @@ -948,7 +923,7 @@ struct page *alloc_new_node_page(struct page *page, unsigned long node)
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> node);
> - else if (thp_migration_supported() && PageTransHuge(page)) {
> + else if (PageTransHuge(page)) {
> struct page *thp;
>
> thp = alloc_pages_node(node,
> @@ -1124,7 +1099,7 @@ static struct page *new_page(struct page *page, unsigned long start)
> if (PageHuge(page)) {
> BUG_ON(!vma);
> return alloc_huge_page_noerr(vma, address, 1);
> - } else if (thp_migration_supported() && PageTransHuge(page)) {
> + } else if (PageTransHuge(page)) {
> struct page *thp;
>
> thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f9235f0155a4..dc5df5fe5c82 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1138,6 +1138,9 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> int rc = MIGRATEPAGE_SUCCESS;
> struct page *newpage;
>
> + if (!thp_migration_supported() && PageTransHuge(page))
> + return -ENOMEM;
> +
> newpage = get_new_page(page, private);
> if (!newpage)
> return -ENOMEM;
> @@ -1159,14 +1162,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> goto out;
> }
>
> - if (unlikely(PageTransHuge(page) && !PageTransHuge(newpage))) {
> - lock_page(page);
> - rc = split_huge_page(page);
> - unlock_page(page);
> - if (rc)
> - goto out;
> - }
> -
> rc = __unmap_and_move(page, newpage, force, mode);
> if (rc == MIGRATEPAGE_SUCCESS)
> set_page_owner_migrate_reason(newpage, reason);
> @@ -1381,6 +1376,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> retry = 0;
>
> list_for_each_entry_safe(page, page2, from, lru) {
> +retry:
> cond_resched();
>
> if (PageHuge(page))
> @@ -1394,6 +1390,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>
> switch(rc) {
> case -ENOMEM:
> + /*
> + * THP migration might be unsupported or the
> + * allocation could've failed so we should
> + * retry on the same page with the THP split
> + * to base pages.
> + */
> + if (PageTransHuge(page)) {
> + lock_page(page);
> + rc = split_huge_page_to_list(page, from);
> + unlock_page(page);
> + if (!rc) {
> + list_safe_reset_next(page, page2, lru);
> + goto retry;
> + }
> + }
The hunk splits the THP and adds all tail pages at the end of the list “from”.
Why do we need “list_safe_reset_next(page, page2, lru);” here, when page2 is not changed here?
And it seems a little bit strange to only re-migrate the head page, then come back to all tail
pages after migrating the rest of pages in the list “from”. Is it better to split the THP into
a list other than “from” and insert the list after “page”, then retry from the split “page”?
Thus, we attempt to migrate all sub pages of the THP after it is split.
> nr_failed++;
> goto out;
> case -EAGAIN:
> @@ -1480,8 +1491,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> follflags = FOLL_GET | FOLL_DUMP;
> - if (!thp_migration_supported())
> - follflags |= FOLL_SPLIT;
> page = follow_page(vma, addr, follflags);
>
> err = PTR_ERR(page);
> --
> 2.15.0
—
Best Regards,
Yan Zi
On Tue 26-12-17 21:12:38, Zi Yan wrote:
> On 8 Dec 2017, at 11:15, Michal Hocko wrote:
[...]
> > @@ -1622,7 +1608,6 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> > }
> > chunk_node = NUMA_NO_NODE;
> > }
> > - err = 0;
>
> This line can be merged into Patch 1. Or did I miss anything?
Yes, I will move it there.
> > out_flush:
> > /* Make sure we do not overwrite the existing error */
> > err1 = do_move_pages_to_node(mm, &pagelist, chunk_node);
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 165ed8117bd1..53d801235e22 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -293,8 +293,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> > return pfn < end_pfn ? -EBUSY : 0;
> > }
> >
> > -struct page *alloc_migrate_target(struct page *page, unsigned long private,
> > - int **resultp)
> > +struct page *alloc_migrate_target(struct page *page, unsigned long private)
> > {
> > return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
> > }
> > --
> > 2.15.0
>
> Everything else looks good to me.
>
> Reviewed-by: Zi Yan <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
On Tue 26-12-17 21:19:35, Zi Yan wrote:
> On 8 Dec 2017, at 11:15, Michal Hocko wrote:
[...]
> > @@ -1394,6 +1390,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >
> > switch(rc) {
> > case -ENOMEM:
> > + /*
> > + * THP migration might be unsupported or the
> > + * allocation could've failed so we should
> > + * retry on the same page with the THP split
> > + * to base pages.
> > + */
> > + if (PageTransHuge(page)) {
> > + lock_page(page);
> > + rc = split_huge_page_to_list(page, from);
> > + unlock_page(page);
> > + if (!rc) {
> > + list_safe_reset_next(page, page2, lru);
> > + goto retry;
> > + }
> > + }
>
> The hunk splits the THP and adds all tail pages at the end of the list “from”.
> Why do we need “list_safe_reset_next(page, page2, lru);” here, when page2 is not changed here?
Because we need to handle the case when the page2 was the last on the
list.
> And it seems a little bit strange to only re-migrate the head page, then come back to all tail
> pages after migrating the rest of pages in the list “from”. Is it better to split the THP into
> a list other than “from” and insert the list after “page”, then retry from the split “page”?
> Thus, we attempt to migrate all sub pages of the THP after it is split.
Why does this matter?
--
Michal Hocko
SUSE Labs
On 29 Dec 2017, at 6:36, Michal Hocko wrote:
> On Tue 26-12-17 21:19:35, Zi Yan wrote:
>> On 8 Dec 2017, at 11:15, Michal Hocko wrote:
> [...]
>>> @@ -1394,6 +1390,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>>>
>>> switch(rc) {
>>> case -ENOMEM:
>>> + /*
>>> + * THP migration might be unsupported or the
>>> + * allocation could've failed so we should
>>> + * retry on the same page with the THP split
>>> + * to base pages.
>>> + */
>>> + if (PageTransHuge(page)) {
>>> + lock_page(page);
>>> + rc = split_huge_page_to_list(page, from);
>>> + unlock_page(page);
>>> + if (!rc) {
>>> + list_safe_reset_next(page, page2, lru);
>>> + goto retry;
>>> + }
>>> + }
>>
>> The hunk splits the THP and adds all tail pages at the end of the list “from”.
>> Why do we need “list_safe_reset_next(page, page2, lru);” here, when page2 is not changed here?
>
> Because we need to handle the case when the page2 was the last on the
> list.
Got it. Thanks for the explanation.
>
>> And it seems a little bit strange to only re-migrate the head page, then come back to all tail
>> pages after migrating the rest of pages in the list “from”. Is it better to split the THP into
>> a list other than “from” and insert the list after “page”, then retry from the split “page”?
>> Thus, we attempt to migrate all sub pages of the THP after it is split.
>
> Why does this matter?
Functionally, it does not matter.
This behavior is just less intuitive and a little different from current one,
which implicitly preserves its original order of the not-migrated pages
in the “from” list, although no one relies on this implicit behavior now.
Adding one line comment about this difference would be good for code maintenance. :)
Reviewed-by: Zi Yan <[email protected]>
—
Best Regards,
Yan Zi
On Fri 29-12-17 10:45:46, Zi Yan wrote:
> On 29 Dec 2017, at 6:36, Michal Hocko wrote:
>
> > On Tue 26-12-17 21:19:35, Zi Yan wrote:
[...]
> >> And it seems a little bit strange to only re-migrate the head page, then come back to all tail
> >> pages after migrating the rest of pages in the list “from”. Is it better to split the THP into
> >> a list other than “from” and insert the list after “page”, then retry from the split “page”?
> >> Thus, we attempt to migrate all sub pages of the THP after it is split.
> >
> > Why does this matter?
>
> Functionally, it does not matter.
>
> This behavior is just less intuitive and a little different from current one,
> which implicitly preserves its original order of the not-migrated pages
> in the “from” list, although no one relies on this implicit behavior now.
>
>
> Adding one line comment about this difference would be good for code maintenance. :)
OK, I will not argue. I still do not see _why_ we need it but I've added
the following.
diff --git a/mm/migrate.c b/mm/migrate.c
index 21b3381a2871..0ac5185d3949 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1395,6 +1395,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
* allocation could've failed so we should
* retry on the same page with the THP split
* to base pages.
+ *
+ * Head page is retried immediatelly and tail
+ * pages are added to the tail of the list so
+ * we encounter them after the rest of the list
+ * is processed.
*/
if (PageTransHuge(page)) {
lock_page(page);
Does that this reflect what you mean?
> Reviewed-by: Zi Yan <[email protected]>
Thx!
--
Michal Hocko
SUSE Labs
On 31 Dec 2017, at 4:07, Michal Hocko wrote:
> On Fri 29-12-17 10:45:46, Zi Yan wrote:
>> On 29 Dec 2017, at 6:36, Michal Hocko wrote:
>>
>>> On Tue 26-12-17 21:19:35, Zi Yan wrote:
> [...]
>>>> And it seems a little bit strange to only re-migrate the head page, then come back to all tail
>>>> pages after migrating the rest of pages in the list “from”. Is it better to split the THP into
>>>> a list other than “from” and insert the list after “page”, then retry from the split “page”?
>>>> Thus, we attempt to migrate all sub pages of the THP after it is split.
>>>
>>> Why does this matter?
>>
>> Functionally, it does not matter.
>>
>> This behavior is just less intuitive and a little different from current one,
>> which implicitly preserves its original order of the not-migrated pages
>> in the “from” list, although no one relies on this implicit behavior now.
>>
>>
>> Adding one line comment about this difference would be good for code maintenance. :)
>
> OK, I will not argue. I still do not see _why_ we need it but I've added
> the following.
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 21b3381a2871..0ac5185d3949 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1395,6 +1395,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> * allocation could've failed so we should
> * retry on the same page with the THP split
> * to base pages.
> + *
> + * Head page is retried immediatelly and tail
> + * pages are added to the tail of the list so
> + * we encounter them after the rest of the list
> + * is processed.
> */
> if (PageTransHuge(page)) {
> lock_page(page);
>
> Does that this reflect what you mean?
s/immediatelly/immediately
Yes. Thanks. :)
—
Best Regards,
Yan Zi
On 12/08/2017 09:45 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
It does report back the migration failures as well. In new_page_node
there is '*result = &pm->status' which going forward in unmap_and_move
will hold migration error or node ID of the new page.
newpage = get_new_page(page, private, &result);
............
if (result) {
if (rc)
*result = rc;
else
*result = page_to_nid(newpage);
}
On Tue 02-01-18 16:55:46, Anshuman Khandual wrote:
> On 12/08/2017 09:45 PM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > do_pages_move is supposed to move user defined memory (an array of
> > addresses) to the user defined numa nodes (an array of nodes one for
> > each address). The user provided status array then contains resulting
> > numa node for each address or an error. The semantic of this function is
> > little bit confusing because only some errors are reported back. Notably
> > migrate_pages error is only reported via the return value. This patch
>
> It does report back the migration failures as well. In new_page_node
> there is '*result = &pm->status' which going forward in unmap_and_move
> will hold migration error or node ID of the new page.
>
> newpage = get_new_page(page, private, &result);
> ............
> if (result) {
> if (rc)
> *result = rc;
> else
> *result = page_to_nid(newpage);
> }
>
This is true, except the user will not get this information. Have a look
how we do not copy status on error up in the do_pages_move layer.
--
Michal Hocko
SUSE Labs
On 01/02/2018 05:42 PM, Michal Hocko wrote:
> On Tue 02-01-18 16:55:46, Anshuman Khandual wrote:
>> On 12/08/2017 09:45 PM, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> do_pages_move is supposed to move user defined memory (an array of
>>> addresses) to the user defined numa nodes (an array of nodes one for
>>> each address). The user provided status array then contains resulting
>>> numa node for each address or an error. The semantic of this function is
>>> little bit confusing because only some errors are reported back. Notably
>>> migrate_pages error is only reported via the return value. This patch
>>
>> It does report back the migration failures as well. In new_page_node
>> there is '*result = &pm->status' which going forward in unmap_and_move
>> will hold migration error or node ID of the new page.
>>
>> newpage = get_new_page(page, private, &result);
>> ............
>> if (result) {
>> if (rc)
>> *result = rc;
>> else
>> *result = page_to_nid(newpage);
>> }
>>
>
> This is true, except the user will not get this information. Have a look
> how we do not copy status on error up in the do_pages_move layer.
Ahh, right, we dont. But as you have mentioned this patch does not
intend to change the semantics of status return thought it seems
like the right thing to do. We can just pass on the status to user
here before bailing out.
On 12/08/2017 09:45 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> do_pages_move is supposed to move user defined memory (an array of
> addresses) to the user defined numa nodes (an array of nodes one for
> each address). The user provided status array then contains resulting
> numa node for each address or an error. The semantic of this function is
> little bit confusing because only some errors are reported back. Notably
> migrate_pages error is only reported via the return value. This patch
> doesn't try to address these semantic nuances but rather change the
> underlying implementation.
>
> Currently we are processing user input (which can be really large)
> in batches which are stored to a temporarily allocated page. Each
> address is resolved to its struct page and stored to page_to_node
> structure along with the requested target numa node. The array of these
> structures is then conveyed down the page migration path via private
> argument. new_page_node then finds the corresponding structure and
> allocates the proper target page.
>
> What is the problem with the current implementation and why to change
> it? Apart from being quite ugly it also doesn't cope with unexpected
> pages showing up on the migration list inside migrate_pages path.
> That doesn't happen currently but the follow up patch would like to
> make the thp migration code more clear and that would need to split a
> THP into the list for some cases.
>
> How does the new implementation work? Well, instead of batching into a
> fixed size array we simply batch all pages that should be migrated to
> the same node and isolate all of them into a linked list which doesn't
> require any additional storage. This should work reasonably well because
> page migration usually migrates larger ranges of memory to a specific
> node. So the common case should work equally well as the current
> implementation. Even if somebody constructs an input where the target
> numa nodes would be interleaved we shouldn't see a large performance
> impact because page migration alone doesn't really benefit from
> batching. mmap_sem batching for the lookup is quite questionable and
> isolate_lru_page which would benefit from batching is not using it even
> in the current implementation.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 1 +
> mm/mempolicy.c | 5 +-
> mm/migrate.c | 306 +++++++++++++++++++++++++--------------------------------
> 3 files changed, 139 insertions(+), 173 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e6bd35182dae..1a1bb5d59c15 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -538,4 +538,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> +extern struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x);
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f604b22ebb65..66c9c79b21be 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -942,7 +942,8 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
> }
> }
>
> -static struct page *new_node_page(struct page *page, unsigned long node, int **x)
> +/* page allocation callback for NUMA node migration */
> +struct page *alloc_new_node_page(struct page *page, unsigned long node, int **x)
> {
> if (PageHuge(page))
> return alloc_huge_page_node(page_hstate(compound_head(page)),
> @@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>
> if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, new_node_page, NULL, dest,
> + err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
This reuses the existing page allocation helper from migrate_pages() system
call. But all these allocator helper names for migrate_pages() function are
really confusing. Even in this case alloc_new_node_page and the original
new_node_page() which is still getting used in do_migrate_range() sounds
similar even if their implementation is quite different. IMHO either all of
them should be moved to the header file with proper differentiating names
or let them be there in their respective files with these generic names and
clean them up later.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4d0be47a322a..9d7252ea2acd 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1444,141 +1444,104 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> }
>
> #ifdef CONFIG_NUMA
> -/*
> - * Move a list of individual pages
> - */
> -struct page_to_node {
> - unsigned long addr;
> - struct page *page;
> - int node;
> - int status;
> -};
>
> -static struct page *new_page_node(struct page *p, unsigned long private,
> - int **result)
> +static int store_status(int __user *status, int start, int value, int nr)
> {
> - struct page_to_node *pm = (struct page_to_node *)private;
> -
> - while (pm->node != MAX_NUMNODES && pm->page != p)
> - pm++;
> + while (nr-- > 0) {
> + if (put_user(value, status + start))
> + return -EFAULT;
> + start++;
> + }
>
> - if (pm->node == MAX_NUMNODES)
> - return NULL;
> + return 0;
> +}
Just a nit. new_page_node() and store_status() seems different. Then why
the git diff looks so clumsy.
>
> - *result = &pm->status;
> +static int do_move_pages_to_node(struct mm_struct *mm,
> + struct list_head *pagelist, int node)
> +{
> + int err;
>
> - if (PageHuge(p))
> - return alloc_huge_page_node(page_hstate(compound_head(p)),
> - pm->node);
> - else if (thp_migration_supported() && PageTransHuge(p)) {
> - struct page *thp;
> + if (list_empty(pagelist))
> + return 0;
>
> - thp = alloc_pages_node(pm->node,
> - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> - HPAGE_PMD_ORDER);
> - if (!thp)
> - return NULL;
> - prep_transhuge_page(thp);
> - return thp;
> - } else
> - return __alloc_pages_node(pm->node,
> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> + MIGRATE_SYNC, MR_SYSCALL);
> + if (err)
> + putback_movable_pages(pagelist);
> + return err;
> }
Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
at a time which belong to the same target node. Perhaps the name should
suggest so. All these helper page migration helper functions sound so
similar.
>
> /*
> - * Move a set of pages as indicated in the pm array. The addr
> - * field must be set to the virtual address of the page to be moved
> - * and the node number must contain a valid target node.
> - * The pm array ends with node = MAX_NUMNODES.
> + * Resolves the given address to a struct page, isolates it from the LRU and
> + * puts it to the given pagelist.
> + * Returns -errno if the page cannot be found/isolated or 0 when it has been
> + * queued or the page doesn't need to be migrated because it is already on
> + * the target node
> */
> -static int do_move_page_to_node_array(struct mm_struct *mm,
> - struct page_to_node *pm,
> - int migrate_all)
> +static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> + int node, struct list_head *pagelist, bool migrate_all)
> {
> + struct vm_area_struct *vma;
> + struct page *page;
> + unsigned int follflags;
> int err;
> - struct page_to_node *pp;
> - LIST_HEAD(pagelist);
>
> down_read(&mm->mmap_sem);
Holding mmap_sem for individual pages makes sense. Current
implementation is holding it for an entire batch.
> + err = -EFAULT;
> + vma = find_vma(mm, addr);
> + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
While here, should not we add 'addr > vma->vm_end' into this condition ?
> + goto out;
>
> - /*
> - * Build a list of pages to migrate
> - */
> - for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
> - struct vm_area_struct *vma;
> - struct page *page;
> - struct page *head;
> - unsigned int follflags;
> -
> - err = -EFAULT;
> - vma = find_vma(mm, pp->addr);
> - if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
> - goto set_status;
> -
> - /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> - if (!thp_migration_supported())
> - follflags |= FOLL_SPLIT;
> - page = follow_page(vma, pp->addr, follflags);
> + /* FOLL_DUMP to ignore special (like zero) pages */
> + follflags = FOLL_GET | FOLL_DUMP;
> + if (!thp_migration_supported())
> + follflags |= FOLL_SPLIT;
> + page = follow_page(vma, addr, follflags);
>
> - err = PTR_ERR(page);
> - if (IS_ERR(page))
> - goto set_status;
> + err = PTR_ERR(page);
> + if (IS_ERR(page))
> + goto out;
>
> - err = -ENOENT;
> - if (!page)
> - goto set_status;
> + err = -ENOENT;
> + if (!page)
> + goto out;
>
> - err = page_to_nid(page);
> + err = 0;
> + if (page_to_nid(page) == node)
> + goto out_putpage;
>
> - if (err == pp->node)
> - /*
> - * Node already in the right place
> - */
> - goto put_and_set;
> + err = -EACCES;
> + if (page_mapcount(page) > 1 &&
> + !migrate_all)
> + goto out_putpage;
>
> - err = -EACCES;
> - if (page_mapcount(page) > 1 &&
> - !migrate_all)
> - goto put_and_set;
> -
> - if (PageHuge(page)) {
> - if (PageHead(page)) {
> - isolate_huge_page(page, &pagelist);
> - err = 0;
> - pp->page = page;
> - }
> - goto put_and_set;
> + if (PageHuge(page)) {
> + if (PageHead(page)) {
> + isolate_huge_page(page, pagelist);
> + err = 0;
> }
> + } else {
> + struct page *head;
>
> - pp->page = compound_head(page);
> head = compound_head(page);
> err = isolate_lru_page(head);
> - if (!err) {
> - list_add_tail(&head->lru, &pagelist);
> - mod_node_page_state(page_pgdat(head),
> - NR_ISOLATED_ANON + page_is_file_cache(head),
> - hpage_nr_pages(head));
> - }
> -put_and_set:
> - /*
> - * Either remove the duplicate refcount from
> - * isolate_lru_page() or drop the page ref if it was
> - * not isolated.
> - */
> - put_page(page);
> -set_status:
> - pp->status = err;
> - }
> -
> - err = 0;
> - if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, new_page_node, NULL,
> - (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> - putback_movable_pages(&pagelist);
> - }
> + goto out_putpage;
>
> + err = 0;
> + list_add_tail(&head->lru, pagelist);
> + mod_node_page_state(page_pgdat(head),
> + NR_ISOLATED_ANON + page_is_file_cache(head),
> + hpage_nr_pages(head));
> + }
> +out_putpage:
> + /*
> + * Either remove the duplicate refcount from
> + * isolate_lru_page() or drop the page ref if it was
> + * not isolated.
> + */
> + put_page(page);
> +out:
> up_read(&mm->mmap_sem);
> return err;
> }
> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> const int __user *nodes,
> int __user *status, int flags)
> {
> - struct page_to_node *pm;
> - unsigned long chunk_nr_pages;
> - unsigned long chunk_start;
> - int err;
> -
> - err = -ENOMEM;
> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> - if (!pm)
> - goto out;
> + int chunk_node = NUMA_NO_NODE;
> + LIST_HEAD(pagelist);
> + int chunk_start, i;
> + int err = 0, err1;
err init might not be required, its getting assigned to -EFAULT right away.
>
> migrate_prep();
>
> - /*
> - * Store a chunk of page_to_node array in a page,
> - * but keep the last one as a marker
> - */
> - chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
> -
> - for (chunk_start = 0;
> - chunk_start < nr_pages;
> - chunk_start += chunk_nr_pages) {
> - int j;
> -
> - if (chunk_start + chunk_nr_pages > nr_pages)
> - chunk_nr_pages = nr_pages - chunk_start;
> -
> - /* fill the chunk pm with addrs and nodes from user-space */
> - for (j = 0; j < chunk_nr_pages; j++) {
> - const void __user *p;
> - int node;
> -
> - err = -EFAULT;
> - if (get_user(p, pages + j + chunk_start))
> - goto out_pm;
> - pm[j].addr = (unsigned long) p;
> -
> - if (get_user(node, nodes + j + chunk_start))
> - goto out_pm;
> -
> - err = -ENODEV;
> - if (node < 0 || node >= MAX_NUMNODES)
> - goto out_pm;
> + for (i = chunk_start = 0; i < nr_pages; i++) {
> + const void __user *p;
> + unsigned long addr;
> + int node;
>
> - if (!node_state(node, N_MEMORY))
> - goto out_pm;
> -
> - err = -EACCES;
> - if (!node_isset(node, task_nodes))
> - goto out_pm;
> + err = -EFAULT;
> + if (get_user(p, pages + i))
> + goto out_flush;
> + if (get_user(node, nodes + i))
> + goto out_flush;
> + addr = (unsigned long)p;
> +
> + err = -ENODEV;
> + if (node < 0 || node >= MAX_NUMNODES)
> + goto out_flush;
> + if (!node_state(node, N_MEMORY))
> + goto out_flush;
>
> - pm[j].node = node;
> + err = -EACCES;
> + if (!node_isset(node, task_nodes))
> + goto out_flush;
> +
> + if (chunk_node == NUMA_NO_NODE) {
> + chunk_node = node;
> + chunk_start = i;
> + } else if (node != chunk_node) {
> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> + if (err)
> + goto out;
> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> + if (err)
> + goto out;
> + chunk_start = i;
> + chunk_node = node;
> }
>
> - /* End marker for this chunk */
> - pm[chunk_nr_pages].node = MAX_NUMNODES;
> + /*
> + * Errors in the page lookup or isolation are not fatal and we simply
> + * report them via status
> + */
> + err = add_page_for_migration(mm, addr, chunk_node,
> + &pagelist, flags & MPOL_MF_MOVE_ALL);
> + if (!err)
> + continue;
>
> - /* Migrate this chunk */
> - err = do_move_page_to_node_array(mm, pm,
> - flags & MPOL_MF_MOVE_ALL);
> - if (err < 0)
> - goto out_pm;
> + err = store_status(status, i, err, 1);
> + if (err)
> + goto out_flush;
>
> - /* Return status information */
> - for (j = 0; j < chunk_nr_pages; j++)
> - if (put_user(pm[j].status, status + j + chunk_start)) {
> - err = -EFAULT;
> - goto out_pm;
> - }
> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> + if (err)
> + goto out;
> + if (i > chunk_start) {
> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> + if (err)
> + goto out;
> + }
> + chunk_node = NUMA_NO_NODE;
This block of code is bit confusing.
1) Why attempt to migrate when just one page could not be isolated ?
2) 'i' is always greater than chunk_start except the starting page
3) Why reset chunk_node as NUMA_NO_NODE ?
On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
> On 12/08/2017 09:45 PM, Michal Hocko wrote:
[...]
> > @@ -986,7 +987,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> > flags | MPOL_MF_DISCONTIG_OK, &pagelist);
> >
> > if (!list_empty(&pagelist)) {
> > - err = migrate_pages(&pagelist, new_node_page, NULL, dest,
> > + err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> > MIGRATE_SYNC, MR_SYSCALL);
> > if (err)
> > putback_movable_pages(&pagelist);
>
> This reuses the existing page allocation helper from migrate_pages() system
> call. But all these allocator helper names for migrate_pages() function are
> really confusing. Even in this case alloc_new_node_page and the original
> new_node_page() which is still getting used in do_migrate_range() sounds
> similar even if their implementation is quite different. IMHO either all of
> them should be moved to the header file with proper differentiating names
> or let them be there in their respective files with these generic names and
> clean them up later.
I believe I've tried that but I couldn't make them into a single header
file easily because of header file dependencies. I agree that their
names are quite confusing. Feel free to send a patch to clean this up.
>
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 4d0be47a322a..9d7252ea2acd 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1444,141 +1444,104 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > }
> >
> > #ifdef CONFIG_NUMA
> > -/*
> > - * Move a list of individual pages
> > - */
> > -struct page_to_node {
> > - unsigned long addr;
> > - struct page *page;
> > - int node;
> > - int status;
> > -};
> >
> > -static struct page *new_page_node(struct page *p, unsigned long private,
> > - int **result)
> > +static int store_status(int __user *status, int start, int value, int nr)
> > {
> > - struct page_to_node *pm = (struct page_to_node *)private;
> > -
> > - while (pm->node != MAX_NUMNODES && pm->page != p)
> > - pm++;
> > + while (nr-- > 0) {
> > + if (put_user(value, status + start))
> > + return -EFAULT;
> > + start++;
> > + }
> >
> > - if (pm->node == MAX_NUMNODES)
> > - return NULL;
> > + return 0;
> > +}
>
>
> Just a nit. new_page_node() and store_status() seems different. Then why
> the git diff looks so clumsy.
Kirill was suggesting to use --patience to general the diff which leads
to a slightly better output. It has been posted as a separate email [1].
Maybe you will find that one easier to review.
[1] http://lkml.kernel.org/r/[email protected]
> >
> > - *result = &pm->status;
> > +static int do_move_pages_to_node(struct mm_struct *mm,
> > + struct list_head *pagelist, int node)
> > +{
> > + int err;
> >
> > - if (PageHuge(p))
> > - return alloc_huge_page_node(page_hstate(compound_head(p)),
> > - pm->node);
> > - else if (thp_migration_supported() && PageTransHuge(p)) {
> > - struct page *thp;
> > + if (list_empty(pagelist))
> > + return 0;
> >
> > - thp = alloc_pages_node(pm->node,
> > - (GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> > - HPAGE_PMD_ORDER);
> > - if (!thp)
> > - return NULL;
> > - prep_transhuge_page(thp);
> > - return thp;
> > - } else
> > - return __alloc_pages_node(pm->node,
> > - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
> > + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> > + MIGRATE_SYNC, MR_SYSCALL);
> > + if (err)
> > + putback_movable_pages(pagelist);
> > + return err;
> > }
>
> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
> at a time which belong to the same target node. Perhaps the name should
> suggest so. All these helper page migration helper functions sound so
> similar.
What do you suggest? I find do_move_pages_to_node quite explicit on its
purpose.
> > /*
> > - * Move a set of pages as indicated in the pm array. The addr
> > - * field must be set to the virtual address of the page to be moved
> > - * and the node number must contain a valid target node.
> > - * The pm array ends with node = MAX_NUMNODES.
> > + * Resolves the given address to a struct page, isolates it from the LRU and
> > + * puts it to the given pagelist.
> > + * Returns -errno if the page cannot be found/isolated or 0 when it has been
> > + * queued or the page doesn't need to be migrated because it is already on
> > + * the target node
> > */
> > -static int do_move_page_to_node_array(struct mm_struct *mm,
> > - struct page_to_node *pm,
> > - int migrate_all)
> > +static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> > + int node, struct list_head *pagelist, bool migrate_all)
> > {
> > + struct vm_area_struct *vma;
> > + struct page *page;
> > + unsigned int follflags;
> > int err;
> > - struct page_to_node *pp;
> > - LIST_HEAD(pagelist);
> >
> > down_read(&mm->mmap_sem);
>
> Holding mmap_sem for individual pages makes sense. Current
> implementation is holding it for an entire batch.
I didn't bother to optimize this path to be honest. It is true that lock
batching can lead to improvements but that would complicate the code
(how many patches to batch?) so I've left that for later if somebody
actually sees any problem.
> > + err = -EFAULT;
> > + vma = find_vma(mm, addr);
> > + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>
> While here, should not we add 'addr > vma->vm_end' into this condition ?
No. See what find_vma does.
[...]
Please cut out the quoted reply to minimum
> > @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> > const int __user *nodes,
> > int __user *status, int flags)
> > {
> > - struct page_to_node *pm;
> > - unsigned long chunk_nr_pages;
> > - unsigned long chunk_start;
> > - int err;
> > -
> > - err = -ENOMEM;
> > - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> > - if (!pm)
> > - goto out;
> > + int chunk_node = NUMA_NO_NODE;
> > + LIST_HEAD(pagelist);
> > + int chunk_start, i;
> > + int err = 0, err1;
>
> err init might not be required, its getting assigned to -EFAULT right away.
No, nr_pages might be 0 AFAICS.
[...]
> > + if (chunk_node == NUMA_NO_NODE) {
> > + chunk_node = node;
> > + chunk_start = i;
> > + } else if (node != chunk_node) {
> > + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> > + if (err)
> > + goto out;
> > + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> > + if (err)
> > + goto out;
> > + chunk_start = i;
> > + chunk_node = node;
> > }
> >
> > - /* End marker for this chunk */
> > - pm[chunk_nr_pages].node = MAX_NUMNODES;
> > + /*
> > + * Errors in the page lookup or isolation are not fatal and we simply
> > + * report them via status
> > + */
> > + err = add_page_for_migration(mm, addr, chunk_node,
> > + &pagelist, flags & MPOL_MF_MOVE_ALL);
> > + if (!err)
> > + continue;
> >
> > - /* Migrate this chunk */
> > - err = do_move_page_to_node_array(mm, pm,
> > - flags & MPOL_MF_MOVE_ALL);
> > - if (err < 0)
> > - goto out_pm;
> > + err = store_status(status, i, err, 1);
> > + if (err)
> > + goto out_flush;
> >
> > - /* Return status information */
> > - for (j = 0; j < chunk_nr_pages; j++)
> > - if (put_user(pm[j].status, status + j + chunk_start)) {
> > - err = -EFAULT;
> > - goto out_pm;
> > - }
> > + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> > + if (err)
> > + goto out;
> > + if (i > chunk_start) {
> > + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> > + if (err)
> > + goto out;
> > + }
> > + chunk_node = NUMA_NO_NODE;
>
> This block of code is bit confusing.
I believe this is easier to grasp when looking at the resulting code.
>
> 1) Why attempt to migrate when just one page could not be isolated ?
> 2) 'i' is always greater than chunk_start except the starting page
> 3) Why reset chunk_node as NUMA_NO_NODE ?
This is all about flushing the pending state on an error and
distinguising a fresh batch.
--
Michal Hocko
SUSE Labs
On 01/03/2018 02:28 PM, Michal Hocko wrote:
> On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
>> On 12/08/2017 09:45 PM, Michal Hocko wrote:
> [...]
[...]
>>
>> This reuses the existing page allocation helper from migrate_pages() system
>> call. But all these allocator helper names for migrate_pages() function are
>> really confusing. Even in this case alloc_new_node_page and the original
>> new_node_page() which is still getting used in do_migrate_range() sounds
>> similar even if their implementation is quite different. IMHO either all of
>> them should be moved to the header file with proper differentiating names
>> or let them be there in their respective files with these generic names and
>> clean them up later.
>
> I believe I've tried that but I couldn't make them into a single header
> file easily because of header file dependencies. I agree that their
> names are quite confusing. Feel free to send a patch to clean this up.
Sure. Will try once this one gets into mmotm.
[...]
>>
>>
>> Just a nit. new_page_node() and store_status() seems different. Then why
>> the git diff looks so clumsy.
>
> Kirill was suggesting to use --patience to general the diff which leads
> to a slightly better output. It has been posted as a separate email [1].
> Maybe you will find that one easier to review.
>
> [1] http://lkml.kernel.org/r/[email protected]
Yeah it does look better.
[...]
>>> - return thp;
>>> - } else
>>> - return __alloc_pages_node(pm->node,
>>> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>>> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
>>> + MIGRATE_SYNC, MR_SYSCALL);
>>> + if (err)
>>> + putback_movable_pages(pagelist);
>>> + return err;
>>> }
>>
>> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
>> at a time which belong to the same target node. Perhaps the name should
>> suggest so. All these helper page migration helper functions sound so
>> similar.
>
> What do you suggest? I find do_move_pages_to_node quite explicit on its
> purpose.
Sure. Not a big deal.
[...]
>>> {
>>> + struct vm_area_struct *vma;
>>> + struct page *page;
>>> + unsigned int follflags;
>>> int err;
>>> - struct page_to_node *pp;
>>> - LIST_HEAD(pagelist);
>>>
>>> down_read(&mm->mmap_sem);
>>
>> Holding mmap_sem for individual pages makes sense. Current
>> implementation is holding it for an entire batch.
>
> I didn't bother to optimize this path to be honest. It is true that lock
> batching can lead to improvements but that would complicate the code
> (how many patches to batch?) so I've left that for later if somebody
> actually sees any problem.
>
>>> + err = -EFAULT;
>>> + vma = find_vma(mm, addr);
>>> + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>>
>> While here, should not we add 'addr > vma->vm_end' into this condition ?
>
> No. See what find_vma does.
>
Right.
> [...]
>
> Please cut out the quoted reply to minimum
Sure will do.
>
>>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>>> const int __user *nodes,
>>> int __user *status, int flags)
>>> {
>>> - struct page_to_node *pm;
>>> - unsigned long chunk_nr_pages;
>>> - unsigned long chunk_start;
>>> - int err;
>>> -
>>> - err = -ENOMEM;
>>> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
>>> - if (!pm)
>>> - goto out;
>>> + int chunk_node = NUMA_NO_NODE;
>>> + LIST_HEAD(pagelist);
>>> + int chunk_start, i;
>>> + int err = 0, err1;
>>
>> err init might not be required, its getting assigned to -EFAULT right away.
>
> No, nr_pages might be 0 AFAICS.
Right but there is another err = 0 after the for loop.
>
> [...]
>>> + if (chunk_node == NUMA_NO_NODE) {
>>> + chunk_node = node;
>>> + chunk_start = i;
>>> + } else if (node != chunk_node) {
>>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> + if (err)
>>> + goto out;
>>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> + if (err)
>>> + goto out;
>>> + chunk_start = i;
>>> + chunk_node = node;
>>> }
[...]
>>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> + if (err)
>>> + goto out;
>>> + if (i > chunk_start) {
>>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> + if (err)
>>> + goto out;
>>> + }
>>> + chunk_node = NUMA_NO_NODE;
>>
>> This block of code is bit confusing.
>
> I believe this is easier to grasp when looking at the resulting code.
>>
>> 1) Why attempt to migrate when just one page could not be isolated ?
>> 2) 'i' is always greater than chunk_start except the starting page
>> 3) Why reset chunk_node as NUMA_NO_NODE ?
>
> This is all about flushing the pending state on an error and
> distinguising a fresh batch.
Okay. Will test it out on a multi node system once I get hold of one.
On Wed 03-01-18 15:06:49, Anshuman Khandual wrote:
> On 01/03/2018 02:28 PM, Michal Hocko wrote:
> > On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
> >> On 12/08/2017 09:45 PM, Michal Hocko wrote:
[...]
> >>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> >>> const int __user *nodes,
> >>> int __user *status, int flags)
> >>> {
> >>> - struct page_to_node *pm;
> >>> - unsigned long chunk_nr_pages;
> >>> - unsigned long chunk_start;
> >>> - int err;
> >>> -
> >>> - err = -ENOMEM;
> >>> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
> >>> - if (!pm)
> >>> - goto out;
> >>> + int chunk_node = NUMA_NO_NODE;
> >>> + LIST_HEAD(pagelist);
> >>> + int chunk_start, i;
> >>> + int err = 0, err1;
> >>
> >> err init might not be required, its getting assigned to -EFAULT right away.
> >
> > No, nr_pages might be 0 AFAICS.
>
> Right but there is another err = 0 after the for loop.
No we have
out_flush:
/* Make sure we do not overwrite the existing error */
err1 = do_move_pages_to_node(mm, &pagelist, current_node);
if (!err1)
err1 = store_status(status, start, current_node, i - start);
if (!err)
err = err1;
This is obviously not an act of beauty and probably a subject to a
cleanup but I just wanted this thing to be working first. Further
cleanups can go on top.
> > [...]
> >>> + if (chunk_node == NUMA_NO_NODE) {
> >>> + chunk_node = node;
> >>> + chunk_start = i;
> >>> + } else if (node != chunk_node) {
> >>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> >>> + if (err)
> >>> + goto out;
> >>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> >>> + if (err)
> >>> + goto out;
> >>> + chunk_start = i;
> >>> + chunk_node = node;
> >>> }
>
> [...]
>
> >>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
> >>> + if (err)
> >>> + goto out;
> >>> + if (i > chunk_start) {
> >>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
> >>> + if (err)
> >>> + goto out;
> >>> + }
> >>> + chunk_node = NUMA_NO_NODE;
> >>
> >> This block of code is bit confusing.
> >
> > I believe this is easier to grasp when looking at the resulting code.
> >>
> >> 1) Why attempt to migrate when just one page could not be isolated ?
> >> 2) 'i' is always greater than chunk_start except the starting page
> >> 3) Why reset chunk_node as NUMA_NO_NODE ?
> >
> > This is all about flushing the pending state on an error and
> > distinguising a fresh batch.
>
> Okay. Will test it out on a multi node system once I get hold of one.
Thanks. I have been testing this specific code path with the following
simple test program and numactl -m0. The code is rather crude so I've
always modified it manually to test different scenarios (this one keeps
every 1k page on the node node to test batching.
---
#include <sys/mman.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <numaif.h>
int main()
{
unsigned long nr_pages = 10000;
size_t length = nr_pages << 12, i;
unsigned char *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
void *addrs[nr_pages];
int nodes[nr_pages];
int status[nr_pages];
char cmd[128];
char ch;
if (addr == MAP_FAILED)
return 1;
madvise(addr, length, MADV_NOHUGEPAGE);
for (i = 0; i < length; i += 4096)
addr[i] = 1;
for (i = 0; i < nr_pages; i++)
{
addrs[i] = &addr[i * 4096];
if (i%1024)
nodes[i] = 1;
else
nodes[i] = 0;
status[i] = 0;
}
snprintf(cmd, sizeof(cmd)-1, "grep %lx /proc/%d/numa_maps", addr, getpid());
system(cmd);
snprintf(cmd, sizeof(cmd)-1, "grep %lx -A20 /proc/%d/smaps", addr, getpid());
system(cmd);
read(0, &ch, 1);
if (move_pages(0, nr_pages, addrs, nodes, status, MPOL_MF_MOVE)) {
printf("move_pages: err:%d\n", errno);
}
snprintf(cmd, sizeof(cmd)-1, "grep %lx /proc/%d/numa_maps", addr, getpid());
system(cmd);
snprintf(cmd, sizeof(cmd)-1, "grep %lx -A20 /proc/%d/smaps", addr, getpid());
system(cmd);
return 0;
}
---
--
Michal Hocko
SUSE Labs