From: Joonsoo Kim <[email protected]>
This patchset clean-up the migration target allocation functions.
* Changes on v3
- do not introduce alloc_control for hugetlb functions
- do not change the signature of migrate_pages()
- rename alloc_control to migration_target_control
* Changes on v2
- add acked-by tags
- fix missing compound_head() call for patch #3
- remove thisnode field on alloc_control and use __GFP_THISNODE directly
- fix missing __gfp_mask setup for patch
"mm/hugetlb: do not modify user provided gfp_mask"
* Cover-letter
Contributions of this patchset are:
1. unify two hugetlb alloc functions. As a result, one is remained.
2. make one external hugetlb alloc function to internal one.
3. unify three functions for migration target allocation.
The patchset is based on next-20200621.
The patchset is available on:
https://github.com/JoonsooKim/linux/tree/cleanup-migration-target-allocation-v3.00-next-20200621
Thanks.
Joonsoo Kim (8):
mm/page_isolation: prefer the node of the source page
mm/migrate: move migration helper from .h to .c
mm/hugetlb: unify migration callbacks
mm/hugetlb: make hugetlb migration callback CMA aware
mm/migrate: make a standard migration target allocation function
mm/gup: use a standard migration target allocation callback
mm/mempolicy: use a standard migration target allocation callback
mm/page_alloc: remove a wrapper for alloc_migration_target()
include/linux/hugetlb.h | 13 +++---------
include/linux/migrate.h | 34 ++++++------------------------
mm/gup.c | 56 +++++++------------------------------------------
mm/hugetlb.c | 53 ++++++++++++++++++++--------------------------
mm/internal.h | 9 +++++++-
mm/memory-failure.c | 8 +++++--
mm/memory_hotplug.c | 14 ++++++++-----
mm/mempolicy.c | 29 ++++++-------------------
mm/migrate.c | 45 +++++++++++++++++++++++++++++++++++++--
mm/page_alloc.c | 9 ++++++--
mm/page_isolation.c | 5 -----
11 files changed, 119 insertions(+), 156 deletions(-)
--
2.7.4
From: Joonsoo Kim <[email protected]>
For locality, it's better to migrate the page to the same node
rather than the node of the current caller's cpu.
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_isolation.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index f6d07c5..aec26d9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -309,5 +309,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
struct page *alloc_migrate_target(struct page *page, unsigned long private)
{
- return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]);
+ int nid = page_to_nid(page);
+
+ return new_page_nodemask(page, nid, &node_states[N_MEMORY]);
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
It's not performance sensitive function. Move it to .c.
This is a preparation step for future change.
Acked-by: Mike Kravetz <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/migrate.h | 33 +++++----------------------------
mm/migrate.c | 29 +++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 28 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e546cb..1d70b4a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -31,34 +31,6 @@ enum migrate_reason {
/* In mm/debug.c; also keep sync with include/trace/events/migrate.h */
extern const char *migrate_reason_names[MR_TYPES];
-static inline struct page *new_page_nodemask(struct page *page,
- int preferred_nid, nodemask_t *nodemask)
-{
- gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
- unsigned int order = 0;
- struct page *new_page = NULL;
-
- if (PageHuge(page))
- return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
- preferred_nid, nodemask);
-
- if (PageTransHuge(page)) {
- gfp_mask |= GFP_TRANSHUGE;
- order = HPAGE_PMD_ORDER;
- }
-
- if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
- gfp_mask |= __GFP_HIGHMEM;
-
- new_page = __alloc_pages_nodemask(gfp_mask, order,
- preferred_nid, nodemask);
-
- if (new_page && PageTransHuge(new_page))
- prep_transhuge_page(new_page);
-
- return new_page;
-}
-
#ifdef CONFIG_MIGRATION
extern void putback_movable_pages(struct list_head *l);
@@ -67,6 +39,8 @@ extern int migrate_page(struct address_space *mapping,
enum migrate_mode mode);
extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
+extern struct page *new_page_nodemask(struct page *page,
+ int preferred_nid, nodemask_t *nodemask);
extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
extern void putback_movable_page(struct page *page);
@@ -85,6 +59,9 @@ static inline int migrate_pages(struct list_head *l, new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
int reason)
{ return -ENOSYS; }
+static inline struct page *new_page_nodemask(struct page *page,
+ int preferred_nid, nodemask_t *nodemask)
+ { return NULL; }
static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }
diff --git a/mm/migrate.c b/mm/migrate.c
index c95912f..6b5c75b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1536,6 +1536,35 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
return rc;
}
+struct page *new_page_nodemask(struct page *page,
+ int preferred_nid, nodemask_t *nodemask)
+{
+ gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
+ unsigned int order = 0;
+ struct page *new_page = NULL;
+
+ if (PageHuge(page))
+ return alloc_huge_page_nodemask(
+ page_hstate(compound_head(page)),
+ preferred_nid, nodemask);
+
+ if (PageTransHuge(page)) {
+ gfp_mask |= GFP_TRANSHUGE;
+ order = HPAGE_PMD_ORDER;
+ }
+
+ if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
+ gfp_mask |= __GFP_HIGHMEM;
+
+ new_page = __alloc_pages_nodemask(gfp_mask, order,
+ preferred_nid, nodemask);
+
+ if (new_page && PageTransHuge(new_page))
+ prep_transhuge_page(new_page);
+
+ return new_page;
+}
+
#ifdef CONFIG_NUMA
static int store_status(int __user *status, int start, int value, int nr)
--
2.7.4
From: Joonsoo Kim <[email protected]>
There is no difference between two migration callback functions,
alloc_huge_page_node() and alloc_huge_page_nodemask(), except
__GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
alloc_huge_page_nodemask() and replace the callsite for
alloc_huge_page_node() with the call to
alloc_huge_page_nodemask(..., __GFP_THISNODE).
It's safe to remove a node id check in alloc_huge_page_node() since
there is no caller passing NUMA_NO_NODE as a node id.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/hugetlb.h | 11 +++--------
mm/hugetlb.c | 26 +++-----------------------
mm/mempolicy.c | 9 +++++----
mm/migrate.c | 5 +++--
4 files changed, 14 insertions(+), 37 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 50650d0..8a8b755 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -504,9 +504,8 @@ struct huge_bootmem_page {
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
-struct page *alloc_huge_page_node(struct hstate *h, int nid);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask);
+ nodemask_t *nmask, gfp_t gfp_mask);
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
@@ -759,13 +758,9 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
return NULL;
}
-static inline struct page *alloc_huge_page_node(struct hstate *h, int nid)
-{
- return NULL;
-}
-
static inline struct page *
-alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask)
+alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
+ nodemask_t *nmask, gfp_t gfp_mask)
{
return NULL;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d54bb7e..bd408f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1979,30 +1979,10 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
}
/* page migration callback function */
-struct page *alloc_huge_page_node(struct hstate *h, int nid)
-{
- gfp_t gfp_mask = htlb_alloc_mask(h);
- struct page *page = NULL;
-
- if (nid != NUMA_NO_NODE)
- gfp_mask |= __GFP_THISNODE;
-
- spin_lock(&hugetlb_lock);
- if (h->free_huge_pages - h->resv_huge_pages > 0)
- page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
- spin_unlock(&hugetlb_lock);
-
- if (!page)
- page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
-
- return page;
-}
-
-/* page migration callback function */
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask)
+ nodemask_t *nmask, gfp_t gfp_mask)
{
- gfp_t gfp_mask = htlb_alloc_mask(h);
+ gfp_mask |= htlb_alloc_mask(h);
spin_lock(&hugetlb_lock);
if (h->free_huge_pages - h->resv_huge_pages > 0) {
@@ -2031,7 +2011,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
gfp_mask = htlb_alloc_mask(h);
node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = alloc_huge_page_nodemask(h, node, nodemask);
+ page = alloc_huge_page_nodemask(h, node, nodemask, 0);
mpol_cond_put(mpol);
return page;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b9e85d4..f21cff5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1068,10 +1068,11 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
/* page allocation callback for NUMA node migration */
struct page *alloc_new_node_page(struct page *page, unsigned long node)
{
- if (PageHuge(page))
- return alloc_huge_page_node(page_hstate(compound_head(page)),
- node);
- else if (PageTransHuge(page)) {
+ if (PageHuge(page)) {
+ return alloc_huge_page_nodemask(
+ page_hstate(compound_head(page)), node,
+ NULL, __GFP_THISNODE);
+ } else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_pages_node(node,
diff --git a/mm/migrate.c b/mm/migrate.c
index 6b5c75b..6ca9f0c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1543,10 +1543,11 @@ struct page *new_page_nodemask(struct page *page,
unsigned int order = 0;
struct page *new_page = NULL;
- if (PageHuge(page))
+ if (PageHuge(page)) {
return alloc_huge_page_nodemask(
page_hstate(compound_head(page)),
- preferred_nid, nodemask);
+ preferred_nid, nodemask, 0);
+ }
if (PageTransHuge(page)) {
gfp_mask |= GFP_TRANSHUGE;
--
2.7.4
From: Joonsoo Kim <[email protected]>
new_non_cma_page() in gup.c which try to allocate migration target page
requires to allocate the new page that is not on the CMA area.
new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
works well for THP page or normal page but not for hugetlb page.
hugetlb page allocation process consists of two steps. First is dequeing
from the pool. Second is, if there is no available page on the queue,
allocating from the page allocator.
new_non_cma_page() can control allocation from the page allocator by
specifying correct gfp flag. However, dequeing cannot be controlled until
now, so, new_non_cma_page() skips dequeing completely. It is a suboptimal
since new_non_cma_page() cannot utilize hugetlb pages on the queue so
this patch tries to fix this situation.
This patch makes the deque function on hugetlb CMA aware and skip CMA
pages if newly added skip_cma argument is passed as true.
Acked-by: Mike Kravetz <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/hugetlb.h | 6 ++----
mm/gup.c | 3 ++-
mm/hugetlb.c | 31 ++++++++++++++++++++++---------
mm/mempolicy.c | 2 +-
mm/migrate.c | 2 +-
5 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8a8b755..858522e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,11 +505,9 @@ struct huge_bootmem_page {
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask);
+ nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma);
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
- int nid, nodemask_t *nmask);
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
@@ -760,7 +758,7 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
static inline struct page *
alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
{
return NULL;
}
diff --git a/mm/gup.c b/mm/gup.c
index 6f47697..15be281 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1630,11 +1630,12 @@ static struct page *new_non_cma_page(struct page *page, unsigned long private)
#ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(page)) {
struct hstate *h = page_hstate(page);
+
/*
* We don't want to dequeue from the pool because pool pages will
* mostly be from the CMA region.
*/
- return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+ return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);
}
#endif
if (PageTransHuge(page)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bd408f2..1410e62 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1033,13 +1033,18 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
h->free_huge_pages_node[nid]++;
}
-static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
+static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid, bool skip_cma)
{
struct page *page;
- list_for_each_entry(page, &h->hugepage_freelists[nid], lru)
+ list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
+ if (skip_cma && is_migrate_cma_page(page))
+ continue;
+
if (!PageHWPoison(page))
break;
+ }
+
/*
* if 'non-isolated free hugepage' not found on the list,
* the allocation fails.
@@ -1054,7 +1059,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
}
static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, int nid,
- nodemask_t *nmask)
+ nodemask_t *nmask, bool skip_cma)
{
unsigned int cpuset_mems_cookie;
struct zonelist *zonelist;
@@ -1079,7 +1084,7 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
continue;
node = zone_to_nid(zone);
- page = dequeue_huge_page_node_exact(h, node);
+ page = dequeue_huge_page_node_exact(h, node, skip_cma);
if (page)
return page;
}
@@ -1124,7 +1129,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask, false);
if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
SetPagePrivate(page);
h->resv_huge_pages--;
@@ -1937,7 +1942,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
return page;
}
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
struct page *page;
@@ -1980,7 +1985,7 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
/* page migration callback function */
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
{
gfp_mask |= htlb_alloc_mask(h);
@@ -1988,7 +1993,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
if (h->free_huge_pages - h->resv_huge_pages > 0) {
struct page *page;
- page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
+ page = dequeue_huge_page_nodemask(h, gfp_mask,
+ preferred_nid, nmask, skip_cma);
if (page) {
spin_unlock(&hugetlb_lock);
return page;
@@ -1996,6 +2002,13 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
}
spin_unlock(&hugetlb_lock);
+ /*
+ * To skip the memory on CMA area, we need to clear __GFP_MOVABLE.
+ * Clearing __GFP_MOVABLE at the top of this function would also skip
+ * the proper allocation candidates for dequeue so clearing it here.
+ */
+ if (skip_cma)
+ gfp_mask &= ~__GFP_MOVABLE;
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}
@@ -2011,7 +2024,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
gfp_mask = htlb_alloc_mask(h);
node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = alloc_huge_page_nodemask(h, node, nodemask, 0);
+ page = alloc_huge_page_nodemask(h, node, nodemask, 0, false);
mpol_cond_put(mpol);
return page;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f21cff5..a3abf64 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1071,7 +1071,7 @@ struct page *alloc_new_node_page(struct page *page, unsigned long node)
if (PageHuge(page)) {
return alloc_huge_page_nodemask(
page_hstate(compound_head(page)), node,
- NULL, __GFP_THISNODE);
+ NULL, __GFP_THISNODE, false);
} else if (PageTransHuge(page)) {
struct page *thp;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6ca9f0c..634f1ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1546,7 +1546,7 @@ struct page *new_page_nodemask(struct page *page,
if (PageHuge(page)) {
return alloc_huge_page_nodemask(
page_hstate(compound_head(page)),
- preferred_nid, nodemask, 0);
+ preferred_nid, nodemask, 0, false);
}
if (PageTransHuge(page)) {
--
2.7.4
From: Joonsoo Kim <[email protected]>
There is a well-defined standard migration target callback.
Use it directly.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_alloc.c | 9 +++++++--
mm/page_isolation.c | 11 -----------
2 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9808339..884dfb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8359,6 +8359,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
unsigned long pfn = start;
unsigned int tries = 0;
int ret = 0;
+ struct migration_target_control mtc = {
+ .nid = zone_to_nid(cc->zone),
+ .nmask = &node_states[N_MEMORY],
+ .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+ };
migrate_prep();
@@ -8385,8 +8390,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
&cc->migratepages);
cc->nr_migratepages -= nr_reclaimed;
- ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
- NULL, 0, cc->mode, MR_CONTIG_RANGE);
+ ret = migrate_pages(&cc->migratepages, alloc_migration_target,
+ NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
}
if (ret < 0) {
putback_movable_pages(&cc->migratepages);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index adba031..242c031 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -306,14 +306,3 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
return pfn < end_pfn ? -EBUSY : 0;
}
-
-struct page *alloc_migrate_target(struct page *page, unsigned long private)
-{
- struct migration_target_control mtc = {
- .nid = page_to_nid(page),
- .nmask = &node_states[N_MEMORY],
- .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
- };
-
- return alloc_migration_target(page, (unsigned long)&mtc);
-}
--
2.7.4
From: Joonsoo Kim <[email protected]>
There are some similar functions for migration target allocation. Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants. This patch implements base migration target
allocation function. In the following patches, variants will be converted
to use this function.
Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/migrate.h | 5 +++--
mm/internal.h | 7 +++++++
mm/memory-failure.c | 8 ++++++--
mm/memory_hotplug.c | 14 +++++++++-----
mm/migrate.c | 21 +++++++++++++--------
mm/page_isolation.c | 8 ++++++--
6 files changed, 44 insertions(+), 19 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 1d70b4a..5e9c866 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,6 +10,8 @@
typedef struct page *new_page_t(struct page *page, unsigned long private);
typedef void free_page_t(struct page *page, unsigned long private);
+struct migration_target_control;
+
/*
* Return values from addresss_space_operations.migratepage():
* - negative errno on page migration failure;
@@ -39,8 +41,7 @@ extern int migrate_page(struct address_space *mapping,
enum migrate_mode mode);
extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
-extern struct page *new_page_nodemask(struct page *page,
- int preferred_nid, nodemask_t *nodemask);
+extern struct page *alloc_migration_target(struct page *page, unsigned long private);
extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
extern void putback_movable_page(struct page *page);
diff --git a/mm/internal.h b/mm/internal.h
index 42cf0b6..f725aa8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -614,4 +614,11 @@ static inline bool is_migrate_highatomic_page(struct page *page)
void setup_zone_pageset(struct zone *zone);
extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+struct migration_target_control {
+ int nid; /* preferred node id */
+ nodemask_t *nmask;
+ gfp_t gfp_mask;
+};
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb..820ea5e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1648,9 +1648,13 @@ EXPORT_SYMBOL(unpoison_memory);
static struct page *new_page(struct page *p, unsigned long private)
{
- int nid = page_to_nid(p);
+ struct migration_target_control mtc = {
+ .nid = page_to_nid(p),
+ .nmask = &node_states[N_MEMORY],
+ .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+ };
- return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
+ return alloc_migration_target(p, (unsigned long)&mtc);
}
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index be3c62e3..d2b65a5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1259,19 +1259,23 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
static struct page *new_node_page(struct page *page, unsigned long private)
{
- int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
+ struct migration_target_control mtc = {
+ .nid = page_to_nid(page),
+ .nmask = &nmask,
+ .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+ };
/*
* try to allocate from a different node but reuse this node if there
* are no other online nodes to be used (e.g. we are offlining a part
* of the only existing node)
*/
- node_clear(nid, nmask);
- if (nodes_empty(nmask))
- node_set(nid, nmask);
+ node_clear(mtc.nid, *mtc.nmask);
+ if (nodes_empty(*mtc.nmask))
+ node_set(mtc.nid, *mtc.nmask);
- return new_page_nodemask(page, nid, &nmask);
+ return alloc_migration_target(page, (unsigned long)&mtc);
}
static int
diff --git a/mm/migrate.c b/mm/migrate.c
index 634f1ea..3afff59 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1536,29 +1536,34 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
return rc;
}
-struct page *new_page_nodemask(struct page *page,
- int preferred_nid, nodemask_t *nodemask)
+struct page *alloc_migration_target(struct page *page, unsigned long private)
{
- gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
+ struct migration_target_control *mtc;
+ gfp_t gfp_mask;
unsigned int order = 0;
struct page *new_page = NULL;
+ int zidx;
+
+ mtc = (struct migration_target_control *)private;
+ gfp_mask = mtc->gfp_mask;
if (PageHuge(page)) {
return alloc_huge_page_nodemask(
- page_hstate(compound_head(page)),
- preferred_nid, nodemask, 0, false);
+ page_hstate(compound_head(page)), mtc->nid,
+ mtc->nmask, gfp_mask, false);
}
if (PageTransHuge(page)) {
+ gfp_mask &= ~__GFP_RECLAIM;
gfp_mask |= GFP_TRANSHUGE;
order = HPAGE_PMD_ORDER;
}
-
- if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
+ zidx = zone_idx(page_zone(page));
+ if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
gfp_mask |= __GFP_HIGHMEM;
new_page = __alloc_pages_nodemask(gfp_mask, order,
- preferred_nid, nodemask);
+ mtc->nid, mtc->nmask);
if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index aec26d9..adba031 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -309,7 +309,11 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
struct page *alloc_migrate_target(struct page *page, unsigned long private)
{
- int nid = page_to_nid(page);
+ struct migration_target_control mtc = {
+ .nid = page_to_nid(page),
+ .nmask = &node_states[N_MEMORY],
+ .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+ };
- return new_page_nodemask(page, nid, &node_states[N_MEMORY]);
+ return alloc_migration_target(page, (unsigned long)&mtc);
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
There is a well-defined migration target allocation callback.
It's mostly similar with new_non_cma_page() except considering CMA pages.
This patch adds a CMA consideration to the standard migration target
allocation callback and use it on gup.c.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/gup.c | 57 ++++++++-------------------------------------------------
mm/internal.h | 1 +
mm/migrate.c | 4 +++-
3 files changed, 12 insertions(+), 50 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 15be281..f6124e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1608,56 +1608,15 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
}
#ifdef CONFIG_CMA
-static struct page *new_non_cma_page(struct page *page, unsigned long private)
+static struct page *alloc_migration_target_non_cma(struct page *page, unsigned long private)
{
- /*
- * We want to make sure we allocate the new page from the same node
- * as the source page.
- */
- int nid = page_to_nid(page);
- /*
- * Trying to allocate a page for migration. Ignore allocation
- * failure warnings. We don't force __GFP_THISNODE here because
- * this node here is the node where we have CMA reservation and
- * in some case these nodes will have really less non movable
- * allocation memory.
- */
- gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
-
- if (PageHighMem(page))
- gfp_mask |= __GFP_HIGHMEM;
-
-#ifdef CONFIG_HUGETLB_PAGE
- if (PageHuge(page)) {
- struct hstate *h = page_hstate(page);
-
- /*
- * We don't want to dequeue from the pool because pool pages will
- * mostly be from the CMA region.
- */
- return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);
- }
-#endif
- if (PageTransHuge(page)) {
- struct page *thp;
- /*
- * ignore allocation failure warnings
- */
- gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
-
- /*
- * Remove the movable mask so that we don't allocate from
- * CMA area again.
- */
- thp_gfpmask &= ~__GFP_MOVABLE;
- thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
- if (!thp)
- return NULL;
- prep_transhuge_page(thp);
- return thp;
- }
+ struct migration_target_control mtc = {
+ .nid = page_to_nid(page),
+ .gfp_mask = GFP_USER | __GFP_NOWARN,
+ .skip_cma = true,
+ };
- return __alloc_pages_node(nid, gfp_mask, 0);
+ return alloc_migration_target(page, (unsigned long)&mtc);
}
static long check_and_migrate_cma_pages(struct task_struct *tsk,
@@ -1719,7 +1678,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
- if (migrate_pages(&cma_page_list, new_non_cma_page,
+ if (migrate_pages(&cma_page_list, alloc_migration_target_non_cma,
NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
/*
* some of the pages failed migration. Do get_user_pages
diff --git a/mm/internal.h b/mm/internal.h
index f725aa8..fb7f7fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -619,6 +619,7 @@ struct migration_target_control {
int nid; /* preferred node id */
nodemask_t *nmask;
gfp_t gfp_mask;
+ bool skip_cma;
};
#endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 3afff59..7c4cd74 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1550,7 +1550,7 @@ struct page *alloc_migration_target(struct page *page, unsigned long private)
if (PageHuge(page)) {
return alloc_huge_page_nodemask(
page_hstate(compound_head(page)), mtc->nid,
- mtc->nmask, gfp_mask, false);
+ mtc->nmask, gfp_mask, mtc->skip_cma);
}
if (PageTransHuge(page)) {
@@ -1561,6 +1561,8 @@ struct page *alloc_migration_target(struct page *page, unsigned long private)
zidx = zone_idx(page_zone(page));
if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
gfp_mask |= __GFP_HIGHMEM;
+ if (mtc->skip_cma)
+ gfp_mask &= ~__GFP_MOVABLE;
new_page = __alloc_pages_nodemask(gfp_mask, order,
mtc->nid, mtc->nmask);
--
2.7.4
From: Joonsoo Kim <[email protected]>
There is a well-defined migration target allocation callback.
Use it.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/internal.h | 1 -
mm/mempolicy.c | 30 ++++++------------------------
mm/migrate.c | 8 ++++++--
3 files changed, 12 insertions(+), 27 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index fb7f7fe..4f9f6b6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -613,7 +613,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
}
void setup_zone_pageset(struct zone *zone);
-extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
struct migration_target_control {
int nid; /* preferred node id */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a3abf64..85a3f21 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1065,28 +1065,6 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
return 0;
}
-/* page allocation callback for NUMA node migration */
-struct page *alloc_new_node_page(struct page *page, unsigned long node)
-{
- if (PageHuge(page)) {
- return alloc_huge_page_nodemask(
- page_hstate(compound_head(page)), node,
- NULL, __GFP_THISNODE, false);
- } else if (PageTransHuge(page)) {
- struct page *thp;
-
- thp = alloc_pages_node(node,
- (GFP_TRANSHUGE | __GFP_THISNODE),
- HPAGE_PMD_ORDER);
- if (!thp)
- return NULL;
- prep_transhuge_page(thp);
- return thp;
- } else
- return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE, 0);
-}
-
/*
* Migrate pages from one node to a target node.
* Returns error or the number of pages not migrated.
@@ -1097,6 +1075,10 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
nodemask_t nmask;
LIST_HEAD(pagelist);
int err = 0;
+ struct migration_target_control mtc = {
+ .nid = dest,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+ };
nodes_clear(nmask);
node_set(source, nmask);
@@ -1111,8 +1093,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);
if (!list_empty(&pagelist)) {
- err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
- MIGRATE_SYNC, MR_SYSCALL);
+ err = migrate_pages(&pagelist, alloc_migration_target, NULL,
+ (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 7c4cd74..1c943b0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1590,9 +1590,13 @@ static int do_move_pages_to_node(struct mm_struct *mm,
struct list_head *pagelist, int node)
{
int err;
+ struct migration_target_control mtc = {
+ .nid = node,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
+ };
- err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
- MIGRATE_SYNC, MR_SYSCALL);
+ err = migrate_pages(pagelist, alloc_migration_target, NULL,
+ (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
if (err)
putback_movable_pages(pagelist);
return err;
--
2.7.4
On 6/22/20 11:13 PM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is no difference between two migration callback functions,
> alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> alloc_huge_page_nodemask() and replace the callsite for
> alloc_huge_page_node() with the call to
> alloc_huge_page_nodemask(..., __GFP_THISNODE).
>
> It's safe to remove a node id check in alloc_huge_page_node() since
> there is no caller passing NUMA_NO_NODE as a node id.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Thanks for consolidating these.
Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz
On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is no difference between two migration callback functions,
> alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> alloc_huge_page_nodemask() and replace the callsite for
> alloc_huge_page_node() with the call to
> alloc_huge_page_nodemask(..., __GFP_THISNODE).
>
> It's safe to remove a node id check in alloc_huge_page_node() since
> there is no caller passing NUMA_NO_NODE as a node id.
Yes this is indeed safe. alloc_huge_page_node used to be called from
other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
well. Now it is called only from the mempolicy migration callback and
that always specifies a node and want to stick with that node.
But I have to say I really dislike the gfp semantic because it is
different from any other allocation function I can think of. It
specifies what to be added rather than what should be used.
Removing the function is ok but please use the full gfp mask instead
or if that is impractical for some reason (wich shouldn't be the case
as htlb_alloc_mask should be trivial to make static inline) make it
explicit that this is not a gfp_mask but a gfp modifier and explicitly
state which modifiers are allowed.
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/hugetlb.h | 11 +++--------
> mm/hugetlb.c | 26 +++-----------------------
> mm/mempolicy.c | 9 +++++----
> mm/migrate.c | 5 +++--
> 4 files changed, 14 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 50650d0..8a8b755 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -504,9 +504,8 @@ struct huge_bootmem_page {
>
> struct page *alloc_huge_page(struct vm_area_struct *vma,
> unsigned long addr, int avoid_reserve);
> -struct page *alloc_huge_page_node(struct hstate *h, int nid);
> struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> - nodemask_t *nmask);
> + nodemask_t *nmask, gfp_t gfp_mask);
> struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
> unsigned long address);
> struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> @@ -759,13 +758,9 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
> return NULL;
> }
>
> -static inline struct page *alloc_huge_page_node(struct hstate *h, int nid)
> -{
> - return NULL;
> -}
> -
> static inline struct page *
> -alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask)
> +alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> + nodemask_t *nmask, gfp_t gfp_mask)
> {
> return NULL;
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d54bb7e..bd408f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1979,30 +1979,10 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> }
>
> /* page migration callback function */
> -struct page *alloc_huge_page_node(struct hstate *h, int nid)
> -{
> - gfp_t gfp_mask = htlb_alloc_mask(h);
> - struct page *page = NULL;
> -
> - if (nid != NUMA_NO_NODE)
> - gfp_mask |= __GFP_THISNODE;
> -
> - spin_lock(&hugetlb_lock);
> - if (h->free_huge_pages - h->resv_huge_pages > 0)
> - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
> - spin_unlock(&hugetlb_lock);
> -
> - if (!page)
> - page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> -
> - return page;
> -}
> -
> -/* page migration callback function */
> struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> - nodemask_t *nmask)
> + nodemask_t *nmask, gfp_t gfp_mask)
> {
> - gfp_t gfp_mask = htlb_alloc_mask(h);
> + gfp_mask |= htlb_alloc_mask(h);
>
> spin_lock(&hugetlb_lock);
> if (h->free_huge_pages - h->resv_huge_pages > 0) {
> @@ -2031,7 +2011,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>
> gfp_mask = htlb_alloc_mask(h);
> node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> - page = alloc_huge_page_nodemask(h, node, nodemask);
> + page = alloc_huge_page_nodemask(h, node, nodemask, 0);
> mpol_cond_put(mpol);
>
> return page;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b9e85d4..f21cff5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1068,10 +1068,11 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
> /* page allocation callback for NUMA node migration */
> struct page *alloc_new_node_page(struct page *page, unsigned long node)
> {
> - if (PageHuge(page))
> - return alloc_huge_page_node(page_hstate(compound_head(page)),
> - node);
> - else if (PageTransHuge(page)) {
> + if (PageHuge(page)) {
> + return alloc_huge_page_nodemask(
> + page_hstate(compound_head(page)), node,
> + NULL, __GFP_THISNODE);
> + } else if (PageTransHuge(page)) {
> struct page *thp;
>
> thp = alloc_pages_node(node,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6b5c75b..6ca9f0c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1543,10 +1543,11 @@ struct page *new_page_nodemask(struct page *page,
> unsigned int order = 0;
> struct page *new_page = NULL;
>
> - if (PageHuge(page))
> + if (PageHuge(page)) {
> return alloc_huge_page_nodemask(
> page_hstate(compound_head(page)),
> - preferred_nid, nodemask);
> + preferred_nid, nodemask, 0);
> + }
>
> if (PageTransHuge(page)) {
> gfp_mask |= GFP_TRANSHUGE;
> --
> 2.7.4
>
--
Michal Hocko
SUSE Labs
On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> new_non_cma_page() in gup.c which try to allocate migration target page
> requires to allocate the new page that is not on the CMA area.
> new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
> works well for THP page or normal page but not for hugetlb page.
Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
flag when calling alloc_huge_page_nodemask and check for it in dequeue
path?
Your current calling convention doesn't allow that but as I've said in
the reply to previous patch this should be changed and then it would
make this one easier as well unless I am missing something.
--
Michal Hocko
SUSE Labs
On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> There are some similar functions for migration target allocation. Since
> there is no fundamental difference, it's better to keep just one rather
> than keeping all variants. This patch implements base migration target
> allocation function. In the following patches, variants will be converted
> to use this function.
>
> Note that PageHighmem() call in previous function is changed to open-code
> "is_highmem_idx()" since it provides more readability.
I was little bit surprised that alloc_migration_target still uses
private argument while it only accepts migration_target_control
structure pointer but then I have noticed that you are using it as a
real callback in a later patch.
> Signed-off-by: Joonsoo Kim <[email protected]>
Few questions inline
[...]
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 47b8ccb..820ea5e 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1648,9 +1648,13 @@ EXPORT_SYMBOL(unpoison_memory);
>
> static struct page *new_page(struct page *p, unsigned long private)
> {
> - int nid = page_to_nid(p);
> + struct migration_target_control mtc = {
> + .nid = page_to_nid(p),
> + .nmask = &node_states[N_MEMORY],
This could be .namsk = NULL, right? alloc_migration_target doesn't
modify the node mask and NULL nodemask is always interpreted as all
available nodes.
> + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> + };
>
> - return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
> + return alloc_migration_target(p, (unsigned long)&mtc);
> }
>
[...]
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 634f1ea..3afff59 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1536,29 +1536,34 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> return rc;
> }
>
> -struct page *new_page_nodemask(struct page *page,
> - int preferred_nid, nodemask_t *nodemask)
> +struct page *alloc_migration_target(struct page *page, unsigned long private)
> {
> - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> + struct migration_target_control *mtc;
> + gfp_t gfp_mask;
> unsigned int order = 0;
> struct page *new_page = NULL;
> + int zidx;
> +
> + mtc = (struct migration_target_control *)private;
> + gfp_mask = mtc->gfp_mask;
>
> if (PageHuge(page)) {
> return alloc_huge_page_nodemask(
> - page_hstate(compound_head(page)),
> - preferred_nid, nodemask, 0, false);
> + page_hstate(compound_head(page)), mtc->nid,
> + mtc->nmask, gfp_mask, false);
> }
>
> if (PageTransHuge(page)) {
> + gfp_mask &= ~__GFP_RECLAIM;
What's up with this gfp_mask modification?
> gfp_mask |= GFP_TRANSHUGE;
> order = HPAGE_PMD_ORDER;
> }
> -
> - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> + zidx = zone_idx(page_zone(page));
> + if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> gfp_mask |= __GFP_HIGHMEM;
>
> new_page = __alloc_pages_nodemask(gfp_mask, order,
> - preferred_nid, nodemask);
> + mtc->nid, mtc->nmask);
>
> if (new_page && PageTransHuge(new_page))
> prep_transhuge_page(new_page);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index aec26d9..adba031 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -309,7 +309,11 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>
> struct page *alloc_migrate_target(struct page *page, unsigned long private)
> {
> - int nid = page_to_nid(page);
> + struct migration_target_control mtc = {
> + .nid = page_to_nid(page),
> + .nmask = &node_states[N_MEMORY],
nmask = NULL again
> + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> + };
>
> - return new_page_nodemask(page, nid, &node_states[N_MEMORY]);
> + return alloc_migration_target(page, (unsigned long)&mtc);
> }
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Tue 23-06-20 15:13:46, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined migration target allocation callback.
> It's mostly similar with new_non_cma_page() except considering CMA pages.
>
> This patch adds a CMA consideration to the standard migration target
> allocation callback and use it on gup.c.
We already can express that by a missing __GFP_MOVABLE so I would rather
not introduce a duplication in form of another flag.
--
Michal Hocko
SUSE Labs
On Tue 23-06-20 15:13:47, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined migration target allocation callback.
> Use it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/internal.h | 1 -
> mm/mempolicy.c | 30 ++++++------------------------
> mm/migrate.c | 8 ++++++--
> 3 files changed, 12 insertions(+), 27 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index fb7f7fe..4f9f6b6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -613,7 +613,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> -extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
>
> struct migration_target_control {
> int nid; /* preferred node id */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index a3abf64..85a3f21 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1065,28 +1065,6 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
> return 0;
> }
>
> -/* page allocation callback for NUMA node migration */
> -struct page *alloc_new_node_page(struct page *page, unsigned long node)
> -{
> - if (PageHuge(page)) {
> - return alloc_huge_page_nodemask(
> - page_hstate(compound_head(page)), node,
> - NULL, __GFP_THISNODE, false);
> - } else if (PageTransHuge(page)) {
> - struct page *thp;
> -
> - thp = alloc_pages_node(node,
> - (GFP_TRANSHUGE | __GFP_THISNODE),
> - HPAGE_PMD_ORDER);
> - if (!thp)
> - return NULL;
> - prep_transhuge_page(thp);
> - return thp;
> - } else
> - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
> - __GFP_THISNODE, 0);
> -}
> -
> /*
> * Migrate pages from one node to a target node.
> * Returns error or the number of pages not migrated.
> @@ -1097,6 +1075,10 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> nodemask_t nmask;
> LIST_HEAD(pagelist);
> int err = 0;
> + struct migration_target_control mtc = {
> + .nid = dest,
> + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
> + };
>
> nodes_clear(nmask);
> node_set(source, nmask);
> @@ -1111,8 +1093,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>
> if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> - MIGRATE_SYNC, MR_SYSCALL);
> + err = migrate_pages(&pagelist, alloc_migration_target, NULL,
> + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
> }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7c4cd74..1c943b0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1590,9 +1590,13 @@ static int do_move_pages_to_node(struct mm_struct *mm,
> struct list_head *pagelist, int node)
> {
> int err;
> + struct migration_target_control mtc = {
> + .nid = node,
> + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
> + };
>
> - err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> - MIGRATE_SYNC, MR_SYSCALL);
> + err = migrate_pages(pagelist, alloc_migration_target, NULL,
> + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(pagelist);
> return err;
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Tue 23-06-20 15:13:48, Joonsoo Kim wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined standard migration target callback.
> Use it directly.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 9 +++++++--
> mm/page_isolation.c | 11 -----------
> 2 files changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9808339..884dfb5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8359,6 +8359,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> unsigned long pfn = start;
> unsigned int tries = 0;
> int ret = 0;
> + struct migration_target_control mtc = {
> + .nid = zone_to_nid(cc->zone),
> + .nmask = &node_states[N_MEMORY],
> + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> + };
>
> migrate_prep();
>
> @@ -8385,8 +8390,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> &cc->migratepages);
> cc->nr_migratepages -= nr_reclaimed;
>
> - ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
> - NULL, 0, cc->mode, MR_CONTIG_RANGE);
> + ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> }
> if (ret < 0) {
> putback_movable_pages(&cc->migratepages);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index adba031..242c031 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -306,14 +306,3 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>
> return pfn < end_pfn ? -EBUSY : 0;
> }
> -
> -struct page *alloc_migrate_target(struct page *page, unsigned long private)
> -{
> - struct migration_target_control mtc = {
> - .nid = page_to_nid(page),
> - .nmask = &node_states[N_MEMORY],
> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> - };
> -
> - return alloc_migration_target(page, (unsigned long)&mtc);
> -}
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
2020년 6월 25일 (목) 오후 8:26, Michal Hocko <[email protected]>님이 작성:
>
> On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is no difference between two migration callback functions,
> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> > __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> > alloc_huge_page_nodemask() and replace the callsite for
> > alloc_huge_page_node() with the call to
> > alloc_huge_page_nodemask(..., __GFP_THISNODE).
> >
> > It's safe to remove a node id check in alloc_huge_page_node() since
> > there is no caller passing NUMA_NO_NODE as a node id.
>
> Yes this is indeed safe. alloc_huge_page_node used to be called from
> other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
> well. Now it is called only from the mempolicy migration callback and
> that always specifies a node and want to stick with that node.
>
> But I have to say I really dislike the gfp semantic because it is
> different from any other allocation function I can think of. It
> specifies what to be added rather than what should be used.
>
> Removing the function is ok but please use the full gfp mask instead
> or if that is impractical for some reason (wich shouldn't be the case
> as htlb_alloc_mask should be trivial to make static inline) make it
> explicit that this is not a gfp_mask but a gfp modifier and explicitly
> state which modifiers are allowed.
Okay. I will try to solve your concern. Concrete solution is not yet prepared
but perhaps I will use full gfp_mask by using htlb_alloc_mask() in caller sites.
Thanks.
2020년 6월 25일 (목) 오후 8:54, Michal Hocko <[email protected]>님이 작성:
>
> On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
> > works well for THP page or normal page but not for hugetlb page.
>
> Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
> flag when calling alloc_huge_page_nodemask and check for it in dequeue
> path?
If we remove __GFP_MOVABLE when calling alloc_huge_page_nodemask, we cannot
use the page in ZONE_MOVABLE on dequeing.
__GFP_MOVABLE is not only used for CMA selector but also used for zone selector.
If we clear it, we cannot use the page in the ZONE_MOVABLE even if
it's not CMA pages.
For THP page or normal page allocation, there is no way to avoid this
weakness without
introducing another flag or argument. For me, introducing another flag
or argument for
these functions looks over-engineering so I don't change them and
leave them as they are
(removing __GFP_MOVABLE).
But, for alloc_huge_page_nodemask(), introducing a new argument
doesn't seem to be
a problem since it is not a general function but just a migration
target allocation function.
If you agree with this argument, I will add more description to the patch.
Thanks.
2020년 6월 25일 (목) 오후 9:05, Michal Hocko <[email protected]>님이 작성:
>
> On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There are some similar functions for migration target allocation. Since
> > there is no fundamental difference, it's better to keep just one rather
> > than keeping all variants. This patch implements base migration target
> > allocation function. In the following patches, variants will be converted
> > to use this function.
> >
> > Note that PageHighmem() call in previous function is changed to open-code
> > "is_highmem_idx()" since it provides more readability.
>
> I was little bit surprised that alloc_migration_target still uses
> private argument while it only accepts migration_target_control
> structure pointer but then I have noticed that you are using it as a
> real callback in a later patch.
>
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Few questions inline
> [...]
>
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 47b8ccb..820ea5e 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1648,9 +1648,13 @@ EXPORT_SYMBOL(unpoison_memory);
> >
> > static struct page *new_page(struct page *p, unsigned long private)
> > {
> > - int nid = page_to_nid(p);
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(p),
> > + .nmask = &node_states[N_MEMORY],
>
> This could be .namsk = NULL, right? alloc_migration_target doesn't
> modify the node mask and NULL nodemask is always interpreted as all
> available nodes.
Will do.
> > + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> > + };
> >
> > - return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
> > + return alloc_migration_target(p, (unsigned long)&mtc);
> > }
> >
> [...]
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 634f1ea..3afff59 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1536,29 +1536,34 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> > return rc;
> > }
> >
> > -struct page *new_page_nodemask(struct page *page,
> > - int preferred_nid, nodemask_t *nodemask)
> > +struct page *alloc_migration_target(struct page *page, unsigned long private)
> > {
> > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > + struct migration_target_control *mtc;
> > + gfp_t gfp_mask;
> > unsigned int order = 0;
> > struct page *new_page = NULL;
> > + int zidx;
> > +
> > + mtc = (struct migration_target_control *)private;
> > + gfp_mask = mtc->gfp_mask;
> >
> > if (PageHuge(page)) {
> > return alloc_huge_page_nodemask(
> > - page_hstate(compound_head(page)),
> > - preferred_nid, nodemask, 0, false);
> > + page_hstate(compound_head(page)), mtc->nid,
> > + mtc->nmask, gfp_mask, false);
> > }
> >
> > if (PageTransHuge(page)) {
> > + gfp_mask &= ~__GFP_RECLAIM;
>
> What's up with this gfp_mask modification?
THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
So, I clear it here so as not to disrupt the THP gfp mask.
> > gfp_mask |= GFP_TRANSHUGE;
> > order = HPAGE_PMD_ORDER;
> > }
> > -
> > - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> > + zidx = zone_idx(page_zone(page));
> > + if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> > gfp_mask |= __GFP_HIGHMEM;
> >
> > new_page = __alloc_pages_nodemask(gfp_mask, order,
> > - preferred_nid, nodemask);
> > + mtc->nid, mtc->nmask);
> >
> > if (new_page && PageTransHuge(new_page))
> > prep_transhuge_page(new_page);
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index aec26d9..adba031 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -309,7 +309,11 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> >
> > struct page *alloc_migrate_target(struct page *page, unsigned long private)
> > {
> > - int nid = page_to_nid(page);
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(page),
> > + .nmask = &node_states[N_MEMORY],
>
> nmask = NULL again
Okay.
Thanks.
2020년 6월 25일 (목) 오후 9:08, Michal Hocko <[email protected]>님이 작성:
>
> On Tue 23-06-20 15:13:46, Joonsoo Kim wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is a well-defined migration target allocation callback.
> > It's mostly similar with new_non_cma_page() except considering CMA pages.
> >
> > This patch adds a CMA consideration to the standard migration target
> > allocation callback and use it on gup.c.
>
> We already can express that by a missing __GFP_MOVABLE so I would rather
> not introduce a duplication in form of another flag.
I replied to this question in a previous email.
Thanks.
On Fri 26-06-20 13:49:15, Joonsoo Kim wrote:
> 2020년 6월 25일 (목) 오후 8:54, Michal Hocko <[email protected]>님이 작성:
> >
> > On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > requires to allocate the new page that is not on the CMA area.
> > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
> > > works well for THP page or normal page but not for hugetlb page.
> >
> > Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
> > flag when calling alloc_huge_page_nodemask and check for it in dequeue
> > path?
>
> If we remove __GFP_MOVABLE when calling alloc_huge_page_nodemask, we cannot
> use the page in ZONE_MOVABLE on dequeing.
>
> __GFP_MOVABLE is not only used for CMA selector but also used for zone
> selector. If we clear it, we cannot use the page in the ZONE_MOVABLE
> even if it's not CMA pages. For THP page or normal page allocation,
> there is no way to avoid this weakness without introducing another
> flag or argument. For me, introducing another flag or argument for
> these functions looks over-engineering so I don't change them and
> leave them as they are (removing __GFP_MOVABLE).
>
> But, for alloc_huge_page_nodemask(), introducing a new argument
> doesn't seem to be a problem since it is not a general function but
> just a migration target allocation function.
I really do not see why hugetlb and only the dequeing part should be
special. This just leads to a confusion. From the code point of view it
makes perfect sense to opt out CMA regions for !__GFP_MOVABLE when
dequeing. So I would rather see a consistent behavior than a special
case deep in the hugetlb allocator layer.
--
Michal Hocko
SUSE Labs
On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> 2020년 6월 25일 (목) 오후 9:05, Michal Hocko <[email protected]>님이 작성:
> >
> > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
[...]
> > > -struct page *new_page_nodemask(struct page *page,
> > > - int preferred_nid, nodemask_t *nodemask)
> > > +struct page *alloc_migration_target(struct page *page, unsigned long private)
> > > {
> > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > > + struct migration_target_control *mtc;
> > > + gfp_t gfp_mask;
> > > unsigned int order = 0;
> > > struct page *new_page = NULL;
> > > + int zidx;
> > > +
> > > + mtc = (struct migration_target_control *)private;
> > > + gfp_mask = mtc->gfp_mask;
> > >
> > > if (PageHuge(page)) {
> > > return alloc_huge_page_nodemask(
> > > - page_hstate(compound_head(page)),
> > > - preferred_nid, nodemask, 0, false);
> > > + page_hstate(compound_head(page)), mtc->nid,
> > > + mtc->nmask, gfp_mask, false);
> > > }
> > >
> > > if (PageTransHuge(page)) {
> > > + gfp_mask &= ~__GFP_RECLAIM;
> >
> > What's up with this gfp_mask modification?
>
> THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
> So, I clear it here so as not to disrupt the THP gfp mask.
Why this wasn't really needed before? I guess I must be missing
something here. This patch should be mostly mechanical convergence of
existing migration callbacks but this change adds a new behavior AFAICS.
It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM
from the mask so the allocation would "lighter". If that is your
intention then this should be a separate patch with an explanation
rather than hiding it into this patch.
--
Michal Hocko
SUSE Labs
2020년 6월 26일 (금) 오후 4:23, Michal Hocko <[email protected]>님이 작성:
>
> On Fri 26-06-20 13:49:15, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 8:54, Michal Hocko <[email protected]>님이 작성:
> > >
> > > On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> > > > From: Joonsoo Kim <[email protected]>
> > > >
> > > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > > requires to allocate the new page that is not on the CMA area.
> > > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
> > > > works well for THP page or normal page but not for hugetlb page.
> > >
> > > Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
> > > flag when calling alloc_huge_page_nodemask and check for it in dequeue
> > > path?
> >
> > If we remove __GFP_MOVABLE when calling alloc_huge_page_nodemask, we cannot
> > use the page in ZONE_MOVABLE on dequeing.
> >
> > __GFP_MOVABLE is not only used for CMA selector but also used for zone
> > selector. If we clear it, we cannot use the page in the ZONE_MOVABLE
> > even if it's not CMA pages. For THP page or normal page allocation,
> > there is no way to avoid this weakness without introducing another
> > flag or argument. For me, introducing another flag or argument for
> > these functions looks over-engineering so I don't change them and
> > leave them as they are (removing __GFP_MOVABLE).
> >
> > But, for alloc_huge_page_nodemask(), introducing a new argument
> > doesn't seem to be a problem since it is not a general function but
> > just a migration target allocation function.
>
> I really do not see why hugetlb and only the dequeing part should be
> special. This just leads to a confusion. From the code point of view it
> makes perfect sense to opt out CMA regions for !__GFP_MOVABLE when
> dequeing. So I would rather see a consistent behavior than a special
> case deep in the hugetlb allocator layer.
It seems that there is a misunderstanding. It's possible to opt out CMA regions
for !__GFP_MOVABLE when dequeing. It's reasonable. But, for !__GFP_MOVABLE,
we don't search the hugetlb page on the ZONE_MOVABLE when dequeing since
dequeing zone is limited by gfp_zone(gfp_mask). Solution that Introduces a new
argument doesn't cause this problem while avoiding CMA regions.
Thanks.
On Mon 29-06-20 15:27:25, Joonsoo Kim wrote:
[...]
> Solution that Introduces a new
> argument doesn't cause this problem while avoiding CMA regions.
My primary argument is that there is no real reason to treat hugetlb
dequeing somehow differently. So if we simply exclude __GFP_MOVABLE for
_any_ other allocation then this certainly has some drawbacks on the
usable memory for the migration target and it can lead to allocation
failures (especially on movable_node setups where the amount of movable
memory might be really high) and therefore longterm gup failures. And
yes those failures might be premature. But my point is that the behavior
would be _consistent_. So a user wouldn't see random failures for some
types of pages while a success for others.
Let's have a look at this patch. It is simply working that around the
restriction for a very limited types of pages - only hugetlb pages
which have reserves in non-cma movable pools. I would claim that many
setups will simply not have many (if any) spare hugetlb pages in the
pool except for temporary time periods when a workload is (re)starting
because this would be effectively a wasted memory.
The patch is adding a special case flag to claim what the code already
does by memalloc_nocma_{save,restore} API so the information is already
there. Sorry I didn't bring this up earlier but I have completely forgot
about its existence. With that one in place I do agree that dequeing
needs a fixup but that should be something like the following instead.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 57ece74e3aae..c1595b1d36f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
/* Movability of hugepages depends on migration support. */
static inline gfp_t htlb_alloc_mask(struct hstate *h)
{
+ gfp_t gfp;
+
if (hugepage_movable_supported(h))
- return GFP_HIGHUSER_MOVABLE;
+ gfp = GFP_HIGHUSER_MOVABLE;
else
- return GFP_HIGHUSER;
+ gfp = GFP_HIGHUSER;
+
+ return current_gfp_context(gfp);
}
static struct page *dequeue_huge_page_vma(struct hstate *h,
If we even fix this general issue for other allocations and allow a
better CMA exclusion then it would be implemented consistently for
everybody.
Does this make more sense to you are we still not on the same page wrt
to the actual problem?
--
Michal Hocko
SUSE Labs
On Mon 29-06-20 15:41:37, Joonsoo Kim wrote:
> 2020년 6월 26일 (금) 오후 4:33, Michal Hocko <[email protected]>님이 작성:
> >
> > On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> > > 2020년 6월 25일 (목) 오후 9:05, Michal Hocko <[email protected]>님이 작성:
> > > >
> > > > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> > [...]
> > > > > -struct page *new_page_nodemask(struct page *page,
> > > > > - int preferred_nid, nodemask_t *nodemask)
> > > > > +struct page *alloc_migration_target(struct page *page, unsigned long private)
> > > > > {
> > > > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > > > > + struct migration_target_control *mtc;
> > > > > + gfp_t gfp_mask;
> > > > > unsigned int order = 0;
> > > > > struct page *new_page = NULL;
> > > > > + int zidx;
> > > > > +
> > > > > + mtc = (struct migration_target_control *)private;
> > > > > + gfp_mask = mtc->gfp_mask;
> > > > >
> > > > > if (PageHuge(page)) {
> > > > > return alloc_huge_page_nodemask(
> > > > > - page_hstate(compound_head(page)),
> > > > > - preferred_nid, nodemask, 0, false);
> > > > > + page_hstate(compound_head(page)), mtc->nid,
> > > > > + mtc->nmask, gfp_mask, false);
> > > > > }
> > > > >
> > > > > if (PageTransHuge(page)) {
> > > > > + gfp_mask &= ~__GFP_RECLAIM;
> > > >
> > > > What's up with this gfp_mask modification?
> > >
> > > THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> > > GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
> > > So, I clear it here so as not to disrupt the THP gfp mask.
> >
> > Why this wasn't really needed before? I guess I must be missing
> > something here. This patch should be mostly mechanical convergence of
> > existing migration callbacks but this change adds a new behavior AFAICS.
>
> Before this patch, a user cannot specify a gfp_mask and THP allocation
> uses GFP_TRANSHUGE
> statically.
Unless I am misreading there are code paths (e.g.new_page_nodemask) which simply use
add GFP_TRANSHUGE to GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL. And
this goes all the way to thp migration introduction.
> After this patch, a user can specify a gfp_mask and it
> could conflict with GFP_TRANSHUGE.
> This code tries to avoid this conflict.
>
> > It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM
>
> __GFP_RETRY_MAYFAIL isn't dropped. __GFP_RECLAIM is
> "___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM".
> So, __GFP_KSWAPD_RECLAIM would be dropped for THP allocation.
> IIUC, THP allocation doesn't use __GFP_KSWAPD_RECLAIM since it's
> overhead is too large and this overhead should be given to the caller
> rather than system thread (kswapd) and so on.
Yes, there is a reason why KSWAPD is excluded from THP allocations in
the page fault path. Maybe we want to extend that behavior to the
migration as well. I do not have a strong opinion on that because I
haven't seen excessive kswapd reclaim due to THP migrations. They are
likely too rare.
But as I've said in my previous email. Make this a separate patch with
an explanation why we want this.
--
Michal Hocko
SUSE Labs
2020년 6월 26일 (금) 오후 4:33, Michal Hocko <[email protected]>님이 작성:
>
> On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 9:05, Michal Hocko <[email protected]>님이 작성:
> > >
> > > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> [...]
> > > > -struct page *new_page_nodemask(struct page *page,
> > > > - int preferred_nid, nodemask_t *nodemask)
> > > > +struct page *alloc_migration_target(struct page *page, unsigned long private)
> > > > {
> > > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > > > + struct migration_target_control *mtc;
> > > > + gfp_t gfp_mask;
> > > > unsigned int order = 0;
> > > > struct page *new_page = NULL;
> > > > + int zidx;
> > > > +
> > > > + mtc = (struct migration_target_control *)private;
> > > > + gfp_mask = mtc->gfp_mask;
> > > >
> > > > if (PageHuge(page)) {
> > > > return alloc_huge_page_nodemask(
> > > > - page_hstate(compound_head(page)),
> > > > - preferred_nid, nodemask, 0, false);
> > > > + page_hstate(compound_head(page)), mtc->nid,
> > > > + mtc->nmask, gfp_mask, false);
> > > > }
> > > >
> > > > if (PageTransHuge(page)) {
> > > > + gfp_mask &= ~__GFP_RECLAIM;
> > >
> > > What's up with this gfp_mask modification?
> >
> > THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> > GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
> > So, I clear it here so as not to disrupt the THP gfp mask.
>
> Why this wasn't really needed before? I guess I must be missing
> something here. This patch should be mostly mechanical convergence of
> existing migration callbacks but this change adds a new behavior AFAICS.
Before this patch, a user cannot specify a gfp_mask and THP allocation
uses GFP_TRANSHUGE
statically. After this patch, a user can specify a gfp_mask and it
could conflict with GFP_TRANSHUGE.
This code tries to avoid this conflict.
> It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM
__GFP_RETRY_MAYFAIL isn't dropped. __GFP_RECLAIM is
"___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM".
So, __GFP_KSWAPD_RECLAIM would be dropped for THP allocation.
IIUC, THP allocation doesn't use __GFP_KSWAPD_RECLAIM since it's overhead is
too large and this overhead should be given to the caller rather than
system thread (kswapd)
and so on.
Thanks.
2020년 6월 29일 (월) 오후 4:55, Michal Hocko <[email protected]>님이 작성:
>
> On Mon 29-06-20 15:27:25, Joonsoo Kim wrote:
> [...]
> > Solution that Introduces a new
> > argument doesn't cause this problem while avoiding CMA regions.
>
> My primary argument is that there is no real reason to treat hugetlb
> dequeing somehow differently. So if we simply exclude __GFP_MOVABLE for
> _any_ other allocation then this certainly has some drawbacks on the
> usable memory for the migration target and it can lead to allocation
> failures (especially on movable_node setups where the amount of movable
> memory might be really high) and therefore longterm gup failures. And
> yes those failures might be premature. But my point is that the behavior
> would be _consistent_. So a user wouldn't see random failures for some
> types of pages while a success for others.
Hmm... I don't agree with your argument. Excluding __GFP_MOVABLE is
a *work-around* way to exclude CMA regions. Implementation for dequeuing
in this patch is a right way to exclude CMA regions. Why do we use a work-around
for this case? To be consistent is important but it's only meaningful
if it is correct.
It should not disrupt to make a better code. And, dequeing is already a special
process that is only available for hugetlb. I think that using
different (correct)
implementations there doesn't break any consistency.
> Let's have a look at this patch. It is simply working that around the
> restriction for a very limited types of pages - only hugetlb pages
> which have reserves in non-cma movable pools. I would claim that many
> setups will simply not have many (if any) spare hugetlb pages in the
> pool except for temporary time periods when a workload is (re)starting
> because this would be effectively a wasted memory.
This can not be a stopper to make the correct code.
> The patch is adding a special case flag to claim what the code already
> does by memalloc_nocma_{save,restore} API so the information is already
> there. Sorry I didn't bring this up earlier but I have completely forgot
> about its existence. With that one in place I do agree that dequeing
> needs a fixup but that should be something like the following instead.
Thanks for letting me know. I don't know it until now. It looks like it's
better to use this API rather than introducing a new argument.
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 57ece74e3aae..c1595b1d36f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
> /* Movability of hugepages depends on migration support. */
> static inline gfp_t htlb_alloc_mask(struct hstate *h)
> {
> + gfp_t gfp;
> +
> if (hugepage_movable_supported(h))
> - return GFP_HIGHUSER_MOVABLE;
> + gfp = GFP_HIGHUSER_MOVABLE;
> else
> - return GFP_HIGHUSER;
> + gfp = GFP_HIGHUSER;
> +
> + return current_gfp_context(gfp);
> }
>
> static struct page *dequeue_huge_page_vma(struct hstate *h,
>
> If we even fix this general issue for other allocations and allow a
> better CMA exclusion then it would be implemented consistently for
> everybody.
Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
for CMA exclusion. I will do it after this patch is finished.
> Does this make more sense to you are we still not on the same page wrt
> to the actual problem?
Yes, but we have different opinions about it. As said above, I will make
a patch for better CMA exclusion after this patchset. It will make
code consistent.
I'd really appreciate it if you wait until then.
Thanks.
On Tue 30-06-20 15:30:04, Joonsoo Kim wrote:
> 2020년 6월 29일 (월) 오후 4:55, Michal Hocko <[email protected]>님이 작성:
[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 57ece74e3aae..c1595b1d36f3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
> > /* Movability of hugepages depends on migration support. */
> > static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > {
> > + gfp_t gfp;
> > +
> > if (hugepage_movable_supported(h))
> > - return GFP_HIGHUSER_MOVABLE;
> > + gfp = GFP_HIGHUSER_MOVABLE;
> > else
> > - return GFP_HIGHUSER;
> > + gfp = GFP_HIGHUSER;
> > +
> > + return current_gfp_context(gfp);
> > }
> >
> > static struct page *dequeue_huge_page_vma(struct hstate *h,
> >
> > If we even fix this general issue for other allocations and allow a
> > better CMA exclusion then it would be implemented consistently for
> > everybody.
>
> Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
> for CMA exclusion. I will do it after this patch is finished.
>
> > Does this make more sense to you are we still not on the same page wrt
> > to the actual problem?
>
> Yes, but we have different opinions about it. As said above, I will make
> a patch for better CMA exclusion after this patchset. It will make
> code consistent.
> I'd really appreciate it if you wait until then.
As I've said I would _prefer_ simplicity over "correctness" if it is only
partial and hard to reason about from the userspace experience but this
is not something I would _insist_ on. If Mike as a maintainer of the
code is ok with that then I will not stand in the way.
But please note that a missing current_gfp_context inside
htlb_alloc_mask is a subtle bug. I do not think it matters right now but
with a growing use of scoped apis this might actually hit some day so I
believe we want to have it in place.
Thanks!
--
Michal Hocko
SUSE Labs
2020년 6월 29일 (월) 오후 5:03, Michal Hocko <[email protected]>님이 작성:
>
> On Mon 29-06-20 15:41:37, Joonsoo Kim wrote:
> > 2020년 6월 26일 (금) 오후 4:33, Michal Hocko <[email protected]>님이 작성:
> > >
> > > On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> > > > 2020년 6월 25일 (목) 오후 9:05, Michal Hocko <[email protected]>님이 작성:
> > > > >
> > > > > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> > > [...]
> > > > > > -struct page *new_page_nodemask(struct page *page,
> > > > > > - int preferred_nid, nodemask_t *nodemask)
> > > > > > +struct page *alloc_migration_target(struct page *page, unsigned long private)
> > > > > > {
> > > > > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > > > > > + struct migration_target_control *mtc;
> > > > > > + gfp_t gfp_mask;
> > > > > > unsigned int order = 0;
> > > > > > struct page *new_page = NULL;
> > > > > > + int zidx;
> > > > > > +
> > > > > > + mtc = (struct migration_target_control *)private;
> > > > > > + gfp_mask = mtc->gfp_mask;
> > > > > >
> > > > > > if (PageHuge(page)) {
> > > > > > return alloc_huge_page_nodemask(
> > > > > > - page_hstate(compound_head(page)),
> > > > > > - preferred_nid, nodemask, 0, false);
> > > > > > + page_hstate(compound_head(page)), mtc->nid,
> > > > > > + mtc->nmask, gfp_mask, false);
> > > > > > }
> > > > > >
> > > > > > if (PageTransHuge(page)) {
> > > > > > + gfp_mask &= ~__GFP_RECLAIM;
> > > > >
> > > > > What's up with this gfp_mask modification?
> > > >
> > > > THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> > > > GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
> > > > So, I clear it here so as not to disrupt the THP gfp mask.
> > >
> > > Why this wasn't really needed before? I guess I must be missing
> > > something here. This patch should be mostly mechanical convergence of
> > > existing migration callbacks but this change adds a new behavior AFAICS.
> >
> > Before this patch, a user cannot specify a gfp_mask and THP allocation
> > uses GFP_TRANSHUGE
> > statically.
>
> Unless I am misreading there are code paths (e.g.new_page_nodemask) which simply use
> add GFP_TRANSHUGE to GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL. And
> this goes all the way to thp migration introduction.
Ahh... Indeed. I missed that. There are multiple THP migration target
allocation functions
and some functions use GFP_TRANSHUGE + extra_mask so doesn't include
__GFP_KSWAPD_RECLAIM
but the others includes __GFP_KSWAPD_RECLAIM due to original GFP_USER.
Thanks for clarifying.
> > After this patch, a user can specify a gfp_mask and it
> > could conflict with GFP_TRANSHUGE.
> > This code tries to avoid this conflict.
> >
> > > It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM
> >
> > __GFP_RETRY_MAYFAIL isn't dropped. __GFP_RECLAIM is
> > "___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM".
> > So, __GFP_KSWAPD_RECLAIM would be dropped for THP allocation.
> > IIUC, THP allocation doesn't use __GFP_KSWAPD_RECLAIM since it's
> > overhead is too large and this overhead should be given to the caller
> > rather than system thread (kswapd) and so on.
>
> Yes, there is a reason why KSWAPD is excluded from THP allocations in
> the page fault path. Maybe we want to extend that behavior to the
> migration as well. I do not have a strong opinion on that because I
> haven't seen excessive kswapd reclaim due to THP migrations. They are
> likely too rare.
>
> But as I've said in my previous email. Make this a separate patch with
> an explanation why we want this.
Okay. I will make a separate patch that clears __GFP_RECLAIM for passed
gfp_mask to extend the behavior. It will make THP migration target allocation
consistent. :)
Thanks.
2020년 6월 30일 (화) 오후 3:42, Michal Hocko <[email protected]>님이 작성:
>
> On Tue 30-06-20 15:30:04, Joonsoo Kim wrote:
> > 2020년 6월 29일 (월) 오후 4:55, Michal Hocko <[email protected]>님이 작성:
> [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 57ece74e3aae..c1595b1d36f3 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
> > > /* Movability of hugepages depends on migration support. */
> > > static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > > {
> > > + gfp_t gfp;
> > > +
> > > if (hugepage_movable_supported(h))
> > > - return GFP_HIGHUSER_MOVABLE;
> > > + gfp = GFP_HIGHUSER_MOVABLE;
> > > else
> > > - return GFP_HIGHUSER;
> > > + gfp = GFP_HIGHUSER;
> > > +
> > > + return current_gfp_context(gfp);
> > > }
> > >
> > > static struct page *dequeue_huge_page_vma(struct hstate *h,
> > >
> > > If we even fix this general issue for other allocations and allow a
> > > better CMA exclusion then it would be implemented consistently for
> > > everybody.
> >
> > Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
> > for CMA exclusion. I will do it after this patch is finished.
> >
> > > Does this make more sense to you are we still not on the same page wrt
> > > to the actual problem?
> >
> > Yes, but we have different opinions about it. As said above, I will make
> > a patch for better CMA exclusion after this patchset. It will make
> > code consistent.
> > I'd really appreciate it if you wait until then.
>
> As I've said I would _prefer_ simplicity over "correctness" if it is only
> partial and hard to reason about from the userspace experience but this
> is not something I would _insist_ on. If Mike as a maintainer of the
> code is ok with that then I will not stand in the way.
Okay.
> But please note that a missing current_gfp_context inside
> htlb_alloc_mask is a subtle bug. I do not think it matters right now but
> with a growing use of scoped apis this might actually hit some day so I
> believe we want to have it in place.
Okay. I will keep in mind and consider it when fixing CMA exclusion on the
other patchset.
Thanks.
On 6/30/20 12:22 AM, Joonsoo Kim wrote:
> 2020년 6월 30일 (화) 오후 3:42, Michal Hocko <[email protected]>님이 작성:
>>
>> On Tue 30-06-20 15:30:04, Joonsoo Kim wrote:
>>> 2020년 6월 29일 (월) 오후 4:55, Michal Hocko <[email protected]>님이 작성:
>> [...]
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 57ece74e3aae..c1595b1d36f3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
>>>> /* Movability of hugepages depends on migration support. */
>>>> static inline gfp_t htlb_alloc_mask(struct hstate *h)
>>>> {
>>>> + gfp_t gfp;
>>>> +
>>>> if (hugepage_movable_supported(h))
>>>> - return GFP_HIGHUSER_MOVABLE;
>>>> + gfp = GFP_HIGHUSER_MOVABLE;
>>>> else
>>>> - return GFP_HIGHUSER;
>>>> + gfp = GFP_HIGHUSER;
>>>> +
>>>> + return current_gfp_context(gfp);
>>>> }
>>>>
>>>> static struct page *dequeue_huge_page_vma(struct hstate *h,
>>>>
>>>> If we even fix this general issue for other allocations and allow a
>>>> better CMA exclusion then it would be implemented consistently for
>>>> everybody.
>>>
>>> Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
>>> for CMA exclusion. I will do it after this patch is finished.
>>>
>>>> Does this make more sense to you are we still not on the same page wrt
>>>> to the actual problem?
>>>
>>> Yes, but we have different opinions about it. As said above, I will make
>>> a patch for better CMA exclusion after this patchset. It will make
>>> code consistent.
>>> I'd really appreciate it if you wait until then.
>>
>> As I've said I would _prefer_ simplicity over "correctness" if it is only
>> partial and hard to reason about from the userspace experience but this
>> is not something I would _insist_ on. If Mike as a maintainer of the
>> code is ok with that then I will not stand in the way.
>
> Okay.
I was OK with Joonsoo's original patch which is why I Ack'ed it. However,
my sense of simplicity and style may not be the norm as I have spent too
much time with the hugetlbfs code. :) That is why I did not chime in and
let Michal and Joonsoo discuss. I can see both sides of the issue. For
now, I am OK to go with Joonsoo's patch as long as the issue below is
considered in the the next patchset.
--
Mike Kravetz
>> But please note that a missing current_gfp_context inside
>> htlb_alloc_mask is a subtle bug. I do not think it matters right now but
>> with a growing use of scoped apis this might actually hit some day so I
>> believe we want to have it in place.
>
> Okay. I will keep in mind and consider it when fixing CMA exclusion on the
> other patchset.
>
> Thanks.
On 6/26/20 6:02 AM, Joonsoo Kim wrote:
> 2020년 6월 25일 (목) 오후 8:26, Michal Hocko <[email protected]>님이 작성:
>>
>> On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
>> > From: Joonsoo Kim <[email protected]>
>> >
>> > There is no difference between two migration callback functions,
>> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
>> > __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
>> > alloc_huge_page_nodemask() and replace the callsite for
>> > alloc_huge_page_node() with the call to
>> > alloc_huge_page_nodemask(..., __GFP_THISNODE).
>> >
>> > It's safe to remove a node id check in alloc_huge_page_node() since
>> > there is no caller passing NUMA_NO_NODE as a node id.
>>
>> Yes this is indeed safe. alloc_huge_page_node used to be called from
>> other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
>> well. Now it is called only from the mempolicy migration callback and
>> that always specifies a node and want to stick with that node.
>>
>> But I have to say I really dislike the gfp semantic because it is
>> different from any other allocation function I can think of. It
>> specifies what to be added rather than what should be used.
>>
>> Removing the function is ok but please use the full gfp mask instead
>> or if that is impractical for some reason (wich shouldn't be the case
>> as htlb_alloc_mask should be trivial to make static inline) make it
>> explicit that this is not a gfp_mask but a gfp modifier and explicitly
>> state which modifiers are allowed.
>
> Okay. I will try to solve your concern. Concrete solution is not yet prepared
> but perhaps I will use full gfp_mask by using htlb_alloc_mask() in caller sites.
Yeah, that should be feasible. alloc_huge_page_vma() already does
htlb_alloc_mask(h). In alloc_new_node_page() and new_page_nodemask() it would be
consistent with the other cases handled there (THP and base).
> Thanks.
>
2020년 7월 3일 (금) 오전 1:13, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/26/20 6:02 AM, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 8:26, Michal Hocko <[email protected]>님이 작성:
> >>
> >> On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
> >> > From: Joonsoo Kim <[email protected]>
> >> >
> >> > There is no difference between two migration callback functions,
> >> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> >> > __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> >> > alloc_huge_page_nodemask() and replace the callsite for
> >> > alloc_huge_page_node() with the call to
> >> > alloc_huge_page_nodemask(..., __GFP_THISNODE).
> >> >
> >> > It's safe to remove a node id check in alloc_huge_page_node() since
> >> > there is no caller passing NUMA_NO_NODE as a node id.
> >>
> >> Yes this is indeed safe. alloc_huge_page_node used to be called from
> >> other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
> >> well. Now it is called only from the mempolicy migration callback and
> >> that always specifies a node and want to stick with that node.
> >>
> >> But I have to say I really dislike the gfp semantic because it is
> >> different from any other allocation function I can think of. It
> >> specifies what to be added rather than what should be used.
> >>
> >> Removing the function is ok but please use the full gfp mask instead
> >> or if that is impractical for some reason (wich shouldn't be the case
> >> as htlb_alloc_mask should be trivial to make static inline) make it
> >> explicit that this is not a gfp_mask but a gfp modifier and explicitly
> >> state which modifiers are allowed.
> >
> > Okay. I will try to solve your concern. Concrete solution is not yet prepared
> > but perhaps I will use full gfp_mask by using htlb_alloc_mask() in caller sites.
>
> Yeah, that should be feasible. alloc_huge_page_vma() already does
> htlb_alloc_mask(h). In alloc_new_node_page() and new_page_nodemask() it would be
> consistent with the other cases handled there (THP and base).
Okay. Will check it.
Thanks.
On 6/23/20 8:13 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There are some similar functions for migration target allocation. Since
> there is no fundamental difference, it's better to keep just one rather
> than keeping all variants. This patch implements base migration target
> allocation function. In the following patches, variants will be converted
> to use this function.
>
> Note that PageHighmem() call in previous function is changed to open-code
> "is_highmem_idx()" since it provides more readability.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Provided that the "&= ~__GFP_RECLAIM" line is separated patch as you discussed,
Acked-by: Vlastimil Babka <[email protected]>
On 6/23/20 8:13 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined migration target allocation callback.
> It's mostly similar with new_non_cma_page() except considering CMA pages.
>
> This patch adds a CMA consideration to the standard migration target
> allocation callback and use it on gup.c.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
But a suggestion below.
> ---
> mm/gup.c | 57 ++++++++-------------------------------------------------
> mm/internal.h | 1 +
> mm/migrate.c | 4 +++-
> 3 files changed, 12 insertions(+), 50 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 15be281..f6124e3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1608,56 +1608,15 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> }
>
> #ifdef CONFIG_CMA
> -static struct page *new_non_cma_page(struct page *page, unsigned long private)
> +static struct page *alloc_migration_target_non_cma(struct page *page, unsigned long private)
> {
...
> + struct migration_target_control mtc = {
> + .nid = page_to_nid(page),
> + .gfp_mask = GFP_USER | __GFP_NOWARN,
> + .skip_cma = true,
> + };
>
> - return __alloc_pages_node(nid, gfp_mask, 0);
> + return alloc_migration_target(page, (unsigned long)&mtc);
Do we really need this wrapper? The only user is check_and_migrate_cma_pages so
just opencode it?
On 6/23/20 8:13 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined migration target allocation callback.
> Use it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
I like that this removes the wrapper completely.
On 6/23/20 8:13 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined standard migration target callback.
> Use it directly.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
But you could move this to patch 5/8 to reduce churn. And do the same with
mm/memory-failure.c new_page() there really, to drop the simple wrappers. Only
new_node_page() is complex enough.
Hm wait, new_node_page() is only called by do_migrate_range() which is only
called by __offline_pages() with explicit test that all pages are from a single
zone, so the nmask could also be setup just once and not per each page, making
it possible to remove the wrapper.
But for new_page() you would have to define that mtc->nid == NUMA_NO_NODE means
alloc_migrate_target() does page_to_nid(page) by itself.
> ---
> mm/page_alloc.c | 9 +++++++--
> mm/page_isolation.c | 11 -----------
> 2 files changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9808339..884dfb5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8359,6 +8359,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> unsigned long pfn = start;
> unsigned int tries = 0;
> int ret = 0;
> + struct migration_target_control mtc = {
> + .nid = zone_to_nid(cc->zone),
> + .nmask = &node_states[N_MEMORY],
> + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> + };
>
> migrate_prep();
>
> @@ -8385,8 +8390,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> &cc->migratepages);
> cc->nr_migratepages -= nr_reclaimed;
>
> - ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
> - NULL, 0, cc->mode, MR_CONTIG_RANGE);
> + ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> }
> if (ret < 0) {
> putback_movable_pages(&cc->migratepages);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index adba031..242c031 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -306,14 +306,3 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>
> return pfn < end_pfn ? -EBUSY : 0;
> }
> -
> -struct page *alloc_migrate_target(struct page *page, unsigned long private)
> -{
> - struct migration_target_control mtc = {
> - .nid = page_to_nid(page),
> - .nmask = &node_states[N_MEMORY],
> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> - };
> -
> - return alloc_migration_target(page, (unsigned long)&mtc);
> -}
>
2020년 7월 4일 (토) 오전 12:56, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/23/20 8:13 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is a well-defined migration target allocation callback.
> > It's mostly similar with new_non_cma_page() except considering CMA pages.
> >
> > This patch adds a CMA consideration to the standard migration target
> > allocation callback and use it on gup.c.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> But a suggestion below.
>
> > ---
> > mm/gup.c | 57 ++++++++-------------------------------------------------
> > mm/internal.h | 1 +
> > mm/migrate.c | 4 +++-
> > 3 files changed, 12 insertions(+), 50 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 15be281..f6124e3 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1608,56 +1608,15 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > }
> >
> > #ifdef CONFIG_CMA
> > -static struct page *new_non_cma_page(struct page *page, unsigned long private)
> > +static struct page *alloc_migration_target_non_cma(struct page *page, unsigned long private)
> > {
>
> ...
>
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(page),
> > + .gfp_mask = GFP_USER | __GFP_NOWARN,
> > + .skip_cma = true,
> > + };
> >
> > - return __alloc_pages_node(nid, gfp_mask, 0);
> > + return alloc_migration_target(page, (unsigned long)&mtc);
>
> Do we really need this wrapper? The only user is check_and_migrate_cma_pages so
> just opencode it?
This wrapper exists for setting up a different nid for each page.
However, as you suggested in the next reply, we can remove this wrapper if
NUMA_NO_NODE handling is added to the standard function. I will add NUMA_NO_NODE
handling to the standard function and remove this wrapper.
Thanks.
2020년 7월 4일 (토) 오전 1:18, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/23/20 8:13 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is a well-defined standard migration target callback.
> > Use it directly.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> But you could move this to patch 5/8 to reduce churn. And do the same with
Yes, I now realize that it is possible to make this change earlier.
However, reordering
the patches would cause additional change so I will not change the
order in the next
version. Result would be the same. :)
> mm/memory-failure.c new_page() there really, to drop the simple wrappers. Only
Okay. As you suggested below, with NUMA_NO_NODE handling, we can remove
the more wrappers. I will do it.
> new_node_page() is complex enough.
> Hm wait, new_node_page() is only called by do_migrate_range() which is only
> called by __offline_pages() with explicit test that all pages are from a single
> zone, so the nmask could also be setup just once and not per each page, making
> it possible to remove the wrapper.
I have tried this suggestion and found that code is not simpler than before.
However, there would be minor performance benefit so I will include
this change, too.
> But for new_page() you would have to define that mtc->nid == NUMA_NO_NODE means
> alloc_migrate_target() does page_to_nid(page) by itself.
Yes, I will use this suggestion.
Thanks.
On Tue, Jun 23, 2020 at 03:13:47PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a well-defined migration target allocation callback.
> Use it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/internal.h | 1 -
> mm/mempolicy.c | 30 ++++++------------------------
> mm/migrate.c | 8 ++++++--
> 3 files changed, 12 insertions(+), 27 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index fb7f7fe..4f9f6b6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -613,7 +613,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> }
>
> void setup_zone_pageset(struct zone *zone);
> -extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
>
> struct migration_target_control {
> int nid; /* preferred node id */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index a3abf64..85a3f21 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1065,28 +1065,6 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
> return 0;
> }
>
> -/* page allocation callback for NUMA node migration */
> -struct page *alloc_new_node_page(struct page *page, unsigned long node)
> -{
> - if (PageHuge(page)) {
> - return alloc_huge_page_nodemask(
> - page_hstate(compound_head(page)), node,
> - NULL, __GFP_THISNODE, false);
> - } else if (PageTransHuge(page)) {
> - struct page *thp;
> -
> - thp = alloc_pages_node(node,
> - (GFP_TRANSHUGE | __GFP_THISNODE),
> - HPAGE_PMD_ORDER);
> - if (!thp)
> - return NULL;
> - prep_transhuge_page(thp);
> - return thp;
> - } else
> - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE |
> - __GFP_THISNODE, 0);
> -}
> -
> /*
> * Migrate pages from one node to a target node.
> * Returns error or the number of pages not migrated.
> @@ -1097,6 +1075,10 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> nodemask_t nmask;
> LIST_HEAD(pagelist);
> int err = 0;
> + struct migration_target_control mtc = {
> + .nid = dest,
> + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
> + };
>
> nodes_clear(nmask);
> node_set(source, nmask);
> @@ -1111,8 +1093,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> flags | MPOL_MF_DISCONTIG_OK, &pagelist);
>
> if (!list_empty(&pagelist)) {
> - err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
> - MIGRATE_SYNC, MR_SYSCALL);
> + err = migrate_pages(&pagelist, alloc_migration_target, NULL,
> + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> if (err)
> putback_movable_pages(&pagelist);
> }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7c4cd74..1c943b0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1590,9 +1590,13 @@ static int do_move_pages_to_node(struct mm_struct *mm,
> struct list_head *pagelist, int node)
> {
> int err;
> + struct migration_target_control mtc = {
> + .nid = node,
> + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE,
> + };
>
> - err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
> - MIGRATE_SYNC, MR_SYSCALL);
> + err = migrate_pages(pagelist, alloc_migration_target, NULL,
> + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
migrate_pages() starts failing like this apparently using the new
callback on NUMA systems,
[ 6147.019063][T45242] LTP: starting move_pages12
[ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
[ 6147.483301][T64921] #PF: supervisor read access in kernel mode
[ 6147.489170][T64921] #PF: error_code(0x0000) - not-present page
[ 6147.495040][T64921] PGD 5df817067 P4D 5df817067 PUD 5df819067 PMD 0
[ 6147.501438][T64921] Oops: 0000 [#1] SMP KASAN NOPTI
[ 6147.506348][T64921] CPU: 35 PID: 64921 Comm: move_pages12 Tainted: G O 5.8.0-rc4-next-20200707 #1
[ 6147.516586][T64921] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
avc_start_pgoff at mm/interval_tree.c:63
(inlined by) __anon_vma_interval_tree_iter_first at mm/interval_tree.c:71
(inlined by) anon_vma_interval_tree_iter_first at mm/interval_tree.c:95
[ 6147.532787][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
[ 6147.552370][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
[ 6147.558327][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
[ 6147.566205][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
[ 6147.574084][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
[ 6147.581962][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
[ 6147.589839][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
[ 6147.597717][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
[ 6147.606557][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6147.613037][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
[ 6147.620914][T64921] Call Trace:
[ 6147.624078][T64921] rmap_walk_anon+0x141/0xa30
rmap_walk_anon at mm/rmap.c:1864
[ 6147.628639][T64921] try_to_unmap+0x209/0x2d0
try_to_unmap at mm/rmap.c:1763
[ 6147.633026][T64921] ? rmap_walk_locked+0x140/0x140
[ 6147.637936][T64921] ? page_remove_rmap+0x1190/0x1190
[ 6147.643020][T64921] ? page_not_mapped+0x10/0x10
[ 6147.647668][T64921] ? page_get_anon_vma+0x290/0x290
[ 6147.652664][T64921] ? page_mapcount_is_zero+0x10/0x10
[ 6147.657838][T64921] ? hugetlb_page_mapping_lock_write+0x97/0x180
[ 6147.663972][T64921] migrate_pages+0x1005/0x1fb0
unmap_and_move_huge_page at mm/migrate.c:1383
(inlined by) migrate_pages at mm/migrate.c:1468
[ 6147.668617][T64921] ? remove_migration_pte+0xac0/0xac0
[ 6147.673875][T64921] move_pages_and_store_status.isra.47+0xd7/0x1a0
do_move_pages_to_node at mm/migrate.c:1595
(inlined by) move_pages_and_store_status at mm/migrate.c:1683
[ 6147.680181][T64921] ? migrate_pages+0x1fb0/0x1fb0
[ 6147.685002][T64921] __x64_sys_move_pages+0xa5c/0x1100
[ 6147.690176][T64921] ? trace_hardirqs_on+0x20/0x1b5
[ 6147.695084][T64921] ? move_pages_and_store_status.isra.47+0x1a0/0x1a0
[ 6147.701653][T64921] ? rcu_read_lock_sched_held+0xaa/0xd0
[ 6147.707088][T64921] ? switch_fpu_return+0x196/0x400
[ 6147.712083][T64921] ? lockdep_hardirqs_on_prepare+0x38c/0x550
[ 6147.717954][T64921] ? do_syscall_64+0x24/0x310
[ 6147.722513][T64921] do_syscall_64+0x5f/0x310
[ 6147.726897][T64921] ? trace_hardirqs_off+0x12/0x1a0
[ 6147.731894][T64921] ? asm_exc_page_fault+0x8/0x30
[ 6147.736714][T64921] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6147.742495][T64921] RIP: 0033:0x7f329c3fe6ed
[ 6147.746791][T64921] Code: Bad RIP value.
[ 6147.750738][T64921] RSP: 002b:00007fff5b6b5f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
[ 6147.759055][T64921] RAX: ffffffffffffffda RBX: 00007f329cf18af8 RCX: 00007f329c3fe6ed
[ 6147.766933][T64921] RDX: 00000000019b0ee0 RSI: 0000000000000400 RDI: 000000000000fd98
[ 6147.774809][T64921] RBP: 0000000000000400 R08: 00000000019b3f00 R09: 0000000000000004
[ 6147.782686][T64921] R10: 00000000019b2ef0 R11: 0000000000000246 R12: 0000000000000400
[ 6147.790563][T64921] R13: 00000000019b0ee0 R14: 00000000019b2ef0 R15: 00000000019b3f00
[ 6147.798440][T64921] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop kvm_amd ses enclosure kvm irqbypass efivars acpi_cpufreq nls_ascii nls_cp437 vfat fat efivarfs ip_tables x_tables sd_mod smartpqi scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
[ 6147.828701][T64921] CR2: ffffffffffffffe0
[ 6147.832736][T64921] ---[ end trace 40323b256f1c74a8 ]---
[ 6147.838083][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
[ 6147.845001][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
[ 6147.864583][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
[ 6147.870539][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
[ 6147.878417][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
[ 6147.886294][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
[ 6147.894172][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
[ 6147.902049][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
[ 6147.909932][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
[ 6147.918771][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6147.925251][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
[ 6147.933130][T64921] Kernel panic - not syncing: Fatal exception
[ 6147.939493][T64921] Kernel Offset: 0x28c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 6147.951090][T64921] ---[ end Kernel panic - not syncing: Fatal exception ]---
> if (err)
> putback_movable_pages(pagelist);
> return err;
> --
> 2.7.4
>
>
On Tue 07-07-20 21:20:44, Qian Cai wrote:
[...]
> migrate_pages() starts failing like this apparently using the new
> callback on NUMA systems,
>
> [ 6147.019063][T45242] LTP: starting move_pages12
> [ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
Hmm, this looks like -EPIPE (-32) which is unexpected to say the least.
Does the test pass without this patch applied? Also there has been v4
posted just yesterday. Does it suffer from the same problem?
--
Michal Hocko
SUSE Labs
On Tue, 7 Jul 2020, Qian Cai wrote:
> On Tue, Jun 23, 2020 at 03:13:47PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is a well-defined migration target allocation callback.
> > Use it.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
...
>
> migrate_pages() starts failing like this apparently using the new
> callback on NUMA systems,
>
> [ 6147.019063][T45242] LTP: starting move_pages12
> [ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
> [ 6147.483301][T64921] #PF: supervisor read access in kernel mode
> [ 6147.489170][T64921] #PF: error_code(0x0000) - not-present page
> [ 6147.495040][T64921] PGD 5df817067 P4D 5df817067 PUD 5df819067 PMD 0
> [ 6147.501438][T64921] Oops: 0000 [#1] SMP KASAN NOPTI
> [ 6147.506348][T64921] CPU: 35 PID: 64921 Comm: move_pages12 Tainted: G O 5.8.0-rc4-next-20200707 #1
> [ 6147.516586][T64921] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> [ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
> avc_start_pgoff at mm/interval_tree.c:63
> (inlined by) __anon_vma_interval_tree_iter_first at mm/interval_tree.c:71
> (inlined by) anon_vma_interval_tree_iter_first at mm/interval_tree.c:95
> [ 6147.532787][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
> [ 6147.552370][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
> [ 6147.558327][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
> [ 6147.566205][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
> [ 6147.574084][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
> [ 6147.581962][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
> [ 6147.589839][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
> [ 6147.597717][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
> [ 6147.606557][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6147.613037][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
> [ 6147.620914][T64921] Call Trace:
> [ 6147.624078][T64921] rmap_walk_anon+0x141/0xa30
> rmap_walk_anon at mm/rmap.c:1864
> [ 6147.628639][T64921] try_to_unmap+0x209/0x2d0
> try_to_unmap at mm/rmap.c:1763
> [ 6147.633026][T64921] ? rmap_walk_locked+0x140/0x140
> [ 6147.637936][T64921] ? page_remove_rmap+0x1190/0x1190
> [ 6147.643020][T64921] ? page_not_mapped+0x10/0x10
> [ 6147.647668][T64921] ? page_get_anon_vma+0x290/0x290
> [ 6147.652664][T64921] ? page_mapcount_is_zero+0x10/0x10
> [ 6147.657838][T64921] ? hugetlb_page_mapping_lock_write+0x97/0x180
> [ 6147.663972][T64921] migrate_pages+0x1005/0x1fb0
> unmap_and_move_huge_page at mm/migrate.c:1383
> (inlined by) migrate_pages at mm/migrate.c:1468
> [ 6147.668617][T64921] ? remove_migration_pte+0xac0/0xac0
> [ 6147.673875][T64921] move_pages_and_store_status.isra.47+0xd7/0x1a0
> do_move_pages_to_node at mm/migrate.c:1595
> (inlined by) move_pages_and_store_status at mm/migrate.c:1683
> [ 6147.680181][T64921] ? migrate_pages+0x1fb0/0x1fb0
> [ 6147.685002][T64921] __x64_sys_move_pages+0xa5c/0x1100
> [ 6147.690176][T64921] ? trace_hardirqs_on+0x20/0x1b5
> [ 6147.695084][T64921] ? move_pages_and_store_status.isra.47+0x1a0/0x1a0
> [ 6147.701653][T64921] ? rcu_read_lock_sched_held+0xaa/0xd0
> [ 6147.707088][T64921] ? switch_fpu_return+0x196/0x400
> [ 6147.712083][T64921] ? lockdep_hardirqs_on_prepare+0x38c/0x550
> [ 6147.717954][T64921] ? do_syscall_64+0x24/0x310
> [ 6147.722513][T64921] do_syscall_64+0x5f/0x310
> [ 6147.726897][T64921] ? trace_hardirqs_off+0x12/0x1a0
> [ 6147.731894][T64921] ? asm_exc_page_fault+0x8/0x30
> [ 6147.736714][T64921] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 6147.742495][T64921] RIP: 0033:0x7f329c3fe6ed
> [ 6147.746791][T64921] Code: Bad RIP value.
> [ 6147.750738][T64921] RSP: 002b:00007fff5b6b5f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
> [ 6147.759055][T64921] RAX: ffffffffffffffda RBX: 00007f329cf18af8 RCX: 00007f329c3fe6ed
> [ 6147.766933][T64921] RDX: 00000000019b0ee0 RSI: 0000000000000400 RDI: 000000000000fd98
> [ 6147.774809][T64921] RBP: 0000000000000400 R08: 00000000019b3f00 R09: 0000000000000004
> [ 6147.782686][T64921] R10: 00000000019b2ef0 R11: 0000000000000246 R12: 0000000000000400
> [ 6147.790563][T64921] R13: 00000000019b0ee0 R14: 00000000019b2ef0 R15: 00000000019b3f00
> [ 6147.798440][T64921] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop kvm_amd ses enclosure kvm irqbypass efivars acpi_cpufreq nls_ascii nls_cp437 vfat fat efivarfs ip_tables x_tables sd_mod smartpqi scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
> [ 6147.828701][T64921] CR2: ffffffffffffffe0
> [ 6147.832736][T64921] ---[ end trace 40323b256f1c74a8 ]---
> [ 6147.838083][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
> [ 6147.845001][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
> [ 6147.864583][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
> [ 6147.870539][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
> [ 6147.878417][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
> [ 6147.886294][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
> [ 6147.894172][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
> [ 6147.902049][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
> [ 6147.909932][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
> [ 6147.918771][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6147.925251][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
> [ 6147.933130][T64921] Kernel panic - not syncing: Fatal exception
> [ 6147.939493][T64921] Kernel Offset: 0x28c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 6147.951090][T64921] ---[ end Kernel panic - not syncing: Fatal exception ]---
I hit this too, trying LTP (yes, move_pages12) on 5.9-rc8. I could not
then reproduce it, and suppose that Qian Cai was also unable to do so.
But it's fairly easy to explain, and not at all related to Joonsoo's
patch or patchset accused in this thread.
The faulting address ffffffffffffffe0 is not an -ESPIPE here,
it comes from a mov -0x20(%rcx),%rcx with NULL %rcx in my case: I've
been too impatient to unravel the interval tree defining to work out
exactly what that corresponds to, but I assume from an
anon_vma_interval_tree rb_entry container_of.
Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs:
use i_mmap_rwsem for more pmd sharing synchronization"), in which
unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to
try_to_unmap(), because it's already holding mapping->i_mmap_rwsem:
but that is not the right lock to secure an anon_vma lookup.
I intended to send a patch, passing TTU_RMAP_LOCKED only in the
!PageAnon case (and, see vma_adjust(), anon_vma lock conveniently
nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was
needed in that case or not, so looked deeper into c0d0381ade79.
Hmm, not even you liked it! But the worst of it looks simply
unnecessary to me, and I hope can be deleted - but better by you
than by me (in particular, you were trying to kill 1) and 2) birds
with one stone, and I've always given up on understanding hugetlb's
reservations: I suspect that side of it is irrelevant here,
but I wouldn't pretend to be sure).
How could you ever find a PageAnon page in a vma_shareable() area?
It is all rather confusing (vma_shareable() depending on VM_MAYSHARE,
whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to
be studied together with do_mmap()'s
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
(And let me add to the confusion by admitting that, prior to 3.15's
cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in
shared areas", maybe it was possible to find a PageAnon there.)
But my belief (best confirmed by you running your tests with a
suitably placed BUG_ON or WARN_ON) is that you'll never find a
PageAnon in a vma_shareable() area, so will never need try_to_unmap()
to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem
for PageAnon there, and _get_hugetlb_page_mapping() (your function that
deduces an address_space from an anon_vma) can just be deleted.
(And in passing, may I ask what hugetlb_page_mapping_lock_write()'s
hpage->_mapcount inc and dec are for? You comment it as a hack,
but don't explain what needs that hack, and I don't see it.)
It does seem sad to be unsharing the page table in any case of
try_to_unmap(): the only reason for that I can see (if more care
were taken with TLB flushing) is the mmu_notifier situation -
I don't think we have any suitable mmu_notifier to alert all
the mms to be affected.
Hugh
On 10/7/20 8:21 PM, Hugh Dickins wrote:
> On Tue, 7 Jul 2020, Qian Cai wrote:
>> On Tue, Jun 23, 2020 at 03:13:47PM +0900, [email protected] wrote:
>>> From: Joonsoo Kim <[email protected]>
>>>
>>> There is a well-defined migration target allocation callback.
>>> Use it.
>>>
>>> Signed-off-by: Joonsoo Kim <[email protected]>
>>> ---
> ...
>>
>> migrate_pages() starts failing like this apparently using the new
>> callback on NUMA systems,
>>
>> [ 6147.019063][T45242] LTP: starting move_pages12
>> [ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
>> [ 6147.483301][T64921] #PF: supervisor read access in kernel mode
>> [ 6147.489170][T64921] #PF: error_code(0x0000) - not-present page
>> [ 6147.495040][T64921] PGD 5df817067 P4D 5df817067 PUD 5df819067 PMD 0
>> [ 6147.501438][T64921] Oops: 0000 [#1] SMP KASAN NOPTI
>> [ 6147.506348][T64921] CPU: 35 PID: 64921 Comm: move_pages12 Tainted: G O 5.8.0-rc4-next-20200707 #1
>> [ 6147.516586][T64921] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>> [ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
>> avc_start_pgoff at mm/interval_tree.c:63
>> (inlined by) __anon_vma_interval_tree_iter_first at mm/interval_tree.c:71
>> (inlined by) anon_vma_interval_tree_iter_first at mm/interval_tree.c:95
>> [ 6147.532787][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
>> [ 6147.552370][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
>> [ 6147.558327][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
>> [ 6147.566205][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
>> [ 6147.574084][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
>> [ 6147.581962][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
>> [ 6147.589839][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
>> [ 6147.597717][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
>> [ 6147.606557][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 6147.613037][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
>> [ 6147.620914][T64921] Call Trace:
>> [ 6147.624078][T64921] rmap_walk_anon+0x141/0xa30
>> rmap_walk_anon at mm/rmap.c:1864
>> [ 6147.628639][T64921] try_to_unmap+0x209/0x2d0
>> try_to_unmap at mm/rmap.c:1763
>> [ 6147.633026][T64921] ? rmap_walk_locked+0x140/0x140
>> [ 6147.637936][T64921] ? page_remove_rmap+0x1190/0x1190
>> [ 6147.643020][T64921] ? page_not_mapped+0x10/0x10
>> [ 6147.647668][T64921] ? page_get_anon_vma+0x290/0x290
>> [ 6147.652664][T64921] ? page_mapcount_is_zero+0x10/0x10
>> [ 6147.657838][T64921] ? hugetlb_page_mapping_lock_write+0x97/0x180
>> [ 6147.663972][T64921] migrate_pages+0x1005/0x1fb0
>> unmap_and_move_huge_page at mm/migrate.c:1383
>> (inlined by) migrate_pages at mm/migrate.c:1468
>> [ 6147.668617][T64921] ? remove_migration_pte+0xac0/0xac0
>> [ 6147.673875][T64921] move_pages_and_store_status.isra.47+0xd7/0x1a0
>> do_move_pages_to_node at mm/migrate.c:1595
>> (inlined by) move_pages_and_store_status at mm/migrate.c:1683
>> [ 6147.680181][T64921] ? migrate_pages+0x1fb0/0x1fb0
>> [ 6147.685002][T64921] __x64_sys_move_pages+0xa5c/0x1100
>> [ 6147.690176][T64921] ? trace_hardirqs_on+0x20/0x1b5
>> [ 6147.695084][T64921] ? move_pages_and_store_status.isra.47+0x1a0/0x1a0
>> [ 6147.701653][T64921] ? rcu_read_lock_sched_held+0xaa/0xd0
>> [ 6147.707088][T64921] ? switch_fpu_return+0x196/0x400
>> [ 6147.712083][T64921] ? lockdep_hardirqs_on_prepare+0x38c/0x550
>> [ 6147.717954][T64921] ? do_syscall_64+0x24/0x310
>> [ 6147.722513][T64921] do_syscall_64+0x5f/0x310
>> [ 6147.726897][T64921] ? trace_hardirqs_off+0x12/0x1a0
>> [ 6147.731894][T64921] ? asm_exc_page_fault+0x8/0x30
>> [ 6147.736714][T64921] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 6147.742495][T64921] RIP: 0033:0x7f329c3fe6ed
>> [ 6147.746791][T64921] Code: Bad RIP value.
>> [ 6147.750738][T64921] RSP: 002b:00007fff5b6b5f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
>> [ 6147.759055][T64921] RAX: ffffffffffffffda RBX: 00007f329cf18af8 RCX: 00007f329c3fe6ed
>> [ 6147.766933][T64921] RDX: 00000000019b0ee0 RSI: 0000000000000400 RDI: 000000000000fd98
>> [ 6147.774809][T64921] RBP: 0000000000000400 R08: 00000000019b3f00 R09: 0000000000000004
>> [ 6147.782686][T64921] R10: 00000000019b2ef0 R11: 0000000000000246 R12: 0000000000000400
>> [ 6147.790563][T64921] R13: 00000000019b0ee0 R14: 00000000019b2ef0 R15: 00000000019b3f00
>> [ 6147.798440][T64921] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop kvm_amd ses enclosure kvm irqbypass efivars acpi_cpufreq nls_ascii nls_cp437 vfat fat efivarfs ip_tables x_tables sd_mod smartpqi scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_del_mod]
>> [ 6147.828701][T64921] CR2: ffffffffffffffe0
>> [ 6147.832736][T64921] ---[ end trace 40323b256f1c74a8 ]---
>> [ 6147.838083][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
>> [ 6147.845001][T64921] Code: 00 00 00 00 fc ff df 48 8b 6d 08 48 8d 7d e0 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 b3 00 00 00 48 b8 00 00 00 00 00 fc ff df <48> 8b 6d e0 48 8d bd 98 00 00 00 48 89 f9 48 c1 e9 03 80 3c 01 00
>> [ 6147.864583][T64921] RSP: 0018:ffffc9000bfdfa98 EFLAGS: 00010246
>> [ 6147.870539][T64921] RAX: dffffc0000000000 RBX: ffff888524019b28 RCX: 1ffffffffffffffc
>> [ 6147.878417][T64921] RDX: 00000000000003ff RSI: 0000000000000200 RDI: ffffffffffffffe0
>> [ 6147.886294][T64921] RBP: 0000000000000000 R08: fffff94002b1c001 R09: fffff94002b1c001
>> [ 6147.894172][T64921] R10: ffffea00158e0007 R11: fffff94002b1c000 R12: 0000000000000009
>> [ 6147.902049][T64921] R13: ffffea00158e0008 R14: ffffea00158e0000 R15: ffffea00158e0000
>> [ 6147.909932][T64921] FS: 00007f329cf18b80(0000) GS:ffff88881f7c0000(0000) knlGS:0000000000000000
>> [ 6147.918771][T64921] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 6147.925251][T64921] CR2: ffffffffffffffe0 CR3: 000000081122a000 CR4: 00000000003506e0
>> [ 6147.933130][T64921] Kernel panic - not syncing: Fatal exception
>> [ 6147.939493][T64921] Kernel Offset: 0x28c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 6147.951090][T64921] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> I hit this too, trying LTP (yes, move_pages12) on 5.9-rc8. I could not
> then reproduce it, and suppose that Qian Cai was also unable to do so.
>
> But it's fairly easy to explain, and not at all related to Joonsoo's
> patch or patchset accused in this thread.
>
> The faulting address ffffffffffffffe0 is not an -ESPIPE here,
> it comes from a mov -0x20(%rcx),%rcx with NULL %rcx in my case: I've
> been too impatient to unravel the interval tree defining to work out
> exactly what that corresponds to, but I assume from an
> anon_vma_interval_tree rb_entry container_of.
>
> Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs:
> use i_mmap_rwsem for more pmd sharing synchronization"), in which
> unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to
> try_to_unmap(), because it's already holding mapping->i_mmap_rwsem:
> but that is not the right lock to secure an anon_vma lookup.
Thanks Hugh! Your analysis is correct and the code in that commit is
not correct. I was so focused on the file mapping case, I overlooked
(actually introduced) this issue for anon mappings.
Let me verify that this indeed is the root cause. However, since
move_pages12 migrated anon hugetlb pages it certainly does look to be
the case.
> I intended to send a patch, passing TTU_RMAP_LOCKED only in the
> !PageAnon case (and, see vma_adjust(), anon_vma lock conveniently
> nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was
> needed in that case or not, so looked deeper into c0d0381ade79.
>
> Hmm, not even you liked it! But the worst of it looks simply
> unnecessary to me, and I hope can be deleted - but better by you
> than by me (in particular, you were trying to kill 1) and 2) birds
> with one stone, and I've always given up on understanding hugetlb's
> reservations: I suspect that side of it is irrelevant here,
> but I wouldn't pretend to be sure).
>
> How could you ever find a PageAnon page in a vma_shareable() area?
>
> It is all rather confusing (vma_shareable() depending on VM_MAYSHARE,
> whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to
> be studied together with do_mmap()'s
> vm_flags |= VM_SHARED | VM_MAYSHARE;
> if (!(file->f_mode & FMODE_WRITE))
> vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
>
> (And let me add to the confusion by admitting that, prior to 3.15's
> cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in
> shared areas", maybe it was possible to find a PageAnon there.)
>
> But my belief (best confirmed by you running your tests with a
> suitably placed BUG_ON or WARN_ON) is that you'll never find a
> PageAnon in a vma_shareable() area, so will never need try_to_unmap()
> to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem
> for PageAnon there, and _get_hugetlb_page_mapping() (your function that
> deduces an address_space from an anon_vma) can just be deleted.
Yes, it is confusing. Let me look into this. I would be really happy
to delete that ugly function.
> (And in passing, may I ask what hugetlb_page_mapping_lock_write()'s
> hpage->_mapcount inc and dec are for? You comment it as a hack,
> but don't explain what needs that hack, and I don't see it.)
We are trying to lock the mapping (mapping->i_mmap_rwsem). We know
mapping is valid, because we obtained it from page_mapping() and it
will remain valid because we have the page locked. Page needs to be
unlocked to unmap. However, we have to drop page lock in order to
acquire i_mmap_rwsem. Once we drop page lock, mapping could become
invalid. So, the code code artifically incs mapcount so that mapping
will remain valid when upmapping page.
As mentioned above, I hope all this can be removed.
--
Mike Kravetz
On Thu, 8 Oct 2020, Mike Kravetz wrote:
> On 10/7/20 8:21 PM, Hugh Dickins wrote:
> >
> > Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs:
> > use i_mmap_rwsem for more pmd sharing synchronization"), in which
> > unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to
> > try_to_unmap(), because it's already holding mapping->i_mmap_rwsem:
> > but that is not the right lock to secure an anon_vma lookup.
>
> Thanks Hugh! Your analysis is correct and the code in that commit is
> not correct. I was so focused on the file mapping case, I overlooked
> (actually introduced) this issue for anon mappings.
>
> Let me verify that this indeed is the root cause. However, since
> move_pages12 migrated anon hugetlb pages it certainly does look to be
> the case.
>
> > I intended to send a patch, passing TTU_RMAP_LOCKED only in the
> > !PageAnon case (and, see vma_adjust(), anon_vma lock conveniently
> > nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was
> > needed in that case or not, so looked deeper into c0d0381ade79.
> >
> > Hmm, not even you liked it! But the worst of it looks simply
> > unnecessary to me, and I hope can be deleted - but better by you
> > than by me (in particular, you were trying to kill 1) and 2) birds
> > with one stone, and I've always given up on understanding hugetlb's
> > reservations: I suspect that side of it is irrelevant here,
> > but I wouldn't pretend to be sure).
> >
> > How could you ever find a PageAnon page in a vma_shareable() area?
> >
> > It is all rather confusing (vma_shareable() depending on VM_MAYSHARE,
> > whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to
> > be studied together with do_mmap()'s
> > vm_flags |= VM_SHARED | VM_MAYSHARE;
> > if (!(file->f_mode & FMODE_WRITE))
> > vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> >
> > (And let me add to the confusion by admitting that, prior to 3.15's
> > cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in
> > shared areas", maybe it was possible to find a PageAnon there.)
> >
> > But my belief (best confirmed by you running your tests with a
> > suitably placed BUG_ON or WARN_ON) is that you'll never find a
> > PageAnon in a vma_shareable() area, so will never need try_to_unmap()
> > to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem
> > for PageAnon there, and _get_hugetlb_page_mapping() (your function that
> > deduces an address_space from an anon_vma) can just be deleted.
>
> Yes, it is confusing. Let me look into this. I would be really happy
> to delete that ugly function.
>
> > (And in passing, may I ask what hugetlb_page_mapping_lock_write()'s
> > hpage->_mapcount inc and dec are for? You comment it as a hack,
> > but don't explain what needs that hack, and I don't see it.)
>
> We are trying to lock the mapping (mapping->i_mmap_rwsem). We know
> mapping is valid, because we obtained it from page_mapping() and it
> will remain valid because we have the page locked. Page needs to be
> unlocked to unmap. However, we have to drop page lock in order to
> acquire i_mmap_rwsem. Once we drop page lock, mapping could become
> invalid. So, the code code artifically incs mapcount so that mapping
> will remain valid when upmapping page.
No, unless you can point me to some other hugetlbfs-does-it-differently
(and I didn't see it there in that commit), raising _mapcount does not
provide any such protection; but does add the possiblility of a
"BUG: Bad page cache" and leak from unaccount_page_cache_page().
Earlier in the day I was trying to work out what to recommend instead,
but had to turn aside to something else: I'll try again tomorrow.
It's a problem I've faced before in tmpfs, keeping a hold on the
mapping while page lock is dropped. Quite awkward: igrab() looks as
if it's the right thing to use, but turns out to give no protection
against umount. Last time around, I ended up with a stop_eviction
count in the shmem inode, which shmem_evict_inode() waits on if
necessary. Something like that could be done for hugetlbfs too,
but I'd prefer to do it without adding extra, if there is a way.
>
> As mentioned above, I hope all this can be removed.
If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs,
then I think that part of hugetlb_page_mapping_lock_write() has to
remain. I'd much prefer that hugetlbfs did not reverse the usual
nesting, but accept that you had reasons for doing it that way.
Hugh
On 10/8/20 10:50 PM, Hugh Dickins wrote:
> On Thu, 8 Oct 2020, Mike Kravetz wrote:
>> On 10/7/20 8:21 PM, Hugh Dickins wrote:
>>>
>>> Mike, j'accuse... your 5.7 commit c0d0381ade79 ("hugetlbfs:
>>> use i_mmap_rwsem for more pmd sharing synchronization"), in which
>>> unmap_and_move_huge_page() now passes the TTU_RMAP_LOCKED flag to
>>> try_to_unmap(), because it's already holding mapping->i_mmap_rwsem:
>>> but that is not the right lock to secure an anon_vma lookup.
>>
>> Thanks Hugh! Your analysis is correct and the code in that commit is
>> not correct. I was so focused on the file mapping case, I overlooked
>> (actually introduced) this issue for anon mappings.
>>
>> Let me verify that this indeed is the root cause. However, since
>> move_pages12 migrated anon hugetlb pages it certainly does look to be
>> the case.
>>
>>> I intended to send a patch, passing TTU_RMAP_LOCKED only in the
>>> !PageAnon case (and, see vma_adjust(), anon_vma lock conveniently
>>> nests inside i_mmap_rwsem); but then wondered if i_mmap_rwsem was
>>> needed in that case or not, so looked deeper into c0d0381ade79.
>>>
>>> Hmm, not even you liked it! But the worst of it looks simply
>>> unnecessary to me, and I hope can be deleted - but better by you
>>> than by me (in particular, you were trying to kill 1) and 2) birds
>>> with one stone, and I've always given up on understanding hugetlb's
>>> reservations: I suspect that side of it is irrelevant here,
>>> but I wouldn't pretend to be sure).
>>>
>>> How could you ever find a PageAnon page in a vma_shareable() area?
>>>
>>> It is all rather confusing (vma_shareable() depending on VM_MAYSHARE,
>>> whereas is_cow_mapping() on VM_SHARED and VM_MAYWRITE: they have to
>>> be studied together with do_mmap()'s
>>> vm_flags |= VM_SHARED | VM_MAYSHARE;
>>> if (!(file->f_mode & FMODE_WRITE))
>>> vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
>>>
>>> (And let me add to the confusion by admitting that, prior to 3.15's
>>> cda540ace6a1 "mm: get_user_pages(write,force) refuse to COW in
>>> shared areas", maybe it was possible to find a PageAnon there.)
>>>
>>> But my belief (best confirmed by you running your tests with a
>>> suitably placed BUG_ON or WARN_ON) is that you'll never find a
>>> PageAnon in a vma_shareable() area, so will never need try_to_unmap()
>>> to unshare a pagetable in the PageAnon case, so won't need i_mmap_rwsem
>>> for PageAnon there, and _get_hugetlb_page_mapping() (your function that
>>> deduces an address_space from an anon_vma) can just be deleted.
>>
>> Yes, it is confusing. Let me look into this. I would be really happy
>> to delete that ugly function.
>>
>>> (And in passing, may I ask what hugetlb_page_mapping_lock_write()'s
>>> hpage->_mapcount inc and dec are for? You comment it as a hack,
>>> but don't explain what needs that hack, and I don't see it.)
>>
>> We are trying to lock the mapping (mapping->i_mmap_rwsem). We know
>> mapping is valid, because we obtained it from page_mapping() and it
>> will remain valid because we have the page locked. Page needs to be
>> unlocked to unmap. However, we have to drop page lock in order to
>> acquire i_mmap_rwsem. Once we drop page lock, mapping could become
>> invalid. So, the code code artifically incs mapcount so that mapping
>> will remain valid when upmapping page.
>
> No, unless you can point me to some other hugetlbfs-does-it-differently
> (and I didn't see it there in that commit), raising _mapcount does not
> provide any such protection; but does add the possiblility of a
> "BUG: Bad page cache" and leak from unaccount_page_cache_page().
>
> Earlier in the day I was trying to work out what to recommend instead,
> but had to turn aside to something else: I'll try again tomorrow.
>
> It's a problem I've faced before in tmpfs, keeping a hold on the
> mapping while page lock is dropped. Quite awkward: igrab() looks as
> if it's the right thing to use, but turns out to give no protection
> against umount. Last time around, I ended up with a stop_eviction
> count in the shmem inode, which shmem_evict_inode() waits on if
> necessary. Something like that could be done for hugetlbfs too,
> but I'd prefer to do it without adding extra, if there is a way.
Thanks.
>>
>> As mentioned above, I hope all this can be removed.
>
> If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs,
> then I think that part of hugetlb_page_mapping_lock_write() has to
> remain. I'd much prefer that hugetlbfs did not reverse the usual
> nesting, but accept that you had reasons for doing it that way.
Yes, that is necessary with the change to lock order.
Yesterday I found another issue/case to handle in the hugetlb COW code
caused by the difference in lock nesting. As a result, I am rethinking
the decision to change the locking order.
The primary reason for changing the lock order was to make the hugetlb
page fault code not have to worry about pte pointers changing. The issue
is that the pte pointer you get (or create) while walking the table
without the page table lock could go away due to unsharing the PMD. We
can walk the table again after acquiring the lock and validate and possibly
retry. However, the backout code in this area which previously had to
deal with truncation as well, was quite fragile and did not work in all
corner cases. This was mostly due to lovely huge page reservations.
I thought that adding more complexity to the backout code was going to
cause more issues. Changing the locking order eliminated the pte pointer
race as well as truncation. However, it created new locking issues. :(
In parallel to working the locking issue, I am also exploring enhanced
backout code to handle all the needed cases.
--
Mike Kravetz
On 10/9/20 3:23 PM, Hugh Dickins wrote:
> On Fri, 9 Oct 2020, Mike Kravetz wrote:
>> On 10/8/20 10:50 PM, Hugh Dickins wrote:
>>>
>>> It's a problem I've faced before in tmpfs, keeping a hold on the
>>> mapping while page lock is dropped. Quite awkward: igrab() looks as
>>> if it's the right thing to use, but turns out to give no protection
>>> against umount. Last time around, I ended up with a stop_eviction
>>> count in the shmem inode, which shmem_evict_inode() waits on if
>>> necessary. Something like that could be done for hugetlbfs too,
>>> but I'd prefer to do it without adding extra, if there is a way.
>>
>> Thanks.
>
> I failed to come up with anything neater than a stop_eviction count
> in the hugetlbfs inode.
>
> But that's not unlike a very special purpose rwsem added into the
> hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem
> got repurposed, perhaps you will end up with an rwsem of your own in
> the hugetlbfs inode.
We have this in the Oracle UEK kernel.
https://github.com/oracle/linux-uek/commit/89260f55df008bb01c841714fb2ad26c300c8c88
Usage has been expanded to cover more cases.
When I proposed the same upstream, the suggestion was to try and use
i_mmap_rwsem. That is what I have been trying to do. Certainly, a
hugetlbfs specific rwsem is easier to manage and less disruptive.
> So I won't distract you with a stop_eviction patch unless you ask:
> that's easy, what you're looking into is hard - good luck!
Thanks Hugh!
>> In parallel to working the locking issue, I am also exploring enhanced
>> backout code to handle all the needed cases.
I'm now confident that we need this or some other locking in place. Why?
I went back to a code base before locking changes. My idea was to check
for races and backout changes as necessary. Page fault handling code will
do something like this:
ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
... do some stuff, possibly allocate a page ...
ptl = huge_pte_lock(h, mm, ptep);
Because of pmd sharing, we can not be sure the ptep is valid until
after holding the ptl. So, I was going to add something like this
after obtaining the ptl.
while(ptep != huge_pte_offset(mm, haddr, huge_page_size(h))) {
spin_unlock(ptl);
ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
if (!ptep) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
ptl = huge_pte_lock(h, mm, ptep);
}
Unfortunately, the problem is worse than I thought. I knew the PMD
pointed to by the ptep could be unshared before attempting to acquire
the ptl. However, it is possible that the page which was the PMD may
be repurposed before we even try to acquire the ptl. Since the ptl is
a spinlock within the struct page of what was the PMD, it may no longer
be valid. I actually experienced addressing exceptions in the spinlock
code within huge_pte_lock. :(
So, it seems that we need some way to prevent PMD unsharing between the
huge_pte_alloc() and huge_pte_lock(). It is going to have to be
i_mmap_rwsem or some other rwsem.
--
Mike Kravetz
On Fri, 9 Oct 2020, Mike Kravetz wrote:
> On 10/8/20 10:50 PM, Hugh Dickins wrote:
> >
> > It's a problem I've faced before in tmpfs, keeping a hold on the
> > mapping while page lock is dropped. Quite awkward: igrab() looks as
> > if it's the right thing to use, but turns out to give no protection
> > against umount. Last time around, I ended up with a stop_eviction
> > count in the shmem inode, which shmem_evict_inode() waits on if
> > necessary. Something like that could be done for hugetlbfs too,
> > but I'd prefer to do it without adding extra, if there is a way.
>
> Thanks.
I failed to come up with anything neater than a stop_eviction count
in the hugetlbfs inode.
But that's not unlike a very special purpose rwsem added into the
hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem
got repurposed, perhaps you will end up with an rwsem of your own in
the hugetlbfs inode.
So I won't distract you with a stop_eviction patch unless you ask:
that's easy, what you're looking into is hard - good luck!
Hugh
> >>
> >> As mentioned above, I hope all this can be removed.
> >
> > If you continue to nest page lock inside i_mmap_rwsem for hugetlbfs,
> > then I think that part of hugetlb_page_mapping_lock_write() has to
> > remain. I'd much prefer that hugetlbfs did not reverse the usual
> > nesting, but accept that you had reasons for doing it that way.
>
> Yes, that is necessary with the change to lock order.
>
> Yesterday I found another issue/case to handle in the hugetlb COW code
> caused by the difference in lock nesting. As a result, I am rethinking
> the decision to change the locking order.
>
> The primary reason for changing the lock order was to make the hugetlb
> page fault code not have to worry about pte pointers changing. The issue
> is that the pte pointer you get (or create) while walking the table
> without the page table lock could go away due to unsharing the PMD. We
> can walk the table again after acquiring the lock and validate and possibly
> retry. However, the backout code in this area which previously had to
> deal with truncation as well, was quite fragile and did not work in all
> corner cases. This was mostly due to lovely huge page reservations.
> I thought that adding more complexity to the backout code was going to
> cause more issues. Changing the locking order eliminated the pte pointer
> race as well as truncation. However, it created new locking issues. :(
>
> In parallel to working the locking issue, I am also exploring enhanced
> backout code to handle all the needed cases.
> --
> Mike Kravetz