Changelog
---------
v3
- Merged with linux-next, which contains clean-up patch from Jason,
therefore this series is reduced by two patches which did the same
thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path
---------
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.
This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().
However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.
This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.
It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.
For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com
Previous versions:
v1
https://lore.kernel.org/lkml/[email protected]
v2
https://lore.kernel.org/lkml/[email protected]
Pavel Tatashin (6):
mm/gup: don't pin migrated cma pages in movable zone
mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
mm: apply per-task gfp constraints in fast path
mm: honor PF_MEMALLOC_PIN for all movable pages
mm/gup: migrate pinned pages out of movable zone
memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
.../admin-guide/mm/memory-hotplug.rst | 9 +++
include/linux/migrate.h | 1 +
include/linux/mm.h | 11 +++
include/linux/mmzone.h | 11 ++-
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 27 +++----
include/trace/events/migrate.h | 3 +-
mm/gup.c | 72 ++++++++-----------
mm/hugetlb.c | 4 +-
mm/page_alloc.c | 32 ++++-----
mm/vmscan.c | 10 +--
11 files changed, 95 insertions(+), 87 deletions(-)
--
2.25.1
Function current_gfp_context() is called after fast path. However, soon we
will add more constraints which will also limit zones based on context.
Move this call into fast path, and apply the correct constraints for all
allocations.
Also update .reclaim_idx based on value returned by current_gfp_context()
because it soon will modify the allowed zones.
Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:
__alloc_pages_nodemask()
prepare_alloc_pages()
current_alloc_flags(gfp_mask, *alloc_flags);
Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/page_alloc.c | 15 ++++++++-------
mm/vmscan.c | 10 ++++++----
2 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec05396a597b..c2dea9ad0e98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4976,6 +4976,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
}
gfp_mask &= gfp_allowed_mask;
+ /*
+ * Apply scoped allocation constraints. This is mainly about GFP_NOFS
+ * resp. GFP_NOIO which has to be inherited for all allocation requests
+ * from a particular context which has been marked by
+ * memalloc_no{fs,io}_{save,restore}.
+ */
+ gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
@@ -4991,13 +4998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
if (likely(page))
goto out;
- /*
- * Apply scoped allocation constraints. This is mainly about GFP_NOFS
- * resp. GFP_NOIO which has to be inherited for all allocation requests
- * from a particular context which has been marked by
- * memalloc_no{fs,io}_{save,restore}.
- */
- alloc_mask = current_gfp_context(gfp_mask);
+ alloc_mask = gfp_mask;
ac.spread_dirty_pages = false;
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 469016222cdb..d9546f5897f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3234,11 +3234,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
+ gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = current_gfp_context(gfp_mask),
- .reclaim_idx = gfp_zone(gfp_mask),
+ .gfp_mask = current_gfp_mask,
+ .reclaim_idx = gfp_zone(current_gfp_mask),
.order = order,
.nodemask = nodemask,
.priority = DEF_PRIORITY,
@@ -4158,17 +4159,18 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
+ gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
struct task_struct *p = current;
unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = current_gfp_context(gfp_mask),
+ .gfp_mask = current_gfp_mask,
.order = order,
.priority = NODE_RECLAIM_PRIORITY,
.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
.may_swap = 1,
- .reclaim_idx = gfp_zone(gfp_mask),
+ .reclaim_idx = gfp_zone(current_gfp_mask),
};
trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
--
2.25.1
In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.
Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/gup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index 0c866af5d96f..87452fcad048 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1565,7 +1565,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
- .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+ .gfp_mask = GFP_USER | __GFP_NOWARN,
};
check_again:
--
2.25.1
PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
pages that might belong to CMA region. This is currently used for long
term gup to make sure that such pins are not going to be done on any CMA
pages.
When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
focusing on CMA pages too much and that there is larger class of pages that
need the same treatment. MOVABLE zone cannot contain any long term pins as
well so it makes sense to reuse and redefine this flag for that usecase as
well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
context which can only get pages suitable for long-term pins.
Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_pin_save()/memalloc_pin_restore()
and make the new functions common.
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 21 +++++----------------
mm/gup.c | 4 ++--
mm/hugetlb.c | 4 ++--
mm/page_alloc.c | 4 ++--
5 files changed, 12 insertions(+), 23 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e5ad6d354b7b..f3226ef7134f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
-#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_PIN 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..5f4dd3274734 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -270,29 +270,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
}
-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_pin_save(void)
{
- unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+ unsigned int flags = current->flags & PF_MEMALLOC_PIN;
- current->flags |= PF_MEMALLOC_NOCMA;
+ current->flags |= PF_MEMALLOC_PIN;
return flags;
}
-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_pin_restore(unsigned int flags)
{
- current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+ current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
}
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
- return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif
#ifdef CONFIG_MEMCG
DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index 87452fcad048..007060e66a48 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1671,7 +1671,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
long rc;
if (gup_flags & FOLL_LONGTERM)
- flags = memalloc_nocma_save();
+ flags = memalloc_pin_save();
rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
gup_flags);
@@ -1680,7 +1680,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (rc > 0)
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
vmas, gup_flags);
- memalloc_nocma_restore(flags);
+ memalloc_pin_restore(flags);
}
return rc;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3bcc0bc7e02a..012246234eb5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
{
struct page *page;
- bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+ bool pin = !!(current->flags & PF_MEMALLOC_PIN);
list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (nocma && is_migrate_cma_page(page))
+ if (pin && is_migrate_cma_page(page))
continue;
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 774542e1483e..ec05396a597b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3808,8 +3808,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
#ifdef CONFIG_CMA
unsigned int pflags = current->flags;
- if (!(pflags & PF_MEMALLOC_NOCMA) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (!(pflags & PF_MEMALLOC_PIN) &&
+ gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
#endif
--
2.25.1
We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which
pages need to be migrated
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/migrate.h | 1 +
include/linux/mmzone.h | 11 ++++--
include/trace/events/migrate.h | 3 +-
mm/gup.c | 66 ++++++++++++++--------------------
4 files changed, 38 insertions(+), 43 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4594838a0f7c..aae5ef0b3ba1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_LONGTERM_PIN,
MR_TYPES
};
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..25c0c13ba4b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -386,9 +386,14 @@ enum zone_type {
* likely to succeed, and to locally limit unmovable allocations - e.g.,
* to increase the number of THP/huge pages. Notable special cases are:
*
- * 1. Pinned pages: (long-term) pinning of movable pages might
- * essentially turn such pages unmovable. Memory offlining might
- * retry a long time.
+ * 1. Pinned pages: (long-term) pinning of movable pages is avoided
+ * when pages are pinned and faulted, but it is still possible that
+ * address space already has pages in ZONE_MOVABLE at the time when
+ * pages are pinned (i.e. user has touches that memory before
+ * pinning). In such case we try to migrate them to a different zone,
+ * but if migration fails the pages can still end-up pinned in
+ * ZONE_MOVABLE. In such case, memory offlining might retry a long
+ * time and will only succeed once user application unpins pages.
* 2. memblock allocations: kernelcore/movablecore setups might create
* situations where ZONE_MOVABLE contains unmovable allocations
* after boot. Memory offlining and allocations fail early.
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM( MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_LONGTERM_PIN, "longterm_pin")
/*
* First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 007060e66a48..d5e9c459952e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -89,11 +89,12 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
int orig_refs = refs;
/*
- * Can't do FOLL_LONGTERM + FOLL_PIN with CMA in the gup fast
- * path, so fail and let the caller fall back to the slow path.
+ * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+ * right zone, so fail and let the caller fall back to the slow
+ * path.
*/
- if (unlikely(flags & FOLL_LONGTERM) &&
- is_migrate_cma_page(page))
+ if (unlikely((flags & FOLL_LONGTERM) &&
+ !is_pinnable_page(page)))
return NULL;
/*
@@ -1549,19 +1550,18 @@ struct page *get_dump_page(unsigned long addr)
}
#endif /* CONFIG_ELF_CORE */
-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int gup_flags)
{
unsigned long i;
unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
- LIST_HEAD(cma_page_list);
+ LIST_HEAD(movable_page_list);
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
@@ -1579,13 +1579,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
*/
step = compound_nr(head) - (pages[i] - head);
/*
- * If we get a page from the CMA zone, since we are going to
- * be pinning these entries, we might as well move them out
- * of the CMA zone if possible.
+ * If we get a movable page, since we are going to be pinning
+ * these entries, try to move them out if possible.
*/
- if (is_migrate_cma_page(head)) {
+ if (!is_pinnable_page(head)) {
if (PageHuge(head))
- isolate_huge_page(head, &cma_page_list);
+ isolate_huge_page(head, &movable_page_list);
else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
@@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
}
if (!isolate_lru_page(head)) {
- list_add_tail(&head->lru, &cma_page_list);
+ list_add_tail(&head->lru, &movable_page_list);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +
page_is_file_lru(head),
@@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
i += step;
}
- if (!list_empty(&cma_page_list)) {
+ if (!list_empty(&movable_page_list)) {
/*
* drop the above get_user_pages reference.
*/
@@ -1615,7 +1614,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
- if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
+ if (migrate_pages(&movable_page_list, alloc_migration_target, NULL,
(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
/*
* some of the pages failed migration. Do get_user_pages
@@ -1623,17 +1622,16 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
*/
migrate_allow = false;
- if (!list_empty(&cma_page_list))
- putback_movable_pages(&cma_page_list);
+ if (!list_empty(&movable_page_list))
+ putback_movable_pages(&movable_page_list);
}
/*
* We did migrate all the pages, Try to get the page references
- * again migrating any new CMA pages which we failed to isolate
- * earlier.
+ * again migrating any pages which we failed to isolate earlier.
*/
ret = __get_user_pages_locked(mm, start, nr_pages,
- pages, vmas, NULL,
- gup_flags);
+ pages, vmas, NULL,
+ gup_flags);
if ((ret > 0) && migrate_allow) {
nr_pages = ret;
@@ -1644,17 +1642,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
return ret;
}
-#else
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
-{
- return nr_pages;
-}
-#endif /* CONFIG_CMA */
/*
* __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
@@ -1678,8 +1665,9 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (gup_flags & FOLL_LONGTERM) {
if (rc > 0)
- rc = check_and_migrate_cma_pages(mm, start, rc, pages,
- vmas, gup_flags);
+ rc = check_and_migrate_movable_pages(mm, start, rc,
+ pages, vmas,
+ gup_flags);
memalloc_pin_restore(flags);
}
return rc;
--
2.25.1
PF_MEMALLOC_PIN is only honored for CMA pages, extend
this flag to work for any allocations from ZONE_MOVABLE by removing
__GFP_MOVABLE from gfp_mask when this flag is passed in the current
context.
Add is_pinnable_page() to return true if page is in a pinnable page.
A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.
Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
include/linux/sched/mm.h | 6 +++++-
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 19 ++++++++-----------
4 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5299b90a6c40..51b3090dd072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1109,6 +1109,17 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif
+static inline bool is_zone_movable_page(const struct page *page)
+{
+ return page_zonenum(page) == ZONE_MOVABLE;
+}
+
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+static inline bool is_pinnable_page(struct page *page)
+{
+ return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+}
+
#ifdef CONFIG_DEV_PAGEMAP_OPS
void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5f4dd3274734..a55277b0d475 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -150,12 +150,13 @@ static inline bool in_vfork(struct task_struct *tsk)
* Applies per-task gfp context to the given allocation flags.
* PF_MEMALLOC_NOIO implies GFP_NOIO
* PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_PIN implies !GFP_MOVABLE
*/
static inline gfp_t current_gfp_context(gfp_t flags)
{
unsigned int pflags = READ_ONCE(current->flags);
- if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) {
+ if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
/*
* NOIO implies both NOIO and NOFS and it is a weaker context
* so always make sure it makes precedence
@@ -164,6 +165,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
flags &= ~(__GFP_IO | __GFP_FS);
else if (pflags & PF_MEMALLOC_NOFS)
flags &= ~__GFP_FS;
+
+ if (pflags & PF_MEMALLOC_PIN)
+ flags &= ~__GFP_MOVABLE;
}
return flags;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 012246234eb5..b170ef2e04f5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
bool pin = !!(current->flags & PF_MEMALLOC_PIN);
list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (pin && is_migrate_cma_page(page))
+ if (pin && !is_pinnable_page(page))
continue;
if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2dea9ad0e98..4d8e7f801c66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3802,16 +3802,12 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
return alloc_flags;
}
-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
- unsigned int alloc_flags)
+static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
+ unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
- unsigned int pflags = current->flags;
-
- if (!(pflags & PF_MEMALLOC_PIN) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
#endif
return alloc_flags;
}
@@ -4467,7 +4463,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
- alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+ alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags);
return alloc_flags;
}
@@ -4769,7 +4765,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
- alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+ alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags);
/*
* Reset the nodemask and zonelist iterators if memory policies can be
@@ -4938,7 +4934,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;
- *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
+ *alloc_flags = cma_alloc_flags(gfp_mask, *alloc_flags);
/* Dirty zone balancing only done in the fast path */
ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
@@ -4980,7 +4976,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
* resp. GFP_NOIO which has to be inherited for all allocation requests
* from a particular context which has been marked by
- * memalloc_no{fs,io}_{save,restore}.
+ * memalloc_no{fs,io}_{save,restore}. And PF_MEMALLOC_PIN which ensures
+ * movable zones are not used during allocation.
*/
gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
--
2.25.1
Document the special handling of page pinning when ZONE_MOVABLE present.
Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
---
Documentation/admin-guide/mm/memory-hotplug.rst | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..c6618f99f765 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
Unfortunately, there is no information to show which memory block belongs
to ZONE_MOVABLE. This is TBD.
+.. note::
+ Techniques that rely on long-term pinnings of memory (especially, RDMA and
+ vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
+ hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
+ memory can still get hot removed - be aware that pinning can fail even if
+ there is plenty of free memory in ZONE_MOVABLE. In addition, using
+ ZONE_MOVABLE might make page pinning more expensive, because pages have to be
+ migrated off that zone first.
+
.. _memory_hotplug_how_to_offline_memory:
How to offline memory
--
2.25.1
On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> }
>
> if (!isolate_lru_page(head)) {
> - list_add_tail(&head->lru, &cma_page_list);
> + list_add_tail(&head->lru, &movable_page_list);
> mod_node_page_state(page_pgdat(head),
> NR_ISOLATED_ANON +
> page_is_file_lru(head),
> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> i += step;
> }
>
> - if (!list_empty(&cma_page_list)) {
> + if (!list_empty(&movable_page_list)) {
You didn't answer my earlier question, is it OK that ZONE_MOVABLE
pages leak out here if ioslate_lru_page() fails but the
moval_page_list is empty?
I think the answer is no, right?
Jason
On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> > @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > }
> >
> > if (!isolate_lru_page(head)) {
> > - list_add_tail(&head->lru, &cma_page_list);
> > + list_add_tail(&head->lru, &movable_page_list);
> > mod_node_page_state(page_pgdat(head),
> > NR_ISOLATED_ANON +
> > page_is_file_lru(head),
> > @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > i += step;
> > }
> >
> > - if (!list_empty(&cma_page_list)) {
> > + if (!list_empty(&movable_page_list)) {
>
> You didn't answer my earlier question, is it OK that ZONE_MOVABLE
> pages leak out here if ioslate_lru_page() fails but the
> moval_page_list is empty?
>
> I think the answer is no, right?
In my opinion it is OK. We are doing our best to not pin movable
pages, but if isolate_lru_page() fails because pages are currently
locked by someone else, we will end up long-term pinning them.
See comment in this patch:
+ * 1. Pinned pages: (long-term) pinning of movable pages is avoided
+ * when pages are pinned and faulted, but it is still possible that
+ * address space already has pages in ZONE_MOVABLE at the time when
+ * pages are pinned (i.e. user has touches that memory before
+ * pinning). In such case we try to migrate them to a different zone,
+ * but if migration fails the pages can still end-up pinned in
+ * ZONE_MOVABLE. In such case, memory offlining might retry a long
+ * time and will only succeed once user application unpins pages.
>
> Jason
On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> > > @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > }
> > >
> > > if (!isolate_lru_page(head)) {
> > > - list_add_tail(&head->lru, &cma_page_list);
> > > + list_add_tail(&head->lru, &movable_page_list);
> > > mod_node_page_state(page_pgdat(head),
> > > NR_ISOLATED_ANON +
> > > page_is_file_lru(head),
> > > @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > i += step;
> > > }
> > >
> > > - if (!list_empty(&cma_page_list)) {
> > > + if (!list_empty(&movable_page_list)) {
> >
> > You didn't answer my earlier question, is it OK that ZONE_MOVABLE
> > pages leak out here if ioslate_lru_page() fails but the
> > moval_page_list is empty?
> >
> > I think the answer is no, right?
> In my opinion it is OK. We are doing our best to not pin movable
> pages, but if isolate_lru_page() fails because pages are currently
> locked by someone else, we will end up long-term pinning them.
> See comment in this patch:
> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> + * when pages are pinned and faulted, but it is still possible that
> + * address space already has pages in ZONE_MOVABLE at the time when
> + * pages are pinned (i.e. user has touches that memory before
> + * pinning). In such case we try to migrate them to a different zone,
> + * but if migration fails the pages can still end-up pinned in
> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> + * time and will only succeed once user application unpins pages.
It is not "retry a long time" it is "might never complete" because
userspace will hold the DMA pin indefinitely.
Confused what the point of all this is then ??
I thought to goal here is to make memory unplug reliable, if you leave
a hole like this then any hostile userspace can block it forever.
Jason
> Am 11.12.2020 um 22:09 schrieb Pavel Tatashin <[email protected]>:
>
> On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe <[email protected]> wrote:
>>
>>> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
>>> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
>>>>
>>>> On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
>>>>> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>>>>> }
>>>>>
>>>>> if (!isolate_lru_page(head)) {
>>>>> - list_add_tail(&head->lru, &cma_page_list);
>>>>> + list_add_tail(&head->lru, &movable_page_list);
>>>>> mod_node_page_state(page_pgdat(head),
>>>>> NR_ISOLATED_ANON +
>>>>> page_is_file_lru(head),
>>>>> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>>>>> i += step;
>>>>> }
>>>>>
>>>>> - if (!list_empty(&cma_page_list)) {
>>>>> + if (!list_empty(&movable_page_list)) {
>>>>
>>>> You didn't answer my earlier question, is it OK that ZONE_MOVABLE
>>>> pages leak out here if ioslate_lru_page() fails but the
>>>> moval_page_list is empty?
>>>>
>>>> I think the answer is no, right?
>>> In my opinion it is OK. We are doing our best to not pin movable
>>> pages, but if isolate_lru_page() fails because pages are currently
>>> locked by someone else, we will end up long-term pinning them.
>>> See comment in this patch:
>>> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
>>> + * when pages are pinned and faulted, but it is still possible that
>>> + * address space already has pages in ZONE_MOVABLE at the time when
>>> + * pages are pinned (i.e. user has touches that memory before
>>> + * pinning). In such case we try to migrate them to a different zone,
>>> + * but if migration fails the pages can still end-up pinned in
>>> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
>>> + * time and will only succeed once user application unpins pages.
>>
>> It is not "retry a long time" it is "might never complete" because
>> userspace will hold the DMA pin indefinitely.
>>
>> Confused what the point of all this is then ??
>>
>> I thought to goal here is to make memory unplug reliable, if you leave
>> a hole like this then any hostile userspace can block it forever.
>
> You are right, I used a wording from the previous comment, and it
> should be made clear that pin may be forever. Without these patches it
> is guaranteed that hot-remove will fail if there are pinned pages as
> ZONE_MOVABLE is actually the first to be searched. Now, it will fail
> only due to exceptions listed in ZONE_MOVABLE comment:
>
> 1. pin + migration/isolation failure
Not sure what that really means. We have short-term pinnings (although we might have a better term for „pinning“ here) for example, when a process dies (IIRC). There is a period where pages cannot get migrated and offlining code has to retry (which might take a while). This still applies after your change - are you referring to that?
> 2. memblock allocation due to limited amount of space for kernelcore
> 3. memory holes
> 4. hwpoison
> 5. Unmovable PG_offline pages (? need to study why this is a scenario).
Virtio-mem is the primary user in this context.
> Do you think we should unconditionally unpin pages, and return error
> when isolation/migration fails?
I‘m not sure what you mean here. Who’s supposed to unpin which pages?
>
> Pasha
>
>>
>> Jason
>
On Fri, Dec 11, 2020 at 4:29 PM David Hildenbrand <[email protected]> wrote:
>
>
> > Am 11.12.2020 um 22:09 schrieb Pavel Tatashin <[email protected]>:
> >
> > On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe <[email protected]> wrote:
> >>
> >>> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
> >>> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
> >>>>
> >>>> On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> >>>>> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> >>>>> }
> >>>>>
> >>>>> if (!isolate_lru_page(head)) {
> >>>>> - list_add_tail(&head->lru, &cma_page_list);
> >>>>> + list_add_tail(&head->lru, &movable_page_list);
> >>>>> mod_node_page_state(page_pgdat(head),
> >>>>> NR_ISOLATED_ANON +
> >>>>> page_is_file_lru(head),
> >>>>> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> >>>>> i += step;
> >>>>> }
> >>>>>
> >>>>> - if (!list_empty(&cma_page_list)) {
> >>>>> + if (!list_empty(&movable_page_list)) {
> >>>>
> >>>> You didn't answer my earlier question, is it OK that ZONE_MOVABLE
> >>>> pages leak out here if ioslate_lru_page() fails but the
> >>>> moval_page_list is empty?
> >>>>
> >>>> I think the answer is no, right?
> >>> In my opinion it is OK. We are doing our best to not pin movable
> >>> pages, but if isolate_lru_page() fails because pages are currently
> >>> locked by someone else, we will end up long-term pinning them.
> >>> See comment in this patch:
> >>> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> >>> + * when pages are pinned and faulted, but it is still possible that
> >>> + * address space already has pages in ZONE_MOVABLE at the time when
> >>> + * pages are pinned (i.e. user has touches that memory before
> >>> + * pinning). In such case we try to migrate them to a different zone,
> >>> + * but if migration fails the pages can still end-up pinned in
> >>> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> >>> + * time and will only succeed once user application unpins pages.
> >>
> >> It is not "retry a long time" it is "might never complete" because
> >> userspace will hold the DMA pin indefinitely.
> >>
> >> Confused what the point of all this is then ??
> >>
> >> I thought to goal here is to make memory unplug reliable, if you leave
> >> a hole like this then any hostile userspace can block it forever.
> >
> > You are right, I used a wording from the previous comment, and it
> > should be made clear that pin may be forever. Without these patches it
> > is guaranteed that hot-remove will fail if there are pinned pages as
> > ZONE_MOVABLE is actually the first to be searched. Now, it will fail
> > only due to exceptions listed in ZONE_MOVABLE comment:
> >
> > 1. pin + migration/isolation failure
>
> Not sure what that really means. We have short-term pinnings (although we might have a better term for „pinning“ here) for example, when a process dies (IIRC). There is a period where pages cannot get migrated and offlining code has to retry (which might take a while). This still applies after your change - are you referring to that?
>
> > 2. memblock allocation due to limited amount of space for kernelcore
> > 3. memory holes
> > 4. hwpoison
> > 5. Unmovable PG_offline pages (? need to study why this is a scenario).
>
> Virtio-mem is the primary user in this context.
>
> > Do you think we should unconditionally unpin pages, and return error
> > when isolation/migration fails?
>
> I‘m not sure what you mean here. Who’s supposed to unpin which pages?
Hi David,
When check_and_migrate_movable_pages() is called, the pages are
already pinned. If some of those pages are in movable zone, and we
fail to migrate or isolate them what should we do: proceed, and keep
it as exception of when movable zone can actually have pinned pages or
unpin all pages in the array, and return an error, or unpin only pages
in movable zone, and return an error?
Pasha
>
> >
> > Pasha
> >
> >>
> >> Jason
> >
>
> Am 11.12.2020 um 22:36 schrieb Pavel Tatashin <[email protected]>:
>
> On Fri, Dec 11, 2020 at 4:29 PM David Hildenbrand <[email protected]> wrote:
>>
>>
>>>> Am 11.12.2020 um 22:09 schrieb Pavel Tatashin <[email protected]>:
>>>
>>> On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe <[email protected]> wrote:
>>>>
>>>>> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
>>>>> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
>>>>>>> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>>>>>>> }
>>>>>>>
>>>>>>> if (!isolate_lru_page(head)) {
>>>>>>> - list_add_tail(&head->lru, &cma_page_list);
>>>>>>> + list_add_tail(&head->lru, &movable_page_list);
>>>>>>> mod_node_page_state(page_pgdat(head),
>>>>>>> NR_ISOLATED_ANON +
>>>>>>> page_is_file_lru(head),
>>>>>>> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>>>>>>> i += step;
>>>>>>> }
>>>>>>>
>>>>>>> - if (!list_empty(&cma_page_list)) {
>>>>>>> + if (!list_empty(&movable_page_list)) {
>>>>>>
>>>>>> You didn't answer my earlier question, is it OK that ZONE_MOVABLE
>>>>>> pages leak out here if ioslate_lru_page() fails but the
>>>>>> moval_page_list is empty?
>>>>>>
>>>>>> I think the answer is no, right?
>>>>> In my opinion it is OK. We are doing our best to not pin movable
>>>>> pages, but if isolate_lru_page() fails because pages are currently
>>>>> locked by someone else, we will end up long-term pinning them.
>>>>> See comment in this patch:
>>>>> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
>>>>> + * when pages are pinned and faulted, but it is still possible that
>>>>> + * address space already has pages in ZONE_MOVABLE at the time when
>>>>> + * pages are pinned (i.e. user has touches that memory before
>>>>> + * pinning). In such case we try to migrate them to a different zone,
>>>>> + * but if migration fails the pages can still end-up pinned in
>>>>> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
>>>>> + * time and will only succeed once user application unpins pages.
>>>>
>>>> It is not "retry a long time" it is "might never complete" because
>>>> userspace will hold the DMA pin indefinitely.
>>>>
>>>> Confused what the point of all this is then ??
>>>>
>>>> I thought to goal here is to make memory unplug reliable, if you leave
>>>> a hole like this then any hostile userspace can block it forever.
>>>
>>> You are right, I used a wording from the previous comment, and it
>>> should be made clear that pin may be forever. Without these patches it
>>> is guaranteed that hot-remove will fail if there are pinned pages as
>>> ZONE_MOVABLE is actually the first to be searched. Now, it will fail
>>> only due to exceptions listed in ZONE_MOVABLE comment:
>>>
>>> 1. pin + migration/isolation failure
>>
>> Not sure what that really means. We have short-term pinnings (although we might have a better term for „pinning“ here) for example, when a process dies (IIRC). There is a period where pages cannot get migrated and offlining code has to retry (which might take a while). This still applies after your change - are you referring to that?
>>
>>> 2. memblock allocation due to limited amount of space for kernelcore
>>> 3. memory holes
>>> 4. hwpoison
>>> 5. Unmovable PG_offline pages (? need to study why this is a scenario).
>>
>> Virtio-mem is the primary user in this context.
>>
>>> Do you think we should unconditionally unpin pages, and return error
>>> when isolation/migration fails?
>>
>> I‘m not sure what you mean here. Who’s supposed to unpin which pages?
>
> Hi David,
>
> When check_and_migrate_movable_pages() is called, the pages are
> already pinned. If some of those pages are in movable zone, and we
> fail to migrate or isolate them what should we do: proceed, and keep
> it as exception of when movable zone can actually have pinned pages or
> unpin all pages in the array, and return an error, or unpin only pages
> in movable zone, and return an error?
>
I guess revert what we did (unpin) and return an error. The interesting question is what can make migration/isolation fail
a) out of memory: smells like a zone setup issue. Failures are acceptable I guess.
b) short term pinnings: process dying - not relevant I guess. Other cases? (Fork?)
c) ?
Once we clarified that, we actually know how likely it will be to return an error (and making vfio pinnings fail etc).
> Pasha
> I guess revert what we did (unpin) and return an error. The interesting question is what can make migration/isolation fail
OK. I will make the necessary changes. Let's handle errors properly.
Whatever the cause for the error, we will know it when it happens, and
when error is returned. I think I will add a 10-time retry instead of
the infinite retry that we currently have. The 10-times retry we
currently have during the hot-remove path.
>
> a) out of memory: smells like a zone setup issue. Failures are acceptable I guess.
>
> b) short term pinnings: process dying - not relevant I guess. Other cases? (Fork?)
>
> c) ?
>
> Once we clarified that, we actually know how likely it will be to return an error (and making vfio pinnings fail etc).
On Fri, Dec 11, 2020 at 10:53:00PM +0100, David Hildenbrand wrote:
> > When check_and_migrate_movable_pages() is called, the pages are
> > already pinned. If some of those pages are in movable zone, and we
> > fail to migrate or isolate them what should we do: proceed, and
> > keep it as exception of when movable zone can actually have pinned
> > pages or unpin all pages in the array, and return an error, or
> > unpin only pages in movable zone, and return an error?
>
> I guess revert what we did (unpin) and return an error. The
> interesting question is what can make migration/isolation fail
>
> a) out of memory: smells like a zone setup issue. Failures are acceptable I guess.
Out of memory is reasonable..
> b) short term pinnings: process dying - not relevant I guess. Other cases? (Fork?)
Concurrent with non-longterm GUP users are less reasonable, fork is
not reasonable, etc..
Racing with another GUP in another thread is also not reasonable, so
failing to isolate can't be a failure
Jasnon
On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
> > On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> > > > @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > > }
> > > >
> > > > if (!isolate_lru_page(head)) {
> > > > - list_add_tail(&head->lru, &cma_page_list);
> > > > + list_add_tail(&head->lru, &movable_page_list);
> > > > mod_node_page_state(page_pgdat(head),
> > > > NR_ISOLATED_ANON +
> > > > page_is_file_lru(head),
> > > > @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > > i += step;
> > > > }
> > > >
> > > > - if (!list_empty(&cma_page_list)) {
> > > > + if (!list_empty(&movable_page_list)) {
> > >
> > > You didn't answer my earlier question, is it OK that ZONE_MOVABLE
> > > pages leak out here if ioslate_lru_page() fails but the
> > > moval_page_list is empty?
> > >
> > > I think the answer is no, right?
> > In my opinion it is OK. We are doing our best to not pin movable
> > pages, but if isolate_lru_page() fails because pages are currently
> > locked by someone else, we will end up long-term pinning them.
> > See comment in this patch:
> > + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> > + * when pages are pinned and faulted, but it is still possible that
> > + * address space already has pages in ZONE_MOVABLE at the time when
> > + * pages are pinned (i.e. user has touches that memory before
> > + * pinning). In such case we try to migrate them to a different zone,
> > + * but if migration fails the pages can still end-up pinned in
> > + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> > + * time and will only succeed once user application unpins pages.
>
> It is not "retry a long time" it is "might never complete" because
> userspace will hold the DMA pin indefinitely.
>
> Confused what the point of all this is then ??
>
> I thought to goal here is to make memory unplug reliable, if you leave
> a hole like this then any hostile userspace can block it forever.
You are right, I used a wording from the previous comment, and it
should be made clear that pin may be forever. Without these patches it
is guaranteed that hot-remove will fail if there are pinned pages as
ZONE_MOVABLE is actually the first to be searched. Now, it will fail
only due to exceptions listed in ZONE_MOVABLE comment:
1. pin + migration/isolation failure
2. memblock allocation due to limited amount of space for kernelcore
3. memory holes
4. hwpoison
5. Unmovable PG_offline pages (? need to study why this is a scenario).
Do you think we should unconditionally unpin pages, and return error
when isolation/migration fails?
Pasha
>
> Jason
On 12/11/20 3:00 PM, Pavel Tatashin wrote:
>> I guess revert what we did (unpin) and return an error. The interesting question is what can make migration/isolation fail
>
> OK. I will make the necessary changes. Let's handle errors properly.
> Whatever the cause for the error, we will know it when it happens, and
> when error is returned. I think I will add a 10-time retry instead of
> the infinite retry that we currently have. The 10-times retry we
> currently have during the hot-remove path.
It occurs to me that maybe the pre-existing infinite loop shouldn't be
there at all? Would it be better to let the callers retry? Obviously that
would be a separate patch and I'm not sure it's safe to make that change,
but the current loop seems buried maybe too far down.
Thoughts, anyone?
thanks,
--
John Hubbard
NVIDIA
> Am 12.12.2020 um 00:50 schrieb Jason Gunthorpe <[email protected]>:
>
> On Fri, Dec 11, 2020 at 10:53:00PM +0100, David Hildenbrand wrote:
>
>>> When check_and_migrate_movable_pages() is called, the pages are
>>> already pinned. If some of those pages are in movable zone, and we
>>> fail to migrate or isolate them what should we do: proceed, and
>>> keep it as exception of when movable zone can actually have pinned
>>> pages or unpin all pages in the array, and return an error, or
>>> unpin only pages in movable zone, and return an error?
>>
>> I guess revert what we did (unpin) and return an error. The
>> interesting question is what can make migration/isolation fail
>>
>> a) out of memory: smells like a zone setup issue. Failures are acceptable I guess.
>
> Out of memory is reasonable..
>
>> b) short term pinnings: process dying - not relevant I guess. Other cases? (Fork?)
>
> Concurrent with non-longterm GUP users are less reasonable, fork is
> not reasonable, etc..
Concurrent alloc_contig_range(), memory offlining, compaction .. where we migrate pages? Any experts on racing page migration in these scenarios?
(Also wondering what would happen if we are just about to swap)
>
> Racing with another GUP in another thread is also not reasonable, so
> failing to isolate can't be a failure
Having VMs with multiple vfio containers is certainly realistic, and optimizing in user space to do vfio mappings concurrently doesn‘t sound too crazy to me. But I haven‘t checked if vfio common code already handles such concurrency.
>
> Jasnon
>
On Sat, Dec 12, 2020 at 08:29:11AM +0100, David Hildenbrand wrote:
> > Racing with another GUP in another thread is also not reasonable, so
> > failing to isolate can't be a failure
>
> Having VMs with multiple vfio containers is certainly realistic, and
> optimizing in user space to do vfio mappings concurrently doesn‘t
> sound too crazy to me. But I haven‘t checked if vfio common code
> already handles such concurrency.
There is a lot more out there than vfio.. RDMA already does concurrent
pin_user_pages in real apps
Jason
On Fri 11-12-20 15:21:36, Pavel Tatashin wrote:
> PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
> pages that might belong to CMA region. This is currently used for long
> term gup to make sure that such pins are not going to be done on any CMA
> pages.
>
> When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
> focusing on CMA pages too much and that there is larger class of pages that
> need the same treatment. MOVABLE zone cannot contain any long term pins as
> well so it makes sense to reuse and redefine this flag for that usecase as
> well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
> context which can only get pages suitable for long-term pins.
>
> Also re-name:
> memalloc_nocma_save()/memalloc_nocma_restore
> to
> memalloc_pin_save()/memalloc_pin_restore()
> and make the new functions common.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
with one comment below
[...]
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> -#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> +#define PF_MEMALLOC_PIN 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
This comment is not really helpeful. I would go with
/* Allocation context constrained to zones which allow long term
* pinning.
*/
Something similar would be useful for memalloc_pin* functions as well.
--
Michal Hocko
SUSE Labs
On Fri 11-12-20 15:21:37, Pavel Tatashin wrote:
> Function current_gfp_context() is called after fast path. However, soon we
> will add more constraints which will also limit zones based on context.
> Move this call into fast path, and apply the correct constraints for all
> allocations.
>
> Also update .reclaim_idx based on value returned by current_gfp_context()
> because it soon will modify the allowed zones.
>
> Note:
> With this patch we will do one extra current->flags load during fast path,
> but we already load current->flags in fast-path:
>
> __alloc_pages_nodemask()
> prepare_alloc_pages()
> current_alloc_flags(gfp_mask, *alloc_flags);
>
> Later, when we add the zone constrain logic to current_gfp_context() we
> will be able to remove current->flags load from current_alloc_flags, and
> therefore return fast-path to the current performance level.
>
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/page_alloc.c | 15 ++++++++-------
> mm/vmscan.c | 10 ++++++----
> 2 files changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ec05396a597b..c2dea9ad0e98 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4976,6 +4976,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> }
>
> gfp_mask &= gfp_allowed_mask;
> + /*
> + * Apply scoped allocation constraints. This is mainly about GFP_NOFS
> + * resp. GFP_NOIO which has to be inherited for all allocation requests
> + * from a particular context which has been marked by
> + * memalloc_no{fs,io}_{save,restore}.
> + */
> + gfp_mask = current_gfp_context(gfp_mask);
> alloc_mask = gfp_mask;
> if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> return NULL;
> @@ -4991,13 +4998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> if (likely(page))
> goto out;
>
> - /*
> - * Apply scoped allocation constraints. This is mainly about GFP_NOFS
> - * resp. GFP_NOIO which has to be inherited for all allocation requests
> - * from a particular context which has been marked by
> - * memalloc_no{fs,io}_{save,restore}.
> - */
> - alloc_mask = current_gfp_context(gfp_mask);
> + alloc_mask = gfp_mask;
> ac.spread_dirty_pages = false;
>
> /*
Ack to this.
But I do not really understand this. All allocation contexts should have
a proper gfp mask so why do we have to call current_gfp_context here?
In fact moving the current_gfp_context in the allocator path should have
made all this games unnecessary. Memcg reclaim path might need some
careful check because gfp mask is used more creative there but the
general reclaim paths should be ok.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 469016222cdb..d9546f5897f4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3234,11 +3234,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
> + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> unsigned long nr_reclaimed;
> struct scan_control sc = {
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> - .gfp_mask = current_gfp_context(gfp_mask),
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .gfp_mask = current_gfp_mask,
> + .reclaim_idx = gfp_zone(current_gfp_mask),
> .order = order,
> .nodemask = nodemask,
> .priority = DEF_PRIORITY,
> @@ -4158,17 +4159,18 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> {
> /* Minimum pages needed in order to stay on node */
> const unsigned long nr_pages = 1 << order;
> + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> struct task_struct *p = current;
> unsigned int noreclaim_flag;
> struct scan_control sc = {
> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = current_gfp_context(gfp_mask),
> + .gfp_mask = current_gfp_mask,
> .order = order,
> .priority = NODE_RECLAIM_PRIORITY,
> .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> .may_swap = 1,
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .reclaim_idx = gfp_zone(current_gfp_mask),
Again, why do we need this when the gfp_mask
> };
>
> trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
> --
> 2.25.1
--
Michal Hocko
SUSE Labs
On Fri 11-12-20 15:21:38, Pavel Tatashin wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c2dea9ad0e98..4d8e7f801c66 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3802,16 +3802,12 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> return alloc_flags;
> }
>
> -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> - unsigned int alloc_flags)
> +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> + unsigned int alloc_flags)
Do you have any strong reason to rename? Even though the current
implementation only does something for cma I do not think this is all
that important. The naming nicely fits with current_gfp_context so I
would stick with it.
Other than that the patch looks reasonable. I would just add a comment
explaining that current_alloc_flags should be called _after_
current_gfp_context because that one might change the gfp_mask.
With that addressed, feel free to add
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
On 14.12.20 14:36, Jason Gunthorpe wrote:
> On Sat, Dec 12, 2020 at 08:29:11AM +0100, David Hildenbrand wrote:
>
>>> Racing with another GUP in another thread is also not reasonable, so
>>> failing to isolate can't be a failure
>>
>> Having VMs with multiple vfio containers is certainly realistic, and
>> optimizing in user space to do vfio mappings concurrently doesn‘t
>> sound too crazy to me. But I haven‘t checked if vfio common code
>> already handles such concurrency.
>
> There is a lot more out there than vfio.. RDMA already does concurrent
> pin_user_pages in real apps
I actually misread your comment. I think we both agree that temporary
isolation failures must not lead to a failure.
--
Thanks,
David / dhildenb
On Fri 11-12-20 15:21:39, Pavel Tatashin wrote:
[...]
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b593316bff3d..25c0c13ba4b1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -386,9 +386,14 @@ enum zone_type {
> * likely to succeed, and to locally limit unmovable allocations - e.g.,
> * to increase the number of THP/huge pages. Notable special cases are:
> *
> - * 1. Pinned pages: (long-term) pinning of movable pages might
> - * essentially turn such pages unmovable. Memory offlining might
> - * retry a long time.
> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> + * when pages are pinned and faulted, but it is still possible that
> + * address space already has pages in ZONE_MOVABLE at the time when
> + * pages are pinned (i.e. user has touches that memory before
> + * pinning). In such case we try to migrate them to a different zone,
> + * but if migration fails the pages can still end-up pinned in
> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> + * time and will only succeed once user application unpins pages.
I do agree with others in the thread. This is not really helping out the
current situation much. You should simply fail the pin rather than
pretend all is just fine.
--
Michal Hocko
SUSE Labs
On Mon, Dec 14, 2020 at 9:03 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 11-12-20 15:21:36, Pavel Tatashin wrote:
> > PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
> > pages that might belong to CMA region. This is currently used for long
> > term gup to make sure that such pins are not going to be done on any CMA
> > pages.
> >
> > When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
> > focusing on CMA pages too much and that there is larger class of pages that
> > need the same treatment. MOVABLE zone cannot contain any long term pins as
> > well so it makes sense to reuse and redefine this flag for that usecase as
> > well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
> > context which can only get pages suitable for long-term pins.
> >
> > Also re-name:
> > memalloc_nocma_save()/memalloc_nocma_restore
> > to
> > memalloc_pin_save()/memalloc_pin_restore()
> > and make the new functions common.
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > Reviewed-by: John Hubbard <[email protected]>
>
> Acked-by: Michal Hocko <[email protected]>
Thank you.
>
> with one comment below
> [...]
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
> > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> > #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> > -#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> > +#define PF_MEMALLOC_PIN 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
>
> This comment is not really helpeful. I would go with
> /* Allocation context constrained to zones which allow long term
> * pinning.
> */
>
> Something similar would be useful for memalloc_pin* functions as well.
I will add it.
> --
> Michal Hocko
> SUSE Labs
> Ack to this.
Thank you.
>
> But I do not really understand this. All allocation contexts should have
> a proper gfp mask so why do we have to call current_gfp_context here?
> In fact moving the current_gfp_context in the allocator path should have
> made all this games unnecessary. Memcg reclaim path might need some
> careful check because gfp mask is used more creative there but the
> general reclaim paths should be ok.
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>
> Again, why do we need this when the gfp_mask
> > };
> >
--
Hi Michal,
Beside from __alloc_pages_nodemask(), the current_gfp_context() is
called from the following six functions:
try_to_free_pages()
try_to_free_mem_cgroup_pages()
__node_reclaim()
__need_fs_reclaim()
alloc_contig_range()
pcpu_alloc()
As I understand, the idea is that because the allocator now honors
gfp_context values for all paths, the call can be removed from some of
the above functions. I think you are correct. But, at least from a
quick glance, this is not obvious, and is not the case for all of the
above functions.
For example:
alloc_contig_range()
__alloc_contig_migrate_range
isolate_migratepages_range
isolate_migratepages_block
/*
* Only allow to migrate anonymous pages in GFP_NOFS context
* because those do not depend on fs locks.
*/
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
If we remove current_gfp_context() from alloc_contig_range(), the
cc->gfp_mask will not be updated with proper __GFP_FS flag.
I have studied some other paths, and they are also convoluted.
Therefore, I am worried about performing this optimization in this
series.
On Mon, Dec 14, 2020 at 9:17 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 11-12-20 15:21:38, Pavel Tatashin wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c2dea9ad0e98..4d8e7f801c66 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3802,16 +3802,12 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> > return alloc_flags;
> > }
> >
> > -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> > - unsigned int alloc_flags)
> > +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> > + unsigned int alloc_flags)
>
> Do you have any strong reason to rename? Even though the current
Yes :)
> implementation only does something for cma I do not think this is all
> that important. The naming nicely fits with current_gfp_context so I
> would stick with it.
I am renaming because current->flags is removed from this function,
therefore keeping the name
becomes misleading. This function only addresses cma flag check
without looking at the thread local state now.
>
> Other than that the patch looks reasonable. I would just add a comment
> explaining that current_alloc_flags should be called _after_
> current_gfp_context because that one might change the gfp_mask.
Thanks, I will add it.
>
> With that addressed, feel free to add
> Acked-by: Michal Hocko <[email protected]>
Thank you,
Pasha
On Tue 15-12-20 00:24:30, Pavel Tatashin wrote:
> On Mon, Dec 14, 2020 at 9:17 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 11-12-20 15:21:38, Pavel Tatashin wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index c2dea9ad0e98..4d8e7f801c66 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3802,16 +3802,12 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> > > return alloc_flags;
> > > }
> > >
> > > -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> > > - unsigned int alloc_flags)
> > > +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> > > + unsigned int alloc_flags)
> >
> > Do you have any strong reason to rename? Even though the current
>
> Yes :)
>
> > implementation only does something for cma I do not think this is all
> > that important. The naming nicely fits with current_gfp_context so I
> > would stick with it.
>
> I am renaming because current->flags is removed from this function,
> therefore keeping the name
> becomes misleading. This function only addresses cma flag check
> without looking at the thread local state now.
Fair enough. I still dislike cma being called out explicitly because
that is slightly misleading as well. gpf_to_alloc_flags would be more
explicit I believe. But I do not want to bikeshed this to death.
--
Michal Hocko
SUSE Labs
On Tue 15-12-20 00:20:39, Pavel Tatashin wrote:
> > Ack to this.
>
> Thank you.
>
> >
> > But I do not really understand this. All allocation contexts should have
> > a proper gfp mask so why do we have to call current_gfp_context here?
> > In fact moving the current_gfp_context in the allocator path should have
> > made all this games unnecessary. Memcg reclaim path might need some
> > careful check because gfp mask is used more creative there but the
> > general reclaim paths should be ok.
> >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >
> > Again, why do we need this when the gfp_mask
> > > };
> > >
> --
>
> Hi Michal,
>
> Beside from __alloc_pages_nodemask(), the current_gfp_context() is
> called from the following six functions:
>
> try_to_free_pages()
> try_to_free_mem_cgroup_pages()
> __node_reclaim()
> __need_fs_reclaim()
> alloc_contig_range()
> pcpu_alloc()
>
> As I understand, the idea is that because the allocator now honors
> gfp_context values for all paths, the call can be removed from some of
> the above functions. I think you are correct. But, at least from a
> quick glance, this is not obvious, and is not the case for all of the
> above functions.
>
> For example:
>
> alloc_contig_range()
> __alloc_contig_migrate_range
> isolate_migratepages_range
> isolate_migratepages_block
> /*
> * Only allow to migrate anonymous pages in GFP_NOFS context
> * because those do not depend on fs locks.
> */
> if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> goto isolate_fail;
>
> If we remove current_gfp_context() from alloc_contig_range(), the
> cc->gfp_mask will not be updated with proper __GFP_FS flag.
I do not think I was proposing to drop current_gfp_context from
alloc_contig_range. ACR needs some work to be properly scoped gfp mask
aware. This should be addressed but I do not think think the code
works properly now so I wouldn't lose sleep over it in this series. At
least __alloc_contig_migrate_range should follow the gfp mask given to
alloc_contig_range.
> I have studied some other paths, and they are also convoluted.
> Therefore, I am worried about performing this optimization in this
> series.
Dropping current_gfp_context from the reclaim context should be done in
a separate patch. I didn't mean to push for this here. All I meant was
to simply not touch gfp/zone_idx in the reclaim path. The changelog
should call out that the page allocator always provides proper gfp mask.
--
Michal Hocko
SUSE Labs
> Fair enough. I still dislike cma being called out explicitly because
> that is slightly misleading as well. gpf_to_alloc_flags would be more
> explicit I believe. But I do not want to bikeshed this to death.
Sounds good, I renamed it to gpf_to_alloc_flags.
Thank you,
Pasha
On Tue, Dec 15, 2020 at 3:25 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 15-12-20 00:20:39, Pavel Tatashin wrote:
> > > Ack to this.
> >
> > Thank you.
> >
> > >
> > > But I do not really understand this. All allocation contexts should have
> > > a proper gfp mask so why do we have to call current_gfp_context here?
> > > In fact moving the current_gfp_context in the allocator path should have
> > > made all this games unnecessary. Memcg reclaim path might need some
> > > careful check because gfp mask is used more creative there but the
> > > general reclaim paths should be ok.
> > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >
> > > Again, why do we need this when the gfp_mask
> > > > };
> > > >
> > --
> >
> > Hi Michal,
> >
> > Beside from __alloc_pages_nodemask(), the current_gfp_context() is
> > called from the following six functions:
> >
> > try_to_free_pages()
> > try_to_free_mem_cgroup_pages()
> > __node_reclaim()
> > __need_fs_reclaim()
> > alloc_contig_range()
> > pcpu_alloc()
> >
> > As I understand, the idea is that because the allocator now honors
> > gfp_context values for all paths, the call can be removed from some of
> > the above functions. I think you are correct. But, at least from a
> > quick glance, this is not obvious, and is not the case for all of the
> > above functions.
> >
> > For example:
> >
> > alloc_contig_range()
> > __alloc_contig_migrate_range
> > isolate_migratepages_range
> > isolate_migratepages_block
> > /*
> > * Only allow to migrate anonymous pages in GFP_NOFS context
> > * because those do not depend on fs locks.
> > */
> > if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> > goto isolate_fail;
> >
> > If we remove current_gfp_context() from alloc_contig_range(), the
> > cc->gfp_mask will not be updated with proper __GFP_FS flag.
>
> I do not think I was proposing to drop current_gfp_context from
> alloc_contig_range. ACR needs some work to be properly scoped gfp mask
> aware. This should be addressed but I do not think think the code
> works properly now so I wouldn't lose sleep over it in this series. At
> least __alloc_contig_migrate_range should follow the gfp mask given to
> alloc_contig_range.
>
> > I have studied some other paths, and they are also convoluted.
> > Therefore, I am worried about performing this optimization in this
> > series.
>
> Dropping current_gfp_context from the reclaim context should be done in
> a separate patch. I didn't mean to push for this here. All I meant was
> to simply not touch gfp/zone_idx in the reclaim path. The changelog
> should call out that the page allocator always provides proper gfp mask.
I see what you mean. I will remove reclaim changes, and will add a
note to changelog.
Thank you,
Pasha
>
> --
> Michal Hocko
> SUSE Labs