2020-12-17 18:55:02

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 00/10] prohibit pinning pages in ZONE_MOVABLE

Changelog
---------
v4
- Address page migration comments. New patch:
mm/gup: limit number of gup migration failures, honor failures
Implements the limiting number of retries for migration failures, and
also check for isolation failures.
Added a test case into gup_test to verify that pages never long-term
pinned in a movable zone, and also added tests to fault both in kernel
and in userland.
v3
- Merged with linux-next, which contains clean-up patch from Jason,
therefore this series is reduced by two patches which did the same
thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path

---------
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com

Previous versions:
v1
https://lore.kernel.org/lkml/[email protected]
v2
https://lore.kernel.org/lkml/[email protected]
v3
https://lore.kernel.org/lkml/[email protected]

Pavel Tatashin (10):
mm/gup: don't pin migrated cma pages in movable zone
mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
mm: apply per-task gfp constraints in fast path
mm: honor PF_MEMALLOC_PIN for all movable pages
mm/gup: migrate pinned pages out of movable zone
memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
mm/gup: change index type to long as it counts pages
mm/gup: limit number of gup migration failures, honor failures
selftests/vm: test flag is broken
selftests/vm: test faulting in kernel, and verify pinnable pages

.../admin-guide/mm/memory-hotplug.rst | 9 +
include/linux/migrate.h | 1 +
include/linux/mm.h | 11 +
include/linux/mmzone.h | 11 +-
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 27 +--
include/trace/events/migrate.h | 3 +-
mm/gup.c | 217 ++++++++++--------
mm/gup_test.c | 15 +-
mm/gup_test.h | 1 +
mm/hugetlb.c | 4 +-
mm/page_alloc.c | 33 ++-
mm/vmscan.c | 10 +-
tools/testing/selftests/vm/gup_test.c | 26 ++-
14 files changed, 221 insertions(+), 149 deletions(-)

--
2.25.1


2020-12-17 18:56:04

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 04/10] mm: honor PF_MEMALLOC_PIN for all movable pages

PF_MEMALLOC_PIN is only honored for CMA pages, extend
this flag to work for any allocations from ZONE_MOVABLE by removing
__GFP_MOVABLE from gfp_mask when this flag is passed in the current
context.

Add is_pinnable_page() to return true if page is in a pinnable page.
A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
include/linux/sched/mm.h | 6 +++++-
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 20 +++++++++-----------
4 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5299b90a6c40..51b3090dd072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1109,6 +1109,17 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif

+static inline bool is_zone_movable_page(const struct page *page)
+{
+ return page_zonenum(page) == ZONE_MOVABLE;
+}
+
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+static inline bool is_pinnable_page(struct page *page)
+{
+ return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+}
+
#ifdef CONFIG_DEV_PAGEMAP_OPS
void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5f4dd3274734..a55277b0d475 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -150,12 +150,13 @@ static inline bool in_vfork(struct task_struct *tsk)
* Applies per-task gfp context to the given allocation flags.
* PF_MEMALLOC_NOIO implies GFP_NOIO
* PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_PIN implies !GFP_MOVABLE
*/
static inline gfp_t current_gfp_context(gfp_t flags)
{
unsigned int pflags = READ_ONCE(current->flags);

- if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) {
+ if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
/*
* NOIO implies both NOIO and NOFS and it is a weaker context
* so always make sure it makes precedence
@@ -164,6 +165,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
flags &= ~(__GFP_IO | __GFP_FS);
else if (pflags & PF_MEMALLOC_NOFS)
flags &= ~__GFP_FS;
+
+ if (pflags & PF_MEMALLOC_PIN)
+ flags &= ~__GFP_MOVABLE;
}
return flags;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3f5ddac5de8a..958ef274db82 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
bool pin = !!(current->flags & PF_MEMALLOC_PIN);

list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (pin && is_migrate_cma_page(page))
+ if (pin && !is_pinnable_page(page))
continue;

if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2dea9ad0e98..f83fa8cf3bb8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3802,16 +3802,13 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
return alloc_flags;
}

-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
- unsigned int alloc_flags)
+/* Must be called after current_gfp_context() which can change gfp_mask */
+static inline unsigned int gpf_to_alloc_flags(gfp_t gfp_mask,
+ unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
- unsigned int pflags = current->flags;
-
- if (!(pflags & PF_MEMALLOC_PIN) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
#endif
return alloc_flags;
}
@@ -4467,7 +4464,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;

- alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+ alloc_flags = gpf_to_alloc_flags(gfp_mask, alloc_flags);

return alloc_flags;
}
@@ -4769,7 +4766,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
- alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+ alloc_flags = gpf_to_alloc_flags(gfp_mask, reserve_flags);

/*
* Reset the nodemask and zonelist iterators if memory policies can be
@@ -4938,7 +4935,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;

- *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
+ *alloc_flags = gpf_to_alloc_flags(gfp_mask, *alloc_flags);

/* Dirty zone balancing only done in the fast path */
ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
@@ -4980,7 +4977,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
* resp. GFP_NOIO which has to be inherited for all allocation requests
* from a particular context which has been marked by
- * memalloc_no{fs,io}_{save,restore}.
+ * memalloc_no{fs,io}_{save,restore}. And PF_MEMALLOC_PIN which ensures
+ * movable zones are not used during allocation.
*/
gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
--
2.25.1

2020-12-17 18:56:13

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 03/10] mm: apply per-task gfp constraints in fast path

Function current_gfp_context() is called after fast path. However, soon we
will add more constraints which will also limit zones based on context.
Move this call into fast path, and apply the correct constraints for all
allocations.

Also update .reclaim_idx based on value returned by current_gfp_context()
because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages_nodemask()
prepare_alloc_pages()
current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 15 ++++++++-------
mm/vmscan.c | 10 ++++++----
2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec05396a597b..c2dea9ad0e98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4976,6 +4976,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
}

gfp_mask &= gfp_allowed_mask;
+ /*
+ * Apply scoped allocation constraints. This is mainly about GFP_NOFS
+ * resp. GFP_NOIO which has to be inherited for all allocation requests
+ * from a particular context which has been marked by
+ * memalloc_no{fs,io}_{save,restore}.
+ */
+ gfp_mask = current_gfp_context(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
@@ -4991,13 +4998,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
if (likely(page))
goto out;

- /*
- * Apply scoped allocation constraints. This is mainly about GFP_NOFS
- * resp. GFP_NOIO which has to be inherited for all allocation requests
- * from a particular context which has been marked by
- * memalloc_no{fs,io}_{save,restore}.
- */
- alloc_mask = current_gfp_context(gfp_mask);
+ alloc_mask = gfp_mask;
ac.spread_dirty_pages = false;

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 469016222cdb..d9546f5897f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3234,11 +3234,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
+ gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = current_gfp_context(gfp_mask),
- .reclaim_idx = gfp_zone(gfp_mask),
+ .gfp_mask = current_gfp_mask,
+ .reclaim_idx = gfp_zone(current_gfp_mask),
.order = order,
.nodemask = nodemask,
.priority = DEF_PRIORITY,
@@ -4158,17 +4159,18 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
+ gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
struct task_struct *p = current;
unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = current_gfp_context(gfp_mask),
+ .gfp_mask = current_gfp_mask,
.order = order,
.priority = NODE_RECLAIM_PRIORITY,
.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
.may_swap = 1,
- .reclaim_idx = gfp_zone(gfp_mask),
+ .reclaim_idx = gfp_zone(current_gfp_mask),
};

trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
--
2.25.1

2020-12-17 18:56:14

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 02/10] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
pages that might belong to CMA region. This is currently used for long
term gup to make sure that such pins are not going to be done on any CMA
pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
focusing on CMA pages too much and that there is larger class of pages that
need the same treatment. MOVABLE zone cannot contain any long term pins as
well so it makes sense to reuse and redefine this flag for that usecase as
well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
context which can only get pages suitable for long-term pins.

Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_pin_save()/memalloc_pin_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 21 +++++----------------
mm/gup.c | 4 ++--
mm/hugetlb.c | 4 ++--
mm/page_alloc.c | 4 ++--
5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e5ad6d354b7b..e30d5511ccc0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
-#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..5f4dd3274734 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -270,29 +270,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
}

-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_pin_save(void)
{
- unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+ unsigned int flags = current->flags & PF_MEMALLOC_PIN;

- current->flags |= PF_MEMALLOC_NOCMA;
+ current->flags |= PF_MEMALLOC_PIN;
return flags;
}

-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_pin_restore(unsigned int flags)
{
- current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+ current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
}
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
- return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif

#ifdef CONFIG_MEMCG
DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index f2e50cdd7d67..04602e94856b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1671,7 +1671,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
long rc;

if (gup_flags & FOLL_LONGTERM)
- flags = memalloc_nocma_save();
+ flags = memalloc_pin_save();

rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
gup_flags);
@@ -1680,7 +1680,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (rc > 0)
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
vmas, gup_flags);
- memalloc_nocma_restore(flags);
+ memalloc_pin_restore(flags);
}
return rc;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cbf32d2824fd..3f5ddac5de8a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
{
struct page *page;
- bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+ bool pin = !!(current->flags & PF_MEMALLOC_PIN);

list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (nocma && is_migrate_cma_page(page))
+ if (pin && is_migrate_cma_page(page))
continue;

if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 774542e1483e..ec05396a597b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3808,8 +3808,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
#ifdef CONFIG_CMA
unsigned int pflags = current->flags;

- if (!(pflags & PF_MEMALLOC_NOCMA) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (!(pflags & PF_MEMALLOC_PIN) &&
+ gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

#endif
--
2.25.1

2020-12-17 18:56:18

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 05/10] mm/gup: migrate pinned pages out of movable zone

We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which
pages need to be migrated

Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/migrate.h | 1 +
include/linux/mmzone.h | 11 ++++--
include/trace/events/migrate.h | 3 +-
mm/gup.c | 68 ++++++++++++++--------------------
4 files changed, 39 insertions(+), 44 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4594838a0f7c..aae5ef0b3ba1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_LONGTERM_PIN,
MR_TYPES
};

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..25c0c13ba4b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -386,9 +386,14 @@ enum zone_type {
* likely to succeed, and to locally limit unmovable allocations - e.g.,
* to increase the number of THP/huge pages. Notable special cases are:
*
- * 1. Pinned pages: (long-term) pinning of movable pages might
- * essentially turn such pages unmovable. Memory offlining might
- * retry a long time.
+ * 1. Pinned pages: (long-term) pinning of movable pages is avoided
+ * when pages are pinned and faulted, but it is still possible that
+ * address space already has pages in ZONE_MOVABLE at the time when
+ * pages are pinned (i.e. user has touches that memory before
+ * pinning). In such case we try to migrate them to a different zone,
+ * but if migration fails the pages can still end-up pinned in
+ * ZONE_MOVABLE. In such case, memory offlining might retry a long
+ * time and will only succeed once user application unpins pages.
* 2. memblock allocations: kernelcore/movablecore setups might create
* situations where ZONE_MOVABLE contains unmovable allocations
* after boot. Memory offlining and allocations fail early.
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM( MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_LONGTERM_PIN, "longterm_pin")

/*
* First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 04602e94856b..591d8e2dfc70 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -89,11 +89,12 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
int orig_refs = refs;

/*
- * Can't do FOLL_LONGTERM + FOLL_PIN with CMA in the gup fast
- * path, so fail and let the caller fall back to the slow path.
+ * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+ * right zone, so fail and let the caller fall back to the slow
+ * path.
*/
- if (unlikely(flags & FOLL_LONGTERM) &&
- is_migrate_cma_page(page))
+ if (unlikely((flags & FOLL_LONGTERM) &&
+ !is_pinnable_page(page)))
return NULL;

/*
@@ -1549,19 +1550,18 @@ struct page *get_dump_page(unsigned long addr)
}
#endif /* CONFIG_ELF_CORE */

-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int gup_flags)
{
unsigned long i;
unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
- LIST_HEAD(cma_page_list);
+ LIST_HEAD(movable_page_list);
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
@@ -1579,13 +1579,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
*/
step = compound_nr(head) - (pages[i] - head);
/*
- * If we get a page from the CMA zone, since we are going to
- * be pinning these entries, we might as well move them out
- * of the CMA zone if possible.
+ * If we get a movable page, since we are going to be pinning
+ * these entries, try to move them out if possible.
*/
- if (is_migrate_cma_page(head)) {
+ if (!is_pinnable_page(head)) {
if (PageHuge(head))
- isolate_huge_page(head, &cma_page_list);
+ isolate_huge_page(head, &movable_page_list);
else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
@@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
}

if (!isolate_lru_page(head)) {
- list_add_tail(&head->lru, &cma_page_list);
+ list_add_tail(&head->lru, &movable_page_list);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +
page_is_file_lru(head),
@@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
i += step;
}

- if (!list_empty(&cma_page_list)) {
+ if (!list_empty(&movable_page_list)) {
/*
* drop the above get_user_pages reference.
*/
@@ -1615,25 +1614,24 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);

- if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
- (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+ if (migrate_pages(&movable_page_list, alloc_migration_target, NULL,
+ (unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN)) {
/*
* some of the pages failed migration. Do get_user_pages
* without migration.
*/
migrate_allow = false;

- if (!list_empty(&cma_page_list))
- putback_movable_pages(&cma_page_list);
+ if (!list_empty(&movable_page_list))
+ putback_movable_pages(&movable_page_list);
}
/*
* We did migrate all the pages, Try to get the page references
- * again migrating any new CMA pages which we failed to isolate
- * earlier.
+ * again migrating any pages which we failed to isolate earlier.
*/
ret = __get_user_pages_locked(mm, start, nr_pages,
- pages, vmas, NULL,
- gup_flags);
+ pages, vmas, NULL,
+ gup_flags);

if ((ret > 0) && migrate_allow) {
nr_pages = ret;
@@ -1644,17 +1642,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,

return ret;
}
-#else
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
-{
- return nr_pages;
-}
-#endif /* CONFIG_CMA */

/*
* __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
@@ -1678,8 +1665,9 @@ static long __gup_longterm_locked(struct mm_struct *mm,

if (gup_flags & FOLL_LONGTERM) {
if (rc > 0)
- rc = check_and_migrate_cma_pages(mm, start, rc, pages,
- vmas, gup_flags);
+ rc = check_and_migrate_movable_pages(mm, start, rc,
+ pages, vmas,
+ gup_flags);
memalloc_pin_restore(flags);
}
return rc;
--
2.25.1

2020-12-17 18:56:30

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 06/10] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning

Document the special handling of page pinning when ZONE_MOVABLE present.

Signed-off-by: Pavel Tatashin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
---
Documentation/admin-guide/mm/memory-hotplug.rst | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..c6618f99f765 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
Unfortunately, there is no information to show which memory block belongs
to ZONE_MOVABLE. This is TBD.

+.. note::
+ Techniques that rely on long-term pinnings of memory (especially, RDMA and
+ vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
+ hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
+ memory can still get hot removed - be aware that pinning can fail even if
+ there is plenty of free memory in ZONE_MOVABLE. In addition, using
+ ZONE_MOVABLE might make page pinning more expensive, because pages have to be
+ migrated off that zone first.
+
.. _memory_hotplug_how_to_offline_memory:

How to offline memory
--
2.25.1

2020-12-17 18:56:33

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

check_and_migrate_movable_pages() does not honor isolation errors, and also
retries migration failures indefinably.

Fix both of the above issues: add a new function that checks and unpins
pages range check_and_unpin_pages().

Move the retry loop from check_and_migrate_movable_pages() to
__gup_longterm_locked().

Rename check_and_migrate_movable_pages() as migrate_movable_pages() and
make this function accept already unpinned pages. Also, track the errors
during isolation, so they can be re-tried with a different maximum limit,
the isolation errors should be ephemeral.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup.c | 179 ++++++++++++++++++++++++++++++++++---------------------
1 file changed, 111 insertions(+), 68 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1ebb7cc2fbe4..70cc8b8f67c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1550,27 +1550,57 @@ struct page *get_dump_page(unsigned long addr)
}
#endif /* CONFIG_ELF_CORE */

-static long check_and_migrate_movable_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
-{
- unsigned long i;
- unsigned long step;
- bool drain_allow = true;
- bool migrate_allow = true;
+/*
+ * Verify that there are no unpinnable (movable) pages, if so return true.
+ * Otherwise an unpinnable pages is found return false, and unpin all pages.
+ */
+static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
+ unsigned int gup_flags)
+{
+ unsigned long i, step;
+
+ for (i = 0; i < nr_pages; i += step) {
+ struct page *head = compound_head(pages[i]);
+
+ step = compound_nr(head) - (pages[i] - head);
+ if (!is_pinnable_page(head))
+ break;
+ }
+
+ if (i >= nr_pages)
+ return true;
+
+ if (gup_flags & FOLL_PIN) {
+ unpin_user_pages(pages, nr_pages);
+ } else {
+ for (i = 0; i < nr_pages; i++)
+ put_page(pages[i]);
+ }
+
+ return false;
+}
+
+#define PINNABLE_MIGRATE_MAX 10
+#define PINNABLE_ISOLATE_MAX 100
+
+/*
+ * Migrate pages that cannot be pinned. Return zero on success and error code
+ * on migration failure. If migration was successful but page isolation had
+ * failures return number of pages that failed to be isolated.
+ */
+static long migrate_movable_pages(unsigned long nr_pages, struct page **pages)
+{
+ unsigned long i, step;
LIST_HEAD(movable_page_list);
- long ret = nr_pages;
+ long ret = 0;
+ long error_count = 0;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};

-check_again:
- for (i = 0; i < nr_pages;) {
-
+ lru_add_drain_all();
+ for (i = 0; i < nr_pages; i += step) {
struct page *head = compound_head(pages[i]);

/*
@@ -1583,62 +1613,42 @@ static long check_and_migrate_movable_pages(struct mm_struct *mm,
* these entries, try to move them out if possible.
*/
if (!is_pinnable_page(head)) {
- if (PageHuge(head))
- isolate_huge_page(head, &movable_page_list);
- else {
- if (!PageLRU(head) && drain_allow) {
- lru_add_drain_all();
- drain_allow = false;
- }
-
+ if (PageHuge(head)) {
+ if (!isolate_huge_page(head, &movable_page_list))
+ error_count += step;
+ } else {
if (!isolate_lru_page(head)) {
list_add_tail(&head->lru, &movable_page_list);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +
page_is_file_lru(head),
thp_nr_pages(head));
+ } else {
+ error_count += step;
}
}
}
-
- i += step;
}

if (!list_empty(&movable_page_list)) {
- /*
- * drop the above get_user_pages reference.
- */
- if (gup_flags & FOLL_PIN)
- unpin_user_pages(pages, nr_pages);
- else
- for (i = 0; i < nr_pages; i++)
- put_page(pages[i]);
+ ret = migrate_pages(&movable_page_list, alloc_migration_target,
+ NULL, (unsigned long)&mtc, MIGRATE_SYNC,
+ MR_LONGTERM_PIN);
+ /* Assume -EBUSY failure if some pages were not migrated */
+ if (ret > 0)
+ ret = -EBUSY;
+ }

- if (migrate_pages(&movable_page_list, alloc_migration_target, NULL,
- (unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN)) {
- /*
- * some of the pages failed migration. Do get_user_pages
- * without migration.
- */
- migrate_allow = false;
+ if (ret && !list_empty(&movable_page_list))
+ putback_movable_pages(&movable_page_list);

- if (!list_empty(&movable_page_list))
- putback_movable_pages(&movable_page_list);
- }
- /*
- * We did migrate all the pages, Try to get the page references
- * again migrating any pages which we failed to isolate earlier.
- */
- ret = __get_user_pages_locked(mm, start, nr_pages,
- pages, vmas, NULL,
- gup_flags);
-
- if ((ret > 0) && migrate_allow) {
- nr_pages = ret;
- drain_allow = true;
- goto check_again;
- }
- }
+ /*
+ * Check if there were isolation errors, if so they should not be
+ * counted toward PINNABLE_MIGRATE_MAX, so separate them, by
+ * returning number of pages failed to isolate.
+ */
+ if (!ret && error_count)
+ ret = error_count;

return ret;
}
@@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm,
struct vm_area_struct **vmas,
unsigned int gup_flags)
{
- unsigned long flags = 0;
+ int migrate_retry = 0;
+ int isolate_retry = 0;
+ unsigned int flags;
long rc;

- if (gup_flags & FOLL_LONGTERM)
- flags = memalloc_pin_save();
+ if (!(gup_flags & FOLL_LONGTERM))
+ return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ NULL, gup_flags);

- rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
- gup_flags);
+ /*
+ * Without FOLL_WRITE fault handler may return zero page, which can
+ * be in a movable zone, and also will fail to isolate during migration,
+ * thus the longterm pin will fail.
+ */
+ gup_flags &= FOLL_WRITE;

- if (gup_flags & FOLL_LONGTERM) {
- if (rc > 0)
- rc = check_and_migrate_movable_pages(mm, start, rc,
- pages, vmas,
- gup_flags);
- memalloc_pin_restore(flags);
+ flags = memalloc_pin_save();
+ /*
+ * Migration may fail, we retry before giving up. Also, because after
+ * migration pages[] becomes outdated, we unpin and repin all pages
+ * in the range, so pages array is repopulated with new values.
+ * Also, because of this we cannot retry migration failures in a loop
+ * without pinning/unpinnig pages.
+ */
+ for (; ; ) {
+ rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ NULL, gup_flags);
+
+ /* Return if error or if all pages are pinnable */
+ if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags))
+ break;
+
+ /* Some pages are not pinnable, migrate them */
+ rc = migrate_movable_pages(rc, pages);
+
+ /*
+ * If there is an error, and we tried maximum number of times
+ * bail out. Notice: we return an error code, and all pages are
+ * unpinned
+ */
+ if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) {
+ break;
+ } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) {
+ rc = -EBUSY;
+ break;
+ }
}
+ memalloc_pin_restore(flags);
+
return rc;
}

--
2.25.1

2020-12-17 18:56:33

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test without touching pages in userland, and use FOLL_TOUCH
instead. Also, verify that pinned pages are pinnable.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup_test.c | 6 ++++++
tools/testing/selftests/vm/gup_test.c | 17 +++++++++++++----
2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index 24c70c5814ba..24fd542091ee 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,

dump_page(page, "gup_test failure");
break;
+ } else if (cmd == PIN_LONGTERM_BENCHMARK &&
+ WARN(!is_pinnable_page(page),
+ "pages[%lu] is NOT pinnable but pinned\n",
+ i)) {
+ dump_page(page, "gup_test failure");
+ break;
}
}
break;
diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index 42c71483729f..f08cc97d424d 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -13,6 +13,7 @@

/* Just the flags we need, copied from mm.h: */
#define FOLL_WRITE 0x01 /* check pte is writable */
+#define FOLL_TOUCH 0x02 /* mark page accessed */

static char *cmd_to_str(unsigned long cmd)
{
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
unsigned long size = 128 * MB;
int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
unsigned long cmd = GUP_FAST_BENCHMARK;
- int flags = MAP_PRIVATE;
+ int flags = MAP_PRIVATE, touch = 0;
char *file = "/dev/zero";
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
case 'H':
flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
break;
+ case 'z':
+ /* fault pages in gup, do not fault in userland */
+ touch = 1;
+ break;
default:
return -1;
}
@@ -167,8 +172,12 @@ int main(int argc, char **argv)
else if (thp == 0)
madvise(p, size, MADV_NOHUGEPAGE);

- for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
- p[0] = 0;
+ if (touch) {
+ gup.flags |= FOLL_TOUCH;
+ } else {
+ for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+ p[0] = 0;
+ }

/* Only report timing information on the *_BENCHMARK commands: */
if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
--
2.25.1

2020-12-17 18:57:00

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 07/10] mm/gup: change index type to long as it counts pages

In __get_user_pages_locked() i counts number of pages which should be
long.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 591d8e2dfc70..1ebb7cc2fbe4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1481,7 +1481,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
{
struct vm_area_struct *vma;
unsigned long vm_flags;
- int i;
+ long i;

/* calculate required read or write permissions.
* If FOLL_FORCE is set, we only require the "MAY" flags.
--
2.25.1

2020-12-17 18:57:34

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH v4 09/10] selftests/vm: test flag is broken

In gup_test both gup_flags and test_flags use the same flags field.
This is broken, because gup_flags can be passed as raw value (via -F hex),
which can overwrite all the test flags.

Farther, in the actual gup_test.c all the passed gup_flags are erased and
unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156 nr = pin_user_pages(addr, nr, gup->flags,
157 pages + i, NULL);
158 else
159 nr = get_user_pages(addr, nr, gup->flags,
160 pages + i, NULL);
161 break;

Add a new test_flags field, to allow raw gup_flags to work.
Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
should be performed.
Remove unconditional overwriting of gup_flags via FOLL_WRITE. But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

With the fix, dump works like this:

root@virtme:/# gup_test -c
---- page #0, starting from user virt addr: 0x7f8acb9e4000
page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
index:0x0 pfn:0x100bcf
anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

root@virtme:/# gup_test -c -p
---- page #0, starting from user virt addr: 0x7fd19701b000
page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
index:0x0 pfn:0x108008
anon flags: 0x300000000080014(uptodate|lru|swapbacked)
raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup_test.c | 9 +++------
mm/gup_test.h | 1 +
tools/testing/selftests/vm/gup_test.c | 11 +++++++++--
3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index e3cf78e5873e..24c70c5814ba 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
{
ktime_t start_time, end_time;
unsigned long i, nr_pages, addr, next;
- int nr;
+ long nr;
struct page **pages;
int ret = 0;
bool needs_mmap_lock =
@@ -126,9 +126,6 @@ static int __gup_test_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}

- /* Filter out most gup flags: only allow a tiny subset here: */
- gup->flags &= FOLL_WRITE;
-
switch (cmd) {
case GUP_FAST_BENCHMARK:
nr = get_user_pages_fast(addr, nr, gup->flags,
@@ -152,7 +149,7 @@ static int __gup_test_ioctl(unsigned int cmd,
pages + i, NULL);
break;
case DUMP_USER_PAGES_TEST:
- if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
+ if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
nr = pin_user_pages(addr, nr, gup->flags,
pages + i, NULL);
else
@@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,

start_time = ktime_get();

- put_back_pages(cmd, pages, nr_pages, gup->flags);
+ put_back_pages(cmd, pages, nr_pages, gup->test_flags);

end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
diff --git a/mm/gup_test.h b/mm/gup_test.h
index 90a6713d50eb..b1b376b5d3b2 100644
--- a/mm/gup_test.h
+++ b/mm/gup_test.h
@@ -22,6 +22,7 @@ struct gup_test {
__u64 size;
__u32 nr_pages_per_call;
__u32 flags;
+ __u32 test_flags;
/*
* Each non-zero entry is the number of the page (1-based: first page is
* page 1, so that zero entries mean "do nothing") from the .addr base.
diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index 6c6336dd3b7f..42c71483729f 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -37,13 +37,13 @@ int main(int argc, char **argv)
{
struct gup_test gup = { 0 };
unsigned long size = 128 * MB;
- int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
+ int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
unsigned long cmd = GUP_FAST_BENCHMARK;
int flags = MAP_PRIVATE;
char *file = "/dev/zero";
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -65,6 +65,10 @@ int main(int argc, char **argv)
*/
gup.which_pages[0] = 1;
break;
+ case 'p':
+ /* works only with DUMP_USER_PAGES_TEST */
+ gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
+ break;
case 'F':
/* strtol, so you can pass flags in hex form */
gup.flags = strtol(optarg, 0, 0);
@@ -93,6 +97,9 @@ int main(int argc, char **argv)
case 'w':
write = 1;
break;
+ case 'W':
+ write = 0;
+ break;
case 'f':
file = optarg;
break;
--
2.25.1

2020-12-17 20:52:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> +/*
> + * Verify that there are no unpinnable (movable) pages, if so return true.
> + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> + */
> +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> + unsigned int gup_flags)
> +{
> + unsigned long i, step;
> +
> + for (i = 0; i < nr_pages; i += step) {
> + struct page *head = compound_head(pages[i]);
> +
> + step = compound_nr(head) - (pages[i] - head);

You can't assume that all of a compound head is in the pages array,
this assumption would only work inside the page walkers if the page
was found in a PMD or something.

> + if (gup_flags & FOLL_PIN) {
> + unpin_user_pages(pages, nr_pages);

So we throw everything away? Why? That isn't how the old algorithm worked

> @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> struct vm_area_struct **vmas,
> unsigned int gup_flags)
> {
> - unsigned long flags = 0;
> + int migrate_retry = 0;
> + int isolate_retry = 0;
> + unsigned int flags;
> long rc;
>
> - if (gup_flags & FOLL_LONGTERM)
> - flags = memalloc_pin_save();
> + if (!(gup_flags & FOLL_LONGTERM))
> + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> + NULL, gup_flags);
>
> - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
> - gup_flags);
> + /*
> + * Without FOLL_WRITE fault handler may return zero page, which can
> + * be in a movable zone, and also will fail to isolate during migration,
> + * thus the longterm pin will fail.
> + */
> + gup_flags &= FOLL_WRITE;

Is &= what you mean here? |= right?

Seems like we've ended up in a weird place if FOLL_LONGTERM always
includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like
a bad idea, no?

> + /*
> + * Migration may fail, we retry before giving up. Also, because after
> + * migration pages[] becomes outdated, we unpin and repin all pages
> + * in the range, so pages array is repopulated with new values.
> + * Also, because of this we cannot retry migration failures in a loop
> + * without pinning/unpinnig pages.
> + */

The old algorithm made continuous forward progress and only went back
to the first migration point.

> + for (; ; ) {

while (true)?

> + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> + NULL, gup_flags);

> + /* Return if error or if all pages are pinnable */
> + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags))
> + break;

So we sweep the pages list twice now?

> + /* Some pages are not pinnable, migrate them */
> + rc = migrate_movable_pages(rc, pages);
> +
> + /*
> + * If there is an error, and we tried maximum number of times
> + * bail out. Notice: we return an error code, and all pages are
> + * unpinned
> + */
> + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) {
> + break;
> + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) {
> + rc = -EBUSY;

I don't like this at all. It shouldn't be so flakey

Can you do migration without the LRU?

Jason

2020-12-17 22:05:41

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

Hi Jason,

Thank you for your comments. My replies below.

On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> > +/*
> > + * Verify that there are no unpinnable (movable) pages, if so return true.
> > + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> > + */
> > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> > + unsigned int gup_flags)
> > +{
> > + unsigned long i, step;
> > +
> > + for (i = 0; i < nr_pages; i += step) {
> > + struct page *head = compound_head(pages[i]);
> > +
> > + step = compound_nr(head) - (pages[i] - head);
>
> You can't assume that all of a compound head is in the pages array,
> this assumption would only work inside the page walkers if the page
> was found in a PMD or something.

I am not sure I understand your comment. The compound head is not
taken from the pages array, and not assumed to be in it. It is exactly
the same logic as that we currently have:
https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565

>
> > + if (gup_flags & FOLL_PIN) {
> > + unpin_user_pages(pages, nr_pages);
>
> So we throw everything away? Why? That isn't how the old algorithm worked

It is exactly like the old algorithm worked: if there are pages to be
migrated (not pinnable pages) we unpinned everything.
See here:
https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603

If cma_pages_list is not empty unpin everything. The list is not empty
if we isolated some pages, we isolated some pages if there are some
pages which are not pinnable. Now, we do exactly the same thing, but
cleaner, and handle errors. We must unpin everything because if we
fail, no pages should stay pinned, and also if we migrated some pages,
the pages array must be updated, so we need to call
__get_user_pages_locked() pin and repopulated pages array.

>
> > @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> > struct vm_area_struct **vmas,
> > unsigned int gup_flags)
> > {
> > - unsigned long flags = 0;
> > + int migrate_retry = 0;
> > + int isolate_retry = 0;
> > + unsigned int flags;
> > long rc;
> >
> > - if (gup_flags & FOLL_LONGTERM)
> > - flags = memalloc_pin_save();
> > + if (!(gup_flags & FOLL_LONGTERM))
> > + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > + NULL, gup_flags);
> >
> > - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
> > - gup_flags);
> > + /*
> > + * Without FOLL_WRITE fault handler may return zero page, which can
> > + * be in a movable zone, and also will fail to isolate during migration,
> > + * thus the longterm pin will fail.
> > + */
> > + gup_flags &= FOLL_WRITE;
>
> Is &= what you mean here? |= right?

Right, I meant |=.

>
> Seems like we've ended up in a weird place if FOLL_LONGTERM always
> includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like
> a bad idea, no?

I am not sure, I just found this problem during testing, and this is
the solution I am proposing. I am worried about limiting the zero page
to a non movable zone, but let's see what others think about this.

>
> > + /*
> > + * Migration may fail, we retry before giving up. Also, because after
> > + * migration pages[] becomes outdated, we unpin and repin all pages
> > + * in the range, so pages array is repopulated with new values.
> > + * Also, because of this we cannot retry migration failures in a loop
> > + * without pinning/unpinnig pages.
> > + */
>
> The old algorithm made continuous forward progress and only went back
> to the first migration point.

That is not right, the old code went back to the beginning. Making
continuous progress is possible, but we won't see any performance
betnefit from it, because migration failures is already exception
scenarios where machine is under memory stress. The truth is if we
fail to migrate it is unlikely will succeed if we retry right away, so
giving some time between retries may be even beneficial. Also with
continious progress we need to take care of some corner cases where we
need to unpin already succeeded pages in case if forward progress is
not possible. Also, adjust pages array, start address etc.

>
> > + for (; ; ) {
>
> while (true)?

Hm, the same thing? :)

>
> > + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > + NULL, gup_flags);
>
> > + /* Return if error or if all pages are pinnable */
> > + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags))
> > + break;
>
> So we sweep the pages list twice now?

Yes, but O(N) is the same. No new operation is added. Before we had
something like this:

while (npages)
check if pinnable
isolate
while (npages)
unpin
migrate
while (npages)
pin

Now:
while(npages)
check if pinnable
while(npages)
unpin
while(npages)
isolate
migrate
pin

>
> > + /* Some pages are not pinnable, migrate them */
> > + rc = migrate_movable_pages(rc, pages);
> > +
> > + /*
> > + * If there is an error, and we tried maximum number of times
> > + * bail out. Notice: we return an error code, and all pages are
> > + * unpinned
> > + */
> > + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) {
> > + break;
> > + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) {
> > + rc = -EBUSY;
>
> I don't like this at all. It shouldn't be so flakey
>
> Can you do migration without the LRU?

I do not think it is possible, we must isolate pages before migration.

Pasha

2020-12-18 09:45:09

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] selftests/vm: test flag is broken

On 12/17/20 10:52 AM, Pavel Tatashin wrote:
> In gup_test both gup_flags and test_flags use the same flags field.
> This is broken, because gup_flags can be passed as raw value (via -F hex),
> which can overwrite all the test flags.

Thanks for finding and fixing the "stuck at 0x1" bug!

The test is not quite as broken as you think, though, because you're not
looking at the history and intent of the FOLL_WRITE, and how that is used.
And that is leading you to make wider "fixes" than I'd recommend.

The new -F command line switch is the latest, newest (and therefore
flakiest) addition, and *that* part is worth fixing up, because in
addition to being stuck at 0x1 as you noticed, it also misleads everyone
into thinking that it's intended for general gup flags. It's not.

It really was meant to be a subtest control flag (controlling the behavior
of the dump pages sub-test), as you can see in the commit log and documentation:

For example:

./gup_test -ct -F 1 0 19 0x1000

Meaning:
-c: dump pages sub-test
-t: use THP pages
-F 1: use pin_user_pages() instead of get_user_pages()
0 19 0x1000: dump pages 0, 19, and 4096


>
> Farther, in the actual gup_test.c all the passed gup_flags are erased and
> unconditionally replaced with FOLL_WRITE.

This is still desirable; Kirill's original intention of only allowing
FOLL_WRITE or nothing is still what we want. There is no desire to pass
in general gup_flags.

>
> Which means that test_flags are ignored, and code like this always
> performs pin dump test:
>
> 155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> 156 nr = pin_user_pages(addr, nr, gup->flags,
> 157 pages + i, NULL);
> 158 else
> 159 nr = get_user_pages(addr, nr, gup->flags,
> 160 pages + i, NULL);
> 161 break;
>
> Add a new test_flags field, to allow raw gup_flags to work.

I think .test_control_flags field would be a good name, to make it very
clear that it's not destined for gup_flags. Just .test_flags is not quite
as clear a distinction from .gup_flags, as .test_control_flags is, IMHO.

> Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
> should be performed.

It's arguably better to just keep using the -c option instead, plus the new
.test_control_flags field. Otherwise you get a multiplication of possibilities
(and therefore, of command line options).

> Remove unconditional overwriting of gup_flags via FOLL_WRITE. But,

Nope, let's not do that.

> preserve the previous behaviour where FOLL_WRITE was the default flag,
> and add a new option "-W" to unset FOLL_WRITE.

Or that either.

>
> With the fix, dump works like this:
>
> root@virtme:/# gup_test -c
> ---- page #0, starting from user virt addr: 0x7f8acb9e4000
> page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
> index:0x0 pfn:0x100bcf
> anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
> raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
> raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
> page dumped because: gup_test: dump_pages() test
> DUMP_USER_PAGES_TEST: done
>
> root@virtme:/# gup_test -c -p
> ---- page #0, starting from user virt addr: 0x7fd19701b000
> page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
> index:0x0 pfn:0x108008
> anon flags: 0x300000000080014(uptodate|lru|swapbacked)
> raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
> raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
> page dumped because: gup_test: dump_pages() test
> DUMP_USER_PAGES_TEST: done
>
> Refcount shows the difference between pin vs no-pin case.
> Also change type of nr from int to long, as it counts number of pages.

Why? Is there a bug? gup only handles int sized amounts of pages so this
is just adding to the churn, yes?

>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup_test.c | 9 +++------
> mm/gup_test.h | 1 +
> tools/testing/selftests/vm/gup_test.c | 11 +++++++++--
> 3 files changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/mm/gup_test.c b/mm/gup_test.c
> index e3cf78e5873e..24c70c5814ba 100644
> --- a/mm/gup_test.c
> +++ b/mm/gup_test.c
> @@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> {
> ktime_t start_time, end_time;
> unsigned long i, nr_pages, addr, next;
> - int nr;
> + long nr;
> struct page **pages;
> int ret = 0;
> bool needs_mmap_lock =
> @@ -126,9 +126,6 @@ static int __gup_test_ioctl(unsigned int cmd,
> nr = (next - addr) / PAGE_SIZE;
> }
>
> - /* Filter out most gup flags: only allow a tiny subset here: */
> - gup->flags &= FOLL_WRITE;
> -
> switch (cmd) {
> case GUP_FAST_BENCHMARK:
> nr = get_user_pages_fast(addr, nr, gup->flags,
> @@ -152,7 +149,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> pages + i, NULL);
> break;
> case DUMP_USER_PAGES_TEST:
> - if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> + if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> nr = pin_user_pages(addr, nr, gup->flags,
> pages + i, NULL);
> else
> @@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
>
> start_time = ktime_get();
>
> - put_back_pages(cmd, pages, nr_pages, gup->flags);
> + put_back_pages(cmd, pages, nr_pages, gup->test_flags);
>
> end_time = ktime_get();
> gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> diff --git a/mm/gup_test.h b/mm/gup_test.h
> index 90a6713d50eb..b1b376b5d3b2 100644
> --- a/mm/gup_test.h
> +++ b/mm/gup_test.h
> @@ -22,6 +22,7 @@ struct gup_test {
> __u64 size;
> __u32 nr_pages_per_call;
> __u32 flags;
> + __u32 test_flags;
> /*
> * Each non-zero entry is the number of the page (1-based: first page is
> * page 1, so that zero entries mean "do nothing") from the .addr base.
> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> index 6c6336dd3b7f..42c71483729f 100644
> --- a/tools/testing/selftests/vm/gup_test.c
> +++ b/tools/testing/selftests/vm/gup_test.c
> @@ -37,13 +37,13 @@ int main(int argc, char **argv)
> {
> struct gup_test gup = { 0 };
> unsigned long size = 128 * MB;
> - int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
> + int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;

Let's leave that alone, as noted above.

> unsigned long cmd = GUP_FAST_BENCHMARK;
> int flags = MAP_PRIVATE;
> char *file = "/dev/zero";
> char *p;
>
> - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {

And also as noted above I don't think we need this or any of the remaining
hunks below.

> switch (opt) {
> case 'a':
> cmd = PIN_FAST_BENCHMARK;
> @@ -65,6 +65,10 @@ int main(int argc, char **argv)
> */
> gup.which_pages[0] = 1;
> break;
> + case 'p':
> + /* works only with DUMP_USER_PAGES_TEST */
> + gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
> + break;
> case 'F':
> /* strtol, so you can pass flags in hex form */
> gup.flags = strtol(optarg, 0, 0);
> @@ -93,6 +97,9 @@ int main(int argc, char **argv)
> case 'w':
> write = 1;
> break;
> + case 'W':
> + write = 0;
> + break;
> case 'f':
> file = optarg;
> break;
>

thanks,
--
John Hubbard
NVIDIA

2020-12-18 09:45:56

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] selftests/vm: test flag is broken

On 12/18/20 1:06 AM, John Hubbard wrote:
>> Add a new test_flags field, to allow raw gup_flags to work.
>
> I think .test_control_flags field would be a good name, to make it very
> clear that it's not destined for gup_flags. Just .test_flags is not quite
> as clear a distinction from .gup_flags, as .test_control_flags is, IMHO.
>

And maybe renaming .flags to .gup_flags, just to make it really clear.


thanks,
--
John Hubbard
NVIDIA

2020-12-18 09:49:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 03/10] mm: apply per-task gfp constraints in fast path

On Thu 17-12-20 13:52:36, Pavel Tatashin wrote:
[..]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 469016222cdb..d9546f5897f4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3234,11 +3234,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
> + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> unsigned long nr_reclaimed;
> struct scan_control sc = {
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> - .gfp_mask = current_gfp_context(gfp_mask),
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .gfp_mask = current_gfp_mask,
> + .reclaim_idx = gfp_zone(current_gfp_mask),
> .order = order,
> .nodemask = nodemask,
> .priority = DEF_PRIORITY,
> @@ -4158,17 +4159,18 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> {
> /* Minimum pages needed in order to stay on node */
> const unsigned long nr_pages = 1 << order;
> + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> struct task_struct *p = current;
> unsigned int noreclaim_flag;
> struct scan_control sc = {
> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = current_gfp_context(gfp_mask),
> + .gfp_mask = current_gfp_mask,
> .order = order,
> .priority = NODE_RECLAIM_PRIORITY,
> .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> .may_swap = 1,
> - .reclaim_idx = gfp_zone(gfp_mask),
> + .reclaim_idx = gfp_zone(current_gfp_mask),
> };
>
> trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,

I was hoping we had agreed these are not necessary and they shouldn't be
touched in the patch.
--
Michal Hocko
SUSE Labs

2020-12-18 09:53:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 06/10] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning

On Thu 17-12-20 13:52:39, Pavel Tatashin wrote:
> Document the special handling of page pinning when ZONE_MOVABLE present.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> Documentation/admin-guide/mm/memory-hotplug.rst | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
> index 5c4432c96c4b..c6618f99f765 100644
> --- a/Documentation/admin-guide/mm/memory-hotplug.rst
> +++ b/Documentation/admin-guide/mm/memory-hotplug.rst
> @@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
> Unfortunately, there is no information to show which memory block belongs
> to ZONE_MOVABLE. This is TBD.
>
> +.. note::
> + Techniques that rely on long-term pinnings of memory (especially, RDMA and
> + vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
> + hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
> + memory can still get hot removed - be aware that pinning can fail even if
> + there is plenty of free memory in ZONE_MOVABLE. In addition, using
> + ZONE_MOVABLE might make page pinning more expensive, because pages have to be
> + migrated off that zone first.
> +
> .. _memory_hotplug_how_to_offline_memory:
>
> How to offline memory
> --
> 2.25.1
>

--
Michal Hocko
SUSE Labs

2020-12-18 09:54:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 05/10] mm/gup: migrate pinned pages out of movable zone

On Thu 17-12-20 13:52:38, Pavel Tatashin wrote:
> + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> + * when pages are pinned and faulted, but it is still possible that
> + * address space already has pages in ZONE_MOVABLE at the time when
> + * pages are pinned (i.e. user has touches that memory before
> + * pinning). In such case we try to migrate them to a different zone,
> + * but if migration fails the pages can still end-up pinned in
> + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> + * time and will only succeed once user application unpins pages.

I still dislike this. Pinning can fail so there shouldn't be any reasons
to break MOVABLE constrain for something that can be handled. If
anything there should be a very good reasoning behind this decision
documented.
--
Michal Hocko
SUSE Labs

2020-12-18 09:56:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 07/10] mm/gup: change index type to long as it counts pages

On Thu 17-12-20 13:52:40, Pavel Tatashin wrote:
> In __get_user_pages_locked() i counts number of pages which should be
> long.

Do we know of any caller who would like to pin so many pages it wouldn't
fit into an int? I suspect this is more to sync types of nr_pages and
the iterator right. It would be better to be explicit about this in the
changelog.

> Signed-off-by: Pavel Tatashin <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/gup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 591d8e2dfc70..1ebb7cc2fbe4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1481,7 +1481,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> {
> struct vm_area_struct *vma;
> unsigned long vm_flags;
> - int i;
> + long i;
>
> /* calculate required read or write permissions.
> * If FOLL_FORCE is set, we only require the "MAY" flags.
> --
> 2.25.1

--
Michal Hocko
SUSE Labs

2020-12-18 10:49:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
[...]
> +#define PINNABLE_MIGRATE_MAX 10
> +#define PINNABLE_ISOLATE_MAX 100

Why would we need to limit the isolation retries. Those should always be
temporary failure unless I am missing something. I am not sure about the
PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
already implements its retry logic why do you want to count retries on
top of that? I do agree that the existing logic is suboptimal because
the migration failure might be ephemeral or permanent but that should be
IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
failures that are permanent - e.g. any potential pre-existing long term
pin - if that is possible at all. If not what would cause permanent
migration failure? OOM?
--
Michal Hocko
SUSE Labs

2020-12-18 12:27:24

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 03/10] mm: apply per-task gfp constraints in fast path

On Fri, Dec 18, 2020 at 4:36 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 17-12-20 13:52:36, Pavel Tatashin wrote:
> [..]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 469016222cdb..d9546f5897f4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3234,11 +3234,12 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > gfp_t gfp_mask, nodemask_t *nodemask)
> > {
> > + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> > unsigned long nr_reclaimed;
> > struct scan_control sc = {
> > .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > - .gfp_mask = current_gfp_context(gfp_mask),
> > - .reclaim_idx = gfp_zone(gfp_mask),
> > + .gfp_mask = current_gfp_mask,
> > + .reclaim_idx = gfp_zone(current_gfp_mask),
> > .order = order,
> > .nodemask = nodemask,
> > .priority = DEF_PRIORITY,
> > @@ -4158,17 +4159,18 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> > {
> > /* Minimum pages needed in order to stay on node */
> > const unsigned long nr_pages = 1 << order;
> > + gfp_t current_gfp_mask = current_gfp_context(gfp_mask);
> > struct task_struct *p = current;
> > unsigned int noreclaim_flag;
> > struct scan_control sc = {
> > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> > - .gfp_mask = current_gfp_context(gfp_mask),
> > + .gfp_mask = current_gfp_mask,
> > .order = order,
> > .priority = NODE_RECLAIM_PRIORITY,
> > .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> > .may_swap = 1,
> > - .reclaim_idx = gfp_zone(gfp_mask),
> > + .reclaim_idx = gfp_zone(current_gfp_mask),
> > };
> >
> > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
>
> I was hoping we had agreed these are not necessary and they shouldn't be
> touched in the patch.

Thank you for noticing, I was sure I removed these changes, not sure
what happened :(
They will be gone in the next version.

Thank you,
Pasha

> --
> Michal Hocko
> SUSE Labs

2020-12-18 12:31:32

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 05/10] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 18, 2020 at 4:43 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 17-12-20 13:52:38, Pavel Tatashin wrote:
> > + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> > + * when pages are pinned and faulted, but it is still possible that
> > + * address space already has pages in ZONE_MOVABLE at the time when
> > + * pages are pinned (i.e. user has touches that memory before
> > + * pinning). In such case we try to migrate them to a different zone,
> > + * but if migration fails the pages can still end-up pinned in
> > + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> > + * time and will only succeed once user application unpins pages.
>
> I still dislike this. Pinning can fail so there shouldn't be any reasons
> to break MOVABLE constrain for something that can be handled. If
> anything there should be a very good reasoning behind this decision
> documented.

This is basically current behaviour, after patch 8, we can never pin
pages in the movable zone, so I will update this comment in that
patch.

Thank you,
Pasha

> --
> Michal Hocko
> SUSE Labs

2020-12-18 12:35:33

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 07/10] mm/gup: change index type to long as it counts pages

On Fri, Dec 18, 2020 at 4:51 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 17-12-20 13:52:40, Pavel Tatashin wrote:
> > In __get_user_pages_locked() i counts number of pages which should be
> > long.
>
> Do we know of any caller who would like to pin so many pages it wouldn't
> fit into an int? I suspect this is more to sync types of nr_pages and
> the iterator right. It would be better to be explicit about this in the
> changelog.

It is to sync types. I will add it to the changelog.

But, in general 32-bit increasingly becomes too small for handling
page count proportional values. It is 8T for npages. For pinning may
be a bit too large today, but I can image RDMA this size in the
future.

>
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> Acked-by: Michal Hocko <[email protected]>
>
> > ---
> > mm/gup.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 591d8e2dfc70..1ebb7cc2fbe4 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1481,7 +1481,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> > {
> > struct vm_area_struct *vma;
> > unsigned long vm_flags;
> > - int i;
> > + long i;
> >
> > /* calculate required read or write permissions.
> > * If FOLL_FORCE is set, we only require the "MAY" flags.
> > --
> > 2.25.1
>
> --
> Michal Hocko
> SUSE Labs

2020-12-18 12:45:30

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
> [...]
> > +#define PINNABLE_MIGRATE_MAX 10
> > +#define PINNABLE_ISOLATE_MAX 100
>
> Why would we need to limit the isolation retries. Those should always be
> temporary failure unless I am missing something.

Actually, during development, I was retrying isolate errors
infinitely, but during testing found a hung where when FOLL_TOUCH
without FOLL_WRITE is passed (fault in kernel without write flag), the
zero page is faulted. The isolation of the zero page was failing every
time, therefore the process was hanging.

Since then, I fixed this problem by adding FOLL_WRITE unconditionally
to FOLL_LONGTERM, but I was worried about other possible bugs that
would cause hangs, so decided to limit isolation errors. If you think
it its not necessary, I can unlimit isolate retires.

> I am not sure about the
> PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
> already implements its retry logic why do you want to count retries on
> top of that? I do agree that the existing logic is suboptimal because

True, but again, just recently, I worked on a race bug where pages can
end up in per-cpu list after lru_add_drain_all() but before isolation,
so I think retry is necessary.

> the migration failure might be ephemeral or permanent but that should be
> IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
> failures that are permanent - e.g. any potential pre-existing long term
> pin - if that is possible at all. If not what would cause permanent
> migration failure? OOM?

Yes, OOM is the main cause for migration failures. And also a few
cases described in movable zone comment, where it is possible during
boot some pages can be allocated by memblock in movable zone due to
lack of memory resources (even if those resources were added later),
hardware page poisoning is another rare example.

> --
> Michal Hocko
> SUSE Labs

2020-12-18 13:49:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures


> Am 18.12.2020 um 13:43 schrieb Pavel Tatashin <[email protected]>:
>
> On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <[email protected]> wrote:
>>
>> On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
>> [...]
>>> +#define PINNABLE_MIGRATE_MAX 10
>>> +#define PINNABLE_ISOLATE_MAX 100
>>
>> Why would we need to limit the isolation retries. Those should always be
>> temporary failure unless I am missing something.
>
> Actually, during development, I was retrying isolate errors
> infinitely, but during testing found a hung where when FOLL_TOUCH
> without FOLL_WRITE is passed (fault in kernel without write flag), the
> zero page is faulted. The isolation of the zero page was failing every
> time, therefore the process was hanging.
>
> Since then, I fixed this problem by adding FOLL_WRITE unconditionally
> to FOLL_LONGTERM, but I was worried about other possible bugs that
> would cause hangs, so decided to limit isolation errors. If you think
> it its not necessary, I can unlimit isolate retires.
>
>> I am not sure about the
>> PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
>> already implements its retry logic why do you want to count retries on
>> top of that? I do agree that the existing logic is suboptimal because
>
> True, but again, just recently, I worked on a race bug where pages can
> end up in per-cpu list after lru_add_drain_all() but before isolation,
> so I think retry is necessary.
>
>> the migration failure might be ephemeral or permanent but that should be
>> IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
>> failures that are permanent - e.g. any potential pre-existing long term
>> pin - if that is possible at all. If not what would cause permanent
>> migration failure? OOM?
>
> Yes, OOM is the main cause for migration failures. And also a few
> cases described in movable zone comment, where it is possible during
> boot some pages can be allocated by memblock in movable zone due to
> lack of memory resources (even if those resources were added later),
> hardware page poisoning is another rare example.
>

How is concurrent migration handled? Like memory offlining, compaction, alloc_contig_range() while trying to pin?


>> --
>> Michal Hocko
>> SUSE Labs
>

2020-12-18 13:49:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 05/10] mm/gup: migrate pinned pages out of movable zone

On Fri 18-12-20 07:24:53, Pavel Tatashin wrote:
> On Fri, Dec 18, 2020 at 4:43 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 17-12-20 13:52:38, Pavel Tatashin wrote:
> > > + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> > > + * when pages are pinned and faulted, but it is still possible that
> > > + * address space already has pages in ZONE_MOVABLE at the time when
> > > + * pages are pinned (i.e. user has touches that memory before
> > > + * pinning). In such case we try to migrate them to a different zone,
> > > + * but if migration fails the pages can still end-up pinned in
> > > + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> > > + * time and will only succeed once user application unpins pages.
> >
> > I still dislike this. Pinning can fail so there shouldn't be any reasons
> > to break MOVABLE constrain for something that can be handled. If
> > anything there should be a very good reasoning behind this decision
> > documented.
>
> This is basically current behaviour, after patch 8, we can never pin
> pages in the movable zone, so I will update this comment in that
> patch.

Then it would be much easier for review to state that the existing
behavior is unchanged and do not update this comment just to remove it
in a later patch. Because this patch should be straightforward change of
the condition which pages to migrate (+some renaming which should be
reasonably easy to follow). Maybe it would be even better to do the
renaming separately without any functional changes and make only the
change in the condition here.

--
Michal Hocko
SUSE Labs

2020-12-18 13:50:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Fri 18-12-20 07:43:15, Pavel Tatashin wrote:
> On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
> > [...]
> > > +#define PINNABLE_MIGRATE_MAX 10
> > > +#define PINNABLE_ISOLATE_MAX 100
> >
> > Why would we need to limit the isolation retries. Those should always be
> > temporary failure unless I am missing something.
>
> Actually, during development, I was retrying isolate errors
> infinitely, but during testing found a hung where when FOLL_TOUCH
> without FOLL_WRITE is passed (fault in kernel without write flag), the
> zero page is faulted. The isolation of the zero page was failing every
> time, therefore the process was hanging.

Why would you migrate zero page in the first place? Simply instantiate
it.

> Since then, I fixed this problem by adding FOLL_WRITE unconditionally
> to FOLL_LONGTERM, but I was worried about other possible bugs that
> would cause hangs, so decided to limit isolation errors. If you think
> it its not necessary, I can unlimit isolate retires.

It should have a really good reason to exist. Worries about some corner
cases is definitely not a reason to put some awkward retry mechanism.
My historical experience is that these things are extremely hard to get
rid of later.

> > I am not sure about the
> > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
> > already implements its retry logic why do you want to count retries on
> > top of that? I do agree that the existing logic is suboptimal because
>
> True, but again, just recently, I worked on a race bug where pages can
> end up in per-cpu list after lru_add_drain_all() but before isolation,
> so I think retry is necessary.

There are ways to make sure pages are not ending on pcp list. Have a
look at how hotplug does that.

> > the migration failure might be ephemeral or permanent but that should be
> > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
> > failures that are permanent - e.g. any potential pre-existing long term
> > pin - if that is possible at all. If not what would cause permanent
> > migration failure? OOM?
>
> Yes, OOM is the main cause for migration failures.

Then you can treat ENOMEM as a permanent failure.

> And also a few
> cases described in movable zone comment, where it is possible during
> boot some pages can be allocated by memblock in movable zone due to
> lack of memory resources (even if those resources were added later),

Do you have any examples? I find it hard to follow that somebody would
be pinning early boot allocations.

> hardware page poisoning is another rare example.

Could you elaborate please?
--
Michal Hocko
SUSE Labs

2020-12-18 14:22:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:
> Hi Jason,
>
> Thank you for your comments. My replies below.
>
> On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> > > +/*
> > > + * Verify that there are no unpinnable (movable) pages, if so return true.
> > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> > > + */
> > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> > > + unsigned int gup_flags)
> > > +{
> > > + unsigned long i, step;
> > > +
> > > + for (i = 0; i < nr_pages; i += step) {
> > > + struct page *head = compound_head(pages[i]);
> > > +
> > > + step = compound_nr(head) - (pages[i] - head);
> >
> > You can't assume that all of a compound head is in the pages array,
> > this assumption would only work inside the page walkers if the page
> > was found in a PMD or something.
>
> I am not sure I understand your comment. The compound head is not
> taken from the pages array, and not assumed to be in it. It is exactly
> the same logic as that we currently have:
> https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565

Oh, that existing logic is wrong too :( Another bug.

You can't skip pages in the pages[] array under the assumption they
are contiguous. ie the i+=step is wrong.

> >
> > > + if (gup_flags & FOLL_PIN) {
> > > + unpin_user_pages(pages, nr_pages);
> >
> > So we throw everything away? Why? That isn't how the old algorithm worked
>
> It is exactly like the old algorithm worked: if there are pages to be
> migrated (not pinnable pages) we unpinned everything.
> See here:
> https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603

Hmm, OK, but I'm not sure that is great either

> cleaner, and handle errors. We must unpin everything because if we
> fail, no pages should stay pinned, and also if we migrated some pages,
> the pages array must be updated, so we need to call
> __get_user_pages_locked() pin and repopulated pages array.

However the page can't be unpinned until it is put on the LRU (and I'm
hoping that the LRU is enough of a 'lock' to make that safe, no idea)

> > I don't like this at all. It shouldn't be so flakey
> >
> > Can you do migration without the LRU?
>
> I do not think it is possible, we must isolate pages before migration.

I don't like this at all :( Lots of stuff relies on GUP, introducing a
random flakiness like this not good.

Jason

2020-12-19 06:03:27

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

On 12/17/20 10:52 AM, Pavel Tatashin wrote:
>

Hi Pavel,

This all looks good pretty good to me, with just a couple of minor
doubts interleaved with the documentation tweaks:

a) I'm not yet sure if the is_pinnable_page() concept is a keeper. If it's
not for some reason, then we should revisit this patch.

b) I don't yet understand why FOLL_TOUCH from gup/pup is a critical part
of the test.


> When pages are pinned they can be faulted in userland and migrated, and
> they can be faulted right in kernel without migration.
>
> In either case, the pinned pages must end-up being pinnable (not movable).

Let's delete the above two sentences, which are confusing as currently
worded, and just keep approximately the last sentence below.

>
> Add a new test without touching pages in userland, and use FOLL_TOUCH
> instead. Also, verify that pinned pages are pinnable.

Maybe this instead:

Add a new test to gup_test, to verify that only "pinnable" pages are
pinned. Also, use gup/pup + FOLL_TOUCH to fault in the pages, rather
than faulting them in from user space.


? But I don't know why that second point is important. Is it actually
important in order to have a valid test? If so, why?


>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup_test.c | 6 ++++++
> tools/testing/selftests/vm/gup_test.c | 17 +++++++++++++----
> 2 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/gup_test.c b/mm/gup_test.c
> index 24c70c5814ba..24fd542091ee 100644
> --- a/mm/gup_test.c
> +++ b/mm/gup_test.c
> @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
>
> dump_page(page, "gup_test failure");
> break;
> + } else if (cmd == PIN_LONGTERM_BENCHMARK &&
> + WARN(!is_pinnable_page(page),
> + "pages[%lu] is NOT pinnable but pinned\n",
> + i)) {
> + dump_page(page, "gup_test failure");
> + break;
> }
> }
> break;
> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> index 42c71483729f..f08cc97d424d 100644
> --- a/tools/testing/selftests/vm/gup_test.c
> +++ b/tools/testing/selftests/vm/gup_test.c
> @@ -13,6 +13,7 @@
>
> /* Just the flags we need, copied from mm.h: */
> #define FOLL_WRITE 0x01 /* check pte is writable */
> +#define FOLL_TOUCH 0x02 /* mark page accessed */


Aha, now I see why you wanted to pass other GUP flags, in the previous
patch. I think it's OK to pass this set of possible flags (as
.gup_flags) through ioctl, yes.

However (this is about the previous patch), I *think* we're better off
leaving the gup_test behavior as: "default is read-only pages, but you
can pass in -w to specify FOLL_WRITE". As opposed to passing in raw
flags from the command line. And yes, I realize that my -F option seemed
to recommand the latter...I'm regretting that -F approach now.

The other direction to go might be to stop doing that, and shift over to
just let the user specify FOLL_* flags directly on the command line, but
IMHO there's no need for that (yet), and it's a little less error-prone
to constrain it.

This leads to: change the "-F 1", to some other better-named option,
perhaps. Open to suggestion there.


>
> static char *cmd_to_str(unsigned long cmd)
> {
> @@ -39,11 +40,11 @@ int main(int argc, char **argv)
> unsigned long size = 128 * MB;
> int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> unsigned long cmd = GUP_FAST_BENCHMARK;
> - int flags = MAP_PRIVATE;
> + int flags = MAP_PRIVATE, touch = 0;


Silly nit, can we put it on its own line? This pre-existing mess of
declarations makes it hard to read everything. One item per line is
easier on the reader, who is often just looking for a single item at a
time. Actually why not rename it slightly while we're here (see below),
maybe to this:

int use_foll_touch = 0;


> char *file = "/dev/zero";
> char *p;
>
> - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {

Yes, this seems worth its own command line option.

> switch (opt) {
> case 'a':
> cmd = PIN_FAST_BENCHMARK;
> @@ -110,6 +111,10 @@ int main(int argc, char **argv)
> case 'H':
> flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
> break;
> + case 'z':
> + /* fault pages in gup, do not fault in userland */

How about:
/*
* Use gup/pup(FOLL_TOUCH), *instead* of faulting
* pages in from user space.
*/
use_foll_touch = 1;

> + touch = 1;
> + break;
> default:
> return -1;
> }
> @@ -167,8 +172,12 @@ int main(int argc, char **argv)
> else if (thp == 0)
> madvise(p, size, MADV_NOHUGEPAGE);
>
> - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> - p[0] = 0;
> + if (touch) {
> + gup.flags |= FOLL_TOUCH;
> + } else {
> + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> + p[0] = 0;
> + }

OK.

>
> /* Only report timing information on the *_BENCHMARK commands: */
> if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
>

thanks,
--
John Hubbard
NVIDIA

2020-12-19 15:24:41

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

On Sat, Dec 19, 2020 at 12:57 AM John Hubbard <[email protected]> wrote:
>
> On 12/17/20 10:52 AM, Pavel Tatashin wrote:
> >
>
> Hi Pavel,
>
> This all looks good pretty good to me, with just a couple of minor
> doubts interleaved with the documentation tweaks:
>
> a) I'm not yet sure if the is_pinnable_page() concept is a keeper. If it's
> not for some reason, then we should revisit this patch.
>
> b) I don't yet understand why FOLL_TOUCH from gup/pup is a critical part
> of the test.
>
>
> > When pages are pinned they can be faulted in userland and migrated, and
> > they can be faulted right in kernel without migration.
> >
> > In either case, the pinned pages must end-up being pinnable (not movable).
>
> Let's delete the above two sentences, which are confusing as currently
> worded, and just keep approximately the last sentence below.

Sure.

>
> >
> > Add a new test without touching pages in userland, and use FOLL_TOUCH
> > instead. Also, verify that pinned pages are pinnable.
>
> Maybe this instead:
>
> Add a new test to gup_test, to verify that only "pinnable" pages are
> pinned. Also, use gup/pup + FOLL_TOUCH to fault in the pages, rather
> than faulting them in from user space.

OK

>
>
> ? But I don't know why that second point is important. Is it actually
> important in order to have a valid test? If so, why?

It is crucial because when pages are faulted from userland they may
end up in a movable zone, we spend time migrating them during pinning:
performance suffers. When pages are faulted from the kernel, they do
not need to be migrated as they should end-up in the right zone from
the beginning, and performance should be fast.

Here is sample run to pin 512M 4096 pages at a time:

Before my changes:
Fault in userland: get:4787 put:1093
Fault in kernel: get:80577 put:1187
In both cases pages in movable zone were pinned. The fault in kernel
is slower compared to userland, because pages need to be faulted and
zeroed before returning.

With my changes:
Fault in userland: get:150793 put:1111
Fault in kernel: get:81066 put:1091
In both cases pages were pinned in correct zone. The fault in kernel
is faster than userland, because userland pages needed to be migrated
from the movable zone.


>
>
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > mm/gup_test.c | 6 ++++++
> > tools/testing/selftests/vm/gup_test.c | 17 +++++++++++++----
> > 2 files changed, 19 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/gup_test.c b/mm/gup_test.c
> > index 24c70c5814ba..24fd542091ee 100644
> > --- a/mm/gup_test.c
> > +++ b/mm/gup_test.c
> > @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
> >
> > dump_page(page, "gup_test failure");
> > break;
> > + } else if (cmd == PIN_LONGTERM_BENCHMARK &&
> > + WARN(!is_pinnable_page(page),
> > + "pages[%lu] is NOT pinnable but pinned\n",
> > + i)) {
> > + dump_page(page, "gup_test failure");
> > + break;
> > }
> > }
> > break;
> > diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> > index 42c71483729f..f08cc97d424d 100644
> > --- a/tools/testing/selftests/vm/gup_test.c
> > +++ b/tools/testing/selftests/vm/gup_test.c
> > @@ -13,6 +13,7 @@
> >
> > /* Just the flags we need, copied from mm.h: */
> > #define FOLL_WRITE 0x01 /* check pte is writable */
> > +#define FOLL_TOUCH 0x02 /* mark page accessed */
>
>
> Aha, now I see why you wanted to pass other GUP flags, in the previous
> patch. I think it's OK to pass this set of possible flags (as
> .gup_flags) through ioctl, yes.

Sure, and I will rename .flags to .gup_flags as you previously suggested.

>
> However (this is about the previous patch), I *think* we're better off
> leaving the gup_test behavior as: "default is read-only pages, but you
> can pass in -w to specify FOLL_WRITE". As opposed to passing in raw
> flags from the command line. And yes, I realize that my -F option seemed
> to recommand the latter...I'm regretting that -F approach now.

But, that would reverse the current default behaviour where FOLL_WRITE
is always set. I introduced "-W" not to break backward compatibility
where the "-w" option was already available, but since "-w" is the
default it basically does nothing, where "-W '' removes the
FOLL_WRITE. Do you want to reverse the current behaviour?

>
> The other direction to go might be to stop doing that, and shift over to
> just let the user specify FOLL_* flags directly on the command line, but
> IMHO there's no need for that (yet), and it's a little less error-prone
> to constrain it.
>
> This leads to: change the "-F 1", to some other better-named option,
> perhaps. Open to suggestion there.

Perhaps disable "-F n" and print a warning about the ignored flag
until it is found to be useful?

>
>
> >
> > static char *cmd_to_str(unsigned long cmd)
> > {
> > @@ -39,11 +40,11 @@ int main(int argc, char **argv)
> > unsigned long size = 128 * MB;
> > int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> > unsigned long cmd = GUP_FAST_BENCHMARK;
> > - int flags = MAP_PRIVATE;
> > + int flags = MAP_PRIVATE, touch = 0;
>
>
> Silly nit, can we put it on its own line? This pre-existing mess of
> declarations makes it hard to read everything. One item per line is
> easier on the reader, who is often just looking for a single item at a
> time. Actually why not rename it slightly while we're here (see below),
> maybe to this:
>
> int use_foll_touch = 0;

Sure, I will move it. I also prefer one initialization per-line, but
tried to keep the current style. I can also correct the other
initializations in this function.

>
>
> > char *file = "/dev/zero";
> > char *p;
> >
> > - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> > + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
>
> Yes, this seems worth its own command line option.
>
> > switch (opt) {
> > case 'a':
> > cmd = PIN_FAST_BENCHMARK;
> > @@ -110,6 +111,10 @@ int main(int argc, char **argv)
> > case 'H':
> > flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
> > break;
> > + case 'z':
> > + /* fault pages in gup, do not fault in userland */
>
> How about:
> /*
> * Use gup/pup(FOLL_TOUCH), *instead* of faulting
> * pages in from user space.
> */
> use_foll_touch = 1;

Sure.

>
> > + touch = 1;
> > + break;
> > default:
> > return -1;
> > }
> > @@ -167,8 +172,12 @@ int main(int argc, char **argv)
> > else if (thp == 0)
> > madvise(p, size, MADV_NOHUGEPAGE);
> >
> > - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > - p[0] = 0;
> > + if (touch) {
> > + gup.flags |= FOLL_TOUCH;
> > + } else {
> > + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > + p[0] = 0;
> > + }
>
> OK.

Thank you for your review.

Pahsa

2020-12-19 23:54:53

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

On 12/19/20 7:22 AM, Pavel Tatashin wrote:
...
>> Add a new test to gup_test, to verify that only "pinnable" pages are
>> pinned. Also, use gup/pup + FOLL_TOUCH to fault in the pages, rather
>> than faulting them in from user space.
>
> OK
>
>>
>>
>> ? But I don't know why that second point is important. Is it actually
>> important in order to have a valid test? If so, why?
>
> It is crucial because when pages are faulted from userland they may
> end up in a movable zone, we spend time migrating them during pinning:
> performance suffers. When pages are faulted from the kernel, they do
> not need to be migrated as they should end-up in the right zone from
> the beginning, and performance should be fast.
>
> Here is sample run to pin 512M 4096 pages at a time:
>
> Before my changes:
> Fault in userland: get:4787 put:1093
> Fault in kernel: get:80577 put:1187
> In both cases pages in movable zone were pinned. The fault in kernel
> is slower compared to userland, because pages need to be faulted and
> zeroed before returning.
>
> With my changes:
> Fault in userland: get:150793 put:1111
> Fault in kernel: get:81066 put:1091
> In both cases pages were pinned in correct zone. The fault in kernel
> is faster than userland, because userland pages needed to be migrated
> from the movable zone.
>
>

OK, I see. So then that second paragraph should read more like this,
instead:

"
Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a way
to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH).

Now, I was going to write the following sentence to finish off the last
paragraph above:

"This decision is controlled with the new -z command line option."

...but see below, before we go with that.


>>
>>
>>>
>>> Signed-off-by: Pavel Tatashin <[email protected]>
>>> ---
>>> mm/gup_test.c | 6 ++++++
>>> tools/testing/selftests/vm/gup_test.c | 17 +++++++++++++----
>>> 2 files changed, 19 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/gup_test.c b/mm/gup_test.c
>>> index 24c70c5814ba..24fd542091ee 100644
>>> --- a/mm/gup_test.c
>>> +++ b/mm/gup_test.c
>>> @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
>>>
>>> dump_page(page, "gup_test failure");
>>> break;
>>> + } else if (cmd == PIN_LONGTERM_BENCHMARK &&
>>> + WARN(!is_pinnable_page(page),
>>> + "pages[%lu] is NOT pinnable but pinned\n",
>>> + i)) {
>>> + dump_page(page, "gup_test failure");
>>> + break;
>>> }
>>> }
>>> break;
>>> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
>>> index 42c71483729f..f08cc97d424d 100644
>>> --- a/tools/testing/selftests/vm/gup_test.c
>>> +++ b/tools/testing/selftests/vm/gup_test.c
>>> @@ -13,6 +13,7 @@
>>>
>>> /* Just the flags we need, copied from mm.h: */
>>> #define FOLL_WRITE 0x01 /* check pte is writable */
>>> +#define FOLL_TOUCH 0x02 /* mark page accessed */
>>
>>
>> Aha, now I see why you wanted to pass other GUP flags, in the previous
>> patch. I think it's OK to pass this set of possible flags (as
>> .gup_flags) through ioctl, yes.
>
> Sure, and I will rename .flags to .gup_flags as you previously suggested.
>
>>
>> However (this is about the previous patch), I *think* we're better off
>> leaving the gup_test behavior as: "default is read-only pages, but you
>> can pass in -w to specify FOLL_WRITE". As opposed to passing in raw
>> flags from the command line. And yes, I realize that my -F option seemed
>> to recommand the latter...I'm regretting that -F approach now.
>
> But, that would reverse the current default behaviour where FOLL_WRITE
> is always set. I introduced "-W" not to break backward compatibility
> where the "-w" option was already available, but since "-w" is the
> default it basically does nothing, where "-W '' removes the
> FOLL_WRITE. Do you want to reverse the current behaviour?
>
>>
>> The other direction to go might be to stop doing that, and shift over to
>> just let the user specify FOLL_* flags directly on the command line, but
>> IMHO there's no need for that (yet), and it's a little less error-prone
>> to constrain it.
>>
>> This leads to: change the "-F 1", to some other better-named option,
>> perhaps. Open to suggestion there.
>
> Perhaps disable "-F n" and print a warning about the ignored flag
> until it is found to be useful?
>


OK, now it's clear what to do. My (other) mistake with -F was that I
mis-predicted what the next future feature would need: I thought we'd be
adding sub-flags for the dump pages test, but instead we are (at least
so far) adding flags for the gup/pup calls. So let's adapt to that.

Some of this contradicts what I proposed earlier, but now it forms a
consistent API as a whole. How's this sound:

* Change the meaning of the -F option, slightly: it now means "raw flags
to pass into .gup_flags and thus to gup/pup, and these will override
other options".

* Use -F 0x2 (the value of FOLL_TOUCH) instead of the -z option.

* We can remain backward compatible with -w at the command line level,
with a caveat: passing in "-F 0" or "-F 2" (anything that doesn't set
FOLL_WRITE) will override -w. The rule would be that raw -F flags
override other settings, just for a consistent way to think about it.

* No need, obviously, for any option to undo -w. -F can do that.

* Do *not* provide direct access to .test_control_flags. Just set those
according to a new command line option. Because these aren't (yet)
exploding.

If this is too much fooling around, let me know and I'll send some diffs
or even a patch. I don't want to take away too much time from the main
patch, and I feel some obligation to get gup_test straightened out. :)




>>
>>
>>>
>>> static char *cmd_to_str(unsigned long cmd)
>>> {
>>> @@ -39,11 +40,11 @@ int main(int argc, char **argv)
>>> unsigned long size = 128 * MB;
>>> int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
>>> unsigned long cmd = GUP_FAST_BENCHMARK;
>>> - int flags = MAP_PRIVATE;
>>> + int flags = MAP_PRIVATE, touch = 0;
>>
>>
>> Silly nit, can we put it on its own line? This pre-existing mess of
>> declarations makes it hard to read everything. One item per line is
>> easier on the reader, who is often just looking for a single item at a
>> time. Actually why not rename it slightly while we're here (see below),
>> maybe to this:
>>
>> int use_foll_touch = 0;
>
> Sure, I will move it. I also prefer one initialization per-line, but
> tried to keep the current style. I can also correct the other
> initializations in this function.
>
>>
>>
>>> char *file = "/dev/zero";
>>> char *p;
>>>
>>> - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
>>> + while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
>>
>> Yes, this seems worth its own command line option.
>>
>>> switch (opt) {
>>> case 'a':
>>> cmd = PIN_FAST_BENCHMARK;
>>> @@ -110,6 +111,10 @@ int main(int argc, char **argv)
>>> case 'H':
>>> flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
>>> break;
>>> + case 'z':
>>> + /* fault pages in gup, do not fault in userland */
>>
>> How about:
>> /*
>> * Use gup/pup(FOLL_TOUCH), *instead* of faulting
>> * pages in from user space.
>> */
>> use_foll_touch = 1;
>
> Sure.
>
>>
>>> + touch = 1;
>>> + break;
>>> default:
>>> return -1;
>>> }
>>> @@ -167,8 +172,12 @@ int main(int argc, char **argv)
>>> else if (thp == 0)
>>> madvise(p, size, MADV_NOHUGEPAGE);
>>>
>>> - for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
>>> - p[0] = 0;
>>> + if (touch) {
>>> + gup.flags |= FOLL_TOUCH;
>>> + } else {
>>> + for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
>>> + p[0] = 0;
>>> + }

Here, the test would change to "if (gup.gup_flags & FOLL_TOUCH)".

And we need a comment somewhere (maybe right above that) that says something
like:

/*
* FOLL_TOUCH, in gup_test, is used as an either/or case: either
* fault pages in from the kernel via FOLL_TOUCH, or fault them
* in here, from user space. This allows comparison of performance
* between those two cases.
*/

...OR, you know what? We can do even better. Because the above is too quirky
and weird. Instead, let's leave FOLL_TOUCH "pure": it's just a gup flag.

And then, add an option (maybe -z, after all) that says, "skip faulting pages
in from user space". That's a lot clearer! And you can tell it's better,
because we don't have to write a chunk of documentation explaining the odd
quirks. Ha, it feels better!

What do you think? (Again, if you want me to send over some diffs that put this
all together, let me know. I'm kind of embarrassed at all the typing required here.)

thanks,
--
John Hubbard
NVIDIA

2021-01-13 19:17:29

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 05/10] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 18, 2020 at 8:08 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 18-12-20 07:24:53, Pavel Tatashin wrote:
> > On Fri, Dec 18, 2020 at 4:43 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 17-12-20 13:52:38, Pavel Tatashin wrote:
> > > > + * 1. Pinned pages: (long-term) pinning of movable pages is avoided
> > > > + * when pages are pinned and faulted, but it is still possible that
> > > > + * address space already has pages in ZONE_MOVABLE at the time when
> > > > + * pages are pinned (i.e. user has touches that memory before
> > > > + * pinning). In such case we try to migrate them to a different zone,
> > > > + * but if migration fails the pages can still end-up pinned in
> > > > + * ZONE_MOVABLE. In such case, memory offlining might retry a long
> > > > + * time and will only succeed once user application unpins pages.
> > >
> > > I still dislike this. Pinning can fail so there shouldn't be any reasons
> > > to break MOVABLE constrain for something that can be handled. If
> > > anything there should be a very good reasoning behind this decision
> > > documented.
> >
> > This is basically current behaviour, after patch 8, we can never pin
> > pages in the movable zone, so I will update this comment in that
> > patch.
>
> Then it would be much easier for review to state that the existing
> behavior is unchanged and do not update this comment just to remove it
> in a later patch. Because this patch should be straightforward change of
> the condition which pages to migrate (+some renaming which should be
> reasonably easy to follow).

Makes sense, I will update this comment correctly right away when the
behaviour changes.

2021-01-13 19:48:52

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:
> > Hi Jason,
> >
> > Thank you for your comments. My replies below.
> >
> > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> > > > +/*
> > > > + * Verify that there are no unpinnable (movable) pages, if so return true.
> > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> > > > + */
> > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> > > > + unsigned int gup_flags)
> > > > +{
> > > > + unsigned long i, step;
> > > > +
> > > > + for (i = 0; i < nr_pages; i += step) {
> > > > + struct page *head = compound_head(pages[i]);
> > > > +
> > > > + step = compound_nr(head) - (pages[i] - head);
> > >
> > > You can't assume that all of a compound head is in the pages array,
> > > this assumption would only work inside the page walkers if the page
> > > was found in a PMD or something.
> >
> > I am not sure I understand your comment. The compound head is not
> > taken from the pages array, and not assumed to be in it. It is exactly
> > the same logic as that we currently have:
> > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565
>
> Oh, that existing logic is wrong too :( Another bug.

I do not think there is a bug.

> You can't skip pages in the pages[] array under the assumption they
> are contiguous. ie the i+=step is wrong.

If pages[i] is part of a compound page, the other parts of this page
must be sequential in this array for this compound page (it might
start in the middle through). If they are not sequential then the
translation will be broken, as these pages also correspond to virtual
addresses from [start, start + nr_pages) in __gup_longterm_locked.

For example, when __gup_longterm_locked() is returned, the following
must be true:
PHYSICAL VIRTUAL
page_to_phys(pages[0]) -> start + 0 * PAGE_SIZE
page_to_phys(pages[1]) -> start + 1 * PAGE_SIZE
page_to_phys(pages[2]) -> start + 2 * PAGE_SIZE
page_to_phys(pages[3]) -> start + 3 * PAGE_SIZE
...
page_to_phys(pages[nr_pages - 1]) -> start + (nr_pages - 1) * PAGE_SIZE

If any pages[i] is part of a compound page (i.e. huge page), we can't
have other pages to be in the middle of that page in the array..

>
> > >
> > > > + if (gup_flags & FOLL_PIN) {
> > > > + unpin_user_pages(pages, nr_pages);
> > >
> > > So we throw everything away? Why? That isn't how the old algorithm worked
> >
> > It is exactly like the old algorithm worked: if there are pages to be
> > migrated (not pinnable pages) we unpinned everything.
> > See here:
> > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603
>
> Hmm, OK, but I'm not sure that is great either

I will send out a new series. We can discuss it there if you have
suggestions for improvement here.

>
> > cleaner, and handle errors. We must unpin everything because if we
> > fail, no pages should stay pinned, and also if we migrated some pages,
> > the pages array must be updated, so we need to call
> > __get_user_pages_locked() pin and repopulated pages array.
>
> However the page can't be unpinned until it is put on the LRU (and I'm
> hoping that the LRU is enough of a 'lock' to make that safe, no idea)
>
> > > I don't like this at all. It shouldn't be so flakey
> > >
> > > Can you do migration without the LRU?
> >
> > I do not think it is possible, we must isolate pages before migration.
>
> I don't like this at all :( Lots of stuff relies on GUP, introducing a
> random flakiness like this not good.

This is actually standard migration procedure, elsewhere in the kernel
we migrate pages in exactly the same fashion: isolate and later
migrate. The isolation works for LRU only pages.

>
> Jason

2021-01-13 19:52:48

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Fri, Dec 18, 2020 at 8:14 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 18-12-20 07:43:15, Pavel Tatashin wrote:
> > On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
> > > [...]
> > > > +#define PINNABLE_MIGRATE_MAX 10
> > > > +#define PINNABLE_ISOLATE_MAX 100
> > >
> > > Why would we need to limit the isolation retries. Those should always be
> > > temporary failure unless I am missing something.
> >
> > Actually, during development, I was retrying isolate errors
> > infinitely, but during testing found a hung where when FOLL_TOUCH
> > without FOLL_WRITE is passed (fault in kernel without write flag), the
> > zero page is faulted. The isolation of the zero page was failing every
> > time, therefore the process was hanging.
>
> Why would you migrate zero page in the first place? Simply instantiate
> it.

This is exactly the idea behind FOLL_WRITE; it causes zero pages to be
created in the right zone right away, and no migration is necessary.

>
> > Since then, I fixed this problem by adding FOLL_WRITE unconditionally
> > to FOLL_LONGTERM, but I was worried about other possible bugs that
> > would cause hangs, so decided to limit isolation errors. If you think
> > it its not necessary, I can unlimit isolate retires.
>
> It should have a really good reason to exist. Worries about some corner
> cases is definitely not a reason to put some awkward retry mechanism.
> My historical experience is that these things are extremely hard to get
> rid of later.
>
> > > I am not sure about the
> > > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
> > > already implements its retry logic why do you want to count retries on
> > > top of that? I do agree that the existing logic is suboptimal because
> >
> > True, but again, just recently, I worked on a race bug where pages can
> > end up in per-cpu list after lru_add_drain_all() but before isolation,
> > so I think retry is necessary.
>
> There are ways to make sure pages are not ending on pcp list. Have a
> look at how hotplug does that.

Sounds good to me, I will remove PINNABLE_MIGRATE_MAX, and leave only
PINNABLE_ISOLATE_MAX for transient isolation errors.

>
> > > the migration failure might be ephemeral or permanent but that should be
> > > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
> > > failures that are permanent - e.g. any potential pre-existing long term
> > > pin - if that is possible at all. If not what would cause permanent
> > > migration failure? OOM?
> >
> > Yes, OOM is the main cause for migration failures.
>
> Then you can treat ENOMEM as a permanent failure.

Sounds good.

2021-01-13 19:58:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Wed, Jan 13, 2021 at 02:43:50PM -0500, Pavel Tatashin wrote:
> On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:
> > > Hi Jason,
> > >
> > > Thank you for your comments. My replies below.
> > >
> > > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> > > > > +/*
> > > > > + * Verify that there are no unpinnable (movable) pages, if so return true.
> > > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> > > > > + */
> > > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> > > > > + unsigned int gup_flags)
> > > > > +{
> > > > > + unsigned long i, step;
> > > > > +
> > > > > + for (i = 0; i < nr_pages; i += step) {
> > > > > + struct page *head = compound_head(pages[i]);
> > > > > +
> > > > > + step = compound_nr(head) - (pages[i] - head);
> > > >
> > > > You can't assume that all of a compound head is in the pages array,
> > > > this assumption would only work inside the page walkers if the page
> > > > was found in a PMD or something.
> > >
> > > I am not sure I understand your comment. The compound head is not
> > > taken from the pages array, and not assumed to be in it. It is exactly
> > > the same logic as that we currently have:
> > > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565
> >
> > Oh, that existing logic is wrong too :( Another bug.
>
> I do not think there is a bug.
>
> > You can't skip pages in the pages[] array under the assumption they
> > are contiguous. ie the i+=step is wrong.
>
> If pages[i] is part of a compound page, the other parts of this page
> must be sequential in this array for this compound page

That is true only if the PMD points to the page. If the PTE points to
a tail page then there is no requirement that other PTEs are
contiguous with the compount page.

At this point we have no idea if the GUP logic got this compound page
as a head page in a PMD or as a tail page from a PTE, so we can't
assume a contiguous run of addresses.

Look at split_huge_pmd() - it doesn't break up the compound page it
just converts the PMD to a PTE array and scatters the tail pages to
the PTE.

I understand Matt is pushing on this idea more by having compound
pages in the page cache, but still mapping tail pages when required.

> This is actually standard migration procedure, elsewhere in the kernel
> we migrate pages in exactly the same fashion: isolate and later
> migrate. The isolation works for LRU only pages.

But do other places cause a userspace visible random failure when LRU
isolation fails?

I don't like it at all, what is the user supposed to do?

Jason

2021-01-13 20:08:04

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

> > > Oh, that existing logic is wrong too :( Another bug.
> >
> > I do not think there is a bug.
> >
> > > You can't skip pages in the pages[] array under the assumption they
> > > are contiguous. ie the i+=step is wrong.
> >
> > If pages[i] is part of a compound page, the other parts of this page
> > must be sequential in this array for this compound page
>
> That is true only if the PMD points to the page. If the PTE points to
> a tail page then there is no requirement that other PTEs are
> contiguous with the compount page.
>
> At this point we have no idea if the GUP logic got this compound page
> as a head page in a PMD or as a tail page from a PTE, so we can't
> assume a contiguous run of addresses.

I see, I will fix this bug in an upstream as a separate patch in my
series, and keep the fix when my fixes are applied.

>
> Look at split_huge_pmd() - it doesn't break up the compound page it
> just converts the PMD to a PTE array and scatters the tail pages to
> the PTE.

Got it, unfortunately the fix will deoptimize the code by having to
check every page if it is part of a previous compound page or not.

>
> I understand Matt is pushing on this idea more by having compound
> pages in the page cache, but still mapping tail pages when required.
>
> > This is actually standard migration procedure, elsewhere in the kernel
> > we migrate pages in exactly the same fashion: isolate and later
> > migrate. The isolation works for LRU only pages.
>
> But do other places cause a userspace visible random failure when LRU
> isolation fails?

Makes sense, I will remove maximum retries for isolation, and retry
indefinitely, the same as it is done during memory hot-remove. So, we
will fail only when migration fails.

>
> I don't like it at all, what is the user supposed to do?
>
> Jason

2021-01-13 23:45:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Wed, Jan 13, 2021 at 03:05:41PM -0500, Pavel Tatashin wrote:

> Makes sense, I will remove maximum retries for isolation, and retry
> indefinitely, the same as it is done during memory hot-remove. So, we
> will fail only when migration fails.

It would be good to make this also as a stand alone pre-fix as well..

Thanks,
Jason

2021-01-15 18:13:49

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Wed, Jan 13, 2021 at 3:05 PM Pavel Tatashin
<[email protected]> wrote:
>
> > > > Oh, that existing logic is wrong too :( Another bug.
> > >
> > > I do not think there is a bug.
> > >
> > > > You can't skip pages in the pages[] array under the assumption they
> > > > are contiguous. ie the i+=step is wrong.
> > >
> > > If pages[i] is part of a compound page, the other parts of this page
> > > must be sequential in this array for this compound page
> >
> > That is true only if the PMD points to the page. If the PTE points to
> > a tail page then there is no requirement that other PTEs are
> > contiguous with the compount page.
> >
> > At this point we have no idea if the GUP logic got this compound page
> > as a head page in a PMD or as a tail page from a PTE, so we can't
> > assume a contiguous run of addresses.
>
> I see, I will fix this bug in an upstream as a separate patch in my
> series, and keep the fix when my fixes are applied.
>
> >
> > Look at split_huge_pmd() - it doesn't break up the compound page it
> > just converts the PMD to a PTE array and scatters the tail pages to
> > the PTE.

Hi Jason,

I've been thinking about this some more. Again, I am not sure this is
a bug. I understand split_huge_pmd() may split the PMD size page into
PTEs and leave the compound page intact. However, in order for pages[]
to have non sequential addresses in compound page, those PTEs must
also be migrated after split_huge_pmd(), however when we migrate them
we will either migrate the whole compound page or do
split_huge_page_to_list() which will in turn do ClearPageCompound().
Please let me know if I am missing something.

Thank you,
Pasha

>
> Got it, unfortunately the fix will deoptimize the code by having to
> check every page if it is part of a previous compound page or not.
>
> >
> > I understand Matt is pushing on this idea more by having compound
> > pages in the page cache, but still mapping tail pages when required.
> >
> > > This is actually standard migration procedure, elsewhere in the kernel
> > > we migrate pages in exactly the same fashion: isolate and later
> > > migrate. The isolation works for LRU only pages.
> >
> > But do other places cause a userspace visible random failure when LRU
> > isolation fails?
>
> Makes sense, I will remove maximum retries for isolation, and retry
> indefinitely, the same as it is done during memory hot-remove. So, we
> will fail only when migration fails.
>
> >
> > I don't like it at all, what is the user supposed to do?
> >
> > Jason

2021-01-15 18:42:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

On Fri, Jan 15, 2021 at 01:10:27PM -0500, Pavel Tatashin wrote:

> I've been thinking about this some more. Again, I am not sure this is
> a bug. I understand split_huge_pmd() may split the PMD size page into
> PTEs and leave the compound page intact. However, in order for pages[]
> to have non sequential addresses in compound page, those PTEs must
> also be migrated after split_huge_pmd(),

Why focus on migrated? Anything could happen to those PTEs: they could
be COW'd, they could be mmap/munmap'd, etc.

Jason