2020-12-02 05:26:59

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.

[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com/

Pavel Tatashin (6):
mm/gup: perform check_dax_vmas only when FS_DAX is enabled
mm/gup: don't pin migrated cma pages in movable zone
mm/gup: make __gup_longterm_locked common
mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE
mm: honor PF_MEMALLOC_NOMOVABLE for all allocations
mm/gup: migrate pinned pages out of movable zone

include/linux/migrate.h | 1 +
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 21 +++------
include/trace/events/migrate.h | 3 +-
mm/gup.c | 82 +++++++++++++---------------------
mm/hugetlb.c | 4 +-
mm/page_alloc.c | 26 ++++++-----
7 files changed, 58 insertions(+), 81 deletions(-)

--
2.25.1


2020-12-02 05:27:00

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of
clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE.

We will prohibit allocating any pages that are getting
longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use
this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE.
Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_nomovable_save()/memalloc_nomovable_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 21 +++++----------------
mm/gup.c | 4 ++--
mm/hugetlb.c | 4 ++--
mm/page_alloc.c | 4 ++--
5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 76cd21fa5501..f1bf05f5f5fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1548,7 +1548,7 @@ extern struct pid *cad_pid;
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
-#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_NOMOVABLE 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index d5ece7a9a403..5bb9a6b69479 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -254,29 +254,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
}

-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_nomovable_save(void)
{
- unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+ unsigned int flags = current->flags & PF_MEMALLOC_NOMOVABLE;

- current->flags |= PF_MEMALLOC_NOCMA;
+ current->flags |= PF_MEMALLOC_NOMOVABLE;
return flags;
}

-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_nomovable_restore(unsigned int flags)
{
- current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+ current->flags = (current->flags & ~PF_MEMALLOC_NOMOVABLE) | flags;
}
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
- return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif

#ifdef CONFIG_MEMCG
DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index 0e2de888a8b0..724d8a65e1df 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1726,7 +1726,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (!vmas_tmp)
return -ENOMEM;
}
- flags = memalloc_nocma_save();
+ flags = memalloc_nomovable_save();
}

rc = __get_user_pages_locked(mm, start, nr_pages, pages,
@@ -1749,7 +1749,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
vmas_tmp, gup_flags);
out:
- memalloc_nocma_restore(flags);
+ memalloc_nomovable_restore(flags);
}

if (vmas_tmp != vmas)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 37f15c3c24dc..02213c74ed6b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
{
struct page *page;
- bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+ bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);

list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (nocma && is_migrate_cma_page(page))
+ if (nomovable && is_migrate_cma_page(page))
continue;

if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..611799c72da5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3772,8 +3772,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
#ifdef CONFIG_CMA
unsigned int pflags = current->flags;

- if (!(pflags & PF_MEMALLOC_NOCMA) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
+ gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

#endif
--
2.25.1

2020-12-02 05:27:05

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index cdb8b9eeb016..3a76c005a3e2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
- .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+ .gfp_mask = GFP_USER | __GFP_NOWARN,
};

check_again:
--
2.25.1

2020-12-02 05:27:15

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

__gup_longterm_locked() has CMA || FS_DAX version and a common stub
version. In the preparation of prohibiting longterm pinning of pages from
movable zone make the CMA || FS_DAX version common, and delete the stub
version.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/gup.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3a76c005a3e2..0e2de888a8b0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
}
#endif /* CONFIG_ELF_CORE */

-#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
#ifdef CONFIG_FS_DAX
static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
{
@@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
kfree(vmas_tmp);
return rc;
}
-#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
-static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int flags)
-{
- return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
- NULL, flags);
-}
-#endif /* CONFIG_FS_DAX || CONFIG_CMA */

static bool is_valid_gup_flags(unsigned int gup_flags)
{
--
2.25.1

2020-12-02 05:27:40

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

PF_MEMALLOC_NOMOVABLE is only honored for CMA allocations, extend
this flag to work for any allocations by removing __GFP_MOVABLE from
gfp_mask when this flag is passed in the current context, thus
prohibiting allocations from ZONE_MOVABLE.

Signed-off-by: Pavel Tatashin <[email protected]>
---
mm/hugetlb.c | 2 +-
mm/page_alloc.c | 26 ++++++++++++++++----------
2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 02213c74ed6b..00e786201d8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);

list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
- if (nomovable && is_migrate_cma_page(page))
+ if (nomovable && is_migrate_movable(get_pageblock_migratetype(page)))
continue;

if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 611799c72da5..7a6d86d0bc5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
return alloc_flags;
}

-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
- unsigned int alloc_flags)
+static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
+ unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
- unsigned int pflags = current->flags;
-
- if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
- gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
#endif
return alloc_flags;
}

+static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)
+{
+ unsigned int pflags = current->flags;
+
+ if ((pflags & PF_MEMALLOC_NOMOVABLE))
+ return gfp_mask & ~__GFP_MOVABLE;
+ return gfp_mask;
+}
+
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
@@ -4423,7 +4428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;

- alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+ alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags);

return alloc_flags;
}
@@ -4725,7 +4730,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
- alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+ alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags);

/*
* Reset the nodemask and zonelist iterators if memory policies can be
@@ -4894,7 +4899,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;

- *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
+ *alloc_flags = cma_alloc_flags(gfp_mask, *alloc_flags);

/* Dirty zone balancing only done in the fast path */
ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
@@ -4932,6 +4937,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
}

gfp_mask &= gfp_allowed_mask;
+ gfp_mask = current_gfp_checkmovable(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
--
2.25.1

2020-12-02 05:29:53

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
allocated before pinning they need to migrated to a different zone.
Currently, we migrate movable CMA pages only. Generalize the function
that migrates CMA pages to migrate all movable pages.

Signed-off-by: Pavel Tatashin <[email protected]>
---
include/linux/migrate.h | 1 +
include/trace/events/migrate.h | 3 +-
mm/gup.c | 56 +++++++++++++---------------------
3 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0f8d1583fa8e..00bab23d1ee5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+ MR_LONGTERM_PIN,
MR_TYPES
};

diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CONTIG_RANGE, "contig_range")
+ EM( MR_CONTIG_RANGE, "contig_range") \
+ EMe(MR_LONGTERM_PIN, "longterm_pin")

/*
* First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 724d8a65e1df..1d511f65f8a7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
}
#endif

-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int gup_flags)
{
unsigned long i;
unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
- LIST_HEAD(cma_page_list);
+ LIST_HEAD(page_list);
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
@@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
*/
step = compound_nr(head) - (pages[i] - head);
/*
- * If we get a page from the CMA zone, since we are going to
- * be pinning these entries, we might as well move them out
- * of the CMA zone if possible.
+ * If we get a movable page, since we are going to be pinning
+ * these entries, try to move them out if possible.
*/
- if (is_migrate_cma_page(head)) {
+ if (is_migrate_movable(get_pageblock_migratetype(head))) {
if (PageHuge(head))
- isolate_huge_page(head, &cma_page_list);
+ isolate_huge_page(head, &page_list);
else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
@@ -1637,7 +1635,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
}

if (!isolate_lru_page(head)) {
- list_add_tail(&head->lru, &cma_page_list);
+ list_add_tail(&head->lru, &page_list);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +
page_is_file_lru(head),
@@ -1649,7 +1647,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
i += step;
}

- if (!list_empty(&cma_page_list)) {
+ if (!list_empty(&page_list)) {
/*
* drop the above get_user_pages reference.
*/
@@ -1659,7 +1657,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);

- if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
+ if (migrate_pages(&page_list, alloc_migration_target, NULL,
(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
/*
* some of the pages failed migration. Do get_user_pages
@@ -1667,17 +1665,16 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
*/
migrate_allow = false;

- if (!list_empty(&cma_page_list))
- putback_movable_pages(&cma_page_list);
+ if (!list_empty(&page_list))
+ putback_movable_pages(&page_list);
}
/*
* We did migrate all the pages, Try to get the page references
- * again migrating any new CMA pages which we failed to isolate
- * earlier.
+ * again migrating any pages which we failed to isolate earlier.
*/
ret = __get_user_pages_locked(mm, start, nr_pages,
- pages, vmas, NULL,
- gup_flags);
+ pages, vmas, NULL,
+ gup_flags);

if ((ret > 0) && migrate_allow) {
nr_pages = ret;
@@ -1688,17 +1685,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,

return ret;
}
-#else
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int gup_flags)
-{
- return nr_pages;
-}
-#endif /* CONFIG_CMA */

/*
* __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
@@ -1746,8 +1732,8 @@ static long __gup_longterm_locked(struct mm_struct *mm,
goto out;
}

- rc = check_and_migrate_cma_pages(mm, start, rc, pages,
- vmas_tmp, gup_flags);
+ rc = check_and_migrate_movable_pages(mm, start, rc, pages,
+ vmas_tmp, gup_flags);
out:
memalloc_nomovable_restore(flags);
}
--
2.25.1

2020-12-02 16:35:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

On 02.12.20 06:23, Pavel Tatashin wrote:
> In order not to fragment CMA the pinned pages are migrated. However,
> they are migrated to ZONE_MOVABLE, which also should not have pinned pages.
>
> Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
> is allowed.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index cdb8b9eeb016..3a76c005a3e2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
> + .gfp_mask = GFP_USER | __GFP_NOWARN,
> };
>
> check_again:
>


Looks like the right thing to me, thanks!

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2020-12-02 16:36:57

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

On Wed, Dec 02, 2020 at 12:23:27AM -0500, Pavel Tatashin wrote:
> __gup_longterm_locked() has CMA || FS_DAX version and a common stub
> version. In the preparation of prohibiting longterm pinning of pages from
> movable zone make the CMA || FS_DAX version common, and delete the stub
> version.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup.c | 13 -------------
> 1 file changed, 13 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 3a76c005a3e2..0e2de888a8b0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
> }
> #endif /* CONFIG_ELF_CORE */
>
> -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> #ifdef CONFIG_FS_DAX
> static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> {
> @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> kfree(vmas_tmp);
> return rc;
> }

Isn't this going to potentially allocate vmas_tmp only to not need it when
!FS_DAX and !CMA?

Ira

> -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
> -static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int flags)
> -{
> - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> - NULL, flags);
> -}
> -#endif /* CONFIG_FS_DAX || CONFIG_CMA */
>
> static bool is_valid_gup_flags(unsigned int gup_flags)
> {
> --
> 2.25.1
>
>

2020-12-02 16:38:12

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

On Wed, Dec 02, 2020 at 08:31:45AM -0800, 'Ira Weiny' wrote:
> On Wed, Dec 02, 2020 at 12:23:27AM -0500, Pavel Tatashin wrote:
> > __gup_longterm_locked() has CMA || FS_DAX version and a common stub
> > version. In the preparation of prohibiting longterm pinning of pages from
> > movable zone make the CMA || FS_DAX version common, and delete the stub
> > version.
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > mm/gup.c | 13 -------------
> > 1 file changed, 13 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 3a76c005a3e2..0e2de888a8b0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
> > }
> > #endif /* CONFIG_ELF_CORE */
> >
> > -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> > #ifdef CONFIG_FS_DAX
> > static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > {
> > @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> > kfree(vmas_tmp);
> > return rc;
> > }
>
> Isn't this going to potentially allocate vmas_tmp only to not need it when
> !FS_DAX and !CMA?

To clarify, when FOLL_LONGTERM is set...

IRa

>
> Ira
>
> > -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
> > -static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
> > - unsigned long start,
> > - unsigned long nr_pages,
> > - struct page **pages,
> > - struct vm_area_struct **vmas,
> > - unsigned int flags)
> > -{
> > - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > - NULL, flags);
> > -}
> > -#endif /* CONFIG_FS_DAX || CONFIG_CMA */
> >
> > static bool is_valid_gup_flags(unsigned int gup_flags)
> > {
> > --
> > 2.25.1
> >
> >
>

2020-12-02 16:40:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> /*
> * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> }
> #endif
>
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + unsigned int gup_flags)
> {
> unsigned long i;
> unsigned long step;
> bool drain_allow = true;
> bool migrate_allow = true;
> - LIST_HEAD(cma_page_list);
> + LIST_HEAD(page_list);
> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> */
> step = compound_nr(head) - (pages[i] - head);
> /*
> - * If we get a page from the CMA zone, since we are going to
> - * be pinning these entries, we might as well move them out
> - * of the CMA zone if possible.
> + * If we get a movable page, since we are going to be pinning
> + * these entries, try to move them out if possible.
> */
> - if (is_migrate_cma_page(head)) {
> + if (is_migrate_movable(get_pageblock_migratetype(head))) {
> if (PageHuge(head))

It is a good moment to say, I really dislike how this was implemented
in the first place.

Scanning the output of gup just to do the is_migrate_movable() test is
kind of nonsense and slow. It would be better/faster to handle this
directly while gup is scanning the page tables and adding pages to the
list.

Now that this becoming more general, can you take a moment to see if a
better implementation could be possible?

Also, something takes care of the gup fast path too?

Jason

2020-12-02 18:22:55

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

On Wed, Dec 2, 2020 at 11:33 AM Ira Weiny <[email protected]> wrote:
>
> On Wed, Dec 02, 2020 at 08:31:45AM -0800, 'Ira Weiny' wrote:
> > On Wed, Dec 02, 2020 at 12:23:27AM -0500, Pavel Tatashin wrote:
> > > __gup_longterm_locked() has CMA || FS_DAX version and a common stub
> > > version. In the preparation of prohibiting longterm pinning of pages from
> > > movable zone make the CMA || FS_DAX version common, and delete the stub
> > > version.
> > >
> > > Signed-off-by: Pavel Tatashin <[email protected]>
> > > ---
> > > mm/gup.c | 13 -------------
> > > 1 file changed, 13 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 3a76c005a3e2..0e2de888a8b0 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
> > > }
> > > #endif /* CONFIG_ELF_CORE */
> > >
> > > -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> > > #ifdef CONFIG_FS_DAX
> > > static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > > {
> > > @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> > > kfree(vmas_tmp);
> > > return rc;
> > > }
> >
> > Isn't this going to potentially allocate vmas_tmp only to not need it when
> > !FS_DAX and !CMA?
>
> To clarify, when FOLL_LONGTERM is set...

Yes, this is the case. We need that because once migration is checked
for all allocations, not only CMA, we need vmas_tmp for all cases.

Pasha

>
> IRa
>
> >
> > Ira
> >
> > > -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
> > > -static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
> > > - unsigned long start,
> > > - unsigned long nr_pages,
> > > - struct page **pages,
> > > - struct vm_area_struct **vmas,
> > > - unsigned int flags)
> > > -{
> > > - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > > - NULL, flags);
> > > -}
> > > -#endif /* CONFIG_FS_DAX || CONFIG_CMA */
> > >
> > > static bool is_valid_gup_flags(unsigned int gup_flags)
> > > {
> > > --
> > > 2.25.1
> > >
> > >
> >

2020-12-02 18:23:09

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

> Looks like the right thing to me, thanks!
>
> Reviewed-by: David Hildenbrand <[email protected]>

Thank you,
Pasha

>
> --
> Thanks,
>
> David / dhildenb
>

2020-12-03 00:09:39

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

On Wed, Dec 2, 2020 at 1:19 PM Pavel Tatashin <[email protected]> wrote:
>
> On Wed, Dec 2, 2020 at 11:33 AM Ira Weiny <[email protected]> wrote:
> >
> > On Wed, Dec 02, 2020 at 08:31:45AM -0800, 'Ira Weiny' wrote:
> > > On Wed, Dec 02, 2020 at 12:23:27AM -0500, Pavel Tatashin wrote:
> > > > __gup_longterm_locked() has CMA || FS_DAX version and a common stub
> > > > version. In the preparation of prohibiting longterm pinning of pages from
> > > > movable zone make the CMA || FS_DAX version common, and delete the stub
> > > > version.
> > > >
> > > > Signed-off-by: Pavel Tatashin <[email protected]>
> > > > ---
> > > > mm/gup.c | 13 -------------
> > > > 1 file changed, 13 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index 3a76c005a3e2..0e2de888a8b0 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
> > > > }
> > > > #endif /* CONFIG_ELF_CORE */
> > > >
> > > > -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> > > > #ifdef CONFIG_FS_DAX
> > > > static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > > > {
> > > > @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> > > > kfree(vmas_tmp);
> > > > return rc;
> > > > }
> > >
> > > Isn't this going to potentially allocate vmas_tmp only to not need it when
> > > !FS_DAX and !CMA?
> >
> > To clarify, when FOLL_LONGTERM is set...
>
> Yes, this is the case. We need that because once migration is checked
> for all allocations, not only CMA, we need vmas_tmp for all cases.

A slight correction, we only need vmas_tmp for check_dax_vmas().
Potentially, we could wrap vmas_tmp allocation in a #ifdef for FS_DAX,
but I do not think this is really needed.

Pasha

>
> Pasha
>
> >
> > IRa
> >
> > >
> > > Ira
> > >
> > > > -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
> > > > -static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
> > > > - unsigned long start,
> > > > - unsigned long nr_pages,
> > > > - struct page **pages,
> > > > - struct vm_area_struct **vmas,
> > > > - unsigned int flags)
> > > > -{
> > > > - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > > > - NULL, flags);
> > > > -}
> > > > -#endif /* CONFIG_FS_DAX || CONFIG_CMA */
> > > >
> > > > static bool is_valid_gup_flags(unsigned int gup_flags)
> > > > {
> > > > --
> > > > 2.25.1
> > > >
> > > >
> > >

2020-12-03 00:24:56

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

> It is a good moment to say, I really dislike how this was implemented
> in the first place.
>
> Scanning the output of gup just to do the is_migrate_movable() test is
> kind of nonsense and slow. It would be better/faster to handle this
> directly while gup is scanning the page tables and adding pages to the
> list.

Hi Jason,

I assume you mean to migrate pages as soon as they are followed and
skip those that are faulted, as we already know that faulted pages are
allocated from nomovable zone.

The place would be:

__get_user_pages()
while(more pages)
get_gate_page()
follow_hugetlb_page()
follow_page_mask()

if (!page)
faultin_page()

if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
check_and_migrate this page

I looked at that function, and I do not think the code will be cleaner
there, as that function already has a complicated loop. The only
drawback with the current approach that I see is that
check_and_migrate_movable_pages() has to check once the faulted pages.
This is while not optimal is not horrible. The FOLL_LONGTERM should
not happen too frequently, so having one extra nr_pages loop should
not hurt the performance. Also, I checked and most of the users of
FOLL_LONGTERM pin only one page at a time. Which means the extra loop
is only to check a single page.

We could discuss improving this code farther. For example, I still
think it would be a good idea to pass an appropriate gfp_mask via
fault_flags from gup_flags instead of using
PF_MEMALLOC_NOMOVABLE (previously PF_MEMALLOC_NOCMA) per context flag.
However, those changes can come after this series. The current series
fixes a bug where hot-remove is not working with making minimal amount
of changes, so it is easy to backport it to stable kernels.

Thank you,
Pasha



>
> Now that this becoming more general, can you take a moment to see if a
> better implementation could be possible?
>
> Also, something takes care of the gup fast path too?
>
> Jason

2020-12-03 01:12:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Wed, Dec 02, 2020 at 07:19:45PM -0500, Pavel Tatashin wrote:
> > It is a good moment to say, I really dislike how this was implemented
> > in the first place.
> >
> > Scanning the output of gup just to do the is_migrate_movable() test is
> > kind of nonsense and slow. It would be better/faster to handle this
> > directly while gup is scanning the page tables and adding pages to the
> > list.
>
> Hi Jason,
>
> I assume you mean to migrate pages as soon as they are followed and
> skip those that are faulted, as we already know that faulted pages are
> allocated from nomovable zone.
>
> The place would be:
>
> __get_user_pages()
> while(more pages)
> get_gate_page()
> follow_hugetlb_page()
> follow_page_mask()
>
> if (!page)
> faultin_page()
>
> if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
> check_and_migrate this page

Either here or perhaps even lower down the call chain when the page is
captured, similar to how GUP fast would detect it. (how is that done
anyhow?)

> I looked at that function, and I do not think the code will be cleaner
> there, as that function already has a complicated loop.

That function is complicated for its own reasons.. But complicated is
not the point here..

> The only drawback with the current approach that I see is that
> check_and_migrate_movable_pages() has to check once the faulted
> pages.

Yes

> This is while not optimal is not horrible.

It is.

> The FOLL_LONGTERM should not happen too frequently, so having one
> extra nr_pages loop should not hurt the performance.

FOLL_LONGTERM is typically used with very large regions, for instance
we are benchmarking around the 300G level. It takes 10s of seconds for
get_user_pages to operate. There are many inefficiencies in this
path. This extra work of re-scanning the list is part of the cost.

Further, having these special wrappers just for FOLL_LONGTERM has a
spill over complexity on the entire rest of the callchain up to here,
we now have endless wrappers and varieties of function calls that
generally are happening because the longterm path needs to end up in a
different place than other paths.

IMHO this is due to the lack of integration with the primary loop
above

> Also, I checked and most of the users of FOLL_LONGTERM pin only one
> page at a time. Which means the extra loop is only to check a single
> page.

Er, I don't know what you checked but those are not the cases I
see. Two big users are vfio and rdma. Both are pinning huge ranges of
memory in very typical use cases.

> However, those changes can come after this series. The current series
> fixes a bug where hot-remove is not working with making minimal amount
> of changes, so it is easy to backport it to stable kernels.

This is a good point, good enough that you should probably continue as
is

Jason

2020-12-03 01:40:03

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Wed, Dec 2, 2020 at 8:08 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, Dec 02, 2020 at 07:19:45PM -0500, Pavel Tatashin wrote:
> > > It is a good moment to say, I really dislike how this was implemented
> > > in the first place.
> > >
> > > Scanning the output of gup just to do the is_migrate_movable() test is
> > > kind of nonsense and slow. It would be better/faster to handle this
> > > directly while gup is scanning the page tables and adding pages to the
> > > list.
> >
> > Hi Jason,
> >
> > I assume you mean to migrate pages as soon as they are followed and
> > skip those that are faulted, as we already know that faulted pages are
> > allocated from nomovable zone.
> >
> > The place would be:
> >
> > __get_user_pages()
> > while(more pages)
> > get_gate_page()
> > follow_hugetlb_page()
> > follow_page_mask()
> >
> > if (!page)
> > faultin_page()
> >
> > if (page && !faulted && (gup_flags & FOLL_LONGTERM) )
> > check_and_migrate this page
>
> Either here or perhaps even lower down the call chain when the page is
> captured, similar to how GUP fast would detect it. (how is that done
> anyhow?)

Ah, thank you for pointing this out. I think I need to address it here:

https://soleen.com/source/xref/linux/mm/gup.c?r=96e1fac1#94

static __maybe_unused struct page *try_grab_compound_head()
if (unlikely(flags & FOLL_LONGTERM) && is_migrate_cma_page(page))
return NULL;

I need to change is_migrate_cma_page() to all migratable pages. Will
study, and send an update with this fix.

>
> > I looked at that function, and I do not think the code will be cleaner
> > there, as that function already has a complicated loop.
>
> That function is complicated for its own reasons.. But complicated is
> not the point here..
>
> > The only drawback with the current approach that I see is that
> > check_and_migrate_movable_pages() has to check once the faulted
> > pages.
>
> Yes
>
> > This is while not optimal is not horrible.
>
> It is.
>
> > The FOLL_LONGTERM should not happen too frequently, so having one
> > extra nr_pages loop should not hurt the performance.
>
> FOLL_LONGTERM is typically used with very large regions, for instance
> we are benchmarking around the 300G level. It takes 10s of seconds for
> get_user_pages to operate. There are many inefficiencies in this
> path. This extra work of re-scanning the list is part of the cost.

OK, I did not realize that pinning was for such large regions, the
path must be optimized.

>
> Further, having these special wrappers just for FOLL_LONGTERM has a
> spill over complexity on the entire rest of the callchain up to here,
> we now have endless wrappers and varieties of function calls that
> generally are happening because the longterm path needs to end up in a
> different place than other paths.
>
> IMHO this is due to the lack of integration with the primary loop
> above
>
> > Also, I checked and most of the users of FOLL_LONGTERM pin only one
> > page at a time. Which means the extra loop is only to check a single
> > page.
>
> Er, I don't know what you checked but those are not the cases I
> see. Two big users are vfio and rdma. Both are pinning huge ranges of
> memory in very typical use cases.

What I meant is the users of the interface do it incrementally not in
large chunks. For example:

vfio_pin_pages_remote
vaddr_get_pfn
ret = pin_user_pages_remote(mm, vaddr, 1, flags |
FOLL_LONGTERM, page, NULL, NULL);
1 -> pin only one pages at a time

RDMA indeed can do it in one chunk though. Regardless, the VFIO should
probably be optimized to do it in a larger chunk, and the code path
should be optimized for the reasons you gave above.

>
> > However, those changes can come after this series. The current series
> > fixes a bug where hot-remove is not working with making minimal amount
> > of changes, so it is easy to backport it to stable kernels.
>
> This is a good point, good enough that you should probably continue as
> is

I will continue looking into this code, and see if I can address your
concerns in a follow-up fixes.


Thanks,
Pasha

>
> Jason

2020-12-03 08:04:16

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> In order not to fragment CMA the pinned pages are migrated. However,
> they are migrated to ZONE_MOVABLE, which also should not have pinned pages.
>
> Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
> is allowed.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index cdb8b9eeb016..3a76c005a3e2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
> + .gfp_mask = GFP_USER | __GFP_NOWARN,
> };
>
> check_again:
>

Reviewed-by: John Hubbard <[email protected]>

...while I was here, I noticed this, which is outside the scope of your patchset, but
I thought I'd just mention it anyway in case anyone cares:

static inline bool is_migrate_movable(int mt)
{
return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
}

...that really should take an "enum migratetype mt" instead of an int, I think.

thanks,
--
John Hubbard
NVIDIA

2020-12-03 08:07:23

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> __gup_longterm_locked() has CMA || FS_DAX version and a common stub
> version. In the preparation of prohibiting longterm pinning of pages from
> movable zone make the CMA || FS_DAX version common, and delete the stub
> version.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/gup.c | 13 -------------
> 1 file changed, 13 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 3a76c005a3e2..0e2de888a8b0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
> }
> #endif /* CONFIG_ELF_CORE */
>
> -#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
> #ifdef CONFIG_FS_DAX
> static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> {
> @@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> kfree(vmas_tmp);
> return rc;
> }
> -#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
> -static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int flags)
> -{
> - return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> - NULL, flags);
> -}
> -#endif /* CONFIG_FS_DAX || CONFIG_CMA */
>
> static bool is_valid_gup_flags(unsigned int gup_flags)
> {
>

At last some simplification here, yea!

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

2020-12-03 08:07:38

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of
> clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE.
>
> We will prohibit allocating any pages that are getting
> longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use
> this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE.
> Also re-name:
> memalloc_nocma_save()/memalloc_nocma_restore
> to
> memalloc_nomovable_save()/memalloc_nomovable_restore()
> and make the new functions common.
>
> Signed-off-by: Pavel Tatashin <[email protected]>

Looks accurate, and grep didn't find any lingering rename candidates
after this series is applied. And it's a good rename.


Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

> ---
> include/linux/sched.h | 2 +-
> include/linux/sched/mm.h | 21 +++++----------------
> mm/gup.c | 4 ++--
> mm/hugetlb.c | 4 ++--
> mm/page_alloc.c | 4 ++--
> 5 files changed, 12 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 76cd21fa5501..f1bf05f5f5fa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1548,7 +1548,7 @@ extern struct pid *cad_pid;
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> -#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> +#define PF_MEMALLOC_NOMOVABLE 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index d5ece7a9a403..5bb9a6b69479 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -254,29 +254,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
> current->flags = (current->flags & ~PF_MEMALLOC) | flags;
> }
>
> -#ifdef CONFIG_CMA
> -static inline unsigned int memalloc_nocma_save(void)
> +static inline unsigned int memalloc_nomovable_save(void)
> {
> - unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
> + unsigned int flags = current->flags & PF_MEMALLOC_NOMOVABLE;
>
> - current->flags |= PF_MEMALLOC_NOCMA;
> + current->flags |= PF_MEMALLOC_NOMOVABLE;
> return flags;
> }
>
> -static inline void memalloc_nocma_restore(unsigned int flags)
> +static inline void memalloc_nomovable_restore(unsigned int flags)
> {
> - current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
> + current->flags = (current->flags & ~PF_MEMALLOC_NOMOVABLE) | flags;
> }
> -#else
> -static inline unsigned int memalloc_nocma_save(void)
> -{
> - return 0;
> -}
> -
> -static inline void memalloc_nocma_restore(unsigned int flags)
> -{
> -}
> -#endif
>
> #ifdef CONFIG_MEMCG
> DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> diff --git a/mm/gup.c b/mm/gup.c
> index 0e2de888a8b0..724d8a65e1df 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1726,7 +1726,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> if (!vmas_tmp)
> return -ENOMEM;
> }
> - flags = memalloc_nocma_save();
> + flags = memalloc_nomovable_save();
> }
>
> rc = __get_user_pages_locked(mm, start, nr_pages, pages,
> @@ -1749,7 +1749,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> rc = check_and_migrate_cma_pages(mm, start, rc, pages,
> vmas_tmp, gup_flags);
> out:
> - memalloc_nocma_restore(flags);
> + memalloc_nomovable_restore(flags);
> }
>
> if (vmas_tmp != vmas)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 37f15c3c24dc..02213c74ed6b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
> static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> {
> struct page *page;
> - bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
> + bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
>
> list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
> - if (nocma && is_migrate_cma_page(page))
> + if (nomovable && is_migrate_cma_page(page))
> continue;
>
> if (PageHWPoison(page))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eaa227a479e4..611799c72da5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3772,8 +3772,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> #ifdef CONFIG_CMA
> unsigned int pflags = current->flags;
>
> - if (!(pflags & PF_MEMALLOC_NOCMA) &&
> - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> + gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
>
> #endif
>

2020-12-03 08:22:44

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> PF_MEMALLOC_NOMOVABLE is only honored for CMA allocations, extend
> this flag to work for any allocations by removing __GFP_MOVABLE from
> gfp_mask when this flag is passed in the current context, thus
> prohibiting allocations from ZONE_MOVABLE.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> mm/hugetlb.c | 2 +-
> mm/page_alloc.c | 26 ++++++++++++++++----------
> 2 files changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 02213c74ed6b..00e786201d8b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
>
> list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
> - if (nomovable && is_migrate_cma_page(page))
> + if (nomovable && is_migrate_movable(get_pageblock_migratetype(page)))


I wonder if we should add a helper, like is_migrate_cma_page(), that avoids having
to call get_pageblock_migratetype() at all of the callsites?


> continue;
>
> if (PageHWPoison(page))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 611799c72da5..7a6d86d0bc5f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> return alloc_flags;
> }
>
> -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> - unsigned int alloc_flags)
> +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> + unsigned int alloc_flags)

Actually, maybe the original name should be left intact. This handles current alloc
flags, which right now happen to only cover CMA flags, so the original name seems
accurate, right?


thanks,

John Hubbard
NVIDIA

> {
> #ifdef CONFIG_CMA
> - unsigned int pflags = current->flags;
> -
> - if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> -
> #endif
> return alloc_flags;
> }
>
> +static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)
> +{
> + unsigned int pflags = current->flags;
> +
> + if ((pflags & PF_MEMALLOC_NOMOVABLE))
> + return gfp_mask & ~__GFP_MOVABLE;
> + return gfp_mask;
> +}
> +
> /*
> * get_page_from_freelist goes through the zonelist trying to allocate
> * a page.
> @@ -4423,7 +4428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> } else if (unlikely(rt_task(current)) && !in_interrupt())
> alloc_flags |= ALLOC_HARDER;
>
> - alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
> + alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags);
>
> return alloc_flags;
> }
> @@ -4725,7 +4730,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> if (reserve_flags)
> - alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
> + alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags);
>
> /*
> * Reset the nodemask and zonelist iterators if memory policies can be
> @@ -4894,7 +4899,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> if (should_fail_alloc_page(gfp_mask, order))
> return false;
>
> - *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
> + *alloc_flags = cma_alloc_flags(gfp_mask, *alloc_flags);
>
> /* Dirty zone balancing only done in the fast path */
> ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
> @@ -4932,6 +4937,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> }
>
> gfp_mask &= gfp_allowed_mask;
> + gfp_mask = current_gfp_checkmovable(gfp_mask);
> alloc_mask = gfp_mask;
> if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> return NULL;
>

2020-12-03 08:25:23

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> allocated before pinning they need to migrated to a different zone.
> Currently, we migrate movable CMA pages only. Generalize the function
> that migrates CMA pages to migrate all movable pages.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> include/trace/events/migrate.h | 3 +-
> mm/gup.c | 56 +++++++++++++---------------------
> 3 files changed, 24 insertions(+), 36 deletions(-)
>

I like the cleanup so far, even at this point it's a relief to finally
see the nested ifdefs get removed.

One naming nit/idea below, but this looks fine as is, so:

Reviewed-by: John Hubbard <[email protected]>

> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0f8d1583fa8e..00bab23d1ee5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -27,6 +27,7 @@ enum migrate_reason {
> MR_MEMPOLICY_MBIND,
> MR_NUMA_MISPLACED,
> MR_CONTIG_RANGE,
> + MR_LONGTERM_PIN,
> MR_TYPES
> };
>
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 4d434398d64d..363b54ce104c 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
> EM( MR_SYSCALL, "syscall_or_cpuset") \
> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> - EMe(MR_CONTIG_RANGE, "contig_range")
> + EM( MR_CONTIG_RANGE, "contig_range") \
> + EMe(MR_LONGTERM_PIN, "longterm_pin")
>
> /*
> * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> }
> #endif
>
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + unsigned int gup_flags)
> {
> unsigned long i;
> unsigned long step;
> bool drain_allow = true;
> bool migrate_allow = true;
> - LIST_HEAD(cma_page_list);
> + LIST_HEAD(page_list);


Maybe naming it "movable_page_list", would be a nice touch.


thanks,
--
John Hubbard
NVIDIA

> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> */
> step = compound_nr(head) - (pages[i] - head);
> /*
> - * If we get a page from the CMA zone, since we are going to
> - * be pinning these entries, we might as well move them out
> - * of the CMA zone if possible.
> + * If we get a movable page, since we are going to be pinning
> + * these entries, try to move them out if possible.
> */
> - if (is_migrate_cma_page(head)) {
> + if (is_migrate_movable(get_pageblock_migratetype(head))) {
> if (PageHuge(head))
> - isolate_huge_page(head, &cma_page_list);
> + isolate_huge_page(head, &page_list);
> else {
> if (!PageLRU(head) && drain_allow) {
> lru_add_drain_all();
> @@ -1637,7 +1635,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> }
>
> if (!isolate_lru_page(head)) {
> - list_add_tail(&head->lru, &cma_page_list);
> + list_add_tail(&head->lru, &page_list);
> mod_node_page_state(page_pgdat(head),
> NR_ISOLATED_ANON +
> page_is_file_lru(head),
> @@ -1649,7 +1647,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> i += step;
> }
>
> - if (!list_empty(&cma_page_list)) {
> + if (!list_empty(&page_list)) {
> /*
> * drop the above get_user_pages reference.
> */
> @@ -1659,7 +1657,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> for (i = 0; i < nr_pages; i++)
> put_page(pages[i]);
>
> - if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
> + if (migrate_pages(&page_list, alloc_migration_target, NULL,
> (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> /*
> * some of the pages failed migration. Do get_user_pages
> @@ -1667,17 +1665,16 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> */
> migrate_allow = false;
>
> - if (!list_empty(&cma_page_list))
> - putback_movable_pages(&cma_page_list);
> + if (!list_empty(&page_list))
> + putback_movable_pages(&page_list);
> }
> /*
> * We did migrate all the pages, Try to get the page references
> - * again migrating any new CMA pages which we failed to isolate
> - * earlier.
> + * again migrating any pages which we failed to isolate earlier.
> */
> ret = __get_user_pages_locked(mm, start, nr_pages,
> - pages, vmas, NULL,
> - gup_flags);
> + pages, vmas, NULL,
> + gup_flags);
>
> if ((ret > 0) && migrate_allow) {
> nr_pages = ret;
> @@ -1688,17 +1685,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>
> return ret;
> }
> -#else
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int gup_flags)
> -{
> - return nr_pages;
> -}
> -#endif /* CONFIG_CMA */
>
> /*
> * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
> @@ -1746,8 +1732,8 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> goto out;
> }
>
> - rc = check_and_migrate_cma_pages(mm, start, rc, pages,
> - vmas_tmp, gup_flags);
> + rc = check_and_migrate_movable_pages(mm, start, rc, pages,
> + vmas_tmp, gup_flags);
> out:
> memalloc_nomovable_restore(flags);
> }
>

2020-12-03 08:48:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

On Wed 02-12-20 00:23:26, Pavel Tatashin wrote:
> In order not to fragment CMA the pinned pages are migrated. However,
> they are migrated to ZONE_MOVABLE, which also should not have pinned pages.
>
> Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
> is allowed.

I was wondering why we do have __GFP_MOVABLE at all. Took a shovel
and... 41b4dc14ee807 says:
: We have well defined scope API to exclude CMA region. Use it rather than
: manipulating gfp_mask manually. With this change, we can now restore
: __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
: would result in that the ZONE_MOVABLE is also searched by page allocator.
: For hugetlb, gfp_mask is redefined since it has a regular allocation mask
: filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
: filter since a new user for gfp_mask filter, gup, want to be silent when
: allocation fails.

This clearly states that the priority was to increase the migration
target success rate rather than bother with the pinning aspect of the
target page. So I believe we have simply ignored/missed the point of the
movable zone guarantees back then and that was a mistake.

> Signed-off-by: Pavel Tatashin <[email protected]>

I have to admit I am not really sure about the failure path. The code is
just too convoluted to follow. I presume the pin will fail in that case.
Anyway this wouldn't be anything new in this path. Movable zone
exclusion can make the failure slightly more possible in some setups but
fundamentally nothing new there.

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/gup.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index cdb8b9eeb016..3a76c005a3e2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> - .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
> + .gfp_mask = GFP_USER | __GFP_NOWARN,
> };
>
> check_again:
> --
> 2.25.1
>

--
Michal Hocko
SUSE Labs

2020-12-03 09:00:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

On Wed 02-12-20 00:23:28, Pavel Tatashin wrote:
> PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of
> clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE.

This is not precise. You are mixing the implementation and the intention
here. I would say "PF_MEMALLOC_NOCMA is used ot guarantee that the
allocator will not return pages that might belong to CMA region. This is
currently used for long term gup to make sure that such pins are not
going to be done on any CMA pages."

> We will prohibit allocating any pages that are getting
> longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use
> this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE.
> Also re-name:
> memalloc_nocma_save()/memalloc_nocma_restore
> to
> memalloc_nomovable_save()/memalloc_nomovable_restore()
> and make the new functions common.

This is hard to parse for me. I would go with something like:
"
When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
is focusing on CMA pages too much and that there is larger class of
pages that need the same treatment. MOVABLE zone cannot contain
any long term pins as well so it makes sense to reuse and redefine this
flag for that usecase as well. Rename the flag to PF_MEMALLOC_NOMOVABLE
which defines an allocation context which can only get pages suitable
for long-term pins.
"

Btw. the naming is hard but PF_MEMALLOC_NOMOVABLE is a bit misnomer. CMA
pages are not implicitly movable. So in fact we do care more about
pinning than movability. PF_MEMALLOC_PIN or something similar would be
better fit for the overal intention.

Other than that looks good to me. Thanks!

> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> include/linux/sched.h | 2 +-
> include/linux/sched/mm.h | 21 +++++----------------
> mm/gup.c | 4 ++--
> mm/hugetlb.c | 4 ++--
> mm/page_alloc.c | 4 ++--
> 5 files changed, 12 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 76cd21fa5501..f1bf05f5f5fa 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1548,7 +1548,7 @@ extern struct pid *cad_pid;
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> -#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> +#define PF_MEMALLOC_NOMOVABLE 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index d5ece7a9a403..5bb9a6b69479 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -254,29 +254,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
> current->flags = (current->flags & ~PF_MEMALLOC) | flags;
> }
>
> -#ifdef CONFIG_CMA
> -static inline unsigned int memalloc_nocma_save(void)
> +static inline unsigned int memalloc_nomovable_save(void)
> {
> - unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
> + unsigned int flags = current->flags & PF_MEMALLOC_NOMOVABLE;
>
> - current->flags |= PF_MEMALLOC_NOCMA;
> + current->flags |= PF_MEMALLOC_NOMOVABLE;
> return flags;
> }
>
> -static inline void memalloc_nocma_restore(unsigned int flags)
> +static inline void memalloc_nomovable_restore(unsigned int flags)
> {
> - current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
> + current->flags = (current->flags & ~PF_MEMALLOC_NOMOVABLE) | flags;
> }
> -#else
> -static inline unsigned int memalloc_nocma_save(void)
> -{
> - return 0;
> -}
> -
> -static inline void memalloc_nocma_restore(unsigned int flags)
> -{
> -}
> -#endif
>
> #ifdef CONFIG_MEMCG
> DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> diff --git a/mm/gup.c b/mm/gup.c
> index 0e2de888a8b0..724d8a65e1df 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1726,7 +1726,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> if (!vmas_tmp)
> return -ENOMEM;
> }
> - flags = memalloc_nocma_save();
> + flags = memalloc_nomovable_save();
> }
>
> rc = __get_user_pages_locked(mm, start, nr_pages, pages,
> @@ -1749,7 +1749,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> rc = check_and_migrate_cma_pages(mm, start, rc, pages,
> vmas_tmp, gup_flags);
> out:
> - memalloc_nocma_restore(flags);
> + memalloc_nomovable_restore(flags);
> }
>
> if (vmas_tmp != vmas)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 37f15c3c24dc..02213c74ed6b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
> static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> {
> struct page *page;
> - bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
> + bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
>
> list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
> - if (nocma && is_migrate_cma_page(page))
> + if (nomovable && is_migrate_cma_page(page))
> continue;
>
> if (PageHWPoison(page))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eaa227a479e4..611799c72da5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3772,8 +3772,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> #ifdef CONFIG_CMA
> unsigned int pflags = current->flags;
>
> - if (!(pflags & PF_MEMALLOC_NOCMA) &&
> - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> + gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
>
> #endif
> --
> 2.25.1
>

--
Michal Hocko
SUSE Labs

2020-12-03 09:19:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Wed 02-12-20 00:23:29, Pavel Tatashin wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 611799c72da5..7a6d86d0bc5f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> return alloc_flags;
> }
>
> -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> - unsigned int alloc_flags)
> +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> + unsigned int alloc_flags)
> {
> #ifdef CONFIG_CMA
> - unsigned int pflags = current->flags;
> -
> - if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> alloc_flags |= ALLOC_CMA;
> -
> #endif
> return alloc_flags;
> }
>
> +static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)
> +{
> + unsigned int pflags = current->flags;
> +
> + if ((pflags & PF_MEMALLOC_NOMOVABLE))
> + return gfp_mask & ~__GFP_MOVABLE;
> + return gfp_mask;
> +}
> +

It sucks that we have to control both ALLOC and gfp flags. But wouldn't
it be simpler and more straightforward to keep current_alloc_flags as is
(module PF rename) and hook the gfp mask evaluation into current_gfp_context
and move it up before the first allocation attempt? All scope flags
should be applicable to the hot path as well. It would add few cycles to
there but the question is whether that would be noticeable over just
handling PF_MEMALLOC_NOMOVABLE on its own. The cache line would be
pulled in anyway.

--
Michal Hocko
SUSE Labs

2020-12-03 14:19:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:

> > Either here or perhaps even lower down the call chain when the page is
> > captured, similar to how GUP fast would detect it. (how is that done
> > anyhow?)
>
> Ah, thank you for pointing this out. I think I need to address it here:
>
> https://soleen.com/source/xref/linux/mm/gup.c?r=96e1fac1#94
>
> static __maybe_unused struct page *try_grab_compound_head()
> if (unlikely(flags & FOLL_LONGTERM) && is_migrate_cma_page(page))
> return NULL;
>
> I need to change is_migrate_cma_page() to all migratable pages. Will
> study, and send an update with this fix.

Yes, missing the two flows is a common error :(

Looking at this code some more.. How is it even correct?

1633 if (!isolate_lru_page(head)) {
1634 list_add_tail(&head->lru, &cma_page_list);

Here we are only running under the read side of the mmap sem so multiple
GUPs can be calling that sequence in parallel. I don't see any
obvious exclusion that will prevent corruption of head->lru. The first
GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
GUP thread will be a NOP for isolate_lru_page().

They will both race list_add_tail and other list ops. That is not OK.

> What I meant is the users of the interface do it incrementally not in
> large chunks. For example:
>
> vfio_pin_pages_remote
> vaddr_get_pfn
> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> FOLL_LONGTERM, page, NULL, NULL);
> 1 -> pin only one pages at a time

I don't know why vfio does this, it is why it so ridiculously slow at
least.

Jason

2020-12-03 15:04:28

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

On Thu, Dec 3, 2020 at 3:46 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-12-20 00:23:26, Pavel Tatashin wrote:
> > In order not to fragment CMA the pinned pages are migrated. However,
> > they are migrated to ZONE_MOVABLE, which also should not have pinned pages.
> >
> > Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
> > is allowed.
>
> I was wondering why we do have __GFP_MOVABLE at all. Took a shovel
> and... 41b4dc14ee807 says:
> : We have well defined scope API to exclude CMA region. Use it rather than
> : manipulating gfp_mask manually. With this change, we can now restore
> : __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
> : would result in that the ZONE_MOVABLE is also searched by page allocator.
> : For hugetlb, gfp_mask is redefined since it has a regular allocation mask
> : filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
> : filter since a new user for gfp_mask filter, gup, want to be silent when
> : allocation fails.
>
> This clearly states that the priority was to increase the migration
> target success rate rather than bother with the pinning aspect of the
> target page. So I believe we have simply ignored/missed the point of the
> movable zone guarantees back then and that was a mistake.
>
> > Signed-off-by: Pavel Tatashin <[email protected]>
>
> I have to admit I am not really sure about the failure path. The code is
> just too convoluted to follow. I presume the pin will fail in that case.
> Anyway this wouldn't be anything new in this path. Movable zone
> exclusion can make the failure slightly more possible in some setups but
> fundamentally nothing new there.

I've been trying to keep this series simple for easier backport, and
not to introduce new changes beside increasing the scope of pages
which are not allowed to be pinned. This area, however, requires some
inspection and fixes, something that Jason also mentioned in another
patch.

> Acked-by: Michal Hocko <[email protected]>

Thank you,
Pasha

2020-12-03 15:06:36

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

> Reviewed-by: John Hubbard <[email protected]>

Thank you,
Pasha

2020-12-03 15:06:36

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

On Thu, Dec 3, 2020 at 3:57 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-12-20 00:23:28, Pavel Tatashin wrote:
> > PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of
> > clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE.
>
> This is not precise. You are mixing the implementation and the intention
> here. I would say "PF_MEMALLOC_NOCMA is used ot guarantee that the
> allocator will not return pages that might belong to CMA region. This is
> currently used for long term gup to make sure that such pins are not
> going to be done on any CMA pages."
>
> > We will prohibit allocating any pages that are getting
> > longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use
> > this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE.
> > Also re-name:
> > memalloc_nocma_save()/memalloc_nocma_restore
> > to
> > memalloc_nomovable_save()/memalloc_nomovable_restore()
> > and make the new functions common.
>
> This is hard to parse for me. I would go with something like:
> "
> When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
> is focusing on CMA pages too much and that there is larger class of
> pages that need the same treatment. MOVABLE zone cannot contain
> any long term pins as well so it makes sense to reuse and redefine this
> flag for that usecase as well. Rename the flag to PF_MEMALLOC_NOMOVABLE
> which defines an allocation context which can only get pages suitable
> for long-term pins.
> "

I will address the above with your suggested wording.

>
> Btw. the naming is hard but PF_MEMALLOC_NOMOVABLE is a bit misnomer. CMA
> pages are not implicitly movable. So in fact we do care more about
> pinning than movability. PF_MEMALLOC_PIN or something similar would be
> better fit for the overal intention.

Sounds good, I will rename with PF_MEMALLOC_PIN

>
> Other than that looks good to me. Thanks!

Thank you,
Pasha

2020-12-03 15:08:41

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

>
> Reviewed-by: John Hubbard <[email protected]>

Thank you for your review.
Pasha

2020-12-03 15:12:36

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Thu, Dec 3, 2020 at 3:17 AM John Hubbard <[email protected]> wrote:
>
> On 12/1/20 9:23 PM, Pavel Tatashin wrote:
> > PF_MEMALLOC_NOMOVABLE is only honored for CMA allocations, extend
> > this flag to work for any allocations by removing __GFP_MOVABLE from
> > gfp_mask when this flag is passed in the current context, thus
> > prohibiting allocations from ZONE_MOVABLE.
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > mm/hugetlb.c | 2 +-
> > mm/page_alloc.c | 26 ++++++++++++++++----------
> > 2 files changed, 17 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 02213c74ed6b..00e786201d8b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
> >
> > list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
> > - if (nomovable && is_migrate_cma_page(page))
> > + if (nomovable && is_migrate_movable(get_pageblock_migratetype(page)))
>
>
> I wonder if we should add a helper, like is_migrate_cma_page(), that avoids having
> to call get_pageblock_migratetype() at all of the callsites?

Good idea, I will add it.

>
>
> > continue;
> >
> > if (PageHWPoison(page))
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 611799c72da5..7a6d86d0bc5f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> > return alloc_flags;
> > }
> >
> > -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> > - unsigned int alloc_flags)
> > +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> > + unsigned int alloc_flags)
>
> Actually, maybe the original name should be left intact. This handles current alloc
> flags, which right now happen to only cover CMA flags, so the original name seems
> accurate, right?

The reason I re-named it is because we do not access current context
anymore, only use gfp_mask to get cma flag.
>> - unsigned int pflags = current->flags;

So, keeping "current" in the function name makes its intent misleading.

Thank you,
Pasha

2020-12-03 15:22:06

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Thu, Dec 3, 2020 at 4:17 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-12-20 00:23:29, Pavel Tatashin wrote:
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 611799c72da5..7a6d86d0bc5f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> > return alloc_flags;
> > }
> >
> > -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> > - unsigned int alloc_flags)
> > +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> > + unsigned int alloc_flags)
> > {
> > #ifdef CONFIG_CMA
> > - unsigned int pflags = current->flags;
> > -
> > - if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> > - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > alloc_flags |= ALLOC_CMA;
> > -
> > #endif
> > return alloc_flags;
> > }
> >
> > +static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)
> > +{
> > + unsigned int pflags = current->flags;
> > +
> > + if ((pflags & PF_MEMALLOC_NOMOVABLE))
> > + return gfp_mask & ~__GFP_MOVABLE;
> > + return gfp_mask;
> > +}
> > +
>
> It sucks that we have to control both ALLOC and gfp flags. But wouldn't
> it be simpler and more straightforward to keep current_alloc_flags as is
> (module PF rename) and hook the gfp mask evaluation into current_gfp_context
> and move it up before the first allocation attempt?

We could do that, but perhaps as a separate patch? I am worried about
hidden implication of adding extra scope (GFP_NOIO|GFP_NOFS) to the
fast path. Also, current_gfp_context() is used elsewhere, and in some
places removing __GFP_MOVABLE from gfp_mask means that we will need to
also change other things. For example [1], in try_to_free_pages() we
call current_gfp_context(gfp_mask) which can reduce the maximum zone
idx, yet we simply set it to: reclaim_idx = gfp_zone(gfp_mask), not to
the newly determined gfp_mask.

[1] https://soleen.com/source/xref/linux/mm/vmscan.c?r=2da9f630#3239


All scope flags
> should be applicable to the hot path as well. It would add few cycles to
> there but the question is whether that would be noticeable over just
> handling PF_MEMALLOC_NOMOVABLE on its own. The cache line would be
> pulled in anyway.

Let's try it in a separate patch? I will add it in the next version of
this series.

Thank you,
Pasha

2020-12-03 15:59:14

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

> I like the cleanup so far, even at this point it's a relief to finally
> see the nested ifdefs get removed.
>
> One naming nit/idea below, but this looks fine as is, so:
>
> Reviewed-by: John Hubbard <[email protected]>

Thank you for reviewing this series.

> Maybe naming it "movable_page_list", would be a nice touch.

Sure, I will change it.

Thank you,
Pasha

2020-12-03 16:45:55

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

> Looking at this code some more.. How is it even correct?
>
> 1633 if (!isolate_lru_page(head)) {
> 1634 list_add_tail(&head->lru, &cma_page_list);
>
> Here we are only running under the read side of the mmap sem so multiple
> GUPs can be calling that sequence in parallel. I don't see any
> obvious exclusion that will prevent corruption of head->lru. The first
> GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> GUP thread will be a NOP for isolate_lru_page().
>
> They will both race list_add_tail and other list ops. That is not OK.

Good question. I studied it, and I do not see how this is OK. Worse,
this race is also exposable as a syscall instead of via driver: two
move_pages() run simultaneously. Perhaps in other places?

move_pages()
kernel_move_pages()
mmget()
do_pages_move()
add_page_for_migratio()
mmap_read_lock(mm);
list_add_tail(&head->lru, pagelist); <- Not protected

>
> > What I meant is the users of the interface do it incrementally not in
> > large chunks. For example:
> >
> > vfio_pin_pages_remote
> > vaddr_get_pfn
> > ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> > FOLL_LONGTERM, page, NULL, NULL);
> > 1 -> pin only one pages at a time
>
> I don't know why vfio does this, it is why it so ridiculously slow at
> least.

Agreed.

>
> Jason

2020-12-03 16:55:52

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On 12/3/20 7:06 AM, Pavel Tatashin wrote:
...
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 611799c72da5..7a6d86d0bc5f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
>>> return alloc_flags;
>>> }
>>>
>>> -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
>>> - unsigned int alloc_flags)
>>> +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
>>> + unsigned int alloc_flags)
>>
>> Actually, maybe the original name should be left intact. This handles current alloc
>> flags, which right now happen to only cover CMA flags, so the original name seems
>> accurate, right?
>
> The reason I re-named it is because we do not access current context
> anymore, only use gfp_mask to get cma flag.
>>> - unsigned int pflags = current->flags;
>
> So, keeping "current" in the function name makes its intent misleading.
>

OK, I see. That sounds OK then.


thanks,
--
John Hubbard
NVIDIA

2020-12-03 17:05:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Thu, Dec 03, 2020 at 11:40:15AM -0500, Pavel Tatashin wrote:
> > Looking at this code some more.. How is it even correct?
> >
> > 1633 if (!isolate_lru_page(head)) {
> > 1634 list_add_tail(&head->lru, &cma_page_list);
> >
> > Here we are only running under the read side of the mmap sem so multiple
> > GUPs can be calling that sequence in parallel. I don't see any
> > obvious exclusion that will prevent corruption of head->lru. The first
> > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > GUP thread will be a NOP for isolate_lru_page().
> >
> > They will both race list_add_tail and other list ops. That is not OK.
>
> Good question. I studied it, and I do not see how this is OK. Worse,
> this race is also exposable as a syscall instead of via driver: two
> move_pages() run simultaneously. Perhaps in other places?
>
> move_pages()
> kernel_move_pages()
> mmget()
> do_pages_move()
> add_page_for_migratio()
> mmap_read_lock(mm);
> list_add_tail(&head->lru, pagelist); <- Not protected

When this was CMA only it might have been rarer to trigger, but this
move stuff sounds like it makes it much more broadly, eg on typical
servers with RDMA exposed/etc

Seems like it needs fixing as part of this too :\

Page at a time inside the gup loop could address both concerns, unsure
about batching performance here though..

Jason

2020-12-03 17:18:25

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Thu, Dec 3, 2020 at 11:59 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Dec 03, 2020 at 11:40:15AM -0500, Pavel Tatashin wrote:
> > > Looking at this code some more.. How is it even correct?
> > >
> > > 1633 if (!isolate_lru_page(head)) {
> > > 1634 list_add_tail(&head->lru, &cma_page_list);
> > >
> > > Here we are only running under the read side of the mmap sem so multiple
> > > GUPs can be calling that sequence in parallel. I don't see any
> > > obvious exclusion that will prevent corruption of head->lru. The first
> > > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > > GUP thread will be a NOP for isolate_lru_page().
> > >
> > > They will both race list_add_tail and other list ops. That is not OK.
> >
> > Good question. I studied it, and I do not see how this is OK. Worse,
> > this race is also exposable as a syscall instead of via driver: two
> > move_pages() run simultaneously. Perhaps in other places?
> >
> > move_pages()
> > kernel_move_pages()
> > mmget()
> > do_pages_move()
> > add_page_for_migratio()
> > mmap_read_lock(mm);
> > list_add_tail(&head->lru, pagelist); <- Not protected
>
> When this was CMA only it might have been rarer to trigger, but this
> move stuff sounds like it makes it much more broadly, eg on typical
> servers with RDMA exposed/etc
>
> Seems like it needs fixing as part of this too :\

Just to clarify the stack that I showed above is outside of gup, it is
the same issue that you pointed out that happens elsewhere. I suspect
there might be more. All of them should be addressed together.

Pasha

>
> Page at a time inside the gup loop could address both concerns, unsure
> about batching performance here though..
>
> Jason

2020-12-03 19:22:42

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

> > > > Looking at this code some more.. How is it even correct?
> > > >
> > > > 1633 if (!isolate_lru_page(head)) {
> > > > 1634 list_add_tail(&head->lru, &cma_page_list);
> > > >
> > > > Here we are only running under the read side of the mmap sem so multiple
> > > > GUPs can be calling that sequence in parallel. I don't see any
> > > > obvious exclusion that will prevent corruption of head->lru. The first
> > > > GUP thread to do isolate_lru_page() will ClearPageLRU() and the second
> > > > GUP thread will be a NOP for isolate_lru_page().
> > > >
> > > > They will both race list_add_tail and other list ops. That is not OK.
> > >
> > > Good question. I studied it, and I do not see how this is OK. Worse,
> > > this race is also exposable as a syscall instead of via driver: two
> > > move_pages() run simultaneously. Perhaps in other places?
> > >
> > > move_pages()
> > > kernel_move_pages()
> > > mmget()
> > > do_pages_move()
> > > add_page_for_migratio()
> > > mmap_read_lock(mm);
> > > list_add_tail(&head->lru, pagelist); <- Not protected
> >
> > When this was CMA only it might have been rarer to trigger, but this
> > move stuff sounds like it makes it much more broadly, eg on typical
> > servers with RDMA exposed/etc
> >
> > Seems like it needs fixing as part of this too :\
>
> Just to clarify the stack that I showed above is outside of gup, it is
> the same issue that you pointed out that happens elsewhere. I suspect
> there might be more. All of them should be addressed together.

Hi Jason,

I studied some more, and I think this is not a race:
list_add_tail(&head->lru, &cma_page_list) is called only when
isolate_lru_page(head) succeeds.
isolate_lru_page(head) succeeds only when PageLRU(head) is true.
However, in this function we also clear LRU flag before returning
success.
This means, that if we race with another thread, the other thread
won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
until head is is back on LRU list.

Please let me know if I am missing anything.

Thank you,
Pasha

2020-12-03 19:39:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:

> I studied some more, and I think this is not a race:
> list_add_tail(&head->lru, &cma_page_list) is called only when
> isolate_lru_page(head) succeeds.
> isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> However, in this function we also clear LRU flag before returning
> success.
> This means, that if we race with another thread, the other thread
> won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> until head is is back on LRU list.

Oh interesting, I totally didn't see how that LRU stuff is
working. So.. this creates a ridiculously expensive spin lock? Not
broken, but yikes :|

Jason

2020-12-04 04:06:10

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

Hello,

On Wed, Dec 02, 2020 at 12:23:24AM -0500, Pavel Tatashin wrote:
> When page is pinned it cannot be moved and its physical address stays
> the same until pages is unpinned.
>
> This is useful functionality to allows userland to implementation DMA
> access. For example, it is used by vfio in vfio_pin_pages().
>
> However, this functionality breaks memory hotplug/hotremove assumptions
> that pages in ZONE_MOVABLE can always be migrated.
>
> This patch series fixes this issue by forcing new allocations during
> page pinning to omit ZONE_MOVABLE, and also to migrate any existing
> pages from ZONE_MOVABLE during pinning.

I love what this patchset does, but, at least, it's better to consider
the side-effect of this patchset and inform it in somewhere. IIUC,
ZONE_MOVABLE exists for two purposes.

1) increasing availability of THP
2) memory hot-unplug

Potential issue would come from the case 1). They uses ZONE_MOVABLE
for THP availability and hard guarantee for migration isn't required
until now. So, there would be a system with following congifuration.

- memory layout: ZONE_NORMAL-512MB, ZONE_MOVABLE-512MB
- memory usage: unmovable-256MB, movable pinned-256MB, movable
unpinned-512MB

With this patchset, movable pinned should be placed in ZONE_NORMAL so
512MB is required for ZONE_NORMAL. ZONE_NORMAL would be exhausted and
system performance would be highly afftect according to memory usage
pattern.

I'm not sure whether such configuration exists or not, but, at least,
it's better to write down this risk on commit message or something
else.

Thanks.

2020-12-04 04:17:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> allocated before pinning they need to migrated to a different zone.
> Currently, we migrate movable CMA pages only. Generalize the function
> that migrates CMA pages to migrate all movable pages.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> include/trace/events/migrate.h | 3 +-
> mm/gup.c | 56 +++++++++++++---------------------
> 3 files changed, 24 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0f8d1583fa8e..00bab23d1ee5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -27,6 +27,7 @@ enum migrate_reason {
> MR_MEMPOLICY_MBIND,
> MR_NUMA_MISPLACED,
> MR_CONTIG_RANGE,
> + MR_LONGTERM_PIN,
> MR_TYPES
> };
>
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 4d434398d64d..363b54ce104c 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
> EM( MR_SYSCALL, "syscall_or_cpuset") \
> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> - EMe(MR_CONTIG_RANGE, "contig_range")
> + EM( MR_CONTIG_RANGE, "contig_range") \
> + EMe(MR_LONGTERM_PIN, "longterm_pin")
>
> /*
> * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> }
> #endif
>
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + unsigned int gup_flags)
> {
> unsigned long i;
> unsigned long step;
> bool drain_allow = true;
> bool migrate_allow = true;
> - LIST_HEAD(cma_page_list);
> + LIST_HEAD(page_list);
> long ret = nr_pages;
> struct migration_target_control mtc = {
> .nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> */
> step = compound_nr(head) - (pages[i] - head);
> /*
> - * If we get a page from the CMA zone, since we are going to
> - * be pinning these entries, we might as well move them out
> - * of the CMA zone if possible.
> + * If we get a movable page, since we are going to be pinning
> + * these entries, try to move them out if possible.
> */
> - if (is_migrate_cma_page(head)) {
> + if (is_migrate_movable(get_pageblock_migratetype(head))) {

is_migrate_movable() isn't a check for the ZONE. It's a check for the
MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
migration, and, most of memory, including ZONE_NORMAL, is
MIGRATE_MOVABLE. With this code, long term gup would always fails due
to not enough memory. I think that correct change would be
"is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".

Patch #5 also has this problem. Please fix it too.

Thanks.

2020-12-04 08:46:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Thu 03-12-20 10:15:41, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 4:17 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 02-12-20 00:23:29, Pavel Tatashin wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 611799c72da5..7a6d86d0bc5f 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> > > return alloc_flags;
> > > }
> > >
> > > -static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> > > - unsigned int alloc_flags)
> > > +static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
> > > + unsigned int alloc_flags)
> > > {
> > > #ifdef CONFIG_CMA
> > > - unsigned int pflags = current->flags;
> > > -
> > > - if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
> > > - gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > > + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > > alloc_flags |= ALLOC_CMA;
> > > -
> > > #endif
> > > return alloc_flags;
> > > }
> > >
> > > +static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)
> > > +{
> > > + unsigned int pflags = current->flags;
> > > +
> > > + if ((pflags & PF_MEMALLOC_NOMOVABLE))
> > > + return gfp_mask & ~__GFP_MOVABLE;
> > > + return gfp_mask;
> > > +}
> > > +
> >
> > It sucks that we have to control both ALLOC and gfp flags. But wouldn't
> > it be simpler and more straightforward to keep current_alloc_flags as is
> > (module PF rename) and hook the gfp mask evaluation into current_gfp_context
> > and move it up before the first allocation attempt?
>
> We could do that, but perhaps as a separate patch? I am worried about
> hidden implication of adding extra scope (GFP_NOIO|GFP_NOFS) to the
> fast path.

Why?

> Also, current_gfp_context() is used elsewhere, and in some
> places removing __GFP_MOVABLE from gfp_mask means that we will need to
> also change other things. For example [1], in try_to_free_pages() we
> call current_gfp_context(gfp_mask) which can reduce the maximum zone
> idx, yet we simply set it to: reclaim_idx = gfp_zone(gfp_mask), not to
> the newly determined gfp_mask.

Yes and the direct reclaim should honor the movable zone restriction.
Why should we reclaim ZONE_MOVABLE when the allocation cannot really
allocate from it? Or have I misunderstood your concern?

>
> [1] https://soleen.com/source/xref/linux/mm/vmscan.c?r=2da9f630#3239
>
>
> All scope flags
> > should be applicable to the hot path as well. It would add few cycles to
> > there but the question is whether that would be noticeable over just
> > handling PF_MEMALLOC_NOMOVABLE on its own. The cache line would be
> > pulled in anyway.
>
> Let's try it in a separate patch? I will add it in the next version of
> this series.

Separate patch or not is up to you. But I do not see a strong reason why
this cannot be addressed in a single one.

--
Michal Hocko
SUSE Labs

2020-12-04 08:57:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Fri 04-12-20 09:43:13, Michal Hocko wrote:
> On Thu 03-12-20 10:15:41, Pavel Tatashin wrote:
[...]
> > Also, current_gfp_context() is used elsewhere, and in some
> > places removing __GFP_MOVABLE from gfp_mask means that we will need to
> > also change other things. For example [1], in try_to_free_pages() we
> > call current_gfp_context(gfp_mask) which can reduce the maximum zone
> > idx, yet we simply set it to: reclaim_idx = gfp_zone(gfp_mask), not to
> > the newly determined gfp_mask.
>
> Yes and the direct reclaim should honor the movable zone restriction.
> Why should we reclaim ZONE_MOVABLE when the allocation cannot really
> allocate from it? Or have I misunderstood your concern?

Btw. if we have gfp mask properly filtered for the fast path then we can
remove the additional call to current_gfp_context from the direct
reclaim path. Something for a separate patch.
--
Michal Hocko
SUSE Labs

2020-12-04 15:59:08

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

On Thu, Dec 3, 2020 at 11:03 PM Joonsoo Kim <[email protected]> wrote:
>
> Hello,
>
> On Wed, Dec 02, 2020 at 12:23:24AM -0500, Pavel Tatashin wrote:
> > When page is pinned it cannot be moved and its physical address stays
> > the same until pages is unpinned.
> >
> > This is useful functionality to allows userland to implementation DMA
> > access. For example, it is used by vfio in vfio_pin_pages().
> >
> > However, this functionality breaks memory hotplug/hotremove assumptions
> > that pages in ZONE_MOVABLE can always be migrated.
> >
> > This patch series fixes this issue by forcing new allocations during
> > page pinning to omit ZONE_MOVABLE, and also to migrate any existing
> > pages from ZONE_MOVABLE during pinning.
>
> I love what this patchset does, but, at least, it's better to consider
> the side-effect of this patchset and inform it in somewhere. IIUC,
> ZONE_MOVABLE exists for two purposes.
>
> 1) increasing availability of THP
> 2) memory hot-unplug
>
> Potential issue would come from the case 1). They uses ZONE_MOVABLE
> for THP availability and hard guarantee for migration isn't required
> until now. So, there would be a system with following congifuration.
>
> - memory layout: ZONE_NORMAL-512MB, ZONE_MOVABLE-512MB
> - memory usage: unmovable-256MB, movable pinned-256MB, movable
> unpinned-512MB
>
> With this patchset, movable pinned should be placed in ZONE_NORMAL so
> 512MB is required for ZONE_NORMAL. ZONE_NORMAL would be exhausted and
> system performance would be highly afftect according to memory usage
> pattern.
>
> I'm not sure whether such configuration exists or not, but, at least,
> it's better to write down this risk on commit message or something
> else.

Yes, this indeed could be a problem for some configurations. I will
add your comment to the commit log of one of the patches.

Thank you,
Pasha

2020-12-04 16:13:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

On Fri, Dec 04, 2020 at 10:55:30AM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 11:03 PM Joonsoo Kim <[email protected]> wrote:
> >
> > Hello,
> >
> > On Wed, Dec 02, 2020 at 12:23:24AM -0500, Pavel Tatashin wrote:
> > > When page is pinned it cannot be moved and its physical address stays
> > > the same until pages is unpinned.
> > >
> > > This is useful functionality to allows userland to implementation DMA
> > > access. For example, it is used by vfio in vfio_pin_pages().
> > >
> > > However, this functionality breaks memory hotplug/hotremove assumptions
> > > that pages in ZONE_MOVABLE can always be migrated.
> > >
> > > This patch series fixes this issue by forcing new allocations during
> > > page pinning to omit ZONE_MOVABLE, and also to migrate any existing
> > > pages from ZONE_MOVABLE during pinning.
> >
> > I love what this patchset does, but, at least, it's better to consider
> > the side-effect of this patchset and inform it in somewhere. IIUC,
> > ZONE_MOVABLE exists for two purposes.
> >
> > 1) increasing availability of THP
> > 2) memory hot-unplug
> >
> > Potential issue would come from the case 1). They uses ZONE_MOVABLE
> > for THP availability and hard guarantee for migration isn't required
> > until now. So, there would be a system with following congifuration.
> >
> > - memory layout: ZONE_NORMAL-512MB, ZONE_MOVABLE-512MB
> > - memory usage: unmovable-256MB, movable pinned-256MB, movable
> > unpinned-512MB
> >
> > With this patchset, movable pinned should be placed in ZONE_NORMAL so
> > 512MB is required for ZONE_NORMAL. ZONE_NORMAL would be exhausted and
> > system performance would be highly afftect according to memory usage
> > pattern.
> >
> > I'm not sure whether such configuration exists or not, but, at least,
> > it's better to write down this risk on commit message or something
> > else.
>
> Yes, this indeed could be a problem for some configurations. I will
> add your comment to the commit log of one of the patches.

It sounds like there is some inherent tension here, breaking THP's
when doing pin_user_pages() is a really nasty thing to do. DMA
benefits greatly from THP.

I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
option? If the result of this patch is standard systems can no longer
pin > 80% of their memory I have some regression concerns..

Jason

2020-12-04 16:13:29

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

On Fri, Dec 4, 2020 at 3:54 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 04-12-20 09:43:13, Michal Hocko wrote:
> > On Thu 03-12-20 10:15:41, Pavel Tatashin wrote:
> [...]
> > > Also, current_gfp_context() is used elsewhere, and in some
> > > places removing __GFP_MOVABLE from gfp_mask means that we will need to
> > > also change other things. For example [1], in try_to_free_pages() we
> > > call current_gfp_context(gfp_mask) which can reduce the maximum zone
> > > idx, yet we simply set it to: reclaim_idx = gfp_zone(gfp_mask), not to
> > > the newly determined gfp_mask.
> >
> > Yes and the direct reclaim should honor the movable zone restriction.
> > Why should we reclaim ZONE_MOVABLE when the allocation cannot really
> > allocate from it? Or have I misunderstood your concern?
>
> Btw. if we have gfp mask properly filtered for the fast path then we can
> remove the additional call to current_gfp_context from the direct
> reclaim path. Something for a separate patch.

Good point. I am thinking to make a preparation patch at the beginning
of the series where we move current_gfp_context() to the fast path,
and also address all other cases where this call is not going to be
needed anymore, or where the gfp_mask will needed to be set according
to what current_gfp_context() returned.

Thanks,
Pasha

2020-12-04 16:28:22

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Thu, Dec 3, 2020 at 2:36 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:
>
> > I studied some more, and I think this is not a race:
> > list_add_tail(&head->lru, &cma_page_list) is called only when
> > isolate_lru_page(head) succeeds.
> > isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> > However, in this function we also clear LRU flag before returning
> > success.
> > This means, that if we race with another thread, the other thread
> > won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> > until head is is back on LRU list.
>
> Oh interesting, I totally didn't see how that LRU stuff is
> working. So.. this creates a ridiculously expensive spin lock? Not
> broken, but yikes :|

Not really a spin lock, the second thread won't be able to isolate
this page, and will skip migration of this page.

>
> Jason

2020-12-04 17:09:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 04, 2020 at 11:24:56AM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 2:36 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Dec 03, 2020 at 02:15:36PM -0500, Pavel Tatashin wrote:
> >
> > > I studied some more, and I think this is not a race:
> > > list_add_tail(&head->lru, &cma_page_list) is called only when
> > > isolate_lru_page(head) succeeds.
> > > isolate_lru_page(head) succeeds only when PageLRU(head) is true.
> > > However, in this function we also clear LRU flag before returning
> > > success.
> > > This means, that if we race with another thread, the other thread
> > > won't get to unprotected list_add_tail(&head->lru, &cma_page_list)
> > > until head is is back on LRU list.
> >
> > Oh interesting, I totally didn't see how that LRU stuff is
> > working. So.. this creates a ridiculously expensive spin lock? Not
> > broken, but yikes :|
>
> Not really a spin lock, the second thread won't be able to isolate
> this page, and will skip migration of this page.

It looks like the intent is that it will call gup again, then goto
check_again, and once again try to isolate the LRU. ie it loops.

If it gets to a point where all the CMA pages fail to isolate then it
simply exits with success as the cma_page_list will be empty.

Is this a bug? It seems like a bug, the invariant here is to not
return with a CMA page, so why do we have a path that does return with
a CMA page?

Jason

2020-12-04 17:49:09

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Thu, Dec 3, 2020 at 11:14 PM Joonsoo Kim <[email protected]> wrote:
>
> On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> > We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> > allocated before pinning they need to migrated to a different zone.
> > Currently, we migrate movable CMA pages only. Generalize the function
> > that migrates CMA pages to migrate all movable pages.
> >
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > ---
> > include/linux/migrate.h | 1 +
> > include/trace/events/migrate.h | 3 +-
> > mm/gup.c | 56 +++++++++++++---------------------
> > 3 files changed, 24 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 0f8d1583fa8e..00bab23d1ee5 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -27,6 +27,7 @@ enum migrate_reason {
> > MR_MEMPOLICY_MBIND,
> > MR_NUMA_MISPLACED,
> > MR_CONTIG_RANGE,
> > + MR_LONGTERM_PIN,
> > MR_TYPES
> > };
> >
> > diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> > index 4d434398d64d..363b54ce104c 100644
> > --- a/include/trace/events/migrate.h
> > +++ b/include/trace/events/migrate.h
> > @@ -20,7 +20,8 @@
> > EM( MR_SYSCALL, "syscall_or_cpuset") \
> > EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> > EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> > - EMe(MR_CONTIG_RANGE, "contig_range")
> > + EM( MR_CONTIG_RANGE, "contig_range") \
> > + EMe(MR_LONGTERM_PIN, "longterm_pin")
> >
> > /*
> > * First define the enums in the above macros to be exported to userspace
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 724d8a65e1df..1d511f65f8a7 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > }
> > #endif
> >
> > -#ifdef CONFIG_CMA
> > -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > - unsigned long start,
> > - unsigned long nr_pages,
> > - struct page **pages,
> > - struct vm_area_struct **vmas,
> > - unsigned int gup_flags)
> > +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long nr_pages,
> > + struct page **pages,
> > + struct vm_area_struct **vmas,
> > + unsigned int gup_flags)
> > {
> > unsigned long i;
> > unsigned long step;
> > bool drain_allow = true;
> > bool migrate_allow = true;
> > - LIST_HEAD(cma_page_list);
> > + LIST_HEAD(page_list);
> > long ret = nr_pages;
> > struct migration_target_control mtc = {
> > .nid = NUMA_NO_NODE,
> > @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > */
> > step = compound_nr(head) - (pages[i] - head);
> > /*
> > - * If we get a page from the CMA zone, since we are going to
> > - * be pinning these entries, we might as well move them out
> > - * of the CMA zone if possible.
> > + * If we get a movable page, since we are going to be pinning
> > + * these entries, try to move them out if possible.
> > */
> > - if (is_migrate_cma_page(head)) {
> > + if (is_migrate_movable(get_pageblock_migratetype(head))) {
>
> is_migrate_movable() isn't a check for the ZONE. It's a check for the
> MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
> migration, and, most of memory, including ZONE_NORMAL, is
> MIGRATE_MOVABLE. With this code, long term gup would always fails due
> to not enough memory. I think that correct change would be
> "is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".

Good point. The above should be OR not AND.

zone_idx(page_zone(head)) == ZONE_MOVABLE || is_migrate_cma_page(hear)

Pasha

2020-12-04 17:54:25

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

> > Yes, this indeed could be a problem for some configurations. I will
> > add your comment to the commit log of one of the patches.
>
> It sounds like there is some inherent tension here, breaking THP's
> when doing pin_user_pages() is a really nasty thing to do. DMA
> benefits greatly from THP.
>
> I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
> option? If the result of this patch is standard systems can no longer
> pin > 80% of their memory I have some regression concerns..

ZONE_MOVABLE can be configured via kernel parameter, or when memory
nodes are onlined after hot-add; so this is something that admins
configure. ZONE_MOVABLE is designed to gurantee memory hot-plug
functionality, and not availability of THP, however, I did not know
about the use case where some admins might configure ZONE_MOVABLE to
increase availability of THP because pages are always migratable in
them. The thing is, if we fragment ZONE_MOVABLE by pinning pages in
it, the availability of THP also suffers. We can migrate pages in
ZONE_NORMAL, just not guaranteed, so we can create THP in ZONE_NORMAL
as well, which is the usual case.

2020-12-04 18:07:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

On 04.12.20 18:50, Pavel Tatashin wrote:
>>> Yes, this indeed could be a problem for some configurations. I will
>>> add your comment to the commit log of one of the patches.
>>
>> It sounds like there is some inherent tension here, breaking THP's
>> when doing pin_user_pages() is a really nasty thing to do. DMA
>> benefits greatly from THP.
>>
>> I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
>> option? If the result of this patch is standard systems can no longer
>> pin > 80% of their memory I have some regression concerns..
>
> ZONE_MOVABLE can be configured via kernel parameter, or when memory
> nodes are onlined after hot-add; so this is something that admins
> configure. ZONE_MOVABLE is designed to gurantee memory hot-plug
> functionality, and not availability of THP, however, I did not know
> about the use case where some admins might configure ZONE_MOVABLE to
> increase availability of THP because pages are always migratable in
> them. The thing is, if we fragment ZONE_MOVABLE by pinning pages in
> it, the availability of THP also suffers. We can migrate pages in
> ZONE_NORMAL, just not guaranteed, so we can create THP in ZONE_NORMAL
> as well, which is the usual case.

Right, we should document this at some place to make admins aware of
this. Something like

"Techniques that rely on long-term pinnings of memory (especially, RDMA
and vfio) are fundamentally problematic with ZONE_MOVABLE and,
therefore, memory hotunplug. Pinned pages cannot reside on ZONE_MOVABLE,
to guarantee that memory can still get hotunplugged - be aware that
pinning can fail even if there is plenty of free memory in ZONE_MOVABLE.
In addition, using ZONE_MOVABLE might make page pinning more expensive,
because pages have to be migrated off that zone first."

BTW, you might also want to update the comment for ZONE_MOVABLE in
include/linux/mmzone.h at the end of this series, removing the special
case of pinned pages (1.) and maybe adding what happens when trying to
pin pages on ZONE_MOVABLE.

--
Thanks,

David / dhildenb

2020-12-04 18:15:38

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

> > ZONE_MOVABLE can be configured via kernel parameter, or when memory
> > nodes are onlined after hot-add; so this is something that admins
> > configure. ZONE_MOVABLE is designed to gurantee memory hot-plug
> > functionality, and not availability of THP, however, I did not know
> > about the use case where some admins might configure ZONE_MOVABLE to
> > increase availability of THP because pages are always migratable in
> > them. The thing is, if we fragment ZONE_MOVABLE by pinning pages in
> > it, the availability of THP also suffers. We can migrate pages in
> > ZONE_NORMAL, just not guaranteed, so we can create THP in ZONE_NORMAL
> > as well, which is the usual case.
>
> Right, we should document this at some place to make admins aware of
> this. Something like
>
> "Techniques that rely on long-term pinnings of memory (especially, RDMA
> and vfio) are fundamentally problematic with ZONE_MOVABLE and,
> therefore, memory hotunplug. Pinned pages cannot reside on ZONE_MOVABLE,
> to guarantee that memory can still get hotunplugged - be aware that
> pinning can fail even if there is plenty of free memory in ZONE_MOVABLE.
> In addition, using ZONE_MOVABLE might make page pinning more expensive,
> because pages have to be migrated off that zone first."

Thanks, I will add this.

>
> BTW, you might also want to update the comment for ZONE_MOVABLE in
> include/linux/mmzone.h at the end of this series, removing the special
> case of pinned pages (1.) and maybe adding what happens when trying to
> pin pages on ZONE_MOVABLE.

Will do it.

Thank you,
Pasha

2020-12-04 20:11:51

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

Jason Gunthorpe <[email protected]> writes:

> On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> What I meant is the users of the interface do it incrementally not in
>> large chunks. For example:
>>
>> vfio_pin_pages_remote
>> vaddr_get_pfn
>> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> FOLL_LONGTERM, page, NULL, NULL);
>> 1 -> pin only one pages at a time
>
> I don't know why vfio does this, it is why it so ridiculously slow at
> least.

Well Alex can correct me, but I went digging and a comment from the
first type1 vfio commit says the iommu API didn't promise to unmap
subpages of previous mappings, so doing page at a time gave flexibility
at the cost of inefficiency.

Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
vfio kept pinning pages at a time. I couldn't find an explanation for
why that stayed the same.

Yesterday I tried optimizing vfio to skip gup calls for tail pages after
Matthew pointed out this same issue to me by coincidence last week.
Currently debugging, but if there's a fundamental reason this won't work
on the vfio side, it'd be nice to know.

2020-12-04 20:20:49

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 4, 2020 at 3:06 PM Daniel Jordan <[email protected]> wrote:
>
> Jason Gunthorpe <[email protected]> writes:
>
> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
> >> What I meant is the users of the interface do it incrementally not in
> >> large chunks. For example:
> >>
> >> vfio_pin_pages_remote
> >> vaddr_get_pfn
> >> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> >> FOLL_LONGTERM, page, NULL, NULL);
> >> 1 -> pin only one pages at a time
> >
> > I don't know why vfio does this, it is why it so ridiculously slow at
> > least.
>
> Well Alex can correct me, but I went digging and a comment from the
> first type1 vfio commit says the iommu API didn't promise to unmap
> subpages of previous mappings, so doing page at a time gave flexibility
> at the cost of inefficiency.
>
> Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
> vfio kept pinning pages at a time. I couldn't find an explanation for
> why that stayed the same.
>
> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
> Matthew pointed out this same issue to me by coincidence last week.
> Currently debugging, but if there's a fundamental reason this won't work
> on the vfio side, it'd be nice to know.

Hi Daniel,

I do not think there are any fundamental reasons why it won't work. I
have also thinking increasing VFIO chunking for a different reason:

If a client touches pages before doing a VFIO DMA map, those pages
might be huge, and pinning a small page at a time and migrating a
small page at a time can break-up the huge pages. So, it is not only
inefficient to pin, but it can also inadvertently slow down the
runtime.

Thank you,
Pasha

2020-12-04 20:55:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
> Jason Gunthorpe <[email protected]> writes:
>
> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
> >> What I meant is the users of the interface do it incrementally not in
> >> large chunks. For example:
> >>
> >> vfio_pin_pages_remote
> >> vaddr_get_pfn
> >> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
> >> FOLL_LONGTERM, page, NULL, NULL);
> >> 1 -> pin only one pages at a time
> >
> > I don't know why vfio does this, it is why it so ridiculously slow at
> > least.
>
> Well Alex can correct me, but I went digging and a comment from the
> first type1 vfio commit says the iommu API didn't promise to unmap
> subpages of previous mappings, so doing page at a time gave flexibility
> at the cost of inefficiency.

iommu restrictions are not related to with gup. vfio needs to get the
page list from the page tables as efficiently as possible, then you
break it up into what you want to feed into the IOMMU how the iommu
wants.

vfio must maintain a page list to call unpin_user_pages() anyhow, so
it makes alot of sense to assemble the page list up front, then do the
iommu, instead of trying to do both things page at a time.

It would be smart to rebuild vfio to use scatter lists to store the
page list and then break the sgl into pages for iommu
configuration. SGLs will consume alot less memory for the usual case
of THPs backing the VFIO registrations.

ib_umem_get() has some example of how to code this, I've been thinking
we could make this some common API, and it could be further optimized.

> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
> Matthew pointed out this same issue to me by coincidence last week.

Please don't just hack up vfio like this. Everyone needs faster gup,
we really need to solve this in the core code. Plus this is tricky,
vfio is already using follow_pfn wrongly, drivers should not be open
coding MM stuff.

> Currently debugging, but if there's a fundamental reason this won't work
> on the vfio side, it'd be nice to know.

AFAIK there is no guarentee that just because you see a compound head
that the remaining pages in the page tables are actually the tail
pages. This is only true sometimes, for instance if an entire huge
page is placed in a page table level.

I belive Ralph pointed to some case where we might break a huge page
from PMD to PTEs then later COW one of the PTEs. In this case the
compound head will be visible but the page map will be non-contiguous
and the page flags on each 4k entry will be different.

Only GUP's page walkers know that the compound page is actually at a
PMD level and can safely apply the 'everything is the same'
optimization.

The solution here is to make core gup faster, espcially for the cases
where it is returning huge pages. We can approach this by:
- Batching the compound & tail page acquisition for higher page
levels, eg gup fast does this already, look at record_subpages()
gup slow needs it too
- Batching unpin for compound & tail page, the opposite of the 'refs'
arg for try_grab_compound_head()
- Devise some API where get_user_pages can directly return
contiguous groups of pages to avoid memory traffic
- Reduce the cost of a FOLL_LONGTERM pin eg here is a start:
https://lore.kernel.org/linux-mm/[email protected]
And CMA should get some similar treatment. Scanning the output page
list multiple times is slow.

I would like to get to a point where the main GUP walker functions can
output in more formats than just page array. For instance directly
constructing and chaining a biovec or sgl would dramatically improve
perfomance and decrease memory consumption. Being able to write in
hmm_range_fault's pfn&flags output format would delete a whole bunch
of duplicated code.

Jason

2020-12-07 07:15:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

On Fri, Dec 04, 2020 at 12:50:56PM -0500, Pavel Tatashin wrote:
> > > Yes, this indeed could be a problem for some configurations. I will
> > > add your comment to the commit log of one of the patches.
> >
> > It sounds like there is some inherent tension here, breaking THP's
> > when doing pin_user_pages() is a really nasty thing to do. DMA
> > benefits greatly from THP.
> >
> > I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
> > option? If the result of this patch is standard systems can no longer
> > pin > 80% of their memory I have some regression concerns..
>
> ZONE_MOVABLE can be configured via kernel parameter, or when memory
> nodes are onlined after hot-add; so this is something that admins
> configure. ZONE_MOVABLE is designed to gurantee memory hot-plug

Just note, the origin of ZONE_MOVABLE is to provide availability of
huge page, especially, hugetlb page. AFAIK, not guarantee memory
hot-plug. See following commit that introduces the ZONE_MOVABLE.

2a1e274 Create the ZONE_MOVABLE zone

> functionality, and not availability of THP, however, I did not know
> about the use case where some admins might configure ZONE_MOVABLE to

The usecase is lightly mentioned in previous discussion.

http://lkml.kernel.org/r/alpine.DEB.2.23.453.2011221300100.2830030@chino.kir.corp.google.com

Anyway, I agree with your other arguments and this patchset.

Thanks.

2020-12-07 07:17:50

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Fri, Dec 04, 2020 at 12:43:29PM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 11:14 PM Joonsoo Kim <[email protected]> wrote:
> >
> > On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> > > We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> > > allocated before pinning they need to migrated to a different zone.
> > > Currently, we migrate movable CMA pages only. Generalize the function
> > > that migrates CMA pages to migrate all movable pages.
> > >
> > > Signed-off-by: Pavel Tatashin <[email protected]>
> > > ---
> > > include/linux/migrate.h | 1 +
> > > include/trace/events/migrate.h | 3 +-
> > > mm/gup.c | 56 +++++++++++++---------------------
> > > 3 files changed, 24 insertions(+), 36 deletions(-)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 0f8d1583fa8e..00bab23d1ee5 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -27,6 +27,7 @@ enum migrate_reason {
> > > MR_MEMPOLICY_MBIND,
> > > MR_NUMA_MISPLACED,
> > > MR_CONTIG_RANGE,
> > > + MR_LONGTERM_PIN,
> > > MR_TYPES
> > > };
> > >
> > > diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> > > index 4d434398d64d..363b54ce104c 100644
> > > --- a/include/trace/events/migrate.h
> > > +++ b/include/trace/events/migrate.h
> > > @@ -20,7 +20,8 @@
> > > EM( MR_SYSCALL, "syscall_or_cpuset") \
> > > EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
> > > EM( MR_NUMA_MISPLACED, "numa_misplaced") \
> > > - EMe(MR_CONTIG_RANGE, "contig_range")
> > > + EM( MR_CONTIG_RANGE, "contig_range") \
> > > + EMe(MR_LONGTERM_PIN, "longterm_pin")
> > >
> > > /*
> > > * First define the enums in the above macros to be exported to userspace
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 724d8a65e1df..1d511f65f8a7 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
> > > }
> > > #endif
> > >
> > > -#ifdef CONFIG_CMA
> > > -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > - unsigned long start,
> > > - unsigned long nr_pages,
> > > - struct page **pages,
> > > - struct vm_area_struct **vmas,
> > > - unsigned int gup_flags)
> > > +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> > > + unsigned long start,
> > > + unsigned long nr_pages,
> > > + struct page **pages,
> > > + struct vm_area_struct **vmas,
> > > + unsigned int gup_flags)
> > > {
> > > unsigned long i;
> > > unsigned long step;
> > > bool drain_allow = true;
> > > bool migrate_allow = true;
> > > - LIST_HEAD(cma_page_list);
> > > + LIST_HEAD(page_list);
> > > long ret = nr_pages;
> > > struct migration_target_control mtc = {
> > > .nid = NUMA_NO_NODE,
> > > @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > */
> > > step = compound_nr(head) - (pages[i] - head);
> > > /*
> > > - * If we get a page from the CMA zone, since we are going to
> > > - * be pinning these entries, we might as well move them out
> > > - * of the CMA zone if possible.
> > > + * If we get a movable page, since we are going to be pinning
> > > + * these entries, try to move them out if possible.
> > > */
> > > - if (is_migrate_cma_page(head)) {
> > > + if (is_migrate_movable(get_pageblock_migratetype(head))) {
> >
> > is_migrate_movable() isn't a check for the ZONE. It's a check for the
> > MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
> > migration, and, most of memory, including ZONE_NORMAL, is
> > MIGRATE_MOVABLE. With this code, long term gup would always fails due
> > to not enough memory. I think that correct change would be
> > "is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".
>
> Good point. The above should be OR not AND.
>
> zone_idx(page_zone(head)) == ZONE_MOVABLE || is_migrate_cma_page(hear)

Yep!

Thanks.

2020-12-07 12:17:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

On Mon 07-12-20 16:12:50, Joonsoo Kim wrote:
> On Fri, Dec 04, 2020 at 12:50:56PM -0500, Pavel Tatashin wrote:
> > > > Yes, this indeed could be a problem for some configurations. I will
> > > > add your comment to the commit log of one of the patches.
> > >
> > > It sounds like there is some inherent tension here, breaking THP's
> > > when doing pin_user_pages() is a really nasty thing to do. DMA
> > > benefits greatly from THP.
> > >
> > > I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
> > > option? If the result of this patch is standard systems can no longer
> > > pin > 80% of their memory I have some regression concerns..
> >
> > ZONE_MOVABLE can be configured via kernel parameter, or when memory
> > nodes are onlined after hot-add; so this is something that admins
> > configure. ZONE_MOVABLE is designed to gurantee memory hot-plug
>
> Just note, the origin of ZONE_MOVABLE is to provide availability of
> huge page, especially, hugetlb page. AFAIK, not guarantee memory
> hot-plug. See following commit that introduces the ZONE_MOVABLE.
>
> 2a1e274 Create the ZONE_MOVABLE zone
>
> > functionality, and not availability of THP, however, I did not know
> > about the use case where some admins might configure ZONE_MOVABLE to
>
> The usecase is lightly mentioned in previous discussion.
>
> http://lkml.kernel.org/r/alpine.DEB.2.23.453.2011221300100.2830030@chino.kir.corp.google.com
>
> Anyway, I agree with your other arguments and this patchset.

Yes, historically the original motivation for the movable zone was to
help creating large pages via compaction. I also do remember Mel
not being particularly happy about that.

The thing is that the movability constrain is just too strict for this
usecases because the movable zone, especially a lot of it, might be
causing similar to lowmem/highmem problems very well known from 32b
world. So an admin had to be always very careful when configuring to not
cause zone pressure problems.

Later on, with a higher demand on the memory hotplug - especially the
hotremove usecases - it has become clear that the only reliable way for
the memory offlining is to rule out any unmovable memory out of the way
and that is why a rather strong properly of movable zone was relied on.

In the end we are in two rather different requirements here. One for
optimization and one for correctness. In this case I would much rather
focus on the correctness aspect.
--
Michal Hocko
SUSE Labs

2020-12-08 02:58:39

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

Jason Gunthorpe <[email protected]> writes:
> On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>
> iommu restrictions are not related to with gup. vfio needs to get the
> page list from the page tables as efficiently as possible, then you
> break it up into what you want to feed into the IOMMU how the iommu
> wants.
>
> vfio must maintain a page list to call unpin_user_pages() anyhow, so

It does in some cases but not others, namely the expensive
VFIO_IOMMU_MAP_DMA/UNMAP_DMA path where the iommu page tables are used
to find the pfns when unpinning.

I don't see why vfio couldn't do as you say, though, and the worst case
memory overhead of using scatterlist to remember the pfns of a 300g VM
backed by huge but physically discontiguous pages is only a few meg, not
bad at all.

> it makes alot of sense to assemble the page list up front, then do the
> iommu, instead of trying to do both things page at a time.
>
> It would be smart to rebuild vfio to use scatter lists to store the
> page list and then break the sgl into pages for iommu
> configuration. SGLs will consume alot less memory for the usual case
> of THPs backing the VFIO registrations.
>
> ib_umem_get() has some example of how to code this, I've been thinking
> we could make this some common API, and it could be further optimized.

Agreed, great suggestions, both above and in the rest of your response.

>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>
> Please don't just hack up vfio like this.

Yeah, you've cured me of that idea. I'll see where I get experimenting
with some of this stuff.

2020-12-08 03:35:19

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

Pavel Tatashin <[email protected]> writes:

> On Fri, Dec 4, 2020 at 3:06 PM Daniel Jordan <[email protected]> wrote:
>>
>> Jason Gunthorpe <[email protected]> writes:
>>
>> > On Wed, Dec 02, 2020 at 08:34:32PM -0500, Pavel Tatashin wrote:
>> >> What I meant is the users of the interface do it incrementally not in
>> >> large chunks. For example:
>> >>
>> >> vfio_pin_pages_remote
>> >> vaddr_get_pfn
>> >> ret = pin_user_pages_remote(mm, vaddr, 1, flags |
>> >> FOLL_LONGTERM, page, NULL, NULL);
>> >> 1 -> pin only one pages at a time
>> >
>> > I don't know why vfio does this, it is why it so ridiculously slow at
>> > least.
>>
>> Well Alex can correct me, but I went digging and a comment from the
>> first type1 vfio commit says the iommu API didn't promise to unmap
>> subpages of previous mappings, so doing page at a time gave flexibility
>> at the cost of inefficiency.
>>
>> Then 166fd7d94afd allowed the iommu to use larger pages in vfio, but
>> vfio kept pinning pages at a time. I couldn't find an explanation for
>> why that stayed the same.
>>
>> Yesterday I tried optimizing vfio to skip gup calls for tail pages after
>> Matthew pointed out this same issue to me by coincidence last week.
>> Currently debugging, but if there's a fundamental reason this won't work
>> on the vfio side, it'd be nice to know.
>
> Hi Daniel,
>
> I do not think there are any fundamental reasons why it won't work. I
> have also thinking increasing VFIO chunking for a different reason:
>
> If a client touches pages before doing a VFIO DMA map, those pages
> might be huge, and pinning a small page at a time and migrating a
> small page at a time can break-up the huge pages. So, it is not only
> inefficient to pin, but it can also inadvertently slow down the
> runtime.

Hi Pasha,

I see, and I'm curious, do you experience this case where a user has
touched the pages before doing a VFIO DMA map, and if so where?

The usual situation on my side is that the pages are faulted in during
pinning.

Daniel

2020-12-08 13:29:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

On Mon, Dec 07, 2020 at 09:48:48PM -0500, Daniel Jordan wrote:
> Jason Gunthorpe <[email protected]> writes:
> > On Fri, Dec 04, 2020 at 03:05:46PM -0500, Daniel Jordan wrote:
> >> Well Alex can correct me, but I went digging and a comment from the
> >> first type1 vfio commit says the iommu API didn't promise to unmap
> >> subpages of previous mappings, so doing page at a time gave flexibility
> >> at the cost of inefficiency.
> >
> > iommu restrictions are not related to with gup. vfio needs to get the
> > page list from the page tables as efficiently as possible, then you
> > break it up into what you want to feed into the IOMMU how the iommu
> > wants.
> >
> > vfio must maintain a page list to call unpin_user_pages() anyhow, so
>
> It does in some cases but not others, namely the expensive
> VFIO_IOMMU_MAP_DMA/UNMAP_DMA path where the iommu page tables are used
> to find the pfns when unpinning.

Oh, I see.. Well, that is still possible, but vfio really needs to
batch operations, eg call pin_user_pages() with some larger buffer and
store those into the iommu and then reverse this to build up
contiguous runs of pages to unpin

> I don't see why vfio couldn't do as you say, though, and the worst case
> memory overhead of using scatterlist to remember the pfns of a 300g VM
> backed by huge but physically discontiguous pages is only a few meg, not
> bad at all.

Yes, but 0 is still better.. I would start by focusing on batching
pin_user_pages.

Jason