2021-02-05 01:48:03

by Joao Martins

[permalink] [raw]
Subject: [PATCH v2 0/4] mm/gup: page unpining improvements

Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages.

2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/[email protected]/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks for all the comments in the RFC above.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

Joao

Changelog since,

v1 -> v2:
* Prefix macro arguments with __ to avoid collisions with other defines (John)
* Remove count_tails() and have the logic for the two iterators split into
range_next() and compound_next() (John)
* Remove the @range boolean from the iterator helpers (John)
* Add docs on unpin_user_page_range_dirty_lock() on patch 3 (John)
* Use unsigned for @i on patch 4 (John)
* Fix subject line of patch 4 (John)
* Add John's Reviewed-by on the second patch
* Fix incorrect use of @nmap and use @sg_nents instead (Jason)

RFC -> v1:
* Introduce a head/ntails iterator and change unpin_*_pages() to use that,
inspired by folio iterators (Jason)
* Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
on a consecutive page range without having to walk page arrays (Jason)
* Use unsigned for number of tails (Jason)

Joao Martins (4):
mm/gup: add compound page list iterator
mm/gup: decrement head page once for group of subpages
mm/gup: add a range variant of unpin_user_pages_dirty_lock()
RDMA/umem: batch page unpin in __ib_umem_release()

drivers/infiniband/core/umem.c | 12 ++--
include/linux/mm.h | 2 +
mm/gup.c | 122 ++++++++++++++++++++++++++++-----
3 files changed, 112 insertions(+), 24 deletions(-)

--
2.17.1


2021-02-05 01:48:09

by Joao Martins

[permalink] [raw]
Subject: [PATCH v2 1/4] mm/gup: add compound page list iterator

Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
mm/gup.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..d1549c61c2f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);

+static inline void compound_next(unsigned long i, unsigned long npages,
+ struct page **list, struct page **head,
+ unsigned int *ntails)
+{
+ struct page *page;
+ unsigned int nr;
+
+ if (i >= npages)
+ return;
+
+ list += i;
+ npages -= i;
+ page = compound_head(*list);
+
+ for (nr = 1; nr < npages; nr++) {
+ if (compound_head(list[nr]) != page)
+ break;
+ }
+
+ *head = page;
+ *ntails = nr;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+ for (__i = 0, \
+ compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+ __i < __npages; __i += __ntails, \
+ compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
/**
* unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
* @pages: array of pages to be maybe marked dirty, and definitely released.
--
2.17.1

2021-02-05 01:48:16

by Joao Martins

[permalink] [raw]
Subject: [PATCH v2 2/4] mm/gup: decrement head page once for group of subpages

Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
mm/gup.c | 29 +++++++++++------------------
1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index d1549c61c2f6..5a3dd235017a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
bool make_dirty)
{
unsigned long index;
-
- /*
- * TODO: this can be optimized for huge pages: if a series of pages is
- * physically contiguous and part of the same compound page, then a
- * single operation to the head page should suffice.
- */
+ struct page *head;
+ unsigned int ntails;

if (!make_dirty) {
unpin_user_pages(pages, npages);
return;
}

- for (index = 0; index < npages; index++) {
- struct page *page = compound_head(pages[index]);
+ for_each_compound_head(index, pages, npages, head, ntails) {
/*
* Checking PageDirty at this point may race with
* clear_page_dirty_for_io(), but that's OK. Two key
@@ -304,9 +299,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
* written back, so it gets written back again in the
* next writeback cycle. This is harmless.
*/
- if (!PageDirty(page))
- set_page_dirty_lock(page);
- unpin_user_page(page);
+ if (!PageDirty(head))
+ set_page_dirty_lock(head);
+ put_compound_head(head, ntails, FOLL_PIN);
}
}
EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -323,6 +318,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
void unpin_user_pages(struct page **pages, unsigned long npages)
{
unsigned long index;
+ struct page *head;
+ unsigned int ntails;

/*
* If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -331,13 +328,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
*/
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
- /*
- * TODO: this can be optimized for huge pages: if a series of pages is
- * physically contiguous and part of the same compound page, then a
- * single operation to the head page should suffice.
- */
- for (index = 0; index < npages; index++)
- unpin_user_page(pages[index]);
+
+ for_each_compound_head(index, pages, npages, head, ntails)
+ put_compound_head(head, ntails, FOLL_PIN);
}
EXPORT_SYMBOL(unpin_user_pages);

--
2.17.1

2021-02-05 01:49:15

by Joao Martins

[permalink] [raw]
Subject: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

Given that we won't be iterating on a list of changes, change
compound_next() to receive a bool, whether to calculate from the starting
page, or walk the page array. Finally add a separate iterator,
for_each_compound_range() that just operate in page ranges as opposed
to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Joao Martins <[email protected]>
---
include/linux/mm.h | 2 ++
mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
void unpin_user_page(struct page *page);
void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
void unpin_user_pages(struct page **pages, unsigned long npages);

/**
diff --git a/mm/gup.c b/mm/gup.c
index 5a3dd235017a..3426736a01b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);

+static inline void range_next(unsigned long i, unsigned long npages,
+ struct page **list, struct page **head,
+ unsigned int *ntails)
+{
+ struct page *next, *page;
+ unsigned int nr = 1;
+
+ if (i >= npages)
+ return;
+
+ npages -= i;
+ next = *list + i;
+
+ page = compound_head(next);
+ if (PageCompound(page) && compound_order(page) > 1)
+ nr = min_t(unsigned int,
+ page + compound_nr(page) - next, npages);
+
+ *head = page;
+ *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+ for (__i = 0, \
+ range_next(__i, __npages, __list, &(__head), &(__ntails)); \
+ __i < __npages; __i += __ntails, \
+ range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
static inline void compound_next(unsigned long i, unsigned long npages,
struct page **list, struct page **head,
unsigned int *ntails)
@@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
}
EXPORT_SYMBOL(unpin_user_pages_dirty_lock);

+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page: the starting page of a range maybe marked dirty, and definitely released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+ unsigned long index;
+ struct page *head;
+ unsigned int ntails;
+
+ for_each_compound_range(index, &page, npages, head, ntails) {
+ if (make_dirty && !PageDirty(head))
+ set_page_dirty_lock(head);
+ put_compound_head(head, ntails, FOLL_PIN);
+ }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
/**
* unpin_user_pages() - release an array of gup-pinned pages.
* @pages: array of pages to be marked dirty and released.
--
2.17.1

2021-02-05 01:49:17

by Joao Martins

[permalink] [raw]
Subject: [PATCH v2 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2688 rounds in 5.002 sec: 1860.786 usec / round
32517 rounds in 60.001 sec: 1845.225 usec / round

Signed-off-by: Joao Martins <[email protected]>
---
drivers/infiniband/core/umem.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..9b607013e2a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@

static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
- struct sg_page_iter sg_iter;
- struct page *page;
+ bool make_dirty = umem->writable && dirty;
+ struct scatterlist *sg;
+ unsigned int i;

if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);

- for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
- page = sg_page_iter_page(&sg_iter);
- unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
- }
+ for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+ unpin_user_page_range_dirty_lock(sg_page(sg),
+ DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);

sg_free_table(&umem->sg_head);
}
--
2.17.1

2021-02-05 04:16:54

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] mm/gup: add compound page list iterator

On 2/4/21 12:24 PM, Joao Martins wrote:
> Add an helper that iterates over head pages in a list of pages. It
> essentially counts the tails until the next page to process has a
> different head that the current. This is going to be used by
> unpin_user_pages() family of functions, to batch the head page refcount
> updates once for all passed consecutive tail pages.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> mm/gup.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index d68bcb482b11..d1549c61c2f6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
> }
> EXPORT_SYMBOL(unpin_user_page);
>
> +static inline void compound_next(unsigned long i, unsigned long npages,
> + struct page **list, struct page **head,
> + unsigned int *ntails)
> +{
> + struct page *page;
> + unsigned int nr;
> +
> + if (i >= npages)
> + return;
> +
> + list += i;
> + npages -= i;

It is worth noting that this is slightly more complex to read than it needs to be.
You are changing both endpoints of a loop at once. That's hard to read for a human.
And you're only doing it in order to gain the small benefit of being able to
use nr directly at the end of the routine.

If instead you keep npages constant like it naturally wants to be, you could
just do a "(*ntails)++" in the loop, to take care of *ntails.

However, given that the patch is correct and works as-is, the above is really just
an optional idea, so please feel free to add:

Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

> + page = compound_head(*list);
> +
> + for (nr = 1; nr < npages; nr++) {
> + if (compound_head(list[nr]) != page)
> + break;
> + }
> +
> + *head = page;
> + *ntails = nr;
> +}
> +
> +#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
> + for (__i = 0, \
> + compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
> + __i < __npages; __i += __ntails, \
> + compound_next(__i, __npages, __list, &(__head), &(__ntails)))
> +
> /**
> * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> * @pages: array of pages to be maybe marked dirty, and definitely released.
>

2021-02-05 04:52:10

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

On 2/4/21 12:24 PM, Joao Martins wrote:
> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
> and how many consecutive pages we want to unpin and optionally dirty.
>
> Given that we won't be iterating on a list of changes, change
> compound_next() to receive a bool, whether to calculate from the starting

Thankfully, that claim is stale and can now be removed from this commit
description.

> page, or walk the page array. Finally add a separate iterator,
> for_each_compound_range() that just operate in page ranges as opposed
> to page array.
>
> For users (like RDMA mr_dereg) where each sg represents a
> contiguous set of pages, we're able to more efficiently unpin
> pages without having to supply an array of pages much of what
> happens today with unpin_user_pages().
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> include/linux/mm.h | 2 ++
> mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 66 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a608feb0d42e..b76063f7f18a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
> void unpin_user_page(struct page *page);
> void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> bool make_dirty);
> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> + bool make_dirty);
> void unpin_user_pages(struct page **pages, unsigned long npages);
>
> /**
> diff --git a/mm/gup.c b/mm/gup.c
> index 5a3dd235017a..3426736a01b2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
> }
> EXPORT_SYMBOL(unpin_user_page);
>
> +static inline void range_next(unsigned long i, unsigned long npages,
> + struct page **list, struct page **head,
> + unsigned int *ntails)

Would compound_range_next() be a better name?

> +{
> + struct page *next, *page;
> + unsigned int nr = 1;
> +
> + if (i >= npages)
> + return;
> +
> + npages -= i;
> + next = *list + i;
> +
> + page = compound_head(next);
> + if (PageCompound(page) && compound_order(page) > 1)
> + nr = min_t(unsigned int,
> + page + compound_nr(page) - next, npages);

This pointer arithmetic will involve division. Which may be unnecessarily
expensive, if there is a way to calculate this with indices instead of
pointer arithmetic. I'm not sure if there is, off hand, but thought it
worth mentioning because the point is sometimes overlooked.

> +
> + *head = page;
> + *ntails = nr;
> +}
> +
> +#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
> + for (__i = 0, \
> + range_next(__i, __npages, __list, &(__head), &(__ntails)); \
> + __i < __npages; __i += __ntails, \
> + range_next(__i, __npages, __list, &(__head), &(__ntails)))
> +
> static inline void compound_next(unsigned long i, unsigned long npages,
> struct page **list, struct page **head,
> unsigned int *ntails)
> @@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> }
> EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
>
> +/**
> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
> + * gup-pinned page range
> + *
> + * @page: the starting page of a range maybe marked dirty, and definitely released.
> + * @npages: number of consecutive pages to release.
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * "gup-pinned page range" refers to a range of pages that has had one of the
> + * get_user_pages() variants called on that page.
> + *
> + * For the page ranges defined by [page .. page+npages], make that range (or
> + * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
> + * page range was previously listed as clean.
> + *
> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> + * required, then the caller should a) verify that this is really correct,
> + * because _lock() is usually required, and b) hand code it:
> + * set_page_dirty_lock(), unpin_user_page().
> + *
> + */
> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> + bool make_dirty)
> +{
> + unsigned long index;
> + struct page *head;
> + unsigned int ntails;
> +
> + for_each_compound_range(index, &page, npages, head, ntails) {
> + if (make_dirty && !PageDirty(head))
> + set_page_dirty_lock(head);
> + put_compound_head(head, ntails, FOLL_PIN);
> + }
> +}
> +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
> +
> /**
> * unpin_user_pages() - release an array of gup-pinned pages.
> * @pages: array of pages to be marked dirty and released.
>

Didn't spot any actual problems with how this works.

thanks,
--
John Hubbard
NVIDIA

2021-02-05 10:52:53

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] mm/gup: add compound page list iterator



On 2/5/21 4:11 AM, John Hubbard wrote:
> On 2/4/21 12:24 PM, Joao Martins wrote:
>> Add an helper that iterates over head pages in a list of pages. It
>> essentially counts the tails until the next page to process has a
>> different head that the current. This is going to be used by
>> unpin_user_pages() family of functions, to batch the head page refcount
>> updates once for all passed consecutive tail pages.
>>
>> Suggested-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Joao Martins <[email protected]>
>> ---
>> mm/gup.c | 29 +++++++++++++++++++++++++++++
>> 1 file changed, 29 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index d68bcb482b11..d1549c61c2f6 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
>> }
>> EXPORT_SYMBOL(unpin_user_page);
>>
>> +static inline void compound_next(unsigned long i, unsigned long npages,
>> + struct page **list, struct page **head,
>> + unsigned int *ntails)
>> +{
>> + struct page *page;
>> + unsigned int nr;
>> +
>> + if (i >= npages)
>> + return;
>> +
>> + list += i;
>> + npages -= i;
>
> It is worth noting that this is slightly more complex to read than it needs to be.
> You are changing both endpoints of a loop at once. That's hard to read for a human.
> And you're only doing it in order to gain the small benefit of being able to
> use nr directly at the end of the routine.
>
> If instead you keep npages constant like it naturally wants to be, you could
> just do a "(*ntails)++" in the loop, to take care of *ntails.
>
I didn't do it as such as I would need to deref @ntails per iteration, so
it felt more efficient to do as above. On a second thought, I could alternatively do the
following below, thoughts?

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..8defe4f670d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
}
EXPORT_SYMBOL(unpin_user_page);

+static inline void compound_next(unsigned long i, unsigned long npages,
+ struct page **list, struct page **head,
+ unsigned int *ntails)
+{
+ struct page *page;
+ unsigned int nr;
+
+ if (i >= npages)
+ return;
+
+ page = compound_head(list[i]);
+ for (nr = i + 1; nr < npages; nr++) {
+ if (compound_head(list[nr]) != page)
+ break;
+ }
+
+ *head = page;
+ *ntails = nr - i;
+}
+

> However, given that the patch is correct and works as-is, the above is really just
> an optional idea, so please feel free to add:
>
> Reviewed-by: John Hubbard <[email protected]>
>
>
Thanks!

Hopefully I can retain that if the snippet above is preferred?

Joao

2021-02-05 12:04:36

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

On 2/5/21 4:49 AM, John Hubbard wrote:
> On 2/4/21 12:24 PM, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> Given that we won't be iterating on a list of changes, change
>> compound_next() to receive a bool, whether to calculate from the starting
>
> Thankfully, that claim is stale and can now be removed from this commit
> description.
>
Yes, I'll delete it.

>> page, or walk the page array. Finally add a separate iterator,
>> for_each_compound_range() that just operate in page ranges as opposed
>> to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Joao Martins <[email protected]>
>> ---
>> include/linux/mm.h | 2 ++
>> mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 66 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a608feb0d42e..b76063f7f18a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>> void unpin_user_page(struct page *page);
>> void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>> bool make_dirty);
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>> + bool make_dirty);
>> void unpin_user_pages(struct page **pages, unsigned long npages);
>>
>> /**
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5a3dd235017a..3426736a01b2 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
>> }
>> EXPORT_SYMBOL(unpin_user_page);
>>
>> +static inline void range_next(unsigned long i, unsigned long npages,
>> + struct page **list, struct page **head,
>> + unsigned int *ntails)
>
> Would compound_range_next() be a better name?
>
Yeah, will change to that instead. range_next() might actually get confused for operations
done on struct range *.

One other thing about my naming is that unpin_user_page_range_dirty_lock() is *huge*. But
it seems to adhere to the rest of unpin_* family of functions naming. Couldn't find a
better alternative :/

>> +{
>> + struct page *next, *page;
>> + unsigned int nr = 1;
>> +
>> + if (i >= npages)
>> + return;
>> +
>> + npages -= i;

I will remove this @npages subtraction into the min_t() calculation as it's the only
placed that's used.

>> + next = *list + i;
>> +
>> + page = compound_head(next);
>> + if (PageCompound(page) && compound_order(page) > 1)

I am not handling compound_order == 1 so will change to >= in the condition above.
@compound_nr is placed on the second page.

>> + nr = min_t(unsigned int,
>> + page + compound_nr(page) - next, npages);
>
> This pointer arithmetic will involve division. Which may be unnecessarily
> expensive, if there is a way to calculate this with indices instead of
> pointer arithmetic. I'm not sure if there is, off hand, but thought it
> worth mentioning because the point is sometimes overlooked.
>
Sadly, can't think of :( hence had to adhere to what seems to be the pattern today.

Any conversion to PFNs (page_to_pfn) will do same said arithmetic, and
I don't think we can reliably use page_index (and even that is only available on the
head page).

>> +
>> + *head = page;
>> + *ntails = nr;
>> +}
>> +
>> +#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
>> + for (__i = 0, \
>> + range_next(__i, __npages, __list, &(__head), &(__ntails)); \
>> + __i < __npages; __i += __ntails, \
>> + range_next(__i, __npages, __list, &(__head), &(__ntails)))
>> +
>> static inline void compound_next(unsigned long i, unsigned long npages,
>> struct page **list, struct page **head,
>> unsigned int *ntails)
>> @@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>> }
>> EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
>>
>> +/**
>> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
>> + * gup-pinned page range
>> + *
>> + * @page: the starting page of a range maybe marked dirty, and definitely released.
>> + * @npages: number of consecutive pages to release.
>> + * @make_dirty: whether to mark the pages dirty
>> + *
>> + * "gup-pinned page range" refers to a range of pages that has had one of the
>> + * get_user_pages() variants called on that page.
>> + *
>> + * For the page ranges defined by [page .. page+npages], make that range (or
>> + * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
>> + * page range was previously listed as clean.
>> + *
>> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
>> + * required, then the caller should a) verify that this is really correct,
>> + * because _lock() is usually required, and b) hand code it:
>> + * set_page_dirty_lock(), unpin_user_page().
>> + *
>> + */
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>> + bool make_dirty)
>> +{
>> + unsigned long index;
>> + struct page *head;
>> + unsigned int ntails;
>> +
>> + for_each_compound_range(index, &page, npages, head, ntails) {
>> + if (make_dirty && !PageDirty(head))
>> + set_page_dirty_lock(head);
>> + put_compound_head(head, ntails, FOLL_PIN);
>> + }
>> +}
>> +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
>> +
>> /**
>> * unpin_user_pages() - release an array of gup-pinned pages.
>> * @pages: array of pages to be marked dirty and released.
>>
>
> Didn't spot any actual problems with how this works.

/me nods

2021-02-05 20:00:37

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] mm/gup: add compound page list iterator

On 2/5/21 2:46 AM, Joao Martins wrote:
...>> If instead you keep npages constant like it naturally wants to be, you could
>> just do a "(*ntails)++" in the loop, to take care of *ntails.
>>
> I didn't do it as such as I would need to deref @ntails per iteration, so
> it felt more efficient to do as above. On a second thought, I could alternatively do the
> following below, thoughts?
>
> diff --git a/mm/gup.c b/mm/gup.c
> index d68bcb482b11..8defe4f670d5 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
> }
> EXPORT_SYMBOL(unpin_user_page);
>
> +static inline void compound_next(unsigned long i, unsigned long npages,
> + struct page **list, struct page **head,
> + unsigned int *ntails)
> +{
> + struct page *page;
> + unsigned int nr;
> +
> + if (i >= npages)
> + return;
> +
> + page = compound_head(list[i]);
> + for (nr = i + 1; nr < npages; nr++) {
> + if (compound_head(list[nr]) != page)
> + break;
> + }
> +
> + *head = page;
> + *ntails = nr - i;
> +}
> +

Yes, this is cleaner and quite a bit easier to verify that it is correct.

>
>> However, given that the patch is correct and works as-is, the above is really just
>> an optional idea, so please feel free to add:
>>
>> Reviewed-by: John Hubbard <[email protected]>
>>
>>
> Thanks!
>
> Hopefully I can retain that if the snippet above is preferred?
>
> Joao
>

Yes. Still looks good.

thanks,
--
John Hubbard
NVIDIA