LinuxLists.cc - [PATCH v3 00/12] mm/gup: track FOLL_PIN pages

2020-02-01 03:44:14

by John Hubbard

[permalink] [raw]

Subject: [PATCH v3 00/12] mm/gup: track FOLL_PIN pages

Matthew,

I've merged in your dump_page() ideas, and also factored things out
into a new __dump_tail_page() routine, in order to save a few
indentation levels, mainly.

Kirill, thanks for your review comments. I've applied them, and I think
splitting this up as you recommended really makes it a lot better, and
easier to spot problems.

============================================================
Changes since v2:

* Rebased onto linux.git, because the akpm tree for 5.6 has been merged.

* Split the tracking patch into even more patches, as requested.

* Merged Matthew Wilcox's dump_page() changes into mine, as part of the
first patch.

* Renamed: page_dma_pinned() --> page_maybe_dma_pinned(), in response to
Kirill Shutemov's review.

* Moved a WARN to the top of a routine, and fixed a typo in the commit
description of patch #7, also as suggested by Kirill.

============================================================
Changes since v1:

* Split the tracking patch into 6 smaller patches

* Rebased onto today's linux-next/akpm (there weren't any conflicts).

* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported
by Nathan Chancellor. (I don't see it in my local builds, probably
because they use gcc, but an LLVM test found the mismatch.)

* Fixed a huge page pincount problem (add/subtract vs.
increment/decrement), spotted by Jan Kara.
============================================================

There is a reasonable case to be made for merging two of the patches
(patches 4 and 5), given that patch 4 provides tracking that has upper
limits on the number of pins that can be done with huge pages. Let me
know if anyone wants those merged, but unless there is some weird chance
of someone grabbing patch 4 and not patch 5, I don't really see the
need. Meanwhile, it's easier to review in this form.

Also, patch 3 has been revived. Earlier reviewers asked for it to be
merged into the tracking patch (one cannot please everyone, heh), but
now it's back out on it's own.

This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

It is based on today's (Jan 28) linux-next (branch: akpm),
commit 280e9cb00b41 ("drivers/media/platform/sti/delta/delta-ipc.c: fix
read buffer overflow")

There is a git repo and branch, for convenience in reviewing:

[email protected]:johnhubbard/linux.git
track_user_pages_v2_linux-next_akpm_28Jan2020

FOLL_PIN support is (so far) in mmotm and linux-next. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.

This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.

Here's the last reviewed version of the tracking patch (v11):

https://lore.kernel.org/r/[email protected]

Jan Kara had provided a reviewed-by tag for that, but I've had to remove
it (again) here, due to having changed the patch "a little bit", in
order to add the feature described above.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
pages: head refcount (for all compound pages), and map_pincount (for
compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
huge page refcount upper limit problems, and added notes about how it
works now. Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
ways to count pinned pages: exact (for compound pages of order > 1)
and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()

============================================================
Next steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
* Work with Ira and others to connect this all up with file system
leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/

[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/

[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

[4] LWN kernel index: get_user_pages()
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

John Hubbard (12):
mm: dump_page(): better diagnostics for compound pages
mm/gup: split get_user_pages_remote() into two routines
mm/gup: pass a flags arg to __gup_device_* functions
mm: introduce page_ref_sub_return()
mm/gup: pass gup flags to two more routines
mm/gup: require FOLL_GET for get_user_pages_fast()
mm/gup: track FOLL_PIN pages
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
mm: dump_page(): better diagnostics for huge pinned pages
mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage

Documentation/core-api/pin_user_pages.rst | 53 +--
include/linux/mm.h | 108 ++++-
include/linux/mm_types.h | 7 +-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/debug.c | 60 ++-
mm/gup.c | 459 ++++++++++++++++-----
mm/gup_benchmark.c | 71 +++-
mm/huge_memory.c | 29 +-
mm/hugetlb.c | 44 +-
mm/page_alloc.c | 2 +
mm/rmap.c | 6 +
mm/vmstat.c | 2 +
tools/testing/selftests/vm/gup_benchmark.c | 15 +-
tools/testing/selftests/vm/run_vmtests | 22 +
15 files changed, 715 insertions(+), 175 deletions(-)

--
2.25.0

2020-02-01 03:44:14

by John Hubbard

[permalink] [raw]

Subject: [PATCH v3 08/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

For huge pages (and in fact, any compound page), the
GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail
page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS
(1024). That limits the number of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin
counting for compound pages of order > 1. The "order > 1" is required
because this approach uses the 3rd struct page in the compound page, and
order 1 compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added,
replacing a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That
is because these compound pages have extra space for tracking things, so
they get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
Documentation/core-api/pin_user_pages.rst | 40 +++++-------
include/linux/mm.h | 26 ++++++++
include/linux/mm_types.h | 7 +-
mm/gup.c | 78 ++++++++++++++++++++---
mm/hugetlb.c | 6 ++
mm/page_alloc.c | 2 +
mm/rmap.c | 6 ++
7 files changed, 133 insertions(+), 32 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 9829345428f8..3f72b1ea1104 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -52,8 +52,22 @@ Which flags are set by each wrapper

For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
flags the caller provides. The caller is required to pass in a non-null struct
-pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+pages* array, and the function then pins pages by incrementing each by a special
+value: GUP_PIN_COUNTING_BIAS.
+
+For huge pages (and in fact, any compound page of more than 2 pages), the
+GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
+is achieved, by using the 3rd struct page in the compound page. A new struct
+page field, hpage_pinned_refcount, has been added in order to support this.
+
+This approach for compound pages avoids the counting upper limit problems that
+are discussed below. Those limitations would have been aggravated severely by
+huge pages, because each tail page adds a refcount to the head page. And in
+fact, testing revealed that, without a separate hpage_pinned_refcount field,
+page overflows were seen in some huge page stress tests.
+
+This also means that huge pages and compound pages (of order > 1) do not suffer
+from the false positives problem that is mentioned below.::

Function
--------
@@ -99,27 +113,6 @@ pages:
This also leads to limitations: there are only 31-10==21 bits available for a
counter that increments 10 bits at a time.

-TODO: for 1GB and larger huge pages, this is cutting it close. That's because
-when pin_user_pages() follows such pages, it increments the head page by "1"
-(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
-pin_user_pages()) for each tail page. So if you have a 1GB huge page:
-
-* There are 256K (18 bits) worth of 4 KB tail pages.
-* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
- 10 bits at a time)
-* There are 21 - 18 == 3 bits available to count. Except that there aren't,
- because you need to allow for a few normal get_page() calls on the head page,
- as well. Fortunately, the approach of using addition, rather than "hard"
- bitfields, within page->_refcount, allows for sharing these bits gracefully.
- But we're still looking at about 8 references.
-
-This, however, is a missing feature more than anything else, because it's easily
-solved by addressing an obvious inefficiency in the original get_user_pages()
-approach of retrieving pages: stop treating all the pages as if they were
-PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
-this, so some work is required. Once that's in place, this limitation mostly
-disappears from view, because there will be ample refcounting range available.
-
* Callers must specifically request "dma-pinned tracking of pages". In other
words, just calling get_user_pages() will not suffice; a new set of functions,
pin_user_page() and related, must be used.
@@ -228,5 +221,6 @@ References
* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`

John Hubbard, October, 2019
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca787c606f0e..fdcd137b9981 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page)
return page[1].compound_order;
}

+static inline bool hpage_pincount_available(struct page *page)
+{
+ /*
+ * Can the page->hpage_pinned_refcount field be used? That field is in
+ * the 3rd page of the compound page, so the smallest (2-page) compound
+ * pages cannot support it.
+ */
+ page = compound_head(page);
+ return PageCompound(page) && compound_order(page) > 1;
+}
+
+static inline int compound_pincount(struct page *page)
+{
+ VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+ page = compound_head(page);
+ return atomic_read(compound_pincount_ptr(page));
+}
+
static inline void set_compound_order(struct page *page, unsigned int order)
{
page[1].compound_order = order;
@@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
* refcounts, and b) all the callers of this routine are expected to be able to
* deal gracefully with a false positive.
*
+ * For huge pages, the result will be exactly correct. That's because we have
+ * more tracking data available: the 3rd struct page in the compound page is
+ * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
+ * scheme).
+ *
* For more information, please see Documentation/vm/pin_user_pages.rst.
*
* @page: pointer to page to be queried.
@@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
*/
static inline bool page_maybe_dma_pinned(struct page *page)
{
+ if (hpage_pincount_available(page))
+ return compound_pincount(page) > 0;
+
/*
* page_ref_count() is signed. If that refcount overflows, then
* page_ref_count() returns a negative value, and callers will avoid
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e87bb864bdb2..01e9717b8529 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -137,7 +137,7 @@ struct page {
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
- unsigned long _compound_pad_2;
+ atomic_t hpage_pinned_refcount;
/* For both global and memcg */
struct list_head deferred_list;
};
@@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
return &page[1].compound_mapcount;
}

+static inline atomic_t *compound_pincount_ptr(struct page *page)
+{
+ return &page[2].hpage_pinned_refcount;
+}
+
/*
* Used for sizing the vmemmap region on some architectures
*/
diff --git a/mm/gup.c b/mm/gup.c
index 6e8b773c233a..c10d0d051c5b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,22 @@ struct follow_page_context {
unsigned int page_mask;
};

+static void hpage_pincount_add(struct page *page, int refs)
+{
+ VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+ VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+ atomic_add(refs, compound_pincount_ptr(page));
+}
+
+static void hpage_pincount_sub(struct page *page, int refs)
+{
+ VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+ VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+ atomic_sub(refs, compound_pincount_ptr(page));
+}
+
/*
* Return the compound head page with ref appropriately incremented,
* or NULL if that failed.
@@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
if (flags & FOLL_GET)
return try_get_compound_head(page, refs);
else if (flags & FOLL_PIN) {
- refs *= GUP_PIN_COUNTING_BIAS;
- return try_get_compound_head(page, refs);
+ /*
+ * When pinning a compound page of order > 1 (which is what
+ * hpage_pincount_available() checks for), use an exact count to
+ * track it, via hpage_pincount_add/_sub().
+ *
+ * However, be sure to *also* increment the normal page refcount
+ * field at least once, so that the page really is pinned.
+ */
+ if (!hpage_pincount_available(page))
+ refs *= GUP_PIN_COUNTING_BIAS;
+
+ page = try_get_compound_head(page, refs);
+ if (!page)
+ return NULL;
+
+ if (hpage_pincount_available(page))
+ hpage_pincount_add(page, refs);
+
+ return page;
}

WARN_ON_ONCE(1);
@@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
if (flags & FOLL_GET)
return try_get_page(page);
else if (flags & FOLL_PIN) {
+ int refs = 1;
+
page = compound_head(page);

if (WARN_ON_ONCE(page_ref_count(page) <= 0))
return false;

- page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+ if (hpage_pincount_available(page))
+ hpage_pincount_add(page, 1);
+ else
+ refs = GUP_PIN_COUNTING_BIAS;
+
+ /*
+ * Similar to try_grab_compound_head(): even if using the
+ * hpage_pincount_add/_sub() routines, be sure to
+ * *also* increment the normal page refcount field at least
+ * once, so that the page really is pinned.
+ */
+ page_ref_add(page, refs);
}

return true;
@@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
#ifdef CONFIG_DEV_PAGEMAP_OPS
static bool __unpin_devmap_managed_user_page(struct page *page)
{
- int count;
+ int count, refs = 1;

if (!page_is_devmap_managed(page))
return false;

- count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+ if (hpage_pincount_available(page))
+ hpage_pincount_sub(page, 1);
+ else
+ refs = GUP_PIN_COUNTING_BIAS;
+
+ count = page_ref_sub_return(page, refs);

/*
* devmap page refcounts are 1-based, rather than 0-based: if
@@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
*/
void unpin_user_page(struct page *page)
{
+ int refs = 1;
+
page = compound_head(page);

/*
@@ -168,7 +221,12 @@ void unpin_user_page(struct page *page)
if (__unpin_devmap_managed_user_page(page))
return;

- if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+ if (hpage_pincount_available(page))
+ hpage_pincount_sub(page, 1);
+ else
+ refs = GUP_PIN_COUNTING_BIAS;
+
+ if (page_ref_sub_and_test(page, refs))
__put_page(page);
}
EXPORT_SYMBOL(unpin_user_page);
@@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr,

static void put_compound_head(struct page *page, int refs, unsigned int flags)
{
- if (flags & FOLL_PIN)
- refs *= GUP_PIN_COUNTING_BIAS;
+ if (flags & FOLL_PIN) {
+ if (hpage_pincount_available(page))
+ hpage_pincount_sub(page, refs);
+ else
+ refs *= GUP_PIN_COUNTING_BIAS;
+ }

VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 487e998fd38e..07059d936f7b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page,
struct page *p = page + 1;

atomic_set(compound_mapcount_ptr(page), 0);
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
+
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
clear_compound_head(p);
set_page_refcounted(p);
@@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
set_compound_head(p, page);
}
atomic_set(compound_mapcount_ptr(page), -1);
+
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
}

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15e908ad933b..c205b912f108 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order)
set_compound_head(p, page);
}
atomic_set(compound_mapcount_ptr(page), -1);
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
}

#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/rmap.c b/mm/rmap.c
index b3e381919835..e45b9b991e2f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
+
__inc_node_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
@@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page,
{
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
atomic_set(compound_mapcount_ptr(page), 0);
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
+
__page_set_anon_rmap(page, vma, address, 1);
}
#endif /* CONFIG_HUGETLB_PAGE */
--
2.25.0

2020-02-01 03:45:01

by John Hubbard

[permalink] [raw]

Subject: [PATCH v3 04/12] mm: introduce page_ref_sub_return()

An upcoming patch requires subtracting a large chunk of refcounts from
a page, and checking what the resulting refcount is. This is a little
different than the usual "check for zero refcount" that many of the
page ref functions already do. However, it is similar to a few other
routines that (like this one) are generally useful for things such as
1-based refcounting.

Add page_ref_sub_return(), that subtracts a chunk of refcounts
atomically, and returns an atomic snapshot of the result.

Signed-off-by: John Hubbard <[email protected]>
---
include/linux/page_ref.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b9cbe553d1e7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
__page_ref_mod(page, -nr);
}

+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+ int ret = atomic_sub_return(nr, &page->_refcount);
+
+ if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
+ __page_ref_mod(page, -nr);
+
+ return ret;
+}
+
static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_refcount);
--
2.25.0

2020-02-01 03:45:04

by John Hubbard

[permalink] [raw]

Subject: [PATCH v3 09/12] mm: dump_page(): better diagnostics for huge pinned pages

As part of pin_user_pages() and related API calls, pages are
"dma-pinned". For the case of compound pages of order > 1, the per-page
accounting of dma pins is accomplished via the 3rd struct page in the
compound page. In order to support debugging of any pin_user_pages()-
related problems, enhance dump_page() so as to report the pin count
in that case.

Documentation/core-api/pin_user_pages.rst is also updated accordingly.

Signed-off-by: John Hubbard <[email protected]>
---
Documentation/core-api/pin_user_pages.rst | 7 +++++
mm/debug.c | 34 +++++++++++++++++------
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 3f72b1ea1104..dd21ea140ef4 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -215,6 +215,13 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
because there is a noticeable performance drop in unpin_user_page(), when they
are activated.

+Other diagnostics
+=================
+
+dump_page() has been enhanced slightly, to handle these new counting fields, and
+to better report on compound pages in general. Specifically, for compound pages
+with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+
References
==========

diff --git a/mm/debug.c b/mm/debug.c
index beb1c59d784b..db81b11345be 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -57,10 +57,20 @@ static void __dump_tail_page(struct page *page, int mapcount)
page, page_ref_count(page), mapcount, page->mapping,
page_to_pgoff(page));
} else {
- pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
- "index:%#lx compound_mapcount:%d\n",
- page, page_ref_count(head), mapcount, head->mapping,
- page_to_pgoff(head), compound_mapcount(page));
+ if (hpage_pincount_available(page))
+ pr_warn("page:%px compound refcount:%d mapcount:%d "
+ "mapping:%px index:%#lx compound_mapcount:%d "
+ "compound_pincount:%d\n",
+ page, page_ref_count(head), mapcount,
+ head->mapping, page_to_pgoff(head),
+ compound_mapcount(page),
+ compound_pincount(page));
+ else
+ pr_warn("page:%px compound refcount:%d mapcount:%d "
+ "mapping:%px index:%#lx compound_mapcount:%d\n",
+ page, page_ref_count(head), mapcount,
+ head->mapping, page_to_pgoff(head),
+ compound_mapcount(page));
}

if (page_ref_count(page) != 0)
@@ -103,10 +113,18 @@ void __dump_page(struct page *page, const char *reason)

if (PageTail(page))
__dump_tail_page(page, mapcount);
- else
- pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
- page, page_ref_count(page), mapcount,
- page->mapping, page_to_pgoff(page));
+ else {
+ if (hpage_pincount_available(page))
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx compound pincount: %d\n",
+ page, page_ref_count(page), mapcount,
+ page->mapping, page_to_pgoff(page),
+ compound_pincount(page));
+ else
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx\n", page, page_ref_count(page),
+ mapcount, page->mapping, page_to_pgoff(page));
+ }
if (PageKsm(page))
type = "ksm ";
else if (PageAnon(page))
--
2.25.0

2020-02-01 03:45:23

by John Hubbard

[permalink] [raw]

Subject: [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls

Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages(): via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_maybe_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/gup_benchmark.c | 71 ++++++++++++++++++++--
tools/testing/selftests/vm/gup_benchmark.c | 15 ++++-
2 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 8dba38e79a9f..447628d0131f 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
#define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark)

struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +21,48 @@ struct gup_benchmark {
__u64 expansion[10]; /* For future use */
};

+static void put_back_pages(unsigned int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+ int i;
+
+ switch (cmd) {
+ case GUP_FAST_BENCHMARK:
+ case GUP_LONGTERM_BENCHMARK:
+ case GUP_BENCHMARK:
+ for (i = 0; i < nr_pages; i++)
+ put_page(pages[i]);
+ break;
+
+ case PIN_FAST_BENCHMARK:
+ case PIN_BENCHMARK:
+ unpin_user_pages(pages, nr_pages);
+ break;
+ }
+}
+
+static void verify_dma_pinned(unsigned int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+ int i;
+ struct page *page;
+
+ switch (cmd) {
+ case PIN_FAST_BENCHMARK:
+ case PIN_BENCHMARK:
+ for (i = 0; i < nr_pages; i++) {
+ page = pages[i];
+ if (WARN(!page_maybe_dma_pinned(page),
+ "pages[%d] is NOT dma-pinned\n", i)) {
+
+ dump_page(page, "gup_benchmark failure");
+ break;
+ }
+ }
+ break;
+ }
+}
+
static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
{
@@ -66,6 +110,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+ case PIN_FAST_BENCHMARK:
+ nr = pin_user_pages_fast(addr, nr, gup->flags,
+ pages + i);
+ break;
+ case PIN_BENCHMARK:
+ nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+ NULL);
+ break;
default:
kvfree(pages);
ret = -EINVAL;
@@ -78,15 +130,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();

+ /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+ nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;

+ /*
+ * Take an un-benchmark-timed moment to verify DMA pinned
+ * state: print a warning if any non-dma-pinned pages are found:
+ */
+ verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
- for (i = 0; i < nr_pages; i++) {
- if (!pages[i])
- break;
- put_page(pages[i]);
- }
+
+ put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);

@@ -105,6 +164,8 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
case GUP_FAST_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
+ case PIN_FAST_BENCHMARK:
+ case PIN_BENCHMARK:
break;
default:
return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 389327e9b30a..43b4dfe161a2 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,10 @@
#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
#define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)

+/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
+
/* Just the flags we need, copied from mm.h: */
#define FOLL_WRITE 0x01 /* check pte is writable */

@@ -40,8 +44,14 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;

- while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) {
switch (opt) {
+ case 'a':
+ cmd = PIN_FAST_BENCHMARK;
+ break;
+ case 'b':
+ cmd = PIN_BENCHMARK;
+ break;
case 'm':
size = atoi(optarg) * MB;
break;
@@ -63,6 +73,9 @@ int main(int argc, char **argv)
case 'U':
cmd = GUP_BENCHMARK;
break;
+ case 'u':
+ cmd = GUP_FAST_BENCHMARK;
+ break;
case 'w':
write = 1;
break;
--
2.25.0

2020-02-03 14:58:53

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v3 04/12] mm: introduce page_ref_sub_return()

On Fri, Jan 31, 2020 at 07:40:21PM -0800, John Hubbard wrote:
> An upcoming patch requires subtracting a large chunk of refcounts from
> a page, and checking what the resulting refcount is. This is a little
> different than the usual "check for zero refcount" that many of the
> page ref functions already do. However, it is similar to a few other
> routines that (like this one) are generally useful for things such as
> 1-based refcounting.
>
> Add page_ref_sub_return(), that subtracts a chunk of refcounts
> atomically, and returns an atomic snapshot of the result.
>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/page_ref.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 14d14beb1f7f..b9cbe553d1e7 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
> __page_ref_mod(page, -nr);
> }
>
> +static inline int page_ref_sub_return(struct page *page, int nr)
> +{
> + int ret = atomic_sub_return(nr, &page->_refcount);
> +
> + if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
> + __page_ref_mod(page, -nr);

Shouldn't it be __page_ref_mod_and_return() and relevant tracepoint?

> +
> + return ret;
> +}
> +
> static inline void page_ref_inc(struct page *page)
> {
> atomic_inc(&page->_refcount);
> --
> 2.25.0
>

--
Kirill A. Shutemov

2020-02-03 15:36:30

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v3 09/12] mm: dump_page(): better diagnostics for huge pinned pages

On Fri, Jan 31, 2020 at 07:40:26PM -0800, John Hubbard wrote:
> As part of pin_user_pages() and related API calls, pages are
> "dma-pinned". For the case of compound pages of order > 1, the per-page
> accounting of dma pins is accomplished via the 3rd struct page in the
> compound page. In order to support debugging of any pin_user_pages()-
> related problems, enhance dump_page() so as to report the pin count
> in that case.
>
> Documentation/core-api/pin_user_pages.rst is also updated accordingly.
>
> Signed-off-by: John Hubbard <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2020-02-03 15:36:30

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v3 08/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

On Fri, Jan 31, 2020 at 07:40:25PM -0800, John Hubbard wrote:
> For huge pages (and in fact, any compound page), the
> GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail
> page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS
> (1024). That limits the number of huge pages that can be pinned.
>
> This patch removes that limitation, by using an exact form of pin
> counting for compound pages of order > 1. The "order > 1" is required
> because this approach uses the 3rd struct page in the compound page, and
> order 1 compound pages only have two pages, so that won't work there.

Could you update the comment for HPAGE_PMD_ORDER < 2 check in
hugepage_init() to reflect addtional user for the condition.
>
> A new struct page field, hpage_pinned_refcount, has been added,
> replacing a padding field in the union (so no new space is used).
>
> This enhancement also has a useful side effect: huge pages and compound
> pages (of order > 1) do not suffer from the "potential false positives"
> problem that is discussed in the page_dma_pinned() comment block. That
> is because these compound pages have extra space for tracking things, so
> they get exact pin counts instead of overloading page->_refcount.
>
> Documentation/core-api/pin_user_pages.rst is updated accordingly.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2020-02-03 16:03:05

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls

On Fri, Jan 31, 2020 at 07:40:28PM -0800, John Hubbard wrote:
> Up until now, gup_benchmark supported testing of the
> following kernel functions:
>
> * get_user_pages(): via the '-U' command line option
> * get_user_pages_longterm(): via the '-L' command line option
> * get_user_pages_fast(): as the default (no options required)
>
> Add test coverage for the new corresponding pin_*() functions:
>
> * pin_user_pages_fast(): via the '-a' command line option
> * pin_user_pages(): via the '-b' command line option
>
> Also, add an option for clarity: '-u' for what is now (still) the
> default choice: get_user_pages_fast().
>
> Also, for the commands that set FOLL_PIN, verify that the pages
> really are dma-pinned, via the new is_dma_pinned() routine.
> Those commands are:
>
> PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
> PIN_BENCHMARK : calls pin_user_pages()
>
> In between the calls to pin_*() and unpin_user_pages(),
> check each page: if page_maybe_dma_pinned() returns false, then
> WARN and return.
>
> Do this outside of the benchmark timestamps, so that it doesn't
> affect reported times.
>
> Reviewed-by: Ira Weiny <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> mm/gup_benchmark.c | 71 ++++++++++++++++++++--
> tools/testing/selftests/vm/gup_benchmark.c | 15 ++++-
> 2 files changed, 80 insertions(+), 6 deletions(-)
>
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 8dba38e79a9f..447628d0131f 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -8,6 +8,8 @@
> #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
> #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
> #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark)
> +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
> +#define PIN_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
>
> struct gup_benchmark {
> __u64 get_delta_usec;
> @@ -19,6 +21,48 @@ struct gup_benchmark {
> __u64 expansion[10]; /* For future use */
> };
>
> +static void put_back_pages(unsigned int cmd, struct page **pages,
> + unsigned long nr_pages)
> +{
> + int i;
> +
> + switch (cmd) {
> + case GUP_FAST_BENCHMARK:
> + case GUP_LONGTERM_BENCHMARK:
> + case GUP_BENCHMARK:
> + for (i = 0; i < nr_pages; i++)

'i' is 'int' and 'nr_pages' is 'unsigned long'.
There's space for trouble :P

> + put_page(pages[i]);
> + break;
> +
> + case PIN_FAST_BENCHMARK:
> + case PIN_BENCHMARK:
> + unpin_user_pages(pages, nr_pages);
> + break;
> + }
> +}
> +
> +static void verify_dma_pinned(unsigned int cmd, struct page **pages,
> + unsigned long nr_pages)
> +{
> + int i;
> + struct page *page;
> +
> + switch (cmd) {
> + case PIN_FAST_BENCHMARK:
> + case PIN_BENCHMARK:
> + for (i = 0; i < nr_pages; i++) {

Ditto.

> + page = pages[i];
> + if (WARN(!page_maybe_dma_pinned(page),
> + "pages[%d] is NOT dma-pinned\n", i)) {
> +
> + dump_page(page, "gup_benchmark failure");
> + break;
> + }
> + }
> + break;
> + }
> +}
> +
> static int __gup_benchmark_ioctl(unsigned int cmd,
> struct gup_benchmark *gup)
> {

--
Kirill A. Shutemov

2020-02-03 17:13:36

[permalink] [raw]

Subject: Re: [PATCH v3 08/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

On Fri 31-01-20 19:40:25, John Hubbard wrote:
> For huge pages (and in fact, any compound page), the
> GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail
> page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS
> (1024). That limits the number of huge pages that can be pinned.
>
> This patch removes that limitation, by using an exact form of pin
> counting for compound pages of order > 1. The "order > 1" is required
> because this approach uses the 3rd struct page in the compound page, and
> order 1 compound pages only have two pages, so that won't work there.
>
> A new struct page field, hpage_pinned_refcount, has been added,
> replacing a padding field in the union (so no new space is used).
>
> This enhancement also has a useful side effect: huge pages and compound
> pages (of order > 1) do not suffer from the "potential false positives"
> problem that is discussed in the page_dma_pinned() comment block. That
> is because these compound pages have extra space for tracking things, so
> they get exact pin counts instead of overloading page->_refcount.
>
> Documentation/core-api/pin_user_pages.rst is updated accordingly.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> Documentation/core-api/pin_user_pages.rst | 40 +++++-------
> include/linux/mm.h | 26 ++++++++
> include/linux/mm_types.h | 7 +-
> mm/gup.c | 78 ++++++++++++++++++++---
> mm/hugetlb.c | 6 ++
> mm/page_alloc.c | 2 +
> mm/rmap.c | 6 ++
> 7 files changed, 133 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 9829345428f8..3f72b1ea1104 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -52,8 +52,22 @@ Which flags are set by each wrapper
>
> For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
> flags the caller provides. The caller is required to pass in a non-null struct
> -pages* array, and the function then pin pages by incrementing each by a special
> -value. For now, that value is +1, just like get_user_pages*().::
> +pages* array, and the function then pins pages by incrementing each by a special
> +value: GUP_PIN_COUNTING_BIAS.
> +
> +For huge pages (and in fact, any compound page of more than 2 pages), the
> +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
> +is achieved, by using the 3rd struct page in the compound page. A new struct
> +page field, hpage_pinned_refcount, has been added in order to support this.
> +
> +This approach for compound pages avoids the counting upper limit problems that
> +are discussed below. Those limitations would have been aggravated severely by
> +huge pages, because each tail page adds a refcount to the head page. And in
> +fact, testing revealed that, without a separate hpage_pinned_refcount field,
> +page overflows were seen in some huge page stress tests.
> +
> +This also means that huge pages and compound pages (of order > 1) do not suffer
> +from the false positives problem that is mentioned below.::
>
> Function
> --------
> @@ -99,27 +113,6 @@ pages:
> This also leads to limitations: there are only 31-10==21 bits available for a
> counter that increments 10 bits at a time.
>
> -TODO: for 1GB and larger huge pages, this is cutting it close. That's because
> -when pin_user_pages() follows such pages, it increments the head page by "1"
> -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
> -pin_user_pages()) for each tail page. So if you have a 1GB huge page:
> -
> -* There are 256K (18 bits) worth of 4 KB tail pages.
> -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
> - 10 bits at a time)
> -* There are 21 - 18 == 3 bits available to count. Except that there aren't,
> - because you need to allow for a few normal get_page() calls on the head page,
> - as well. Fortunately, the approach of using addition, rather than "hard"
> - bitfields, within page->_refcount, allows for sharing these bits gracefully.
> - But we're still looking at about 8 references.
> -
> -This, however, is a missing feature more than anything else, because it's easily
> -solved by addressing an obvious inefficiency in the original get_user_pages()
> -approach of retrieving pages: stop treating all the pages as if they were
> -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
> -this, so some work is required. Once that's in place, this limitation mostly
> -disappears from view, because there will be ample refcounting range available.
> -
> * Callers must specifically request "dma-pinned tracking of pages". In other
> words, just calling get_user_pages() will not suffice; a new set of functions,
> pin_user_page() and related, must be used.
> @@ -228,5 +221,6 @@ References
> * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
> * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
> * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
> +* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`
>
> John Hubbard, October, 2019
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ca787c606f0e..fdcd137b9981 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page)
> return page[1].compound_order;
> }
>
> +static inline bool hpage_pincount_available(struct page *page)
> +{
> + /*
> + * Can the page->hpage_pinned_refcount field be used? That field is in
> + * the 3rd page of the compound page, so the smallest (2-page) compound
> + * pages cannot support it.
> + */
> + page = compound_head(page);
> + return PageCompound(page) && compound_order(page) > 1;
> +}
> +
> +static inline int compound_pincount(struct page *page)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + page = compound_head(page);
> + return atomic_read(compound_pincount_ptr(page));
> +}
> +
> static inline void set_compound_order(struct page *page, unsigned int order)
> {
> page[1].compound_order = order;
> @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
> * refcounts, and b) all the callers of this routine are expected to be able to
> * deal gracefully with a false positive.
> *
> + * For huge pages, the result will be exactly correct. That's because we have
> + * more tracking data available: the 3rd struct page in the compound page is
> + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
> + * scheme).
> + *
> * For more information, please see Documentation/vm/pin_user_pages.rst.
> *
> * @page: pointer to page to be queried.
> @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
> */
> static inline bool page_maybe_dma_pinned(struct page *page)
> {
> + if (hpage_pincount_available(page))
> + return compound_pincount(page) > 0;
> +
> /*
> * page_ref_count() is signed. If that refcount overflows, then
> * page_ref_count() returns a negative value, and callers will avoid
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e87bb864bdb2..01e9717b8529 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -137,7 +137,7 @@ struct page {
> };
> struct { /* Second tail page of compound page */
> unsigned long _compound_pad_1; /* compound_head */
> - unsigned long _compound_pad_2;
> + atomic_t hpage_pinned_refcount;
> /* For both global and memcg */
> struct list_head deferred_list;
> };
> @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
> return &page[1].compound_mapcount;
> }
>
> +static inline atomic_t *compound_pincount_ptr(struct page *page)
> +{
> + return &page[2].hpage_pinned_refcount;
> +}
> +
> /*
> * Used for sizing the vmemmap region on some architectures
> */
> diff --git a/mm/gup.c b/mm/gup.c
> index 6e8b773c233a..c10d0d051c5b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,22 @@ struct follow_page_context {
> unsigned int page_mask;
> };
>
> +static void hpage_pincount_add(struct page *page, int refs)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + VM_BUG_ON_PAGE(page != compound_head(page), page);
> +
> + atomic_add(refs, compound_pincount_ptr(page));
> +}
> +
> +static void hpage_pincount_sub(struct page *page, int refs)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + VM_BUG_ON_PAGE(page != compound_head(page), page);
> +
> + atomic_sub(refs, compound_pincount_ptr(page));
> +}
> +
> /*
> * Return the compound head page with ref appropriately incremented,
> * or NULL if that failed.
> @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> if (flags & FOLL_GET)
> return try_get_compound_head(page, refs);
> else if (flags & FOLL_PIN) {
> - refs *= GUP_PIN_COUNTING_BIAS;
> - return try_get_compound_head(page, refs);
> + /*
> + * When pinning a compound page of order > 1 (which is what
> + * hpage_pincount_available() checks for), use an exact count to
> + * track it, via hpage_pincount_add/_sub().
> + *
> + * However, be sure to *also* increment the normal page refcount
> + * field at least once, so that the page really is pinned.
> + */
> + if (!hpage_pincount_available(page))
> + refs *= GUP_PIN_COUNTING_BIAS;
> +
> + page = try_get_compound_head(page, refs);
> + if (!page)
> + return NULL;
> +
> + if (hpage_pincount_available(page))
> + hpage_pincount_add(page, refs);
> +
> + return page;
> }
>
> WARN_ON_ONCE(1);
> @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
> if (flags & FOLL_GET)
> return try_get_page(page);
> else if (flags & FOLL_PIN) {
> + int refs = 1;
> +
> page = compound_head(page);
>
> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> return false;
>
> - page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> + if (hpage_pincount_available(page))
> + hpage_pincount_add(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + /*
> + * Similar to try_grab_compound_head(): even if using the
> + * hpage_pincount_add/_sub() routines, be sure to
> + * *also* increment the normal page refcount field at least
> + * once, so that the page really is pinned.
> + */
> + page_ref_add(page, refs);
> }
>
> return true;
> @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> static bool __unpin_devmap_managed_user_page(struct page *page)
> {
> - int count;
> + int count, refs = 1;
>
> if (!page_is_devmap_managed(page))
> return false;
>
> - count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + count = page_ref_sub_return(page, refs);
>
> /*
> * devmap page refcounts are 1-based, rather than 0-based: if
> @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
> */
> void unpin_user_page(struct page *page)
> {
> + int refs = 1;
> +
> page = compound_head(page);
>
> /*
> @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page)
> if (__unpin_devmap_managed_user_page(page))
> return;
>
> - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + if (page_ref_sub_and_test(page, refs))
> __put_page(page);
> }
> EXPORT_SYMBOL(unpin_user_page);
> @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr,
>
> static void put_compound_head(struct page *page, int refs, unsigned int flags)
> {
> - if (flags & FOLL_PIN)
> - refs *= GUP_PIN_COUNTING_BIAS;
> + if (flags & FOLL_PIN) {
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, refs);
> + else
> + refs *= GUP_PIN_COUNTING_BIAS;
> + }
>
> VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 487e998fd38e..07059d936f7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page,
> struct page *p = page + 1;
>
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> clear_compound_head(p);
> set_page_refcounted(p);
> @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
> +
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> }
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 15e908ad933b..c205b912f108 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order)
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> }
>
> #ifdef CONFIG_DEBUG_PAGEALLOC
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b3e381919835..e45b9b991e2f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page,
> VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> /* increment count (starts at -1) */
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> __inc_node_page_state(page, NR_ANON_THPS);
> } else {
> /* Anon THP always mapped first with PMD */
> @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page,
> {
> BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> __page_set_anon_rmap(page, vma, address, 1);
> }
> #endif /* CONFIG_HUGETLB_PAGE */
> --
> 2.25.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-02-03 17:15:14

[permalink] [raw]

Subject: Re: [PATCH v3 09/12] mm: dump_page(): better diagnostics for huge pinned pages

On Fri 31-01-20 19:40:26, John Hubbard wrote:
> As part of pin_user_pages() and related API calls, pages are
> "dma-pinned". For the case of compound pages of order > 1, the per-page
> accounting of dma pins is accomplished via the 3rd struct page in the
> compound page. In order to support debugging of any pin_user_pages()-
> related problems, enhance dump_page() so as to report the pin count
> in that case.
>
> Documentation/core-api/pin_user_pages.rst is also updated accordingly.
>
> Signed-off-by: John Hubbard <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> Documentation/core-api/pin_user_pages.rst | 7 +++++
> mm/debug.c | 34 +++++++++++++++++------
> 2 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 3f72b1ea1104..dd21ea140ef4 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -215,6 +215,13 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
> because there is a noticeable performance drop in unpin_user_page(), when they
> are activated.
>
> +Other diagnostics
> +=================
> +
> +dump_page() has been enhanced slightly, to handle these new counting fields, and
> +to better report on compound pages in general. Specifically, for compound pages
> +with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
> +
> References
> ==========
>
> diff --git a/mm/debug.c b/mm/debug.c
> index beb1c59d784b..db81b11345be 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -57,10 +57,20 @@ static void __dump_tail_page(struct page *page, int mapcount)
> page, page_ref_count(page), mapcount, page->mapping,
> page_to_pgoff(page));
> } else {
> - pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
> - "index:%#lx compound_mapcount:%d\n",
> - page, page_ref_count(head), mapcount, head->mapping,
> - page_to_pgoff(head), compound_mapcount(page));
> + if (hpage_pincount_available(page))
> + pr_warn("page:%px compound refcount:%d mapcount:%d "
> + "mapping:%px index:%#lx compound_mapcount:%d "
> + "compound_pincount:%d\n",
> + page, page_ref_count(head), mapcount,
> + head->mapping, page_to_pgoff(head),
> + compound_mapcount(page),
> + compound_pincount(page));
> + else
> + pr_warn("page:%px compound refcount:%d mapcount:%d "
> + "mapping:%px index:%#lx compound_mapcount:%d\n",
> + page, page_ref_count(head), mapcount,
> + head->mapping, page_to_pgoff(head),
> + compound_mapcount(page));
> }
>
> if (page_ref_count(page) != 0)
> @@ -103,10 +113,18 @@ void __dump_page(struct page *page, const char *reason)
>
> if (PageTail(page))
> __dump_tail_page(page, mapcount);
> - else
> - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
> - page, page_ref_count(page), mapcount,
> - page->mapping, page_to_pgoff(page));
> + else {
> + if (hpage_pincount_available(page))
> + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
> + "index:%#lx compound pincount: %d\n",
> + page, page_ref_count(page), mapcount,
> + page->mapping, page_to_pgoff(page),
> + compound_pincount(page));
> + else
> + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
> + "index:%#lx\n", page, page_ref_count(page),
> + mapcount, page->mapping, page_to_pgoff(page));
> + }
> if (PageKsm(page))
> type = "ksm ";
> else if (PageAnon(page))
> --
> 2.25.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-02-03 20:05:17

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH v3 04/12] mm: introduce page_ref_sub_return()

On 2/3/20 5:23 AM, Kirill A. Shutemov wrote:
> On Fri, Jan 31, 2020 at 07:40:21PM -0800, John Hubbard wrote:
>> An upcoming patch requires subtracting a large chunk of refcounts from
>> a page, and checking what the resulting refcount is. This is a little
>> different than the usual "check for zero refcount" that many of the
>> page ref functions already do. However, it is similar to a few other
>> routines that (like this one) are generally useful for things such as
>> 1-based refcounting.
>>
>> Add page_ref_sub_return(), that subtracts a chunk of refcounts
>> atomically, and returns an atomic snapshot of the result.
>>
>> Signed-off-by: John Hubbard <[email protected]>
>> ---
>> include/linux/page_ref.h | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
>> index 14d14beb1f7f..b9cbe553d1e7 100644
>> --- a/include/linux/page_ref.h
>> +++ b/include/linux/page_ref.h
>> @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
>> __page_ref_mod(page, -nr);
>> }
>>
>> +static inline int page_ref_sub_return(struct page *page, int nr)
>> +{
>> + int ret = atomic_sub_return(nr, &page->_refcount);
>> +
>> + if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
>> + __page_ref_mod(page, -nr);
>
> Shouldn't it be __page_ref_mod_and_return() and relevant tracepoint?

Why yes, it should. I didn't even notice that that more precise function existed,
thanks for catching that. I've changed it to this for the next version of the
patchset:

static inline int page_ref_sub_return(struct page *page, int nr)
{
int ret = atomic_sub_return(nr, &page->_refcount);

if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod_and_return(page, -nr, ret);
return ret;
}

thanks,
--
John Hubbard
NVIDIA

>
>> +
>> + return ret;
>> +}
>> +
>> static inline void page_ref_inc(struct page *page)
>> {
>> atomic_inc(&page->_refcount);
>> --
>> 2.25.0
>>
>

2020-02-03 21:18:55

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls

On 2/3/20 5:58 AM, Kirill A. Shutemov wrote:
...
>> @@ -19,6 +21,48 @@ struct gup_benchmark {
>> __u64 expansion[10]; /* For future use */
>> };
>>
>> +static void put_back_pages(unsigned int cmd, struct page **pages,
>> + unsigned long nr_pages)
>> +{
>> + int i;
>> +
>> + switch (cmd) {
>> + case GUP_FAST_BENCHMARK:
>> + case GUP_LONGTERM_BENCHMARK:
>> + case GUP_BENCHMARK:
>> + for (i = 0; i < nr_pages; i++)
>
> 'i' is 'int' and 'nr_pages' is 'unsigned long'.
> There's space for trouble :P
>

Yes, I've changed it to "unsigned int", thanks.

>> + put_page(pages[i]);
>> + break;
>> +
>> + case PIN_FAST_BENCHMARK:
>> + case PIN_BENCHMARK:
>> + unpin_user_pages(pages, nr_pages);
>> + break;
>> + }
>> +}
>> +
>> +static void verify_dma_pinned(unsigned int cmd, struct page **pages,
>> + unsigned long nr_pages)
>> +{
>> + int i;
>> + struct page *page;
>> +
>> + switch (cmd) {
>> + case PIN_FAST_BENCHMARK:
>> + case PIN_BENCHMARK:
>> + for (i = 0; i < nr_pages; i++) {
>
> Ditto.
>

Fixed here also.

>> + page = pages[i];
>> + if (WARN(!page_maybe_dma_pinned(page),
>> + "pages[%d] is NOT dma-pinned\n", i)) {

...and changed to "pages[%u]", to match.

thanks,
--
John Hubbard
NVIDIA

>> +
>> + dump_page(page, "gup_benchmark failure");
>> + break;
>> + }
>> + }
>> + break;
>> + }
>> +}
>> +
>> static int __gup_benchmark_ioctl(unsigned int cmd,
>> struct gup_benchmark *gup)
>> {
>

2020-02-03 21:56:46

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls

On Mon, Feb 03, 2020 at 01:17:40PM -0800, John Hubbard wrote:
> On 2/3/20 5:58 AM, Kirill A. Shutemov wrote:
> ...
> >> @@ -19,6 +21,48 @@ struct gup_benchmark {
> >> __u64 expansion[10]; /* For future use */
> >> };
> >>
> >> +static void put_back_pages(unsigned int cmd, struct page **pages,
> >> + unsigned long nr_pages)
> >> +{
> >> + int i;
> >> +
> >> + switch (cmd) {
> >> + case GUP_FAST_BENCHMARK:
> >> + case GUP_LONGTERM_BENCHMARK:
> >> + case GUP_BENCHMARK:
> >> + for (i = 0; i < nr_pages; i++)
> >
> > 'i' is 'int' and 'nr_pages' is 'unsigned long'.
> > There's space for trouble :P
> >
>
> Yes, I've changed it to "unsigned int", thanks.

I'm confused. If nr_pages is more than UINT_MAX, this is endless loop.
Hm?

--
Kirill A. Shutemov

2020-02-03 22:09:18

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls

On 2/3/20 1:55 PM, Kirill A. Shutemov wrote:
> On Mon, Feb 03, 2020 at 01:17:40PM -0800, John Hubbard wrote:
>> On 2/3/20 5:58 AM, Kirill A. Shutemov wrote:
>> ...
>>>> @@ -19,6 +21,48 @@ struct gup_benchmark {
>>>> __u64 expansion[10]; /* For future use */
>>>> };
>>>>
>>>> +static void put_back_pages(unsigned int cmd, struct page **pages,
>>>> + unsigned long nr_pages)
>>>> +{
>>>> + int i;
>>>> +
>>>> + switch (cmd) {
>>>> + case GUP_FAST_BENCHMARK:
>>>> + case GUP_LONGTERM_BENCHMARK:
>>>> + case GUP_BENCHMARK:
>>>> + for (i = 0; i < nr_pages; i++)
>>>
>>> 'i' is 'int' and 'nr_pages' is 'unsigned long'.
>>> There's space for trouble :P
>>>
>>
>> Yes, I've changed it to "unsigned int", thanks.
>
> I'm confused. If nr_pages is more than UINT_MAX, this is endless loop.
> Hm?
>

Oh, I've been afflicted with 64-bit tunnel vision. OK, make that
"unsigned long" and "%ul". yikes. :)

thanks,
--
John Hubbard
NVIDIA