Hi,
Here's another update after another round of reviews from Kirill and Jan.
There is a git repo and branch, for convenience in reviewing:
[email protected]:johnhubbard/linux.git track_user_pages_v5
============================================================
Changes since v4:
* Added documentation about the huge page behavior of the new
/proc/vmstat items.
* Added a missing mode_node_page_state() call to put_compound_head().
* Fixed a tracepoint call in page_ref_sub_return().
* Added a trailing underscore to a URL in pin_user_pages.rst, to fix
a broken generated link.
* Added ACKs and reviewed-by's from Jan Kara and Kirill Shutemov.
* Rebased onto today's linux.git, and
* I am experimenting here with "git format-patch --base=<commit>".
This generated the "base-commit:" tag you'll see at the end of this
cover letter. I was inspired to do so after trying out a new
get-lore-mbox.py tool (it's very nice), mentioned in a recent LWN
article (https://lwn.net/Articles/811528/ ). That tool relies on the
base-commit tag for some things.
============================================================
Changes since v3:
* Rebased onto latest linux.git
* Added ACKs and reviewed-by's from Kirill Shutemov and Jan Kara.
* /proc/vmstat:
* Renamed items, after realizing that I hate the previous names:
nr_foll_pin_requested --> nr_foll_pin_acquired
nr_foll_pin_returned --> nr_foll_pin_released
* Removed the CONFIG_DEBUG_VM guard, and collapsed away a wrapper
routine: now just calls mod_node_page_state() directly.
* Tweaked the WARN_ON_ONCE() statements in mm/hugetlb.c to be more
informative, and added comments above them as well.
* Fixed gup_benchmark: signed int --> unsigned long.
* One or two minor formatting changes.
============================================================
Changes since v2:
* Rebased onto linux.git, because the akpm tree for 5.6 has been merged.
* Split the tracking patch into even more patches, as requested.
* Merged Matthew Wilcox's dump_page() changes into mine, as part of the
first patch.
* Renamed: page_dma_pinned() --> page_maybe_dma_pinned(), in response to
Kirill Shutemov's review.
* Moved a WARN to the top of a routine, and fixed a typo in the commit
description of patch #7, also as suggested by Kirill.
============================================================
Changes since v1:
* Split the tracking patch into 6 smaller patches
* Rebased onto today's linux-next/akpm (there weren't any conflicts).
* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported
by Nathan Chancellor. (I don't see it in my local builds, probably
because they use gcc, but an LLVM test found the mismatch.)
* Fixed a huge page pincount problem (add/subtract vs.
increment/decrement), spotted by Jan Kara.
============================================================
There is a reasonable case to be made for merging two of the patches
(patches 7 and 8), given that patch 7 provides tracking that has upper
limits on the number of pins that can be done with huge pages. Let me
know if anyone wants those merged, but unless there is some weird chance
of someone grabbing patch 7 and not patch 8, I don't really see the
need. Meanwhile, it's easier to review in this form.
Also, patch 3 has been revived. Earlier reviewers asked for it to be
merged into the tracking patch (one cannot please everyone, heh), but
now it's back out on it's own.
This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].
FOLL_PIN support is now in the main linux tree. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.
This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.
Other interesting changes:
* dump_page(): added one, or two new things to report for compound
pages: head refcount (for all compound pages), and map_pincount (for
compound pages of order > 1).
* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
huge page refcount upper limit problems, and added notes about how it
works now. Also added a note about the dump_page() enhancements.
* Added some comments in gup.c and mm.h, to explain that there are two
ways to count pinned pages: exact (for compound pages of order > 1)
and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
============================================================
General notes about the tracking patch:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
============================================================
Next steps:
* Convert more subsystems from get_user_pages() to pin_user_pages().
* Work with Ira and others to connect this all up with file system
leases.
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages()
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
John Hubbard (12):
mm: dump_page(): better diagnostics for compound pages
mm/gup: split get_user_pages_remote() into two routines
mm/gup: pass a flags arg to __gup_device_* functions
mm: introduce page_ref_sub_return()
mm/gup: pass gup flags to two more routines
mm/gup: require FOLL_GET for get_user_pages_fast()
mm/gup: track FOLL_PIN pages
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
mm: dump_page(): better diagnostics for huge pinned pages
mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
Documentation/core-api/pin_user_pages.rst | 86 ++--
include/linux/mm.h | 108 ++++-
include/linux/mm_types.h | 7 +-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 9 +
mm/debug.c | 61 ++-
mm/gup.c | 452 ++++++++++++++++-----
mm/gup_benchmark.c | 71 +++-
mm/huge_memory.c | 29 +-
mm/hugetlb.c | 60 ++-
mm/page_alloc.c | 2 +
mm/rmap.c | 6 +
mm/vmstat.c | 2 +
tools/testing/selftests/vm/gup_benchmark.c | 15 +-
tools/testing/selftests/vm/run_vmtests | 22 +
15 files changed, 752 insertions(+), 180 deletions(-)
base-commit: 90568ecf561540fa330511e21fcd823b0c3829c6
--
2.25.0
An upcoming patch requires reusing the implementation of
get_user_pages_remote(). Split up get_user_pages_remote() into an outer
routine that checks flags, and an implementation routine that will be
reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/gup.c | 56 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 33 insertions(+), 23 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..b699500da077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1557,6 +1557,37 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
}
#endif /* CONFIG_FS_DAX || CONFIG_CMA */
+#ifdef CONFIG_MMU
+static long __get_user_pages_remote(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long nr_pages,
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas, int *locked)
+{
+ /*
+ * Parts of FOLL_LONGTERM behavior are incompatible with
+ * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+ * vmas. However, this only comes up if locked is set, and there are
+ * callers that do request FOLL_LONGTERM, but do not set locked. So,
+ * allow what we can.
+ */
+ if (gup_flags & FOLL_LONGTERM) {
+ if (WARN_ON_ONCE(locked))
+ return -EINVAL;
+ /*
+ * This will check the vmas (even if our vmas arg is NULL)
+ * and return -ENOTSUPP if DAX isn't allowed in this case:
+ */
+ return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+ vmas, gup_flags | FOLL_TOUCH |
+ FOLL_REMOTE);
+ }
+
+ return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+ locked,
+ gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+
/*
* get_user_pages_remote() - pin user pages in memory
* @tsk: the task_struct to use for page fault accounting, or
@@ -1619,7 +1650,6 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
* should use get_user_pages because it cannot pass
* FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
*/
-#ifdef CONFIG_MMU
long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
@@ -1632,28 +1662,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
return -EINVAL;
- /*
- * Parts of FOLL_LONGTERM behavior are incompatible with
- * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
- * vmas. However, this only comes up if locked is set, and there are
- * callers that do request FOLL_LONGTERM, but do not set locked. So,
- * allow what we can.
- */
- if (gup_flags & FOLL_LONGTERM) {
- if (WARN_ON_ONCE(locked))
- return -EINVAL;
- /*
- * This will check the vmas (even if our vmas arg is NULL)
- * and return -ENOTSUPP if DAX isn't allowed in this case:
- */
- return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
- vmas, gup_flags | FOLL_TOUCH |
- FOLL_REMOTE);
- }
-
- return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
- locked,
- gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+ return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+ pages, vmas, locked);
}
EXPORT_SYMBOL(get_user_pages_remote);
--
2.25.0
In preparation for an upcoming patch, send gup flags args to two more
routines: put_compound_head(), and undo_dev_pagemap().
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/gup.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 9e117998274c..e5f75e886663 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1870,6 +1870,7 @@ static inline pte_t gup_get_pte(pte_t *ptep)
#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+ unsigned int flags,
struct page **pages)
{
while ((*nr) - nr_start) {
@@ -1909,7 +1910,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
goto pte_unmap;
}
} else if (pte_special(pte))
@@ -1974,7 +1975,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
pgmap = get_dev_pagemap(pfn, pgmap);
if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
SetPageReferenced(page);
@@ -2001,7 +2002,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
return 1;
@@ -2019,7 +2020,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
return 1;
@@ -2053,7 +2054,7 @@ static int record_subpages(struct page *page, unsigned long addr,
return nr;
}
-static void put_compound_head(struct page *page, int refs)
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
{
VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
/*
@@ -2103,7 +2104,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
return 0;
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
- put_compound_head(head, refs);
+ put_compound_head(head, refs, flags);
return 0;
}
@@ -2163,7 +2164,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
- put_compound_head(head, refs);
+ put_compound_head(head, refs, flags);
return 0;
}
@@ -2197,7 +2198,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
- put_compound_head(head, refs);
+ put_compound_head(head, refs, flags);
return 0;
}
@@ -2226,7 +2227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
return 0;
if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
- put_compound_head(head, refs);
+ put_compound_head(head, refs, flags);
return 0;
}
--
2.25.0
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_released
These are documented in Documentation/core-api/pin_user_pages.rst.
They represent the number of pages (since boot time) that have been
pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"),
via pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.
Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.
Also: updated Documentation/core-api/pin_user_pages.rst to rename the
new /proc/vmstat entries, to the names listed here.
Signed-off-by: John Hubbard <[email protected]>
---
Documentation/core-api/pin_user_pages.rst | 33 +++++++++++++++++++----
include/linux/mmzone.h | 2 ++
mm/gup.c | 14 ++++++++++
mm/vmstat.c | 2 ++
4 files changed, 46 insertions(+), 5 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 24641f1a1eba..2e939ff10b86 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -208,12 +208,35 @@ has the following new calls to exercise the new pin*() wrapper functions:
You can monitor how many total dma-pinned pages have been acquired and released
since the system was booted, via two new /proc/vmstat entries: ::
- /proc/vmstat/nr_foll_pin_requested
- /proc/vmstat/nr_foll_pin_requested
+ /proc/vmstat/nr_foll_pin_acquired
+ /proc/vmstat/nr_foll_pin_released
-Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in unpin_user_page(), when they
-are activated.
+Under normal conditions, these two values will be equal unless there are any
+long-term [R]DMA pins in place, or during pin/unpin transitions.
+
+* nr_foll_pin_acquired: This is the number of logical pins that have been
+ acquired since the system was powered on. For huge pages, the head page is
+ pinned once for each page (head page and each tail page) within the huge page.
+ This follows the same sort of behavior that get_user_pages() uses for huge
+ pages: the head page is refcounted once for each tail or head page in the huge
+ page, when get_user_pages() is applied to a huge page.
+
+* nr_foll_pin_released: The number of logical pins that have been released since
+ the system was powered on. Note that pages are released (unpinned) on a
+ PAGE_SIZE granularity, even if the original pin was applied to a huge page.
+ Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
+ the accounting balances out, so that after doing this::
+
+ pin_user_pages(huge_page);
+ for (each page in huge_page)
+ unpin_user_page(page);
+
+...the following is expected::
+
+ nr_foll_pin_released == nr_foll_pin_acquired
+
+(...unless it was already out of balance due to a long-term RDMA pin being in
+place.)
Other diagnostics
=================
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..4bca42eeb439 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -243,6 +243,8 @@ enum node_stat_item {
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
+ NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
+ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
NR_VM_NODE_STAT_ITEMS
};
diff --git a/mm/gup.c b/mm/gup.c
index 4d0d94405639..ae503c51bc7f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -86,6 +86,8 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
if (flags & FOLL_GET)
return try_get_compound_head(page, refs);
else if (flags & FOLL_PIN) {
+ int orig_refs = refs;
+
/*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
@@ -104,6 +106,9 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
+ orig_refs);
+
return page;
}
@@ -158,6 +163,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
* once, so that the page really is pinned.
*/
page_ref_add(page, refs);
+
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
}
return true;
@@ -178,6 +185,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
count = page_ref_sub_return(page, refs);
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
/*
* devmap page refcounts are 1-based, rather than 0-based: if
* refcount is 1, then the page is free and the refcount is
@@ -228,6 +236,8 @@ void unpin_user_page(struct page *page)
if (page_ref_sub_and_test(page, refs))
__put_page(page);
+
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
}
EXPORT_SYMBOL(unpin_user_page);
@@ -2258,6 +2268,8 @@ static int record_subpages(struct page *page, unsigned long addr,
static void put_compound_head(struct page *page, int refs, unsigned int flags)
{
+ int orig_refs = refs;
+
if (flags & FOLL_PIN) {
if (hpage_pincount_available(page))
hpage_pincount_sub(page, refs);
@@ -2273,6 +2285,8 @@ static void put_compound_head(struct page *page, int refs, unsigned int flags)
if (refs > 1)
page_ref_sub(page, refs - 1);
put_page(page);
+
+ mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
}
#ifdef CONFIG_ARCH_HAS_HUGEPD
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..c9c0d71f917f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
"nr_dirtied",
"nr_written",
"nr_kernel_misc_reclaimable",
+ "nr_foll_pin_acquired",
+ "nr_foll_pin_released",
/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.25.0
Internal to mm/gup.c, require that get_user_pages_fast()
and __get_user_pages_fast() identify themselves, by setting
FOLL_GET. This is required in order to be able to make decisions
based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in
upcoming patches.
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/gup.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index e5f75e886663..c8affbea2019 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2390,6 +2390,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
unsigned long len, end;
unsigned long flags;
int nr = 0;
+ /*
+ * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+ * because gup fast is always a "pin with a +1 page refcount" request.
+ */
+ unsigned int gup_flags = FOLL_GET;
+
+ if (write)
+ gup_flags |= FOLL_WRITE;
start = untagged_addr(start) & PAGE_MASK;
len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -2415,7 +2423,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
local_irq_save(flags);
- gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+ gup_pgd_range(start, end, gup_flags, pages, &nr);
local_irq_restore(flags);
}
@@ -2454,7 +2462,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
int nr = 0, ret = 0;
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
- FOLL_FORCE | FOLL_PIN)))
+ FOLL_FORCE | FOLL_PIN | FOLL_GET)))
return -EINVAL;
start = untagged_addr(start) & PAGE_MASK;
@@ -2521,6 +2529,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
return -EINVAL;
+ /*
+ * The caller may or may not have explicitly set FOLL_GET; either way is
+ * OK. However, internally (within mm/gup.c), gup fast variants must set
+ * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
+ * request.
+ */
+ gup_flags |= FOLL_GET;
return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
--
2.25.0
As part of pin_user_pages() and related API calls, pages are
"dma-pinned". For the case of compound pages of order > 1, the per-page
accounting of dma pins is accomplished via the 3rd struct page in the
compound page. In order to support debugging of any pin_user_pages()-
related problems, enhance dump_page() so as to report the pin count
in that case.
Documentation/core-api/pin_user_pages.rst is also updated accordingly.
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
Documentation/core-api/pin_user_pages.rst | 7 +++++
mm/debug.c | 34 +++++++++++++++++------
2 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 7e5dd8b1b3f2..24641f1a1eba 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -215,6 +215,13 @@ Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
because there is a noticeable performance drop in unpin_user_page(), when they
are activated.
+Other diagnostics
+=================
+
+dump_page() has been enhanced slightly, to handle these new counting fields, and
+to better report on compound pages in general. Specifically, for compound pages
+with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+
References
==========
diff --git a/mm/debug.c b/mm/debug.c
index f074077eee11..e82c878c27df 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -57,10 +57,20 @@ static void __dump_tail_page(struct page *page, int mapcount)
page, page_ref_count(page), mapcount, page->mapping,
page_to_pgoff(page));
} else {
- pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
- "index:%#lx compound_mapcount:%d\n",
- page, page_ref_count(head), mapcount, head->mapping,
- page_to_pgoff(head), compound_mapcount(page));
+ if (hpage_pincount_available(page))
+ pr_warn("page:%px compound refcount:%d mapcount:%d "
+ "mapping:%px index:%#lx compound_mapcount:%d "
+ "compound_pincount:%d\n",
+ page, page_ref_count(head), mapcount,
+ head->mapping, page_to_pgoff(head),
+ compound_mapcount(page),
+ compound_pincount(page));
+ else
+ pr_warn("page:%px compound refcount:%d mapcount:%d "
+ "mapping:%px index:%#lx compound_mapcount:%d\n",
+ page, page_ref_count(head), mapcount,
+ head->mapping, page_to_pgoff(head),
+ compound_mapcount(page));
}
if (page_ref_count(page) != 0) {
@@ -104,10 +114,18 @@ void __dump_page(struct page *page, const char *reason)
if (PageTail(page))
__dump_tail_page(page, mapcount);
- else
- pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
- page, page_ref_count(page), mapcount,
- page->mapping, page_to_pgoff(page));
+ else {
+ if (hpage_pincount_available(page))
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx compound pincount: %d\n",
+ page, page_ref_count(page), mapcount,
+ page->mapping, page_to_pgoff(page),
+ compound_pincount(page));
+ else
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx\n", page, page_ref_count(page),
+ mapcount, page->mapping, page_to_pgoff(page));
+ }
if (PageKsm(page))
type = "ksm ";
else if (PageAnon(page))
--
2.25.0
A compound page collects the refcount in the head page, while leaving
the refcount of each tail page at zero. Therefore, when debugging a
problem that involves compound pages, it's best to have diagnostics that
reflect that situation. However, dump_page() is oblivious to these
points.
Change dump_page() as follows:
1) For tail pages, print relevant head page information: refcount, in
particular. But only do this if the page is not corrupted so badly
that the pointer to the head page is all wrong.
2) Do a separate check to catch any (rare) cases of the tail page's
refcount being non-zero, and issue a separate, clear pr_warn() if
that ever happens.
Suggested-by: Matthew Wilcox <[email protected]>
Suggested-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
mm/debug.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/mm/debug.c b/mm/debug.c
index ecccd9f17801..f074077eee11 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -42,6 +42,33 @@ const struct trace_print_flags vmaflag_names[] = {
{0, NULL}
};
+static void __dump_tail_page(struct page *page, int mapcount)
+{
+ struct page *head = compound_head(page);
+
+ if ((page < head) || (page >= head + MAX_ORDER_NR_PAGES)) {
+ /*
+ * Page is hopelessly corrupted, so limit any reporting to
+ * information about the page itself. Do not attempt to look at
+ * the head page.
+ */
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx (corrupted tail page case)\n",
+ page, page_ref_count(page), mapcount, page->mapping,
+ page_to_pgoff(page));
+ } else {
+ pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
+ "index:%#lx compound_mapcount:%d\n",
+ page, page_ref_count(head), mapcount, head->mapping,
+ page_to_pgoff(head), compound_mapcount(page));
+ }
+
+ if (page_ref_count(page) != 0) {
+ pr_warn("page:%px PROBLEM: non-zero refcount (==%d) on this "
+ "tail page\n", page, page_ref_count(page));
+ }
+}
+
void __dump_page(struct page *page, const char *reason)
{
struct address_space *mapping;
@@ -75,12 +102,8 @@ void __dump_page(struct page *page, const char *reason)
*/
mapcount = PageSlab(page) ? 0 : page_mapcount(page);
- if (PageCompound(page))
- pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
- "index:%#lx compound_mapcount: %d\n",
- page, page_ref_count(page), mapcount,
- page->mapping, page_to_pgoff(page),
- compound_mapcount(page));
+ if (PageTail(page))
+ __dump_tail_page(page, mapcount);
else
pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
page, page_ref_count(page), mapcount,
--
2.25.0
On Thu, Feb 06, 2020 at 07:37:33PM -0800, John Hubbard wrote:
> Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
> unpin_user_pages*(), we need some visibility into whether all of this is
> working correctly.
>
> Add two new fields to /proc/vmstat:
>
> nr_foll_pin_acquired
> nr_foll_pin_released
>
> These are documented in Documentation/core-api/pin_user_pages.rst.
> They represent the number of pages (since boot time) that have been
> pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"),
> via pin_user_pages*() and unpin_user_pages*().
>
> In the absence of long-running DMA or RDMA operations that hold pages
> pinned, the above two fields will normally be equal to each other.
>
> Also: update Documentation/core-api/pin_user_pages.rst, to remove an
> earlier (now confirmed untrue) claim about a performance problem with
> /proc/vmstat.
>
> Also: updated Documentation/core-api/pin_user_pages.rst to rename the
> new /proc/vmstat entries, to the names listed here.
>
> Signed-off-by: John Hubbard <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Thu, Feb 06, 2020 at 07:37:24PM -0800, John Hubbard wrote:
> A compound page collects the refcount in the head page, while leaving
> the refcount of each tail page at zero. Therefore, when debugging a
> problem that involves compound pages, it's best to have diagnostics that
> reflect that situation. However, dump_page() is oblivious to these
> points.
>
> Change dump_page() as follows:
>
> 1) For tail pages, print relevant head page information: refcount, in
> particular. But only do this if the page is not corrupted so badly
> that the pointer to the head page is all wrong.
>
> 2) Do a separate check to catch any (rare) cases of the tail page's
> refcount being non-zero, and issue a separate, clear pr_warn() if
> that ever happens.
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Suggested-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> mm/debug.c | 35 +++++++++++++++++++++++++++++------
> 1 file changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/mm/debug.c b/mm/debug.c
> index ecccd9f17801..f074077eee11 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -42,6 +42,33 @@ const struct trace_print_flags vmaflag_names[] = {
> {0, NULL}
> };
>
> +static void __dump_tail_page(struct page *page, int mapcount)
> +{
> + struct page *head = compound_head(page);
> +
> + if ((page < head) || (page >= head + MAX_ORDER_NR_PAGES)) {
> + /*
> + * Page is hopelessly corrupted, so limit any reporting to
> + * information about the page itself. Do not attempt to look at
> + * the head page.
> + */
> + pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
> + "index:%#lx (corrupted tail page case)\n",
> + page, page_ref_count(page), mapcount, page->mapping,
> + page_to_pgoff(page));
> + } else {
> + pr_warn("page:%px compound refcount:%d mapcount:%d mapping:%px "
> + "index:%#lx compound_mapcount:%d\n",
> + page, page_ref_count(head), mapcount, head->mapping,
> + page_to_pgoff(head), compound_mapcount(page));
> + }
> +
> + if (page_ref_count(page) != 0) {
> + pr_warn("page:%px PROBLEM: non-zero refcount (==%d) on this "
> + "tail page\n", page, page_ref_count(page));
> + }
> +}
> +
> void __dump_page(struct page *page, const char *reason)
> {
> struct address_space *mapping;
> @@ -75,12 +102,8 @@ void __dump_page(struct page *page, const char *reason)
> */
> mapcount = PageSlab(page) ? 0 : page_mapcount(page);
>
> - if (PageCompound(page))
> - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
> - "index:%#lx compound_mapcount: %d\n",
> - page, page_ref_count(page), mapcount,
> - page->mapping, page_to_pgoff(page),
> - compound_mapcount(page));
> + if (PageTail(page))
> + __dump_tail_page(page, mapcount);
> else
> pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
> page, page_ref_count(page), mapcount,
A definite improvement, but I think we could do better. For example,
you've changed PageCompound to PageTail here, whereas we really do want
to dump some more information for PageHead pages than the plain vanilla
order-0 page has. Another thing is that page_mapping() calls compound_head(),
so if the page is corrupted, we're going to get a funky pointer dereference.
I spent a bit of time on this reimplementation ... what do you think?
- Print the mapping pointer using %p insted of %px. The actual value of
the pointer can be read out of the raw page dump and using %p gives a
chance to correlate it to earlier printk of the mapping pointer.
- Add the order of the page for compound pages
- Dump the raw head page as well as the raw page being dumped
diff --git a/mm/debug.c b/mm/debug.c
index ecccd9f17801..0564d4cb8233 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -44,8 +44,10 @@ const struct trace_print_flags vmaflag_names[] = {
void __dump_page(struct page *page, const char *reason)
{
+ struct page *head = compound_head(page);
struct address_space *mapping;
bool page_poisoned = PagePoisoned(page);
+ bool compound = PageCompound(page);
/*
* Accessing the pageblock without the zone lock. It could change to
* "isolate" again in the meantime, but since we are just dumping the
@@ -66,25 +68,32 @@ void __dump_page(struct page *page, const char *reason)
goto hex_only;
}
- mapping = page_mapping(page);
+ if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
+ /* Corrupt page, cannot call page_mapping */
+ mapping = page->mapping;
+ head = page;
+ compound = false;
+ } else {
+ mapping = page_mapping(page);
+ }
/*
* Avoid VM_BUG_ON() in page_mapcount().
* page->_mapcount space in struct page is used by sl[aou]b pages to
* encode own info.
*/
- mapcount = PageSlab(page) ? 0 : page_mapcount(page);
+ mapcount = PageSlab(head) ? 0 : page_mapcount(head);
- if (PageCompound(page))
- pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
- "index:%#lx compound_mapcount: %d\n",
- page, page_ref_count(page), mapcount,
+ if (compound)
+ pr_warn("page:%px head:%px refcount:%d mapcount:%d mapping:%p "
+ "index:%#lx order:%u compound_mapcount: %d\n",
+ page, head, page_ref_count(page), mapcount,
page->mapping, page_to_pgoff(page),
- compound_mapcount(page));
+ compound_order(head), compound_mapcount(page));
else
- pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
+ pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
page, page_ref_count(page), mapcount,
- page->mapping, page_to_pgoff(page));
+ mapping, page_to_pgoff(page));
if (PageKsm(page))
type = "ksm ";
else if (PageAnon(page))
@@ -106,6 +115,10 @@ void __dump_page(struct page *page, const char *reason)
print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
sizeof(unsigned long), page,
sizeof(struct page), false);
+ if (!page_poisoned && compound)
+ print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32,
+ sizeof(unsigned long), head,
+ sizeof(struct page), false);
if (reason)
pr_warn("page dumped because: %s\n", reason);
On 2/7/20 9:27 AM, Matthew Wilcox wrote:
...
>
> A definite improvement, but I think we could do better. For example,
> you've changed PageCompound to PageTail here, whereas we really do want
> to dump some more information for PageHead pages than the plain vanilla
> order-0 page has. Another thing is that page_mapping() calls compound_head(),
> so if the page is corrupted, we're going to get a funky pointer dereference.
>
> I spent a bit of time on this reimplementation ... what do you think?
>
It looks fine to me. I gave it a quick spin, here's the output for a normal
and a huge page, and it has everything we want to see:
page:ffffea0010f0d640 refcount:1025 mapcount:1 mapping:0000000021857089 index:0xed
anon flags: 0x17ffe0000080036(referenced|uptodate|lru|active|swapbacked)
raw: 017ffe0000080036 ffffea0011731f08 ffffea0011730008 ffff8884777272c1
raw: 00000000000000ed 0000000000000000 0000040100000000 0000000000000000
page dumped because: testing dump_page()
page:ffffea0010ef1b80 head:ffffea0010ef0000 refcount:0 mapcount:1 mapping:00000000a8e1c7fa index:0xed order:9 compound_mapcount: 1
anon flags: 0x17ffe0000000000()
raw: 017ffe0000000000 ffffea0010ef0001 ffffea0010ef1b88 dead000000000400
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
head: 017ffe0000090036 ffffea0011734548 ffffea0010ef8008 ffff8884777271b9
head: 000000000000007f 0000000000000000 00000201ffffffff 0000000000000000
page dumped because: testing dump_page()
> - Print the mapping pointer using %p insted of %px. The actual value of
> the pointer can be read out of the raw page dump and using %p gives a
> chance to correlate it to earlier printk of the mapping pointer.
> - Add the order of the page for compound pages
> - Dump the raw head page as well as the raw page being dumped
>
> diff --git a/mm/debug.c b/mm/debug.c
> index ecccd9f17801..0564d4cb8233 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -44,8 +44,10 @@ const struct trace_print_flags vmaflag_names[] = {
>
> void __dump_page(struct page *page, const char *reason)
> {
> + struct page *head = compound_head(page);
> struct address_space *mapping;
> bool page_poisoned = PagePoisoned(page);
> + bool compound = PageCompound(page);
> /*
> * Accessing the pageblock without the zone lock. It could change to
> * "isolate" again in the meantime, but since we are just dumping the
> @@ -66,25 +68,32 @@ void __dump_page(struct page *page, const char *reason)
> goto hex_only;
> }
>
> - mapping = page_mapping(page);
> + if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
> + /* Corrupt page, cannot call page_mapping */
> + mapping = page->mapping;
> + head = page;
> + compound = false;
> + } else {
> + mapping = page_mapping(page);
> + }
>
> /*
> * Avoid VM_BUG_ON() in page_mapcount().
> * page->_mapcount space in struct page is used by sl[aou]b pages to
> * encode own info.
> */
> - mapcount = PageSlab(page) ? 0 : page_mapcount(page);
> + mapcount = PageSlab(head) ? 0 : page_mapcount(head);
>
> - if (PageCompound(page))
> - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
> - "index:%#lx compound_mapcount: %d\n",
> - page, page_ref_count(page), mapcount,
> + if (compound)
> + pr_warn("page:%px head:%px refcount:%d mapcount:%d mapping:%p "
> + "index:%#lx order:%u compound_mapcount: %d\n",
> + page, head, page_ref_count(page), mapcount,
> page->mapping, page_to_pgoff(page),
> - compound_mapcount(page));
> + compound_order(head), compound_mapcount(page));
> else
> - pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
> + pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
> page, page_ref_count(page), mapcount,
> - page->mapping, page_to_pgoff(page));
> + mapping, page_to_pgoff(page));
> if (PageKsm(page))
> type = "ksm ";
> else if (PageAnon(page))
> @@ -106,6 +115,10 @@ void __dump_page(struct page *page, const char *reason)
> print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
> sizeof(unsigned long), page,
> sizeof(struct page), false);
> + if (!page_poisoned && compound)
> + print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32,
> + sizeof(unsigned long), head,
> + sizeof(struct page), false);
Good thought to get the hex dump of the head page in this case, yes.
>
> if (reason)
> pr_warn("page dumped because: %s\n", reason);
>
Seeing as how I want to further enhance dump_page() slightly for this series (to
include the 3rd struct page's hpage_pincount), would you care to send this as a
formal patch that I could insert into this series, to replace patch 5?
thanks,
--
John Hubbard
NVIDIA
On 2/7/20 1:05 PM, John Hubbard wrote:
...
> Seeing as how I want to further enhance dump_page() slightly for this series (to
> include the 3rd struct page's hpage_pincount), would you care to send this as a
> formal patch that I could insert into this series, to replace patch 5?
>
ahem, make that "to replace patch 1", please. Too many 5's in the subject lines for
me, I guess. :)
thanks,
--
John Hubbard
NVIDIA
On Thu 06-02-20 19:37:33, John Hubbard wrote:
> @@ -2258,6 +2268,8 @@ static int record_subpages(struct page *page, unsigned long addr,
>
> static void put_compound_head(struct page *page, int refs, unsigned int flags)
> {
> + int orig_refs = refs;
> +
> if (flags & FOLL_PIN) {
> if (hpage_pincount_available(page))
> hpage_pincount_sub(page, refs);
> @@ -2273,6 +2285,8 @@ static void put_compound_head(struct page *page, int refs, unsigned int flags)
> if (refs > 1)
> page_ref_sub(page, refs - 1);
> put_page(page);
> +
> + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
> }
Still not quite happy about this :) Now you update NR_FOLL_PIN_RELEASED
even if 'flags' don't have FOLL_PIN set. You need to have the
mod_node_page_state() inside the "if (flags & FOLL_PIN)" branch above...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2/10/20 2:16 AM, Jan Kara wrote:
> On Thu 06-02-20 19:37:33, John Hubbard wrote:
>> @@ -2258,6 +2268,8 @@ static int record_subpages(struct page *page, unsigned long addr,
>>
>> static void put_compound_head(struct page *page, int refs, unsigned int flags)
>> {
>> + int orig_refs = refs;
>> +
>> if (flags & FOLL_PIN) {
>> if (hpage_pincount_available(page))
>> hpage_pincount_sub(page, refs);
>> @@ -2273,6 +2285,8 @@ static void put_compound_head(struct page *page, int refs, unsigned int flags)
>> if (refs > 1)
>> page_ref_sub(page, refs - 1);
>> put_page(page);
>> +
>> + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> }
>
> Still not quite happy about this :) Now you update NR_FOLL_PIN_RELEASED
> even if 'flags' don't have FOLL_PIN set. You need to have the
> mod_node_page_state() inside the "if (flags & FOLL_PIN)" branch above...
>
> Honza
Arggh, yes that's true. Thanks for catching that, will fix in v6.
thanks,
--
John Hubbard
NVIDIA