2019-11-13 04:32:53

by John Hubbard

[permalink] [raw]
Subject: [PATCH v4 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

OK, here we go. Any VFIO and Infiniband runtime testing from anyone, is
especially welcome here.

Changes since v3:

* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.

* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.

* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.

* Reverted an accidental whitespace-only change in the IB ODP code.

* Added a few more reviewed-by tags.


Changes since v2:

* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.

* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.

* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).

* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.

* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.

* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.

* Note the use of release_pages() in IB: is that OK?

* Added a few more WARN's and clarifying comments nearby.

* Many documentation improvements in various comments.

* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .

* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.

* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.

* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.

NOT DONE: separate patchset is required:

* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).

======================================================================
Original cover letter (edited to fix up the patch description numbers)

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()

to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()

============================================================
Patch summary:

* Patches 1-8: refactoring and preparatory cleanup, independent fixes

* Patch 9: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 10-15: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 16: Activate tracking of FOLL_PIN pages.
* Patches 17-19: convert FOLL_LONGTERM callers
* Patches: 20-22: gup_benchmark and run_vmtests support
* Patch 23: enforce FOLL_LONGTERM as a gup-internal (only) flag

============================================================
Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.

Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.

Also, my runtime testing for the call sites so far is very weak:

* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...

============================================================
Next:

* Get the block/bio_vec sites converted to use pin_user_pages().

* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.

============================================================

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via
pin_longterm_pages*()
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm/gup: remove support for gup(FOLL_LONGTERM)

Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 218 +++++++
arch/powerpc/mm/book3s64/iommu_api.c | 15 +-
drivers/gpu/drm/via/via_dmablit.c | 2 +-
drivers/infiniband/core/umem.c | 17 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 8 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +-
drivers/infiniband/sw/siw/siw_mem.c | 5 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 10 +-
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 30 +-
fs/io_uring.c | 5 +-
include/linux/mm.h | 164 ++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 636 ++++++++++++++++----
mm/gup_benchmark.c | 87 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 67 +--
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 34 +-
tools/testing/selftests/vm/run_vmtests | 22 +
29 files changed, 1191 insertions(+), 335 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst

--
2.24.0


2019-11-13 04:33:16

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

On 11/12/19 8:26 PM, John Hubbard wrote:
> OK, here we go. Any VFIO and Infiniband runtime testing from anyone, is
> especially welcome here.
>

Oh, and to make that easier, there is a git repo and branch, here:

[email protected]:johnhubbard/linux.git pin_user_pages_tracking_v4


thanks,
--
John Hubbard
NVIDIA

2019-11-13 04:33:31

by John Hubbard

[permalink] [raw]
Subject: [PATCH v4 17/23] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..9b9c5b37bf59 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);

- err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
- flags | FOLL_LONGTERM, dma->pages, NULL);
+ err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
+ flags, dma->pages, NULL);

if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
- dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+ dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);

if (dma->pages) {
- for (i = 0; i < dma->nr_pages; i++) {
- if (dma->direction == DMA_FROM_DEVICE)
- set_page_dirty_lock(dma->pages[i]);
- put_page(dma->pages[i]);
- }
+ put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
--
2.24.0

2019-11-13 04:33:34

by John Hubbard

[permalink] [raw]
Subject: [PATCH v4 22/23] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 951c507a27f7..93e8dc9a7cad 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage. Use"
echo " https://github.com/libhugetlbfs/libhugetlbfs.git for"
echo " hugetlb regression testing."

+echo "--------------------------------------------"
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo "--------------------------------------------"
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
+echo "------------------------------------------"
+echo "running gup_benchmark -c (pin_user_pages)"
+echo "------------------------------------------"
+./gup_benchmark -c
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+ exitcode=1
+else
+ echo "[PASS]"
+fi
+
echo "-------------------"
echo "running userfaultfd"
echo "-------------------"
--
2.24.0

2019-11-13 04:34:11

by John Hubbard

[permalink] [raw]
Subject: [PATCH v4 16/23] mm/gup: track FOLL_PIN pages

Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

Suggested-by: Jan Kara <[email protected]>
Suggested-by: Jérôme Glisse <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 74 +++++++++++----
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +++
mm/gup.c | 189 +++++++++++++++++++++++++++++++++------
mm/huge_memory.c | 54 ++++++++++-
mm/hugetlb.c | 39 +++++++-
mm/vmstat.c | 2 +
7 files changed, 321 insertions(+), 49 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c351e1b0b4b7..19b3fa68a4da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
return true;
}

+__must_check bool user_page_ref_inc(struct page *page);
+
static inline void put_page(struct page *page)
{
page = compound_head(page);
@@ -1071,30 +1073,70 @@ static inline void put_page(struct page *page)
__put_page(page);
}

-/**
- * put_user_page() - release a gup-pinned page
- * @page: pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many get_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, get_user_pages() becomes special: such pages are marked
+ * as distinct from normal pages. As such, the new put_user_page() call (and
+ * its variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
*
- * Pages that were pinned via either pin_user_pages*() or pin_longterm_pages*()
- * must be released via either put_user_page(), or one of the put_user_pages*()
- * routines. This is so that eventually such pages can be separately tracked and
- * uniquely handled. In particular, interactions with RDMA and filesystems need
- * special handling.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to get_user_pages() and put_user_page() becomes simpler,
+ * due to the fact that adding an even power of two to the page refcount has
+ * the effect of using only the upper N bits, for the code that counts up using
+ * the bias value. This means that the lower bits are left for the exclusive
+ * use of the original code that increments and decrements by one (or at least,
+ * by much smaller values than the bias value).
*
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
+ *
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
*/
-static inline void put_user_page(struct page *page)
-{
- put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1UL << 10)

+void put_user_page(struct page *page);
void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
bool make_dirty);
-
void put_user_pages(struct page **pages, unsigned long npages);

+/**
+ * page_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*() or pin_longterm_pages*().
+ *
+ * The return value is partially fuzzy: false is not fuzzy, because it means
+ * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
+ * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
+ * of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page: pointer to page to be queried.
+ * @Return: True, if it is likely that the page has been "dma-pinned".
+ * False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_dma_pinned(struct page *page)
+{
+ return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bda20282746b..0485cba38d23 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -244,6 +244,8 @@ enum node_stat_item {
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
+ NR_FOLL_PIN_REQUESTED, /* via: pin_user_page(), gup flag: FOLL_PIN */
+ NR_FOLL_PIN_RETURNED, /* pages returned via put_user_page() */
NR_VM_NODE_STAT_ITEMS
};

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b9cbe553d1e7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
__page_ref_mod(page, -nr);
}

+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+ int ret = atomic_sub_return(nr, &page->_refcount);
+
+ if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
+ __page_ref_mod(page, -nr);
+
+ return ret;
+}
+
static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_refcount);
diff --git a/mm/gup.c b/mm/gup.c
index 4409e84dff51..82e7e4ce5027 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -51,6 +51,94 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
return head;
}

+#ifdef CONFIG_DEBUG_VM
+static inline void __update_proc_vmstat(struct page *page,
+ enum node_stat_item item, int count)
+{
+ mod_node_page_state(page_pgdat(page), item, count);
+}
+#else
+static inline void __update_proc_vmstat(struct page *page,
+ enum node_stat_item item, int count)
+{
+}
+#endif
+
+/**
+ * user_page_ref_inc() - mark a page as being used by get_user_pages(FOLL_PIN).
+ *
+ * @page: pointer to page to be marked
+ * @Return: true for success, false for failure
+ */
+__must_check bool user_page_ref_inc(struct page *page)
+{
+ page = try_get_compound_head(page, GUP_PIN_COUNTING_BIAS);
+ if (!page)
+ return false;
+
+ __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
+ return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __put_devmap_managed_user_page(struct page *page)
+{
+ bool is_devmap = page_is_devmap_managed(page);
+
+ if (is_devmap) {
+ int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+ __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+ /*
+ * devmap page refcounts are 1-based, rather than 0-based: if
+ * refcount is 1, then the page is free and the refcount is
+ * stable because nobody holds a reference on the page.
+ */
+ if (count == 1)
+ free_devmap_managed_page(page);
+ else if (!count)
+ __put_page(page);
+ }
+
+ return is_devmap;
+}
+#else
+static bool __put_devmap_managed_user_page(struct page *page)
+{
+ return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * put_user_page() - release a dma-pinned page
+ * @page: pointer to page to be released
+ *
+ * Pages that were pinned via either pin_user_pages*() or pin_longterm_pages*()
+ * must be released via either put_user_page(), or one of the put_user_pages*()
+ * routines. This is so that such pages can be separately tracked and uniquely
+ * handled. In particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+void put_user_page(struct page *page)
+{
+ page = compound_head(page);
+
+ /*
+ * For devmap managed pages we need to catch refcount transition from
+ * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+ * page is free and we need to inform the device driver through
+ * callback. See include/linux/memremap.h and HMM for details.
+ */
+ if (__put_devmap_managed_user_page(page))
+ return;
+
+ if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+ __put_page(page);
+
+ __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(put_user_page);
+
/**
* put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
* @pages: array of pages to be maybe marked dirty, and definitely released.
@@ -237,10 +325,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}

page = vm_normal_page(vma, address, pte);
- if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+ if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
- * Only return device mapping pages in the FOLL_GET case since
- * they are only valid while holding the pgmap reference.
+ * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+ * case since they are only valid while holding the pgmap
+ * reference.
*/
*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
if (*pgmap)
@@ -283,6 +372,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(-ENOMEM);
goto out;
}
+ } else if (flags & FOLL_PIN) {
+ if (unlikely(!user_page_ref_inc(page))) {
+ page = ERR_PTR(-ENOMEM);
+ goto out;
+ }
}
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
@@ -544,8 +638,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
/* make this handle hugepd */
page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
if (!IS_ERR(page)) {
- BUG_ON(flags & FOLL_GET);
- return page;
+ WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
+ return NULL;
}

pgd = pgd_offset(mm, address);
@@ -1844,13 +1938,17 @@ static inline pte_t gup_get_pte(pte_t *ptep)
#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */

static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+ unsigned int flags,
struct page **pages)
{
while ((*nr) - nr_start) {
struct page *page = pages[--(*nr)];

ClearPageReferenced(page);
- put_page(page);
+ if (flags & FOLL_PIN)
+ put_user_page(page);
+ else
+ put_page(page);
}
}

@@ -1883,7 +1981,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,

pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
goto pte_unmap;
}
} else if (pte_special(pte))
@@ -1892,9 +1990,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

- head = try_get_compound_head(page, 1);
- if (!head)
- goto pte_unmap;
+ if (flags & FOLL_PIN) {
+ head = page;
+ if (unlikely(!user_page_ref_inc(head)))
+ goto pte_unmap;
+ } else {
+ head = try_get_compound_head(page, 1);
+ if (!head)
+ goto pte_unmap;
+ }

if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_page(head);
@@ -1948,12 +2052,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,

pgmap = get_dev_pagemap(pfn, pgmap);
if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
SetPageReferenced(page);
pages[*nr] = page;
- get_page(page);
+
+ if (flags & FOLL_PIN) {
+ if (unlikely(!user_page_ref_inc(page))) {
+ undo_dev_pagemap(nr, nr_start, flags, pages);
+ return 0;
+ }
+ } else
+ get_page(page);
+
(*nr)++;
pfn++;
} while (addr += PAGE_SIZE, addr != end);
@@ -1975,7 +2087,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;

if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
return 1;
@@ -1993,7 +2105,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;

if (unlikely(pud_val(orig) != pud_val(*pudp))) {
- undo_dev_pagemap(nr, nr_start, pages);
+ undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
return 1;
@@ -2077,9 +2189,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
refs = __record_subpages(page, addr, end, pages, *nr);

- head = try_get_compound_head(head, refs);
- if (!head)
- return 0;
+ if (flags & FOLL_PIN) {
+ head = page;
+ if (unlikely(!user_page_ref_inc(head)))
+ return 0;
+ head = page;
+ } else {
+ head = try_get_compound_head(head, refs);
+ if (!head)
+ return 0;
+ }

if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, refs);
@@ -2136,9 +2255,15 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
refs = __record_subpages(page, addr, end, pages, *nr);

- head = try_get_compound_head(pmd_page(orig), refs);
- if (!head)
- return 0;
+ if (flags & FOLL_PIN) {
+ head = page;
+ if (unlikely(!user_page_ref_inc(head)))
+ return 0;
+ } else {
+ head = try_get_compound_head(pmd_page(orig), refs);
+ if (!head)
+ return 0;
+ }

if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
put_compound_head(head, refs);
@@ -2169,9 +2294,15 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
refs = __record_subpages(page, addr, end, pages, *nr);

- head = try_get_compound_head(pud_page(orig), refs);
- if (!head)
- return 0;
+ if (flags & FOLL_PIN) {
+ head = page;
+ if (unlikely(!user_page_ref_inc(head)))
+ return 0;
+ } else {
+ head = try_get_compound_head(pud_page(orig), refs);
+ if (!head)
+ return 0;
+ }

if (unlikely(pud_val(orig) != pud_val(*pudp))) {
put_compound_head(head, refs);
@@ -2197,9 +2328,15 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
refs = __record_subpages(page, addr, end, pages, *nr);

- head = try_get_compound_head(pgd_page(orig), refs);
- if (!head)
- return 0;
+ if (flags & FOLL_PIN) {
+ head = page;
+ if (unlikely(!user_page_ref_inc(head)))
+ return 0;
+ } else {
+ head = try_get_compound_head(pgd_page(orig), refs);
+ if (!head)
+ return 0;
+ }

if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
put_compound_head(head, refs);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..4010c269e9e5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -945,6 +945,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
*/
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");

+ /* FOLL_GET and FOLL_PIN are mutually exclusive. */
+ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+ (FOLL_PIN | FOLL_GET)))
+ return NULL;
+
if (flags & FOLL_WRITE && !pmd_write(*pmd))
return NULL;

@@ -960,7 +965,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
* device mapped pages can only be returned if the
* caller will manage the page reference count.
*/
- if (!(flags & FOLL_GET))
+ if (!(flags & (FOLL_GET | FOLL_PIN)))
return ERR_PTR(-EEXIST);

pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -968,7 +973,18 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
if (!*pgmap)
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
- get_page(page);
+
+ if (flags & FOLL_GET)
+ get_page(page);
+ else if (flags & FOLL_PIN) {
+ /*
+ * user_page_ref_inc() is not actually expected to fail here
+ * because we hold the pmd lock so no one can unmap the pmd and
+ * free the page that it points to.
+ */
+ if (unlikely(!user_page_ref_inc(page)))
+ page = ERR_PTR(-EFAULT);
+ }

return page;
}
@@ -1088,6 +1104,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
if (flags & FOLL_WRITE && !pud_write(*pud))
return NULL;

+ /* FOLL_GET and FOLL_PIN are mutually exclusive. */
+ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+ (FOLL_PIN | FOLL_GET)))
+ return NULL;
+
if (pud_present(*pud) && pud_devmap(*pud))
/* pass */;
else
@@ -1099,8 +1120,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
/*
* device mapped pages can only be returned if the
* caller will manage the page reference count.
+ *
+ * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
*/
- if (!(flags & FOLL_GET))
+ if (!(flags & (FOLL_GET | FOLL_PIN)))
return ERR_PTR(-EEXIST);

pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1108,7 +1131,18 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
if (!*pgmap)
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
- get_page(page);
+
+ if (flags & FOLL_GET)
+ get_page(page);
+ else if (flags & FOLL_PIN) {
+ /*
+ * user_page_ref_inc() is not actually expected to fail here
+ * because we hold the pud lock so no one can unmap the pud and
+ * free the page that it points to.
+ */
+ if (unlikely(!user_page_ref_inc(page)))
+ page = ERR_PTR(-EFAULT);
+ }

return page;
}
@@ -1522,8 +1556,20 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
if (flags & FOLL_GET)
get_page(page);
+ else if (flags & FOLL_PIN) {
+ /*
+ * user_page_ref_inc() is not actually expected to fail here
+ * because we hold the pmd lock so no one can unmap the pmd and
+ * free the page that it points to.
+ */
+ if (unlikely(!user_page_ref_inc(page))) {
+ WARN_ON_ONCE(1);
+ page = NULL;
+ }
+ }

out:
return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b45a95363a84..5ee80eea25e5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4462,7 +4462,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
same_page:
if (pages) {
pages[i] = mem_map_offset(page, pfn_offset);
- get_page(pages[i]);
+
+ if (flags & FOLL_GET)
+ get_page(pages[i]);
+ else if (flags & FOLL_PIN) {
+ /*
+ * user_page_ref_inc() is not actually expected
+ * to fail here because we hold the ptl.
+ */
+ if (unlikely(!user_page_ref_inc(pages[i]))) {
+ spin_unlock(ptl);
+ remainder = 0;
+ err = -ENOMEM;
+ WARN_ON_ONCE(1);
+ break;
+ }
+ }
}

if (vmas)
@@ -5022,6 +5037,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
struct page *page = NULL;
spinlock_t *ptl;
pte_t pte;
+
+ /* FOLL_GET and FOLL_PIN are mutually exclusive. */
+ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+ (FOLL_PIN | FOLL_GET)))
+ return NULL;
+
retry:
ptl = pmd_lockptr(mm, pmd);
spin_lock(ptl);
@@ -5034,8 +5055,20 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pte = huge_ptep_get((pte_t *)pmd);
if (pte_present(pte)) {
page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+
if (flags & FOLL_GET)
get_page(page);
+ else if (flags & FOLL_PIN) {
+ /*
+ * user_page_ref_inc() is not actually expected to fail
+ * here because we hold the ptl.
+ */
+ if (unlikely(!user_page_ref_inc(page))) {
+ WARN_ON_ONCE(1);
+ page = NULL;
+ goto out;
+ }
+ }
} else {
if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
@@ -5056,7 +5089,7 @@ struct page * __weak
follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags)
{
- if (flags & FOLL_GET)
+ if (flags & (FOLL_GET | FOLL_PIN))
return NULL;

return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -5065,7 +5098,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
struct page * __weak
follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
{
- if (flags & FOLL_GET)
+ if (flags & (FOLL_GET | FOLL_PIN))
return NULL;

return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a8222041bd44..fdad40ccde7b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1167,6 +1167,8 @@ const char * const vmstat_text[] = {
"nr_dirtied",
"nr_written",
"nr_kernel_misc_reclaimable",
+ "nr_foll_pin_requested",
+ "nr_foll_pin_returned",

/* enum writeback_stat_item counters */
"nr_dirty_threshold",
--
2.24.0

2019-11-13 04:34:33

by John Hubbard

[permalink] [raw]
Subject: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
and limit the functionality to "read only": return a bool,
with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
what kind of page it is, and what kind of refcount handling it
requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
and limit the functionality to unconditionally freeing a devmap
page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Suggested-by: Jérôme Glisse <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
include/linux/mm.h | 27 ++++++++++++++++---
mm/memremap.c | 67 ++++++++++++++++++++--------------------------
2 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2adf95b3f9c..96228376139c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
#endif

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
{
if (!static_branch_unlikely(&devmap_managed_key))
return false;
@@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
- __put_devmap_managed_page(page);
return true;
default:
break;
@@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
return false;
}

+static inline bool put_devmap_managed_page(struct page *page)
+{
+ bool is_devmap = page_is_devmap_managed(page);
+
+ if (is_devmap) {
+ int count = page_ref_dec_return(page);
+
+ /*
+ * devmap page refcounts are 1-based, rather than 0-based: if
+ * refcount is 1, then the page is free and the refcount is
+ * stable because nobody holds a reference on the page.
+ */
+ if (count == 1)
+ free_devmap_managed_page(page);
+ else if (!count)
+ __put_page(page);
+ }
+
+ return is_devmap;
+}
+
#else /* CONFIG_DEV_PAGEMAP_OPS */
static inline bool put_devmap_managed_page(struct page *page)
{
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..bc7e2a27d025 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
{
- int count = page_ref_dec_return(page);
+ /* Clear Active bit in case of parallel mark_page_accessed */
+ __ClearPageActive(page);
+ __ClearPageWaiters(page);
+
+ mem_cgroup_uncharge(page);

/*
- * If refcount is 1 then page is freed and refcount is stable as nobody
- * holds a reference on the page.
+ * When a device_private page is freed, the page->mapping field
+ * may still contain a (stale) mapping value. For example, the
+ * lower bits of page->mapping may still identify the page as
+ * an anonymous page. Ultimately, this entire field is just
+ * stale and wrong, and it will cause errors if not cleared.
+ * One example is:
+ *
+ * migrate_vma_pages()
+ * migrate_vma_insert_page()
+ * page_add_new_anon_rmap()
+ * __page_set_anon_rmap()
+ * ...checks page->mapping, via PageAnon(page) call,
+ * and incorrectly concludes that the page is an
+ * anonymous page. Therefore, it incorrectly,
+ * silently fails to set up the new anon rmap.
+ *
+ * For other types of ZONE_DEVICE pages, migration is either
+ * handled differently or not done at all, so there is no need
+ * to clear page->mapping.
*/
- if (count == 1) {
- /* Clear Active bit in case of parallel mark_page_accessed */
- __ClearPageActive(page);
- __ClearPageWaiters(page);
-
- mem_cgroup_uncharge(page);
-
- /*
- * When a device_private page is freed, the page->mapping field
- * may still contain a (stale) mapping value. For example, the
- * lower bits of page->mapping may still identify the page as
- * an anonymous page. Ultimately, this entire field is just
- * stale and wrong, and it will cause errors if not cleared.
- * One example is:
- *
- * migrate_vma_pages()
- * migrate_vma_insert_page()
- * page_add_new_anon_rmap()
- * __page_set_anon_rmap()
- * ...checks page->mapping, via PageAnon(page) call,
- * and incorrectly concludes that the page is an
- * anonymous page. Therefore, it incorrectly,
- * silently fails to set up the new anon rmap.
- *
- * For other types of ZONE_DEVICE pages, migration is either
- * handled differently or not done at all, so there is no need
- * to clear page->mapping.
- */
- if (is_device_private_page(page))
- page->mapping = NULL;
+ if (is_device_private_page(page))
+ page->mapping = NULL;

- page->pgmap->ops->page_free(page);
- } else if (!count)
- __put_page(page);
+ page->pgmap->ops->page_free(page);
}
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
#endif /* CONFIG_DEV_PAGEMAP_OPS */
--
2.24.0

2019-11-13 10:50:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Tue 12-11-19 20:26:51, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
>
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> and limit the functionality to "read only": return a bool,
> with no side effects.
>
> * Add a new routine, put_devmap_managed_page(), to handle checking
> what kind of page it is, and what kind of refcount handling it
> requires.
>
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> and limit the functionality to unconditionally freeing a devmap
> page.
>
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, J?r?me Glisse suggested the refactoring described above.
>
> Suggested-by: J?r?me Glisse <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/mm.h | 27 ++++++++++++++++---
> mm/memremap.c | 67 ++++++++++++++++++++--------------------------
> 2 files changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
> #endif
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
> DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
> {
> if (!static_branch_unlikely(&devmap_managed_key))
> return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
> switch (page->pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
> return true;
> default:
> break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
> return false;
> }
>
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> + * devmap page refcounts are 1-based, rather than 0-based: if
> + * refcount is 1, then the page is free and the refcount is
> + * stable because nobody holds a reference on the page.
> + */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
> #else /* CONFIG_DEV_PAGEMAP_OPS */
> static inline bool put_devmap_managed_page(struct page *page)
> {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
> {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>
> /*
> - * If refcount is 1 then page is freed and refcount is stable as nobody
> - * holds a reference on the page.
> + * When a device_private page is freed, the page->mapping field
> + * may still contain a (stale) mapping value. For example, the
> + * lower bits of page->mapping may still identify the page as
> + * an anonymous page. Ultimately, this entire field is just
> + * stale and wrong, and it will cause errors if not cleared.
> + * One example is:
> + *
> + * migrate_vma_pages()
> + * migrate_vma_insert_page()
> + * page_add_new_anon_rmap()
> + * __page_set_anon_rmap()
> + * ...checks page->mapping, via PageAnon(page) call,
> + * and incorrectly concludes that the page is an
> + * anonymous page. Therefore, it incorrectly,
> + * silently fails to set up the new anon rmap.
> + *
> + * For other types of ZONE_DEVICE pages, migration is either
> + * handled differently or not done at all, so there is no need
> + * to clear page->mapping.
> */
> - if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - __ClearPageActive(page);
> - __ClearPageWaiters(page);
> -
> - mem_cgroup_uncharge(page);
> -
> - /*
> - * When a device_private page is freed, the page->mapping field
> - * may still contain a (stale) mapping value. For example, the
> - * lower bits of page->mapping may still identify the page as
> - * an anonymous page. Ultimately, this entire field is just
> - * stale and wrong, and it will cause errors if not cleared.
> - * One example is:
> - *
> - * migrate_vma_pages()
> - * migrate_vma_insert_page()
> - * page_add_new_anon_rmap()
> - * __page_set_anon_rmap()
> - * ...checks page->mapping, via PageAnon(page) call,
> - * and incorrectly concludes that the page is an
> - * anonymous page. Therefore, it incorrectly,
> - * silently fails to set up the new anon rmap.
> - *
> - * For other types of ZONE_DEVICE pages, migration is either
> - * handled differently or not done at all, so there is no need
> - * to clear page->mapping.
> - */
> - if (is_device_private_page(page))
> - page->mapping = NULL;
> + if (is_device_private_page(page))
> + page->mapping = NULL;
>
> - page->pgmap->ops->page_free(page);
> - } else if (!count)
> - __put_page(page);
> + page->pgmap->ops->page_free(page);
> }
> -EXPORT_SYMBOL(__put_devmap_managed_page);
> +EXPORT_SYMBOL(free_devmap_managed_page);
> #endif /* CONFIG_DEV_PAGEMAP_OPS */
> --
> 2.24.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-11-13 19:29:48

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Tue, Nov 12, 2019 at 08:26:51PM -0800, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
>
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> and limit the functionality to "read only": return a bool,
> with no side effects.
>
> * Add a new routine, put_devmap_managed_page(), to handle checking
> what kind of page it is, and what kind of refcount handling it
> requires.
>
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> and limit the functionality to unconditionally freeing a devmap
> page.
>
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, J?r?me Glisse suggested the refactoring described above.
>
> Suggested-by: J?r?me Glisse <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm.h | 27 ++++++++++++++++---
> mm/memremap.c | 67 ++++++++++++++++++++--------------------------
> 2 files changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
> #endif
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
> DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
> {
> if (!static_branch_unlikely(&devmap_managed_key))
> return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
> switch (page->pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
> return true;
> default:
> break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
> return false;
> }
>
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> + * devmap page refcounts are 1-based, rather than 0-based: if
> + * refcount is 1, then the page is free and the refcount is
> + * stable because nobody holds a reference on the page.
> + */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
> #else /* CONFIG_DEV_PAGEMAP_OPS */
> static inline bool put_devmap_managed_page(struct page *page)
> {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
> {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>
> /*
> - * If refcount is 1 then page is freed and refcount is stable as nobody
> - * holds a reference on the page.
> + * When a device_private page is freed, the page->mapping field
> + * may still contain a (stale) mapping value. For example, the
> + * lower bits of page->mapping may still identify the page as
> + * an anonymous page. Ultimately, this entire field is just
> + * stale and wrong, and it will cause errors if not cleared.
> + * One example is:
> + *
> + * migrate_vma_pages()
> + * migrate_vma_insert_page()
> + * page_add_new_anon_rmap()
> + * __page_set_anon_rmap()
> + * ...checks page->mapping, via PageAnon(page) call,
> + * and incorrectly concludes that the page is an
> + * anonymous page. Therefore, it incorrectly,
> + * silently fails to set up the new anon rmap.
> + *
> + * For other types of ZONE_DEVICE pages, migration is either
> + * handled differently or not done at all, so there is no need
> + * to clear page->mapping.
> */
> - if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - __ClearPageActive(page);
> - __ClearPageWaiters(page);
> -
> - mem_cgroup_uncharge(page);
> -
> - /*
> - * When a device_private page is freed, the page->mapping field
> - * may still contain a (stale) mapping value. For example, the
> - * lower bits of page->mapping may still identify the page as
> - * an anonymous page. Ultimately, this entire field is just
> - * stale and wrong, and it will cause errors if not cleared.
> - * One example is:
> - *
> - * migrate_vma_pages()
> - * migrate_vma_insert_page()
> - * page_add_new_anon_rmap()
> - * __page_set_anon_rmap()
> - * ...checks page->mapping, via PageAnon(page) call,
> - * and incorrectly concludes that the page is an
> - * anonymous page. Therefore, it incorrectly,
> - * silently fails to set up the new anon rmap.
> - *
> - * For other types of ZONE_DEVICE pages, migration is either
> - * handled differently or not done at all, so there is no need
> - * to clear page->mapping.
> - */
> - if (is_device_private_page(page))
> - page->mapping = NULL;
> + if (is_device_private_page(page))
> + page->mapping = NULL;
>
> - page->pgmap->ops->page_free(page);
> - } else if (!count)
> - __put_page(page);
> + page->pgmap->ops->page_free(page);
> }
> -EXPORT_SYMBOL(__put_devmap_managed_page);
> +EXPORT_SYMBOL(free_devmap_managed_page);
> #endif /* CONFIG_DEV_PAGEMAP_OPS */
> --
> 2.24.0
>

2019-11-13 19:29:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Tue, Nov 12, 2019 at 8:27 PM John Hubbard <[email protected]> wrote:
>
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
>
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> and limit the functionality to "read only": return a bool,
> with no side effects.
>
> * Add a new routine, put_devmap_managed_page(), to handle checking
> what kind of page it is, and what kind of refcount handling it
> requires.
>
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> and limit the functionality to unconditionally freeing a devmap
> page.
>
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
>
> Suggested-by: Jérôme Glisse <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> include/linux/mm.h | 27 ++++++++++++++++---
> mm/memremap.c | 67 ++++++++++++++++++++--------------------------
> 2 files changed, 53 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
> #endif
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
> DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
> {
> if (!static_branch_unlikely(&devmap_managed_key))
> return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
> switch (page->pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
> return true;
> default:
> break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
> return false;
> }
>
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> + * devmap page refcounts are 1-based, rather than 0-based: if
> + * refcount is 1, then the page is free and the refcount is
> + * stable because nobody holds a reference on the page.
> + */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
> #else /* CONFIG_DEV_PAGEMAP_OPS */
> static inline bool put_devmap_managed_page(struct page *page)
> {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
> {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);

Ugh, when did all this HMM specific manipulation sneak into the
generic ZONE_DEVICE path? It used to be gated by pgmap type with its
own put_zone_device_private_page(). For example it's certainly
unnecessary and might be broken (would need to check) to call
mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
monolith and the HMM use case leaks pages into code paths that DAX
explicitly avoids.

2019-11-13 19:29:56

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v4 22/23] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

On Tue, Nov 12, 2019 at 08:27:09PM -0800, John Hubbard wrote:
> It's good to have basic unit test coverage of the new FOLL_PIN
> behavior. Fortunately, the gup_benchmark unit test is extremely
> fast (a few milliseconds), so adding it the the run_vmtests suite
> is going to cause no noticeable change in running time.
>
> So, add two new invocations to run_vmtests:
>
> 1) Run gup_benchmark with normal get_user_pages().
>
> 2) Run gup_benchmark with pin_user_pages(). This is much like
> the first call, except that it sets FOLL_PIN.
>
> Running these two in quick succession also provide a visual
> comparison of the running times, which is convenient.
>
> The new invocations are fairly early in the run_vmtests script,
> because with test suites, it's usually preferable to put the
> shorter, faster tests first, all other things being equal.
>
> Signed-off-by: John Hubbard <[email protected]>

Reviewed-by: Ira Weiny <[email protected]>

> ---
> tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
> index 951c507a27f7..93e8dc9a7cad 100755
> --- a/tools/testing/selftests/vm/run_vmtests
> +++ b/tools/testing/selftests/vm/run_vmtests
> @@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage. Use"
> echo " https://github.com/libhugetlbfs/libhugetlbfs.git for"
> echo " hugetlb regression testing."
>
> +echo "--------------------------------------------"
> +echo "running 'gup_benchmark -U' (normal/slow gup)"
> +echo "--------------------------------------------"
> +./gup_benchmark -U
> +if [ $? -ne 0 ]; then
> + echo "[FAIL]"
> + exitcode=1
> +else
> + echo "[PASS]"
> +fi
> +
> +echo "------------------------------------------"
> +echo "running gup_benchmark -c (pin_user_pages)"
> +echo "------------------------------------------"
> +./gup_benchmark -c
> +if [ $? -ne 0 ]; then
> + echo "[FAIL]"
> + exitcode=1
> +else
> + echo "[PASS]"
> +fi
> +
> echo "-------------------"
> echo "running userfaultfd"
> echo "-------------------"
> --
> 2.24.0
>

2019-11-13 22:01:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Wed, Nov 13, 2019 at 11:23 AM Dan Williams <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 8:27 PM John Hubbard <[email protected]> wrote:
> >
> > An upcoming patch changes and complicates the refcounting and
> > especially the "put page" aspects of it. In order to keep
> > everything clean, refactor the devmap page release routines:
> >
> > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> > and limit the functionality to "read only": return a bool,
> > with no side effects.
> >
> > * Add a new routine, put_devmap_managed_page(), to handle checking
> > what kind of page it is, and what kind of refcount handling it
> > requires.
> >
> > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> > and limit the functionality to unconditionally freeing a devmap
> > page.
> >
> > This is originally based on a separate patch by Ira Weiny, which
> > applied to an early version of the put_user_page() experiments.
> > Since then, Jérôme Glisse suggested the refactoring described above.
> >
> > Suggested-by: Jérôme Glisse <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> > Signed-off-by: John Hubbard <[email protected]>
> > ---
> > include/linux/mm.h | 27 ++++++++++++++++---
> > mm/memremap.c | 67 ++++++++++++++++++++--------------------------
> > 2 files changed, 53 insertions(+), 41 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a2adf95b3f9c..96228376139c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
> > #endif
> >
> > #ifdef CONFIG_DEV_PAGEMAP_OPS
> > -void __put_devmap_managed_page(struct page *page);
> > +void free_devmap_managed_page(struct page *page);
> > DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > -static inline bool put_devmap_managed_page(struct page *page)
> > +
> > +static inline bool page_is_devmap_managed(struct page *page)
> > {
> > if (!static_branch_unlikely(&devmap_managed_key))
> > return false;
> > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
> > switch (page->pgmap->type) {
> > case MEMORY_DEVICE_PRIVATE:
> > case MEMORY_DEVICE_FS_DAX:
> > - __put_devmap_managed_page(page);
> > return true;
> > default:
> > break;
> > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
> > return false;
> > }
> >
> > +static inline bool put_devmap_managed_page(struct page *page)
> > +{
> > + bool is_devmap = page_is_devmap_managed(page);
> > +
> > + if (is_devmap) {
> > + int count = page_ref_dec_return(page);
> > +
> > + /*
> > + * devmap page refcounts are 1-based, rather than 0-based: if
> > + * refcount is 1, then the page is free and the refcount is
> > + * stable because nobody holds a reference on the page.
> > + */
> > + if (count == 1)
> > + free_devmap_managed_page(page);
> > + else if (!count)
> > + __put_page(page);
> > + }
> > +
> > + return is_devmap;
> > +}
> > +
> > #else /* CONFIG_DEV_PAGEMAP_OPS */
> > static inline bool put_devmap_managed_page(struct page *page)
> > {
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 03ccbdfeb697..bc7e2a27d025 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> > EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >
> > #ifdef CONFIG_DEV_PAGEMAP_OPS
> > -void __put_devmap_managed_page(struct page *page)
> > +void free_devmap_managed_page(struct page *page)
> > {
> > - int count = page_ref_dec_return(page);
> > + /* Clear Active bit in case of parallel mark_page_accessed */
> > + __ClearPageActive(page);
> > + __ClearPageWaiters(page);
> > +
> > + mem_cgroup_uncharge(page);
>
> Ugh, when did all this HMM specific manipulation sneak into the
> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> own put_zone_device_private_page(). For example it's certainly
> unnecessary and might be broken (would need to check) to call
> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> monolith and the HMM use case leaks pages into code paths that DAX
> explicitly avoids.

It's been this way for a while and I did not react previously,
apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
mem_cgroup_uncharge, belong behind a device-private conditional. The
history here is:

Move some, but not all HMM specifics to hmm_devmem_free():
2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put

Remove the clearing of mapping since no upstream consumers needed it:
b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free

Add it back in once an upstream consumer arrived:
7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse

We're now almost entirely free of ->page_free callbacks except for
that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
also result in killing the ->page_free() callback altogether? In the
meantime I'm proposing a cleanup like this:

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
}

-static void pmem_pagemap_page_free(struct page *page)
-{
- wake_up_var(&page->_refcount);
-}
-
static const struct dev_pagemap_ops fsdax_pagemap_ops = {
- .page_free = pmem_pagemap_page_free,
.kill = pmem_pagemap_kill,
.cleanup = pmem_pagemap_cleanup,
};
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..157edb8f7cf8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
* holds a reference on the page.
*/
if (count == 1) {
- /* Clear Active bit in case of parallel mark_page_accessed */
- __ClearPageActive(page);
- __ClearPageWaiters(page);
-
- mem_cgroup_uncharge(page);
-
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
@@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
* handled differently or not done at all, so there is no need
* to clear page->mapping.
*/
- if (is_device_private_page(page))
- page->mapping = NULL;
+ if (is_device_private_page(page)) {
+ /* Clear Active bit in case of parallel
mark_page_accessed */
+ __ClearPageActive(page);
+ __ClearPageWaiters(page);

- page->pgmap->ops->page_free(page);
+ mem_cgroup_uncharge(page);
+
+ page->mapping = NULL;
+ page->pgmap->ops->page_free(page);
+ } else
+ wake_up_var(&page->_refcount);
} else if (!count)
__put_page(page);
}

2019-11-13 22:52:05

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On 11/13/19 2:00 PM, Dan Williams wrote:
...
>> Ugh, when did all this HMM specific manipulation sneak into the
>> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
>> own put_zone_device_private_page(). For example it's certainly
>> unnecessary and might be broken (would need to check) to call
>> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
>> monolith and the HMM use case leaks pages into code paths that DAX
>> explicitly avoids.
>
> It's been this way for a while and I did not react previously,
> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> mem_cgroup_uncharge, belong behind a device-private conditional. The
> history here is:
>
> Move some, but not all HMM specifics to hmm_devmem_free():
> 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put
>
> Remove the clearing of mapping since no upstream consumers needed it:
> b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free
>
> Add it back in once an upstream consumer arrived:
> 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse
>
> We're now almost entirely free of ->page_free callbacks except for
> that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
> also result in killing the ->page_free() callback altogether? In the
> meantime I'm proposing a cleanup like this:


OK, assuming this is acceptable (no obvious problems jump out at me,
and we can also test it with HMM), then how would you like to proceed, as
far as patches go: add such a patch as part of this series here, or as a
stand-alone patch either before or after this series? Or something else?
And did you plan on sending it out as such?

Also, the diffs didn't quite make it through intact to my "git apply", so
I'm re-posting the diff in hopes that this time it survives:

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f9f76f6ba07b..21db1ce8c0ae 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -338,13 +338,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
}

-static void pmem_pagemap_page_free(struct page *page)
-{
- wake_up_var(&page->_refcount);
-}
-
static const struct dev_pagemap_ops fsdax_pagemap_ops = {
- .page_free = pmem_pagemap_page_free,
.kill = pmem_pagemap_kill,
.cleanup = pmem_pagemap_cleanup,
};
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..157edb8f7cf8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
* holds a reference on the page.
*/
if (count == 1) {
- /* Clear Active bit in case of parallel mark_page_accessed */
- __ClearPageActive(page);
- __ClearPageWaiters(page);
-
- mem_cgroup_uncharge(page);
-
/*
* When a device_private page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
@@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
* handled differently or not done at all, so there is no need
* to clear page->mapping.
*/
- if (is_device_private_page(page))
- page->mapping = NULL;
+ if (is_device_private_page(page)) {
+ /* Clear Active bit in case of parallel mark_page_accessed */
+ __ClearPageActive(page);
+ __ClearPageWaiters(page);

- page->pgmap->ops->page_free(page);
+ mem_cgroup_uncharge(page);
+
+ page->mapping = NULL;
+ page->pgmap->ops->page_free(page);
+ } else
+ wake_up_var(&page->_refcount);
} else if (!count)
__put_page(page);
}
--
2.24.0


thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ad8e4df1282b..4eae441f86c9 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
> put_disk(pmem->disk);
> }
>
> -static void pmem_pagemap_page_free(struct page *page)
> -{
> - wake_up_var(&page->_refcount);
> -}
> -
> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> - .page_free = pmem_pagemap_page_free,
> .kill = pmem_pagemap_kill,
> .cleanup = pmem_pagemap_cleanup,
> };
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..157edb8f7cf8 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
> * holds a reference on the page.
> */
> if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - __ClearPageActive(page);
> - __ClearPageWaiters(page);
> -
> - mem_cgroup_uncharge(page);
> -
> /*
> * When a device_private page is freed, the page->mapping field
> * may still contain a (stale) mapping value. For example, the
> @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
> * handled differently or not done at all, so there is no need
> * to clear page->mapping.
> */
> - if (is_device_private_page(page))
> - page->mapping = NULL;
> + if (is_device_private_page(page)) {
> + /* Clear Active bit in case of parallel
> mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
>
> - page->pgmap->ops->page_free(page);
> + mem_cgroup_uncharge(page);
> +
> + page->mapping = NULL;
> + page->pgmap->ops->page_free(page);
> + } else
> + wake_up_var(&page->_refcount);
> } else if (!count)
> __put_page(page);
> }
>

2019-11-13 22:57:07

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Wed, Nov 13, 2019 at 2:49 PM John Hubbard <[email protected]> wrote:
>
> On 11/13/19 2:00 PM, Dan Williams wrote:
> ...
> >> Ugh, when did all this HMM specific manipulation sneak into the
> >> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> >> own put_zone_device_private_page(). For example it's certainly
> >> unnecessary and might be broken (would need to check) to call
> >> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> >> monolith and the HMM use case leaks pages into code paths that DAX
> >> explicitly avoids.
> >
> > It's been this way for a while and I did not react previously,
> > apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> > mem_cgroup_uncharge, belong behind a device-private conditional. The
> > history here is:
> >
> > Move some, but not all HMM specifics to hmm_devmem_free():
> > 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put
> >
> > Remove the clearing of mapping since no upstream consumers needed it:
> > b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free
> >
> > Add it back in once an upstream consumer arrived:
> > 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse
> >
> > We're now almost entirely free of ->page_free callbacks except for
> > that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
> > also result in killing the ->page_free() callback altogether? In the
> > meantime I'm proposing a cleanup like this:
>
>
> OK, assuming this is acceptable (no obvious problems jump out at me,
> and we can also test it with HMM), then how would you like to proceed, as
> far as patches go: add such a patch as part of this series here, or as a
> stand-alone patch either before or after this series? Or something else?
> And did you plan on sending it out as such?

I think it makes sense to include it in your series since you're
looking to refactor the implementation. I can send you one based on
current linux-next as a lead-in cleanup before the refactor. Does that
work for you?

>
> Also, the diffs didn't quite make it through intact to my "git apply", so
> I'm re-posting the diff in hopes that this time it survives:

Apologies for that. For quick "how about this" patch examples, I just
copy and paste into gmail and it sometimes clobbers it.

>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index f9f76f6ba07b..21db1ce8c0ae 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -338,13 +338,7 @@ static void pmem_release_disk(void *__pmem)
> put_disk(pmem->disk);
> }
>
> -static void pmem_pagemap_page_free(struct page *page)
> -{
> - wake_up_var(&page->_refcount);
> -}
> -
> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> - .page_free = pmem_pagemap_page_free,
> .kill = pmem_pagemap_kill,
> .cleanup = pmem_pagemap_cleanup,
> };
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..157edb8f7cf8 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
> * holds a reference on the page.
> */
> if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - __ClearPageActive(page);
> - __ClearPageWaiters(page);
> -
> - mem_cgroup_uncharge(page);
> -
> /*
> * When a device_private page is freed, the page->mapping field
> * may still contain a (stale) mapping value. For example, the
> @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
> * handled differently or not done at all, so there is no need
> * to clear page->mapping.
> */
> - if (is_device_private_page(page))
> - page->mapping = NULL;
> + if (is_device_private_page(page)) {
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
>
> - page->pgmap->ops->page_free(page);
> + mem_cgroup_uncharge(page);
> +
> + page->mapping = NULL;
> + page->pgmap->ops->page_free(page);
> + } else
> + wake_up_var(&page->_refcount);
> } else if (!count)
> __put_page(page);
> }
> --
> 2.24.0
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index ad8e4df1282b..4eae441f86c9 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
> > put_disk(pmem->disk);
> > }
> >
> > -static void pmem_pagemap_page_free(struct page *page)
> > -{
> > - wake_up_var(&page->_refcount);
> > -}
> > -
> > static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> > - .page_free = pmem_pagemap_page_free,
> > .kill = pmem_pagemap_kill,
> > .cleanup = pmem_pagemap_cleanup,
> > };
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 03ccbdfeb697..157edb8f7cf8 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
> > * holds a reference on the page.
> > */
> > if (count == 1) {
> > - /* Clear Active bit in case of parallel mark_page_accessed */
> > - __ClearPageActive(page);
> > - __ClearPageWaiters(page);
> > -
> > - mem_cgroup_uncharge(page);
> > -
> > /*
> > * When a device_private page is freed, the page->mapping field
> > * may still contain a (stale) mapping value. For example, the
> > @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
> > * handled differently or not done at all, so there is no need
> > * to clear page->mapping.
> > */
> > - if (is_device_private_page(page))
> > - page->mapping = NULL;
> > + if (is_device_private_page(page)) {
> > + /* Clear Active bit in case of parallel
> > mark_page_accessed */
> > + __ClearPageActive(page);
> > + __ClearPageWaiters(page);
> >
> > - page->pgmap->ops->page_free(page);
> > + mem_cgroup_uncharge(page);
> > +
> > + page->mapping = NULL;
> > + page->pgmap->ops->page_free(page);
> > + } else
> > + wake_up_var(&page->_refcount);
> > } else if (!count)
> > __put_page(page);
> > }
> >

2019-11-13 23:01:07

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On 11/13/19 2:55 PM, Dan Williams wrote:
> On Wed, Nov 13, 2019 at 2:49 PM John Hubbard <[email protected]> wrote:
>>
>> On 11/13/19 2:00 PM, Dan Williams wrote:
>> ...
>>>> Ugh, when did all this HMM specific manipulation sneak into the
>>>> generic ZONE_DEVICE path? It used to be gated by pgmap type with its
>>>> own put_zone_device_private_page(). For example it's certainly
>>>> unnecessary and might be broken (would need to check) to call
>>>> mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
>>>> monolith and the HMM use case leaks pages into code paths that DAX
>>>> explicitly avoids.
>>>
>>> It's been this way for a while and I did not react previously,
>>> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
>>> mem_cgroup_uncharge, belong behind a device-private conditional. The
>>> history here is:
>>>
>>> Move some, but not all HMM specifics to hmm_devmem_free():
>>> 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put
>>>
>>> Remove the clearing of mapping since no upstream consumers needed it:
>>> b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free
>>>
>>> Add it back in once an upstream consumer arrived:
>>> 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse
>>>
>>> We're now almost entirely free of ->page_free callbacks except for
>>> that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
>>> also result in killing the ->page_free() callback altogether? In the
>>> meantime I'm proposing a cleanup like this:
>>
>>
>> OK, assuming this is acceptable (no obvious problems jump out at me,
>> and we can also test it with HMM), then how would you like to proceed, as
>> far as patches go: add such a patch as part of this series here, or as a
>> stand-alone patch either before or after this series? Or something else?
>> And did you plan on sending it out as such?
>
> I think it makes sense to include it in your series since you're
> looking to refactor the implementation. I can send you one based on
> current linux-next as a lead-in cleanup before the refactor. Does that
> work for you?
>

That would be perfect!

>>
>> Also, the diffs didn't quite make it through intact to my "git apply", so
>> I'm re-posting the diff in hopes that this time it survives:
>
> Apologies for that. For quick "how about this" patch examples, I just
> copy and paste into gmail and it sometimes clobbers it.
>

No problem at all, I do the same thing and *usually* it works. ha. And
as you say, it's good enough for discussions.


thanks,
--
John Hubbard
NVIDIA

2019-11-13 23:05:17

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

On Wed, Nov 13, 2019 at 02:00:06PM -0800, Dan Williams wrote:
> On Wed, Nov 13, 2019 at 11:23 AM Dan Williams <[email protected]> wrote:
> >
> > On Tue, Nov 12, 2019 at 8:27 PM John Hubbard <[email protected]> wrote:
> > >
> > > An upcoming patch changes and complicates the refcounting and
> > > especially the "put page" aspects of it. In order to keep
> > > everything clean, refactor the devmap page release routines:
> > >
> > > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> > > and limit the functionality to "read only": return a bool,
> > > with no side effects.
> > >
> > > * Add a new routine, put_devmap_managed_page(), to handle checking
> > > what kind of page it is, and what kind of refcount handling it
> > > requires.
> > >
> > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> > > and limit the functionality to unconditionally freeing a devmap
> > > page.
> > >
> > > This is originally based on a separate patch by Ira Weiny, which
> > > applied to an early version of the put_user_page() experiments.
> > > Since then, J?r?me Glisse suggested the refactoring described above.
> > >
> > > Suggested-by: J?r?me Glisse <[email protected]>
> > > Signed-off-by: Ira Weiny <[email protected]>
> > > Signed-off-by: John Hubbard <[email protected]>
> > > ---
> > > include/linux/mm.h | 27 ++++++++++++++++---
> > > mm/memremap.c | 67 ++++++++++++++++++++--------------------------
> > > 2 files changed, 53 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index a2adf95b3f9c..96228376139c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page *page)
> > > #endif
> > >
> > > #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page);
> > > +void free_devmap_managed_page(struct page *page);
> > > DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > > -static inline bool put_devmap_managed_page(struct page *page)
> > > +
> > > +static inline bool page_is_devmap_managed(struct page *page)
> > > {
> > > if (!static_branch_unlikely(&devmap_managed_key))
> > > return false;
> > > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page *page)
> > > switch (page->pgmap->type) {
> > > case MEMORY_DEVICE_PRIVATE:
> > > case MEMORY_DEVICE_FS_DAX:
> > > - __put_devmap_managed_page(page);
> > > return true;
> > > default:
> > > break;
> > > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page *page)
> > > return false;
> > > }
> > >
> > > +static inline bool put_devmap_managed_page(struct page *page)
> > > +{
> > > + bool is_devmap = page_is_devmap_managed(page);
> > > +
> > > + if (is_devmap) {
> > > + int count = page_ref_dec_return(page);
> > > +
> > > + /*
> > > + * devmap page refcounts are 1-based, rather than 0-based: if
> > > + * refcount is 1, then the page is free and the refcount is
> > > + * stable because nobody holds a reference on the page.
> > > + */
> > > + if (count == 1)
> > > + free_devmap_managed_page(page);
> > > + else if (!count)
> > > + __put_page(page);
> > > + }
> > > +
> > > + return is_devmap;
> > > +}
> > > +
> > > #else /* CONFIG_DEV_PAGEMAP_OPS */
> > > static inline bool put_devmap_managed_page(struct page *page)
> > > {
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03ccbdfeb697..bc7e2a27d025 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> > > EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >
> > > #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page)
> > > +void free_devmap_managed_page(struct page *page)
> > > {
> > > - int count = page_ref_dec_return(page);
> > > + /* Clear Active bit in case of parallel mark_page_accessed */
> > > + __ClearPageActive(page);
> > > + __ClearPageWaiters(page);
> > > +
> > > + mem_cgroup_uncharge(page);
> >
> > Ugh, when did all this HMM specific manipulation sneak into the
> > generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> > own put_zone_device_private_page(). For example it's certainly
> > unnecessary and might be broken (would need to check) to call
> > mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> > monolith and the HMM use case leaks pages into code paths that DAX
> > explicitly avoids.
>
> It's been this way for a while and I did not react previously,
> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> mem_cgroup_uncharge, belong behind a device-private conditional. The
> history here is:
>
> Move some, but not all HMM specifics to hmm_devmem_free():
> 2fa147bdbf67 mm, dev_pagemap: Do not clear ->mapping on final put
>
> Remove the clearing of mapping since no upstream consumers needed it:
> b7a523109fb5 mm: don't clear ->mapping in hmm_devmem_free
>
> Add it back in once an upstream consumer arrived:
> 7ab0ad0e74f8 mm/hmm: fix ZONE_DEVICE anon page mapping reuse
>
> We're now almost entirely free of ->page_free callbacks except for
> that weird nouveau case, can that FIXME in nouveau_dmem_page_free()
> also result in killing the ->page_free() callback altogether? In the
> meantime I'm proposing a cleanup like this:

No we need the callback, cleanup looks good.

>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index ad8e4df1282b..4eae441f86c9 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
> put_disk(pmem->disk);
> }
>
> -static void pmem_pagemap_page_free(struct page *page)
> -{
> - wake_up_var(&page->_refcount);
> -}
> -
> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> - .page_free = pmem_pagemap_page_free,
> .kill = pmem_pagemap_kill,
> .cleanup = pmem_pagemap_cleanup,
> };
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..157edb8f7cf8 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -419,12 +419,6 @@ void __put_devmap_managed_page(struct page *page)
> * holds a reference on the page.
> */
> if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> - __ClearPageActive(page);
> - __ClearPageWaiters(page);
> -
> - mem_cgroup_uncharge(page);
> -
> /*
> * When a device_private page is freed, the page->mapping field
> * may still contain a (stale) mapping value. For example, the
> @@ -446,10 +440,17 @@ void __put_devmap_managed_page(struct page *page)
> * handled differently or not done at all, so there is no need
> * to clear page->mapping.
> */
> - if (is_device_private_page(page))
> - page->mapping = NULL;
> + if (is_device_private_page(page)) {
> + /* Clear Active bit in case of parallel
> mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
>
> - page->pgmap->ops->page_free(page);
> + mem_cgroup_uncharge(page);
> +
> + page->mapping = NULL;
> + page->pgmap->ops->page_free(page);
> + } else
> + wake_up_var(&page->_refcount);
> } else if (!count)
> __put_page(page);
> }
>