Subject: [PATCH v1 0/2] mm: remove extra ZONE_DEVICE struct page refcount

This patch cleans up ZONE_DEVICE page refcounting. Quoting Matthew
Wilcox, "it removes a ton of cruft from every call to put_page()"
This work was originally done by Ralph Campbell and submitted as RFC.
As of today, it has been ack by Theodore Ts'o / Darrick J. Wong, and
reviewed by Christoph Hellwig.
https://lore.kernel.org/linux-mm/[email protected]/

"MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory"
patchset depends on this patchset.
https://lore.kernel.org/linux-mm/[email protected]/

Ralph Campbell (2):
ext4/xfs: add page refcount helper
mm: remove extra ZONE_DEVICE struct page refcount

arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 8 +--
fs/ext4/inode.c | 5 +-
fs/fuse/dax.c | 4 +-
fs/xfs/xfs_file.c | 4 +-
include/linux/dax.h | 10 ++++
include/linux/memremap.h | 7 +--
include/linux/mm.h | 11 ----
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memcontrol.c | 6 +--
mm/memremap.c | 69 +++++++-------------------
mm/migrate.c | 5 --
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++---------------
16 files changed, 60 insertions(+), 131 deletions(-)

--
2.32.0


Subject: [PATCH v1 1/2] ext4/xfs: add page refcount helper

From: Ralph Campbell <[email protected]>

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Theodore Ts'o <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
---
v3:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

v4:
[AS]: This ref count functionality was missing on fuse/dax.c.
---
fs/dax.c | 4 ++--
fs/ext4/inode.c | 5 +----
fs/fuse/dax.c | 4 +---
fs/xfs/xfs_file.c | 4 +---
include/linux/dax.h | 10 ++++++++++
5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 62352cbcf0f4..c387d09e3e5a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+ WARN_ON_ONCE(trunc && !dax_page_unused(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -383,7 +383,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- if (page_ref_count(page) > 1)
+ if (!dax_page_unused(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fe6045a46599..05ffe6875cb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3971,10 +3971,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;

- error = ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1,
- TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei));
+ error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);

return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ff99ab2a3c43..2b1f190ba78a 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, fuse_wait_dax_page(inode));
+ return dax_wait_page(inode, page, fuse_wait_dax_page);
}

/* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..182057281086 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -840,9 +840,7 @@ xfs_break_dax_layouts(
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, xfs_wait_dax_page(inode));
+ return dax_wait_page(inode, page, xfs_wait_dax_page);
}

int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8b5da1d60dbc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}

+static inline bool dax_page_unused(struct page *page)
+{
+ return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+ ___wait_var_event(&(_page)->_refcount, \
+ dax_page_unused(_page), \
+ TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
void hmem_register_device(int target_nid, struct resource *r);
#else
--
2.32.0

Subject: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

From: Ralph Campbell <[email protected]>

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
v2:
AS: merged this patch in linux 5.11 version

v5:
AS: add condition at try_grab_page to check for the zone device type, while
page ref counter is checked less/equal to zero. In case of device zone, pages
ref counter are initialized to zero.

v7:
AS: fix condition at try_grab_page added at v5, is invalid. It supposed
to fix xfstests/generic/413 test, however, there's a known issue on
this test where DAX mapped area DIO to non-DAX expect to fail.
https://patchwork.kernel.org/project/fstests/patch/[email protected]
This condition was removed after rebase over patch series
https://lore.kernel.org/r/[email protected]
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 4 +-
include/linux/dax.h | 2 +-
include/linux/memremap.h | 7 +--
include/linux/mm.h | 11 ----
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memcontrol.c | 6 +--
mm/memremap.c | 69 +++++++-------------------
mm/migrate.c | 5 --
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++---------------
13 files changed, 46 insertions(+), 120 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)

dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}

- get_page(page);
+ init_page_count(page);
lock_page(page);
return page;
}
diff --git a/fs/dax.c b/fs/dax.c
index c387d09e3e5a..1166630b7190 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -571,14 +571,14 @@ static void *grab_mapping_entry(struct xa_state *xas,

/**
* dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
* @start: Starting offset. Page containing 'start' is included.
* @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
* pages from 'start' till the end of file are included.
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
* any page in the mapping is busy, i.e. for DMA, or other
* get_user_pages() usages.
*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)

static inline bool dax_page_unused(struct page *page)
{
- return page_ref_count(page) == 1;
+ return page_ref_count(page) == 0;
}

#define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 45a79da89c5f..77ff5fd0685f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {

struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count
+ * should be reset to one with init_page_count(page) before reusing
+ * the page. This allows the device driver to implement its own
+ * memory management.
*/
void (*page_free)(struct page *page);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d8f98d652164..e24c904deeec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1220,17 +1220,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);

- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..6998f10350ea 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}

dpage->zone_device_data = rpage;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;

diff --git a/mm/internal.h b/mm/internal.h
index e8fdb531f887..5438cceca4b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -667,4 +667,12 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,

void vunmap_range_noflush(unsigned long start, unsigned long end);

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..9a6bfb4fd36c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5350,11 +5350,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
*/
if (is_device_private_entry(ent)) {
page = device_private_entry_to_page(ent);
- /*
- * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
- * a refcount of 1 when free (unlike normal page)
- */
- if (!page_ref_add_unless(page, 1, 1))
+ if (!get_page_unless_zero(page))
return NULL;
return page;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..ab949a571e78 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/wait_bit.h>
#include <linux/xarray.h>
+#include "internal.h"

static DEFINE_XARRAY(pgmap_array);

@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void)
EXPORT_SYMBOL_GPL(memremap_compat_align);
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_dec(&devmap_managed_key);
-}
-
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_inc(&devmap_managed_key);
-}
-#else
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-}
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static void pgmap_array_delete(struct range *range)
{
xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -102,16 +77,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}

-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
-#define for_each_device_pfn(pfn, map, i) \
- for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
static void dev_pagemap_kill(struct dev_pagemap *pgmap)
{
if (pgmap->ops && pgmap->ops->kill)
@@ -167,20 +132,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;

dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);

for (i = 0; i < pgmap->nr_range; i++)
pageunmap_range(pgmap, i);

WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
- devmap_managed_enable_put(pgmap);
}
EXPORT_SYMBOL_GPL(memunmap_pages);

@@ -382,8 +345,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
}

- devmap_managed_enable_get(pgmap);
-
/*
* Clear the pgmap nr_range as it will be incremented for each
* successfully processed range. This communicates how many
@@ -498,16 +459,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_page(struct page *page)
{
- /* notify page idle for dax */
- if (!is_device_private_page(page)) {
- wake_up_var(&page->_refcount);
- return;
- }
-
__ClearPageWaiters(page);
-
mem_cgroup_uncharge(page);

/*
@@ -534,4 +488,19 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
}
+
+void free_zone_device_page(struct page *page)
+{
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ free_device_page(page);
+ return;
+ case MEMORY_DEVICE_FS_DAX:
+ /* notify page idle */
+ wake_up_var(&page->_refcount);
+ return;
+ default:
+ return;
+ }
+}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 41ff2c9896c4..e3a10e2a1bb3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -350,11 +350,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;

- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef2265f86b91..1ef1f733af5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6414,6 +6414,9 @@ void __ref memmap_init_zone_device(struct zone *zone,

__init_single_page(page, pfn, zone_idx, nid);

+ /* ZONE_DEVICE pages start with a zero reference count. */
+ set_page_count(page, 0);
+
/*
* Mark page reserved as it will need to wait for onlining
* phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index dfb48cf9c2c9..9e821f1951c5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page)
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
-
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
+ free_zone_device_page(page);
return;
}

@@ -917,29 +916,18 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;

+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * page_is_devmap_managed() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- if (put_page_testzero(page))
- put_dev_pagemap(page->pgmap);
+ free_zone_device_page(page);
continue;
}

- if (!put_page_testzero(page))
- continue;
-
if (PageCompound(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -1143,26 +1131,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.32.0

2021-10-14 18:47:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] ext4/xfs: add page refcount helper

On Thu, Oct 14, 2021 at 10:39:27AM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Theodore Ts'o <[email protected]>
> Acked-by: Darrick J. Wong <[email protected]>
> ---
> v3:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused
>
> v4:
> [AS]: This ref count functionality was missing on fuse/dax.c.
> ---
> fs/dax.c | 4 ++--
> fs/ext4/inode.c | 5 +----
> fs/fuse/dax.c | 4 +---
> fs/xfs/xfs_file.c | 4 +---
> include/linux/dax.h | 10 ++++++++++
> 5 files changed, 15 insertions(+), 12 deletions(-)

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2021-10-14 18:51:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] ext4/xfs: add page refcount helper

On Thu, Oct 14, 2021 at 10:39:27AM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Acked-by: Theodore Ts'o <[email protected]>
> Acked-by: Darrick J. Wong <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2021-10-14 18:54:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2021-10-14 18:56:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> v2:
> AS: merged this patch in linux 5.11 version
>
> v5:
> AS: add condition at try_grab_page to check for the zone device type, while
> page ref counter is checked less/equal to zero. In case of device zone, pages
> ref counter are initialized to zero.
>
> v7:
> AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> to fix xfstests/generic/413 test, however, there's a known issue on
> this test where DAX mapped area DIO to non-DAX expect to fail.
> https://patchwork.kernel.org/project/fstests/patch/[email protected]
> This condition was removed after rebase over patch series
> https://lore.kernel.org/r/[email protected]
> ---
> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> fs/dax.c | 4 +-
> include/linux/dax.h | 2 +-
> include/linux/memremap.h | 7 +--
> include/linux/mm.h | 11 ----
> lib/test_hmm.c | 2 +-
> mm/internal.h | 8 +++
> mm/memcontrol.c | 6 +--
> mm/memremap.c | 69 +++++++-------------------
> mm/migrate.c | 5 --
> mm/page_alloc.c | 3 ++
> mm/swap.c | 45 ++---------------
> 13 files changed, 46 insertions(+), 120 deletions(-)

Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
backed memory still work?

What refcount value does the struct pages have when they are installed
in the PTEs? Remember a 0 refcount will make all the get_user_pages()
fail.

I'm looking at the call path starting in ext4_punch_hole() and I would
expect to see something manipulating the page ref count before
the ext4_break_layouts() call path gets to the dax_page_unused() test.

All I see is we go into unmap_mapping_pages() - that would normally
put back the page references held by PTEs but insert_pfn() has this:

if (pfn_t_devmap(pfn))
entry = pte_mkdevmap(pfn_t_pte(pfn, prot));

And:

static inline pte_t pte_mkdevmap(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
}

Which interacts with vm_normal_page():

if (pte_devmap(pte))
return NULL;

To disable that refcounting?

So... I have a feeling this will have PTEs pointing to 0 refcount
pages? Unless FSDAX is !pte_devmap which is not the case, right?

This seems further confirmed by this comment:

/*
* If we race get_user_pages_fast() here either we'll see the
* elevated page count in the iteration and wait, or
* get_user_pages_fast() will see that the page it took a reference
* against is no longer mapped in the page tables and bail to the
* get_user_pages() slow path. The slow path is protected by
* pte_lock() and pmd_lock(). New references are not taken without
* holding those locks, and unmap_mapping_pages() will not zero the
* pte or pmd without holding the respective lock, so we are
* guaranteed to either see new references or prevent new
* references from being established.
*/

Which seems to explain this scheme relies on unmap_mapping_pages() to
fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
stop.

This seems like it would be properly fixed by using normal page
refcounting for PTEs - ie stop using special for these pages?

Does anyone know why devmap is pte_special anyhow?

> +void free_zone_device_page(struct page *page)
> +{
> + switch (page->pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + free_device_page(page);
> + return;
> + case MEMORY_DEVICE_FS_DAX:
> + /* notify page idle */
> + wake_up_var(&page->_refcount);
> + return;

It is not for this series, but I wonder if we should just always call
ops->page_free and have free_device_page() logic in that callback for
the non-fs-dax cases?

For instance where is the mem_cgroup_charge() call to pair with the
mem_cgroup_uncharge() in free_device_page()?

Isn't cgroup charging (or not) the responsibility of the "allocator"
eg the pgmap_ops owner?

Jason

2021-10-14 19:08:19

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount


On 10/14/21 10:06 AM, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
>> From: Ralph Campbell <[email protected]>
>>
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
>>
>> Signed-off-by: Ralph Campbell <[email protected]>
>> Signed-off-by: Alex Sierra <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> ---
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> v5:
>> AS: add condition at try_grab_page to check for the zone device type, while
>> page ref counter is checked less/equal to zero. In case of device zone, pages
>> ref counter are initialized to zero.
>>
>> v7:
>> AS: fix condition at try_grab_page added at v5, is invalid. It supposed
>> to fix xfstests/generic/413 test, however, there's a known issue on
>> this test where DAX mapped area DIO to non-DAX expect to fail.
>> https://patchwork.kernel.org/project/fstests/patch/[email protected]
>> This condition was removed after rebase over patch series
>> https://lore.kernel.org/r/[email protected]
>> ---
>> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>> fs/dax.c | 4 +-
>> include/linux/dax.h | 2 +-
>> include/linux/memremap.h | 7 +--
>> include/linux/mm.h | 11 ----
>> lib/test_hmm.c | 2 +-
>> mm/internal.h | 8 +++
>> mm/memcontrol.c | 6 +--
>> mm/memremap.c | 69 +++++++-------------------
>> mm/migrate.c | 5 --
>> mm/page_alloc.c | 3 ++
>> mm/swap.c | 45 ++---------------
>> 13 files changed, 46 insertions(+), 120 deletions(-)
> Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> backed memory still work?

I ran xfstests-dev using the kernel boot option to "fake" a pmem device
when I first posted this patch. The tests ran OK (or at least the same
tests passed with and without my patch). However, I could never really
convince myself the changes were "OK" for fsdax since I didn't understand
the code that well. I would still like to see a xfsdax maintainer or
expert ACK this change.

https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

> What refcount value does the struct pages have when they are installed
> in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> fail.
>
> I'm looking at the call path starting in ext4_punch_hole() and I would
> expect to see something manipulating the page ref count before
> the ext4_break_layouts() call path gets to the dax_page_unused() test.
>
> All I see is we go into unmap_mapping_pages() - that would normally
> put back the page references held by PTEs but insert_pfn() has this:
>
> if (pfn_t_devmap(pfn))
> entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
>
> And:
>
> static inline pte_t pte_mkdevmap(pte_t pte)
> {
> return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> }
>
> Which interacts with vm_normal_page():
>
> if (pte_devmap(pte))
> return NULL;
>
> To disable that refcounting?
>
> So... I have a feeling this will have PTEs pointing to 0 refcount
> pages? Unless FSDAX is !pte_devmap which is not the case, right?
>
> This seems further confirmed by this comment:
>
> /*
> * If we race get_user_pages_fast() here either we'll see the
> * elevated page count in the iteration and wait, or
> * get_user_pages_fast() will see that the page it took a reference
> * against is no longer mapped in the page tables and bail to the
> * get_user_pages() slow path. The slow path is protected by
> * pte_lock() and pmd_lock(). New references are not taken without
> * holding those locks, and unmap_mapping_pages() will not zero the
> * pte or pmd without holding the respective lock, so we are
> * guaranteed to either see new references or prevent new
> * references from being established.
> */
>
> Which seems to explain this scheme relies on unmap_mapping_pages() to
> fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> stop.
>
> This seems like it would be properly fixed by using normal page
> refcounting for PTEs - ie stop using special for these pages?
>
> Does anyone know why devmap is pte_special anyhow?
>
>> +void free_zone_device_page(struct page *page)
>> +{
>> + switch (page->pgmap->type) {
>> + case MEMORY_DEVICE_PRIVATE:
>> + free_device_page(page);
>> + return;
>> + case MEMORY_DEVICE_FS_DAX:
>> + /* notify page idle */
>> + wake_up_var(&page->_refcount);
>> + return;
> It is not for this series, but I wonder if we should just always call
> ops->page_free and have free_device_page() logic in that callback for
> the non-fs-dax cases?
>
> For instance where is the mem_cgroup_charge() call to pair with the
> mem_cgroup_uncharge() in free_device_page()?
>
> Isn't cgroup charging (or not) the responsibility of the "allocator"
> eg the pgmap_ops owner?
>
> Jason

2021-10-14 19:18:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote:

> I ran xfstests-dev using the kernel boot option to "fake" a pmem device
> when I first posted this patch. The tests ran OK (or at least the same
> tests passed with and without my patch).

Hmm. I know nothing of xfstests but

tests/generic/413

Looks kind of like it might cover this situation?

Did it run for you?

Jason

2021-10-14 22:33:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount


It would probably help if you cc'd Dan on this.
As far as I know he's the only person left who cares about GUP on DAX.

On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> > From: Ralph Campbell <[email protected]>
> >
> > ZONE_DEVICE struct pages have an extra reference count that complicates the
> > code for put_page() and several places in the kernel that need to check the
> > reference count to see that a page is not being used (gup, compaction,
> > migration, etc.). Clean up the code so the reference count doesn't need to
> > be treated specially for ZONE_DEVICE.
> >
> > Signed-off-by: Ralph Campbell <[email protected]>
> > Signed-off-by: Alex Sierra <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > ---
> > v2:
> > AS: merged this patch in linux 5.11 version
> >
> > v5:
> > AS: add condition at try_grab_page to check for the zone device type, while
> > page ref counter is checked less/equal to zero. In case of device zone, pages
> > ref counter are initialized to zero.
> >
> > v7:
> > AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> > to fix xfstests/generic/413 test, however, there's a known issue on
> > this test where DAX mapped area DIO to non-DAX expect to fail.
> > https://patchwork.kernel.org/project/fstests/patch/[email protected]
> > This condition was removed after rebase over patch series
> > https://lore.kernel.org/r/[email protected]
> > ---
> > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> > fs/dax.c | 4 +-
> > include/linux/dax.h | 2 +-
> > include/linux/memremap.h | 7 +--
> > include/linux/mm.h | 11 ----
> > lib/test_hmm.c | 2 +-
> > mm/internal.h | 8 +++
> > mm/memcontrol.c | 6 +--
> > mm/memremap.c | 69 +++++++-------------------
> > mm/migrate.c | 5 --
> > mm/page_alloc.c | 3 ++
> > mm/swap.c | 45 ++---------------
> > 13 files changed, 46 insertions(+), 120 deletions(-)
>
> Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> backed memory still work?
>
> What refcount value does the struct pages have when they are installed
> in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> fail.
>
> I'm looking at the call path starting in ext4_punch_hole() and I would
> expect to see something manipulating the page ref count before
> the ext4_break_layouts() call path gets to the dax_page_unused() test.
>
> All I see is we go into unmap_mapping_pages() - that would normally
> put back the page references held by PTEs but insert_pfn() has this:
>
> if (pfn_t_devmap(pfn))
> entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
>
> And:
>
> static inline pte_t pte_mkdevmap(pte_t pte)
> {
> return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> }
>
> Which interacts with vm_normal_page():
>
> if (pte_devmap(pte))
> return NULL;
>
> To disable that refcounting?
>
> So... I have a feeling this will have PTEs pointing to 0 refcount
> pages? Unless FSDAX is !pte_devmap which is not the case, right?
>
> This seems further confirmed by this comment:
>
> /*
> * If we race get_user_pages_fast() here either we'll see the
> * elevated page count in the iteration and wait, or
> * get_user_pages_fast() will see that the page it took a reference
> * against is no longer mapped in the page tables and bail to the
> * get_user_pages() slow path. The slow path is protected by
> * pte_lock() and pmd_lock(). New references are not taken without
> * holding those locks, and unmap_mapping_pages() will not zero the
> * pte or pmd without holding the respective lock, so we are
> * guaranteed to either see new references or prevent new
> * references from being established.
> */
>
> Which seems to explain this scheme relies on unmap_mapping_pages() to
> fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> stop.
>
> This seems like it would be properly fixed by using normal page
> refcounting for PTEs - ie stop using special for these pages?
>
> Does anyone know why devmap is pte_special anyhow?
>
> > +void free_zone_device_page(struct page *page)
> > +{
> > + switch (page->pgmap->type) {
> > + case MEMORY_DEVICE_PRIVATE:
> > + free_device_page(page);
> > + return;
> > + case MEMORY_DEVICE_FS_DAX:
> > + /* notify page idle */
> > + wake_up_var(&page->_refcount);
> > + return;
>
> It is not for this series, but I wonder if we should just always call
> ops->page_free and have free_device_page() logic in that callback for
> the non-fs-dax cases?
>
> For instance where is the mem_cgroup_charge() call to pair with the
> mem_cgroup_uncharge() in free_device_page()?
>
> Isn't cgroup charging (or not) the responsibility of the "allocator"
> eg the pgmap_ops owner?
>
> Jason

2021-10-14 22:37:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 11:45 AM Matthew Wilcox <[email protected]> wrote:
>
>
> It would probably help if you cc'd Dan on this.

Thanks.

[..]
>
> On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
> > > From: Ralph Campbell <[email protected]>
> > >
> > > ZONE_DEVICE struct pages have an extra reference count that complicates the
> > > code for put_page() and several places in the kernel that need to check the
> > > reference count to see that a page is not being used (gup, compaction,
> > > migration, etc.). Clean up the code so the reference count doesn't need to
> > > be treated specially for ZONE_DEVICE.
> > >
> > > Signed-off-by: Ralph Campbell <[email protected]>
> > > Signed-off-by: Alex Sierra <[email protected]>
> > > Reviewed-by: Christoph Hellwig <[email protected]>
> > > ---
> > > v2:
> > > AS: merged this patch in linux 5.11 version
> > >
> > > v5:
> > > AS: add condition at try_grab_page to check for the zone device type, while
> > > page ref counter is checked less/equal to zero. In case of device zone, pages
> > > ref counter are initialized to zero.
> > >
> > > v7:
> > > AS: fix condition at try_grab_page added at v5, is invalid. It supposed
> > > to fix xfstests/generic/413 test, however, there's a known issue on
> > > this test where DAX mapped area DIO to non-DAX expect to fail.
> > > https://patchwork.kernel.org/project/fstests/patch/[email protected]
> > > This condition was removed after rebase over patch series
> > > https://lore.kernel.org/r/[email protected]
> > > ---
> > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> > > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> > > fs/dax.c | 4 +-
> > > include/linux/dax.h | 2 +-
> > > include/linux/memremap.h | 7 +--
> > > include/linux/mm.h | 11 ----
> > > lib/test_hmm.c | 2 +-
> > > mm/internal.h | 8 +++
> > > mm/memcontrol.c | 6 +--
> > > mm/memremap.c | 69 +++++++-------------------
> > > mm/migrate.c | 5 --
> > > mm/page_alloc.c | 3 ++
> > > mm/swap.c | 45 ++---------------
> > > 13 files changed, 46 insertions(+), 120 deletions(-)
> >
> > Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
> > backed memory still work?
> >
> > What refcount value does the struct pages have when they are installed
> > in the PTEs? Remember a 0 refcount will make all the get_user_pages()
> > fail.
> >
> > I'm looking at the call path starting in ext4_punch_hole() and I would
> > expect to see something manipulating the page ref count before
> > the ext4_break_layouts() call path gets to the dax_page_unused() test.
> >
> > All I see is we go into unmap_mapping_pages() - that would normally
> > put back the page references held by PTEs but insert_pfn() has this:
> >
> > if (pfn_t_devmap(pfn))
> > entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> >
> > And:
> >
> > static inline pte_t pte_mkdevmap(pte_t pte)
> > {
> > return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> > }
> >
> > Which interacts with vm_normal_page():
> >
> > if (pte_devmap(pte))
> > return NULL;
> >
> > To disable that refcounting?
> >
> > So... I have a feeling this will have PTEs pointing to 0 refcount
> > pages? Unless FSDAX is !pte_devmap which is not the case, right?
> >
> > This seems further confirmed by this comment:
> >
> > /*
> > * If we race get_user_pages_fast() here either we'll see the
> > * elevated page count in the iteration and wait, or
> > * get_user_pages_fast() will see that the page it took a reference
> > * against is no longer mapped in the page tables and bail to the
> > * get_user_pages() slow path. The slow path is protected by
> > * pte_lock() and pmd_lock(). New references are not taken without
> > * holding those locks, and unmap_mapping_pages() will not zero the
> > * pte or pmd without holding the respective lock, so we are
> > * guaranteed to either see new references or prevent new
> > * references from being established.
> > */
> >
> > Which seems to explain this scheme relies on unmap_mapping_pages() to
> > fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
> > stop.
> >
> > This seems like it would be properly fixed by using normal page
> > refcounting for PTEs - ie stop using special for these pages?
> >
> > Does anyone know why devmap is pte_special anyhow?

It does not need to be special as mentioned here:

https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/

The refcount dependencies also go away after this...

https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/

...but you can see that patches 1 and 2 in that series depend on being
able to guarantee that all mappings are invalidated when the undelying
device that owns the pgmap goes away.

For that to happen there needs to be communication back to the FS for
device-gone / failure events. That work is in progress via this
series:

https://lore.kernel.org/all/[email protected]/

So there's a path to unwind this awkwardness, but it needs some
dominoes to fall first as far as I can see. My current focus is
getting Shiyang's series unblocked.

2021-10-15 01:03:00

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount


On 10/14/21 11:01 AM, Jason Gunthorpe wrote:
> On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote:
>
>> I ran xfstests-dev using the kernel boot option to "fake" a pmem device
>> when I first posted this patch. The tests ran OK (or at least the same
>> tests passed with and without my patch).
> Hmm. I know nothing of xfstests but
>
> tests/generic/413
>
> Looks kind of like it might cover this situation?
>
> Did it run for you?
>
> Jason

I don't remember. I'll have to rerun the test which might take a day or two
to set up again.

2021-10-15 06:04:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > Does anyone know why devmap is pte_special anyhow?
>
> It does not need to be special as mentioned here:
>
> https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/

I added a remark there

Not special means more to me, it means devmap should do the refcounts
properly like normal memory pages.

It means vm_normal_page should return !NULL and it means insert_page,
not insert_pfn should be used to install them in the PTE. VMAs should
not be MIXED MAP, but normal struct page maps.

I think this change alone would fix all the refcount problems
everwhere in DAX and devmap.

> The refcount dependencies also go away after this...
>
> https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> ...but you can see that patches 1 and 2 in that series depend on being
> able to guarantee that all mappings are invalidated when the undelying
> device that owns the pgmap goes away.

If I have put everything together right this is because of what I
pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
expecting that to work sanely.

This means the page map cannot be removed until all the PTEs are fully
flushed, which buggily doesn't happen because of the missing unplug.

However, this is all because nobody incrd a refcount to represent the
reference in the PTE and since this ment that 0 refcount pages were
wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
unbreak GUP?

So.. Is there some reason why devmap pages are trying so hard to avoid
sane refcounting???

If the PTE itself holds the refcount (by not being special) then there
is no need for the pagemap stuff in GUP. pagemap already waits for
refs to go to 0 so the missing shootdown during nvdimm unplug will
cause pagemap to block until the address spaces are invalidated. IMHO
this is already better than the current buggy situation of allowing
continued PTE reference to memory that is now removed from the system.

> For that to happen there needs to be communication back to the FS for
> device-gone / failure events. That work is in progress via this
> series:
>
> https://lore.kernel.org/all/[email protected]/

This is fine, but I don't think it should block fixing the mm side -
the end result here still cannot be 0 ref count pages installed in
PTEs.

Fixing that does not depend on shootdown during device removal, right?

It requires holding refcounts while pages are installed into address
spaces - and this lack is a direct cause of making the PTEs all
special and using insert_pfn and MIXED_MAP.

Thanks,
Jason

2021-10-15 08:12:05

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > Does anyone know why devmap is pte_special anyhow?
> >
> > It does not need to be special as mentioned here:
> >
> > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/
>
> I added a remark there
>
> Not special means more to me, it means devmap should do the refcounts
> properly like normal memory pages.
>
> It means vm_normal_page should return !NULL and it means insert_page,
> not insert_pfn should be used to install them in the PTE. VMAs should
> not be MIXED MAP, but normal struct page maps.
>
> I think this change alone would fix all the refcount problems
> everwhere in DAX and devmap.
>
> > The refcount dependencies also go away after this...
> >
> > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > ...but you can see that patches 1 and 2 in that series depend on being
> > able to guarantee that all mappings are invalidated when the undelying
> > device that owns the pgmap goes away.
>
> If I have put everything together right this is because of what I
> pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> expecting that to work sanely.
>
> This means the page map cannot be removed until all the PTEs are fully
> flushed, which buggily doesn't happen because of the missing unplug.
>
> However, this is all because nobody incrd a refcount to represent the
> reference in the PTE and since this ment that 0 refcount pages were
> wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> unbreak GUP?
>
> So.. Is there some reason why devmap pages are trying so hard to avoid
> sane refcounting???

I wouldn't put it that way. It's more that the original sin of
ZONE_DEVICE that sought to reuse the lru field space, by never having
a zero recount, then got layered upon and calcified in malignant ways.
In the meantime surrounding infrastructure got decrustified. Work like
the 'struct page' cleanup among other things, made it clearer and
clearer over time that the original design choice needed to be fixed.

> If the PTE itself holds the refcount (by not being special) then there
> is no need for the pagemap stuff in GUP. pagemap already waits for
> refs to go to 0 so the missing shootdown during nvdimm unplug will
> cause pagemap to block until the address spaces are invalidated. IMHO
> this is already better than the current buggy situation of allowing
> continued PTE reference to memory that is now removed from the system.
>
> > For that to happen there needs to be communication back to the FS for
> > device-gone / failure events. That work is in progress via this
> > series:
> >
> > https://lore.kernel.org/all/[email protected]/
>
> This is fine, but I don't think it should block fixing the mm side -
> the end result here still cannot be 0 ref count pages installed in
> PTEs.
>
> Fixing that does not depend on shootdown during device removal, right?
>
> It requires holding refcounts while pages are installed into address
> spaces - and this lack is a direct cause of making the PTEs all
> special and using insert_pfn and MIXED_MAP.

The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
now that we have page-available DAX, yes, we can skip the FS
notification and just rely on typical refcounting and hanging until
the FS has a chance to uninstall the PTEs. You're right, the FS
notification is an improvement to the conversion, not a requirement.

However, there still needs to be something in the gup-fast path to
indicate that GUP_LONGTERM is not possible because the PTE represents
a pfn that can not support typical page-cache behavior for truncate
which is to just disconnect the page from the file and keep the page
pinned indefinitely. I think the "no longterm" caveat would be the
only remaining utility of PTE_DEVMAP after the above conversion to use
typical page refcounts throughout DAX.

Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount


On 10/14/2021 3:57 PM, Ralph Campbell wrote:
>
> On 10/14/21 11:01 AM, Jason Gunthorpe wrote:
>> On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote:
>>
>>> I ran xfstests-dev using the kernel boot option to "fake" a pmem device
>>> when I first posted this patch. The tests ran OK (or at least the same
>>> tests passed with and without my patch).
>> Hmm. I know nothing of xfstests but
>>
>> tests/generic/413
>>
>> Looks kind of like it might cover this situation?
>>
>> Did it run for you?
>>
>> Jason
>
> I don't remember. I'll have to rerun the test which might take a day
> or two
> to set up again.
>
I just ran this generic/413 on my side using pmem fake device. It does fail.
I remember we proposed a fix on this patch before try_get_page was removed.
@@ -1186,7 +1153,7 @@ bool __must_check try_grab_page(struct page *page,
unsigned int flags);
 static inline __must_check bool try_get_page(struct page *page)
 {
        page = compound_head(page);
-       if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+       if (WARN_ON_ONCE(page_ref_count(page) <
(int)!is_zone_device_page(page)))
                return false;
        page_ref_inc(page);
        return true;

Alex

2021-10-15 17:40:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 10:45:52PM -0500, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 10/14/2021 3:57 PM, Ralph Campbell wrote:
> >
> > On 10/14/21 11:01 AM, Jason Gunthorpe wrote:
> > > On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote:
> > >
> > > > I ran xfstests-dev using the kernel boot option to "fake" a pmem device
> > > > when I first posted this patch. The tests ran OK (or at least the same
> > > > tests passed with and without my patch).
> > > Hmm. I know nothing of xfstests but
> > >
> > > tests/generic/413
> > >
> > > Looks kind of like it might cover this situation?
> > >
> > > Did it run for you?
> > >
> > > Jason
> >
> > I don't remember. I'll have to rerun the test which might take a day or
> > two
> > to set up again.
> >
> I just ran this generic/413 on my side using pmem fake device. It does fail.
> I remember we proposed a fix on this patch before try_get_page was
> removed.

Thanks so much, that confirms I've read everything properly!

The fix is to incr the refcount at the proper times, not ignore the
broken refcount

Jason

2021-10-18 03:36:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote:
> On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > > Does anyone know why devmap is pte_special anyhow?
> > >
> > > It does not need to be special as mentioned here:
> > >
> > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/
> >
> > I added a remark there
> >
> > Not special means more to me, it means devmap should do the refcounts
> > properly like normal memory pages.
> >
> > It means vm_normal_page should return !NULL and it means insert_page,
> > not insert_pfn should be used to install them in the PTE. VMAs should
> > not be MIXED MAP, but normal struct page maps.
> >
> > I think this change alone would fix all the refcount problems
> > everwhere in DAX and devmap.
> >
> > > The refcount dependencies also go away after this...
> > >
> > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
> > >
> > > ...but you can see that patches 1 and 2 in that series depend on being
> > > able to guarantee that all mappings are invalidated when the undelying
> > > device that owns the pgmap goes away.
> >
> > If I have put everything together right this is because of what I
> > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> > expecting that to work sanely.
> >
> > This means the page map cannot be removed until all the PTEs are fully
> > flushed, which buggily doesn't happen because of the missing unplug.
> >
> > However, this is all because nobody incrd a refcount to represent the
> > reference in the PTE and since this ment that 0 refcount pages were
> > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> > unbreak GUP?
> >
> > So.. Is there some reason why devmap pages are trying so hard to avoid
> > sane refcounting???
>
> I wouldn't put it that way. It's more that the original sin of
> ZONE_DEVICE that sought to reuse the lru field space, by never having
> a zero recount, then got layered upon and calcified in malignant ways.
> In the meantime surrounding infrastructure got decrustified. Work like
> the 'struct page' cleanup among other things, made it clearer and
> clearer over time that the original design choice needed to be fixed.

So, there used to be some reason, but with the current code
arrangement it is not the case? This is why it looks so strange when
reading it..

AFIACT we are good on the LRU stuff now? Eg release_pages() does not
use page->lru for is_zone_device_page()?

> The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
> now that we have page-available DAX, yes, we can skip the FS
> notification and just rely on typical refcounting and hanging until
> the FS has a chance to uninstall the PTEs. You're right, the FS
> notification is an improvement to the conversion, not a requirement.

It all sounds so simple. I looked at this for a good long time and the
first major roadblock is huge pages.

The mm side is designed around THP which puts a single high order page
into the PUD/PMD such that the mm only has to juggle one page. This a
very sane and reasonable thing to do.

DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
THP would make using normal refconting much simpler. I looked at
teaching the mm core to deal with page arrays - it is certainly
doable, but it is quite inefficient and ugly mm code.

So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?

Joao has a series that does this to device-dax:

https://lore.kernel.org/all/[email protected]/

TTM is kind of broken already but does have a struct page, and it is
probably already a high order one. Maybe it is OK? I will ask Thomas

That leaves FSDAX. Can this be fixed? I know nothing of filesystems or
fsdax to guess. Sounds like folios to me ..

Assuming changing FSDAX is hard.. How would DAX people feel about just
deleting the PUD/PMD support until it can be done with compound pages?

Doing so would allow fixing the lifecycle, cleaning up gup and
basically delete a huge wack of slow DAX and devmap specific code from
the mm. It also opens the door to removing the PTE flag and thus
allowing DAX on more architectures.

> However, there still needs to be something in the gup-fast path to
> indicate that GUP_LONGTERM is not possible because the PTE
> represents

It looks easy now:

1) We know the pfn has a struct page * because it isn't pfn special

2) We can get a valid ref on the struct page *:

head = try_grab_compound_head(page, 1, flags);

Holding a ref ensures that head->pgmap is valid.

3) Then check the page type directly:

if ((flags & FOLL_LONGTERM) && is_zone_device_page(head))

This tells us we can access the ZONE_DEVICE struct in the union

4) Ask what the pgmap owner wants to do:

if (head->pgmap->deny_foll_longterm)
return FAIL

Cost is only paied if FOLL_LONGTERM is given

Thanks,
Jason

2021-10-18 03:36:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote:
> Assuming changing FSDAX is hard.. How would DAX people feel about just
> deleting the PUD/PMD support until it can be done with compound pages?

I think there are customers who would find that an unacceptable answer :-)

2021-10-18 03:46:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Sat, Oct 16, 2021 at 8:45 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Oct 14, 2021 at 06:37:35PM -0700, Dan Williams wrote:
> > On Thu, Oct 14, 2021 at 4:06 PM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Thu, Oct 14, 2021 at 12:01:14PM -0700, Dan Williams wrote:
> > > > > > Does anyone know why devmap is pte_special anyhow?
> > > >
> > > > It does not need to be special as mentioned here:
> > > >
> > > > https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00hWA@mail.gmail.com/
> > >
> > > I added a remark there
> > >
> > > Not special means more to me, it means devmap should do the refcounts
> > > properly like normal memory pages.
> > >
> > > It means vm_normal_page should return !NULL and it means insert_page,
> > > not insert_pfn should be used to install them in the PTE. VMAs should
> > > not be MIXED MAP, but normal struct page maps.
> > >
> > > I think this change alone would fix all the refcount problems
> > > everwhere in DAX and devmap.
> > >
> > > > The refcount dependencies also go away after this...
> > > >
> > > > https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > >
> > > > ...but you can see that patches 1 and 2 in that series depend on being
> > > > able to guarantee that all mappings are invalidated when the undelying
> > > > device that owns the pgmap goes away.
> > >
> > > If I have put everything together right this is because of what I
> > > pointed to here. FS-DAX is installing 0 refcount pages into PTEs and
> > > expecting that to work sanely.
> > >
> > > This means the page map cannot be removed until all the PTEs are fully
> > > flushed, which buggily doesn't happen because of the missing unplug.
> > >
> > > However, this is all because nobody incrd a refcount to represent the
> > > reference in the PTE and since this ment that 0 refcount pages were
> > > wrongly stuffed into PTEs then devmap used the refcount == 1 hack to
> > > unbreak GUP?
> > >
> > > So.. Is there some reason why devmap pages are trying so hard to avoid
> > > sane refcounting???
> >
> > I wouldn't put it that way. It's more that the original sin of
> > ZONE_DEVICE that sought to reuse the lru field space, by never having
> > a zero recount, then got layered upon and calcified in malignant ways.
> > In the meantime surrounding infrastructure got decrustified. Work like
> > the 'struct page' cleanup among other things, made it clearer and
> > clearer over time that the original design choice needed to be fixed.
>
> So, there used to be some reason, but with the current code
> arrangement it is not the case? This is why it looks so strange when
> reading it..
>
> AFIACT we are good on the LRU stuff now? Eg release_pages() does not
> use page->lru for is_zone_device_page()?

Yes, the lru collision was fixed by the 'struct page' cleanup.

>
> > The MIXED_MAP and insert_pfn were a holdover from page-less DAX, but
> > now that we have page-available DAX, yes, we can skip the FS
> > notification and just rely on typical refcounting and hanging until
> > the FS has a chance to uninstall the PTEs. You're right, the FS
> > notification is an improvement to the conversion, not a requirement.
>
> It all sounds so simple. I looked at this for a good long time and the
> first major roadblock is huge pages.
>
> The mm side is designed around THP which puts a single high order page
> into the PUD/PMD such that the mm only has to juggle one page. This a
> very sane and reasonable thing to do.
>
> DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> THP would make using normal refconting much simpler. I looked at
> teaching the mm core to deal with page arrays - it is certainly
> doable, but it is quite inefficient and ugly mm code.

THP does not support PUD, and neither does FSDAX, so it's only PMDs we
need to worry about.

>
> So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
>
> Joao has a series that does this to device-dax:
>
> https://lore.kernel.org/all/[email protected]/

That assumes there's never any need to fracture a huge page which
FSDAX could not support unless the filesystem was built with 2MB block
size.

> TTM is kind of broken already but does have a struct page, and it is
> probably already a high order one. Maybe it is OK? I will ask Thomas
>
> That leaves FSDAX. Can this be fixed? I know nothing of filesystems or
> fsdax to guess. Sounds like folios to me ..

My thought here is to assemble a compound page on the fly when
establishing the FSDAX PMD mapping.

> Assuming changing FSDAX is hard.. How would DAX people feel about just
> deleting the PUD/PMD support until it can be done with compound pages?

There are end users that would notice the PMD regression, and I think
FSDAX PMDs with proper compound page metadata is on the same order of
work as fixing the refcount.

> Doing so would allow fixing the lifecycle, cleaning up gup and
> basically delete a huge wack of slow DAX and devmap specific code from
> the mm. It also opens the door to removing the PTE flag and thus
> allowing DAX on more architectures.
>
> > However, there still needs to be something in the gup-fast path to
> > indicate that GUP_LONGTERM is not possible because the PTE
> > represents
>
> It looks easy now:
>
> 1) We know the pfn has a struct page * because it isn't pfn special
>
> 2) We can get a valid ref on the struct page *:
>
> head = try_grab_compound_head(page, 1, flags);
>
> Holding a ref ensures that head->pgmap is valid.
>
> 3) Then check the page type directly:
>
> if ((flags & FOLL_LONGTERM) && is_zone_device_page(head))
>
> This tells us we can access the ZONE_DEVICE struct in the union
>
> 4) Ask what the pgmap owner wants to do:
>
> if (head->pgmap->deny_foll_longterm)
> return FAIL

The pgmap itself does not know, but the "holder" could specify this
policy. Which is in line with the 'dax_holder_ops' concept being
introduced for reverse mapping support. I.e. when the FS claims the
dax-device it can specify at that point that it wants to forbid
longterm.

> Cost is only paied if FOLL_LONGTERM is given

Yeah, that does naturally fall out from no longer needing to take an
explicit dev_pagemap reference and assuming a page is always there.

2021-10-18 03:46:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Sat, Oct 16, 2021 at 9:39 AM Matthew Wilcox <[email protected]> wrote:
>
> On Sat, Oct 16, 2021 at 12:44:50PM -0300, Jason Gunthorpe wrote:
> > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > deleting the PUD/PMD support until it can be done with compound pages?
>
> I think there are customers who would find that an unacceptable answer :-)

No, not given the number of end users that ask for help debugging PMD support.

2021-10-18 18:26:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote:

> > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> > THP would make using normal refconting much simpler. I looked at
> > teaching the mm core to deal with page arrays - it is certainly
> > doable, but it is quite inefficient and ugly mm code.
>
> THP does not support PUD, and neither does FSDAX, so it's only PMDs we
> need to worry about.

device-dax uses PUD, along with TTM, they are the only places. I'm not
sure TTM is a real place though.

> > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
> >
> > Joao has a series that does this to device-dax:
> >
> > https://lore.kernel.org/all/[email protected]/
>
> That assumes there's never any need to fracture a huge page which
> FSDAX could not support unless the filesystem was built with 2MB block
> size.

As I understand things, something like FSDAX post-folio should
generate maximal compound pages for extents in the page cache that are
physically contiguous.

A high order folio can be placed in any lower order in the page
tables, so we never have to fracture it, unless the underlying page
are moved around - which requires an unmap_mapping_range() cycle..

> > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > deleting the PUD/PMD support until it can be done with compound pages?
>
> There are end users that would notice the PMD regression, and I think
> FSDAX PMDs with proper compound page metadata is on the same order of
> work as fixing the refcount.

Hmm, I don't know.. I sketched out the refcount stuff and the code is
OK but ugly and will add a conditional to some THP cases

On the other hand, making THP unmap cases a bit slower is probably a
net win compared to making put_page a bit slower.. Considering unmap
is already quite heavy.

> > 4) Ask what the pgmap owner wants to do:
> >
> > if (head->pgmap->deny_foll_longterm)
> > return FAIL
>
> The pgmap itself does not know, but the "holder" could specify this
> policy.

Here I imagine the thing that creates the pgmap would specify the
policy it wants. In most cases the policy is tightly coupled to what
the free function in the the provided dev_pagemap_ops does..

> Which is in line with the 'dax_holder_ops' concept being introduced
> for reverse mapping support. I.e. when the FS claims the dax-device
> it can specify at that point that it wants to forbid longterm.

Which is a reasonable refinment if we think there are cases where two
nvdim users would want different things.

Anyhow, I'm wondering on a way forward. There are many balls in the
air, all linked:
- Joao's compound page support for device_dax and more
- Alex's DEVICE_COHERENT
- The refcount normalization
- Removing the pgmap test from GUP
- Removing the need for the PUD/PMD/PTE special bit
- Removing the need for the PUD/PMD/PTE devmap bit
- Remove PUD/PMD vma_is_special
- folios for fsdax
- shootdown for fsdax

Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on
the current semantics as that will only make it even harder to fix..

I think it would be good to see Joao's compound page support move
ahead..

So.. Does anyone want to work on finishing this patch series?? I can
give some guidance on how I think it should work at least

Jason

2021-10-18 19:39:44

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Mon, Oct 18, 2021 at 11:26 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Sun, Oct 17, 2021 at 11:35:35AM -0700, Dan Williams wrote:
>
> > > DAX is stuffing arrays of 4k pages into the PUD/PMDs. Aligning with
> > > THP would make using normal refconting much simpler. I looked at
> > > teaching the mm core to deal with page arrays - it is certainly
> > > doable, but it is quite inefficient and ugly mm code.
> >
> > THP does not support PUD, and neither does FSDAX, so it's only PMDs we
> > need to worry about.
>
> device-dax uses PUD, along with TTM, they are the only places. I'm not
> sure TTM is a real place though.

I was setting device-dax aside because it can use Joao's changes to
get compound-page support.

>
> > > So, can we fix DAX and TTM - the only uses of PUD/PMDs I could find?
> > >
> > > Joao has a series that does this to device-dax:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> >
> > That assumes there's never any need to fracture a huge page which
> > FSDAX could not support unless the filesystem was built with 2MB block
> > size.
>
> As I understand things, something like FSDAX post-folio should
> generate maximal compound pages for extents in the page cache that are
> physically contiguous.
>
> A high order folio can be placed in any lower order in the page
> tables, so we never have to fracture it, unless the underlying page
> are moved around - which requires an unmap_mapping_range() cycle..

That would be useful to disconnect the compound-page size from the
page-table-entry installed for the page. However, don't we need
typical compound page fracturing in the near term until folios move
ahead?

>
> > > Assuming changing FSDAX is hard.. How would DAX people feel about just
> > > deleting the PUD/PMD support until it can be done with compound pages?
> >
> > There are end users that would notice the PMD regression, and I think
> > FSDAX PMDs with proper compound page metadata is on the same order of
> > work as fixing the refcount.
>
> Hmm, I don't know.. I sketched out the refcount stuff and the code is
> OK but ugly and will add a conditional to some THP cases

That reminds me that there are several places that do:

pmd_devmap(pmd) || pmd_trans_huge(pmd)

...for the common cases where a THP and DEVMAP page are equivalent,
but there are a few places where those paths are not shared when the
THP path expects that the page came from the page allocator. So while
DEVMAP is not needed in GUP after this conversion, there still needs
to be an audit of when THP needs to be careful of DAX mappings.

> On the other hand, making THP unmap cases a bit slower is probably a
> net win compared to making put_page a bit slower.. Considering unmap
> is already quite heavy.

FSDAX eventually learned how to replace 'struct page' with xarray for
several paths, so "how hard could it be" (/famous last words) to
replace xarray with 'struct page'? I think the end result will be
cleaner, but yes, I expect some dragons in the conversion.

>
> > > 4) Ask what the pgmap owner wants to do:
> > >
> > > if (head->pgmap->deny_foll_longterm)
> > > return FAIL
> >
> > The pgmap itself does not know, but the "holder" could specify this
> > policy.
>
> Here I imagine the thing that creates the pgmap would specify the
> policy it wants. In most cases the policy is tightly coupled to what
> the free function in the the provided dev_pagemap_ops does..

The thing that creates the pgmap is the device-driver, and
device-driver does not implement truncate or reclaim. It's not until
the FS mounts that the pgmap needs to start enforcing pin lifetime
guarantees.

>
> > Which is in line with the 'dax_holder_ops' concept being introduced
> > for reverse mapping support. I.e. when the FS claims the dax-device
> > it can specify at that point that it wants to forbid longterm.
>
> Which is a reasonable refinment if we think there are cases where two
> nvdim users would want different things.
>

It's already the case that device-dax does not enforce transient pin lifetimes.

> Anyhow, I'm wondering on a way forward. There are many balls in the
> air, all linked:
> - Joao's compound page support for device_dax and more
> - Alex's DEVICE_COHERENT

I have not seen these patches.

/me notices no MAINTAINERS mention for include/linux/memremap.h

> - The refcount normalization
> - Removing the pgmap test from GUP
> - Removing the need for the PUD/PMD/PTE special bit
> - Removing the need for the PUD/PMD/PTE devmap bit

It's not clear that this anything but pure cleanup once the special
bit can be used for architectures that don't have devmap. Those same
archs presumably don't care about the THP collisions with DAX.

> - Remove PUD/PMD vma_is_special
> - folios for fsdax
> - shootdown for fsdax
>
> Frankly I'm leery to see more ZONE_DEVICE users crop up that depend on
> the current semantics as that will only make it even harder to fix..
>
> I think it would be good to see Joao's compound page support move
> ahead..
>
> So.. Does anyone want to work on finishing this patch series?? I can
> give some guidance on how I think it should work at least

Completing the DAX reflink work is in my near term goals and that
includes "shootdown for fsdax and removing the pgmap test from GUP",
but probably not in the order that "refcount normalization" folks
would prefer.

2021-10-18 23:07:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:

> > device-dax uses PUD, along with TTM, they are the only places. I'm not
> > sure TTM is a real place though.
>
> I was setting device-dax aside because it can use Joao's changes to
> get compound-page support.

Ideally, but that ideas in that patch series have been floating around
for a long time now..

> > As I understand things, something like FSDAX post-folio should
> > generate maximal compound pages for extents in the page cache that are
> > physically contiguous.
> >
> > A high order folio can be placed in any lower order in the page
> > tables, so we never have to fracture it, unless the underlying page
> > are moved around - which requires an unmap_mapping_range() cycle..
>
> That would be useful to disconnect the compound-page size from the
> page-table-entry installed for the page. However, don't we need
> typical compound page fracturing in the near term until folios move
> ahead?

I do not know, just mindful not to get ahead of Matthew

> > > There are end users that would notice the PMD regression, and I think
> > > FSDAX PMDs with proper compound page metadata is on the same order of
> > > work as fixing the refcount.
> >
> > Hmm, I don't know.. I sketched out the refcount stuff and the code is
> > OK but ugly and will add a conditional to some THP cases
>
> That reminds me that there are several places that do:
>
> pmd_devmap(pmd) || pmd_trans_huge(pmd)

I haven't tried to look at this yet. I did check that the pte_devmap()
flag can be deleted, but this is more tricky.

We have pmd_huge(), pmd_large(), pmd_devmap(), pmd_trans_huge(),
pmd_leaf(), at least

and I couldn't tell you today the subtle differences between all of
these things on every arch :\

AFAIK there should only be three case:
- pmd points to a pte table
- pmd is in the special hugetlb format
- pmd points at something described by struct page(s)

> ...for the common cases where a THP and DEVMAP page are equivalent,
> but there are a few places where those paths are not shared when the
> THP path expects that the page came from the page allocator. So while
> DEVMAP is not needed in GUP after this conversion, there still needs
> to be an audit of when THP needs to be careful of DAX mappings.

Yes, it is a tricky job to do the full work, but I think in the end,
'pmd points at something described by struct page(s)' is enough for
all code to use is_zone_device_page() instead of a PTE bit or VMA flag
to drive its logic.

> > Here I imagine the thing that creates the pgmap would specify the
> > policy it wants. In most cases the policy is tightly coupled to what
> > the free function in the the provided dev_pagemap_ops does..
>
> The thing that creates the pgmap is the device-driver, and
> device-driver does not implement truncate or reclaim. It's not until
> the FS mounts that the pgmap needs to start enforcing pin lifetime
> guarantees.

I am explaining this wrong, the immediate need is really 'should
foll_longterm fail fast-gup to the slow path' and something like the
nvdimm driver can just set that to 1 and rely on VMA flags to control
what the slow path does - as is today.

It is not as elegant as more flags in the pgmap, but it would get the
job done with minimal fuss.

Might be nice to either rely fully on VMA flags or fully on pgmap
holder flags for FOLL_LONGTERM?

> > Anyhow, I'm wondering on a way forward. There are many balls in the
> > air, all linked:
> > - Joao's compound page support for device_dax and more
> > - Alex's DEVICE_COHERENT
>
> I have not seen these patches.

It is where this series came from. As DEVICE_COHERENT is focused on
changing the migration code and, as I recall, the 1 == free thing
complicated that enough that Christoph requested it be cleaned.

> > - The refcount normalization
> > - Removing the pgmap test from GUP
> > - Removing the need for the PUD/PMD/PTE special bit
> > - Removing the need for the PUD/PMD/PTE devmap bit
>
> It's not clear that this anything but pure cleanup once the special
> bit can be used for architectures that don't have devmap. Those same
> archs presumably don't care about the THP collisions with DAX.

I understood there was some community that was interested in DAX on
other arches that don't have the PTE bits to spare, so this would be
of interest to them?

> Completing the DAX reflink work is in my near term goals and that
> includes "shootdown for fsdax and removing the pgmap test from GUP",
> but probably not in the order that "refcount normalization" folks
> would prefer.

Indeed, I don't think that will help many of the stuck items on the
list move ahead.

Jason

2021-10-19 15:15:59

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On 10/19/21 00:06, Jason Gunthorpe wrote:
> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
>
>>> device-dax uses PUD, along with TTM, they are the only places. I'm not
>>> sure TTM is a real place though.
>>
>> I was setting device-dax aside because it can use Joao's changes to
>> get compound-page support.
>
> Ideally, but that ideas in that patch series have been floating around
> for a long time now..
>
The current status of the series misses a Rb on patches 6,7,10,12-14.
Well, patch 8 too should now drop its tag, considering the latest
discussion.

If it helps moving things forward I could split my series further into:

1) the compound page introduction (patches 1-7) of my aforementioned series
2) vmemmap deduplication for memory gains (patches 9-14)
3) gup improvements (patch 8 and gup-slow improvements)

The reason being that item 1) is the the main dependency listed below.
And allows 2) and 3) to be parallelized. FWIW, it is almost fully reviewed
by Dan (as of v3->v4). For (1) patches 6 & 7 are on changes to
device-dax subsystem (drivers/dax/*) which still needs his Ack.

>>> Here I imagine the thing that creates the pgmap would specify the
>>> policy it wants. In most cases the policy is tightly coupled to what
>>> the free function in the the provided dev_pagemap_ops does..
>>
>> The thing that creates the pgmap is the device-driver, and
>> device-driver does not implement truncate or reclaim. It's not until
>> the FS mounts that the pgmap needs to start enforcing pin lifetime
>> guarantees.
>
> I am explaining this wrong, the immediate need is really 'should
> foll_longterm fail fast-gup to the slow path' and something like the
> nvdimm driver can just set that to 1 and rely on VMA flags to control
> what the slow path does - as is today.
>
> It is not as elegant as more flags in the pgmap, but it would get the
> job done with minimal fuss.
>
> Might be nice to either rely fully on VMA flags or fully on pgmap
> holder flags for FOLL_LONGTERM?
>

Whats the benefit between preventing longterm at start
versus only after mounting the filesystem? Or is the intended future purpose
to pass more context into an holder potential future callback e.g. nack longterm
pins on a page basis?

Maybe we can start by at least not add any flags and just prevent
FOLL_LONGTERM on fsdax -- which I guess was the original purpose of
commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast").
This patch (which I can formally send) has a sketch of that (below scissors mark):

https://lore.kernel.org/linux-mm/[email protected]/

It uses pgmap->type rather than adding further fields into pgmap, given this
restriction applies only to fsdax.

... and then we could improve devmap_longterm_available(pgmap) to look at the
holder::flags or pgmap::flags should we decide that an explicit flags is required
from holder/pgmap .. as a further improvement?

Joao

2021-10-19 16:02:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
> On 10/19/21 00:06, Jason Gunthorpe wrote:
> > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
> >
> >>> device-dax uses PUD, along with TTM, they are the only places. I'm not
> >>> sure TTM is a real place though.
> >>
> >> I was setting device-dax aside because it can use Joao's changes to
> >> get compound-page support.
> >
> > Ideally, but that ideas in that patch series have been floating around
> > for a long time now..
> >
> The current status of the series misses a Rb on patches 6,7,10,12-14.
> Well, patch 8 too should now drop its tag, considering the latest
> discussion.
>
> If it helps moving things forward I could split my series further into:
>
> 1) the compound page introduction (patches 1-7) of my aforementioned series
> 2) vmemmap deduplication for memory gains (patches 9-14)
> 3) gup improvements (patch 8 and gup-slow improvements)

I would split it, yes..

I think we can see a general consensus that making compound_head/etc
work consistently with how THP uses it will provide value and
opportunity for optimization going forward.

> Whats the benefit between preventing longterm at start
> versus only after mounting the filesystem? Or is the intended future purpose
> to pass more context into an holder potential future callback e.g. nack longterm
> pins on a page basis?

I understood Dan's remark that the device-dax path allows
FOLL_LONGTERM and the FSDAX path does not ?

Which, IIRC, today is signaled basd on vma properties and in all cases
fast-gup is denied.

> Maybe we can start by at least not add any flags and just prevent
> FOLL_LONGTERM on fsdax -- which I guess was the original purpose of
> commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast").
> This patch (which I can formally send) has a sketch of that (below scissors mark):
>
> https://lore.kernel.org/linux-mm/[email protected]/

Yes, basically, whatever test we want for 'deny fast gup foll
longterm' is fine.

Personally I'd like to see us move toward a set of flag specifying
each special behavior and not a collection of types that imply special
behaviors.

Eg we have at least:
- Block gup fast on foll_longterm
- Capture the refcount ==1 and use the pgmap free hook
(confusingly called page_is_devmap_managed())
- Always use a swap entry
- page->index/mapping are used in the usual file based way?

Probably more things..

Jason

2021-10-19 19:24:21

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
> > On 10/19/21 00:06, Jason Gunthorpe wrote:
> > > On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
> > >
> > >>> device-dax uses PUD, along with TTM, they are the only places. I'm not
> > >>> sure TTM is a real place though.
> > >>
> > >> I was setting device-dax aside because it can use Joao's changes to
> > >> get compound-page support.
> > >
> > > Ideally, but that ideas in that patch series have been floating around
> > > for a long time now..
> > >
> > The current status of the series misses a Rb on patches 6,7,10,12-14.
> > Well, patch 8 too should now drop its tag, considering the latest
> > discussion.
> >
> > If it helps moving things forward I could split my series further into:
> >
> > 1) the compound page introduction (patches 1-7) of my aforementioned series
> > 2) vmemmap deduplication for memory gains (patches 9-14)
> > 3) gup improvements (patch 8 and gup-slow improvements)
>
> I would split it, yes..
>
> I think we can see a general consensus that making compound_head/etc
> work consistently with how THP uses it will provide value and
> opportunity for optimization going forward.
>
> > Whats the benefit between preventing longterm at start
> > versus only after mounting the filesystem? Or is the intended future purpose
> > to pass more context into an holder potential future callback e.g. nack longterm
> > pins on a page basis?
>
> I understood Dan's remark that the device-dax path allows
> FOLL_LONGTERM and the FSDAX path does not ?
>
> Which, IIRC, today is signaled basd on vma properties and in all cases
> fast-gup is denied.

Yeah, I forgot that 7af75561e171 eliminated any possibility of
longterm-gup-fast for device-dax, let's not disturb that status quo.

> > Maybe we can start by at least not add any flags and just prevent
> > FOLL_LONGTERM on fsdax -- which I guess was the original purpose of
> > commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast").
> > This patch (which I can formally send) has a sketch of that (below scissors mark):
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> Yes, basically, whatever test we want for 'deny fast gup foll
> longterm' is fine.
>
> Personally I'd like to see us move toward a set of flag specifying
> each special behavior and not a collection of types that imply special
> behaviors.
>
> Eg we have at least:
> - Block gup fast on foll_longterm
> - Capture the refcount ==1 and use the pgmap free hook
> (confusingly called page_is_devmap_managed())
> - Always use a swap entry
> - page->index/mapping are used in the usual file based way?
>
> Probably more things..

Yes, agree with the principle of reducing type-implied special casing.

2021-10-20 17:08:26

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On 10/19/21 20:21, Dan Williams wrote:
> On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <[email protected]> wrote:
>>
>> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
>>> On 10/19/21 00:06, Jason Gunthorpe wrote:
>>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
>>>>
>>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not
>>>>>> sure TTM is a real place though.
>>>>>
>>>>> I was setting device-dax aside because it can use Joao's changes to
>>>>> get compound-page support.
>>>>
>>>> Ideally, but that ideas in that patch series have been floating around
>>>> for a long time now..
>>>>
>>> The current status of the series misses a Rb on patches 6,7,10,12-14.
>>> Well, patch 8 too should now drop its tag, considering the latest
>>> discussion.
>>>
>>> If it helps moving things forward I could split my series further into:
>>>
>>> 1) the compound page introduction (patches 1-7) of my aforementioned series
>>> 2) vmemmap deduplication for memory gains (patches 9-14)
>>> 3) gup improvements (patch 8 and gup-slow improvements)
>>
>> I would split it, yes..
>>
>> I think we can see a general consensus that making compound_head/etc
>> work consistently with how THP uses it will provide value and
>> opportunity for optimization going forward.
>>

I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the
dax subsystem patches (6 & 7), or otherwise send them over.

>>> Whats the benefit between preventing longterm at start
>>> versus only after mounting the filesystem? Or is the intended future purpose
>>> to pass more context into an holder potential future callback e.g. nack longterm
>>> pins on a page basis?
>>
>> I understood Dan's remark that the device-dax path allows
>> FOLL_LONGTERM and the FSDAX path does not ?
>>
>> Which, IIRC, today is signaled basd on vma properties and in all cases
>> fast-gup is denied.
>
> Yeah, I forgot that 7af75561e171 eliminated any possibility of
> longterm-gup-fast for device-dax, let's not disturb that status quo.
>
I am slightly confused by this comment -- the status quo is what we are
questioning here -- And we talked about changing that in the past too
(thread below), that longterm-gup-fast was an oversight that that commit
was only applicable to fsdax. [Maybe this is just my english confusion]

>>> Maybe we can start by at least not add any flags and just prevent
>>> FOLL_LONGTERM on fsdax -- which I guess was the original purpose of
>>> commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast").
>>> This patch (which I can formally send) has a sketch of that (below scissors mark):
>>>
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Yes, basically, whatever test we want for 'deny fast gup foll
>> longterm' is fine.
>>
>> Personally I'd like to see us move toward a set of flag specifying
>> each special behavior and not a collection of types that imply special
>> behaviors.
>>
>> Eg we have at least:
>> - Block gup fast on foll_longterm
>> - Capture the refcount ==1 and use the pgmap free hook
>> (confusingly called page_is_devmap_managed())
>> - Always use a swap entry
>> - page->index/mapping are used in the usual file based way?
>>
>> Probably more things..
>
> Yes, agree with the principle of reducing type-implied special casing.
>
OK.

Moving from implicit devmap types to pgmap::flags is rather simple fixup.
And I suppose (respectivally) PGMAP_NO_PINF_LONGTERM, PGMAP_MANAGED_FREE_PAGE,
PGMAP_USE_SWAP_ENTRY, etc, etc.

2021-10-20 17:13:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On Wed, Oct 20, 2021 at 10:09 AM Joao Martins <[email protected]> wrote:
>
> On 10/19/21 20:21, Dan Williams wrote:
> > On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <[email protected]> wrote:
> >>
> >> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
> >>> On 10/19/21 00:06, Jason Gunthorpe wrote:
> >>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
> >>>>
> >>>>>> device-dax uses PUD, along with TTM, they are the only places. I'm not
> >>>>>> sure TTM is a real place though.
> >>>>>
> >>>>> I was setting device-dax aside because it can use Joao's changes to
> >>>>> get compound-page support.
> >>>>
> >>>> Ideally, but that ideas in that patch series have been floating around
> >>>> for a long time now..
> >>>>
> >>> The current status of the series misses a Rb on patches 6,7,10,12-14.
> >>> Well, patch 8 too should now drop its tag, considering the latest
> >>> discussion.
> >>>
> >>> If it helps moving things forward I could split my series further into:
> >>>
> >>> 1) the compound page introduction (patches 1-7) of my aforementioned series
> >>> 2) vmemmap deduplication for memory gains (patches 9-14)
> >>> 3) gup improvements (patch 8 and gup-slow improvements)
> >>
> >> I would split it, yes..
> >>
> >> I think we can see a general consensus that making compound_head/etc
> >> work consistently with how THP uses it will provide value and
> >> opportunity for optimization going forward.
> >>
>
> I'll go do that. Meanwhile, I'll wait a couple days for Dan to review the
> dax subsystem patches (6 & 7), or otherwise send them over.
>
> >>> Whats the benefit between preventing longterm at start
> >>> versus only after mounting the filesystem? Or is the intended future purpose
> >>> to pass more context into an holder potential future callback e.g. nack longterm
> >>> pins on a page basis?
> >>
> >> I understood Dan's remark that the device-dax path allows
> >> FOLL_LONGTERM and the FSDAX path does not ?
> >>
> >> Which, IIRC, today is signaled basd on vma properties and in all cases
> >> fast-gup is denied.
> >
> > Yeah, I forgot that 7af75561e171 eliminated any possibility of
> > longterm-gup-fast for device-dax, let's not disturb that status quo.
> >
> I am slightly confused by this comment -- the status quo is what we are
> questioning here -- And we talked about changing that in the past too
> (thread below), that longterm-gup-fast was an oversight that that commit
> was only applicable to fsdax. [Maybe this is just my english confusion]

No, you have it correct. However that "regression" has gone unnoticed,
so unless there is data showing that gup-fast on device-dax is
critical for longterm page pinning workflows I'm ok for longterm to
continue to trigger gup-slow.

2021-10-20 18:53:14

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

On 10/20/21 18:12, Dan Williams wrote:
> On Wed, Oct 20, 2021 at 10:09 AM Joao Martins <[email protected]> wrote:
>> On 10/19/21 20:21, Dan Williams wrote:
>>> On Tue, Oct 19, 2021 at 9:02 AM Jason Gunthorpe <[email protected]> wrote:
>>>> On Tue, Oct 19, 2021 at 04:13:34PM +0100, Joao Martins wrote:
>>>>> On 10/19/21 00:06, Jason Gunthorpe wrote:
>>>>>> On Mon, Oct 18, 2021 at 12:37:30PM -0700, Dan Williams wrote:
>>>>> Whats the benefit between preventing longterm at start
>>>>> versus only after mounting the filesystem? Or is the intended future purpose
>>>>> to pass more context into an holder potential future callback e.g. nack longterm
>>>>> pins on a page basis?
>>>>
>>>> I understood Dan's remark that the device-dax path allows
>>>> FOLL_LONGTERM and the FSDAX path does not ?
>>>>
>>>> Which, IIRC, today is signaled basd on vma properties and in all cases
>>>> fast-gup is denied.
>>>
>>> Yeah, I forgot that 7af75561e171 eliminated any possibility of
>>> longterm-gup-fast for device-dax, let's not disturb that status quo.
>>>
>> I am slightly confused by this comment -- the status quo is what we are
>> questioning here -- And we talked about changing that in the past too
>> (thread below), that longterm-gup-fast was an oversight that that commit
>> was only applicable to fsdax. [Maybe this is just my english confusion]
>
> No, you have it correct. However that "regression" has gone unnoticed,
> so unless there is data showing that gup-fast on device-dax is
> critical for longterm page pinning workflows I'm ok for longterm to
> continue to trigger gup-slow.
>
To be fair, it's not surprising that nobody noticed -- my general intent
was just to special-case less for device-dax. Not many places use
pin_user_pages_fast(FOLL_LONGTERM). This is only exposed on those
few cases that do try to use it (e.g. RDMA/IB, vDPA), and specifically
when the page fault occurs (that requires fallback-ing to gup-slow) at a
higher level than the amount you're pinning e.g. pinning in 2M extents on a
device-dax of 1G pagesize. Pin size is limited to a 2M extent at a time by the
users I mentioned above -- regardless of the total size of the extent you will
be pinning (i.e. 512 struct pages pointers fit one page). But even with all this,
this [FOLL_LONGTERM on pin-fast] would still go unnoticed because gup-fast
on devmap is just as slow as gup :)