Subject: [PATCH v1 00/14] Add MEMORY_DEVICE_PUBLIC for CPU-accessible coherent device memory

AMD is building a system architecture for the Frontier supercomputer
with a coherent interconnect between CPUs and GPUs. This hardware
architecture allows the CPUs to coherently access GPU device memory.
We have hardware in our labs and we are working with our partner HPE on
the BIOS, firmware and software for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu
driver registers the memory with devmap as MEMORY_DEVICE_PUBLIC using
devm_memremap_pages.

This patch series adds MEMORY_DEVICE_PUBLIC, which is similar to
MEMORY_DEVICE_GENERIC in that it can be mapped for CPU access, but adds
support for migrating this memory similar to MEMORY_DEVICE_PRIVATE. We
also included and updated two patches from Ralph Campbell (Nvidia),
which change ZONE_DEVICE reference counting as requested in previous
reviews of this patch series (see https://patchwork.freedesktop.org/series/90706/).
Finally, we extended hmm_test to cover migration of MEMORY_DEVICE_PUBLIC.

This work is based on HMM and our SVM memory manager, which has landed
in Linux 5.14 recently.

Alex Sierra (12):
mm: add iomem vma selection for memory migration
mm: add zone device public type memory support
drm/amdkfd: ref count init for device pages
drm/amdkfd: add SPM support for SVM
drm/amdkfd: public type as sys mem on migration to ram
mm: add public type support to migrate_vma helpers
mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages
lib: test_hmm add ioctl to get zone device type
lib: test_hmm add module param for zone device type
lib: add support for device public type in test_hmm
tools: update hmm-test to support device public type
tools: update test_hmm script to support SP config

Ralph Campbell (2):
ext4/xfs: add page refcount helper
mm: remove extra ZONE_DEVICE struct page refcount

arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 36 ++--
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 8 +-
fs/ext4/inode.c | 5 +-
fs/fuse/dax.c | 4 +-
fs/xfs/xfs_file.c | 4 +-
include/linux/dax.h | 10 +
include/linux/memremap.h | 15 +-
include/linux/migrate.h | 1 +
include/linux/mm.h | 19 +-
lib/test_hmm.c | 247 +++++++++++++++--------
lib/test_hmm_uapi.h | 16 ++
mm/internal.h | 8 +
mm/memcontrol.c | 6 +-
mm/memory-failure.c | 6 +-
mm/memremap.c | 70 ++-----
mm/migrate.c | 27 +--
mm/page_alloc.c | 3 +
mm/swap.c | 45 +----
tools/testing/selftests/vm/hmm-tests.c | 142 +++++++++++--
tools/testing/selftests/vm/test_hmm.sh | 20 +-
22 files changed, 443 insertions(+), 253 deletions(-)

--
2.32.0


Subject: [PATCH v1 06/14] drm/amdkfd: add SPM support for SVM

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_PUBLIC to create the device
page map region.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
v7:
Remove lookup_resource call, so export symbol for this function
is not longer required. Patch dropped "kernel: resource:
lookup_resource as exported symbol"
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++++++++++++++---------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 47ee9a895cd2..dd245699479f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -865,7 +865,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
{
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long size;
void *r;

@@ -880,19 +880,25 @@ int svm_migrate_init(struct amdgpu_device *adev)
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
- res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
- if (IS_ERR(res))
- return -ENOMEM;
+ if (adev->gmc.xgmi.connected_to_cpu) {
+ pgmap->range.start = adev->gmc.aper_base;
+ pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
+ pgmap->type = MEMORY_DEVICE_PUBLIC;
+ } else {
+ res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+ if (IS_ERR(res))
+ return -ENOMEM;
+ pgmap->range.start = res->start;
+ pgmap->range.end = res->end;
+ pgmap->type = MEMORY_DEVICE_PRIVATE;
+ }

- pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
- pgmap->range.start = res->start;
- pgmap->range.end = res->end;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
- pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ pgmap->flags = 0;
r = devm_memremap_pages(adev->dev, pgmap);
- if (IS_ERR(r)) {
+ if (res && IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
devm_release_mem_region(adev->dev, res->start,
res->end - res->start + 1);
@@ -914,6 +920,7 @@ void svm_migrate_fini(struct amdgpu_device *adev)
struct dev_pagemap *pgmap = &adev->kfd.dev->pgmap;

devm_memunmap_pages(adev->dev, pgmap);
- devm_release_mem_region(adev->dev, pgmap->range.start,
- pgmap->range.end - pgmap->range.start + 1);
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+ devm_release_mem_region(adev->dev, pgmap->range.start,
+ pgmap->range.end - pgmap->range.start + 1);
}
--
2.32.0

Subject: [PATCH v1 01/14] ext4/xfs: add page refcount helper

From: Ralph Campbell <[email protected]>

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
v3:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

v4:
[AS]: This ref count functionality was missing on fuse/dax.c.
---
fs/dax.c | 4 ++--
fs/ext4/inode.c | 5 +----
fs/fuse/dax.c | 4 +---
fs/xfs/xfs_file.c | 4 +---
include/linux/dax.h | 10 ++++++++++
5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 62352cbcf0f4..c387d09e3e5a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+ WARN_ON_ONCE(trunc && !dax_page_unused(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -383,7 +383,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- if (page_ref_count(page) > 1)
+ if (!dax_page_unused(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fe6045a46599..05ffe6875cb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3971,10 +3971,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;

- error = ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1,
- TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei));
+ error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);

return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ff99ab2a3c43..2b1f190ba78a 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, fuse_wait_dax_page(inode));
+ return dax_wait_page(inode, page, fuse_wait_dax_page);
}

/* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 396ef36dcd0a..182057281086 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -840,9 +840,7 @@ xfs_break_dax_layouts(
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, xfs_wait_dax_page(inode));
+ return dax_wait_page(inode, page, xfs_wait_dax_page);
}

int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8b5da1d60dbc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}

+static inline bool dax_page_unused(struct page *page)
+{
+ return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+ ___wait_var_event(&(_page)->_refcount, \
+ dax_page_unused(_page), \
+ TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
void hmem_register_device(int target_nid, struct resource *r);
#else
--
2.32.0

Subject: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount

From: Ralph Campbell <[email protected]>

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
---
v2:
AS: merged this patch in linux 5.11 version

v5:
AS: add condition at try_grab_page to check for the zone device type, while
page ref counter is checked less/equal to zero. In case of device zone, pages
ref counter are initialized to zero.

v7:
AS: fix condition at try_grab_page added at v5, is invalid. It supposed
to fix xfstests/generic/413 test, however, there's a known issue on
this test where DAX mapped area DIO to non-DAX expect to fail.
https://patchwork.kernel.org/project/fstests/patch/[email protected]
This condition was removed after rebase over patch series
https://lore.kernel.org/r/[email protected]
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 4 +-
include/linux/dax.h | 2 +-
include/linux/memremap.h | 7 +--
include/linux/mm.h | 11 -----
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memremap.c | 68 +++++++-------------------
mm/migrate.c | 5 --
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++---------------
12 files changed, 45 insertions(+), 114 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)

dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}

- get_page(page);
+ init_page_count(page);
lock_page(page);
return page;
}
diff --git a/fs/dax.c b/fs/dax.c
index c387d09e3e5a..1166630b7190 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -571,14 +571,14 @@ static void *grab_mapping_entry(struct xa_state *xas,

/**
* dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
* @start: Starting offset. Page containing 'start' is included.
* @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
* pages from 'start' till the end of file are included.
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
* any page in the mapping is busy, i.e. for DMA, or other
* get_user_pages() usages.
*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)

static inline bool dax_page_unused(struct page *page)
{
- return page_ref_count(page) == 1;
+ return page_ref_count(page) == 0;
}

#define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 45a79da89c5f..77ff5fd0685f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {

struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count
+ * should be reset to one with init_page_count(page) before reusing
+ * the page. This allows the device driver to implement its own
+ * memory management.
*/
void (*page_free)(struct page *page);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d8f98d652164..e24c904deeec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1220,17 +1220,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);

- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..6998f10350ea 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}

dpage->zone_device_data = rpage;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;

diff --git a/mm/internal.h b/mm/internal.h
index e8fdb531f887..5438cceca4b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -667,4 +667,12 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,

void vunmap_range_noflush(unsigned long start, unsigned long end);

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 15a074ffb8d7..5aa8163fd948 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/wait_bit.h>
#include <linux/xarray.h>
+#include "internal.h"

static DEFINE_XARRAY(pgmap_array);

@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void)
EXPORT_SYMBOL_GPL(memremap_compat_align);
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_dec(&devmap_managed_key);
-}
-
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_inc(&devmap_managed_key);
-}
-#else
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-}
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static void pgmap_array_delete(struct range *range)
{
xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -102,16 +77,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}

-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
-#define for_each_device_pfn(pfn, map, i) \
- for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
static void dev_pagemap_kill(struct dev_pagemap *pgmap)
{
if (pgmap->ops && pgmap->ops->kill)
@@ -167,20 +132,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;

dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);

for (i = 0; i < pgmap->nr_range; i++)
pageunmap_range(pgmap, i);

WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
- devmap_managed_enable_put(pgmap);
}
EXPORT_SYMBOL_GPL(memunmap_pages);

@@ -382,8 +345,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
}

- devmap_managed_enable_get(pgmap);
-
/*
* Clear the pgmap nr_range as it will be incremented for each
* successfully processed range. This communicates how many
@@ -498,16 +459,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
{
- /* notify page idle for dax */
- if (!is_device_private_page(page)) {
- wake_up_var(&page->_refcount);
- return;
- }

__ClearPageWaiters(page);
-
mem_cgroup_uncharge(page);

/*
@@ -534,4 +489,19 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
}
+
+void free_zone_device_page(struct page *page)
+{
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_FS_DAX:
+ /* notify page idle */
+ wake_up_var(&page->_refcount);
+ return;
+ case MEMORY_DEVICE_PRIVATE:
+ free_device_private_page(page);
+ return;
+ default:
+ return;
+ }
+}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 41ff2c9896c4..e3a10e2a1bb3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -350,11 +350,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;

- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef2265f86b91..1ef1f733af5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6414,6 +6414,9 @@ void __ref memmap_init_zone_device(struct zone *zone,

__init_single_page(page, pfn, zone_idx, nid);

+ /* ZONE_DEVICE pages start with a zero reference count. */
+ set_page_count(page, 0);
+
/*
* Mark page reserved as it will need to wait for onlining
* phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index dfb48cf9c2c9..9e821f1951c5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page)
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
-
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
+ free_zone_device_page(page);
return;
}

@@ -917,29 +916,18 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;

+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * page_is_devmap_managed() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- if (put_page_testzero(page))
- put_dev_pagemap(page->pgmap);
+ free_zone_device_page(page);
continue;
}

- if (!put_page_testzero(page))
- continue;
-
if (PageCompound(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -1143,26 +1131,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.32.0

Subject: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

In this case, this is used to migrate pages from device memory, back to
system memory. This particular device memory type should be accessible
by the CPU, through IOMEM access. Typically, zone device public type
memory falls into this category.

Signed-off-by: Alex Sierra <[email protected]>
---
include/linux/migrate.h | 1 +
mm/migrate.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4bb4e519e3f5..6b16f417384f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+ MIGRATE_VMA_SELECT_IOMEM = 1 << 2,
};

struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index e3a10e2a1bb3..d4ae2da99607 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_write_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
- if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+ if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
+ !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
goto next;
pfn = pte_pfn(pte);
if (is_zero_pfn(pfn)) {
--
2.32.0

Subject: [PATCH v1 07/14] drm/amdkfd: public type as sys mem on migration to ram

Public device type memory on VRAM to RAM migration,
has similar access as System RAM from the CPU. This flag sets
the source from the sender. Which in Public type case,
should be set as IOMEM.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index dd245699479f..618035dffc64 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -616,9 +616,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
- migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);

+ if (adev->gmc.xgmi.connected_to_cpu)
+ migrate.flags = MIGRATE_VMA_SELECT_IOMEM;
+ else
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
--
2.32.0

Subject: [PATCH v1 04/14] mm: add zone device public type memory support

Device memory that is cache coherent from device and CPU point of view.
This is use on platform that have an advance system bus (like CAPI or
CCIX). Any page of a process can be migrated to such memory. However,
no one should be allow to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <[email protected]>
---
include/linux/memremap.h | 8 ++++++++
include/linux/mm.h | 8 ++++++++
mm/memcontrol.c | 6 +++---
mm/memory-failure.c | 6 +++++-
mm/memremap.c | 1 +
5 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 77ff5fd0685f..431e1b0bc949 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@ struct vmem_altmap {
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.rst.
*
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
+ *
* MEMORY_DEVICE_FS_DAX:
* Host memory that has similar access semantics as System RAM i.e. DMA
* coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@ struct vmem_altmap {
enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+ MEMORY_DEVICE_PUBLIC,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e24c904deeec..70a932e8a2ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1187,6 +1187,14 @@ static inline bool is_device_private_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

+static inline bool is_device_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ is_zone_device_page(page) &&
+ (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ page->pgmap->type == MEMORY_DEVICE_PUBLIC);
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..1599ef1a3b03 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5530,8 +5530,8 @@ static int mem_cgroup_move_account(struct page *page,
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
- * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PUBLIC
+ * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
* For now we such page is charge like a regular page would be as for all
* intent and purposes it is just special memory taking the place of a
* regular page.
@@ -5565,7 +5565,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
*/
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
- if (is_device_private_page(page))
+ if (is_device_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6f5f78885ab4..16cadbabfc99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1373,12 +1373,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
goto unlock;
}

- if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
/*
* TODO: Handle HMM pages which may need coordination
* with device-side memory.
*/
goto unlock;
+ default:
+ break;
}

/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 5aa8163fd948..2c8898ed006f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -294,6 +294,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)

switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
WARN(1, "Device private memory not supported\n");
return ERR_PTR(-EINVAL);
--
2.32.0

Subject: [PATCH v1 05/14] drm/amdkfd: ref count init for device pages

Ref counter from device pages is init to zero during memmap init zone.
The first time a new device page is allocated to migrate data into it,
its ref counter needs to be initialized to one.

Signed-off-by: Alex Sierra <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index dab290a4d19d..47ee9a895cd2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -220,7 +220,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
- get_page(page);
+ init_page_count(page);
lock_page(page);
}

--
2.32.0

Subject: [PATCH v1 09/14] mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages

Add MEMORY_DEVICE_PUBLIC case to free_zone_device_page callback.
Device public type memory case is now able to free its pages properly.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/memremap.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 2c8898ed006f..b9a8ed089cc6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -460,7 +460,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-static void free_device_private_page(struct page *page)
+static void free_device_page(struct page *page)
{

__ClearPageWaiters(page);
@@ -494,13 +494,14 @@ static void free_device_private_page(struct page *page)
void free_zone_device_page(struct page *page)
{
switch (page->pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_PUBLIC:
+ free_device_page(page);
+ return;
case MEMORY_DEVICE_FS_DAX:
/* notify page idle */
wake_up_var(&page->_refcount);
return;
- case MEMORY_DEVICE_PRIVATE:
- free_device_private_page(page);
- return;
default:
return;
}
--
2.32.0

Subject: [PATCH v1 08/14] mm: add public type support to migrate_vma helpers

Add device public type case to migrate_vma_insert_page,
migrate_vma_pages and migrate_vma_check_page helpers.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/migrate.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index d4ae2da99607..09817aded633 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2566,7 +2566,7 @@ static bool migrate_vma_check_page(struct page *page)
* FIXME proper solution is to rework migration_entry_wait() so
* it does not need to take a reference on page.
*/
- return is_device_private_page(page);
+ return is_device_page(page);
}

/* For file back page */
@@ -2855,7 +2855,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
* handle_pte_fault()
* do_anonymous_page()
* to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or public page.
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -2926,10 +2926,15 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
entry = swp_entry_to_pte(swp_entry);
+ } else if (is_device_page(page)) {
+ entry = pte_mkold(mk_pte(page,
+ READ_ONCE(vma->vm_page_prot)));
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
} else {
/*
- * For now we only support migrating to un-addressable
- * device memory.
+ * We support migrating to private and public types
+ * for device zone memory.
*/
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
goto abort;
@@ -3035,10 +3040,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
mapping = page_mapping(page);

if (is_zone_device_page(newpage)) {
- if (is_device_private_page(newpage)) {
+ if (is_device_page(newpage)) {
/*
- * For now only support private anonymous when
- * migrating to un-addressable device memory.
+ * For now only support private and public
+ * anonymous when migrating to device memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
--
2.32.0

Subject: [PATCH v1 11/14] lib: test_hmm add module param for zone device type

In order to configure device public in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra <[email protected]>
---
v5:
Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at
dmirror_allocate_chunk that was forcing to configure pagemap.type
to MEMORY_DEVICE_PRIVATE

v6:
Check for null pointers for resource and memremap references
at dmirror_allocate_chunk

v7:
Due to patch dropped from these patch series "kernel: resource:
lookup_resource as exported symbol", lookup_resource was not longer a
callable function. This was used in public device configuration, to
get start and end addresses, to create pgmap->range struct. This
information is now taken directly from the spm_addr_devX parameters and
the fixed size DEVMEM_CHUNK_SIZE.
---
lib/test_hmm.c | 66 +++++++++++++++++++++++++++++++--------------
lib/test_hmm_uapi.h | 1 +
2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 3cd91ca31dd7..ef27e355738a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -33,6 +33,16 @@
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+ "Specify start address for SPM (special purpose memory) used for device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+ "Specify start address for SPM (special purpose memory) used for device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is set too");
+
static const struct dev_pagemap_ops dmirror_devmem_ops;
static const struct mmu_interval_notifier_ops dmirror_min_ops;
static dev_t dmirror_dev;
@@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
return ret;
}

-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
struct page **ppage)
{
struct dmirror_chunk *devmem;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
@@ -462,17 +472,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
- return false;
+ return -ENOMEM;

- res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
- if (IS_ERR(res))
- goto err_devmem;
+ if (!spm_addr_dev0 && !spm_addr_dev1) {
+ res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+ if (IS_ERR_OR_NULL(res))
+ goto err_devmem;
+ devmem->pagemap.range.start = res->start;
+ devmem->pagemap.range.end = res->end;
+ devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ } else if (spm_addr_dev0 && spm_addr_dev1) {
+ devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
+ spm_addr_dev0 :
+ spm_addr_dev1;
+ devmem->pagemap.range.end = devmem->pagemap.range.start +
+ DEVMEM_CHUNK_SIZE - 1;
+ devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PUBLIC;
+ } else {
+ pr_err("Both spm_addr_dev parameters should be set\n");
+ }

- mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.range.start = res->start;
- devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -493,10 +515,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(&devmem->pagemap, numa_node_id());
- if (IS_ERR(ptr))
+ if (IS_ERR_OR_NULL(ptr)) {
+ if (ptr)
+ ret = PTR_ERR(ptr);
+ else
+ ret = -EFAULT;
goto err_release;
+ }

devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -529,7 +555,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,

err_release:
mutex_unlock(&mdevice->devmem_lock);
- release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+ if (res)
+ release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
err_devmem:
kfree(devmem);

@@ -1097,10 +1124,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
if (ret)
return ret;

- /* Build a list of free ZONE_DEVICE private struct pages */
- dmirror_allocate_chunk(mdevice, NULL);
-
- return 0;
+ /* Build a list of free ZONE_DEVICE struct pages */
+ return dmirror_allocate_chunk(mdevice, NULL);
}

static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1113,8 +1138,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
mdevice->devmem_chunks[i];

memunmap_pages(&devmem->pagemap);
- release_mem_region(devmem->pagemap.range.start,
- range_len(&devmem->pagemap.range));
+ if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
kfree(devmem);
}
kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index ee88701793d5..00259d994410 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -65,6 +65,7 @@ enum {
enum {
/* 0 is reserved to catch uninitialized type fields */
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+ HMM_DMIRROR_MEMORY_DEVICE_PUBLIC,
};

#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: [PATCH v1 10/14] lib: test_hmm add ioctl to get zone device type

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device public type.

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 15 ++++++++++++++-
lib/test_hmm_uapi.h | 7 +++++++
2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 6998f10350ea..3cd91ca31dd7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -82,6 +82,7 @@ struct dmirror_chunk {
struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem *devmem;
+ unsigned int zone_device_type;

unsigned int devmem_capacity;
unsigned int devmem_count;
@@ -468,6 +469,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
if (IS_ERR(res))
goto err_devmem;

+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
devmem->pagemap.range.start = res->start;
devmem->pagemap.range.end = res->end;
@@ -912,6 +914,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
}

+static int dmirror_get_device_type(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ mutex_lock(&dmirror->mutex);
+ cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+ mutex_unlock(&dmirror->mutex);
+
+ return 0;
+}
static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -952,7 +963,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
case HMM_DMIRROR_SNAPSHOT:
ret = dmirror_snapshot(dmirror, &cmd);
break;
-
+ case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+ ret = dmirror_get_device_type(dmirror, &cmd);
+ break;
default:
return -EINVAL;
}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..ee88701793d5 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
__u64 npages;
__u64 cpages;
__u64 faults;
+ __u64 zone_device_type;
};

/* Expose the address space of the calling process through hmm device file */
@@ -33,6 +34,7 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x04, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -60,4 +62,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
};

+enum {
+ /* 0 is reserved to catch uninitialized type fields */
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: [PATCH v1 13/14] tools: update hmm-test to support device public type

Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device public
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and public
zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 142 +++++++++++++++++++++----
1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..477c6283dd1b 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,8 @@ struct hmm_buffer {
int fd;
uint64_t cpages;
uint64_t faults;
+ int zone_device_type;
+ bool alloc_to_devmem;
};

#define TWOMEG (1 << 21)
@@ -133,6 +135,7 @@ static int hmm_dmirror_cmd(int fd,
cmd.addr = (__u64)buffer->ptr;
cmd.ptr = (__u64)buffer->mirror;
cmd.npages = npages;
+ cmd.alloc_to_devmem = buffer->alloc_to_devmem;

for (;;) {
ret = ioctl(fd, request, &cmd);
@@ -144,6 +147,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+ buffer->zone_device_type = cmd.zone_device_type;

return 0;
}
@@ -211,6 +215,34 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
}

+static int hmm_migrate_sys_to_dev(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ buffer->alloc_to_devmem = true;
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ buffer->alloc_to_devmem = false;
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+ struct hmm_buffer buffer;
+ int ret;
+
+ buffer.ptr = 0;
+ ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, &buffer, 1);
+ *res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+ return ret;
+}
+
/*
* Simple NULL test of device open/close.
*/
@@ -875,7 +907,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -923,7 +955,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -936,7 +968,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);

/* Migrate memory to the device again. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -976,7 +1008,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);

hmm_buffer_free(buffer);
@@ -1015,7 +1047,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;

/* Migrating a protected area should be an error. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);

/* Punch a hole after the first page address. */
@@ -1023,7 +1055,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);

/* We expect an error if the vma doesn't cover the range. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);

/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1087,13 @@ TEST_F(hmm2, migrate_mixed)

/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);

/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, -ENOENT);
buffer->ptr = p;

@@ -1070,8 +1102,12 @@ TEST_F(hmm2, migrate_mixed)
}

/*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of public zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Generic device zone has coherent access to CPU, therefore
+ * it will not generate any page fault.
*/
TEST_F(hmm, migrate_multiple)
{
@@ -1082,7 +1118,9 @@ TEST_F(hmm, migrate_multiple)
unsigned long c;
int *ptr;
int ret;
+ bool is_private;

+ ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
ASSERT_NE(npages, 0);
size = npages << self->page_shift;
@@ -1107,8 +1145,7 @@ TEST_F(hmm, migrate_multiple)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
- npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -1116,7 +1153,12 @@ TEST_F(hmm, migrate_multiple)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

- /* Fault pages back to system memory and check them. */
+ /* Migrate back to system memory and check them. */
+ if (!is_private) {
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ }
+
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

@@ -1261,10 +1303,12 @@ TEST_F(hmm2, snapshot)
unsigned char *m;
int ret;
int val;
+ bool is_private;

npages = 7;
size = npages << self->page_shift;

+ ASSERT_EQ(hmm_is_private_device(self->fd0, &is_private), 0);
buffer = malloc(sizeof(*buffer));
ASSERT_NE(buffer, NULL);

@@ -1312,13 +1356,13 @@ TEST_F(hmm2, snapshot)

/* Page 5 will be migrated to device 0. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

/* Page 6 will be migrated to device 1. */
buffer->ptr = p + 6 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

@@ -1335,9 +1379,16 @@ TEST_F(hmm2, snapshot)
ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
- HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ if (is_private) {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ } else {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PUBLIC |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_PUBLIC |
+ HMM_DMIRROR_PROT_WRITE);
+ }

hmm_buffer_free(buffer);
}
@@ -1485,4 +1536,59 @@ TEST_F(hmm2, double_map)
hmm_buffer_free(buffer);
}

+/*
+ * Migrate anonymous memory to device memory and migrate back to system memory
+ * explicitly, without generating a page fault.
+ */
+TEST_F(hmm, migrate_ping_pong)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ buffer->alloc_to_devmem = true;
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Migrate memory back to system mem. */
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+
+ /* Check the buffer migrated back to system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ hmm_buffer_free(buffer);
+}
+
TEST_HARNESS_MAIN
--
2.32.0

Subject: [PATCH v1 12/14] lib: add support for device public type in test_hmm

Device Public type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 166 +++++++++++++++++++++++++++-----------------
lib/test_hmm_uapi.h | 10 ++-
2 files changed, 113 insertions(+), 63 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index ef27e355738a..e346a48e2509 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -469,6 +469,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+ int ret = -ENOMEM;

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
@@ -551,7 +552,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
}
spin_unlock(&mdevice->lock);

- return true;
+ return 0;

err_release:
mutex_unlock(&mdevice->devmem_lock);
@@ -560,7 +561,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
err_devmem:
kfree(devmem);

- return false;
+ return ret;
}

static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -569,8 +570,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
struct page *rpage;

/*
- * This is a fake device so we alloc real system memory to store
- * our device memory.
+ * For ZONE_DEVICE private type, this is a fake device so we alloc real
+ * system memory to store our device memory.
+ * For ZONE_DEVICE public type we use the actual dpage to store the data
+ * and ignore rpage.
*/
rpage = alloc_page(GFP_HIGHUSER);
if (!rpage)
@@ -603,7 +606,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
struct dmirror *dmirror)
{
struct dmirror_device *mdevice = dmirror->mdevice;
- const unsigned long *src = args->src;
+ unsigned long *src = args->src;
unsigned long *dst = args->dst;
unsigned long addr;

@@ -621,12 +624,18 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* unallocated pte_none() or read-only zero page.
*/
spage = migrate_pfn_to_page(*src);
-
+ if (spage && is_zone_device_page(spage)) {
+ pr_debug("page already in device spage pfn: 0x%lx\n",
+ page_to_pfn(spage));
+ *src &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;

- rpage = dpage->zone_device_data;
+ rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
+ dpage;
if (spage)
copy_highpage(rpage, spage);
else
@@ -638,8 +647,10 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* the simulated device memory and that page holds the pointer
* to the mirror.
*/
+ rpage = dpage->zone_device_data;
rpage->zone_device_data = dmirror;
-
+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage)) |
MIGRATE_PFN_LOCKED;
if ((*src & MIGRATE_PFN_WRITE) ||
@@ -673,10 +684,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
continue;

/*
- * Store the page that holds the data so the page table
- * doesn't have to deal with ZONE_DEVICE private pages.
+ * For ZONE_DEVICE private pages we store the page that
+ * holds the data so the page table doesn't have to deal it.
+ * For ZONE_DEVICE public pages we store the actual page, since
+ * the CPU has coherent access to the page.
*/
- entry = dpage->zone_device_data;
+ entry = is_device_private_page(dpage) ? dpage->zone_device_data :
+ dpage;
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -690,6 +704,47 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
return 0;
}

+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+ struct dmirror *dmirror)
+{
+ unsigned long *src = args->src;
+ unsigned long *dst = args->dst;
+ unsigned long start = args->start;
+ unsigned long end = args->end;
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE,
+ src++, dst++) {
+ struct page *dpage, *spage;
+
+ spage = migrate_pfn_to_page(*src);
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+ continue;
+ if (is_device_private_page(spage)) {
+ spage = spage->zone_device_data;
+ } else {
+ pr_debug("page already in system or SPM spage pfn: 0x%lx\n",
+ page_to_pfn(spage));
+ *src &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ continue;
+ pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
+
+ lock_page(dpage);
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dpage, spage);
+ *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+ }
+ return 0;
+}
+
+
static int dmirror_migrate(struct dmirror *dmirror,
struct hmm_dmirror_cmd *cmd)
{
@@ -731,33 +786,46 @@ static int dmirror_migrate(struct dmirror *dmirror,
args.start = addr;
args.end = next;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+ args.flags = (!cmd->alloc_to_devmem &&
+ dmirror->mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+ MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+ MIGRATE_VMA_SELECT_SYSTEM;
ret = migrate_vma_setup(&args);
if (ret)
goto out;

- dmirror_migrate_alloc_and_copy(&args, dmirror);
+ if (cmd->alloc_to_devmem) {
+ pr_debug("Migrating from sys mem to device mem\n");
+ dmirror_migrate_alloc_and_copy(&args, dmirror);
+ } else {
+ pr_debug("Migrating from device mem to sys mem\n");
+ dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+ }
migrate_vma_pages(&args);
- dmirror_migrate_finalize_and_map(&args, dmirror);
+ if (cmd->alloc_to_devmem)
+ dmirror_migrate_finalize_and_map(&args, dmirror);
migrate_vma_finalize(&args);
}
mmap_read_unlock(mm);
mmput(mm);

- /* Return the migrated data for verification. */
- ret = dmirror_bounce_init(&bounce, start, size);
- if (ret)
- return ret;
- mutex_lock(&dmirror->mutex);
- ret = dmirror_do_read(dmirror, start, end, &bounce);
- mutex_unlock(&dmirror->mutex);
- if (ret == 0) {
- if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
- bounce.size))
- ret = -EFAULT;
+ /* Return the migrated data for verification. only for pages in device zone */
+ if (cmd->alloc_to_devmem) {
+ ret = dmirror_bounce_init(&bounce, start, size);
+ if (ret)
+ return ret;
+ mutex_lock(&dmirror->mutex);
+ ret = dmirror_do_read(dmirror, start, end, &bounce);
+ mutex_unlock(&dmirror->mutex);
+ if (ret == 0) {
+ if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+ bounce.size))
+ ret = -EFAULT;
+ }
+ cmd->cpages = bounce.cpages;
+ dmirror_bounce_fini(&bounce);
}
- cmd->cpages = bounce.cpages;
- dmirror_bounce_fini(&bounce);
return ret;

out:
@@ -781,9 +849,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
}

page = hmm_pfn_to_page(entry);
- if (is_device_private_page(page)) {
- /* Is the page migrated to this device or some other? */
- if (dmirror->mdevice == dmirror_page_to_device(page))
+ if (is_device_page(page)) {
+ /* Is page ZONE_DEVICE public? */
+ if (!is_device_private_page(page))
+ *perm = HMM_DMIRROR_PROT_DEV_PUBLIC;
+ /*
+ * Is page ZONE_DEVICE private migrated to
+ * this device or some other?
+ */
+ else if (dmirror->mdevice == dmirror_page_to_device(page))
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
else
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
@@ -1030,38 +1104,6 @@ static void dmirror_devmem_free(struct page *page)
spin_unlock(&mdevice->lock);
}

-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- const unsigned long *src = args->src;
- unsigned long *dst = args->dst;
- unsigned long start = args->start;
- unsigned long end = args->end;
- unsigned long addr;
-
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
- struct page *dpage, *spage;
-
- spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
- spage = spage->zone_device_data;
-
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
-
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- }
- return 0;
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 00259d994410..b6cb8a7d2470 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -17,8 +17,12 @@
* @addr: (in) user address the device will read/write
* @ptr: (in) user address where device data is copied to/from
* @npages: (in) number of pages to read/write
+ * @alloc_to_devmem: (in) desired allocation destination during migration.
+ * True if allocation is to device memory.
+ * False if allocation is to system memory.
* @cpages: (out) number of pages copied
* @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
*/
struct hmm_dmirror_cmd {
__u64 addr;
@@ -26,7 +30,8 @@ struct hmm_dmirror_cmd {
__u64 npages;
__u64 cpages;
__u64 faults;
- __u64 zone_device_type;
+ __u32 zone_device_type;
+ __u32 alloc_to_devmem;
};

/* Expose the address space of the calling process through hmm device file */
@@ -49,6 +54,8 @@ struct hmm_dmirror_cmd {
* device the ioctl() is made
* HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
* other device
+ * HMM_DMIRROR_PROT_DEV_PUBLIC: Migrate device public page on the device
+ * the ioctl() is made
*/
enum {
HMM_DMIRROR_PROT_ERROR = 0xFF,
@@ -60,6 +67,7 @@ enum {
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
+ HMM_DMIRROR_PROT_DEV_PUBLIC = 0x40,
};

enum {
--
2.32.0

Subject: [PATCH v1 14/14] tools: update test_hmm script to support SP config

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as public.

Signed-off-by: Alex Sierra <[email protected]>
---
tools/testing/selftests/vm/test_hmm.sh | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +86,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +98,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.32.0

2021-08-25 07:36:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper

On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote:
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> v3:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused
>
> v4:
> [AS]: This ref count functionality was missing on fuse/dax.c.
> ---

Not sure all tooling can cope with the two --- separators. Personally
I find these per-patch changelogs pretty annoying anyway, but others
have different opinions.

2021-08-25 07:41:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
> In this case, this is used to migrate pages from device memory, back to
> system memory. This particular device memory type should be accessible
> by the CPU, through IOMEM access. Typically, zone device public type
> memory falls into this category.
>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> mm/migrate.c | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..6b16f417384f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
> enum migrate_vma_direction {
> MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> + MIGRATE_VMA_SELECT_IOMEM = 1 << 2,
> };
>
> struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e3a10e2a1bb3..d4ae2da99607 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (is_write_device_private_entry(entry))
> mpfn |= MIGRATE_PFN_WRITE;
> } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))

This makes the MIGRATE_VMA_SELECT_SYSTEM and MIGRATE_VMA_SELECT_IOMEM
behave entirely identifical, that is redundant. I think we need to
distinguish between the dfferent cases here. I think the right check
would be pfn_valid(), which should be true for system memory, and
false for iomem.

Also shouldn't this be called DEVICE_PUBLIC instead of IOMEM?

2021-08-25 07:46:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
> } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
> goto next;
> pfn = pte_pfn(pte);
> if (is_zero_pfn(pfn)) {

.. also how is this going to work for the device public memory? That
should be pte_special() an thus fail vm_normal_page.

2021-08-25 07:48:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages

On Tue, Aug 24, 2021 at 10:48:23PM -0500, Alex Sierra wrote:
> Add MEMORY_DEVICE_PUBLIC case to free_zone_device_page callback.
> Device public type memory case is now able to free its pages properly.

This really should go into patch 4. And it might make sense to introduce
free_device_private_page directly with the free_device_page name instead
of renaming it a little later.

2021-08-25 07:48:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] mm: add public type support to migrate_vma helpers

This should probably be folded into patch 4.

2021-08-25 11:16:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount

On 8/25/21 05:48, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.

That's certainly welcome. I just wonder what was the reason to use 1 in the
first place and why it's no longer necessary?

2021-08-25 15:27:54

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

Am 2021-08-24 um 11:48 p.m. schrieb Alex Sierra:
> In this case, this is used to migrate pages from device memory, back to
> system memory. This particular device memory type should be accessible
> by the CPU, through IOMEM access. Typically, zone device public type
> memory falls into this category.
>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> include/linux/migrate.h | 1 +
> mm/migrate.c | 3 ++-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 4bb4e519e3f5..6b16f417384f 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -156,6 +156,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
> enum migrate_vma_direction {
> MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
> MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
> + MIGRATE_VMA_SELECT_IOMEM = 1 << 2,

How about calling this MIGRATE_VMA_SELECT_DEVICE_PUBLIC?


> };
>
> struct migrate_vma {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e3a10e2a1bb3..d4ae2da99607 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2406,7 +2406,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (is_write_device_private_entry(entry))
> mpfn |= MIGRATE_PFN_WRITE;
> } else {
> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
> goto next;

For MIGRATE_VMA_SELECT_IOMEM/DEVICE_PUBLIC, I think we should ensure the
pages are ZONE_DEVICE and we should also check the page owner for
symmetry with DEVICE_PRIVATE.

Regards,
  Felix


> pfn = pte_pfn(pte);
> if (is_zero_pfn(pfn)) {

2021-08-25 15:33:05

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 06/14] drm/amdkfd: add SPM support for SVM

Am 2021-08-24 um 11:48 p.m. schrieb Alex Sierra:
> When CPU is connected throug XGMI, it has coherent
> access to VRAM resource. In this case that resource
> is taken from a table in the device gmc aperture base.
> This resource is used along with the device type, which could
> be DEVICE_PRIVATE or DEVICE_PUBLIC to create the device
> page map region.
>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Felix Kuehling <[email protected]>
> ---
> v7:
> Remove lookup_resource call, so export symbol for this function
> is not longer required. Patch dropped "kernel: resource:
> lookup_resource as exported symbol"
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++++++++++++++---------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index 47ee9a895cd2..dd245699479f 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -865,7 +865,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
> {
> struct kfd_dev *kfddev = adev->kfd.dev;
> struct dev_pagemap *pgmap;
> - struct resource *res;
> + struct resource *res = NULL;
> unsigned long size;
> void *r;
>
> @@ -880,19 +880,25 @@ int svm_migrate_init(struct amdgpu_device *adev)
> * should remove reserved size
> */
> size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
> - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> - if (IS_ERR(res))
> - return -ENOMEM;
> + if (adev->gmc.xgmi.connected_to_cpu) {
> + pgmap->range.start = adev->gmc.aper_base;
> + pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
> + pgmap->type = MEMORY_DEVICE_PUBLIC;
> + } else {
> + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
> + if (IS_ERR(res))
> + return -ENOMEM;
> + pgmap->range.start = res->start;
> + pgmap->range.end = res->end;
> + pgmap->type = MEMORY_DEVICE_PRIVATE;
> + }
>
> - pgmap->type = MEMORY_DEVICE_PRIVATE;
> pgmap->nr_range = 1;
> - pgmap->range.start = res->start;
> - pgmap->range.end = res->end;
> pgmap->ops = &svm_migrate_pgmap_ops;
> pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
> - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> + pgmap->flags = 0;
> r = devm_memremap_pages(adev->dev, pgmap);
> - if (IS_ERR(r)) {
> + if (res && IS_ERR(r)) {

I think the (res && ...) condition means you only detect failures for
DEVICE_PRIVATE memory. Why are you ignoring failures for DEVICE_PUBLIC?

For DEVICE_PUBLIC you can skip devm_release_mem_region, but you still
need to detect and return the error. Also, using res as an indicator is
a bit obscure. I'd put an if (pgmap->type == MEMORY_DEVICE_PRIVATE)
before the devm_release_mem_region call.

Regards,
  Felix


> pr_err("failed to register HMM device memory\n");
> devm_release_mem_region(adev->dev, res->start,
> res->end - res->start + 1);
> @@ -914,6 +920,7 @@ void svm_migrate_fini(struct amdgpu_device *adev)
> struct dev_pagemap *pgmap = &adev->kfd.dev->pgmap;
>
> devm_memunmap_pages(adev->dev, pgmap);
> - devm_release_mem_region(adev->dev, pgmap->range.start,
> - pgmap->range.end - pgmap->range.start + 1);
> + if (pgmap->type == MEMORY_DEVICE_PRIVATE)
> + devm_release_mem_region(adev->dev, pgmap->range.start,
> + pgmap->range.end - pgmap->range.start + 1);
> }

2021-08-25 15:33:15

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 05/14] drm/amdkfd: ref count init for device pages

Am 2021-08-24 um 11:48 p.m. schrieb Alex Sierra:
> Ref counter from device pages is init to zero during memmap init zone.
> The first time a new device page is allocated to migrate data into it,
> its ref counter needs to be initialized to one.
>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index dab290a4d19d..47ee9a895cd2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -220,7 +220,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
> page = pfn_to_page(pfn);
> svm_range_bo_ref(prange->svm_bo);
> page->zone_device_data = prange->svm_bo;
> - get_page(page);

There is an assumption here that the page refcount is 0 because the page
should be unused. I'd add a VM_BUG_ON_PAGE(page_ref_count(page), page)
here to check that assumption.

Regards,
  Felix


> + init_page_count(page);
> lock_page(page);
> }
>

2021-08-25 15:35:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper

On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>

Acked-by: Theodore Ts'o <[email protected]>

2021-08-25 16:04:21

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v1 01/14] ext4/xfs: add page refcount helper

On Tue, Aug 24, 2021 at 10:48:15PM -0500, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>

Looks fine to me,
Acked-by: Darrick J. Wong <[email protected]>

--D

> ---
> v3:
> [AS]: rename dax_layout_is_idle_page func to dax_page_unused
>
> v4:
> [AS]: This ref count functionality was missing on fuse/dax.c.
> ---
> fs/dax.c | 4 ++--
> fs/ext4/inode.c | 5 +----
> fs/fuse/dax.c | 4 +---
> fs/xfs/xfs_file.c | 4 +---
> include/linux/dax.h | 10 ++++++++++
> 5 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 62352cbcf0f4..c387d09e3e5a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -369,7 +369,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> for_each_mapped_pfn(entry, pfn) {
> struct page *page = pfn_to_page(pfn);
>
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> + WARN_ON_ONCE(trunc && !dax_page_unused(page));
> WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> page->mapping = NULL;
> page->index = 0;
> @@ -383,7 +383,7 @@ static struct page *dax_busy_page(void *entry)
> for_each_mapped_pfn(entry, pfn) {
> struct page *page = pfn_to_page(pfn);
>
> - if (page_ref_count(page) > 1)
> + if (!dax_page_unused(page))
> return page;
> }
> return NULL;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index fe6045a46599..05ffe6875cb1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3971,10 +3971,7 @@ int ext4_break_layouts(struct inode *inode)
> if (!page)
> return 0;
>
> - error = ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1,
> - TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(ei));
> + error = dax_wait_page(ei, page, ext4_wait_dax_page);
> } while (error == 0);
>
> return error;
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index ff99ab2a3c43..2b1f190ba78a 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
> return 0;
>
> *retry = true;
> - return ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, fuse_wait_dax_page(inode));
> + return dax_wait_page(inode, page, fuse_wait_dax_page);
> }
>
> /* dmap_end == 0 leads to unmapping of whole file */
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..182057281086 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -840,9 +840,7 @@ xfs_break_dax_layouts(
> return 0;
>
> *retry = true;
> - return ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, xfs_wait_dax_page(inode));
> + return dax_wait_page(inode, page, xfs_wait_dax_page);
> }
>
> int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..8b5da1d60dbc 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
> return mapping->host && IS_DAX(mapping->host);
> }
>
> +static inline bool dax_page_unused(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}
> +
> +#define dax_wait_page(_inode, _page, _wait_cb) \
> + ___wait_var_event(&(_page)->_refcount, \
> + dax_page_unused(_page), \
> + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
> +
> #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
> void hmem_register_device(int target_nid, struct resource *r);
> #else
> --
> 2.32.0
>

2021-08-25 17:49:43

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount


On 8/25/21 4:15 AM, Vlastimil Babka wrote:
> On 8/25/21 05:48, Alex Sierra wrote:
>> From: Ralph Campbell <[email protected]>
>>
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
> That's certainly welcome. I just wonder what was the reason to use 1 in the
> first place and why it's no longer necessary?

I'm sure this is a long story that I don't know most of the history.
I'm guessing that ZONE_DEVICE started out with a reference count of
one since that is what most "normal" struct page pages start with.
Then put_page() is used to free newly initialized struct pages to the
slab/slob/slub page allocator.
This made it easy to call get_page()/put_page() on ZONE_DEVICE pages
since get_page() asserts that the caller has a reference.
However, most drivers that create ZONE_DEVICE struct pages just insert
a PTE into the user page tables and don't increment/decrement the
reference count. MEMORY_DEVICE_FS_DAX used the >1 to 1 reference count
transition to signal that a page was idle so that made put_page() a
bit more complex. Then MEMORY_DEVICE_PRIVATE pages were added and this
special casing of what "idle" meant got more complicated and more parts
of mm had to check for is_device_private_page().
My goal was to make ZONE_DEVICE struct pages reference counts be zero
based and allocated/freed similar to the page allocator so that more
of the mm code could be used, like THP ZONE_DEVICE pages, without special
casing ZONE_DEVICE.

Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration


On 8/25/2021 2:46 AM, Christoph Hellwig wrote:
> On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>> } else {
>> - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>> + if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
>> + !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>> goto next;
>> pfn = pte_pfn(pte);
>> if (is_zero_pfn(pfn)) {
> .. also how is this going to work for the device public memory? That
> should be pte_special() an thus fail vm_normal_page.
Perhaps we're missing something, as we're not doing any special marking
for the device public pfn/entries.
pfn_valid return true, pte_special return false and pfn_t_devmap return
false on these pages. Same as system pages.
That's the reason vm_normal_page returns the page correctly through
pfn_to_page helper.

Regards,
Alex S.

2021-08-26 22:29:01

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

Am 2021-08-25 um 2:24 p.m. schrieb Sierra Guiza, Alejandro (Alex):
>
> On 8/25/2021 2:46 AM, Christoph Hellwig wrote:
>> On Tue, Aug 24, 2021 at 10:48:17PM -0500, Alex Sierra wrote:
>>>           } else {
>>> -            if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
>>> +            if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM) &&
>>> +                !(migrate->flags & MIGRATE_VMA_SELECT_IOMEM))
>>>                   goto next;
>>>               pfn = pte_pfn(pte);
>>>               if (is_zero_pfn(pfn)) {
>> .. also how is this going to work for the device public memory?  That
>> should be pte_special() an thus fail vm_normal_page.
> Perhaps we're missing something, as we're not doing any special
> marking for the device public pfn/entries.
> pfn_valid return true, pte_special return false and pfn_t_devmap
> return false on these pages. Same as system pages.
> That's the reason vm_normal_page returns the page correctly through
> pfn_to_page helper.

Hi Christoph,

I think we're missing something here. As far as I can tell, all the work
we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
normal pages. Are we missing something in our driver code that would
make these PTEs special? I don't understand how that can be, because
driver code is not really involved in updating the CPU mappings. Maybe
it's something we need to do in the migration helpers.

Thanks,
  Felix


>
> Regards,
> Alex S.

2021-08-27 11:27:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount

On 8/25/21 19:49, Ralph Campbell wrote:
>
> On 8/25/21 4:15 AM, Vlastimil Babka wrote:
>> On 8/25/21 05:48, Alex Sierra wrote:
>>> From: Ralph Campbell <[email protected]>
>>>
>>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>>> code for put_page() and several places in the kernel that need to check the
>>> reference count to see that a page is not being used (gup, compaction,
>>> migration, etc.). Clean up the code so the reference count doesn't need to
>>> be treated specially for ZONE_DEVICE.
>> That's certainly welcome. I just wonder what was the reason to use 1 in the
>> first place and why it's no longer necessary?
>
> I'm sure this is a long story that I don't know most of the history.
> I'm guessing that ZONE_DEVICE started out with a reference count of
> one since that is what most "normal" struct page pages start with.
> Then put_page() is used to free newly initialized struct pages to the
> slab/slob/slub page allocator.
> This made it easy to call get_page()/put_page() on ZONE_DEVICE pages
> since get_page() asserts that the caller has a reference.
> However, most drivers that create ZONE_DEVICE struct pages just insert
> a PTE into the user page tables and don't increment/decrement the
> reference count. MEMORY_DEVICE_FS_DAX used the >1 to 1 reference count
> transition to signal that a page was idle so that made put_page() a
> bit more complex. Then MEMORY_DEVICE_PRIVATE pages were added and this
> special casing of what "idle" meant got more complicated and more parts
> of mm had to check for is_device_private_page().
> My goal was to make ZONE_DEVICE struct pages reference counts be zero
> based and allocated/freed similar to the page allocator so that more
> of the mm code could be used, like THP ZONE_DEVICE pages, without special
> casing ZONE_DEVICE.

Thanks for the explanation. I was afraid there was something more fundamental
that required to catch the 2->1 refcount transition, seems like it's not. I
agree with the simplification!

2021-08-30 08:28:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote:
> I think we're missing something here. As far as I can tell, all the work
> we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
> normal pages. Are we missing something in our driver code that would
> make these PTEs special? I don't understand how that can be, because
> driver code is not really involved in updating the CPU mappings. Maybe
> it's something we need to do in the migration helpers.

It looks like I'm totally misunderstanding what you are adding here
then. Why do we need any special treatment at all for memory that
has normal struct pages and is part of the direct kernel map?

2021-08-30 17:05:30

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

Am 2021-08-30 um 4:28 a.m. schrieb Christoph Hellwig:
> On Thu, Aug 26, 2021 at 06:27:31PM -0400, Felix Kuehling wrote:
>> I think we're missing something here. As far as I can tell, all the work
>> we did first with DEVICE_GENERIC and now DEVICE_PUBLIC always used
>> normal pages. Are we missing something in our driver code that would
>> make these PTEs special? I don't understand how that can be, because
>> driver code is not really involved in updating the CPU mappings. Maybe
>> it's something we need to do in the migration helpers.
> It looks like I'm totally misunderstanding what you are adding here
> then. Why do we need any special treatment at all for memory that
> has normal struct pages and is part of the direct kernel map?

The pages are like normal memory for purposes of mapping them in CPU
page tables and for coherent access from the CPU. From an application
perspective, we want file-backed and anonymous mappings to be able to
use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
optimize performance for GPU heavy workloads while minimizing the need
to migrate data back-and-forth between system memory and device memory.

The pages are special in two ways:

1. The memory is managed not by the Linux buddy allocator, but by the
GPU driver's TTM memory manager
2. We want to migrate data in response to GPU page faults and
application hints using the migrate_vma helpers

It's the second part that we're really trying to address with this patch
series.

Regards,
  Felix


2021-09-01 08:31:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >> driver code is not really involved in updating the CPU mappings. Maybe
> >> it's something we need to do in the migration helpers.
> > It looks like I'm totally misunderstanding what you are adding here
> > then. Why do we need any special treatment at all for memory that
> > has normal struct pages and is part of the direct kernel map?
>
> The pages are like normal memory for purposes of mapping them in CPU
> page tables and for coherent access from the CPU.

That's the user page tables. What about the kernel direct map?
If there is a normal kernel struct page backing there really should
be no need for the pgmap.

> From an application
> perspective, we want file-backed and anonymous mappings to be able to
> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> optimize performance for GPU heavy workloads while minimizing the need
> to migrate data back-and-forth between system memory and device memory.

I don't really understand that part. file backed pages are always
allocated by the file system using the pagecache helpers, that is
using the page allocator. Anonymouns memory also always comes from
the page allocator.

> The pages are special in two ways:
>
> 1. The memory is managed not by the Linux buddy allocator, but by the
> GPU driver's TTM memory manager

Why?

> 2. We want to migrate data in response to GPU page faults and
> application hints using the migrate_vma helpers

Why?

2021-09-01 15:42:24

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration


Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>> it's something we need to do in the migration helpers.
>>> It looks like I'm totally misunderstanding what you are adding here
>>> then. Why do we need any special treatment at all for memory that
>>> has normal struct pages and is part of the direct kernel map?
>> The pages are like normal memory for purposes of mapping them in CPU
>> page tables and for coherent access from the CPU.
> That's the user page tables. What about the kernel direct map?
> If there is a normal kernel struct page backing there really should
> be no need for the pgmap.

I'm not sure. The physical address ranges are in the UEFI system address
map as special-purpose memory. Does Linux create the struct pages and
kernel direct map for that without a pgmap call? I didn't see that last
time I went digging through that code.


>
>> From an application
>> perspective, we want file-backed and anonymous mappings to be able to
>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>> optimize performance for GPU heavy workloads while minimizing the need
>> to migrate data back-and-forth between system memory and device memory.
> I don't really understand that part. file backed pages are always
> allocated by the file system using the pagecache helpers, that is
> using the page allocator. Anonymouns memory also always comes from
> the page allocator.

I'm coming at this from my experience with DEVICE_PRIVATE. Both
anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
memory by the migrate_vma_* helpers for more efficient access by our
GPU. (*) It's part of the basic premise of HMM as I understand it. I
would expect the same thing to work for DEVICE_PUBLIC memory.

(*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
currently work, but that's something I'm hoping to fix at some point.


>
>> The pages are special in two ways:
>>
>> 1. The memory is managed not by the Linux buddy allocator, but by the
>> GPU driver's TTM memory manager
> Why?

It's a system architectural decision based on the access latency to the
memory and the expected use cases that we do not want the GPU driver and
the Linux buddy allocator and VM subsystem competing for the same device
memory.


>
>> 2. We want to migrate data in response to GPU page faults and
>> application hints using the migrate_vma helpers
> Why?

Device memory has much higher bandwidth and much lower latency than
regular system memory for the GPU to access. It's essential for enabling
good GPU application performance. Page-based memory migration enables
good performance with more intuitive programming models such as
managed/unified memory in HIP or unified shared memory in OpenMP. We do
this on our discrete GPUs with DEVICE_PRIVATE memory.

I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
the CPU to map the device memory coherently to minimize the need for
migrations when CPU and GPU access the same memory concurrently or
alternatingly. But we're not going as far as putting that memory
entirely under the management of the Linux memory manager and VM
subsystem. Our (and HPE's) system architects decided that this memory is
not suitable to be used like regular NUMA system memory by the Linux
memory manager.

Regards,
  Felix


2021-09-01 22:46:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>
> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> >>>> driver code is not really involved in updating the CPU mappings. Maybe
> >>>> it's something we need to do in the migration helpers.
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then. Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables. What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
>
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.
>
>
> >
> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part. file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator. Anonymouns memory also always comes from
> > the page allocator.
>
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.
>
> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> currently work, but that's something I'm hoping to fix at some point.

FWIW, I'd love to see the architecture documents that define how
filesystems are supposed to interact with this device private
memory. This whole "hand filesystem controlled memory to other
devices" is a minefield that is trivial to get wrong iand very
difficult to fix - just look at the historical mess that RDMA
to/from file backed and/or DAX pages has been.

So, really, from my perspective as a filesystem engineer, I want to
see an actual specification for how this new memory type is going to
interact with filesystem and the page cache so everyone has some
idea of how this is going to work and can point out how it doesn't
work before code that simply doesn't work is pushed out into
production systems and then merged....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-09-01 23:10:30

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>> it's something we need to do in the migration helpers.
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then. Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables. What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
>>
>>
>>>> From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part. file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator. Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>
>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>> currently work, but that's something I'm hoping to fix at some point.
> FWIW, I'd love to see the architecture documents that define how
> filesystems are supposed to interact with this device private
> memory. This whole "hand filesystem controlled memory to other
> devices" is a minefield that is trivial to get wrong iand very
> difficult to fix - just look at the historical mess that RDMA
> to/from file backed and/or DAX pages has been.
>
> So, really, from my perspective as a filesystem engineer, I want to
> see an actual specification for how this new memory type is going to
> interact with filesystem and the page cache so everyone has some
> idea of how this is going to work and can point out how it doesn't
> work before code that simply doesn't work is pushed out into
> production systems and then merged....

OK. To be clear, that's not part of this patch series. And I have no
authority to push anything in this part of the kernel, so you have
nothing to fear. ;)

FWIW, we already have the ability to map file-backed system memory pages
into device page tables with HMM and interval notifiers. But we cannot
currently migrate them to ZONE_DEVICE pages. Beyond that, my
understanding of how filesystems and page cache work is rather
superficial at this point. I'll keep your name in mind for when I am
ready to discuss this in more detail.

Cheers,
  Felix


>
> Cheers,
>
> Dave.

2021-09-02 01:16:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> > On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > > Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
> > > > On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
> > > > > > > driver code is not really involved in updating the CPU mappings. Maybe
> > > > > > > it's something we need to do in the migration helpers.
> > > > > > It looks like I'm totally misunderstanding what you are adding here
> > > > > > then. Why do we need any special treatment at all for memory that
> > > > > > has normal struct pages and is part of the direct kernel map?
> > > > > The pages are like normal memory for purposes of mapping them in CPU
> > > > > page tables and for coherent access from the CPU.
> > > > That's the user page tables. What about the kernel direct map?
> > > > If there is a normal kernel struct page backing there really should
> > > > be no need for the pgmap.
> > > I'm not sure. The physical address ranges are in the UEFI system address
> > > map as special-purpose memory. Does Linux create the struct pages and
> > > kernel direct map for that without a pgmap call? I didn't see that last
> > > time I went digging through that code.
> > >
> > >
> > > > > From an application
> > > > > perspective, we want file-backed and anonymous mappings to be able to
> > > > > use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> > > > > optimize performance for GPU heavy workloads while minimizing the need
> > > > > to migrate data back-and-forth between system memory and device memory.
> > > > I don't really understand that part. file backed pages are always
> > > > allocated by the file system using the pagecache helpers, that is
> > > > using the page allocator. Anonymouns memory also always comes from
> > > > the page allocator.
> > > I'm coming at this from my experience with DEVICE_PRIVATE. Both
> > > anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> > > memory by the migrate_vma_* helpers for more efficient access by our
> > > GPU. (*) It's part of the basic premise of HMM as I understand it. I
> > > would expect the same thing to work for DEVICE_PUBLIC memory.
> > >
> > > (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
> > > currently work, but that's something I'm hoping to fix at some point.
> > FWIW, I'd love to see the architecture documents that define how
> > filesystems are supposed to interact with this device private
> > memory. This whole "hand filesystem controlled memory to other
> > devices" is a minefield that is trivial to get wrong iand very
> > difficult to fix - just look at the historical mess that RDMA
> > to/from file backed and/or DAX pages has been.
> >
> > So, really, from my perspective as a filesystem engineer, I want to
> > see an actual specification for how this new memory type is going to
> > interact with filesystem and the page cache so everyone has some
> > idea of how this is going to work and can point out how it doesn't
> > work before code that simply doesn't work is pushed out into
> > production systems and then merged....
>
> OK. To be clear, that's not part of this patch series. And I have no
> authority to push anything in this part of the kernel, so you have nothing
> to fear. ;)

I know this isn't part of the series. but this patchset is laying
the foundation for future work that will impact subsystems that
currently have zero visibility and/or knowledge of these changes.
There must be an overall architectural plan for this functionality,
regardless of the current state of implementation. It's that overall
architectural plan I'm asking about here, because I need to
understand that before I can sanely comment on the page
cache/filesystem aspect of the proposed functionality...

> FWIW, we already have the ability to map file-backed system memory pages
> into device page tables with HMM and interval notifiers. But we cannot
> currently migrate them to ZONE_DEVICE pages.

Sure, but sharing page cache pages allocated and managed by the
filesystem is not what you are talking about. You're talking about
migrating page cache data to completely different memory allocated
by a different memory manager that the filesystems currently have no
knowledge of or have any way of interfacing with.

So I'm asking basic, fundamental questions about how these special
device based pages are going to work. How are these pages different
to normal pages - does page_lock() still guarantee exclusive access
to the page state across all hardware that can access it? If not,
what provides access serialisation for pages that are allocated in
device memory rather than CPU memory (e.g. for truncate
serialisation)? Does the hardware that own these pages raise page
faults on the CPU when those pages are accessed/dirtied? How does
demand paging in and out of device memory work (i.e. mapping files
larger than device memory). How does IO to/from storage work - can
the filesystem build normal bios out of these device pages and issue
IO on them? Are the additional constraints on IO because p2p DMA is
needed to move the data from the storage HBA directly into/out of
the GPU memory?

I can think of lots more complex questions about how filesystems are
supposed to manage remote device memory in the page cache, but these
are just some of the basic things that make file-backed mappings
different to anonymous mappings that I need to understand before I
can make head or tail of what is being proposed here.....

> Beyond that, my understanding
> of how filesystems and page cache work is rather superficial at this point.
> I'll keep your name in mind for when I am ready to discuss this in more
> detail.

If you don't know what the bigger picture is, then who does?
Somebody built the design/architecture you are working towards, and
they had to communicate it to you somehow. I'm asking for that
information to documented and made available to all the people these
changes might impact, not whether you personally know how it
works....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-09-02 08:19:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> >>> It looks like I'm totally misunderstanding what you are adding here
> >>> then. Why do we need any special treatment at all for memory that
> >>> has normal struct pages and is part of the direct kernel map?
> >> The pages are like normal memory for purposes of mapping them in CPU
> >> page tables and for coherent access from the CPU.
> > That's the user page tables. What about the kernel direct map?
> > If there is a normal kernel struct page backing there really should
> > be no need for the pgmap.
>
> I'm not sure. The physical address ranges are in the UEFI system address
> map as special-purpose memory. Does Linux create the struct pages and
> kernel direct map for that without a pgmap call? I didn't see that last
> time I went digging through that code.

So doing some googling finds a patch from Dan that claims to hand EFI
special purpose memory to the device dax driver. But when I try to
follow the version that got merged it looks it is treated simply as an
MMIO region to be claimed by drivers, which would not get a struct page.

Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

> >> From an application
> >> perspective, we want file-backed and anonymous mappings to be able to
> >> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
> >> optimize performance for GPU heavy workloads while minimizing the need
> >> to migrate data back-and-forth between system memory and device memory.
> > I don't really understand that part. file backed pages are always
> > allocated by the file system using the pagecache helpers, that is
> > using the page allocator. Anonymouns memory also always comes from
> > the page allocator.
>
> I'm coming at this from my experience with DEVICE_PRIVATE. Both
> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
> memory by the migrate_vma_* helpers for more efficient access by our
> GPU. (*) It's part of the basic premise of HMM as I understand it. I
> would expect the same thing to work for DEVICE_PUBLIC memory.

Ok, so you want to migrate to and from them. Not use DEVICE_PUBLIC
for the actual page cache pages. That maks a lot more sense.

> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
> the CPU to map the device memory coherently to minimize the need for
> migrations when CPU and GPU access the same memory concurrently or
> alternatingly. But we're not going as far as putting that memory
> entirely under the management of the Linux memory manager and VM
> subsystem. Our (and HPE's) system architects decided that this memory is
> not suitable to be used like regular NUMA system memory by the Linux
> memory manager.

So yes. It is a Memory Mapped I/O region, which unlike the PCIe BARs
that people typically deal with is fully cache coherent. I think this
does make more sense as a description.

But to go back to what start this discussion: If these are memory
mapped I/O pfn_valid should generally not return true for them.

And as you already pointed out in reply to Alex we need to tighten the
selection criteria one way or another.

2021-09-02 18:11:59

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On Thu, Sep 2, 2021 at 1:18 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
> > >>> It looks like I'm totally misunderstanding what you are adding here
> > >>> then. Why do we need any special treatment at all for memory that
> > >>> has normal struct pages and is part of the direct kernel map?
> > >> The pages are like normal memory for purposes of mapping them in CPU
> > >> page tables and for coherent access from the CPU.
> > > That's the user page tables. What about the kernel direct map?
> > > If there is a normal kernel struct page backing there really should
> > > be no need for the pgmap.
> >
> > I'm not sure. The physical address ranges are in the UEFI system address
> > map as special-purpose memory. Does Linux create the struct pages and
> > kernel direct map for that without a pgmap call? I didn't see that last
> > time I went digging through that code.
>
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver. But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?

The original implementation of "soft reserve" support depended on the
combination of the EFI special purpose memory type and the ACPI HMAT
to define the device ranges. The requirement for ACPI HMAT was relaxed
later with commit:

5ccac54f3e12 ACPI: HMAT: attach a device for each soft-reserved range

The expectation is that system software policy can then either use the
device interface, assign a portion of the reservation back to the page
allocator, ignore the reservation altogether. Is this discussion
asking for a way to assign this memory to the GPU driver to manage?
device-dax already knows how to hand off to the page-allocator, seems
reasonable for it to be able to hand-off to another driver.

2021-09-09 04:14:03

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

Am 2021-09-02 um 4:18 a.m. schrieb Christoph Hellwig:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then. Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables. What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
> So doing some googling finds a patch from Dan that claims to hand EFI
> special purpose memory to the device dax driver. But when I try to
> follow the version that got merged it looks it is treated simply as an
> MMIO region to be claimed by drivers, which would not get a struct page.
>
> Dan, did I misunderstand how E820_TYPE_SOFT_RESERVED works?
>
>>>> From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part. file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator. Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
> Ok, so you want to migrate to and from them. Not use DEVICE_PUBLIC
> for the actual page cache pages. That maks a lot more sense.
>
>> I see DEVICE_PUBLIC as an improved version of DEVICE_PRIVATE that allows
>> the CPU to map the device memory coherently to minimize the need for
>> migrations when CPU and GPU access the same memory concurrently or
>> alternatingly. But we're not going as far as putting that memory
>> entirely under the management of the Linux memory manager and VM
>> subsystem. Our (and HPE's) system architects decided that this memory is
>> not suitable to be used like regular NUMA system memory by the Linux
>> memory manager.
> So yes. It is a Memory Mapped I/O region, which unlike the PCIe BARs
> that people typically deal with is fully cache coherent. I think this
> does make more sense as a description.
>
> But to go back to what start this discussion: If these are memory
> mapped I/O pfn_valid should generally not return true for them.

As I understand it, pfn_valid should be true for any pfn that's part of
the kernel's physical memory map, i.e. is returned by page_to_pfn or
works with pfn_to_page. Both the hmm_range_fault and the migrate_vma_*
APIs use pfns to refer to regular system memory and ZONE_DEVICE pages
(even DEVICE_PRIVATE). Therefore I believe pfn_valid should be true for
ZONE_DEVICE pages as well.

Regards,
  Felix


>
> And as you already pointed out in reply to Alex we need to tighten the
> selection criteria one way or another.

2021-09-09 05:01:14

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

Am 2021-09-01 um 9:14 p.m. schrieb Dave Chinner:
> On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
>> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
>>> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>>>> it's something we need to do in the migration helpers.
>>>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>>>> then. Why do we need any special treatment at all for memory that
>>>>>>> has normal struct pages and is part of the direct kernel map?
>>>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>>>> page tables and for coherent access from the CPU.
>>>>> That's the user page tables. What about the kernel direct map?
>>>>> If there is a normal kernel struct page backing there really should
>>>>> be no need for the pgmap.
>>>> I'm not sure. The physical address ranges are in the UEFI system address
>>>> map as special-purpose memory. Does Linux create the struct pages and
>>>> kernel direct map for that without a pgmap call? I didn't see that last
>>>> time I went digging through that code.
>>>>
>>>>
>>>>>> From an application
>>>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>>>> to migrate data back-and-forth between system memory and device memory.
>>>>> I don't really understand that part. file backed pages are always
>>>>> allocated by the file system using the pagecache helpers, that is
>>>>> using the page allocator. Anonymouns memory also always comes from
>>>>> the page allocator.
>>>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>>>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>>>> memory by the migrate_vma_* helpers for more efficient access by our
>>>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>>>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>>>
>>>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>>>> currently work, but that's something I'm hoping to fix at some point.
>>> FWIW, I'd love to see the architecture documents that define how
>>> filesystems are supposed to interact with this device private
>>> memory. This whole "hand filesystem controlled memory to other
>>> devices" is a minefield that is trivial to get wrong iand very
>>> difficult to fix - just look at the historical mess that RDMA
>>> to/from file backed and/or DAX pages has been.
>>>
>>> So, really, from my perspective as a filesystem engineer, I want to
>>> see an actual specification for how this new memory type is going to
>>> interact with filesystem and the page cache so everyone has some
>>> idea of how this is going to work and can point out how it doesn't
>>> work before code that simply doesn't work is pushed out into
>>> production systems and then merged....
>> OK. To be clear, that's not part of this patch series. And I have no
>> authority to push anything in this part of the kernel, so you have nothing
>> to fear. ;)
> I know this isn't part of the series. but this patchset is laying
> the foundation for future work that will impact subsystems that
> currently have zero visibility and/or knowledge of these changes.

I don't think this patchset is the foundation. Jerome Glisse's work on
HMM is, which was merged 4 years ago and is being used by multiple
drivers now, with the AMD GPU driver being a fairly recent addition.


> There must be an overall architectural plan for this functionality,
> regardless of the current state of implementation. It's that overall
> architectural plan I'm asking about here, because I need to
> understand that before I can sanely comment on the page
> cache/filesystem aspect of the proposed functionality...

The overall HMM and ZONE_DEVICE architecture is documented to some
extent in Documentation/vm/hmm.rst, though it may not go into the level
of detail you are looking for.


>
>> FWIW, we already have the ability to map file-backed system memory pages
>> into device page tables with HMM and interval notifiers. But we cannot
>> currently migrate them to ZONE_DEVICE pages.
> Sure, but sharing page cache pages allocated and managed by the
> filesystem is not what you are talking about. You're talking about
> migrating page cache data to completely different memory allocated
> by a different memory manager that the filesystems currently have no
> knowledge of or have any way of interfacing with.

This is not part of the current patch series. It is only my intention to
look into ways to migrate file-backed pages to ZONE_DEVICE memory in the
future.


>
> So I'm asking basic, fundamental questions about how these special
> device based pages are going to work. How are these pages different
> to normal pages - does page_lock() still guarantee exclusive access
> to the page state across all hardware that can access it?

Yes. The migration API guarantees that pages are locked during the
migration. The driver code doesn't touch the page state itself. It only
uses the migrate_vma_* helpers to deal with that.

This is not new or changed by this patch series.


> If not,
> what provides access serialisation for pages that are allocated in
> device memory rather than CPU memory (e.g. for truncate
> serialisation)? Does the hardware that own these pages raise page
> faults on the CPU when those pages are accessed/dirtied?

Yes. This is done by the hmm_range_fault API, which the driver calls in
order to populate its device page tables. It is synchronized with any
mapping changes through mmu_interval_notifiers.

This is not new or changed by this patch series.


> How does
> demand paging in and out of device memory work (i.e. mapping files
> larger than device memory).

That depends on how the device driver handles device page faults. The
AMD GPU driver can handle recoverable device page faults and update the
device page table on demand with updated pfns from hmm_range_fault.

This is not new or changed by this patch series.


> How does IO to/from storage work - can
> the filesystem build normal bios out of these device pages and issue
> IO on them?

DEVICE_PUBLIC pages introduced by this patch series, are CPU and
peer-accessible like normal system memory.

DEVICE_PRIVATE pages are not CPU or peer-accessible. Any access to them
would go through the CPU page fault path and cause a
dev_pagemap_ops.migrate_to_ram callback into the AMD GPU driver to unmap
the memory from the GPU and migrate it back to system memory.


> Are the additional constraints on IO because p2p DMA is
> needed to move the data from the storage HBA directly into/out of
> the GPU memory?
>
> I can think of lots more complex questions about how filesystems are
> supposed to manage remote device memory in the page cache, but these
> are just some of the basic things that make file-backed mappings
> different to anonymous mappings that I need to understand before I
> can make head or tail of what is being proposed here.....
>
>> Beyond that, my understanding
>> of how filesystems and page cache work is rather superficial at this point.
>> I'll keep your name in mind for when I am ready to discuss this in more
>> detail.
> If you don't know what the bigger picture is, then who does?
> Somebody built the design/architecture you are working towards, and
> they had to communicate it to you somehow. I'm asking for that
> information to documented and made available to all the people these
> changes might impact, not whether you personally know how it
> works....

This patch series builds on top of existing HMM work with major
contributions from several people on this thread: Jerome Glisse, Jason
Gunthorpe, Christoph Hellwig, Ralph Campbell.

Beyond the reintroduction of DEVICE_PUBLIC memory in this patch series
I'm not looking to invent a major new design here. Immediate future work
is more about chipping away on a few remaining limitations of the
implementation, with respect to migration of file-backed pages and maybe
transparent huge pages.

Regards,
  Felix


>
> Cheers,
>
> Dave.