Subject: [PATCH v5 00/13] Support DEVICE_GENERIC memory in migrate_vma_*

v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently upstreamed
to Dave Airlie's drm-next branch
https://cgit.freedesktop.org/drm/drm/log/?h=drm-next
On top of that we did some rework of our VRAM management for migrations to remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://lore.kernel.org/dri-devel/[email protected]/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1.
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper.

v4:
- Add support for zone device generic type in lib/test_hmm and
tool/testing/selftest/vm/hmm-tests.
- Add missing page refcount helper to fuse/dax.c. This was included in
one of Ralph Campbell's patches.

v5:
- Cosmetic changes on patches 3, 5 and 13
- Bug founded at test_hmm, remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE
at dmirror_allocate_chunk that was forcing to configure pagemap.type to
MEMORY_DEVICE_PRIVATE.
- A bug was found while running one of the xfstest (generic/413) used to
validate fs_dax device type. This was first introduced by patch: "mm: remove
extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
The bug was showed as WARNING message at try_grab_page function call, due to
a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
page refcount" changes, was to initialize page refcounter to zero. Therefore,
a special condition was added to try_grab_page on this v5, were it checks for
device zone pages too. It is included in the same patch.

This is how mm changes from these patch series have been validated:
- hmm-tests were run using device private and device generic types. This last,
just added in these patch series. efi_fake_mem was used to mimic SPM memory
for device generic.
- xfstests tool was used to validate fs-dax device type and page refcounter
changes. DAX configuration was used along with emulated Persisten Memory set as
memmap=4G!4G memmap=4G!9G. xfstests were run from ext4 and generic lists. Some
of them, did not run due to limitations in configuration. Ex. test not
supporting specific file system or DAX mode.
Only three tests failed, generic/356/357 and ext4/049. However, these failures
were consistent before and after applying these patch series.
xfstest configuration:
TEST_DEV=/dev/pmem0
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem1
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"
xfstest passed list:
Ext4:
001,003,005,021,022,023,025,026,030,031,032,036,037,038,042,043,044,271,306
Generic:
1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,20,21,22,23,24,25,28,29,30,31,32,33,35,37,
50,52,53,58,60,61,62,63,64,67,69,70,71,75,76,78,79,80,82,84,86,87,88,91,92,94,
96,97,98,99,103,105,112,113,114,117,120,124,126,129,130,131,135,141,169,184,
198,207,210,211,212,213,214,215,221,223,225,228,236,237,240,244,245,246,247,
248,249,255,257,258,263,277,286,294,306,307,308,309,313,315,316,318,319,337,
346,360,361,371,375,377,379,380,383,384,385,386,389,391,392,393,394,400,401,
403,404,406,409,410,411,412,413,417,420,422,423,424,425,426,427,428

Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches.

Patches 4-5 are for context to show how we are looking up the SPM
memory and registering it with devmap.

Patches 3,6-8 are the changes we are trying to upstream or rework to
make them acceptable upstream.

Patches 9-13 add ZONE_DEVICE Generic type support into the hmm test.

Alex Sierra (11):
kernel: resource: lookup_resource as exported symbol
drm/amdkfd: add SPM support for SVM
drm/amdkfd: generic type as sys mem on migration to ram
include/linux/mm.h: helpers to check zone device generic type
mm: add generic type support to migrate_vma helpers
mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
lib: test_hmm add ioctl to get zone device type
lib: test_hmm add module param for zone device type
lib: add support for device generic type in test_hmm
tools: update hmm-test to support device generic type
tools: update test_hmm script to support SP config

Ralph Campbell (2):
ext4/xfs: add page refcount helper
mm: remove extra ZONE_DEVICE struct page refcount

arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 22 ++-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 8 +-
fs/ext4/inode.c | 5 +-
fs/fuse/dax.c | 4 +-
fs/xfs/xfs_file.c | 4 +-
include/linux/dax.h | 10 +
include/linux/memremap.h | 7 +-
include/linux/mm.h | 54 +-----
kernel/resource.c | 1 +
lib/test_hmm.c | 231 +++++++++++++++--------
lib/test_hmm_uapi.h | 16 ++
mm/internal.h | 8 +
mm/memremap.c | 69 ++-----
mm/migrate.c | 25 +--
mm/page_alloc.c | 3 +
mm/swap.c | 45 +----
tools/testing/selftests/vm/hmm-tests.c | 142 ++++++++++++--
tools/testing/selftests/vm/test_hmm.sh | 20 +-
20 files changed, 405 insertions(+), 273 deletions(-)

--
2.32.0


Subject: [PATCH v5 05/13] drm/amdkfd: generic type as sys mem on migration to ram

Generic device type memory on VRAM to RAM migration,
has similar access as System RAM from the CPU. This flag sets
the source from the sender. Which in Generic type case,
should be set as SYSTEM.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 3e9315354ce4..c511932e56d7 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -664,9 +664,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
- migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);

+ if (adev->gmc.xgmi.connected_to_cpu)
+ migrate.flags = MIGRATE_VMA_SELECT_SYSTEM;
+ else
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
--
2.32.0

Subject: [PATCH v5 04/13] drm/amdkfd: add SPM support for SVM

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_GENERIC to create the device
page map region.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 21b86c35a4f2..3e9315354ce4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -916,6 +916,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
struct resource *res;
unsigned long size;
void *r;
+ bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;

/* Page migration works on Vega10 or newer */
if (kfddev->device_info->asic_family < CHIP_VEGA10)
@@ -928,17 +929,22 @@ int svm_migrate_init(struct amdgpu_device *adev)
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
- res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+ if (xgmi_connected_to_cpu)
+ res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
+ else
+ res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+
if (IS_ERR(res))
return -ENOMEM;

- pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
pgmap->range.start = res->start;
pgmap->range.end = res->end;
+ pgmap->type = xgmi_connected_to_cpu ?
+ MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
- pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ pgmap->flags = 0;
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
@@ -962,6 +968,7 @@ void svm_migrate_fini(struct amdgpu_device *adev)
struct dev_pagemap *pgmap = &adev->kfd.dev->pgmap;

devm_memunmap_pages(adev->dev, pgmap);
- devm_release_mem_region(adev->dev, pgmap->range.start,
- pgmap->range.end - pgmap->range.start + 1);
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+ devm_release_mem_region(adev->dev, pgmap->range.start,
+ pgmap->range.end - pgmap->range.start + 1);
}
--
2.32.0

Subject: [PATCH v5 03/13] kernel: resource: lookup_resource as exported symbol

The AMD architecture for the Frontier supercomputer will
have device memory which can be coherently accessed by
the CPU. The system BIOS advertises this memory as SPM
(special purpose memory) in the UEFI system address map.

The AMDGPU driver needs to be able to lookup this resource
in order to claim it as MEMORY_DEVICE_GENERIC using
devm_memremap_pages.

Signed-off-by: Alex Sierra <[email protected]>
---
kernel/resource.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/resource.c b/kernel/resource.c
index 627e61b0c124..2d4abd395c73 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -783,6 +783,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)

return res;
}
+EXPORT_SYMBOL_GPL(lookup_resource);

/*
* Insert a resource into the resource tree. If successful, return NULL,
--
2.32.0

Subject: [PATCH v5 06/13] include/linux/mm.h: helpers to check zone device generic type

Two helpers added. One checks if zone device page is generic
type. The other if page is either private or generic type.

Signed-off-by: Alex Sierra <[email protected]>
---
include/linux/mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c0fcb47d7641..8853acb89b1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1125,6 +1125,14 @@ static inline bool is_device_private_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

+static inline bool is_device_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ is_zone_device_page(page) &&
+ (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ page->pgmap->type == MEMORY_DEVICE_GENERIC);
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
--
2.32.0

Subject: [PATCH v5 01/13] ext4/xfs: add page refcount helper

From: Ralph Campbell <[email protected]>

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

v3:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

v4:
[AS]: This ref count functionality was missing on fuse/dax.c.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
---
fs/dax.c | 4 ++--
fs/ext4/inode.c | 5 +----
fs/fuse/dax.c | 4 +---
fs/xfs/xfs_file.c | 4 +---
include/linux/dax.h | 10 ++++++++++
5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..4820bb511d68 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+ WARN_ON_ONCE(trunc && !dax_page_unused(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- if (page_ref_count(page) > 1)
+ if (!dax_page_unused(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c173c8405856..9ee00186412f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;

- error = ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1,
- TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei));
+ error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);

return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ff99ab2a3c43..2b1f190ba78a 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, fuse_wait_dax_page(inode));
+ return dax_wait_page(inode, page, fuse_wait_dax_page);
}

/* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..39565fe5f817 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -782,9 +782,7 @@ xfs_break_dax_layouts(
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, xfs_wait_dax_page(inode));
+ return dax_wait_page(inode, page, xfs_wait_dax_page);
}

int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8b5da1d60dbc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}

+static inline bool dax_page_unused(struct page *page)
+{
+ return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+ ___wait_var_event(&(_page)->_refcount, \
+ dax_page_unused(_page), \
+ TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
void hmem_register_device(int target_nid, struct resource *r);
#else
--
2.32.0

Subject: [PATCH v5 02/13] mm: remove extra ZONE_DEVICE struct page refcount

From: Ralph Campbell <[email protected]>

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

v5:
AS: add condition at try_grab_page to check for the zone device type, while
page ref counter is checked less/equal to zero. In case of device zone, pages
ref counter are initialized to zero.

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 4 +-
include/linux/dax.h | 2 +-
include/linux/memremap.h | 7 +--
include/linux/mm.h | 46 +----------------
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memremap.c | 68 +++++++-------------------
mm/migrate.c | 5 --
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++---------------
12 files changed, 46 insertions(+), 148 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)

dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}

- get_page(page);
+ init_page_count(page);
lock_page(page);
return page;
}
diff --git a/fs/dax.c b/fs/dax.c
index 4820bb511d68..8d699c8e888b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,

/**
* dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
* @start: Starting offset. Page containing 'start' is included.
* @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
* pages from 'start' till the end of file are included.
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
* any page in the mapping is busy, i.e. for DMA, or other
* get_user_pages() usages.
*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)

static inline bool dax_page_unused(struct page *page)
{
- return page_ref_count(page) == 1;
+ return page_ref_count(page) == 0;
}

#define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..327f32427d21 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {

struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count
+ * should be reset to one with init_page_count(page) before reusing
+ * the page. This allows the device driver to implement its own
+ * memory management.
*/
void (*page_free)(struct page *page);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9900aedc195..c0fcb47d7641 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,39 +1117,6 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- switch (page->pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_FS_DAX:
- return true;
- default:
- break;
- }
- return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
- return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static inline bool is_device_private_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1186,7 +1153,7 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags);
static inline __must_check bool try_get_page(struct page *page)
{
page = compound_head(page);
- if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+ if (WARN_ON_ONCE(page_ref_count(page) < (int)!is_zone_device_page(page)))
return false;
page_ref_inc(page);
return true;
@@ -1196,17 +1163,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);

- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..6998f10350ea 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}

dpage->zone_device_data = rpage;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;

diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439f19..d3e58746f2d2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -623,4 +623,12 @@ struct migration_target_control {
gfp_t gfp_mask;
};

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 16b2fb482da1..614b3d600e95 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/wait_bit.h>
#include <linux/xarray.h>
+#include "internal.h"

static DEFINE_XARRAY(pgmap_array);

@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void)
EXPORT_SYMBOL_GPL(memremap_compat_align);
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_dec(&devmap_managed_key);
-}
-
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_inc(&devmap_managed_key);
-}
-#else
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-}
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static void pgmap_array_delete(struct range *range)
{
xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -87,16 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}

-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
-#define for_each_device_pfn(pfn, map, i) \
- for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
static void dev_pagemap_kill(struct dev_pagemap *pgmap)
{
if (pgmap->ops && pgmap->ops->kill)
@@ -152,20 +117,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;

dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);

for (i = 0; i < pgmap->nr_range; i++)
pageunmap_range(pgmap, i);

WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
- devmap_managed_enable_put(pgmap);
}
EXPORT_SYMBOL_GPL(memunmap_pages);

@@ -361,8 +324,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
}

- devmap_managed_enable_get(pgmap);
-
/*
* Clear the pgmap nr_range as it will be incremented for each
* successfully processed range. This communicates how many
@@ -477,16 +438,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
{
- /* notify page idle for dax */
- if (!is_device_private_page(page)) {
- wake_up_var(&page->_refcount);
- return;
- }

__ClearPageWaiters(page);
-
mem_cgroup_uncharge(page);

/*
@@ -513,4 +468,19 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
}
+
+void free_zone_device_page(struct page *page)
+{
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_FS_DAX:
+ /* notify page idle */
+ wake_up_var(&page->_refcount);
+ return;
+ case MEMORY_DEVICE_PRIVATE:
+ free_device_private_page(page);
+ return;
+ default:
+ return;
+ }
+}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 20ca887ea769..8c2430d3e77b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -376,11 +376,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;

- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..4612c457d0b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6210,6 +6210,9 @@ void __ref memmap_init_zone_device(struct zone *zone,

__init_single_page(page, pfn, zone_idx, nid);

+ /* ZONE_DEVICE pages start with a zero reference count. */
+ set_page_count(page, 0);
+
/*
* Mark page reserved as it will need to wait for onlining
* phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index 2cca7141470c..0a12af814065 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page)
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
-
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
+ free_zone_device_page(page);
return;
}

@@ -878,29 +877,18 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;

+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * page_is_devmap_managed() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- if (put_page_testzero(page))
- put_dev_pagemap(page->pgmap);
+ free_zone_device_page(page);
continue;
}

- if (!put_page_testzero(page))
- continue;
-
if (PageCompound(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -1142,26 +1130,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.32.0

Subject: [PATCH v5 12/13] tools: update hmm-test to support device generic type

Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device generic
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and generic
zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 142 +++++++++++++++++++++----
1 file changed, 124 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..70632b195497 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,8 @@ struct hmm_buffer {
int fd;
uint64_t cpages;
uint64_t faults;
+ int zone_device_type;
+ bool alloc_to_devmem;
};

#define TWOMEG (1 << 21)
@@ -133,6 +135,7 @@ static int hmm_dmirror_cmd(int fd,
cmd.addr = (__u64)buffer->ptr;
cmd.ptr = (__u64)buffer->mirror;
cmd.npages = npages;
+ cmd.alloc_to_devmem = buffer->alloc_to_devmem;

for (;;) {
ret = ioctl(fd, request, &cmd);
@@ -144,6 +147,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+ buffer->zone_device_type = cmd.zone_device_type;

return 0;
}
@@ -211,6 +215,34 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
}

+static int hmm_migrate_sys_to_dev(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ buffer->alloc_to_devmem = true;
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ buffer->alloc_to_devmem = false;
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+ struct hmm_buffer buffer;
+ int ret;
+
+ buffer.ptr = 0;
+ ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, &buffer, 1);
+ *res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+ return ret;
+}
+
/*
* Simple NULL test of device open/close.
*/
@@ -875,7 +907,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -923,7 +955,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -936,7 +968,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);

/* Migrate memory to the device again. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -976,7 +1008,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);

hmm_buffer_free(buffer);
@@ -1015,7 +1047,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;

/* Migrating a protected area should be an error. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);

/* Punch a hole after the first page address. */
@@ -1023,7 +1055,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);

/* We expect an error if the vma doesn't cover the range. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);

/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1087,13 @@ TEST_F(hmm2, migrate_mixed)

/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);

/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, -ENOENT);
buffer->ptr = p;

@@ -1070,8 +1102,12 @@ TEST_F(hmm2, migrate_mixed)
}

/*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of generic zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Generic device zone has coherent access to CPU, therefore
+ * it will not generate any page fault.
*/
TEST_F(hmm, migrate_multiple)
{
@@ -1082,7 +1118,9 @@ TEST_F(hmm, migrate_multiple)
unsigned long c;
int *ptr;
int ret;
+ bool is_private;

+ ASSERT_EQ(hmm_is_private_device(self->fd, &is_private), 0);
npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
ASSERT_NE(npages, 0);
size = npages << self->page_shift;
@@ -1107,8 +1145,7 @@ TEST_F(hmm, migrate_multiple)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
- npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -1116,7 +1153,12 @@ TEST_F(hmm, migrate_multiple)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

- /* Fault pages back to system memory and check them. */
+ /* Migrate back to system memory and check them. */
+ if (!is_private) {
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ }
+
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

@@ -1261,10 +1303,12 @@ TEST_F(hmm2, snapshot)
unsigned char *m;
int ret;
int val;
+ bool is_private;

npages = 7;
size = npages << self->page_shift;

+ ASSERT_EQ(hmm_is_private_device(self->fd0, &is_private), 0);
buffer = malloc(sizeof(*buffer));
ASSERT_NE(buffer, NULL);

@@ -1312,13 +1356,13 @@ TEST_F(hmm2, snapshot)

/* Page 5 will be migrated to device 0. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

/* Page 6 will be migrated to device 1. */
buffer->ptr = p + 6 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

@@ -1335,9 +1379,16 @@ TEST_F(hmm2, snapshot)
ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
- HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ if (is_private) {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ } else {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_GENERIC |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_GENERIC |
+ HMM_DMIRROR_PROT_WRITE);
+ }

hmm_buffer_free(buffer);
}
@@ -1485,4 +1536,59 @@ TEST_F(hmm2, double_map)
hmm_buffer_free(buffer);
}

+/*
+ * Migrate anonymous memory to device memory and migrate back to system memory
+ * explicitly, without generating a page fault.
+ */
+TEST_F(hmm, migrate_ping_pong)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ buffer->alloc_to_devmem = true;
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Migrate memory back to system mem. */
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+
+ /* Check the buffer migrated back to system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ hmm_buffer_free(buffer);
+}
+
TEST_HARNESS_MAIN
--
2.32.0

Subject: [PATCH v5 11/13] lib: add support for device generic type in test_hmm

Device Generic type uses device memory that is coherently
accesible by the CPU. Usually, this is shown as SP
(special purpose) memory range at the BIOS-e820 memory
enumeration. If no SP memory is supported in system,
this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges
of at least 256MB size. This could be specified in the
kernel parameter variable efi_fake_mem. Ex. Two SP ranges
of 1GB starting at 0x100000000 & 0x140000000 physical address.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 170 ++++++++++++++++++++++++++++----------------
lib/test_hmm_uapi.h | 10 ++-
2 files changed, 116 insertions(+), 64 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9283ad1f4280..52b6190fab66 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -469,6 +469,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+ int ret = -ENOMEM;

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
@@ -516,8 +517,10 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
}

ptr = memremap_pages(&devmem->pagemap, numa_node_id());
- if (IS_ERR(ptr))
+ if (IS_ERR(ptr)) {
+ ret = PTR_ERR(ptr);
goto err_release;
+ }

devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -546,7 +549,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
}
spin_unlock(&mdevice->lock);

- return true;
+ return 0;

err_release:
mutex_unlock(&mdevice->devmem_lock);
@@ -554,7 +557,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
err_devmem:
kfree(devmem);

- return false;
+ return ret;
}

static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -563,8 +566,10 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
struct page *rpage;

/*
- * This is a fake device so we alloc real system memory to store
- * our device memory.
+ * For ZONE_DEVICE private type, this is a fake device so we alloc real
+ * system memory to store our device memory.
+ * For ZONE_DEVICE generic type we use the actual dpage to store the data
+ * and ignore rpage.
*/
rpage = alloc_page(GFP_HIGHUSER);
if (!rpage)
@@ -597,7 +602,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
struct dmirror *dmirror)
{
struct dmirror_device *mdevice = dmirror->mdevice;
- const unsigned long *src = args->src;
+ unsigned long *src = args->src;
unsigned long *dst = args->dst;
unsigned long addr;

@@ -615,12 +620,18 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* unallocated pte_none() or read-only zero page.
*/
spage = migrate_pfn_to_page(*src);
-
+ if (spage && is_zone_device_page(spage)) {
+ pr_debug("page already in device spage pfn: 0x%lx\n",
+ page_to_pfn(spage));
+ *src &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;

- rpage = dpage->zone_device_data;
+ rpage = is_device_private_page(dpage) ? dpage->zone_device_data :
+ dpage;
if (spage)
copy_highpage(rpage, spage);
else
@@ -632,8 +643,10 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* the simulated device memory and that page holds the pointer
* to the mirror.
*/
+ rpage = dpage->zone_device_data;
rpage->zone_device_data = dmirror;
-
+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage)) |
MIGRATE_PFN_LOCKED;
if ((*src & MIGRATE_PFN_WRITE) ||
@@ -667,10 +680,13 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
continue;

/*
- * Store the page that holds the data so the page table
- * doesn't have to deal with ZONE_DEVICE private pages.
+ * For ZONE_DEVICE private pages we store the page that
+ * holds the data so the page table doesn't have to deal it.
+ * For ZONE_DEVICE generic pages we store the actual page, since
+ * the CPU has coherent access to the page.
*/
- entry = dpage->zone_device_data;
+ entry = is_device_private_page(dpage) ? dpage->zone_device_data :
+ dpage;
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -684,6 +700,47 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
return 0;
}

+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+ struct dmirror *dmirror)
+{
+ unsigned long *src = args->src;
+ unsigned long *dst = args->dst;
+ unsigned long start = args->start;
+ unsigned long end = args->end;
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE,
+ src++, dst++) {
+ struct page *dpage, *spage;
+
+ spage = migrate_pfn_to_page(*src);
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+ continue;
+ if (is_device_private_page(spage)) {
+ spage = spage->zone_device_data;
+ } else {
+ pr_debug("page already in system or SPM spage pfn: 0x%lx\n",
+ page_to_pfn(spage));
+ *src &= ~MIGRATE_PFN_MIGRATE;
+ continue;
+ }
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ continue;
+ pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
+
+ lock_page(dpage);
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dpage, spage);
+ *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+ }
+ return 0;
+}
+
+
static int dmirror_migrate(struct dmirror *dmirror,
struct hmm_dmirror_cmd *cmd)
{
@@ -725,33 +782,46 @@ static int dmirror_migrate(struct dmirror *dmirror,
args.start = addr;
args.end = next;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+ args.flags = (!cmd->alloc_to_devmem &&
+ dmirror->mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+ MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+ MIGRATE_VMA_SELECT_SYSTEM;
ret = migrate_vma_setup(&args);
if (ret)
goto out;

- dmirror_migrate_alloc_and_copy(&args, dmirror);
+ if (cmd->alloc_to_devmem) {
+ pr_debug("Migrating from sys mem to device mem\n");
+ dmirror_migrate_alloc_and_copy(&args, dmirror);
+ } else {
+ pr_debug("Migrating from device mem to sys mem\n");
+ dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+ }
migrate_vma_pages(&args);
- dmirror_migrate_finalize_and_map(&args, dmirror);
+ if (cmd->alloc_to_devmem)
+ dmirror_migrate_finalize_and_map(&args, dmirror);
migrate_vma_finalize(&args);
}
mmap_read_unlock(mm);
mmput(mm);

- /* Return the migrated data for verification. */
- ret = dmirror_bounce_init(&bounce, start, size);
- if (ret)
- return ret;
- mutex_lock(&dmirror->mutex);
- ret = dmirror_do_read(dmirror, start, end, &bounce);
- mutex_unlock(&dmirror->mutex);
- if (ret == 0) {
- if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
- bounce.size))
- ret = -EFAULT;
+ /* Return the migrated data for verification. only for pages in device zone */
+ if (cmd->alloc_to_devmem) {
+ ret = dmirror_bounce_init(&bounce, start, size);
+ if (ret)
+ return ret;
+ mutex_lock(&dmirror->mutex);
+ ret = dmirror_do_read(dmirror, start, end, &bounce);
+ mutex_unlock(&dmirror->mutex);
+ if (ret == 0) {
+ if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+ bounce.size))
+ ret = -EFAULT;
+ }
+ cmd->cpages = bounce.cpages;
+ dmirror_bounce_fini(&bounce);
}
- cmd->cpages = bounce.cpages;
- dmirror_bounce_fini(&bounce);
return ret;

out:
@@ -775,9 +845,15 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
}

page = hmm_pfn_to_page(entry);
- if (is_device_private_page(page)) {
- /* Is the page migrated to this device or some other? */
- if (dmirror->mdevice == dmirror_page_to_device(page))
+ if (is_device_page(page)) {
+ /* Is page ZONE_DEVICE generic? */
+ if (!is_device_private_page(page))
+ *perm = HMM_DMIRROR_PROT_DEV_GENERIC;
+ /*
+ * Is page ZONE_DEVICE private migrated to
+ * this device or some other?
+ */
+ else if (dmirror->mdevice == dmirror_page_to_device(page))
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
else
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
@@ -1024,38 +1100,6 @@ static void dmirror_devmem_free(struct page *page)
spin_unlock(&mdevice->lock);
}

-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- const unsigned long *src = args->src;
- unsigned long *dst = args->dst;
- unsigned long start = args->start;
- unsigned long end = args->end;
- unsigned long addr;
-
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
- struct page *dpage, *spage;
-
- spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
- spage = spage->zone_device_data;
-
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
-
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- }
- return 0;
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 17a6b5059871..1f2322286fba 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -17,8 +17,12 @@
* @addr: (in) user address the device will read/write
* @ptr: (in) user address where device data is copied to/from
* @npages: (in) number of pages to read/write
+ * @alloc_to_devmem: (in) desired allocation destination during migration.
+ * True if allocation is to device memory.
+ * False if allocation is to system memory.
* @cpages: (out) number of pages copied
* @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
*/
struct hmm_dmirror_cmd {
__u64 addr;
@@ -26,7 +30,8 @@ struct hmm_dmirror_cmd {
__u64 npages;
__u64 cpages;
__u64 faults;
- __u64 zone_device_type;
+ __u32 zone_device_type;
+ __u32 alloc_to_devmem;
};

/* Expose the address space of the calling process through hmm device file */
@@ -49,6 +54,8 @@ struct hmm_dmirror_cmd {
* device the ioctl() is made
* HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
* other device
+ * HMM_DMIRROR_PROT_DEV_GENERIC: Migrate device generic page on the device
+ * the ioctl() is made
*/
enum {
HMM_DMIRROR_PROT_ERROR = 0xFF,
@@ -60,6 +67,7 @@ enum {
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
+ HMM_DMIRROR_PROT_DEV_GENERIC = 0x40,
};

enum {
--
2.32.0

Subject: [PATCH v5 13/13] tools: update test_hmm script to support SP config

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as generic.

Signed-off-by: Alex Sierra <[email protected]>
---
tools/testing/selftests/vm/test_hmm.sh | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +86,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +98,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.32.0

2021-08-12 06:53:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] Support DEVICE_GENERIC memory in migrate_vma_*

Do you have a pointer to a git branch with this series and all dependencies
to ease testing?

On Thu, Aug 12, 2021 at 01:30:47AM -0500, Alex Sierra wrote:
> v1:
> AMD is building a system architecture for the Frontier supercomputer with a
> coherent interconnect between CPUs and GPUs. This hardware architecture allows
> the CPUs to coherently access GPU device memory. We have hardware in our labs
> and we are working with our partner HPE on the BIOS, firmware and software
> for delivery to the DOE.
>
> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
> (special purpose memory) in the UEFI system address map. The amdgpu driver looks
> it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
> using devm_memremap_pages.
>
> Now we're trying to migrate data to and from that memory using the migrate_vma_*
> helpers so we can support page-based migration in our unified memory allocations,
> while also supporting CPU access to those pages.
>
> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
> correctly in the migrate_vma_* helpers. We are looking for feedback about this
> approach. If we're close, what's needed to make our patches acceptable upstream?
> If we're not close, any suggestions how else to achieve what we are trying to do
> (i.e. page migration and coherent CPU access to VRAM)?
>
> This work is based on HMM and our SVM memory manager that was recently upstreamed
> to Dave Airlie's drm-next branch
> https://cgit.freedesktop.org/drm/drm/log/?h=drm-next
> On top of that we did some rework of our VRAM management for migrations to remove
> some incorrect assumptions, allow partially successful migrations and GPU memory
> mappings that mix pages in VRAM and system memory.
> https://lore.kernel.org/dri-devel/[email protected]/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc
>
> v2:
> This patch series version has merged "[RFC PATCH v3 0/2]
> mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
> Ralph Campbell. It also applies at the top of these series, our changes
> to support device generic type in migration_vma helpers.
> This has been tested in systems with device memory that has coherent
> access by CPU.
>
> Also addresses the following feedback made in v1:
> - Isolate in one patch kernel/resource.c modification, based
> on Christoph's feedback.
> - Add helpers check for generic and private type to avoid
> duplicated long lines.
>
> v3:
> - Include cover letter from v1.
> - Rename dax_layout_is_idle_page func to dax_page_unused in patch
> ext4/xfs: add page refcount helper.
>
> v4:
> - Add support for zone device generic type in lib/test_hmm and
> tool/testing/selftest/vm/hmm-tests.
> - Add missing page refcount helper to fuse/dax.c. This was included in
> one of Ralph Campbell's patches.
>
> v5:
> - Cosmetic changes on patches 3, 5 and 13
> - Bug founded at test_hmm, remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE
> at dmirror_allocate_chunk that was forcing to configure pagemap.type to
> MEMORY_DEVICE_PRIVATE.
> - A bug was found while running one of the xfstest (generic/413) used to
> validate fs_dax device type. This was first introduced by patch: "mm: remove
> extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
> The bug was showed as WARNING message at try_grab_page function call, due to
> a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
> page refcount" changes, was to initialize page refcounter to zero. Therefore,
> a special condition was added to try_grab_page on this v5, were it checks for
> device zone pages too. It is included in the same patch.
>
> This is how mm changes from these patch series have been validated:
> - hmm-tests were run using device private and device generic types. This last,
> just added in these patch series. efi_fake_mem was used to mimic SPM memory
> for device generic.
> - xfstests tool was used to validate fs-dax device type and page refcounter
> changes. DAX configuration was used along with emulated Persisten Memory set as
> memmap=4G!4G memmap=4G!9G. xfstests were run from ext4 and generic lists. Some
> of them, did not run due to limitations in configuration. Ex. test not
> supporting specific file system or DAX mode.
> Only three tests failed, generic/356/357 and ext4/049. However, these failures
> were consistent before and after applying these patch series.
> xfstest configuration:
> TEST_DEV=/dev/pmem0
> TEST_DIR=/mnt/ram0
> SCRATCH_DEV=/dev/pmem1
> SCRATCH_MNT=/mnt/ram1
> TEST_FS_MOUNT_OPTS="-o dax"
> EXT_MOUNT_OPTIONS="-o dax"
> MKFS_OPTIONS="-b4096"
> xfstest passed list:
> Ext4:
> 001,003,005,021,022,023,025,026,030,031,032,036,037,038,042,043,044,271,306
> Generic:
> 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,20,21,22,23,24,25,28,29,30,31,32,33,35,37,
> 50,52,53,58,60,61,62,63,64,67,69,70,71,75,76,78,79,80,82,84,86,87,88,91,92,94,
> 96,97,98,99,103,105,112,113,114,117,120,124,126,129,130,131,135,141,169,184,
> 198,207,210,211,212,213,214,215,221,223,225,228,236,237,240,244,245,246,247,
> 248,249,255,257,258,263,277,286,294,306,307,308,309,313,315,316,318,319,337,
> 346,360,361,371,375,377,379,380,383,384,385,386,389,391,392,393,394,400,401,
> 403,404,406,409,410,411,412,413,417,420,422,423,424,425,426,427,428
>
> Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches.
>
> Patches 4-5 are for context to show how we are looking up the SPM
> memory and registering it with devmap.
>
> Patches 3,6-8 are the changes we are trying to upstream or rework to
> make them acceptable upstream.
>
> Patches 9-13 add ZONE_DEVICE Generic type support into the hmm test.
>
> Alex Sierra (11):
> kernel: resource: lookup_resource as exported symbol
> drm/amdkfd: add SPM support for SVM
> drm/amdkfd: generic type as sys mem on migration to ram
> include/linux/mm.h: helpers to check zone device generic type
> mm: add generic type support to migrate_vma helpers
> mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
> lib: test_hmm add ioctl to get zone device type
> lib: test_hmm add module param for zone device type
> lib: add support for device generic type in test_hmm
> tools: update hmm-test to support device generic type
> tools: update test_hmm script to support SP config
>
> Ralph Campbell (2):
> ext4/xfs: add page refcount helper
> mm: remove extra ZONE_DEVICE struct page refcount
>
> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 22 ++-
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> fs/dax.c | 8 +-
> fs/ext4/inode.c | 5 +-
> fs/fuse/dax.c | 4 +-
> fs/xfs/xfs_file.c | 4 +-
> include/linux/dax.h | 10 +
> include/linux/memremap.h | 7 +-
> include/linux/mm.h | 54 +-----
> kernel/resource.c | 1 +
> lib/test_hmm.c | 231 +++++++++++++++--------
> lib/test_hmm_uapi.h | 16 ++
> mm/internal.h | 8 +
> mm/memremap.c | 69 ++-----
> mm/migrate.c | 25 +--
> mm/page_alloc.c | 3 +
> mm/swap.c | 45 +----
> tools/testing/selftests/vm/hmm-tests.c | 142 ++++++++++++--
> tools/testing/selftests/vm/test_hmm.sh | 20 +-
> 20 files changed, 405 insertions(+), 273 deletions(-)
>
> --
> 2.32.0
---end quoted text---

Subject: [PATCH v5 07/13] mm: add generic type support to migrate_vma helpers

Device generic type case added for migrate_vma_pages and
migrate_vma_check_page helpers.
Both, generic and private device types have the same
conditions to decide to migrate pages from/to device
memory.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/migrate.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c2430d3e77b..7bac06ae831e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2602,7 +2602,7 @@ static bool migrate_vma_check_page(struct page *page)
* FIXME proper solution is to rework migration_entry_wait() so
* it does not need to take a reference on page.
*/
- return is_device_private_page(page);
+ return is_device_page(page);
}

/* For file back page */
@@ -2891,7 +2891,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
* handle_pte_fault()
* do_anonymous_page()
* to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or generic page.
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -2956,13 +2956,11 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
*/
__SetPageUptodate(page);

- if (is_zone_device_page(page)) {
- if (is_device_private_page(page)) {
- swp_entry_t swp_entry;
+ if (is_device_private_page(page)) {
+ swp_entry_t swp_entry;

- swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
- entry = swp_entry_to_pte(swp_entry);
- }
+ swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+ entry = swp_entry_to_pte(swp_entry);
} else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
@@ -3064,10 +3062,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
mapping = page_mapping(page);

if (is_zone_device_page(newpage)) {
- if (is_device_private_page(newpage)) {
+ if (is_device_page(newpage)) {
/*
- * For now only support private anonymous when
- * migrating to un-addressable device memory.
+ * For now only support private and generic
+ * anonymous when migrating to device memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
--
2.32.0

Subject: [PATCH v5 10/13] lib: test_hmm add module param for zone device type

In order to configure device generic in test_hmm, two
module parameters should be passed, which correspon to the
SP start address of each device (2) spm_addr_dev0 &
spm_addr_dev1. If no parameters are passed, private device
type is configured.

v5:
Remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE at
dmirror_allocate_chunk that was forcing to configure pagemap.type
to MEMORY_DEVICE_PRIVATE

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 46 ++++++++++++++++++++++++++++++++-------------
lib/test_hmm_uapi.h | 1 +
2 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 3cd91ca31dd7..9283ad1f4280 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -33,6 +33,16 @@
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+ "Specify start address for SPM (special purpose memory) used for device 0. By setting this Generic device type will be used. Make sure spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+ "Specify start address for SPM (special purpose memory) used for device 1. By setting this Generic device type will be used. Make sure spm_addr_dev0 is set too");
+
static const struct dev_pagemap_ops dmirror_devmem_ops;
static const struct mmu_interval_notifier_ops dmirror_min_ops;
static dev_t dmirror_dev;
@@ -450,11 +460,11 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
return ret;
}

-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
struct page **ppage)
{
struct dmirror_chunk *devmem;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
@@ -462,15 +472,26 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
- return false;
+ return -ENOMEM;
+
+ if (!spm_addr_dev0 && !spm_addr_dev1) {
+ res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+ devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ } else if (spm_addr_dev0 && spm_addr_dev1) {
+ res = lookup_resource(&iomem_resource, MINOR(mdevice->cdevice.dev) ?
+ spm_addr_dev0 :
+ spm_addr_dev1);
+ devmem->pagemap.type = MEMORY_DEVICE_GENERIC;
+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_GENERIC;
+ } else {
+ pr_err("Both spm_addr_dev parameters should be set\n");
+ }

- res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
if (IS_ERR(res))
goto err_devmem;

- mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
devmem->pagemap.range.start = res->start;
devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
@@ -1097,10 +1118,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
if (ret)
return ret;

- /* Build a list of free ZONE_DEVICE private struct pages */
- dmirror_allocate_chunk(mdevice, NULL);
-
- return 0;
+ /* Build a list of free ZONE_DEVICE struct pages */
+ return dmirror_allocate_chunk(mdevice, NULL);
}

static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1113,8 +1132,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
mdevice->devmem_chunks[i];

memunmap_pages(&devmem->pagemap);
- release_mem_region(devmem->pagemap.range.start,
- range_len(&devmem->pagemap.range));
+ if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
kfree(devmem);
}
kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index ee88701793d5..17a6b5059871 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -65,6 +65,7 @@ enum {
enum {
/* 0 is reserved to catch uninitialized type fields */
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+ HMM_DMIRROR_MEMORY_DEVICE_GENERIC,
};

#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: [PATCH v5 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback.
Device generic type memory case is now able to free its pages properly.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/memremap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 614b3d600e95..6c884e2542a9 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -438,7 +438,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-static void free_device_private_page(struct page *page)
+static void free_device_page(struct page *page)
{

__ClearPageWaiters(page);
@@ -477,7 +477,8 @@ void free_zone_device_page(struct page *page)
wake_up_var(&page->_refcount);
return;
case MEMORY_DEVICE_PRIVATE:
- free_device_private_page(page);
+ case MEMORY_DEVICE_GENERIC:
+ free_device_page(page);
return;
default:
return;
--
2.32.0

Subject: [PATCH v5 09/13] lib: test_hmm add ioctl to get zone device type

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device generic type.

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 15 ++++++++++++++-
lib/test_hmm_uapi.h | 7 +++++++
2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 6998f10350ea..3cd91ca31dd7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -82,6 +82,7 @@ struct dmirror_chunk {
struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem *devmem;
+ unsigned int zone_device_type;

unsigned int devmem_capacity;
unsigned int devmem_count;
@@ -468,6 +469,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
if (IS_ERR(res))
goto err_devmem;

+ mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
devmem->pagemap.range.start = res->start;
devmem->pagemap.range.end = res->end;
@@ -912,6 +914,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
}

+static int dmirror_get_device_type(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ mutex_lock(&dmirror->mutex);
+ cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+ mutex_unlock(&dmirror->mutex);
+
+ return 0;
+}
static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -952,7 +963,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
case HMM_DMIRROR_SNAPSHOT:
ret = dmirror_snapshot(dmirror, &cmd);
break;
-
+ case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+ ret = dmirror_get_device_type(dmirror, &cmd);
+ break;
default:
return -EINVAL;
}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..ee88701793d5 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
__u64 npages;
__u64 cpages;
__u64 faults;
+ __u64 zone_device_type;
};

/* Expose the address space of the calling process through hmm device file */
@@ -33,6 +34,7 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x04, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -60,4 +62,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
};

+enum {
+ /* 0 is reserved to catch uninitialized type fields */
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: Re: [PATCH v5 00/13] Support DEVICE_GENERIC memory in migrate_vma_*


On 8/12/2021 1:34 AM, Christoph Hellwig wrote:
> Do you have a pointer to a git branch with this series and all dependencies
> to ease testing?

Hi Christoph,

Yes, actually tomorrow we're planning to send a new version with
detailed instructions

on how to setup and run each of the tests. This will also contain a link
to download our

kernel repo with all these patches.

Regards,

Alejandro S.

>
> On Thu, Aug 12, 2021 at 01:30:47AM -0500, Alex Sierra wrote:
>> v1:
>> AMD is building a system architecture for the Frontier supercomputer with a
>> coherent interconnect between CPUs and GPUs. This hardware architecture allows
>> the CPUs to coherently access GPU device memory. We have hardware in our labs
>> and we are working with our partner HPE on the BIOS, firmware and software
>> for delivery to the DOE.
>>
>> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
>> (special purpose memory) in the UEFI system address map. The amdgpu driver looks
>> it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
>> using devm_memremap_pages.
>>
>> Now we're trying to migrate data to and from that memory using the migrate_vma_*
>> helpers so we can support page-based migration in our unified memory allocations,
>> while also supporting CPU access to those pages.
>>
>> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
>> correctly in the migrate_vma_* helpers. We are looking for feedback about this
>> approach. If we're close, what's needed to make our patches acceptable upstream?
>> If we're not close, any suggestions how else to achieve what we are trying to do
>> (i.e. page migration and coherent CPU access to VRAM)?
>>
>> This work is based on HMM and our SVM memory manager that was recently upstreamed
>> to Dave Airlie's drm-next branch
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Flog%2F%3Fh%3Ddrm-next&amp;data=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=W7NRFXF0t1bqAzsV1RZh4MAPly26c7XPn8kCKvaj%2FMw%3D&amp;reserved=0
>> On top of that we did some rework of our VRAM management for migrations to remove
>> some incorrect assumptions, allow partially successful migrations and GPU memory
>> mappings that mix pages in VRAM and system memory.
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210527205606.2660-6-Felix.Kuehling%40amd.com%2FT%2F%23r996356015e295780eb50453e7dbd5d0d68b47cbc&amp;data=04%7C01%7Calex.sierra%40amd.com%7Ce4e14caf03de403b1e2208d95d5b378b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637643468663206442%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GZZrxedpAlYLfDgO1Y3jvpFacG313SXiUT3ErB4xO0Q%3D&amp;reserved=0
>>
>> v2:
>> This patch series version has merged "[RFC PATCH v3 0/2]
>> mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
>> Ralph Campbell. It also applies at the top of these series, our changes
>> to support device generic type in migration_vma helpers.
>> This has been tested in systems with device memory that has coherent
>> access by CPU.
>>
>> Also addresses the following feedback made in v1:
>> - Isolate in one patch kernel/resource.c modification, based
>> on Christoph's feedback.
>> - Add helpers check for generic and private type to avoid
>> duplicated long lines.
>>
>> v3:
>> - Include cover letter from v1.
>> - Rename dax_layout_is_idle_page func to dax_page_unused in patch
>> ext4/xfs: add page refcount helper.
>>
>> v4:
>> - Add support for zone device generic type in lib/test_hmm and
>> tool/testing/selftest/vm/hmm-tests.
>> - Add missing page refcount helper to fuse/dax.c. This was included in
>> one of Ralph Campbell's patches.
>>
>> v5:
>> - Cosmetic changes on patches 3, 5 and 13
>> - Bug founded at test_hmm, remove devmem->pagemap.type = MEMORY_DEVICE_PRIVATE
>> at dmirror_allocate_chunk that was forcing to configure pagemap.type to
>> MEMORY_DEVICE_PRIVATE.
>> - A bug was found while running one of the xfstest (generic/413) used to
>> validate fs_dax device type. This was first introduced by patch: "mm: remove
>> extra ZONE_DEVICE struct page refcount" whic is part of these patch series.
>> The bug was showed as WARNING message at try_grab_page function call, due to
>> a page refcounter equal to zero. Part of "mm: remove extra ZONE_DEVICE struct
>> page refcount" changes, was to initialize page refcounter to zero. Therefore,
>> a special condition was added to try_grab_page on this v5, were it checks for
>> device zone pages too. It is included in the same patch.
>>
>> This is how mm changes from these patch series have been validated:
>> - hmm-tests were run using device private and device generic types. This last,
>> just added in these patch series. efi_fake_mem was used to mimic SPM memory
>> for device generic.
>> - xfstests tool was used to validate fs-dax device type and page refcounter
>> changes. DAX configuration was used along with emulated Persisten Memory set as
>> memmap=4G!4G memmap=4G!9G. xfstests were run from ext4 and generic lists. Some
>> of them, did not run due to limitations in configuration. Ex. test not
>> supporting specific file system or DAX mode.
>> Only three tests failed, generic/356/357 and ext4/049. However, these failures
>> were consistent before and after applying these patch series.
>> xfstest configuration:
>> TEST_DEV=/dev/pmem0
>> TEST_DIR=/mnt/ram0
>> SCRATCH_DEV=/dev/pmem1
>> SCRATCH_MNT=/mnt/ram1
>> TEST_FS_MOUNT_OPTS="-o dax"
>> EXT_MOUNT_OPTIONS="-o dax"
>> MKFS_OPTIONS="-b4096"
>> xfstest passed list:
>> Ext4:
>> 001,003,005,021,022,023,025,026,030,031,032,036,037,038,042,043,044,271,306
>> Generic:
>> 1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,20,21,22,23,24,25,28,29,30,31,32,33,35,37,
>> 50,52,53,58,60,61,62,63,64,67,69,70,71,75,76,78,79,80,82,84,86,87,88,91,92,94,
>> 96,97,98,99,103,105,112,113,114,117,120,124,126,129,130,131,135,141,169,184,
>> 198,207,210,211,212,213,214,215,221,223,225,228,236,237,240,244,245,246,247,
>> 248,249,255,257,258,263,277,286,294,306,307,308,309,313,315,316,318,319,337,
>> 346,360,361,371,375,377,379,380,383,384,385,386,389,391,392,393,394,400,401,
>> 403,404,406,409,410,411,412,413,417,420,422,423,424,425,426,427,428
>>
>> Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches.
>>
>> Patches 4-5 are for context to show how we are looking up the SPM
>> memory and registering it with devmap.
>>
>> Patches 3,6-8 are the changes we are trying to upstream or rework to
>> make them acceptable upstream.
>>
>> Patches 9-13 add ZONE_DEVICE Generic type support into the hmm test.
>>
>> Alex Sierra (11):
>> kernel: resource: lookup_resource as exported symbol
>> drm/amdkfd: add SPM support for SVM
>> drm/amdkfd: generic type as sys mem on migration to ram
>> include/linux/mm.h: helpers to check zone device generic type
>> mm: add generic type support to migrate_vma helpers
>> mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
>> lib: test_hmm add ioctl to get zone device type
>> lib: test_hmm add module param for zone device type
>> lib: add support for device generic type in test_hmm
>> tools: update hmm-test to support device generic type
>> tools: update test_hmm script to support SP config
>>
>> Ralph Campbell (2):
>> ext4/xfs: add page refcount helper
>> mm: remove extra ZONE_DEVICE struct page refcount
>>
>> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 22 ++-
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>> fs/dax.c | 8 +-
>> fs/ext4/inode.c | 5 +-
>> fs/fuse/dax.c | 4 +-
>> fs/xfs/xfs_file.c | 4 +-
>> include/linux/dax.h | 10 +
>> include/linux/memremap.h | 7 +-
>> include/linux/mm.h | 54 +-----
>> kernel/resource.c | 1 +
>> lib/test_hmm.c | 231 +++++++++++++++--------
>> lib/test_hmm_uapi.h | 16 ++
>> mm/internal.h | 8 +
>> mm/memremap.c | 69 ++-----
>> mm/migrate.c | 25 +--
>> mm/page_alloc.c | 3 +
>> mm/swap.c | 45 +----
>> tools/testing/selftests/vm/hmm-tests.c | 142 ++++++++++++--
>> tools/testing/selftests/vm/test_hmm.sh | 20 +-
>> 20 files changed, 405 insertions(+), 273 deletions(-)
>>
>> --
>> 2.32.0
> ---end quoted text---