LinuxLists.cc - [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma

2021-06-17 15:18:42

Subject: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.

The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.

Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory allocations,
while also supporting CPU access to those pages.

This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?

This work is based on HMM and our SVM memory manager that was recently upstreamed
to Dave Airlie's drm-next branch
https://lore.kernel.org/dri-devel/[email protected]/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc
On top of that we did some rework of our VRAM management for migrations to remove
some incorrect assumptions, allow partially successful migrations and GPU memory
mappings that mix pages in VRAM and system memory.
https://patchwork.kernel.org/project/dri-devel/list/?series=489811

v2:
This patch series version has merged "[RFC PATCH v3 0/2]
mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
Ralph Campbell. It also applies at the top of these series, our changes
to support device generic type in migration_vma helpers.
This has been tested in systems with device memory that has coherent
access by CPU.

Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.

v3:
- Include cover letter from v1
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper

Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches
Patches 4-5 are for context to show how we are looking up the SPM
memory and registering it with devmap.
Patches 3,6-8 are the changes we are trying to upstream or rework to
make them acceptable upstream.

Alex Sierra (6):
kernel: resource: lookup_resource as exported symbol
drm/amdkfd: add SPM support for SVM
drm/amdkfd: generic type as sys mem on migration to ram
include/linux/mm.h: helpers to check zone device generic type
mm: add generic type support to migrate_vma helpers
mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

Ralph Campbell (2):
ext4/xfs: add page refcount helper
mm: remove extra ZONE_DEVICE struct page refcount

arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++--
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 8 +--
fs/ext4/inode.c | 5 +-
fs/xfs/xfs_file.c | 4 +-
include/linux/dax.h | 10 ++++
include/linux/memremap.h | 7 +--
include/linux/mm.h | 52 +++---------------
kernel/resource.c | 2 +-
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memremap.c | 69 +++++++-----------------
mm/migrate.c | 13 ++---
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++--------------
16 files changed, 83 insertions(+), 164 deletions(-)

--
2.17.1

2021-06-17 15:19:03

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

From: Ralph Campbell <[email protected]>

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

v2:
AS: merged this patch in linux 5.11 version

Signed-off-by: Ralph Campbell <[email protected]>
Signed-off-by: Alex Sierra <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
fs/dax.c | 4 +-
include/linux/dax.h | 2 +-
include/linux/memremap.h | 7 +--
include/linux/mm.h | 44 -----------------
lib/test_hmm.c | 2 +-
mm/internal.h | 8 +++
mm/memremap.c | 68 +++++++-------------------
mm/migrate.c | 5 --
mm/page_alloc.c | 3 ++
mm/swap.c | 45 ++---------------
12 files changed, 45 insertions(+), 147 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)

dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}

- get_page(page);
+ init_page_count(page);
lock_page(page);
return page;
}
diff --git a/fs/dax.c b/fs/dax.c
index 321f4ddc6643..7b4c6b35b098 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,

/**
* dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
* @start: Starting offset. Page containing 'start' is included.
* @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
* pages from 'start' till the end of file are included.
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
* any page in the mapping is busy, i.e. for DMA, or other
* get_user_pages() usages.
*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8b5da1d60dbc..05fc982ce153 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)

static inline bool dax_page_unused(struct page *page)
{
- return page_ref_count(page) == 1;
+ return page_ref_count(page) == 0;
}

#define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..327f32427d21 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {

struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count
+ * should be reset to one with init_page_count(page) before reusing
+ * the page. This allows the device driver to implement its own
+ * memory management.
*/
void (*page_free)(struct page *page);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9900aedc195..d8d79bb94be8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,39 +1117,6 @@ static inline bool is_zone_device_page(const struct page *page)
}
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- switch (page->pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_FS_DAX:
- return true;
- default:
- break;
- }
- return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
- return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static inline bool is_device_private_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1196,17 +1163,6 @@ static inline void put_page(struct page *page)
{
page = compound_head(page);

- /*
- * For devmap managed pages we need to catch refcount transition from
- * 2 to 1, when refcount reach one it means the page is free and we
- * need to inform the device driver through callback. See
- * include/linux/memremap.h and HMM for details.
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- return;
- }
-
if (put_page_testzero(page))
__put_page(page);
}
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..6998f10350ea 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}

dpage->zone_device_data = rpage;
- get_page(dpage);
+ init_page_count(dpage);
lock_page(dpage);
return dpage;

diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439f19..d3e58746f2d2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -623,4 +623,12 @@ struct migration_target_control {
gfp_t gfp_mask;
};

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index 16b2fb482da1..614b3d600e95 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/wait_bit.h>
#include <linux/xarray.h>
+#include "internal.h"

static DEFINE_XARRAY(pgmap_array);

@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void)
EXPORT_SYMBOL_GPL(memremap_compat_align);
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_dec(&devmap_managed_key);
-}
-
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
- static_branch_inc(&devmap_managed_key);
-}
-#else
-static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-}
-static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
static void pgmap_array_delete(struct range *range)
{
xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -87,16 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}

-static unsigned long pfn_next(unsigned long pfn)
-{
- if (pfn % 1024 == 0)
- cond_resched();
- return pfn + 1;
-}
-
-#define for_each_device_pfn(pfn, map, i) \
- for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
-
static void dev_pagemap_kill(struct dev_pagemap *pgmap)
{
if (pgmap->ops && pgmap->ops->kill)
@@ -152,20 +117,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;

dev_pagemap_kill(pgmap);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) -
+ pfn_first(pgmap, i));
dev_pagemap_cleanup(pgmap);

for (i = 0; i < pgmap->nr_range; i++)
pageunmap_range(pgmap, i);

WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
- devmap_managed_enable_put(pgmap);
}
EXPORT_SYMBOL_GPL(memunmap_pages);

@@ -361,8 +324,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
}

- devmap_managed_enable_get(pgmap);
-
/*
* Clear the pgmap nr_range as it will be incremented for each
* successfully processed range. This communicates how many
@@ -477,16 +438,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+static void free_device_private_page(struct page *page)
{
- /* notify page idle for dax */
- if (!is_device_private_page(page)) {
- wake_up_var(&page->_refcount);
- return;
- }

__ClearPageWaiters(page);
-
mem_cgroup_uncharge(page);

/*
@@ -513,4 +468,19 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
}
+
+void free_zone_device_page(struct page *page)
+{
+ switch (page->pgmap->type) {
+ case MEMORY_DEVICE_FS_DAX:
+ /* notify page idle */
+ wake_up_var(&page->_refcount);
+ return;
+ case MEMORY_DEVICE_PRIVATE:
+ free_device_private_page(page);
+ return;
+ default:
+ return;
+ }
+}
#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/migrate.c b/mm/migrate.c
index 20ca887ea769..8c2430d3e77b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -376,11 +376,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;

- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += thp_nr_pages(page) + page_has_private(page);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..4612c457d0b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6210,6 +6210,9 @@ void __ref memmap_init_zone_device(struct zone *zone,

__init_single_page(page, pfn, zone_idx, nid);

+ /* ZONE_DEVICE pages start with a zero reference count. */
+ set_page_count(page, 0);
+
/*
* Mark page reserved as it will need to wait for onlining
* phase for it to be fully associated with a zone.
diff --git a/mm/swap.c b/mm/swap.c
index 2cca7141470c..0a12af814065 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page)
void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
-
/*
* The page belongs to the device that created pgmap. Do
* not return it to page allocator.
*/
+ free_zone_device_page(page);
return;
}

@@ -878,29 +877,18 @@ void release_pages(struct page **pages, int nr)
if (is_huge_zero_page(page))
continue;

+ if (!put_page_testzero(page))
+ continue;
+
if (is_zone_device_page(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- /*
- * ZONE_DEVICE pages that return 'false' from
- * page_is_devmap_managed() do not require special
- * processing, and instead, expect a call to
- * put_page_testzero().
- */
- if (page_is_devmap_managed(page)) {
- put_devmap_managed_page(page);
- continue;
- }
- if (put_page_testzero(page))
- put_dev_pagemap(page->pgmap);
+ free_zone_device_page(page);
continue;
}

- if (!put_page_testzero(page))
- continue;
-
if (PageCompound(page)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -1142,26 +1130,3 @@ void __init swap_setup(void)
* _really_ don't want to cluster much more
*/
}
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
- int count;
-
- if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
- return;
-
- count = page_ref_dec_return(page);
-
- /*
- * devmap page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (count == 1)
- free_devmap_managed_page(page);
- else if (!count)
- __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
--
2.17.1

2021-06-17 15:19:06

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 3/8] kernel: resource: lookup_resource as exported symbol

The AMD architecture for the Frontier supercomputer will
have device memory which can be coherently accessed by
the CPU. The system BIOS advertises this memory as SPM
(special purpose memory) in the UEFI system address map.

The AMDGPU driver needs to be able to lookup this resource
in order to claim it as MEMORY_DEVICE_GENERIC using
devm_memremap_pages.

Signed-off-by: Alex Sierra <[email protected]>
---
kernel/resource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 627e61b0c124..269489bb7097 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -783,7 +783,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)

return res;
}
-
+EXPORT_SYMBOL_GPL(lookup_resource);
/*
* Insert a resource into the resource tree. If successful, return NULL,
* otherwise return the conflicting resource (compare to __request_resource())
--
2.17.1

2021-06-17 15:19:10

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 4/8] drm/amdkfd: add SPM support for SVM

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_GENERIC to create the device
page map region.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index c8ca3252cbc2..f5939449a99f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -895,6 +895,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
struct resource *res;
unsigned long size;
void *r;
+ bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;

/* Page migration works on Vega10 or newer */
if (kfddev->device_info->asic_family < CHIP_VEGA10)
@@ -907,17 +908,22 @@ int svm_migrate_init(struct amdgpu_device *adev)
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
- res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+ if (xgmi_connected_to_cpu)
+ res = lookup_resource(&iomem_resource, adev->gmc.aper_base);
+ else
+ res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+
if (IS_ERR(res))
return -ENOMEM;

- pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
pgmap->range.start = res->start;
pgmap->range.end = res->end;
+ pgmap->type = xgmi_connected_to_cpu ?
+ MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
- pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ pgmap->flags = 0;
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
--
2.17.1

2021-06-17 15:19:29

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 5/8] drm/amdkfd: generic type as sys mem on migration to ram

Generic device type memory on VRAM to RAM migration,
has similar access as System RAM from the CPU. This flag sets
the source from the sender. Which in Generic type case,
should be set as SYSTEM.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index f5939449a99f..7b41006c1164 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -653,8 +653,9 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
- migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
+ migrate.flags = adev->gmc.xgmi.connected_to_cpu ?
+ MIGRATE_VMA_SELECT_SYSTEM : MIGRATE_VMA_SELECT_DEVICE_PRIVATE;

size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
--
2.17.1

2021-06-17 15:19:30

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 6/8] include/linux/mm.h: helpers to check zone device generic type

Two helpers added. One checks if zone device page is generic
type. The other if page is either private or generic type.

Signed-off-by: Alex Sierra <[email protected]>
---
include/linux/mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d8d79bb94be8..f5b247a63044 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1125,6 +1125,14 @@ static inline bool is_device_private_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

+static inline bool is_device_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ is_zone_device_page(page) &&
+ (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ page->pgmap->type == MEMORY_DEVICE_GENERIC);
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
--
2.17.1

2021-06-17 15:20:12

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 8/8] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

Add MEMORY_DEVICE_GENERIC case to free_zone_device_page
callback.
Device generic type memory case is now able to free its
pages properly.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/memremap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 614b3d600e95..6c884e2542a9 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -438,7 +438,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
EXPORT_SYMBOL_GPL(get_dev_pagemap);

#ifdef CONFIG_DEV_PAGEMAP_OPS
-static void free_device_private_page(struct page *page)
+static void free_device_page(struct page *page)
{

__ClearPageWaiters(page);
@@ -477,7 +477,8 @@ void free_zone_device_page(struct page *page)
wake_up_var(&page->_refcount);
return;
case MEMORY_DEVICE_PRIVATE:
- free_device_private_page(page);
+ case MEMORY_DEVICE_GENERIC:
+ free_device_page(page);
return;
default:
return;
--
2.17.1

2021-06-17 15:20:43

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: [PATCH v3 7/8] mm: add generic type support to migrate_vma helpers

Device generic type case added for migrate_vma_pages and
migrate_vma_check_page helpers.
Both, generic and private device types have the same
conditions to decide to migrate pages from/to device
memory.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/migrate.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c2430d3e77b..3b6aaba96fe6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2602,7 +2602,7 @@ static bool migrate_vma_check_page(struct page *page)
* FIXME proper solution is to rework migration_entry_wait() so
* it does not need to take a reference on page.
*/
- return is_device_private_page(page);
+ return is_device_page(page);
}

/* For file back page */
@@ -3064,10 +3064,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
mapping = page_mapping(page);

if (is_zone_device_page(newpage)) {
- if (is_device_private_page(newpage)) {
+ if (is_device_page(newpage)) {
/*
- * For now only support private anonymous when
- * migrating to un-addressable device memory.
+ * For now only support private and generic
+ * anonymous when migrating to device memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
--
2.17.1

2021-06-17 16:20:01

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On 6/17/2021 10:16 AM, Alex Sierra wrote:
> v1:
> AMD is building a system architecture for the Frontier supercomputer with a
> coherent interconnect between CPUs and GPUs. This hardware architecture allows
> the CPUs to coherently access GPU device memory. We have hardware in our labs
> and we are working with our partner HPE on the BIOS, firmware and software
> for delivery to the DOE.
>
> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
> (special purpose memory) in the UEFI system address map. The amdgpu driver looks
> it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
> using devm_memremap_pages.
>
> Now we're trying to migrate data to and from that memory using the migrate_vma_*
> helpers so we can support page-based migration in our unified memory allocations,
> while also supporting CPU access to those pages.
>
> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
> correctly in the migrate_vma_* helpers. We are looking for feedback about this
> approach. If we're close, what's needed to make our patches acceptable upstream?
> If we're not close, any suggestions how else to achieve what we are trying to do
> (i.e. page migration and coherent CPU access to VRAM)?
>
> This work is based on HMM and our SVM memory manager that was recently upstreamed
> to Dave Airlie's drm-next branch
> https://lore.kernel.org/dri-devel/[email protected]/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc
Corrected link:

https://cgit.freedesktop.org/drm/drm/log/?h=drm-next

Regards,
Alex Sierra

> On top of that we did some rework of our VRAM management for migrations to remove
> some incorrect assumptions, allow partially successful migrations and GPU memory
> mappings that mix pages in VRAM and system memory.
> https://patchwork.kernel.org/project/dri-devel/list/?series=489811

Corrected link:

https://lore.kernel.org/dri-devel/[email protected]/T/#r996356015e295780eb50453e7dbd5d0d68b47cbc

Regards,
Alex Sierra

>
> v2:
> This patch series version has merged "[RFC PATCH v3 0/2]
> mm: remove extra ZONE_DEVICE struct page refcount" patch series made by
> Ralph Campbell. It also applies at the top of these series, our changes
> to support device generic type in migration_vma helpers.
> This has been tested in systems with device memory that has coherent
> access by CPU.
>
> Also addresses the following feedback made in v1:
> - Isolate in one patch kernel/resource.c modification, based
> on Christoph's feedback.
> - Add helpers check for generic and private type to avoid
> duplicated long lines.
>
> v3:
> - Include cover letter from v1
> - Rename dax_layout_is_idle_page func to dax_page_unused in patch
> ext4/xfs: add page refcount helper
>
> Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches
> Patches 4-5 are for context to show how we are looking up the SPM
> memory and registering it with devmap.
> Patches 3,6-8 are the changes we are trying to upstream or rework to
> make them acceptable upstream.
>
> Alex Sierra (6):
> kernel: resource: lookup_resource as exported symbol
> drm/amdkfd: add SPM support for SVM
> drm/amdkfd: generic type as sys mem on migration to ram
> include/linux/mm.h: helpers to check zone device generic type
> mm: add generic type support to migrate_vma helpers
> mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
>
> Ralph Campbell (2):
> ext4/xfs: add page refcount helper
> mm: remove extra ZONE_DEVICE struct page refcount
>
> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++--
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> fs/dax.c | 8 +--
> fs/ext4/inode.c | 5 +-
> fs/xfs/xfs_file.c | 4 +-
> include/linux/dax.h | 10 ++++
> include/linux/memremap.h | 7 +--
> include/linux/mm.h | 52 +++---------------
> kernel/resource.c | 2 +-
> lib/test_hmm.c | 2 +-
> mm/internal.h | 8 +++
> mm/memremap.c | 69 +++++++-----------------
> mm/migrate.c | 13 ++---
> mm/page_alloc.c | 3 ++
> mm/swap.c | 45 ++--------------
> 16 files changed, 83 insertions(+), 164 deletions(-)
>

2021-06-17 20:51:12

by Ralph Campbell

[permalink] [raw]

Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

On 6/17/21 8:16 AM, Alex Sierra wrote:
> From: Ralph Campbell <[email protected]>
>
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
>
> v2:
> AS: merged this patch in linux 5.11 version
>
> Signed-off-by: Ralph Campbell <[email protected]>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> fs/dax.c | 4 +-
> include/linux/dax.h | 2 +-
> include/linux/memremap.h | 7 +--
> include/linux/mm.h | 44 -----------------
> lib/test_hmm.c | 2 +-
> mm/internal.h | 8 +++
> mm/memremap.c | 68 +++++++-------------------
> mm/migrate.c | 5 --
> mm/page_alloc.c | 3 ++
> mm/swap.c | 45 ++---------------
> 12 files changed, 45 insertions(+), 147 deletions(-)
>
I think it is great that you are picking this up and trying to revive it.

However, I have a number of concerns about how it affects existing ZONE_DEVICE
MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC
struct pages and then:
dev_dax_fault()
dev_dax_huge_fault()
__dev_dax_pte_fault()
vmf_insert_mixed()
which just inserts the PFN into the CPU page tables without increasing the page
refcount so it is zero (whereas it was one before). But using get_page() will
trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of
free verses allocated for these struct pages. I suppose init_page_count()
could be called on all the struct pages in dev_dax_probe() to fix that though.

I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear
allocate and free states for backing storage but there are the complications with
the page cache references, etc. to consider. The >1 to 1 reference count seems to
be used to tell when a page is idle (no I/O, reclaim scanners) rather than free
(not allocated to any file) but I'm not 100% sure about that since I don't really
understand all the issues around why a file system needs to have a DAX mount option
besides knowing that the storage block size has to be a multiple of the page size.

2021-06-20 14:24:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
> v1:
> AMD is building a system architecture for the Frontier supercomputer with a
> coherent interconnect between CPUs and GPUs. This hardware architecture allows
> the CPUs to coherently access GPU device memory. We have hardware in our labs
> and we are working with our partner HPE on the BIOS, firmware and software
> for delivery to the DOE.
>
> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
> (special purpose memory) in the UEFI system address map. The amdgpu driver looks
> it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
> using devm_memremap_pages.
>
> Now we're trying to migrate data to and from that memory using the migrate_vma_*
> helpers so we can support page-based migration in our unified memory allocations,
> while also supporting CPU access to those pages.
>
> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
> correctly in the migrate_vma_* helpers. We are looking for feedback about this
> approach. If we're close, what's needed to make our patches acceptable upstream?
> If we're not close, any suggestions how else to achieve what we are trying to do
> (i.e. page migration and coherent CPU access to VRAM)?

Is there a way we can test the codepaths touched by this patchset? It
doesn't have to be via a complete qemu simulation of the GPU device
memory, but some way of creating MEMORY_DEVICE_GENERIC subject to
migrate_vma_* helpers so we can test for regressions moving forward.

Thanks,

- Ted

2021-06-23 21:50:34

by Felix Kuehling

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On 2021-06-20 10:14 a.m., Theodore Ts'o wrote:
> On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
>> v1:
>> AMD is building a system architecture for the Frontier supercomputer with a
>> coherent interconnect between CPUs and GPUs. This hardware architecture allows
>> the CPUs to coherently access GPU device memory. We have hardware in our labs
>> and we are working with our partner HPE on the BIOS, firmware and software
>> for delivery to the DOE.
>>
>> The system BIOS advertises the GPU device memory (aka VRAM) as SPM
>> (special purpose memory) in the UEFI system address map. The amdgpu driver looks
>> it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
>> using devm_memremap_pages.
>>
>> Now we're trying to migrate data to and from that memory using the migrate_vma_*
>> helpers so we can support page-based migration in our unified memory allocations,
>> while also supporting CPU access to those pages.
>>
>> This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
>> correctly in the migrate_vma_* helpers. We are looking for feedback about this
>> approach. If we're close, what's needed to make our patches acceptable upstream?
>> If we're not close, any suggestions how else to achieve what we are trying to do
>> (i.e. page migration and coherent CPU access to VRAM)?
> Is there a way we can test the codepaths touched by this patchset? It
> doesn't have to be via a complete qemu simulation of the GPU device
> memory, but some way of creating MEMORY_DEVICE_GENERIC subject to
> migrate_vma_* helpers so we can test for regressions moving forward.

Hi Theodore,

I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
this patch series in a way that is reproducible without special hardware
and firmware:

For the reference counting changes we could use the dax driver with hmem
and use efi_fake_mem on the kernel command line to create some
DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests
to exercise dax functionality on this type of memory.

For the migration helper changes we could modify or parametrize
lib/hmm_test.c to create DEVICE_GENERIC pages instead of DEVICE_PRIVATE.
Then run tools/testing/selftests/vm/hmm-tests.c.

Regards,
Felix

>
> Thanks,
>
> - Ted

2021-06-24 05:31:41

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
> For the reference counting changes we could use the dax driver with hmem
> and use efi_fake_mem on the kernel command line to create some
> DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to
> exercise dax functionality on this type of memory.
>
> For the migration helper changes we could modify or parametrize
> lib/hmm_test.c to create DEVICE_GENERIC pages instead of DEVICE_PRIVATE.
> Then run tools/testing/selftests/vm/hmm-tests.c.

We'll also need a real in-tree user of the enhanced DEVICE_GENERIC memory.
So while the refcounting cleanups early in the series are something I'd
really like to see upstream as soon as everything is sorted out, the
actual bits that can't only be used by your updated driver should wait
for that.

2021-06-24 15:09:03

by Felix Kuehling

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

Am 2021-06-24 um 1:30 a.m. schrieb Christoph Hellwig:
> On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>> For the reference counting changes we could use the dax driver with hmem
>> and use efi_fake_mem on the kernel command line to create some
>> DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to
>> exercise dax functionality on this type of memory.
>>
>> For the migration helper changes we could modify or parametrize
>> lib/hmm_test.c to create DEVICE_GENERIC pages instead of DEVICE_PRIVATE.
>> Then run tools/testing/selftests/vm/hmm-tests.c.
> We'll also need a real in-tree user of the enhanced DEVICE_GENERIC memory.
> So while the refcounting cleanups early in the series are something I'd
> really like to see upstream as soon as everything is sorted out, the
> actual bits that can't only be used by your updated driver should wait
> for that.

The driver changes are pretty much ready to go.

But we have a bit of a chicken-egg problem because those changes likely
go through different trees. The GPU driver changes will go through
drm-next, but we can't merge them there until our dependencies have been
merged there from upstream. Unless we protect everything with some #ifdef.

Regards,
Felix

2021-06-28 16:49:46

by Felix Kuehling

[permalink] [raw]

Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>
> On 6/17/21 8:16 AM, Alex Sierra wrote:
>> From: Ralph Campbell <[email protected]>
>>
>> ZONE_DEVICE struct pages have an extra reference count that
>> complicates the
>> code for put_page() and several places in the kernel that need to
>> check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't
>> need to
>> be treated specially for ZONE_DEVICE.
>>
>> v2:
>> AS: merged this patch in linux 5.11 version
>>
>> Signed-off-by: Ralph Campbell <[email protected]>
>> Signed-off-by: Alex Sierra <[email protected]>
>> ---
>> arch/powerpc/kvm/book3s_hv_uvmem.c     | 2 +-
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>> fs/dax.c                               | 4 +-
>> include/linux/dax.h                    | 2 +-
>> include/linux/memremap.h               | 7 +--
>> include/linux/mm.h                     | 44 -----------------
>> lib/test_hmm.c                         | 2 +-
>> mm/internal.h                          | 8 +++
>> mm/memremap.c                          | 68 +++++++-------------------
>> mm/migrate.c                           | 5 --
>> mm/page_alloc.c                        | 3 ++
>> mm/swap.c                              | 45 ++---------------
>> 12 files changed, 45 insertions(+), 147 deletions(-)
>>
> I think it is great that you are picking this up and trying to revive it.
>
> However, I have a number of concerns about how it affects existing
> ZONE_DEVICE
> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
> addressing them. For example, dev_dax_probe() allocates
> MEMORY_DEVICE_GENERIC
> struct pages and then:
> dev_dax_fault()
>     dev_dax_huge_fault()
>       __dev_dax_pte_fault()
>         vmf_insert_mixed()
> which just inserts the PFN into the CPU page tables without increasing
> the page
> refcount so it is zero (whereas it was one before). But using
> get_page() will
> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
> notion of
> free verses allocated for these struct pages. I suppose init_page_count()
> could be called on all the struct pages in dev_dax_probe() to fix that
> though.

Hi Ralph,

For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
I'm not sure what the reference counting is good for in this case.

Alex is going to add free_zone_device_page support for DEVICE_GENERIC
pages (patch 8 of this series). However, even then, it only does
anything if there is an actual call to put_page. Where would that call
come from in the dev_dax driver case?

>
> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
> systems have clear
> allocate and free states for backing storage but there are the
> complications with
> the page cache references, etc. to consider. The >1 to 1 reference
> count seems to
> be used to tell when a page is idle (no I/O, reclaim scanners) rather
> than free
> (not allocated to any file) but I'm not 100% sure about that since I
> don't really
> understand all the issues around why a file system needs to have a DAX
> mount option
> besides knowing that the storage block size has to be a multiple of
> the page size.

The only thing that happens in free_zone_device_page for FS_DAX pages is
wake_up_var(&page->_refcount). I guess, whoever is waiting for this
wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
see these callers that would need to be updated:

./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
--
./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
--
./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE,
./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));

Regarding your page-cache comment, doesn't DAX mean that those file
pages are not in the page cache?

Regards,
Felix

2021-06-30 00:25:13

by Ralph Campbell

[permalink] [raw]

Subject: Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount

On 6/28/21 9:46 AM, Felix Kuehling wrote:
> Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
>> On 6/17/21 8:16 AM, Alex Sierra wrote:
>>> From: Ralph Campbell <[email protected]>
>>>
>>> ZONE_DEVICE struct pages have an extra reference count that
>>> complicates the
>>> code for put_page() and several places in the kernel that need to
>>> check the
>>> reference count to see that a page is not being used (gup, compaction,
>>> migration, etc.). Clean up the code so the reference count doesn't
>>> need to
>>> be treated specially for ZONE_DEVICE.
>>>
>>> v2:
>>> AS: merged this patch in linux 5.11 version
>>>
>>> Signed-off-by: Ralph Campbell <[email protected]>
>>> Signed-off-by: Alex Sierra <[email protected]>
>>> ---
>>> arch/powerpc/kvm/book3s_hv_uvmem.c     | 2 +-
>>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>>> fs/dax.c                               | 4 +-
>>> include/linux/dax.h                    | 2 +-
>>> include/linux/memremap.h               | 7 +--
>>> include/linux/mm.h                     | 44 -----------------
>>> lib/test_hmm.c                         | 2 +-
>>> mm/internal.h                          | 8 +++
>>> mm/memremap.c                          | 68 +++++++-------------------
>>> mm/migrate.c                           | 5 --
>>> mm/page_alloc.c                        | 3 ++
>>> mm/swap.c                              | 45 ++---------------
>>> 12 files changed, 45 insertions(+), 147 deletions(-)
>>>
>> I think it is great that you are picking this up and trying to revive it.
>>
>> However, I have a number of concerns about how it affects existing
>> ZONE_DEVICE
>> MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this
>> addressing them. For example, dev_dax_probe() allocates
>> MEMORY_DEVICE_GENERIC
>> struct pages and then:
>> dev_dax_fault()
>>     dev_dax_huge_fault()
>>       __dev_dax_pte_fault()
>>         vmf_insert_mixed()
>> which just inserts the PFN into the CPU page tables without increasing
>> the page
>> refcount so it is zero (whereas it was one before). But using
>> get_page() will
>> trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current
>> notion of
>> free verses allocated for these struct pages. I suppose init_page_count()
>> could be called on all the struct pages in dev_dax_probe() to fix that
>> though.
> Hi Ralph,
>
> For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So
> I'm not sure what the reference counting is good for in this case.
>
> Alex is going to add free_zone_device_page support for DEVICE_GENERIC
> pages (patch 8 of this series). However, even then, it only does
> anything if there is an actual call to put_page. Where would that call
> come from in the dev_dax driver case?

Correct, the drivers/dax/device.c driver allocates MEMORY_DEVICE_GENERIC
struct pages and doesn't seem to allocate/free the page nor increment/decrement
the reference count but it does call vmf_insert_mixed() if the /dev/file
is mmap()'ed into a user process' address space. If devm_memremap_pages()
returns the array of ZONE_DEVICE struct pages initialized with a reference
count of zero, then the CPU page tables will have a PTE/PFN that points to
a struct page with a zero reference count. This goes against the normal
expectation in the rest of the mm code that assumes a page mapped by a CPU
has a non-zero reference count.
So yes, nothing "bad" happens because put_page() isn't called but the
reference count will differ from other drivers that call vmf_insert_mixed()
or vm_insert_page() where the page was allocated with alloc_pages() or
similar.

>> I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File
>> systems have clear
>> allocate and free states for backing storage but there are the
>> complications with
>> the page cache references, etc. to consider. The >1 to 1 reference
>> count seems to
>> be used to tell when a page is idle (no I/O, reclaim scanners) rather
>> than free
>> (not allocated to any file) but I'm not 100% sure about that since I
>> don't really
>> understand all the issues around why a file system needs to have a DAX
>> mount option
>> besides knowing that the storage block size has to be a multiple of
>> the page size.
> The only thing that happens in free_zone_device_page for FS_DAX pages is
> wake_up_var(&page->_refcount). I guess, whoever is waiting for this
> wake-up will need to be prepared to see a refcount 0 instead of 1 now. I
> see these callers that would need to be updated:
>
> ./fs/ext4/inode.c:        error = ___wait_var_event(&page->_refcount,
> ./fs/ext4/inode.c-                atomic_read(&page->_refcount) == 1,
> ./fs/ext4/inode.c-                TASK_INTERRUPTIBLE, 0, 0,
> ./fs/ext4/inode.c-                ext4_wait_dax_page(ei));
> --
> ./fs/fuse/dax.c:    return ___wait_var_event(&page->_refcount,
> ./fs/fuse/dax.c-            atomic_read(&page->_refcount) == 1,
> TASK_INTERRUPTIBLE,
> ./fs/fuse/dax.c-            0, 0, fuse_wait_dax_page(inode));
> --
> ./fs/xfs/xfs_file.c:    return ___wait_var_event(&page->_refcount,
> ./fs/xfs/xfs_file.c-            atomic_read(&page->_refcount) == 1,
> TASK_INTERRUPTIBLE,
> ./fs/xfs/xfs_file.c-            0, 0, xfs_wait_dax_page(inode));
>
> Regarding your page-cache comment, doesn't DAX mean that those file
> pages are not in the page cache?
>
> Regards,
> Felix
>
I don't really understand the FS_DAX code. I can see the __wait_var_event()
is being used when truncating or punching holes in files but I'm not
quite sure if it is using the >1 to 1 reference count to know when a
page has no "extra" references or if it means the page is actually
"free" and no longer assigned to a file.
I really think some FS_DAX expert needs to weigh in on these reference count
changes.

2021-07-16 15:08:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>
> I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
> this patch series in a way that is reproducible without special hardware and
> firmware:
>
> For the reference counting changes we could use the dax driver with hmem and
> use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC
> pages. I'm open to suggestions for good user mode tests to exercise dax
> functionality on this type of memory.

Sorry for the thread necromancy, but now that the merge window is
past....

Today I test ext4's dax support, without having any $$$ DAX hardware,
by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
reserves memory so that creates two pmem device and then I run
xfstests with DAX enabled using qemu or using a Google Compute Engine
VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.

If you can give me a recipe for what kernel configs I should enable,
and what magic kernel command line arguments to use, then I'd be able
to test your patch set with ext4.

Cheers,

- Ted

2021-07-16 22:14:49

by Felix Kuehling

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
> On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>> I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
>> this patch series in a way that is reproducible without special hardware and
>> firmware:
>>
>> For the reference counting changes we could use the dax driver with hmem and
>> use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC
>> pages. I'm open to suggestions for good user mode tests to exercise dax
>> functionality on this type of memory.
> Sorry for the thread necromancy, but now that the merge window is
> past....

No worries. Alejandro should have a new version of this series in a few
days, with updates to hmm_test and some fixes.

>
> Today I test ext4's dax support, without having any $$$ DAX hardware,
> by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
> reserves memory so that creates two pmem device and then I run
> xfstests with DAX enabled using qemu or using a Google Compute Engine
> VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
>
> If you can give me a recipe for what kernel configs I should enable,
> and what magic kernel command line arguments to use, then I'd be able
> to test your patch set with ext4.
That would be great!

Regarding kernel config options, it should be the same as what you're
using for DAX testing today. We're not changing or adding any Kconfig
options. But let me take a stab:

ZONE_DEVICE
HMM_MIRROR
MMU_NOTIFIER
DEVICE_PRIVATE (maybe not needed for your test)
FS_DAX

I'm not sure what you're looking for in terms of kernel command line,
other than the memmap options you already found. There are some more
options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're
already running that ourselves. That will also be in the next revision
of this patch series.

If you can run your xfstests with DAX on top of this patch series, that
would be very helpful. That's to make sure the ZONE_DEVICE page refcount
changes don't break DAX.

Regards,
Felix

>
> Cheers,
>
> - Ted

2021-07-17 19:55:26

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On 7/16/2021 5:14 PM, Felix Kuehling wrote:
> Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
>> On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>>> I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
>>> this patch series in a way that is reproducible without special hardware and
>>> firmware:
>>>
>>> For the reference counting changes we could use the dax driver with hmem and
>>> use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC
>>> pages. I'm open to suggestions for good user mode tests to exercise dax
>>> functionality on this type of memory.
>> Sorry for the thread necromancy, but now that the merge window is
>> past....
> No worries. Alejandro should have a new version of this series in a few
> days, with updates to hmm_test and some fixes.

V4 patch series have been sent for review.
https://marc.info/?l=dri-devel&m=162654971618911&w=2

Regards,
Alex Sierra

>
>
>> Today I test ext4's dax support, without having any $$$ DAX hardware,
>> by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
>> reserves memory so that creates two pmem device and then I run
>> xfstests with DAX enabled using qemu or using a Google Compute Engine
>> VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
>>
>> If you can give me a recipe for what kernel configs I should enable,
>> and what magic kernel command line arguments to use, then I'd be able
>> to test your patch set with ext4.
> That would be great!
>
> Regarding kernel config options, it should be the same as what you're
> using for DAX testing today. We're not changing or adding any Kconfig
> options. But let me take a stab:
>
> ZONE_DEVICE
> HMM_MIRROR
> MMU_NOTIFIER
> DEVICE_PRIVATE (maybe not needed for your test)
> FS_DAX
>
> I'm not sure what you're looking for in terms of kernel command line,
> other than the memmap options you already found. There are some more
> options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're
> already running that ourselves. That will also be in the next revision
> of this patch series.

In order to run hmm test with generic device type enabled, set the
following:

kernel config:
EFI_FAKE_MEMMAP
RUNTIME_TESTING_MENU
TEST_HMM=m

Kernel parameters to fake SP memory. The addresses set here are based on
my system's usable memory enumerated by BIOS-e820 at boot. The test
requires two SP devices of at least 256MB.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

To run the hmm_test in generic device type pass the SP addresses to the
sh script.
sudo ./test_hmm.sh smoke 0x100000000 0x140000000

>
> If you can run your xfstests with DAX on top of this patch series, that
> would be very helpful. That's to make sure the ZONE_DEVICE page refcount
> changes don't break DAX.
>
> Regards,
> Felix
>
>
>> Cheers,
>>
>> - Ted

2021-07-23 22:47:15

by Sierra Guiza, Alejandro (Alex)

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 7/16/2021 5:14 PM, Felix Kuehling wrote:
>> Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
>>> On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>>>> I can think of two ways to test the changes for
>>>> MEMORY_DEVICE_GENERIC in
>>>> this patch series in a way that is reproducible without special
>>>> hardware and
>>>> firmware:
>>>>
>>>> For the reference counting changes we could use the dax driver with
>>>> hmem and
>>>> use efi_fake_mem on the kernel command line to create some
>>>> DEVICE_GENERIC
>>>> pages. I'm open to suggestions for good user mode tests to exercise
>>>> dax
>>>> functionality on this type of memory.
>>> Sorry for the thread necromancy, but now that the merge window is
>>> past....
>> No worries. Alejandro should have a new version of this series in a few
>> days, with updates to hmm_test and some fixes.
>
> V4 patch series have been sent for review.
> https://marc.info/?l=dri-devel&m=162654971618911&w=2
>
> Regards,
> Alex Sierra
>
>>
>>
>>> Today I test ext4's dax support, without having any $$$ DAX hardware,
>>> by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
>>> reserves memory so that creates two pmem device and then I run
>>> xfstests with DAX enabled using qemu or using a Google Compute Engine
>>> VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
>>>
>>> If you can give me a recipe for what kernel configs I should enable,
>>> and what magic kernel command line arguments to use, then I'd be able
>>> to test your patch set with ext4.
>> That would be great!
>>
>> Regarding kernel config options, it should be the same as what you're
>> using for DAX testing today. We're not changing or adding any Kconfig
>> options. But let me take a stab:
>>
>> ZONE_DEVICE
>> HMM_MIRROR
>> MMU_NOTIFIER
>> DEVICE_PRIVATE (maybe not needed for your test)
>> FS_DAX
Hi Theodore,
I wonder if you had chance to set the kernel configs from Felix to
enable DAX in xfstests.

I've been trying to test FS DAX on my side using virtio-fs + QEMU.
Unfortunately, Im having some problems setting up the environment with
the guest kernel. Could you share your VM (QEMU or GCE) setup to run it
with xfstests?

Regards,
Alex S.

>>
>> I'm not sure what you're looking for in terms of kernel command line,
>> other than the memmap options you already found. There are some more
>> options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're
>> already running that ourselves. That will also be in the next revision
>> of this patch series.
>
> In order to run hmm test with generic device type enabled, set the
> following:
>
> kernel config:
> EFI_FAKE_MEMMAP
> RUNTIME_TESTING_MENU
> TEST_HMM=m
>
> Kernel parameters to fake SP memory. The addresses set here are based
> on my system's usable memory enumerated by BIOS-e820 at boot. The test
> requires two SP devices of at least 256MB.
> efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
>
> To run the hmm_test in generic device type pass the SP addresses to
> the sh script.
> sudo ./test_hmm.sh smoke 0x100000000 0x140000000
>
>>
>> If you can run your xfstests with DAX on top of this patch series, that
>> would be very helpful. That's to make sure the ZONE_DEVICE page refcount
>> changes don't break DAX.
>>
>> Regards,
>> Felix
>>
>>
>>> Cheers,
>>>
>>> - Ted

2021-07-30 19:04:26

by Felix Kuehling

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] Support DEVICE_GENERIC memory in migrate_vma_*

Am 2021-07-23 um 6:46 p.m. schrieb Sierra Guiza, Alejandro (Alex):
>
> On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote:
>>
>> On 7/16/2021 5:14 PM, Felix Kuehling wrote:
>>> Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
>>>> On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
>>>>> I can think of two ways to test the changes for
>>>>> MEMORY_DEVICE_GENERIC in
>>>>> this patch series in a way that is reproducible without special
>>>>> hardware and
>>>>> firmware:
>>>>>
>>>>> For the reference counting changes we could use the dax driver
>>>>> with hmem and
>>>>> use efi_fake_mem on the kernel command line to create some
>>>>> DEVICE_GENERIC
>>>>> pages. I'm open to suggestions for good user mode tests to
>>>>> exercise dax
>>>>> functionality on this type of memory.
>>>> Sorry for the thread necromancy, but now that the merge window is
>>>> past....
>>> No worries. Alejandro should have a new version of this series in a few
>>> days, with updates to hmm_test and some fixes.
>>
>> V4 patch series have been sent for review.
>> https://marc.info/?l=dri-devel&m=162654971618911&w=2
>>
>> Regards,
>> Alex Sierra
>>
>>>
>>>
>>>> Today I test ext4's dax support, without having any $$$ DAX hardware,
>>>> by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which
>>>> reserves memory so that creates two pmem device and then I run
>>>> xfstests with DAX enabled using qemu or using a Google Compute Engine
>>>> VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
>>>>
>>>> If you can give me a recipe for what kernel configs I should enable,
>>>> and what magic kernel command line arguments to use, then I'd be able
>>>> to test your patch set with ext4.
>>> That would be great!
>>>
>>> Regarding kernel config options, it should be the same as what you're
>>> using for DAX testing today. We're not changing or adding any Kconfig
>>> options. But let me take a stab:
>>>
>>> ZONE_DEVICE
>>> HMM_MIRROR
>>> MMU_NOTIFIER
>>> DEVICE_PRIVATE (maybe not needed for your test)
>>> FS_DAX
> Hi Theodore,
> I wonder if you had chance to set the kernel configs from Felix to
> enable DAX in xfstests.
>
> I've been trying to test FS DAX on my side using virtio-fs + QEMU.
> Unfortunately, Im having some problems setting up the environment with
> the guest kernel. Could you share your VM (QEMU or GCE) setup to run
> it with xfstests?
>
> Regards,
> Alex S.

Hi Theodore,

Sorry to keep bugging you. I'm wondering if you've had a chance to test
FS DAX with Alex's last patch series? ([PATCH v4 00/13] Support
DEVICE_GENERIC memory in migrate_vma_*) I think other than minor
cosmetic fixes, this should be ready to merge, if it passes your tests.

Thanks,
Felix

>
>>>
>>> I'm not sure what you're looking for in terms of kernel command line,
>>> other than the memmap options you already found. There are some more
>>> options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but
>>> we're
>>> already running that ourselves. That will also be in the next revision
>>> of this patch series.
>>
>> In order to run hmm test with generic device type enabled, set the
>> following:
>>
>> kernel config:
>> EFI_FAKE_MEMMAP
>> RUNTIME_TESTING_MENU
>> TEST_HMM=m
>>
>> Kernel parameters to fake SP memory. The addresses set here are based
>> on my system's usable memory enumerated by BIOS-e820 at boot. The
>> test requires two SP devices of at least 256MB.
>> efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
>>
>> To run the hmm_test in generic device type pass the SP addresses to
>> the sh script.
>> sudo ./test_hmm.sh smoke 0x100000000 0x140000000
>>
>>>
>>> If you can run your xfstests with DAX on top of this patch series, that
>>> would be very helpful. That's to make sure the ZONE_DEVICE page
>>> refcount
>>> changes don't break DAX.
>>>
>>> Regards,
>>> Felix
>>>
>>>
>>>> Cheers,
>>>>
>>>> - Ted