2022-02-10 08:55:46

by Christoph Hellwig

[permalink] [raw]
Subject: start sorting out the ZONE_DEVICE refcount mess v2

Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Note that if we want to get the p2p series rebased on top of this
we'll need a git branch for this series. I could offer to host one.

A git tree is available here:

git://git.infradead.org/users/hch/misc.git pgmap-refcount

Gitweb:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

Changes since v1:
- add a missing memremap.h include in memcontrol.c
- include rebased versions of the device coherent support and
device coherent migration support series as well as additional
cleanup patches

Diffstt:
arch/arm64/mm/mmu.c | 1
arch/powerpc/kvm/book3s_hv_uvmem.c | 1
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 -
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1
drivers/gpu/drm/drm_cache.c | 2
drivers/gpu/drm/nouveau/nouveau_dmem.c | 3
drivers/gpu/drm/nouveau/nouveau_svm.c | 1
drivers/infiniband/core/rw.c | 1
drivers/nvdimm/pmem.h | 1
drivers/nvme/host/pci.c | 1
drivers/nvme/target/io-cmd-bdev.c | 1
fs/Kconfig | 2
fs/fuse/virtio_fs.c | 1
include/linux/hmm.h | 9
include/linux/memremap.h | 36 +
include/linux/migrate.h | 1
include/linux/mm.h | 59 --
lib/test_hmm.c | 353 ++++++++++---
lib/test_hmm_uapi.h | 22
mm/Kconfig | 7
mm/Makefile | 1
mm/gup.c | 127 +++-
mm/internal.h | 3
mm/memcontrol.c | 19
mm/memory-failure.c | 8
mm/memremap.c | 75 +-
mm/migrate.c | 763 ----------------------------
mm/migrate_device.c | 822 +++++++++++++++++++++++++++++++
mm/rmap.c | 5
mm/swap.c | 49 -
tools/testing/selftests/vm/Makefile | 2
tools/testing/selftests/vm/hmm-tests.c | 204 ++++++-
tools/testing/selftests/vm/test_hmm.sh | 24
33 files changed, 1552 insertions(+), 1088 deletions(-)


2022-02-10 08:57:50

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 08/27] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
---
fs/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index e9433bbc48010a..7f2455e8e18ae2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+ depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
help
--
2.30.2


2022-02-10 09:13:18

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 06/27] mm: don't include <linux/memremap.h> in <linux/mm.h>

Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
---
arch/arm64/mm/mmu.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 +
drivers/gpu/drm/drm_cache.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
drivers/gpu/drm/nouveau/nouveau_svm.c | 1 +
drivers/infiniband/core/rw.c | 1 +
drivers/nvdimm/pmem.h | 1 +
drivers/nvme/host/pci.c | 1 +
drivers/nvme/target/io-cmd-bdev.c | 1 +
fs/fuse/virtio_fs.c | 1 +
include/linux/memremap.h | 18 ++++++++++++++++++
include/linux/mm.h | 20 --------------------
lib/test_hmm.c | 1 +
mm/memcontrol.c | 1 +
mm/memremap.c | 6 +++++-
15 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index acfae9b41cc8c9..580abae6c0b93f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
+#include <linux/memremap.h>
#include <linux/memory.h>
#include <linux/fs.h>
#include <linux/io.h>
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ea68f3b3a4e9cb..6d643b4b791d87 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -25,6 +25,7 @@

#include <linux/hashtable.h>
#include <linux/mmu_notifier.h>
+#include <linux/memremap.h>
#include <linux/mutex.h>
#include <linux/types.h>
#include <linux/atomic.h>
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index f19d9acbe95936..50b8a088f763a6 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -27,11 +27,11 @@
/*
* Authors: Thomas Hellström <thomas-at-tungstengraphics-dot-com>
*/
-
#include <linux/dma-buf-map.h>
#include <linux/export.h>
#include <linux/highmem.h>
#include <linux/cc_platform.h>
+#include <linux/ioport.h>
#include <xen/xen.h>

#include <drm/drm_cache.h>
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e886a3b9e08c7d..a5cdfbe32b5e54 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@

#include <linux/sched/mm.h>
#include <linux/hmm.h>
+#include <linux/memremap.h>
#include <linux/migrate.h>

/*
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 266809e511e2c1..090b9b47708cca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
#include <linux/sched/mm.h>
#include <linux/sort.h>
#include <linux/hmm.h>
+#include <linux/memremap.h>
#include <linux/rmap.h>

struct nouveau_svm {
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c93..4d98f931a13ddd 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -2,6 +2,7 @@
/*
* Copyright (c) 2016 HGST, a Western Digital Company.
*/
+#include <linux/memremap.h>
#include <linux/moduleparam.h>
#include <linux/slab.h>
#include <linux/pci-p2pdma.h>
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a85c..1f51a23614299b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,6 +3,7 @@
#define __NVDIMM_PMEM_H__
#include <linux/page-flags.h>
#include <linux/badblocks.h>
+#include <linux/memremap.h>
#include <linux/types.h>
#include <linux/pfn_t.h>
#include <linux/fs.h>
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed68091589..ab15bc72710dbe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -15,6 +15,7 @@
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
+#include <linux/memremap.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/mutex.h>
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 70ca9dfc1771a9..a141446db1bea3 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/blkdev.h>
#include <linux/blk-integrity.h>
+#include <linux/memremap.h>
#include <linux/module.h>
#include "nvmet.h"

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c0b..86b7dbb6a0d43e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
#include <linux/dax.h>
#include <linux/pci.h>
#include <linux/pfn_t.h>
+#include <linux/memremap.h>
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acbad6..514ab46f597e5c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,8 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MEMREMAP_H_
#define _LINUX_MEMREMAP_H_
+
+#include <linux/mm.h>
#include <linux/range.h>
#include <linux/ioport.h>
#include <linux/percpu-refcount.h>
@@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap)
return 1 << pgmap->vmemmap_shift;
}

+static inline bool is_device_private_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+ is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
+
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+ is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+
#ifdef CONFIG_ZONE_DEVICE
void *memremap_pages(struct dev_pagemap *pgmap, int nid);
void memunmap_pages(struct dev_pagemap *pgmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 26baadcef4556b..80fccfe31c3444 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,7 +23,6 @@
#include <linux/err.h>
#include <linux/page-flags.h>
#include <linux/page_ref.h>
-#include <linux/memremap.h>
#include <linux/overflow.h>
#include <linux/sizes.h>
#include <linux/sched.h>
@@ -1101,9 +1100,6 @@ static inline bool put_devmap_managed_page(struct page *page)
return false;
if (!is_zone_device_page(page))
return false;
- if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
- page->pgmap->type != MEMORY_DEVICE_FS_DAX)
- return false;
return __put_devmap_managed_page(page);
}

@@ -1114,22 +1110,6 @@ static inline bool put_devmap_managed_page(struct page *page)
}
#endif /* CONFIG_DEV_PAGEMAP_OPS */

-static inline bool is_device_private_page(const struct page *page)
-{
- return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
- IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
- is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_pci_p2pdma_page(const struct page *page)
-{
- return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
- IS_ENABLED(CONFIG_PCI_P2PDMA) &&
- is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
-}
-
/* 127: arbitrary random number, small enough to assemble well */
#define folio_ref_zero_or_close_to_overflow(folio) \
((unsigned int) folio_ref_count(folio) + 127u <= 127u)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 396beee6b061d4..e5fc14ba71f33e 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -12,6 +12,7 @@
#include <linux/kernel.h>
#include <linux/cdev.h>
#include <linux/device.h>
+#include <linux/memremap.h>
#include <linux/mutex.h>
#include <linux/rwsem.h>
#include <linux/sched.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0d9..fcdd96aa4380e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -53,6 +53,7 @@
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmpressure.h>
+#include <linux/memremap.h>
#include <linux/mm_inline.h>
#include <linux/swap_cgroup.h>
#include <linux/cpu.h>
diff --git a/mm/memremap.c b/mm/memremap.c
index f41233a67edb12..a0ece2344c2cab 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -4,7 +4,7 @@
#include <linux/io.h>
#include <linux/kasan.h>
#include <linux/memory_hotplug.h>
-#include <linux/mm.h>
+#include <linux/memremap.h>
#include <linux/pfn_t.h>
#include <linux/swap.h>
#include <linux/mmzone.h>
@@ -504,6 +504,10 @@ void free_devmap_managed_page(struct page *page)

bool __put_devmap_managed_page(struct page *page)
{
+ if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+ page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+ return false;
+
/*
* devmap page refcounts are 1-based, rather than 0-based: if
* refcount is 1, then the page is free and the refcount is
--
2.30.2


2022-02-10 09:29:49

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 25/27] mm: remove the vma check in migrate_vma_setup()

From: Alistair Popple <[email protected]>

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/migrate_device.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0b295594e7626d..03e182f9fc7865 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -462,24 +462,24 @@ int migrate_vma_setup(struct migrate_vma *args)

args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
- if (!args->vma || is_vm_hugetlb_page(args->vma) ||
- (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
- return -EINVAL;
- if (nr_pages <= 0)
- return -EINVAL;
- if (args->start < args->vma->vm_start ||
- args->start >= args->vma->vm_end)
- return -EINVAL;
- if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
- return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
- memset(args->src, 0, sizeof(*args->src) * nr_pages);
- args->cpages = 0;
- args->npages = 0;
-
- migrate_vma_collect(args);
+ if (args->vma) {
+ if (is_vm_hugetlb_page(args->vma) ||
+ (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
+ return -EINVAL;
+ if (args->start < args->vma->vm_start ||
+ args->start >= args->vma->vm_end)
+ return -EINVAL;
+ if (args->end <= args->vma->vm_start ||
+ args->end > args->vma->vm_end)
+ return -EINVAL;
+ memset(args->src, 0, sizeof(*args->src) * nr_pages);
+ args->cpages = 0;
+ args->npages = 0;
+
+ migrate_vma_collect(args);
+ }

if (args->cpages)
migrate_vma_unmap(args);
@@ -661,7 +661,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}

- if (!page) {
+ if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
--
2.30.2


2022-02-10 09:34:29

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 07/27] mm: remove the extra ZONE_DEVICE struct page refcount

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1. This is a separate issue and will
be sorted out later. Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell <[email protected]>.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 -
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 -
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 -
fs/Kconfig | 1 -
include/linux/memremap.h | 12 +++--
include/linux/mm.h | 6 +--
lib/test_hmm.c | 1 -
mm/Kconfig | 4 --
mm/internal.h | 2 +
mm/memcontrol.c | 11 ++---
mm/memremap.c | 57 ++++++++----------------
mm/migrate.c | 6 ---
mm/swap.c | 16 ++-----
13 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839fd1..8b6438fa18fc2b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)

dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
- get_page(dpage);
lock_page(dpage);
return dpage;
out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index cb835f95a76e66..e27ca375876230 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
- get_page(page);
lock_page(page);
}

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a5cdfbe32b5e54..7ba66ad68a8a1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}

- get_page(page);
lock_page(page);
return page;
}
diff --git a/fs/Kconfig b/fs/Kconfig
index 6c7dc1387beb0f..e9433bbc48010a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,7 +48,6 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
- select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 514ab46f597e5c..d6a114dd5ea8b7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -68,9 +68,9 @@ enum memory_type {

struct dev_pagemap_ops {
/*
- * Called once the page refcount reaches 1. (ZONE_DEVICE pages never
- * reach 0 refcount unless there is a refcount bug. This allows the
- * device driver to implement its own memory management.)
+ * Called once the page refcount reaches 0. The reference count will be
+ * reset to one by the core code after the method is called to prepare
+ * for handing out the page again.
*/
void (*page_free)(struct page *page);

@@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap)

static inline bool is_device_private_page(const struct page *page)
{
- return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
- IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+ return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

static inline bool is_pci_p2pdma_page(const struct page *page)
{
- return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
- IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+ return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80fccfe31c3444..ff9f149ca2017e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1090,7 +1090,7 @@ static inline bool is_zone_movable_page(const struct page *page)
return page_zonenum(page) == ZONE_MOVABLE;
}

-#ifdef CONFIG_DEV_PAGEMAP_OPS
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);

bool __put_devmap_managed_page(struct page *page);
@@ -1103,12 +1103,12 @@ static inline bool put_devmap_managed_page(struct page *page)
return __put_devmap_managed_page(page);
}

-#else /* CONFIG_DEV_PAGEMAP_OPS */
+#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
static inline bool put_devmap_managed_page(struct page *page)
{
return false;
}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */

/* 127: arbitrary random number, small enough to assemble well */
#define folio_ref_zero_or_close_to_overflow(folio) \
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e5fc14ba71f33e..cfe63204783918 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -566,7 +566,6 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
}

dpage->zone_device_data = rpage;
- get_page(dpage);
lock_page(dpage);
return dpage;

diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f330..a1901ae6d06293 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -776,9 +776,6 @@ config ZONE_DEVICE

If FS_DAX is enabled, then say Y.

-config DEV_PAGEMAP_OPS
- bool
-
#
# Helpers to mirror range of the CPU page tables of a process into device page
# tables.
@@ -790,7 +787,6 @@ config HMM_MIRROR
config DEVICE_PRIVATE
bool "Unaddressable device memory (GPU memory, ...)"
depends on ZONE_DEVICE
- select DEV_PAGEMAP_OPS

help
Allows creation of struct pages to represent unaddressable device
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a194f..a67222d17e5987 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,4 +718,6 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags);

+void free_zone_device_page(struct page *page);
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcdd96aa4380e2..510cbfb82bb62a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5504,17 +5504,12 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
return NULL;

/*
- * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
- * a device and because they are not accessible by CPU they are store
- * as special swap entry in the CPU page table.
+ * Handle device private pages that are not accessible by the CPU, but
+ * stored as special swap entries in the page table.
*/
if (is_device_private_entry(ent)) {
page = pfn_swap_entry_to_page(ent);
- /*
- * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
- * a refcount of 1 when free (unlike normal page)
- */
- if (!page_ref_add_unless(page, 1, 1))
+ if (!get_page_unless_zero(page))
return NULL;
return page;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index a0ece2344c2cab..fef5734d5e4933 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
#include <linux/types.h>
#include <linux/wait_bit.h>
#include <linux/xarray.h>
+#include "internal.h"

static DEFINE_XARRAY(pgmap_array);

@@ -37,21 +38,19 @@ unsigned long memremap_compat_align(void)
EXPORT_SYMBOL_GPL(memremap_compat_align);
#endif

-#ifdef CONFIG_DEV_PAGEMAP_OPS
+#ifdef CONFIG_FS_DAX
DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
EXPORT_SYMBOL(devmap_managed_key);

static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
+ if (pgmap->type == MEMORY_DEVICE_FS_DAX)
static_branch_dec(&devmap_managed_key);
}

static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
{
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_FS_DAX)
+ if (pgmap->type == MEMORY_DEVICE_FS_DAX)
static_branch_inc(&devmap_managed_key);
}
#else
@@ -61,7 +60,7 @@ static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
{
}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
+#endif /* CONFIG_FS_DAX */

static void pgmap_array_delete(struct range *range)
{
@@ -102,23 +101,12 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
return (range->start + range_len(range)) >> PAGE_SHIFT;
}

-static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
-{
- if (pfn % (1024 << pgmap->vmemmap_shift))
- cond_resched();
- return pfn + pgmap_vmemmap_nr(pgmap);
-}
-
static unsigned long pfn_len(struct dev_pagemap *pgmap, unsigned long range_id)
{
return (pfn_end(pgmap, range_id) -
pfn_first(pgmap, range_id)) >> pgmap->vmemmap_shift;
}

-#define for_each_device_pfn(pfn, map, i) \
- for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); \
- pfn = pfn_next(map, pfn))
-
static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
{
struct range *range = &pgmap->ranges[range_id];
@@ -147,13 +135,11 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- unsigned long pfn;
int i;

percpu_ref_kill(&pgmap->ref);
for (i = 0; i < pgmap->nr_range; i++)
- for_each_device_pfn(pfn, pgmap, i)
- put_page(pfn_to_page(pfn));
+ percpu_ref_put_many(&pgmap->ref, pfn_len(pgmap, i));
wait_for_completion(&pgmap->done);
percpu_ref_exit(&pgmap->ref);

@@ -464,14 +450,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
}
EXPORT_SYMBOL_GPL(get_dev_pagemap);

-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page)
+void free_zone_device_page(struct page *page)
{
- /* notify page idle for dax */
- if (!is_device_private_page(page)) {
- wake_up_var(&page->_refcount);
+ if (WARN_ON_ONCE(!is_device_private_page(page)))
return;
- }

__ClearPageWaiters(page);

@@ -500,28 +482,27 @@ void free_devmap_managed_page(struct page *page)
*/
page->mapping = NULL;
page->pgmap->ops->page_free(page);
+
+ /*
+ * Reset the page count to 1 to prepare for handing out the page again.
+ */
+ set_page_count(page, 1);
}

+#ifdef CONFIG_FS_DAX
bool __put_devmap_managed_page(struct page *page)
{
- if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
- page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+ if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
return false;

/*
- * devmap page refcounts are 1-based, rather than 0-based: if
+ * fsdax page refcounts are 1-based, rather than 0-based: if
* refcount is 1, then the page is free and the refcount is
* stable because nobody holds a reference on the page.
*/
- switch (page_ref_dec_return(page)) {
- case 1:
- free_devmap_managed_page(page);
- break;
- case 0:
- __put_page(page);
- break;
- }
+ if (page_ref_dec_return(page) == 1)
+ wake_up_var(&page->_refcount);
return true;
}
EXPORT_SYMBOL(__put_devmap_managed_page);
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
+#endif /* CONFIG_FS_DAX */
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781b8..8e0370a73f8a43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -341,14 +341,8 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
{
int expected_count = 1;

- /*
- * Device private pages have an extra refcount as they are
- * ZONE_DEVICE pages.
- */
- expected_count += is_device_private_page(page);
if (mapping)
expected_count += compound_nr(page) + page_has_private(page);
-
return expected_count;
}

diff --git a/mm/swap.c b/mm/swap.c
index 25b55c56614311..c84d6817043257 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -114,17 +114,9 @@ static void __put_compound_page(struct page *page)

void __put_page(struct page *page)
{
- if (is_zone_device_page(page)) {
- put_dev_pagemap(page->pgmap);
-
- /*
- * The page belongs to the device that created pgmap. Do
- * not return it to page allocator.
- */
- return;
- }
-
- if (unlikely(PageCompound(page)))
+ if (unlikely(is_zone_device_page(page)))
+ free_zone_device_page(page);
+ else if (unlikely(PageCompound(page)))
__put_compound_page(page);
else
__put_single_page(page);
@@ -933,7 +925,7 @@ void release_pages(struct page **pages, int nr)
if (put_devmap_managed_page(page))
continue;
if (put_page_testzero(page))
- put_dev_pagemap(page->pgmap);
+ free_zone_device_page(page);
continue;
}

--
2.30.2


2022-02-10 09:49:30

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
---
mm/memremap.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11fda..5f04a0709e436e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_FS_DAX:
- if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
- IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+ if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
return ERR_PTR(-EINVAL);
}
--
2.30.2


2022-02-10 10:21:15

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 24/27] tools: update test_hmm script to support SP config

From: Alex Sierra <[email protected]>

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
tools/testing/selftests/vm/test_hmm.sh | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a62564..539c9371e592a1 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+ if [ $# -eq 2 ]; then
+ mknod /dev/hmm_dmirror2 c $major 2
+ mknod /dev/hmm_dmirror3 c $major 3
+ fi
fi
}

@@ -58,7 +73,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.30.2


2022-02-10 10:35:33

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 27/27] tools: add hmm gup test for long term pinned device pages

From: Alex Sierra <[email protected]>

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set. These pages
should get migrated back to normal system memory.

Signed-off-by: Alex Sierra <[email protected]>
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
tools/testing/selftests/vm/Makefile | 2 +-
tools/testing/selftests/vm/hmm-tests.c | 81 ++++++++++++++++++++++++++
2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 1607322a112c91..58c8427114f0c2 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap

$(OUTPUT)/gup_test: ../../../../mm/gup_test.h

-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h

# HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
$(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 84ec8c4a1dc7b6..11b83a8084fee2 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
* in the usual include/uapi/... directory.
*/
#include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"

struct hmm_buffer {
void *ptr;
@@ -60,6 +61,8 @@ enum {
#define NTIMES 10

#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01 /* check pte is writable */

FIXTURE(hmm)
{
@@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
}

+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+ struct hmm_buffer *buffer;
+ struct gup_test gup;
+ int gup_fd;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+ unsigned char *m;
+
+ gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+ if (gup_fd == -1)
+ SKIP(return, "Skipping test, could not find gup_test driver");
+
+ npages = 4;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ gup.nr_pages_per_call = npages;
+ gup.addr = (unsigned long)buffer->ptr;
+ gup.gup_flags = FOLL_WRITE;
+ gup.size = size;
+ /*
+ * Calling gup_test ioctl. It will try to PIN_LONGTERM these device pages
+ * causing a migration back to system memory for both, private and coherent
+ * type pages.
+ */
+ if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, &gup)) {
+ perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+ goto out_test;
+ }
+
+ /* Take snapshot to make sure pages have been migrated to sys memory */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ m = buffer->mirror;
+ for (i = 0; i < npages; i++)
+ ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+ close(gup_fd);
+ hmm_buffer_free(buffer);
+}
TEST_HARNESS_MAIN
--
2.30.2


2022-02-10 11:30:07

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 26/27] mm/gup: migrate device coherent pages when pinning instead of failing

From: Alistair Popple <[email protected]>

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
[hch: rebased to the split device memory checks,
moved migrate_device_page to migrate_device.c]
Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/gup.c | 37 ++++++++++++++++++++++++++-----
mm/internal.h | 1 +
mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 39b23ad39a7bde..41349b685eafb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1889,9 +1889,31 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
ret = -EFAULT;
goto unpin_pages;
}
+
+ /*
+ * Device coherent pages are managed by a driver and should not
+ * be pinned indefinitely as it prevents the driver moving the
+ * page. So when trying to pin with FOLL_LONGTERM instead try
+ * to migrate the page out of device memory.
+ */
if (is_device_coherent_page(head)) {
- ret = -EFAULT;
- goto unpin_pages;
+ WARN_ON_ONCE(PageCompound(head));
+
+ /*
+ * Migration will fail if the page is pinned, so convert
+ * the pin on the source page to a normal reference.
+ */
+ if (gup_flags & FOLL_PIN) {
+ get_page(head);
+ unpin_user_page(head);
+ }
+
+ pages[i] = migrate_device_page(head, gup_flags);
+ if (!pages[i]) {
+ ret = -EBUSY;
+ goto unpin_pages;
+ }
+ continue;
}

if (is_pinnable_page(head))
@@ -1931,10 +1953,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
return nr_pages;

unpin_pages:
- if (gup_flags & FOLL_PIN) {
- unpin_user_pages(pages, nr_pages);
- } else {
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
+ if (!pages[i])
+ continue;
+
+ if (gup_flags & FOLL_PIN)
+ unpin_user_page(pages[i]);
+ else
put_page(pages[i]);
}

diff --git a/mm/internal.h b/mm/internal.h
index a67222d17e5987..1bded5d7f41a9d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -719,5 +719,6 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags);

void free_zone_device_page(struct page *page);
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags);

#endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 03e182f9fc7865..3373b535d5c9d9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -767,3 +767,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
}
}
EXPORT_SYMBOL(migrate_vma_finalize);
+
+/*
+ * Migrate a device coherent page back to normal memory. The caller should have
+ * a reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
+{
+ unsigned long src_pfn, dst_pfn = 0;
+ struct migrate_vma args;
+ struct page *dpage;
+
+ lock_page(page);
+ src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+ args.src = &src_pfn;
+ args.dst = &dst_pfn;
+ args.cpages = 1;
+ args.npages = 1;
+ args.vma = NULL;
+ migrate_vma_setup(&args);
+ if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+ return NULL;
+
+ dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+ /*
+ * get/pin the new page now so we don't have to retry gup after
+ * migrating. We already have a reference so this should never fail.
+ */
+ if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+ __free_pages(dpage, 0);
+ dpage = NULL;
+ }
+
+ if (dpage) {
+ lock_page(dpage);
+ dst_pfn = migrate_pfn(page_to_pfn(dpage));
+ }
+
+ migrate_vma_pages(&args);
+ if (src_pfn & MIGRATE_PFN_MIGRATE)
+ copy_highpage(dpage, page);
+ migrate_vma_finalize(&args);
+ if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
+ if (gup_flags & FOLL_PIN)
+ unpin_user_page(dpage);
+ else
+ put_page(dpage);
+ dpage = NULL;
+ }
+
+ return dpage;
+}
--
2.30.2


2022-02-10 11:59:12

by Alistair Popple

[permalink] [raw]
Subject: Re: start sorting out the ZONE_DEVICE refcount mess v2

On Thursday, 10 February 2022 6:28:01 PM AEDT Christoph Hellwig wrote:

[...]

> Changes since v1:
> - add a missing memremap.h include in memcontrol.c
> - include rebased versions of the device coherent support and
> device coherent migration support series as well as additional
> cleanup patches

Thanks for the rebase. I will take a closer look at it tomorrow but I just
ran the hmm-tests and they are all still passing for me with this series.

> Diffstt:
> arch/arm64/mm/mmu.c | 1
> arch/powerpc/kvm/book3s_hv_uvmem.c | 1
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 -
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1
> drivers/gpu/drm/drm_cache.c | 2
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 3
> drivers/gpu/drm/nouveau/nouveau_svm.c | 1
> drivers/infiniband/core/rw.c | 1
> drivers/nvdimm/pmem.h | 1
> drivers/nvme/host/pci.c | 1
> drivers/nvme/target/io-cmd-bdev.c | 1
> fs/Kconfig | 2
> fs/fuse/virtio_fs.c | 1
> include/linux/hmm.h | 9
> include/linux/memremap.h | 36 +
> include/linux/migrate.h | 1
> include/linux/mm.h | 59 --
> lib/test_hmm.c | 353 ++++++++++---
> lib/test_hmm_uapi.h | 22
> mm/Kconfig | 7
> mm/Makefile | 1
> mm/gup.c | 127 +++-
> mm/internal.h | 3
> mm/memcontrol.c | 19
> mm/memory-failure.c | 8
> mm/memremap.c | 75 +-
> mm/migrate.c | 763 ----------------------------
> mm/migrate_device.c | 822 +++++++++++++++++++++++++++++++
> mm/rmap.c | 5
> mm/swap.c | 49 -
> tools/testing/selftests/vm/Makefile | 2
> tools/testing/selftests/vm/hmm-tests.c | 204 ++++++-
> tools/testing/selftests/vm/test_hmm.sh | 24
> 33 files changed, 1552 insertions(+), 1088 deletions(-)
>





2022-02-10 12:44:35

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 02/27] mm: remove the __KERNEL__ guard from <linux/mm.h>

__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
---
include/linux/mm.h | 4 ----
1 file changed, 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b19223..7b46174989b086 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3,9 +3,6 @@
#define _LINUX_MM_H

#include <linux/errno.h>
-
-#ifdef __KERNEL__
-
#include <linux/mmdebug.h>
#include <linux/gfp.h>
#include <linux/bug.h>
@@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

-#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
--
2.30.2


2022-02-10 14:08:26

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure

Key off on the existence of ->page_free to prepare for adding support for
more pgmap types that are device managed and thus need the free callback.

Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/memremap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index fef5734d5e4933..e00ffcdba7b632 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);

void free_zone_device_page(struct page *page)
{
- if (WARN_ON_ONCE(!is_device_private_page(page)))
+ if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
return;

__ClearPageWaiters(page);
@@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
mem_cgroup_uncharge(page_folio(page));

/*
- * When a device_private page is freed, the page->mapping field
+ * When a device managed page is freed, the page->mapping field
* may still contain a (stale) mapping value. For example, the
* lower bits of page->mapping may still identify the page as an
* anonymous page. Ultimately, this entire field is just stale
--
2.30.2


2022-02-10 20:15:36

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

On 2022/2/10 15:28, Christoph Hellwig wrote:
> memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
> the superflous extra check.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> Reviewed-by: Muchun Song <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> ---
> mm/memremap.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 6aa5f0c2d11fda..5f04a0709e436e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> }
> break;
> case MEMORY_DEVICE_FS_DAX:
> - if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
> - IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
> + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
> WARN(1, "File system DAX not supported\n");
> return ERR_PTR(-EINVAL);
> }
>

LGTM. Thanks.

Reviewed-by: Miaohe Lin <[email protected]>

Subject: Re: start sorting out the ZONE_DEVICE refcount mess v2

Christoph,
Thanks a lot for rebase our patches. I just ran our amdgpu hmm-tests
with this series and all passed.

Regards,
Alex Sierra

On 2/10/2022 1:28 AM, Christoph Hellwig wrote:
> Hi all,
>
> this series removes the offset by one refcount for ZONE_DEVICE pages
> that are freed back to the driver owning them, which is just device
> private ones for now, but also the planned device coherent pages
> and the ehanced p2p ones pending.
>
> It does not address the fsdax pages yet, which will be attacked in a
> follow on series.
>
> Note that if we want to get the p2p series rebased on top of this
> we'll need a git branch for this series. I could offer to host one.
>
> A git tree is available here:
>
> git://git.infradead.org/users/hch/misc.git pgmap-refcount
>
> Gitweb:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount
>
> Changes since v1:
> - add a missing memremap.h include in memcontrol.c
> - include rebased versions of the device coherent support and
> device coherent migration support series as well as additional
> cleanup patches
>
> Diffstt:
> arch/arm64/mm/mmu.c | 1
> arch/powerpc/kvm/book3s_hv_uvmem.c | 1
> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 -
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1
> drivers/gpu/drm/drm_cache.c | 2
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 3
> drivers/gpu/drm/nouveau/nouveau_svm.c | 1
> drivers/infiniband/core/rw.c | 1
> drivers/nvdimm/pmem.h | 1
> drivers/nvme/host/pci.c | 1
> drivers/nvme/target/io-cmd-bdev.c | 1
> fs/Kconfig | 2
> fs/fuse/virtio_fs.c | 1
> include/linux/hmm.h | 9
> include/linux/memremap.h | 36 +
> include/linux/migrate.h | 1
> include/linux/mm.h | 59 --
> lib/test_hmm.c | 353 ++++++++++---
> lib/test_hmm_uapi.h | 22
> mm/Kconfig | 7
> mm/Makefile | 1
> mm/gup.c | 127 +++-
> mm/internal.h | 3
> mm/memcontrol.c | 19
> mm/memory-failure.c | 8
> mm/memremap.c | 75 +-
> mm/migrate.c | 763 ----------------------------
> mm/migrate_device.c | 822 +++++++++++++++++++++++++++++++
> mm/rmap.c | 5
> mm/swap.c | 49 -
> tools/testing/selftests/vm/Makefile | 2
> tools/testing/selftests/vm/hmm-tests.c | 204 ++++++-
> tools/testing/selftests/vm/test_hmm.sh | 24
> 33 files changed, 1552 insertions(+), 1088 deletions(-)

2022-02-14 19:50:18

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure



On 2022-02-10 12:28 a.m., Christoph Hellwig wrote:
> Key off on the existence of ->page_free to prepare for adding support for
> more pgmap types that are device managed and thus need the free callback.
>
> Signed-off-by: Christoph Hellwig <[email protected]>

Great! This makes my patch simpler.

Reviewed-by: Logan Gunthorpe <[email protected]>


> ---
> mm/memremap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index fef5734d5e4933..e00ffcdba7b632 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
> void free_zone_device_page(struct page *page)
> {
> - if (WARN_ON_ONCE(!is_device_private_page(page)))
> + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
> return;
>
> __ClearPageWaiters(page);
> @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
> mem_cgroup_uncharge(page_folio(page));
>
> /*
> - * When a device_private page is freed, the page->mapping field
> + * When a device managed page is freed, the page->mapping field
> * may still contain a (stale) mapping value. For example, the
> * lower bits of page->mapping may still identify the page as an
> * anonymous page. Ultimately, this entire field is just stale
>