Subject: [PATCH v2 00/11] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.

Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.

This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.
System stability and performance are not affected according to our
ongoing testing, including xfstests.

How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.

The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.

Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.

We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

v2:
- test_hmm is now able to create private and coherent device mirror
instances in the same driver probe. This adds more usability to the hmm
test by not having to remove the kernel module for each device type
test (private/coherent type). This is done by passing the module
parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
four instances of device_mirror. The first two correspond to private
device type, the last two to coherent type. Then, they can be easily
accessed from user space through /dev/hmm_mirror<num_device>. Usually
num_device 0 and 1 are for private, and 2 and 3 for coherent types.

- Coherent device type pages at gup are now migrated back to system
memory if they have been long term pinned (FOLL_LONGTERM). The reason
is these pages could eventually interfere with their own device memory
manager. A new hmm_gup_test has been added to the hmm-test to test this
functionality. It makes use of the gup_test module to long term pin
user pages that have been migrate to device memory first.

- Other patch corrections made by Felix, Alistair and Christoph.

Alex Sierra (11):
mm: add zone device coherent type memory support
mm: add device coherent vma selection for memory migration
mm/gup: migrate PIN_LONGTERM dev coherent pages to system
drm/amdkfd: add SPM support for SVM
drm/amdkfd: coherent type as sys mem on migration to ram
lib: test_hmm add ioctl to get zone device type
lib: test_hmm add module param for zone device type
lib: add support for device coherent type in test_hmm
tools: update hmm-test to support device coherent type
tools: update test_hmm script to support SP config
tools: add hmm gup test for long term pinned device pages

drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 34 ++-
include/linux/memremap.h | 8 +
include/linux/migrate.h | 1 +
include/linux/mm.h | 16 ++
lib/test_hmm.c | 338 +++++++++++++++++------
lib/test_hmm_uapi.h | 22 +-
mm/gup.c | 32 ++-
mm/memcontrol.c | 6 +-
mm/memory-failure.c | 8 +-
mm/memremap.c | 5 +-
mm/migrate.c | 30 +-
tools/testing/selftests/vm/Makefile | 2 +-
tools/testing/selftests/vm/hmm-tests.c | 203 ++++++++++++--
tools/testing/selftests/vm/test_hmm.sh | 24 +-
14 files changed, 585 insertions(+), 144 deletions(-)

--
2.32.0



Subject: [PATCH v2 09/11] tools: update hmm-test to support device coherent type

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <[email protected]>
---
v2:
Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
will run all the tests for each device type (private and coherent) in
case both existed during hmm-test driver probed.
---
tools/testing/selftests/vm/hmm-tests.c | 122 ++++++++++++++++++++-----
1 file changed, 101 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 864f126ffd78..8eb81dfba4b3 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_t cpages;
uint64_t faults;
+ int zone_device_type;
+};
+
+enum {
+ HMM_PRIVATE_DEVICE_ONE,
+ HMM_PRIVATE_DEVICE_TWO,
+ HMM_COHERENCE_DEVICE_ONE,
+ HMM_COHERENCE_DEVICE_TWO,
};

#define TWOMEG (1 << 21)
@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm)
+{
+ int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+ .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+ .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
FIXTURE(hmm2)
{
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm2)
+{
+ int device_number0;
+ int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+ .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+ .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+ .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+ .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
static int hmm_open(int unit)
{
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
}

+static bool hmm_is_coherent_type(int dev_num)
+{
+ return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
FIXTURE_SETUP(hmm)
{
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd = hmm_open(0);
+ self->fd = hmm_open(variant->device_number);
+ if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
}

@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd0 = hmm_open(0);
+ self->fd0 = hmm_open(variant->device_number0);
+ if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
- self->fd1 = hmm_open(1);
+ self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
}

@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+ buffer->zone_device_type = cmd.zone_device_type;

return 0;
}
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
}

+static int hmm_migrate_sys_to_dev(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
/*
* Simple NULL test of device open/close.
*/
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -936,7 +1001,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);

/* Migrate memory to the device again. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -976,7 +1041,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);

hmm_buffer_free(buffer);
@@ -1015,7 +1080,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;

/* Migrating a protected area should be an error. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);

/* Punch a hole after the first page address. */
@@ -1023,7 +1088,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);

/* We expect an error if the vma doesn't cover the range. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);

/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1120,13 @@ TEST_F(hmm2, migrate_mixed)

/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);

/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, -ENOENT);
buffer->ptr = p;

@@ -1070,8 +1135,12 @@ TEST_F(hmm2, migrate_mixed)
}

/*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of coherent zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Coherent device zone has coherent access by CPU, therefore
+ * it will not generate any page fault.
*/
TEST_F(hmm, migrate_multiple)
{
@@ -1107,8 +1176,7 @@ TEST_F(hmm, migrate_multiple)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
- npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -1116,7 +1184,12 @@ TEST_F(hmm, migrate_multiple)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

- /* Fault pages back to system memory and check them. */
+ /* Migrate back to system memory and check them. */
+ if (hmm_is_coherent_type(variant->device_number)) {
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ }
+
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

@@ -1312,13 +1385,13 @@ TEST_F(hmm2, snapshot)

/* Page 5 will be migrated to device 0. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

/* Page 6 will be migrated to device 1. */
buffer->ptr = p + 6 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

@@ -1335,9 +1408,16 @@ TEST_F(hmm2, snapshot)
ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
- HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ if (!hmm_is_coherent_type(variant->device_number0)) {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ } else {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE |
+ HMM_DMIRROR_PROT_WRITE);
+ }

hmm_buffer_free(buffer);
}
--
2.32.0


Subject: [PATCH v2 11/11] tools: add hmm gup test for long term pinned device pages

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set.

Signed-off-by: Alex Sierra <[email protected]>
---
tools/testing/selftests/vm/Makefile | 2 +-
tools/testing/selftests/vm/hmm-tests.c | 81 ++++++++++++++++++++++++++
2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d9605bd10f2d..527a7bfd80bd 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -141,7 +141,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap

$(OUTPUT)/gup_test: ../../../../mm/gup_test.h

-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h

# HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
$(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 8eb81dfba4b3..9a0b7e44a674 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
* in the usual include/uapi/... directory.
*/
#include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"

struct hmm_buffer {
void *ptr;
@@ -60,6 +61,8 @@ enum {
#define NTIMES 10

#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01 /* check pte is writable */

FIXTURE(hmm)
{
@@ -1723,4 +1726,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
}

+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+ struct hmm_buffer *buffer;
+ struct gup_test gup;
+ int gup_fd;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+ unsigned char *m;
+
+ gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+ if (gup_fd == -1)
+ SKIP(return, "Skipping test, could not find gup_test driver");
+
+ npages = 4;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ gup.nr_pages_per_call = npages;
+ gup.addr = (unsigned long)buffer->ptr;
+ gup.gup_flags = FOLL_WRITE;
+ gup.size = size;
+ /*
+ * Calling gup_test ioctl. It will try to PIN_LONGTERM these device pages
+ * causing a migration back to system memory for both, private and coherent
+ * type pages.
+ */
+ if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, &gup)) {
+ perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+ goto out_test;
+ }
+
+ /* Take snapshot to make sure pages have been migrated to sys memory */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ m = buffer->mirror;
+ for (i = 0; i < npages; i++)
+ ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+ close(gup_fd);
+ hmm_buffer_free(buffer);
+}
TEST_HARNESS_MAIN
--
2.32.0


Subject: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager.
If caller tries to get user device coherent pages with PIN_LONGTERM flag
set, those pages will be migrated back to system memory.

Signed-off-by: Alex Sierra <[email protected]>
---
mm/gup.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1572eacf07f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
#endif /* CONFIG_ELF_CORE */

#ifdef CONFIG_MIGRATION
+static int migrate_device_page(unsigned long address,
+ struct page *page)
+{
+ struct vm_area_struct *vma = find_vma(current->mm, address);
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = address & PAGE_MASK,
+ .flags = FAULT_FLAG_USER,
+ .pgoff = linear_page_index(vma, address),
+ .gfp_mask = GFP_KERNEL,
+ .page = page,
+ };
+ if (page->pgmap && page->pgmap->ops->migrate_to_ram)
+ return page->pgmap->ops->migrate_to_ram(&vmf);
+
+ return -EBUSY;
+}
+
/*
* Check whether all pages are pinnable, if so return number of pages. If some
* pages are not pinnable, migrate them, and unpin all pages. Return zero if
* pages were migrated, or if some pages were not successfully isolated.
* Return negative error if migration fails.
*/
-static long check_and_migrate_movable_pages(unsigned long nr_pages,
+static long check_and_migrate_movable_pages(unsigned long start,
+ unsigned long nr_pages,
struct page **pages,
unsigned int gup_flags)
{
unsigned long i;
+ unsigned long page_index;
unsigned long isolation_error_count = 0;
bool drain_allow = true;
LIST_HEAD(movable_page_list);
@@ -1720,6 +1740,10 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
* If we get a movable page, since we are going to be pinning
* these entries, try to move them out if possible.
*/
+ if (is_device_page(head)) {
+ page_index = i;
+ goto unpin_pages;
+ }
if (!is_pinnable_page(head)) {
if (PageHuge(head)) {
if (!isolate_huge_page(head, &movable_page_list))
@@ -1750,12 +1774,16 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
if (list_empty(&movable_page_list) && !isolation_error_count)
return nr_pages;

+unpin_pages:
if (gup_flags & FOLL_PIN) {
unpin_user_pages(pages, nr_pages);
} else {
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
}
+ if (is_device_page(head))
+ return migrate_device_page(start + page_index * PAGE_SIZE, head);
+
if (!list_empty(&movable_page_list)) {
ret = migrate_pages(&movable_page_list, alloc_migration_target,
NULL, (unsigned long)&mtc, MIGRATE_SYNC,
@@ -1798,7 +1826,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
NULL, gup_flags);
if (rc <= 0)
break;
- rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
+ rc = check_and_migrate_movable_pages(start, rc, pages, gup_flags);
} while (!rc);
memalloc_pin_restore(flags);

--
2.32.0


Subject: [PATCH v2 10/11] tools: update test_hmm script to support SP config

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <[email protected]>
---
v2:
Add more mknods for device coherent type. These are represented under
/dev/hmm_mirror2 and /dev/hmm_mirror3, only in case they have created
at probing the hmm-test driver.
---
tools/testing/selftests/vm/test_hmm.sh | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..539c9371e592 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+ if [ $# -eq 2 ]; then
+ mknod /dev/hmm_dmirror2 c $major 2
+ mknod /dev/hmm_dmirror3 c $major 3
+ fi
fi
}

@@ -58,7 +73,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.32.0


Subject: [PATCH v2 08/11] lib: add support for device coherent type in test_hmm

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Private and coherent device mirror instances can be created in the same
probed. This is done by passing the module parameters spm_addr_dev0 &
spm_addr_dev1. In this case, it will create four instances of
device_mirror. The first two correspond to private device type, the
last two to coherent type. Then, they can be easily accessed from user
space through /dev/hmm_mirror<num_device>. Usually num_device 0 and 1
are for private, and 2 and 3 for coherent types. If no module
parameters are passed, two instances of private type device_mirror will
be created only.

Signed-off-by: Alex Sierra <[email protected]>
---
lib/test_hmm.c | 252 +++++++++++++++++++++++++++++++++-----------
lib/test_hmm_uapi.h | 15 ++-
2 files changed, 198 insertions(+), 69 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9edeff52302e..a1985226d788 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -29,11 +29,22 @@

#include "test_hmm_uapi.h"

-#define DMIRROR_NDEVICES 2
+#define DMIRROR_NDEVICES 4
#define DMIRROR_RANGE_FAULT_TIMEOUT 1000
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+/*
+ * For device_private pages, dpage is just a dummy struct page
+ * representing a piece of device memory. dmirror_devmem_alloc_page
+ * allocates a real system memory page as backing storage to fake a
+ * real device. zone_device_data points to that backing page. But
+ * for device_coherent memory, the struct page represents real
+ * physical CPU-accessible memory that we can use directly.
+ */
+#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
+ (page)->zone_device_data : (page))
+
static unsigned long spm_addr_dev0;
module_param(spm_addr_dev0, long, 0644);
MODULE_PARM_DESC(spm_addr_dev0,
@@ -122,6 +133,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce,
return 0;
}

+static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
+{
+ return (mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
+}
+
+static enum migrate_vma_direction
+ dmirror_select_device(struct dmirror *dmirror)
+{
+ return (dmirror->mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+ MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+}
+
static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
{
vfree(bounce->ptr);
@@ -572,16 +598,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
{
struct page *dpage = NULL;
- struct page *rpage;
+ struct page *rpage = NULL;

/*
- * This is a fake device so we alloc real system memory to store
- * our device memory.
+ * For ZONE_DEVICE private type, this is a fake device so we alloc real
+ * system memory to store our device memory.
+ * For ZONE_DEVICE coherent type we use the actual dpage to store the data
+ * and ignore rpage.
*/
- rpage = alloc_page(GFP_HIGHUSER);
- if (!rpage)
- return NULL;
-
+ if (dmirror_is_private_zone(mdevice)) {
+ rpage = alloc_page(GFP_HIGHUSER);
+ if (!rpage)
+ return NULL;
+ }
spin_lock(&mdevice->lock);

if (mdevice->free_pages) {
@@ -601,7 +630,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
return dpage;

error:
- __free_page(rpage);
+ if (rpage)
+ __free_page(rpage);
return NULL;
}

@@ -627,12 +657,15 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* unallocated pte_none() or read-only zero page.
*/
spage = migrate_pfn_to_page(*src);
+ WARN(spage && is_zone_device_page(spage),
+ "page already in device spage pfn: 0x%lx\n",
+ page_to_pfn(spage));

dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;

- rpage = dpage->zone_device_data;
+ rpage = BACKING_PAGE(dpage);
if (spage)
copy_highpage(rpage, spage);
else
@@ -646,6 +679,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
*/
rpage->zone_device_data = dmirror;

+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage)) |
MIGRATE_PFN_LOCKED;
if ((*src & MIGRATE_PFN_WRITE) ||
@@ -724,11 +759,7 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
if (!dpage)
continue;

- /*
- * Store the page that holds the data so the page table
- * doesn't have to deal with ZONE_DEVICE private pages.
- */
- entry = dpage->zone_device_data;
+ entry = BACKING_PAGE(dpage);
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -808,8 +839,106 @@ static int dmirror_exclusive(struct dmirror *dmirror,
return ret;
}

-static int dmirror_migrate(struct dmirror *dmirror,
- struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+ struct dmirror *dmirror)
+{
+ const unsigned long *src = args->src;
+ unsigned long *dst = args->dst;
+ unsigned long start = args->start;
+ unsigned long end = args->end;
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE,
+ src++, dst++) {
+ struct page *dpage, *spage;
+
+ spage = migrate_pfn_to_page(*src);
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ WARN_ON(!is_device_page(spage));
+ spage = BACKING_PAGE(spage);
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ continue;
+ pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
+
+ lock_page(dpage);
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dpage, spage);
+ *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+ }
+ return 0;
+}
+
+static int dmirror_migrate_to_system(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ unsigned long start, end, addr;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ struct mm_struct *mm = dmirror->notifier.mm;
+ struct vm_area_struct *vma;
+ unsigned long src_pfns[64];
+ unsigned long dst_pfns[64];
+ struct migrate_vma args;
+ unsigned long next;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ /* Since the mm is for the mirrored process, get a reference first. */
+ if (!mmget_not_zero(mm))
+ return -EINVAL;
+
+ mmap_read_lock(mm);
+ for (addr = start; addr < end; addr = next) {
+ vma = find_vma(mm, addr);
+ if (!vma || addr < vma->vm_start ||
+ !(vma->vm_flags & VM_READ)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+ if (next > vma->vm_end)
+ next = vma->vm_end;
+
+ args.vma = vma;
+ args.src = src_pfns;
+ args.dst = dst_pfns;
+ args.start = addr;
+ args.end = next;
+ args.pgmap_owner = dmirror->mdevice;
+ args.flags = dmirror_select_device(dmirror);
+
+ ret = migrate_vma_setup(&args);
+ if (ret)
+ goto out;
+
+ pr_debug("Migrating from device mem to sys mem\n");
+ dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+
+ migrate_vma_pages(&args);
+ migrate_vma_finalize(&args);
+ }
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ return ret;
+
+out:
+ mmap_read_unlock(mm);
+ mmput(mm);
+ return ret;
+}
+
+static int dmirror_migrate_to_device(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
{
unsigned long start, end, addr;
unsigned long size = cmd->npages << PAGE_SHIFT;
@@ -853,6 +982,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
if (ret)
goto out;

+ pr_debug("Migrating from sys mem to device mem\n");
dmirror_migrate_alloc_and_copy(&args, dmirror);
migrate_vma_pages(&args);
dmirror_migrate_finalize_and_map(&args, dmirror);
@@ -861,7 +991,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
mmap_read_unlock(mm);
mmput(mm);

- /* Return the migrated data for verification. */
+ /* Return the migrated data for verification. only for pages in device zone */
ret = dmirror_bounce_init(&bounce, start, size);
if (ret)
return ret;
@@ -898,12 +1028,22 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
}

page = hmm_pfn_to_page(entry);
- if (is_device_private_page(page)) {
- /* Is the page migrated to this device or some other? */
- if (dmirror->mdevice == dmirror_page_to_device(page))
+ if (is_device_page(page)) {
+ /* Is page ZONE_DEVICE coherent? */
+ if (!is_device_private_page(page)) {
+ if (dmirror->mdevice == dmirror_page_to_device(page))
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL;
+ else
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE;
+ /*
+ * Is page ZONE_DEVICE private migrated to
+ * this device or some other?
+ */
+ } else if (dmirror->mdevice == dmirror_page_to_device(page)) {
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
- else
+ } else {
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+ }
} else if (is_zero_pfn(page_to_pfn(page)))
*perm = HMM_DMIRROR_PROT_ZERO;
else
@@ -1100,8 +1240,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_write(dmirror, &cmd);
break;

- case HMM_DMIRROR_MIGRATE:
- ret = dmirror_migrate(dmirror, &cmd);
+ case HMM_DMIRROR_MIGRATE_TO_DEV:
+ ret = dmirror_migrate_to_device(dmirror, &cmd);
+ break;
+
+ case HMM_DMIRROR_MIGRATE_TO_SYS:
+ ret = dmirror_migrate_to_system(dmirror, &cmd);
break;

case HMM_DMIRROR_EXCLUSIVE:
@@ -1142,14 +1286,13 @@ static const struct file_operations dmirror_fops = {

static void dmirror_devmem_free(struct page *page)
{
- struct page *rpage = page->zone_device_data;
+ struct page *rpage = BACKING_PAGE(page);
struct dmirror_device *mdevice;

- if (rpage)
+ if (rpage != page)
__free_page(rpage);

mdevice = dmirror_page_to_device(page);
-
spin_lock(&mdevice->lock);
mdevice->cfree++;
page->zone_device_data = mdevice->free_pages;
@@ -1157,38 +1300,6 @@ static void dmirror_devmem_free(struct page *page)
spin_unlock(&mdevice->lock);
}

-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- const unsigned long *src = args->src;
- unsigned long *dst = args->dst;
- unsigned long start = args->start;
- unsigned long end = args->end;
- unsigned long addr;
-
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
- struct page *dpage, *spage;
-
- spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
- spage = spage->zone_device_data;
-
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
-
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- }
- return 0;
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
@@ -1203,7 +1314,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
* the mirror but here we use it to hold the page for the simulated
* device memory and that page holds the pointer to the mirror.
*/
- rpage = vmf->page->zone_device_data;
+ rpage = BACKING_PAGE(vmf->page);
dmirror = rpage->zone_device_data;

/* FIXME demonstrate how we can adjust migrate range */
@@ -1213,7 +1324,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
args.src = &src_pfns;
args.dst = &dst_pfns;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ args.flags = dmirror_select_device(dmirror);

if (migrate_vma_setup(&args))
return VM_FAULT_SIGBUS;
@@ -1279,14 +1390,26 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
static int __init hmm_dmirror_init(void)
{
int ret;
- int id;
+ int id = 0;
+ int ndevices = 0;

ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
"HMM_DMIRROR");
if (ret)
goto err_unreg;

- for (id = 0; id < DMIRROR_NDEVICES; id++) {
+ memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0]));
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ if (spm_addr_dev0 && spm_addr_dev1) {
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ }
+ for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
goto err_chrdev;
@@ -1308,7 +1431,8 @@ static void __exit hmm_dmirror_exit(void)
int id;

for (id = 0; id < DMIRROR_NDEVICES; id++)
- dmirror_device_remove(dmirror_devices + id);
+ if (dmirror_devices[id].zone_device_type)
+ dmirror_device_remove(dmirror_devices + id);
unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
}

diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 625f3690d086..e190b2ab6f19 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,11 +33,12 @@ struct hmm_dmirror_cmd {
/* Expose the address space of the calling process through hmm device file */
#define HMM_DMIRROR_READ _IOWR('H', 0x00, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV _IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x07, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -52,6 +53,8 @@ struct hmm_dmirror_cmd {
* device the ioctl() is made
* HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
* other device
+ * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
+ * the ioctl() is made
*/
enum {
HMM_DMIRROR_PROT_ERROR = 0xFF,
@@ -63,6 +66,8 @@ enum {
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
+ HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL = 0x40,
+ HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE = 0x50,
};

enum {
--
2.32.0


2021-12-07 19:31:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 11/11] tools: add hmm gup test for long term pinned device pages

On Mon, Dec 06, 2021 at 12:52:51PM -0600, Alex Sierra wrote:
> The intention is to test device coherent type pages that have been
> called through get user pages with PIN_LONGTERM flag set.
>
> Signed-off-by: Alex Sierra <[email protected]>
> tools/testing/selftests/vm/Makefile | 2 +-
> tools/testing/selftests/vm/hmm-tests.c | 81 ++++++++++++++++++++++++++
> 2 files changed, 82 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index d9605bd10f2d..527a7bfd80bd 100644
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -141,7 +141,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
>
> $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
>
> -$(OUTPUT)/hmm-tests: local_config.h
> +$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h
>
> # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
> $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
> diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
> index 8eb81dfba4b3..9a0b7e44a674 100644
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -36,6 +36,7 @@
> * in the usual include/uapi/... directory.
> */
> #include "../../../../lib/test_hmm_uapi.h"
> +#include "../../../../mm/gup_test.h"
>
> struct hmm_buffer {
> void *ptr;
> @@ -60,6 +61,8 @@ enum {
> #define NTIMES 10
>
> #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
> +/* Just the flags we need, copied from mm.h: */
> +#define FOLL_WRITE 0x01 /* check pte is writable */

This is so fragile, you should have a dedicated flag here for asking
for this of PIN_LONGTERM_BENCHMARK

Jason

2021-12-08 11:32:15

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> Avoid long term pinning for Coherent device type pages. This could
> interfere with their own device memory manager.
> If caller tries to get user device coherent pages with PIN_LONGTERM flag
> set, those pages will be migrated back to system memory.
>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 886d6148d3d0..1572eacf07f4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> #endif /* CONFIG_ELF_CORE */
>
> #ifdef CONFIG_MIGRATION
> +static int migrate_device_page(unsigned long address,
> + struct page *page)
> +{
> + struct vm_area_struct *vma = find_vma(current->mm, address);
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = address & PAGE_MASK,
> + .flags = FAULT_FLAG_USER,
> + .pgoff = linear_page_index(vma, address),
> + .gfp_mask = GFP_KERNEL,
> + .page = page,
> + };
> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> + return page->pgmap->ops->migrate_to_ram(&vmf);

How does this synchronise against pgmap being released? As I understand things
at this point we're not holding a reference on either the page or pgmap, so
the page and therefore the pgmap may have been freed.

I think a similar problem exists for device private fault handling as well and
it has been on my list of things to fix for a while. I think the solution is to
call try_get_page(), except it doesn't work with device pages due to the whole
refcount thing. That issue is blocking a fair bit of work now so I've started
looking into it.

> +
> + return -EBUSY;
> +}
> +
> /*
> * Check whether all pages are pinnable, if so return number of pages. If some
> * pages are not pinnable, migrate them, and unpin all pages. Return zero if
> * pages were migrated, or if some pages were not successfully isolated.
> * Return negative error if migration fails.
> */
> -static long check_and_migrate_movable_pages(unsigned long nr_pages,
> +static long check_and_migrate_movable_pages(unsigned long start,
> + unsigned long nr_pages,
> struct page **pages,
> unsigned int gup_flags)
> {
> unsigned long i;
> + unsigned long page_index;
> unsigned long isolation_error_count = 0;
> bool drain_allow = true;
> LIST_HEAD(movable_page_list);
> @@ -1720,6 +1740,10 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> * If we get a movable page, since we are going to be pinning
> * these entries, try to move them out if possible.
> */
> + if (is_device_page(head)) {
> + page_index = i;
> + goto unpin_pages;
> + }
> if (!is_pinnable_page(head)) {
> if (PageHuge(head)) {
> if (!isolate_huge_page(head, &movable_page_list))
> @@ -1750,12 +1774,16 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> if (list_empty(&movable_page_list) && !isolation_error_count)
> return nr_pages;
>
> +unpin_pages:
> if (gup_flags & FOLL_PIN) {
> unpin_user_pages(pages, nr_pages);
> } else {
> for (i = 0; i < nr_pages; i++)
> put_page(pages[i]);
> }
> + if (is_device_page(head))
> + return migrate_device_page(start + page_index * PAGE_SIZE, head);

This isn't very optimal - if a range contains more than one device page (which
seems likely) we will have to go around the whole gup/check_and_migrate loop
once for each device page which seems unnecessary. You should be able to either
build a list or migrate them as you go through the loop. I'm also currently
looking into how to extend migrate_pages() to support device pages which might
be useful here too.

> +
> if (!list_empty(&movable_page_list)) {
> ret = migrate_pages(&movable_page_list, alloc_migration_target,
> NULL, (unsigned long)&mtc, MIGRATE_SYNC,
> @@ -1798,7 +1826,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> NULL, gup_flags);
> if (rc <= 0)
> break;
> - rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
> + rc = check_and_migrate_movable_pages(start, rc, pages, gup_flags);
> } while (!rc);
> memalloc_pin_restore(flags);
>
>





2021-12-08 13:53:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Wed, Dec 08, 2021 at 10:31:58PM +1100, Alistair Popple wrote:
> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> > Avoid long term pinning for Coherent device type pages. This could
> > interfere with their own device memory manager.
> > If caller tries to get user device coherent pages with PIN_LONGTERM flag
> > set, those pages will be migrated back to system memory.
> >
> > Signed-off-by: Alex Sierra <[email protected]>
> > mm/gup.c | 32 ++++++++++++++++++++++++++++++--
> > 1 file changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 886d6148d3d0..1572eacf07f4 100644
> > +++ b/mm/gup.c
> > @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> > #endif /* CONFIG_ELF_CORE */
> >
> > #ifdef CONFIG_MIGRATION
> > +static int migrate_device_page(unsigned long address,
> > + struct page *page)
> > +{
> > + struct vm_area_struct *vma = find_vma(current->mm, address);
> > + struct vm_fault vmf = {
> > + .vma = vma,
> > + .address = address & PAGE_MASK,
> > + .flags = FAULT_FLAG_USER,
> > + .pgoff = linear_page_index(vma, address),
> > + .gfp_mask = GFP_KERNEL,
> > + .page = page,
> > + };
> > + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> > + return page->pgmap->ops->migrate_to_ram(&vmf);
>
> How does this synchronise against pgmap being released? As I understand things
> at this point we're not holding a reference on either the page or pgmap, so
> the page and therefore the pgmap may have been freed.

For sure, this can't keep touching the pages[] array after it unpinned
them:

> > if (gup_flags & FOLL_PIN) {
> > unpin_user_pages(pages, nr_pages);
^^^^^^^^^^^^^^^^^^^

> > } else {
> > for (i = 0; i < nr_pages; i++)
> > put_page(pages[i]);
> > }
> > + if (is_device_page(head))
> > + return migrate_device_page(start + page_index * PAGE_SIZE, head);

It was safe before this patch as isolate_lru_page(head) has a
get_page() inside.

Also, please try hard not to turn this function into goto spaghetti

> I think a similar problem exists for device private fault handling as well and
> it has been on my list of things to fix for a while. I think the solution is to
> call try_get_page(), except it doesn't work with device pages due to the whole
> refcount thing. That issue is blocking a fair bit of work now so I've started
> looking into it.

Where is this?

Jason

2021-12-08 16:58:32

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>> Avoid long term pinning for Coherent device type pages. This could
>> interfere with their own device memory manager.
>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>> set, those pages will be migrated back to system memory.
>>
>> Signed-off-by: Alex Sierra <[email protected]>
>> ---
>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 886d6148d3d0..1572eacf07f4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>> #endif /* CONFIG_ELF_CORE */
>>
>> #ifdef CONFIG_MIGRATION
>> +static int migrate_device_page(unsigned long address,
>> + struct page *page)
>> +{
>> + struct vm_area_struct *vma = find_vma(current->mm, address);
>> + struct vm_fault vmf = {
>> + .vma = vma,
>> + .address = address & PAGE_MASK,
>> + .flags = FAULT_FLAG_USER,
>> + .pgoff = linear_page_index(vma, address),
>> + .gfp_mask = GFP_KERNEL,
>> + .page = page,
>> + };
>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>> + return page->pgmap->ops->migrate_to_ram(&vmf);
> How does this synchronise against pgmap being released? As I understand things
> at this point we're not holding a reference on either the page or pgmap, so
> the page and therefore the pgmap may have been freed.
>
> I think a similar problem exists for device private fault handling as well and
> it has been on my list of things to fix for a while. I think the solution is to
> call try_get_page(), except it doesn't work with device pages due to the whole
> refcount thing. That issue is blocking a fair bit of work now so I've started
> looking into it.

At least the page should have been pinned by the __get_user_pages_locked
call in __gup_longterm_locked. That refcount is dropped in
check_and_migrate_movable_pages when it returns 0 or an error.


>
>> +
>> + return -EBUSY;
>> +}
>> +
>> /*
>> * Check whether all pages are pinnable, if so return number of pages. If some
>> * pages are not pinnable, migrate them, and unpin all pages. Return zero if
>> * pages were migrated, or if some pages were not successfully isolated.
>> * Return negative error if migration fails.
>> */
>> -static long check_and_migrate_movable_pages(unsigned long nr_pages,
>> +static long check_and_migrate_movable_pages(unsigned long start,
>> + unsigned long nr_pages,
>> struct page **pages,
>> unsigned int gup_flags)
>> {
>> unsigned long i;
>> + unsigned long page_index;
>> unsigned long isolation_error_count = 0;
>> bool drain_allow = true;
>> LIST_HEAD(movable_page_list);
>> @@ -1720,6 +1740,10 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>> * If we get a movable page, since we are going to be pinning
>> * these entries, try to move them out if possible.
>> */
>> + if (is_device_page(head)) {
>> + page_index = i;
>> + goto unpin_pages;
>> + }
>> if (!is_pinnable_page(head)) {
>> if (PageHuge(head)) {
>> if (!isolate_huge_page(head, &movable_page_list))
>> @@ -1750,12 +1774,16 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>> if (list_empty(&movable_page_list) && !isolation_error_count)
>> return nr_pages;
>>
>> +unpin_pages:
>> if (gup_flags & FOLL_PIN) {
>> unpin_user_pages(pages, nr_pages);
>> } else {
>> for (i = 0; i < nr_pages; i++)
>> put_page(pages[i]);
>> }
>> + if (is_device_page(head))
>> + return migrate_device_page(start + page_index * PAGE_SIZE, head);
> This isn't very optimal - if a range contains more than one device page (which
> seems likely) we will have to go around the whole gup/check_and_migrate loop
> once for each device page which seems unnecessary. You should be able to either
> build a list or migrate them as you go through the loop. I'm also currently
> looking into how to extend migrate_pages() to support device pages which might
> be useful here too.

We have to do it this way because page->pgmap->ops->migrate_to_ram can
migrate multiple pages per "CPU page fault" to amortize the cost of
migration. The AMD driver typically migrates 2MB at a time. Calling
page->pgmap->ops->migrate_to_ram for each page would probably be even
less optimal.

Regards,
  Felix


>
>> +
>> if (!list_empty(&movable_page_list)) {
>> ret = migrate_pages(&movable_page_list, alloc_migration_target,
>> NULL, (unsigned long)&mtc, MIGRATE_SYNC,
>> @@ -1798,7 +1826,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
>> NULL, gup_flags);
>> if (rc <= 0)
>> break;
>> - rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
>> + rc = check_and_migrate_movable_pages(start, rc, pages, gup_flags);
>> } while (!rc);
>> memalloc_pin_restore(flags);
>>
>>
>
>

2021-12-08 17:30:17

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system


Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>>> Avoid long term pinning for Coherent device type pages. This could
>>> interfere with their own device memory manager.
>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>>> set, those pages will be migrated back to system memory.
>>>
>>> Signed-off-by: Alex Sierra <[email protected]>
>>> ---
>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
>>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 886d6148d3d0..1572eacf07f4 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>>> #endif /* CONFIG_ELF_CORE */
>>>
>>> #ifdef CONFIG_MIGRATION
>>> +static int migrate_device_page(unsigned long address,
>>> + struct page *page)
>>> +{
>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
>>> + struct vm_fault vmf = {
>>> + .vma = vma,
>>> + .address = address & PAGE_MASK,
>>> + .flags = FAULT_FLAG_USER,
>>> + .pgoff = linear_page_index(vma, address),
>>> + .gfp_mask = GFP_KERNEL,
>>> + .page = page,
>>> + };
>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
>> How does this synchronise against pgmap being released? As I understand things
>> at this point we're not holding a reference on either the page or pgmap, so
>> the page and therefore the pgmap may have been freed.
>>
>> I think a similar problem exists for device private fault handling as well and
>> it has been on my list of things to fix for a while. I think the solution is to
>> call try_get_page(), except it doesn't work with device pages due to the whole
>> refcount thing. That issue is blocking a fair bit of work now so I've started
>> looking into it.
> At least the page should have been pinned by the __get_user_pages_locked
> call in __gup_longterm_locked. That refcount is dropped in
> check_and_migrate_movable_pages when it returns 0 or an error.

Never mind. We unpin the pages first. Alex, would the migration work if
we unpinned them afterwards? Also, the normal CPU page fault code path
seems to make sure the page is locked (check in pfn_swap_entry_to_page)
before calling migrate_to_ram.

Regards,
  Felix



Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system


On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
>> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
>>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>>>> Avoid long term pinning for Coherent device type pages. This could
>>>> interfere with their own device memory manager.
>>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>>>> set, those pages will be migrated back to system memory.
>>>>
>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>> ---
>>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
>>>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 886d6148d3d0..1572eacf07f4 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>>>> #endif /* CONFIG_ELF_CORE */
>>>>
>>>> #ifdef CONFIG_MIGRATION
>>>> +static int migrate_device_page(unsigned long address,
>>>> + struct page *page)
>>>> +{
>>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
>>>> + struct vm_fault vmf = {
>>>> + .vma = vma,
>>>> + .address = address & PAGE_MASK,
>>>> + .flags = FAULT_FLAG_USER,
>>>> + .pgoff = linear_page_index(vma, address),
>>>> + .gfp_mask = GFP_KERNEL,
>>>> + .page = page,
>>>> + };
>>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
>>> How does this synchronise against pgmap being released? As I understand things
>>> at this point we're not holding a reference on either the page or pgmap, so
>>> the page and therefore the pgmap may have been freed.
>>>
>>> I think a similar problem exists for device private fault handling as well and
>>> it has been on my list of things to fix for a while. I think the solution is to
>>> call try_get_page(), except it doesn't work with device pages due to the whole
>>> refcount thing. That issue is blocking a fair bit of work now so I've started
>>> looking into it.
>> At least the page should have been pinned by the __get_user_pages_locked
>> call in __gup_longterm_locked. That refcount is dropped in
>> check_and_migrate_movable_pages when it returns 0 or an error.
> Never mind. We unpin the pages first. Alex, would the migration work if
> we unpinned them afterwards? Also, the normal CPU page fault code path
> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> before calling migrate_to_ram.

No, you can not unpinned after migration. Due to the expected_count VS
page_count condition at migrate_page_move_mapping, during migrate_page call.

Regards,
Alex Sierra

> Regards,
>   Felix
>
>

2021-12-09 01:45:36

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Thursday, 9 December 2021 12:53:45 AM AEDT Jason Gunthorpe wrote:
> > I think a similar problem exists for device private fault handling as well and
> > it has been on my list of things to fix for a while. I think the solution is to
> > call try_get_page(), except it doesn't work with device pages due to the whole
> > refcount thing. That issue is blocking a fair bit of work now so I've started
> > looking into it.
>
> Where is this?

Nothing posted yet. I've been going through the mailing list and the old
thread[1] to get an understanding of what is left to do. If you have any
suggestions they would be welcome.

[1] https://lore.kernel.org/all/[email protected]/




2021-12-09 02:53:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Thu, Dec 09, 2021 at 12:45:24PM +1100, Alistair Popple wrote:
> On Thursday, 9 December 2021 12:53:45 AM AEDT Jason Gunthorpe wrote:
> > > I think a similar problem exists for device private fault handling as well and
> > > it has been on my list of things to fix for a while. I think the solution is to
> > > call try_get_page(), except it doesn't work with device pages due to the whole
> > > refcount thing. That issue is blocking a fair bit of work now so I've started
> > > looking into it.
> >
> > Where is this?
>
> Nothing posted yet. I've been going through the mailing list and the old
> thread[1] to get an understanding of what is left to do. If you have any
> suggestions they would be welcome.

Oh, that

Joao's series here is the first step:

https://lore.kernel.org/linux-mm/[email protected]/

I already sent a patch to remove the DRM usage of PUD/PMD -
0d979509539e ("drm/ttm: remove ttm_bo_vm_insert_huge()")

Next, someone needs to change FSDAX to have a folio covering the
ZONE_DEVICE pages before it installs a PUD or PMD. I don't know
anything about FS's to know how to do this at all.

Thus all PUD/PMD entries will point at a head page or larger of a
compound. This is important because all the existing machinery for THP
assumes 1 PUD/PMD means 1 struct page to manipulate.

Then, consolidate all the duplicated code that runs when a page is
removed from a PTE/PMD/PUD etc into a function. Figure out why the
duplications are different to make them the same (I have some rough
patches for this step)

Start with PUD and have zap on PUD call the consolidated function and
make vmf_insert_pfn_pud_prot() accept a struct page not pfn and incr
the refcount. PUD is easy because there is no THP

Then do the same to PMD without breaking the THP code

Then make the PTE also incr the refcount on insert and zap

Exterminate vma_is_special_huge() along the way, there is no such
thing as a special huge VMA without a pud/pmd_special flag so all
things installed here must be struct page and not special.

Then the patches that are already posted are applicable and we can
kill the refcount == 1 stuff. No 0 ref count pages installed in page
tables.

Once all of that is done it is fairly straightforward to remove
pud/pmd/pte_devmap entirely and the pgmap stuff from gup.c

Jason

2021-12-09 10:53:28

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
>
> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> > Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> >> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> >>>> Avoid long term pinning for Coherent device type pages. This could
> >>>> interfere with their own device memory manager.
> >>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
> >>>> set, those pages will be migrated back to system memory.
> >>>>
> >>>> Signed-off-by: Alex Sierra <[email protected]>
> >>>> ---
> >>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
> >>>> 1 file changed, 30 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>> index 886d6148d3d0..1572eacf07f4 100644
> >>>> --- a/mm/gup.c
> >>>> +++ b/mm/gup.c
> >>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> >>>> #endif /* CONFIG_ELF_CORE */
> >>>>
> >>>> #ifdef CONFIG_MIGRATION
> >>>> +static int migrate_device_page(unsigned long address,
> >>>> + struct page *page)
> >>>> +{
> >>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
> >>>> + struct vm_fault vmf = {
> >>>> + .vma = vma,
> >>>> + .address = address & PAGE_MASK,
> >>>> + .flags = FAULT_FLAG_USER,
> >>>> + .pgoff = linear_page_index(vma, address),
> >>>> + .gfp_mask = GFP_KERNEL,
> >>>> + .page = page,
> >>>> + };
> >>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> >>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
> >>> How does this synchronise against pgmap being released? As I understand things
> >>> at this point we're not holding a reference on either the page or pgmap, so
> >>> the page and therefore the pgmap may have been freed.
> >>>
> >>> I think a similar problem exists for device private fault handling as well and
> >>> it has been on my list of things to fix for a while. I think the solution is to
> >>> call try_get_page(), except it doesn't work with device pages due to the whole
> >>> refcount thing. That issue is blocking a fair bit of work now so I've started
> >>> looking into it.
> >> At least the page should have been pinned by the __get_user_pages_locked
> >> call in __gup_longterm_locked. That refcount is dropped in
> >> check_and_migrate_movable_pages when it returns 0 or an error.
> > Never mind. We unpin the pages first. Alex, would the migration work if
> > we unpinned them afterwards? Also, the normal CPU page fault code path
> > seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> > before calling migrate_to_ram.

I don't think that's true. The check in pfn_swap_entry_to_page() is only for
migration entries:

BUG_ON(is_migration_entry(entry) && !PageLocked(p));

As this is coherent memory though why do we have to call into a device driver
to do the migration? Couldn't this all be done in the kernel?

> No, you can not unpinned after migration. Due to the expected_count VS
> page_count condition at migrate_page_move_mapping, during migrate_page call.
>
> Regards,
> Alex Sierra
>
> > Regards,
> > Felix
> >
> >
>





2021-12-09 16:30:02

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
>>> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
>>>> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
>>>>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>>>>>> Avoid long term pinning for Coherent device type pages. This could
>>>>>> interfere with their own device memory manager.
>>>>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>>>>>> set, those pages will be migrated back to system memory.
>>>>>>
>>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>>> ---
>>>>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
>>>>>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index 886d6148d3d0..1572eacf07f4 100644
>>>>>> --- a/mm/gup.c
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>>>>>> #endif /* CONFIG_ELF_CORE */
>>>>>>
>>>>>> #ifdef CONFIG_MIGRATION
>>>>>> +static int migrate_device_page(unsigned long address,
>>>>>> + struct page *page)
>>>>>> +{
>>>>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
>>>>>> + struct vm_fault vmf = {
>>>>>> + .vma = vma,
>>>>>> + .address = address & PAGE_MASK,
>>>>>> + .flags = FAULT_FLAG_USER,
>>>>>> + .pgoff = linear_page_index(vma, address),
>>>>>> + .gfp_mask = GFP_KERNEL,
>>>>>> + .page = page,
>>>>>> + };
>>>>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>>>>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
>>>>> How does this synchronise against pgmap being released? As I understand things
>>>>> at this point we're not holding a reference on either the page or pgmap, so
>>>>> the page and therefore the pgmap may have been freed.
>>>>>
>>>>> I think a similar problem exists for device private fault handling as well and
>>>>> it has been on my list of things to fix for a while. I think the solution is to
>>>>> call try_get_page(), except it doesn't work with device pages due to the whole
>>>>> refcount thing. That issue is blocking a fair bit of work now so I've started
>>>>> looking into it.
>>>> At least the page should have been pinned by the __get_user_pages_locked
>>>> call in __gup_longterm_locked. That refcount is dropped in
>>>> check_and_migrate_movable_pages when it returns 0 or an error.
>>> Never mind. We unpin the pages first. Alex, would the migration work if
>>> we unpinned them afterwards? Also, the normal CPU page fault code path
>>> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
>>> before calling migrate_to_ram.
> I don't think that's true. The check in pfn_swap_entry_to_page() is only for
> migration entries:
>
> BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>
> As this is coherent memory though why do we have to call into a device driver
> to do the migration? Couldn't this all be done in the kernel?

I think you're right. I hadn't thought of that mainly because I'm even
less familiar with the non-device migration code. Alex, can you give
that a try? As long as the driver still gets a page-free callback when
the device page is freed, it should work.

Regards,
  Felix


>
>> No, you can not unpinned after migration. Due to the expected_count VS
>> page_count condition at migrate_page_move_mapping, during migrate_page call.
>>
>> Regards,
>> Alex Sierra
>>
>>> Regards,
>>> Felix
>>>
>>>
>
>

2021-12-10 01:31:21

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On Friday, 10 December 2021 3:54:31 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
>
> On 12/9/2021 10:29 AM, Felix Kuehling wrote:
> > Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
> >> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
> >>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
> >>>> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
> >>>>> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
> >>>>>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
> >>>>>>> Avoid long term pinning for Coherent device type pages. This could
> >>>>>>> interfere with their own device memory manager.
> >>>>>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
> >>>>>>> set, those pages will be migrated back to system memory.
> >>>>>>>
> >>>>>>> Signed-off-by: Alex Sierra<[email protected]>
> >>>>>>> ---
> >>>>>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
> >>>>>>> 1 file changed, 30 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>>> index 886d6148d3d0..1572eacf07f4 100644
> >>>>>>> --- a/mm/gup.c
> >>>>>>> +++ b/mm/gup.c
> >>>>>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>> #endif /* CONFIG_ELF_CORE */
> >>>>>>>
> >>>>>>> #ifdef CONFIG_MIGRATION
> >>>>>>> +static int migrate_device_page(unsigned long address,
> >>>>>>> + struct page *page)
> >>>>>>> +{
> >>>>>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
> >>>>>>> + struct vm_fault vmf = {
> >>>>>>> + .vma = vma,
> >>>>>>> + .address = address & PAGE_MASK,
> >>>>>>> + .flags = FAULT_FLAG_USER,
> >>>>>>> + .pgoff = linear_page_index(vma, address),
> >>>>>>> + .gfp_mask = GFP_KERNEL,
> >>>>>>> + .page = page,
> >>>>>>> + };
> >>>>>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
> >>>>>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
> >>>>>> How does this synchronise against pgmap being released? As I understand things
> >>>>>> at this point we're not holding a reference on either the page or pgmap, so
> >>>>>> the page and therefore the pgmap may have been freed.
> >>>>>>
> >>>>>> I think a similar problem exists for device private fault handling as well and
> >>>>>> it has been on my list of things to fix for a while. I think the solution is to
> >>>>>> call try_get_page(), except it doesn't work with device pages due to the whole
> >>>>>> refcount thing. That issue is blocking a fair bit of work now so I've started
> >>>>>> looking into it.
> >>>>> At least the page should have been pinned by the __get_user_pages_locked
> >>>>> call in __gup_longterm_locked. That refcount is dropped in
> >>>>> check_and_migrate_movable_pages when it returns 0 or an error.
> >>>> Never mind. We unpin the pages first. Alex, would the migration work if
> >>>> we unpinned them afterwards? Also, the normal CPU page fault code path
> >>>> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
> >>>> before calling migrate_to_ram.
> >> I don't think that's true. The check in pfn_swap_entry_to_page() is only for
> >> migration entries:
> >>
> >> BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> >>
> >> As this is coherent memory though why do we have to call into a device driver
> >> to do the migration? Couldn't this all be done in the kernel?
> > I think you're right. I hadn't thought of that mainly because I'm even
> > less familiar with the non-device migration code. Alex, can you give
> > that a try? As long as the driver still gets a page-free callback when
> > the device page is freed, it should work.

Yes, you should still get the page-free callback when the migration code drops
the last page reference.

> ACK.Will do

There is currently not really any support for migrating device pages based on
pfn. What I think is needed is something like migrate_pages(), but that API
won't work for a couple of reasons - main one being that it relies on pages
being LRU pages.

I've been working on a series to implement an equivalent of migrate_pages() for
device-private (and by extension device-coherent) pages. It might also be useful
here so I will try and get it posted as an RFC next week.

- Alistair

> Alex Sierra
>
> > Regards,
> > Felix
> >
> >
> >>> No, you can not unpinned after migration. Due to the expected_count VS
> >>> page_count condition at migrate_page_move_mapping, during migrate_page call.
> >>>
> >>> Regards,
> >>> Alex Sierra
> >>>
> >>>> Regards,
> >>>> Felix
> >>>>
> >>>>
> >>





2021-12-10 16:39:45

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v2 03/11] mm/gup: migrate PIN_LONGTERM dev coherent pages to system

On 2021-12-09 8:31 p.m., Alistair Popple wrote:
> On Friday, 10 December 2021 3:54:31 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
>> On 12/9/2021 10:29 AM, Felix Kuehling wrote:
>>> Am 2021-12-09 um 5:53 a.m. schrieb Alistair Popple:
>>>> On Thursday, 9 December 2021 5:55:26 AM AEDT Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 12/8/2021 11:30 AM, Felix Kuehling wrote:
>>>>>> Am 2021-12-08 um 11:58 a.m. schrieb Felix Kuehling:
>>>>>>> Am 2021-12-08 um 6:31 a.m. schrieb Alistair Popple:
>>>>>>>> On Tuesday, 7 December 2021 5:52:43 AM AEDT Alex Sierra wrote:
>>>>>>>>> Avoid long term pinning for Coherent device type pages. This could
>>>>>>>>> interfere with their own device memory manager.
>>>>>>>>> If caller tries to get user device coherent pages with PIN_LONGTERM flag
>>>>>>>>> set, those pages will be migrated back to system memory.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>> ---
>>>>>>>>> mm/gup.c | 32 ++++++++++++++++++++++++++++++--
>>>>>>>>> 1 file changed, 30 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>>>>> index 886d6148d3d0..1572eacf07f4 100644
>>>>>>>>> --- a/mm/gup.c
>>>>>>>>> +++ b/mm/gup.c
>>>>>>>>> @@ -1689,17 +1689,37 @@ struct page *get_dump_page(unsigned long addr)
>>>>>>>>> #endif /* CONFIG_ELF_CORE */
>>>>>>>>>
>>>>>>>>> #ifdef CONFIG_MIGRATION
>>>>>>>>> +static int migrate_device_page(unsigned long address,
>>>>>>>>> + struct page *page)
>>>>>>>>> +{
>>>>>>>>> + struct vm_area_struct *vma = find_vma(current->mm, address);
>>>>>>>>> + struct vm_fault vmf = {
>>>>>>>>> + .vma = vma,
>>>>>>>>> + .address = address & PAGE_MASK,
>>>>>>>>> + .flags = FAULT_FLAG_USER,
>>>>>>>>> + .pgoff = linear_page_index(vma, address),
>>>>>>>>> + .gfp_mask = GFP_KERNEL,
>>>>>>>>> + .page = page,
>>>>>>>>> + };
>>>>>>>>> + if (page->pgmap && page->pgmap->ops->migrate_to_ram)
>>>>>>>>> + return page->pgmap->ops->migrate_to_ram(&vmf);
>>>>>>>> How does this synchronise against pgmap being released? As I understand things
>>>>>>>> at this point we're not holding a reference on either the page or pgmap, so
>>>>>>>> the page and therefore the pgmap may have been freed.
>>>>>>>>
>>>>>>>> I think a similar problem exists for device private fault handling as well and
>>>>>>>> it has been on my list of things to fix for a while. I think the solution is to
>>>>>>>> call try_get_page(), except it doesn't work with device pages due to the whole
>>>>>>>> refcount thing. That issue is blocking a fair bit of work now so I've started
>>>>>>>> looking into it.
>>>>>>> At least the page should have been pinned by the __get_user_pages_locked
>>>>>>> call in __gup_longterm_locked. That refcount is dropped in
>>>>>>> check_and_migrate_movable_pages when it returns 0 or an error.
>>>>>> Never mind. We unpin the pages first. Alex, would the migration work if
>>>>>> we unpinned them afterwards? Also, the normal CPU page fault code path
>>>>>> seems to make sure the page is locked (check in pfn_swap_entry_to_page)
>>>>>> before calling migrate_to_ram.
>>>> I don't think that's true. The check in pfn_swap_entry_to_page() is only for
>>>> migration entries:
>>>>
>>>> BUG_ON(is_migration_entry(entry) && !PageLocked(p));
>>>>
>>>> As this is coherent memory though why do we have to call into a device driver
>>>> to do the migration? Couldn't this all be done in the kernel?
>>> I think you're right. I hadn't thought of that mainly because I'm even
>>> less familiar with the non-device migration code. Alex, can you give
>>> that a try? As long as the driver still gets a page-free callback when
>>> the device page is freed, it should work.
> Yes, you should still get the page-free callback when the migration code drops
> the last page reference.
>
>> ACK.Will do
> There is currently not really any support for migrating device pages based on
> pfn. What I think is needed is something like migrate_pages(), but that API
> won't work for a couple of reasons - main one being that it relies on pages
> being LRU pages.
>
> I've been working on a series to implement an equivalent of migrate_pages() for
> device-private (and by extension device-coherent) pages. It might also be useful
> here so I will try and get it posted as an RFC next week.
If we want to make progress on this patch series in the shorter term, we
could just fail get_user_pages with FOLL_LONGTERM for DEVICE_COHERENT
pages. Then add the migration support when your patch series is ready.

Regards,
  Felix


>
> - Alistair
>
>> Alex Sierra
>>
>>> Regards,
>>> Felix
>>>
>>>
>>>>> No, you can not unpinned after migration. Due to the expected_count VS
>>>>> page_count condition at migrate_page_move_mapping, during migrate_page call.
>>>>>
>>>>> Regards,
>>>>> Alex Sierra
>>>>>
>>>>>> Regards,
>>>>>> Felix
>>>>>>
>>>>>>
>
>

2022-01-03 20:25:15

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] lib: add support for device coherent type in test_hmm

* Alex Sierra <[email protected]> [211206 14:00]:
> Device Coherent type uses device memory that is coherently accesible by
> the CPU. This could be shown as SP (special purpose) memory range
> at the BIOS-e820 memory enumeration. If no SP memory is supported in
> system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.
>
> Currently, test_hmm only supports two different SP ranges of at least
> 256MB size. This could be specified in the kernel parameter variable
> efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
> 0x140000000 physical address. Ex.
> efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
>
> Private and coherent device mirror instances can be created in the same
> probed. This is done by passing the module parameters spm_addr_dev0 &
> spm_addr_dev1. In this case, it will create four instances of
> device_mirror. The first two correspond to private device type, the
> last two to coherent type. Then, they can be easily accessed from user
> space through /dev/hmm_mirror<num_device>. Usually num_device 0 and 1
> are for private, and 2 and 3 for coherent types. If no module
> parameters are passed, two instances of private type device_mirror will
> be created only.
>
> Signed-off-by: Alex Sierra <[email protected]>
> ---
> lib/test_hmm.c | 252 +++++++++++++++++++++++++++++++++-----------
> lib/test_hmm_uapi.h | 15 ++-
> 2 files changed, 198 insertions(+), 69 deletions(-)
>
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 9edeff52302e..a1985226d788 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -29,11 +29,22 @@
>
> #include "test_hmm_uapi.h"
>
> -#define DMIRROR_NDEVICES 2
> +#define DMIRROR_NDEVICES 4
> #define DMIRROR_RANGE_FAULT_TIMEOUT 1000
> #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
> #define DEVMEM_CHUNKS_RESERVE 16
>
> +/*
> + * For device_private pages, dpage is just a dummy struct page
> + * representing a piece of device memory. dmirror_devmem_alloc_page
> + * allocates a real system memory page as backing storage to fake a
> + * real device. zone_device_data points to that backing page. But
> + * for device_coherent memory, the struct page represents real
> + * physical CPU-accessible memory that we can use directly.
> + */
> +#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
> + (page)->zone_device_data : (page))
> +
> static unsigned long spm_addr_dev0;
> module_param(spm_addr_dev0, long, 0644);
> MODULE_PARM_DESC(spm_addr_dev0,
> @@ -122,6 +133,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce,
> return 0;
> }
>
> +static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
> +{
> + return (mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
> +}
> +
> +static enum migrate_vma_direction
> + dmirror_select_device(struct dmirror *dmirror)
> +{
> + return (dmirror->mdevice->zone_device_type ==
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
> + MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
> + MIGRATE_VMA_SELECT_DEVICE_COHERENT;
> +}
> +
> static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
> {
> vfree(bounce->ptr);
> @@ -572,16 +598,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
> static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
> {
> struct page *dpage = NULL;
> - struct page *rpage;
> + struct page *rpage = NULL;
>
> /*
> - * This is a fake device so we alloc real system memory to store
> - * our device memory.
> + * For ZONE_DEVICE private type, this is a fake device so we alloc real
> + * system memory to store our device memory.
> + * For ZONE_DEVICE coherent type we use the actual dpage to store the data
> + * and ignore rpage.
> */
> - rpage = alloc_page(GFP_HIGHUSER);
> - if (!rpage)
> - return NULL;
> -
> + if (dmirror_is_private_zone(mdevice)) {
> + rpage = alloc_page(GFP_HIGHUSER);
> + if (!rpage)
> + return NULL;
> + }
> spin_lock(&mdevice->lock);
>
> if (mdevice->free_pages) {
> @@ -601,7 +630,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
> return dpage;
>
> error:
> - __free_page(rpage);
> + if (rpage)
> + __free_page(rpage);
> return NULL;
> }
>
> @@ -627,12 +657,15 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
> * unallocated pte_none() or read-only zero page.
> */
> spage = migrate_pfn_to_page(*src);
> + WARN(spage && is_zone_device_page(spage),
> + "page already in device spage pfn: 0x%lx\n",
> + page_to_pfn(spage));
>
> dpage = dmirror_devmem_alloc_page(mdevice);
> if (!dpage)
> continue;
>
> - rpage = dpage->zone_device_data;
> + rpage = BACKING_PAGE(dpage);
> if (spage)
> copy_highpage(rpage, spage);
> else
> @@ -646,6 +679,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
> */
> rpage->zone_device_data = dmirror;
>
> + pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
> + page_to_pfn(spage), page_to_pfn(dpage));
> *dst = migrate_pfn(page_to_pfn(dpage)) |
> MIGRATE_PFN_LOCKED;
> if ((*src & MIGRATE_PFN_WRITE) ||
> @@ -724,11 +759,7 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
> if (!dpage)
> continue;
>
> - /*
> - * Store the page that holds the data so the page table
> - * doesn't have to deal with ZONE_DEVICE private pages.
> - */
> - entry = dpage->zone_device_data;
> + entry = BACKING_PAGE(dpage);
> if (*dst & MIGRATE_PFN_WRITE)
> entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
> entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
> @@ -808,8 +839,106 @@ static int dmirror_exclusive(struct dmirror *dmirror,
> return ret;
> }
>
> -static int dmirror_migrate(struct dmirror *dmirror,
> - struct hmm_dmirror_cmd *cmd)
> +static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> + struct dmirror *dmirror)
> +{
> + const unsigned long *src = args->src;
> + unsigned long *dst = args->dst;
> + unsigned long start = args->start;
> + unsigned long end = args->end;
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE,
> + src++, dst++) {
> + struct page *dpage, *spage;
> +
> + spage = migrate_pfn_to_page(*src);
> + if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> + continue;
> +
> + WARN_ON(!is_device_page(spage));
> + spage = BACKING_PAGE(spage);
> + dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> + if (!dpage)
> + continue;
> + pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
> + page_to_pfn(spage), page_to_pfn(dpage));
> +
> + lock_page(dpage);
> + xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> + copy_highpage(dpage, spage);
> + *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> + if (*src & MIGRATE_PFN_WRITE)
> + *dst |= MIGRATE_PFN_WRITE;
> + }
> + return 0;
> +}
> +
> +static int dmirror_migrate_to_system(struct dmirror *dmirror,
> + struct hmm_dmirror_cmd *cmd)
> +{
> + unsigned long start, end, addr;
> + unsigned long size = cmd->npages << PAGE_SHIFT;
> + struct mm_struct *mm = dmirror->notifier.mm;
> + struct vm_area_struct *vma;
> + unsigned long src_pfns[64];
> + unsigned long dst_pfns[64];
> + struct migrate_vma args;
> + unsigned long next;
> + int ret;
> +
> + start = cmd->addr;
> + end = start + size;
> + if (end < start)
> + return -EINVAL;
> +
> + /* Since the mm is for the mirrored process, get a reference first. */
> + if (!mmget_not_zero(mm))
> + return -EINVAL;
> +
> + mmap_read_lock(mm);
> + for (addr = start; addr < end; addr = next) {
> + vma = find_vma(mm, addr);
> + if (!vma || addr < vma->vm_start ||
> + !(vma->vm_flags & VM_READ)) {

If you use vma_lookup() instead of find_vma(), then the if statement can
be simplified.

> + ret = -EINVAL;
> + goto out;
> + }
> + next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
> + if (next > vma->vm_end)
> + next = vma->vm_end;
> +
> + args.vma = vma;
> + args.src = src_pfns;
> + args.dst = dst_pfns;
> + args.start = addr;
> + args.end = next;
> + args.pgmap_owner = dmirror->mdevice;
> + args.flags = dmirror_select_device(dmirror);
> +
> + ret = migrate_vma_setup(&args);
> + if (ret)
> + goto out;
> +
> + pr_debug("Migrating from device mem to sys mem\n");
> + dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
> +
> + migrate_vma_pages(&args);
> + migrate_vma_finalize(&args);
> + }

out label could be here.

> + mmap_read_unlock(mm);
> + mmput(mm);
> +
> + return ret;
> +
> +out:
> + mmap_read_unlock(mm);
> + mmput(mm);
> + return ret;
> +}
> +
> +static int dmirror_migrate_to_device(struct dmirror *dmirror,
> + struct hmm_dmirror_cmd *cmd)
> {
> unsigned long start, end, addr;
> unsigned long size = cmd->npages << PAGE_SHIFT;
> @@ -853,6 +982,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
> if (ret)
> goto out;
>
> + pr_debug("Migrating from sys mem to device mem\n");
> dmirror_migrate_alloc_and_copy(&args, dmirror);
> migrate_vma_pages(&args);
> dmirror_migrate_finalize_and_map(&args, dmirror);
> @@ -861,7 +991,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
> mmap_read_unlock(mm);
> mmput(mm);
>
> - /* Return the migrated data for verification. */
> + /* Return the migrated data for verification. only for pages in device zone */
> ret = dmirror_bounce_init(&bounce, start, size);
> if (ret)
> return ret;
> @@ -898,12 +1028,22 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
> }
>
> page = hmm_pfn_to_page(entry);
> - if (is_device_private_page(page)) {
> - /* Is the page migrated to this device or some other? */
> - if (dmirror->mdevice == dmirror_page_to_device(page))
> + if (is_device_page(page)) {
> + /* Is page ZONE_DEVICE coherent? */
> + if (!is_device_private_page(page)) {
> + if (dmirror->mdevice == dmirror_page_to_device(page))
> + *perm = HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL;
> + else
> + *perm = HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE;
> + /*
> + * Is page ZONE_DEVICE private migrated to
> + * this device or some other?
> + */
> + } else if (dmirror->mdevice == dmirror_page_to_device(page)) {
> *perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
> - else
> + } else {
> *perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
> + }
> } else if (is_zero_pfn(page_to_pfn(page)))
> *perm = HMM_DMIRROR_PROT_ZERO;
> else
> @@ -1100,8 +1240,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
> ret = dmirror_write(dmirror, &cmd);
> break;
>
> - case HMM_DMIRROR_MIGRATE:
> - ret = dmirror_migrate(dmirror, &cmd);
> + case HMM_DMIRROR_MIGRATE_TO_DEV:
> + ret = dmirror_migrate_to_device(dmirror, &cmd);
> + break;
> +
> + case HMM_DMIRROR_MIGRATE_TO_SYS:
> + ret = dmirror_migrate_to_system(dmirror, &cmd);
> break;
>
> case HMM_DMIRROR_EXCLUSIVE:
> @@ -1142,14 +1286,13 @@ static const struct file_operations dmirror_fops = {
>
> static void dmirror_devmem_free(struct page *page)
> {
> - struct page *rpage = page->zone_device_data;
> + struct page *rpage = BACKING_PAGE(page);
> struct dmirror_device *mdevice;
>
> - if (rpage)
> + if (rpage != page)
> __free_page(rpage);
>
> mdevice = dmirror_page_to_device(page);
> -
> spin_lock(&mdevice->lock);
> mdevice->cfree++;
> page->zone_device_data = mdevice->free_pages;
> @@ -1157,38 +1300,6 @@ static void dmirror_devmem_free(struct page *page)
> spin_unlock(&mdevice->lock);
> }
>
> -static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> - struct dmirror *dmirror)
> -{
> - const unsigned long *src = args->src;
> - unsigned long *dst = args->dst;
> - unsigned long start = args->start;
> - unsigned long end = args->end;
> - unsigned long addr;
> -
> - for (addr = start; addr < end; addr += PAGE_SIZE,
> - src++, dst++) {
> - struct page *dpage, *spage;
> -
> - spage = migrate_pfn_to_page(*src);
> - if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> - continue;
> - spage = spage->zone_device_data;
> -
> - dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
> - if (!dpage)
> - continue;
> -
> - lock_page(dpage);
> - xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
> - copy_highpage(dpage, spage);
> - *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
> - if (*src & MIGRATE_PFN_WRITE)
> - *dst |= MIGRATE_PFN_WRITE;
> - }
> - return 0;
> -}
> -
> static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> {
> struct migrate_vma args;
> @@ -1203,7 +1314,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> * the mirror but here we use it to hold the page for the simulated
> * device memory and that page holds the pointer to the mirror.
> */
> - rpage = vmf->page->zone_device_data;
> + rpage = BACKING_PAGE(vmf->page);
> dmirror = rpage->zone_device_data;
>
> /* FIXME demonstrate how we can adjust migrate range */
> @@ -1213,7 +1324,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> args.src = &src_pfns;
> args.dst = &dst_pfns;
> args.pgmap_owner = dmirror->mdevice;
> - args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
> + args.flags = dmirror_select_device(dmirror);
>
> if (migrate_vma_setup(&args))
> return VM_FAULT_SIGBUS;
> @@ -1279,14 +1390,26 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
> static int __init hmm_dmirror_init(void)
> {
> int ret;
> - int id;
> + int id = 0;
> + int ndevices = 0;
>
> ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
> "HMM_DMIRROR");
> if (ret)
> goto err_unreg;
>
> - for (id = 0; id < DMIRROR_NDEVICES; id++) {
> + memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0]));
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
> + if (spm_addr_dev0 && spm_addr_dev1) {
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
> + dmirror_devices[ndevices++].zone_device_type =
> + HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
> + }
> + for (id = 0; id < ndevices; id++) {
> ret = dmirror_device_init(dmirror_devices + id, id);
> if (ret)
> goto err_chrdev;
> @@ -1308,7 +1431,8 @@ static void __exit hmm_dmirror_exit(void)
> int id;
>
> for (id = 0; id < DMIRROR_NDEVICES; id++)
> - dmirror_device_remove(dmirror_devices + id);
> + if (dmirror_devices[id].zone_device_type)
> + dmirror_device_remove(dmirror_devices + id);
> unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
> }
>
> diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
> index 625f3690d086..e190b2ab6f19 100644
> --- a/lib/test_hmm_uapi.h
> +++ b/lib/test_hmm_uapi.h
> @@ -33,11 +33,12 @@ struct hmm_dmirror_cmd {
> /* Expose the address space of the calling process through hmm device file */
> #define HMM_DMIRROR_READ _IOWR('H', 0x00, struct hmm_dmirror_cmd)
> #define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
> -#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_DEV _IOWR('H', 0x02, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE_TO_SYS _IOWR('H', 0x03, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x04, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x07, struct hmm_dmirror_cmd)
>
> /*
> * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> @@ -52,6 +53,8 @@ struct hmm_dmirror_cmd {
> * device the ioctl() is made
> * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
> * other device
> + * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
> + * the ioctl() is made
> */
> enum {
> HMM_DMIRROR_PROT_ERROR = 0xFF,
> @@ -63,6 +66,8 @@ enum {
> HMM_DMIRROR_PROT_ZERO = 0x10,
> HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
> HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
> + HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL = 0x40,
> + HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE = 0x50,
> };
>
> enum {
> --
> 2.32.0
>
>