Subject: [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

This is our MEMORY_DEVICE_COHERENT patch series rebased and updated
for current 5.18.0

Changes since the last version:
- Fixed problems with migration during long-term pinning in
get_user_pages
- Open coded vm_normal_lru_pages as suggested in previous code review
- Update hmm_gup_test with more get_user_pages calls, include
hmm_cow_in_device in hmm-test.

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.

This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.

System stability and performance are not affected according to our
ongoing testing, including xfstests.

How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.

The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.

Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.

Coherent device type pages at gup are now migrated back to system
memory if they are being pinned long-term (FOLL_LONGTERM). The reason
is, that long-term pinning would interfere with the device memory
manager owning the device-coherent pages (e.g. evictions in TTM).
These series incorporate Alistair Popple patches to do this
migration from pin_user_pages() calls. hmm_gup_test has been added to
hmm-test to test different get user pages calls.

This series includes handling of device-managed anonymous pages
returned by vm_normal_pages. Although they behave like normal pages
for purposes of mapping in CPU page tables and for COW, they do not
support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

v2:
- Rebase to latest 5.18-rc7.
- Drop patch "mm: add device coherent checker to remove migration pte"
and modify try_to_migrate_one, to let DEVICE_COHERENT pages fall
through to normal page path. Based on Alistair Popple's comment.
- Fix comment formatting.
- Reword comment in vm_normal_page about pte_devmap().
- Merge "drm/amdkfd: coherent type as sys mem on migration to ram" to
"drm/amdkfd: add SPM support for SVM".

v3:
- Rebase to latest 5.18.0.
- Patch "mm: handling Non-LRU pages returned by vm_normal_pages"
reordered.
- Add WARN_ON_ONCE for thp device coherent case.

v4:
- Rebase to latest 5.18.0
- Fix consitency between pages with FOLL_LRU flag set and pte_devmap
at follow_page_pte.

v5:
- Remove unused zone_device_type from lib/test_hmm and
selftest/vm/hmm-test.c.

Alex Sierra (11):
mm: add zone device coherent type memory support
mm: handling Non-LRU pages returned by vm_normal_pages
mm: add device coherent vma selection for memory migration
drm/amdkfd: add SPM support for SVM
lib: test_hmm add ioctl to get zone device type
lib: test_hmm add module param for zone device type
lib: add support for device coherent type in test_hmm
tools: update hmm-test to support device coherent type
tools: update test_hmm script to support SP config
tools: add hmm gup tests for device coherent type
tools: add selftests to hmm for COW in device memory

Alistair Popple (2):
mm: remove the vma check in migrate_vma_setup()
mm/gup: migrate device coherent pages when pinning instead of failing

drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 34 ++-
fs/proc/task_mmu.c | 2 +-
include/linux/memremap.h | 19 ++
include/linux/migrate.h | 1 +
include/linux/mm.h | 3 +-
lib/test_hmm.c | 337 +++++++++++++++++------
lib/test_hmm_uapi.h | 19 +-
mm/gup.c | 53 +++-
mm/huge_memory.c | 2 +-
mm/internal.h | 1 +
mm/khugepaged.c | 9 +-
mm/ksm.c | 6 +-
mm/madvise.c | 4 +-
mm/memcontrol.c | 7 +-
mm/memory-failure.c | 8 +-
mm/memory.c | 9 +-
mm/mempolicy.c | 2 +-
mm/memremap.c | 10 +
mm/migrate.c | 4 +-
mm/migrate_device.c | 115 ++++++--
mm/mlock.c | 2 +-
mm/mprotect.c | 2 +-
mm/rmap.c | 5 +-
tools/testing/selftests/vm/hmm-tests.c | 306 ++++++++++++++++++--
tools/testing/selftests/vm/test_hmm.sh | 24 +-
25 files changed, 800 insertions(+), 184 deletions(-)

--
2.32.0



Subject: [PATCH v5 09/13] lib: add support for device coherent type in test_hmm

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Private and coherent device mirror instances can be created in the same
probed. This is done by passing the module parameters spm_addr_dev0 &
spm_addr_dev1. In this case, it will create four instances of
device_mirror. The first two correspond to private device type, the
last two to coherent type. Then, they can be easily accessed from user
space through /dev/hmm_mirror<num_device>. Usually num_device 0 and 1
are for private, and 2 and 3 for coherent types. If no module
parameters are passed, two instances of private type device_mirror will
be created only.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
lib/test_hmm.c | 253 +++++++++++++++++++++++++++++++++-----------
lib/test_hmm_uapi.h | 4 +
2 files changed, 196 insertions(+), 61 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index afb30af9f3ff..7930853e7fc5 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -32,11 +32,22 @@

#include "test_hmm_uapi.h"

-#define DMIRROR_NDEVICES 2
+#define DMIRROR_NDEVICES 4
#define DMIRROR_RANGE_FAULT_TIMEOUT 1000
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+/*
+ * For device_private pages, dpage is just a dummy struct page
+ * representing a piece of device memory. dmirror_devmem_alloc_page
+ * allocates a real system memory page as backing storage to fake a
+ * real device. zone_device_data points to that backing page. But
+ * for device_coherent memory, the struct page represents real
+ * physical CPU-accessible memory that we can use directly.
+ */
+#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
+ (page)->zone_device_data : (page))
+
static unsigned long spm_addr_dev0;
module_param(spm_addr_dev0, long, 0644);
MODULE_PARM_DESC(spm_addr_dev0,
@@ -125,6 +136,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce,
return 0;
}

+static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
+{
+ return (mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
+}
+
+static enum migrate_vma_direction
+dmirror_select_device(struct dmirror *dmirror)
+{
+ return (dmirror->mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+ MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+}
+
static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
{
vfree(bounce->ptr);
@@ -575,16 +601,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
{
struct page *dpage = NULL;
- struct page *rpage;
+ struct page *rpage = NULL;

/*
- * This is a fake device so we alloc real system memory to store
- * our device memory.
+ * For ZONE_DEVICE private type, this is a fake device so we allocate
+ * real system memory to store our device memory.
+ * For ZONE_DEVICE coherent type we use the actual dpage to store the
+ * data and ignore rpage.
*/
- rpage = alloc_page(GFP_HIGHUSER);
- if (!rpage)
- return NULL;
-
+ if (dmirror_is_private_zone(mdevice)) {
+ rpage = alloc_page(GFP_HIGHUSER);
+ if (!rpage)
+ return NULL;
+ }
spin_lock(&mdevice->lock);

if (mdevice->free_pages) {
@@ -603,7 +632,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
return dpage;

error:
- __free_page(rpage);
+ if (rpage)
+ __free_page(rpage);
return NULL;
}

@@ -629,12 +659,16 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* unallocated pte_none() or read-only zero page.
*/
spage = migrate_pfn_to_page(*src);
+ if (WARN(spage && is_zone_device_page(spage),
+ "page already in device spage pfn: 0x%lx\n",
+ page_to_pfn(spage)))
+ continue;

dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;

- rpage = dpage->zone_device_data;
+ rpage = BACKING_PAGE(dpage);
if (spage)
copy_highpage(rpage, spage);
else
@@ -648,6 +682,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
*/
rpage->zone_device_data = dmirror;

+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage));
if ((*src & MIGRATE_PFN_WRITE) ||
(!spage && args->vma->vm_flags & VM_WRITE))
@@ -725,11 +761,7 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
if (!dpage)
continue;

- /*
- * Store the page that holds the data so the page table
- * doesn't have to deal with ZONE_DEVICE private pages.
- */
- entry = dpage->zone_device_data;
+ entry = BACKING_PAGE(dpage);
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -809,15 +841,126 @@ static int dmirror_exclusive(struct dmirror *dmirror,
return ret;
}

-static int dmirror_migrate(struct dmirror *dmirror,
- struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+ struct dmirror *dmirror)
+{
+ const unsigned long *src = args->src;
+ unsigned long *dst = args->dst;
+ unsigned long start = args->start;
+ unsigned long end = args->end;
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE,
+ src++, dst++) {
+ struct page *dpage, *spage;
+
+ spage = migrate_pfn_to_page(*src);
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ if (WARN_ON(!is_device_private_page(spage) &&
+ !is_device_coherent_page(spage)))
+ continue;
+ spage = BACKING_PAGE(spage);
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ continue;
+ pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
+
+ lock_page(dpage);
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dpage, spage);
+ *dst = migrate_pfn(page_to_pfn(dpage));
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+ }
+ return 0;
+}
+
+static unsigned long
+dmirror_successful_migrated_pages(struct migrate_vma *migrate)
+{
+ unsigned long cpages = 0;
+ unsigned long i;
+
+ for (i = 0; i < migrate->npages; i++) {
+ if (migrate->src[i] & MIGRATE_PFN_VALID &&
+ migrate->src[i] & MIGRATE_PFN_MIGRATE)
+ cpages++;
+ }
+ return cpages;
+}
+
+static int dmirror_migrate_to_system(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
{
unsigned long start, end, addr;
unsigned long size = cmd->npages << PAGE_SHIFT;
struct mm_struct *mm = dmirror->notifier.mm;
struct vm_area_struct *vma;
- unsigned long src_pfns[64];
- unsigned long dst_pfns[64];
+ unsigned long src_pfns[64] = { 0 };
+ unsigned long dst_pfns[64] = { 0 };
+ struct migrate_vma args;
+ unsigned long next;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ /* Since the mm is for the mirrored process, get a reference first. */
+ if (!mmget_not_zero(mm))
+ return -EINVAL;
+
+ cmd->cpages = 0;
+ mmap_read_lock(mm);
+ for (addr = start; addr < end; addr = next) {
+ vma = vma_lookup(mm, addr);
+ if (!vma || !(vma->vm_flags & VM_READ)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+ if (next > vma->vm_end)
+ next = vma->vm_end;
+
+ args.vma = vma;
+ args.src = src_pfns;
+ args.dst = dst_pfns;
+ args.start = addr;
+ args.end = next;
+ args.pgmap_owner = dmirror->mdevice;
+ args.flags = dmirror_select_device(dmirror);
+
+ ret = migrate_vma_setup(&args);
+ if (ret)
+ goto out;
+
+ pr_debug("Migrating from device mem to sys mem\n");
+ dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+
+ migrate_vma_pages(&args);
+ cmd->cpages += dmirror_successful_migrated_pages(&args);
+ migrate_vma_finalize(&args);
+ }
+out:
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ return ret;
+}
+
+static int dmirror_migrate_to_device(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ unsigned long start, end, addr;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ struct mm_struct *mm = dmirror->notifier.mm;
+ struct vm_area_struct *vma;
+ unsigned long src_pfns[64] = { 0 };
+ unsigned long dst_pfns[64] = { 0 };
struct dmirror_bounce bounce;
struct migrate_vma args;
unsigned long next;
@@ -854,6 +997,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
if (ret)
goto out;

+ pr_debug("Migrating from sys mem to device mem\n");
dmirror_migrate_alloc_and_copy(&args, dmirror);
migrate_vma_pages(&args);
dmirror_migrate_finalize_and_map(&args, dmirror);
@@ -862,7 +1006,10 @@ static int dmirror_migrate(struct dmirror *dmirror,
mmap_read_unlock(mm);
mmput(mm);

- /* Return the migrated data for verification. */
+ /*
+ * Return the migrated data for verification.
+ * Only for pages in device zone
+ */
ret = dmirror_bounce_init(&bounce, start, size);
if (ret)
return ret;
@@ -905,6 +1052,12 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
else
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+ } else if (is_device_coherent_page(page)) {
+ /* Is the page migrated to this device or some other? */
+ if (dmirror->mdevice == dmirror_page_to_device(page))
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL;
+ else
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE;
} else if (is_zero_pfn(page_to_pfn(page)))
*perm = HMM_DMIRROR_PROT_ZERO;
else
@@ -1092,8 +1245,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_write(dmirror, &cmd);
break;

- case HMM_DMIRROR_MIGRATE:
- ret = dmirror_migrate(dmirror, &cmd);
+ case HMM_DMIRROR_MIGRATE_TO_DEV:
+ ret = dmirror_migrate_to_device(dmirror, &cmd);
+ break;
+
+ case HMM_DMIRROR_MIGRATE_TO_SYS:
+ ret = dmirror_migrate_to_system(dmirror, &cmd);
break;

case HMM_DMIRROR_EXCLUSIVE:
@@ -1155,14 +1312,13 @@ static const struct file_operations dmirror_fops = {

static void dmirror_devmem_free(struct page *page)
{
- struct page *rpage = page->zone_device_data;
+ struct page *rpage = BACKING_PAGE(page);
struct dmirror_device *mdevice;

- if (rpage)
+ if (rpage != page)
__free_page(rpage);

mdevice = dmirror_page_to_device(page);
-
spin_lock(&mdevice->lock);
mdevice->cfree++;
page->zone_device_data = mdevice->free_pages;
@@ -1170,43 +1326,11 @@ static void dmirror_devmem_free(struct page *page)
spin_unlock(&mdevice->lock);
}

-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- const unsigned long *src = args->src;
- unsigned long *dst = args->dst;
- unsigned long start = args->start;
- unsigned long end = args->end;
- unsigned long addr;
-
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
- struct page *dpage, *spage;
-
- spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
- spage = spage->zone_device_data;
-
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
-
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage));
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- }
- return 0;
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
- unsigned long src_pfns;
- unsigned long dst_pfns;
+ unsigned long src_pfns = 0;
+ unsigned long dst_pfns = 0;
struct page *rpage;
struct dmirror *dmirror;
vm_fault_t ret;
@@ -1226,7 +1350,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
args.src = &src_pfns;
args.dst = &dst_pfns;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ args.flags = dmirror_select_device(dmirror);

if (migrate_vma_setup(&args))
return VM_FAULT_SIGBUS;
@@ -1305,6 +1429,12 @@ static int __init hmm_dmirror_init(void)
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
dmirror_devices[ndevices++].zone_device_type =
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ if (spm_addr_dev0 && spm_addr_dev1) {
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ }
for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
@@ -1327,7 +1457,8 @@ static void __exit hmm_dmirror_exit(void)
int id;

for (id = 0; id < DMIRROR_NDEVICES; id++)
- dmirror_device_remove(dmirror_devices + id);
+ if (dmirror_devices[id].zone_device_type)
+ dmirror_device_remove(dmirror_devices + id);
unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
}

diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f700da7807c1..e31d58c9034a 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -50,6 +50,8 @@ struct hmm_dmirror_cmd {
* device the ioctl() is made
* HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
* other device
+ * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
+ * the ioctl() is made
*/
enum {
HMM_DMIRROR_PROT_ERROR = 0xFF,
@@ -61,6 +63,8 @@ enum {
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
+ HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL = 0x40,
+ HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE = 0x50,
};

enum {
--
2.32.0


Subject: [PATCH v5 07/13] lib: test_hmm add ioctl to get zone device type

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
lib/test_hmm.c | 11 +++++++++--
lib/test_hmm_uapi.h | 14 ++++++++++----
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index cfe632047839..915ef6b5b0d4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -87,6 +87,7 @@ struct dmirror_chunk {
struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem *devmem;
+ unsigned int zone_device_type;

unsigned int devmem_capacity;
unsigned int devmem_count;
@@ -1260,14 +1261,20 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
static int __init hmm_dmirror_init(void)
{
int ret;
- int id;
+ int id = 0;
+ int ndevices = 0;

ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
"HMM_DMIRROR");
if (ret)
goto err_unreg;

- for (id = 0; id < DMIRROR_NDEVICES; id++) {
+ memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0]));
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
goto err_chrdev;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd06..0511af7464ee 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -31,10 +31,11 @@ struct hmm_dmirror_cmd {
/* Expose the address space of the calling process through hmm device file */
#define HMM_DMIRROR_READ _IOWR('H', 0x00, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV _IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +63,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
};

+enum {
+ /* 0 is reserved to catch uninitialized type fields */
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0


Subject: [PATCH v5 03/13] mm: add device coherent vma selection for memory migration

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
include/linux/migrate.h | 1 +
mm/migrate_device.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..b84908debe5c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -148,6 +148,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
};

struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index a4847ad65da3..18bc6483f63a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -148,15 +148,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
- if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
- goto next;
pfn = pte_pfn(pte);
- if (is_zero_pfn(pfn)) {
+ if (is_zero_pfn(pfn) &&
+ (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+ if (page && !is_zone_device_page(page) &&
+ !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+ goto next;
+ else if (page && is_device_coherent_page(page) &&
+ (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+ page->pgmap->owner != migrate->pgmap_owner))
+ goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
--
2.32.0


Subject: [PATCH v5 12/13] tools: add hmm gup tests for device coherent type

The intention is to test hmm device coherent type under different get
user pages paths. Also, test gup with FOLL_LONGTERM flag set in
device coherent pages. These pages should get migrated back to system
memory.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 105 +++++++++++++++++++++++++
1 file changed, 105 insertions(+)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 4b547188ec40..3295c8bf6c63 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
* in the usual include/uapi/... directory.
*/
#include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"

struct hmm_buffer {
void *ptr;
@@ -59,6 +60,8 @@ enum {
#define NTIMES 10

#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01 /* check pte is writable */

FIXTURE(hmm)
{
@@ -1764,4 +1767,106 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
}

+static int gup_test_exec(int gup_fd, unsigned long addr,
+ int cmd, int npages, int size)
+{
+ struct gup_test gup = {
+ .nr_pages_per_call = npages,
+ .addr = addr,
+ .gup_flags = FOLL_WRITE,
+ .size = size,
+ };
+
+ if (ioctl(gup_fd, cmd, &gup)) {
+ perror("ioctl on error\n");
+ return errno;
+ }
+
+ return 0;
+}
+
+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+ struct hmm_buffer *buffer;
+ int gup_fd;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+ unsigned char *m;
+
+ gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+ if (gup_fd == -1)
+ SKIP(return, "Skipping test, could not find gup_test driver");
+
+ npages = 3;
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ ASSERT_EQ(gup_test_exec(gup_fd,
+ (unsigned long)buffer->ptr,
+ GUP_BASIC_TEST, 1, self->page_size), 0);
+ ASSERT_EQ(gup_test_exec(gup_fd,
+ (unsigned long)buffer->ptr + 1 * self->page_size,
+ GUP_FAST_BENCHMARK, 1, self->page_size), 0);
+ ASSERT_EQ(gup_test_exec(gup_fd,
+ (unsigned long)buffer->ptr + 2 * self->page_size,
+ PIN_LONGTERM_BENCHMARK, 1, self->page_size), 0);
+
+ /* Take snapshot to CPU pagetables */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ m = buffer->mirror;
+ if (hmm_is_coherent_type(variant->device_number)) {
+ ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL | HMM_DMIRROR_PROT_WRITE, m[0]);
+ ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL | HMM_DMIRROR_PROT_WRITE, m[1]);
+ } else {
+ ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[0]);
+ ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[1]);
+ }
+ ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[2]);
+ /*
+ * Check again the content on the pages. Make sure there's no
+ * corrupted data.
+ */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ close(gup_fd);
+ hmm_buffer_free(buffer);
+}
TEST_HARNESS_MAIN
--
2.32.0


Subject: [PATCH v5 01/13] mm: add zone device coherent type memory support

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
[hch: rebased ontop of the refcount changes,
removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig <[email protected]>
---
include/linux/memremap.h | 19 +++++++++++++++++++
mm/memcontrol.c | 7 ++++---
mm/memory-failure.c | 8 ++++++--
mm/memremap.c | 10 ++++++++++
mm/migrate_device.c | 16 +++++++---------
mm/rmap.c | 5 +++--
6 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..9f752ebed613 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.rst.
*
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
* MEMORY_DEVICE_FS_DAX:
* Host memory that has similar access semantics as System RAM i.e. DMA
* coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+ MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
return is_device_private_page(&folio->page);
}

+static inline bool is_device_coherent_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
+static inline bool folio_is_device_coherent(const struct folio *folio)
+{
+ return is_device_coherent_page(&folio->page);
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe6..93f80d7ca148 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5665,8 +5665,8 @@ static int mem_cgroup_move_account(struct page *page,
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
- * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and
+ * thus not on the lru.
* For now we such page is charge like a regular page would be as for all
* intent and purposes it is just special memory taking the place of a
* regular page.
@@ -5704,7 +5704,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
*/
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
- if (is_device_private_page(page))
+ if (is_device_private_page(page) ||
+ is_device_coherent_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b85661cbdc4a..0b6a0a01ee09 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1683,12 +1683,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
goto unlock;
}

- if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
/*
- * TODO: Handle HMM pages which may need coordination
+ * TODO: Handle device pages which may need coordination
* with device-side memory.
*/
goto unlock;
+ default:
+ break;
}

/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 2b92e97cb25b..dbd2631b3520 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -315,6 +315,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
+ case MEMORY_DEVICE_COHERENT:
+ if (!pgmap->ops->page_free) {
+ WARN(1, "Missing page_free method\n");
+ return ERR_PTR(-EINVAL);
+ }
+ if (!pgmap->owner) {
+ WARN(1, "Missing owner\n");
+ return ERR_PTR(-EINVAL);
+ }
+ break;
case MEMORY_DEVICE_FS_DAX:
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 5052093d0262..a4847ad65da3 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -518,7 +518,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
* handle_pte_fault()
* do_anonymous_page()
* to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -594,11 +594,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
page_to_pfn(page));
entry = swp_entry_to_pte(swp_entry);
} else {
- /*
- * For now we only support migrating to un-addressable device
- * memory.
- */
- if (is_zone_device_page(page)) {
+ if (is_zone_device_page(page) &&
+ !is_device_coherent_page(page)) {
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
goto abort;
}
@@ -701,10 +698,11 @@ void migrate_vma_pages(struct migrate_vma *migrate)

mapping = page_mapping(page);

- if (is_device_private_page(newpage)) {
+ if (is_device_private_page(newpage) ||
+ is_device_coherent_page(newpage)) {
/*
- * For now only support private anonymous when migrating
- * to un-addressable device memory.
+ * For now only support anonymous memory migrating to
+ * device private or coherent memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..04fac1af870b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1957,7 +1957,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

- if (folio_is_zone_device(folio)) {
+ if (folio_is_device_private(folio)) {
unsigned long pfn = folio_pfn(folio);
swp_entry_t entry;
pte_t swp_pte;
@@ -2131,7 +2131,8 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
TTU_SYNC)))
return;

- if (folio_is_zone_device(folio) && !folio_is_device_private(folio))
+ if (folio_is_zone_device(folio) &&
+ (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
return;

/*
--
2.32.0


Subject: [PATCH v5 10/13] tools: update hmm-test to support device coherent type

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 121 ++++++++++++++++++++-----
1 file changed, 100 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 203323967b50..4b547188ec40 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -46,6 +46,13 @@ struct hmm_buffer {
uint64_t faults;
};

+enum {
+ HMM_PRIVATE_DEVICE_ONE,
+ HMM_PRIVATE_DEVICE_TWO,
+ HMM_COHERENCE_DEVICE_ONE,
+ HMM_COHERENCE_DEVICE_TWO,
+};
+
#define TWOMEG (1 << 21)
#define HMM_BUFFER_SIZE (1024 << 12)
#define HMM_PATH_MAX 64
@@ -60,6 +67,21 @@ FIXTURE(hmm)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm)
+{
+ int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+ .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+ .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
FIXTURE(hmm2)
{
int fd0;
@@ -68,6 +90,24 @@ FIXTURE(hmm2)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm2)
+{
+ int device_number0;
+ int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+ .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+ .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+ .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+ .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
static int hmm_open(int unit)
{
char pathname[HMM_PATH_MAX];
@@ -81,12 +121,19 @@ static int hmm_open(int unit)
return fd;
}

+static bool hmm_is_coherent_type(int dev_num)
+{
+ return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
FIXTURE_SETUP(hmm)
{
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd = hmm_open(0);
+ self->fd = hmm_open(variant->device_number);
+ if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
}

@@ -95,9 +142,11 @@ FIXTURE_SETUP(hmm2)
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd0 = hmm_open(0);
+ self->fd0 = hmm_open(variant->device_number0);
+ if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
- self->fd1 = hmm_open(1);
+ self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
}

@@ -211,6 +260,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
}

+static int hmm_migrate_sys_to_dev(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
/*
* Simple NULL test of device open/close.
*/
@@ -875,7 +938,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -923,7 +986,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -936,7 +999,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);

/* Migrate memory to the device again. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -976,7 +1039,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);

hmm_buffer_free(buffer);
@@ -1015,7 +1078,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;

/* Migrating a protected area should be an error. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);

/* Punch a hole after the first page address. */
@@ -1023,7 +1086,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);

/* We expect an error if the vma doesn't cover the range. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);

/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1118,13 @@ TEST_F(hmm2, migrate_mixed)

/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);

/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, -ENOENT);
buffer->ptr = p;

@@ -1070,8 +1133,12 @@ TEST_F(hmm2, migrate_mixed)
}

/*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of coherent zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Coherent device zone has coherent access by CPU, therefore
+ * it will not generate any page fault.
*/
TEST_F(hmm, migrate_multiple)
{
@@ -1107,8 +1174,7 @@ TEST_F(hmm, migrate_multiple)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
- npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -1116,7 +1182,13 @@ TEST_F(hmm, migrate_multiple)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

- /* Fault pages back to system memory and check them. */
+ /* Migrate back to system memory and check them. */
+ if (hmm_is_coherent_type(variant->device_number)) {
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ }
+
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

@@ -1354,13 +1426,13 @@ TEST_F(hmm2, snapshot)

/* Page 5 will be migrated to device 0. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

/* Page 6 will be migrated to device 1. */
buffer->ptr = p + 6 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

@@ -1377,9 +1449,16 @@ TEST_F(hmm2, snapshot)
ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
- HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ if (!hmm_is_coherent_type(variant->device_number0)) {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ } else {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE |
+ HMM_DMIRROR_PROT_WRITE);
+ }

hmm_buffer_free(buffer);
}
--
2.32.0


Subject: [PATCH v5 13/13] tools: add selftests to hmm for COW in device memory

The objective is to test device migration mechanism in pages marked
as COW, for private and coherent device type. In case of writing to
COW private page(s), a page fault will migrate pages back to system
memory first. Then, these pages will be duplicated. In case of COW
device coherent type, pages are duplicated directly from device
memory.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
---
tools/testing/selftests/vm/hmm-tests.c | 80 ++++++++++++++++++++++++++
1 file changed, 80 insertions(+)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 3295c8bf6c63..2da9d5baf339 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1869,4 +1869,84 @@ TEST_F(hmm, hmm_gup_test)
close(gup_fd);
hmm_buffer_free(buffer);
}
+
+/*
+ * Test copy-on-write in device pages.
+ * In case of writing to COW private page(s), a page fault will migrate pages
+ * back to system memory first. Then, these pages will be duplicated. In case
+ * of COW device coherent type, pages are duplicated directly from device
+ * memory.
+ */
+TEST_F(hmm, hmm_cow_in_device)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+ unsigned char *m;
+ pid_t pid;
+ int status;
+
+ npages = 4;
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Migrate memory to device. */
+
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ pid = fork();
+ if (pid == -1)
+ ASSERT_EQ(pid, 0);
+ if (!pid) {
+ /* Child process waitd for SIGTERM from the parent. */
+ while (1) {
+ }
+ perror("Should not reach this\n");
+ exit(0);
+ }
+ /* Parent process writes to COW pages(s) and gets a
+ * new copy in system. In case of device private pages,
+ * this write causes a migration to system mem first.
+ */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Terminate child and wait */
+ EXPECT_EQ(0, kill(pid, SIGTERM));
+ EXPECT_EQ(pid, waitpid(pid, &status, 0));
+ EXPECT_NE(0, WIFSIGNALED(status));
+ EXPECT_EQ(SIGTERM, WTERMSIG(status));
+
+ /* Take snapshot to CPU pagetables */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ m = buffer->mirror;
+ for (i = 0; i < npages; i++)
+ ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[i]);
+
+ hmm_buffer_free(buffer);
+}
TEST_HARNESS_MAIN
--
2.32.0


Subject: [PATCH v5 08/13] lib: test_hmm add module param for zone device type

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
lib/test_hmm.c | 73 ++++++++++++++++++++++++++++++++-------------
lib/test_hmm_uapi.h | 1 +
2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 915ef6b5b0d4..afb30af9f3ff 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -37,6 +37,16 @@
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+ "Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+ "Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
static const struct dev_pagemap_ops dmirror_devmem_ops;
static const struct mmu_interval_notifier_ops dmirror_min_ops;
static dev_t dmirror_dev;
@@ -455,28 +465,44 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
return ret;
}

-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
struct page **ppage)
{
struct dmirror_chunk *devmem;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+ int ret = -ENOMEM;

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
- return false;
+ return ret;

- res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
- if (IS_ERR(res))
+ switch (mdevice->zone_device_type) {
+ case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
+ res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+ if (IS_ERR_OR_NULL(res))
+ goto err_devmem;
+ devmem->pagemap.range.start = res->start;
+ devmem->pagemap.range.end = res->end;
+ devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+ break;
+ case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
+ devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) ?
+ spm_addr_dev0 :
+ spm_addr_dev1;
+ devmem->pagemap.range.end = devmem->pagemap.range.start +
+ DEVMEM_CHUNK_SIZE - 1;
+ devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+ break;
+ default:
+ ret = -EINVAL;
goto err_devmem;
+ }

- devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.range.start = res->start;
- devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -497,10 +523,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(&devmem->pagemap, numa_node_id());
- if (IS_ERR(ptr))
+ if (IS_ERR_OR_NULL(ptr)) {
+ if (ptr)
+ ret = PTR_ERR(ptr);
+ else
+ ret = -EFAULT;
goto err_release;
+ }

devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -529,15 +559,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
}
spin_unlock(&mdevice->lock);

- return true;
+ return 0;

err_release:
mutex_unlock(&mdevice->devmem_lock);
- release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+ if (res && devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
err_devmem:
kfree(devmem);

- return false;
+ return ret;
}

static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -562,7 +594,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
spin_unlock(&mdevice->lock);
} else {
spin_unlock(&mdevice->lock);
- if (!dmirror_allocate_chunk(mdevice, &dpage))
+ if (dmirror_allocate_chunk(mdevice, &dpage))
goto error;
}

@@ -1232,10 +1264,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
if (ret)
return ret;

- /* Build a list of free ZONE_DEVICE private struct pages */
- dmirror_allocate_chunk(mdevice, NULL);
-
- return 0;
+ /* Build a list of free ZONE_DEVICE struct pages */
+ return dmirror_allocate_chunk(mdevice, NULL);
}

static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1248,8 +1278,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
mdevice->devmem_chunks[i];

memunmap_pages(&devmem->pagemap);
- release_mem_region(devmem->pagemap.range.start,
- range_len(&devmem->pagemap.range));
+ if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
kfree(devmem);
}
kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 0511af7464ee..f700da7807c1 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -66,6 +66,7 @@ enum {
enum {
/* 0 is reserved to catch uninitialized type fields */
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
};

#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0


Subject: [PATCH v5 05/13] mm/gup: migrate device coherent pages when pinning instead of failing

From: Alistair Popple <[email protected]>

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
[hch: rebased to the split device memory checks,
moved migrate_device_page to migrate_device.c]
Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/gup.c | 47 +++++++++++++++++++++++++++++++++++-----
mm/internal.h | 1 +
mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 48b45bcc8501..e6093c31f932 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1895,9 +1895,43 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
continue;
prev_folio = folio;

- if (folio_is_pinnable(folio))
+ /*
+ * Device private pages will get faulted in during gup so it
+ * shouldn't be possible to see one here.
+ */
+ if (WARN_ON_ONCE(folio_is_device_private(folio))) {
+ ret = -EFAULT;
+ goto unpin_pages;
+ }
+
+ /*
+ * Device coherent pages are managed by a driver and should not
+ * be pinned indefinitely as it prevents the driver moving the
+ * page. So when trying to pin with FOLL_LONGTERM instead try
+ * to migrate the page out of device memory.
+ */
+ if (folio_is_device_coherent(folio)) {
+ WARN_ON_ONCE(PageCompound(&folio->page));
+
+ /*
+ * Migration will fail if the page is pinned, so convert
+ * the pin on the source page to a normal reference.
+ */
+ if (gup_flags & FOLL_PIN) {
+ get_page(&folio->page);
+ unpin_user_page(&folio->page);
+ }
+
+ pages[i] = migrate_device_page(&folio->page, gup_flags);
+ if (!pages[i]) {
+ ret = -EBUSY;
+ goto unpin_pages;
+ }
continue;
+ }

+ if (folio_is_pinnable(folio))
+ continue;
/*
* Try to move out any movable page before pinning the range.
*/
@@ -1933,10 +1967,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
return nr_pages;

unpin_pages:
- if (gup_flags & FOLL_PIN) {
- unpin_user_pages(pages, nr_pages);
- } else {
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
+ if (!pages[i])
+ continue;
+
+ if (gup_flags & FOLL_PIN)
+ unpin_user_page(pages[i]);
+ else
put_page(pages[i]);
}

diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..eeab4ee7a4a3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -853,6 +853,7 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags);

void free_zone_device_page(struct page *page);
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags);

/*
* mm/gup.c
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index cf9668376c5a..5decd26dd551 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
}
}
EXPORT_SYMBOL(migrate_vma_finalize);
+
+/*
+ * Migrate a device coherent page back to normal memory. The caller should have
+ * a reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
+{
+ unsigned long src_pfn, dst_pfn = 0;
+ struct migrate_vma args;
+ struct page *dpage;
+
+ lock_page(page);
+ src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+ args.src = &src_pfn;
+ args.dst = &dst_pfn;
+ args.cpages = 1;
+ args.npages = 1;
+ args.vma = NULL;
+ migrate_vma_setup(&args);
+ if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+ return NULL;
+
+ dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+ /*
+ * get/pin the new page now so we don't have to retry gup after
+ * migrating. We already have a reference so this should never fail.
+ */
+ if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+ __free_pages(dpage, 0);
+ dpage = NULL;
+ }
+
+ if (dpage) {
+ lock_page(dpage);
+ dst_pfn = migrate_pfn(page_to_pfn(dpage));
+ }
+
+ migrate_vma_pages(&args);
+ if (src_pfn & MIGRATE_PFN_MIGRATE)
+ copy_highpage(dpage, page);
+ migrate_vma_finalize(&args);
+ if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
+ if (gup_flags & FOLL_PIN)
+ unpin_user_page(dpage);
+ else
+ put_page(dpage);
+ dpage = NULL;
+ }
+
+ return dpage;
+}
--
2.32.0


Subject: [PATCH v5 06/13] drm/amdkfd: add SPM support for SVM

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.
Also, MIGRATE_VMA_SELECT_DEVICE_COHERENT flag is selected for
coherent type case during migration to device.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 34 +++++++++++++++---------
1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 997650d597ec..39b8c4710caf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -671,13 +671,15 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
- migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
+ if (adev->gmc.xgmi.connected_to_cpu)
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+ else
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;

buf = kvcalloc(npages,
2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t),
GFP_KERNEL);
-
if (!buf)
goto out;

@@ -947,7 +949,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
{
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long size;
void *r;

@@ -962,28 +964,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
- res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
- if (IS_ERR(res))
- return -ENOMEM;
+ if (adev->gmc.xgmi.connected_to_cpu) {
+ pgmap->range.start = adev->gmc.aper_base;
+ pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
+ pgmap->type = MEMORY_DEVICE_COHERENT;
+ } else {
+ res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+ if (IS_ERR(res))
+ return -ENOMEM;
+ pgmap->range.start = res->start;
+ pgmap->range.end = res->end;
+ pgmap->type = MEMORY_DEVICE_PRIVATE;
+ }

- pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
- pgmap->range.start = res->start;
- pgmap->range.end = res->end;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
- pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+ pgmap->flags = 0;
/* Device manager releases device-specific resources, memory region and
* pgmap when driver disconnects from device.
*/
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-
/* Disable SVM support capability */
pgmap->type = 0;
- devm_release_mem_region(adev->dev, res->start, resource_size(res));
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+ devm_release_mem_region(adev->dev, res->start,
+ res->end - res->start + 1);
return PTR_ERR(r);
}

--
2.32.0


Subject: [PATCH v5 11/13] tools: update test_hmm script to support SP config

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
tools/testing/selftests/vm/test_hmm.sh | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..539c9371e592 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+ if [ $# -eq 2 ]; then
+ mknod /dev/hmm_dmirror2 c $major 2
+ mknod /dev/hmm_dmirror3 c $major 3
+ fi
fi
}

@@ -58,7 +73,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.32.0


Subject: [PATCH v5 04/13] mm: remove the vma check in migrate_vma_setup()

From: Alistair Popple <[email protected]>

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
mm/migrate_device.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 18bc6483f63a..cf9668376c5a 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -486,24 +486,24 @@ int migrate_vma_setup(struct migrate_vma *args)

args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
- if (!args->vma || is_vm_hugetlb_page(args->vma) ||
- (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
- return -EINVAL;
- if (nr_pages <= 0)
- return -EINVAL;
- if (args->start < args->vma->vm_start ||
- args->start >= args->vma->vm_end)
- return -EINVAL;
- if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
- return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
- memset(args->src, 0, sizeof(*args->src) * nr_pages);
- args->cpages = 0;
- args->npages = 0;
-
- migrate_vma_collect(args);
+ if (args->vma) {
+ if (is_vm_hugetlb_page(args->vma) ||
+ (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
+ return -EINVAL;
+ if (args->start < args->vma->vm_start ||
+ args->start >= args->vma->vm_end)
+ return -EINVAL;
+ if (args->end <= args->vma->vm_start ||
+ args->end > args->vma->vm_end)
+ return -EINVAL;
+ memset(args->src, 0, sizeof(*args->src) * nr_pages);
+ args->cpages = 0;
+ args->npages = 0;
+
+ migrate_vma_collect(args);
+ }

if (args->cpages)
migrate_vma_unmap(args);
@@ -685,7 +685,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}

- if (!page) {
+ if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
--
2.32.0


Subject: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
---
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 3 ++-
mm/gup.c | 6 +++++-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 9 ++++++---
mm/ksm.c | 6 +++---
mm/madvise.c | 4 ++--
mm/memory.c | 9 ++++++++-
mm/mempolicy.c | 2 +-
mm/migrate.c | 4 ++--
mm/mlock.c | 2 +-
mm/mprotect.c | 2 +-
12 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2d04e3470d4c..2dd8c8a66924 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
return NULL;

page = vm_normal_page(vma, addr, pte);
- if (!page)
+ if (!page || is_zone_device_page(page))
return NULL;

if (PageReserved(page))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..d3f43908ff8d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -601,7 +601,7 @@ struct vm_operations_struct {
#endif
/*
* Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
+ * page for @addr. This is useful if the default behavior
* (using pte_page()) would not find the correct page.
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
@@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
+#define FOLL_LRU 0x1000 /* return only LRU (anon or page cache) */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..48b45bcc8501 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -532,7 +532,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}

page = vm_normal_page(vma, address, pte);
- if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
+ if ((flags & FOLL_LRU) && ((page && is_zone_device_page(page)) ||
+ (!page && pte_devmap(pte)))) {
+ page = ERR_PTR(-EEXIST);
+ goto out;
+ } else if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
* case since they are only valid while holding the pgmap
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a77c78a2b6b5..48182c8fe151 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2906,7 +2906,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
}

/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);

if (IS_ERR(page))
continue;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 16be62d493cd..671ac7800e53 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -618,7 +618,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
page = vm_normal_page(vma, address, pteval);
- if (unlikely(!page)) {
+ if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
result = SCAN_PAGE_NULL;
goto out;
}
@@ -1267,7 +1267,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
writable = true;

page = vm_normal_page(vma, _address, pteval);
- if (unlikely(!page)) {
+ if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
result = SCAN_PAGE_NULL;
goto out_unmap;
}
@@ -1479,7 +1479,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
goto abort;

page = vm_normal_page(vma, addr, *pte);
-
+ if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+ page = NULL;
/*
* Note that uprobe, debugger, or MAP_PRIVATE may change the
* page table, but the new page will not be a subpage of hpage.
@@ -1497,6 +1498,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
+ if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+ goto abort;
page_remove_rmap(page, vma, false);
}

diff --git a/mm/ksm.c b/mm/ksm.c
index 54f78c9eecae..400790128102 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
do {
cond_resched();
page = follow_page(vma, addr,
- FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
+ FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
@@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
if (!vma)
goto out;

- page = follow_page(vma, addr, FOLL_GET);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
if (IS_ERR_OR_NULL(page))
goto out;
if (PageAnon(page)) {
@@ -2307,7 +2307,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
while (ksm_scan.address < vma->vm_end) {
if (ksm_test_exit(mm))
break;
- *page = follow_page(vma, ksm_scan.address, FOLL_GET);
+ *page = follow_page(vma, ksm_scan.address, FOLL_GET | FOLL_LRU);
if (IS_ERR_OR_NULL(*page)) {
ksm_scan.address += PAGE_SIZE;
cond_resched();
diff --git a/mm/madvise.c b/mm/madvise.c
index d7b4f2602949..e5637181de1b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -421,7 +421,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
continue;

page = vm_normal_page(vma, addr, ptent);
- if (!page)
+ if (!page || is_zone_device_page(page))
continue;

/*
@@ -639,7 +639,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}

page = vm_normal_page(vma, addr, ptent);
- if (!page)
+ if (!page || is_zone_device_page(page))
continue;

/*
diff --git a/mm/memory.c b/mm/memory.c
index 21dadf03f089..30ecbc715e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -624,6 +624,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
if (is_zero_pfn(pfn))
return NULL;
if (pte_devmap(pte))
+/*
+ * NOTE: New uers of ZONE_DEVICE will not set pte_devmap() and will have
+ * refcounts incremented on their struct pages when they are inserted into
+ * PTEs, thus they are safe to return here. Legacy ZONE_DEVICE pages that set
+ * pte_devmap() do not have refcounts. Example of legacy ZONE_DEVICE is
+ * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
+ */
return NULL;

print_bad_pte(vma, addr, pte, NULL);
@@ -4685,7 +4692,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_modify(old_pte, vma->vm_page_prot);

page = vm_normal_page(vma, vmf->address, pte);
- if (!page)
+ if (!page || is_zone_device_page(page))
goto out_map;

/* TODO: handle PTE-mapped THP */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..abc26890fc95 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -523,7 +523,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte_present(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
- if (!page)
+ if (!page || is_zone_device_page(page))
continue;
/*
* vm_normal_page() filters out zero pages, but there might
diff --git a/mm/migrate.c b/mm/migrate.c
index e51588e95f57..f7d1b8312631 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1612,7 +1612,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
goto out;

/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);

err = PTR_ERR(page);
if (IS_ERR(page))
@@ -1803,7 +1803,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
goto set_status;

/* FOLL_DUMP to ignore special (like zero) pages */
- page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);

err = PTR_ERR(page);
if (IS_ERR(page))
diff --git a/mm/mlock.c b/mm/mlock.c
index 716caf851043..b14e929084cc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -333,7 +333,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte_present(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
- if (!page)
+ if (!page || is_zone_device_page(page))
continue;
if (PageTransCompound(page))
continue;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..e034aae2a98b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -95,7 +95,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
continue;

page = vm_normal_page(vma, addr, oldpte);
- if (!page || PageKsm(page))
+ if (!page || is_zone_device_page(page) || PageKsm(page))
continue;

/* Also skip shared copy-on-write pages */
--
2.32.0


2022-06-08 08:46:32

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages


I can't see any issues with this now so:

Reviewed-by: Alistair Popple <[email protected]>

Alex Sierra <[email protected]> writes:

> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
>
> Signed-off-by: Alex Sierra <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> ---
> fs/proc/task_mmu.c | 2 +-
> include/linux/mm.h | 3 ++-
> mm/gup.c | 6 +++++-
> mm/huge_memory.c | 2 +-
> mm/khugepaged.c | 9 ++++++---
> mm/ksm.c | 6 +++---
> mm/madvise.c | 4 ++--
> mm/memory.c | 9 ++++++++-
> mm/mempolicy.c | 2 +-
> mm/migrate.c | 4 ++--
> mm/mlock.c | 2 +-
> mm/mprotect.c | 2 +-
> 12 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 2d04e3470d4c..2dd8c8a66924 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
> return NULL;
>
> page = vm_normal_page(vma, addr, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> return NULL;
>
> if (PageReserved(page))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..d3f43908ff8d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -601,7 +601,7 @@ struct vm_operations_struct {
> #endif
> /*
> * Called by vm_normal_page() for special PTEs to find the
> - * page for @addr. This is useful if the default behavior
> + * page for @addr. This is useful if the default behavior
> * (using pte_page()) would not find the correct page.
> */
> struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_LRU 0x1000 /* return only LRU (anon or page cache) */
> #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> #define FOLL_COW 0x4000 /* internal GUP flag */
> #define FOLL_ANON 0x8000 /* don't do file mappings */
> diff --git a/mm/gup.c b/mm/gup.c
> index 551264407624..48b45bcc8501 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -532,7 +532,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> }
>
> page = vm_normal_page(vma, address, pte);
> - if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
> + if ((flags & FOLL_LRU) && ((page && is_zone_device_page(page)) ||
> + (!page && pte_devmap(pte)))) {
> + page = ERR_PTR(-EEXIST);
> + goto out;
> + } else if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
> /*
> * Only return device mapping pages in the FOLL_GET or FOLL_PIN
> * case since they are only valid while holding the pgmap
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a77c78a2b6b5..48182c8fe151 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2906,7 +2906,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> }
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
>
> if (IS_ERR(page))
> continue;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 16be62d493cd..671ac7800e53 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -618,7 +618,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> goto out;
> }
> page = vm_normal_page(vma, address, pteval);
> - if (unlikely(!page)) {
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> result = SCAN_PAGE_NULL;
> goto out;
> }
> @@ -1267,7 +1267,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> writable = true;
>
> page = vm_normal_page(vma, _address, pteval);
> - if (unlikely(!page)) {
> + if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> result = SCAN_PAGE_NULL;
> goto out_unmap;
> }
> @@ -1479,7 +1479,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> goto abort;
>
> page = vm_normal_page(vma, addr, *pte);
> -
> + if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> + page = NULL;
> /*
> * Note that uprobe, debugger, or MAP_PRIVATE may change the
> * page table, but the new page will not be a subpage of hpage.
> @@ -1497,6 +1498,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> if (pte_none(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> + if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> + goto abort;
> page_remove_rmap(page, vma, false);
> }
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 54f78c9eecae..400790128102 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> do {
> cond_resched();
> page = follow_page(vma, addr,
> - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
> if (IS_ERR_OR_NULL(page))
> break;
> if (PageKsm(page))
> @@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
> if (!vma)
> goto out;
>
> - page = follow_page(vma, addr, FOLL_GET);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
> if (IS_ERR_OR_NULL(page))
> goto out;
> if (PageAnon(page)) {
> @@ -2307,7 +2307,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> while (ksm_scan.address < vma->vm_end) {
> if (ksm_test_exit(mm))
> break;
> - *page = follow_page(vma, ksm_scan.address, FOLL_GET);
> + *page = follow_page(vma, ksm_scan.address, FOLL_GET | FOLL_LRU);
> if (IS_ERR_OR_NULL(*page)) {
> ksm_scan.address += PAGE_SIZE;
> cond_resched();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d7b4f2602949..e5637181de1b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -421,7 +421,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> continue;
>
> page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> continue;
>
> /*
> @@ -639,7 +639,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> }
>
> page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> continue;
>
> /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 21dadf03f089..30ecbc715e60 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -624,6 +624,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> if (is_zero_pfn(pfn))
> return NULL;
> if (pte_devmap(pte))
> +/*
> + * NOTE: New uers of ZONE_DEVICE will not set pte_devmap() and will have
> + * refcounts incremented on their struct pages when they are inserted into
> + * PTEs, thus they are safe to return here. Legacy ZONE_DEVICE pages that set
> + * pte_devmap() do not have refcounts. Example of legacy ZONE_DEVICE is
> + * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
> + */
> return NULL;
>
> print_bad_pte(vma, addr, pte, NULL);
> @@ -4685,7 +4692,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> pte = pte_modify(old_pte, vma->vm_page_prot);
>
> page = vm_normal_page(vma, vmf->address, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> goto out_map;
>
> /* TODO: handle PTE-mapped THP */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d39b01fd52fe..abc26890fc95 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -523,7 +523,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> if (!pte_present(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> continue;
> /*
> * vm_normal_page() filters out zero pages, but there might
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e51588e95f57..f7d1b8312631 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1612,7 +1612,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> goto out;
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
>
> err = PTR_ERR(page);
> if (IS_ERR(page))
> @@ -1803,7 +1803,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
> goto set_status;
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP | FOLL_LRU);
>
> err = PTR_ERR(page);
> if (IS_ERR(page))
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 716caf851043..b14e929084cc 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -333,7 +333,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
> if (!pte_present(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> continue;
> if (PageTransCompound(page))
> continue;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..e034aae2a98b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -95,7 +95,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> continue;
>
> page = vm_normal_page(vma, addr, oldpte);
> - if (!page || PageKsm(page))
> + if (!page || is_zone_device_page(page) || PageKsm(page))
> continue;
>
> /* Also skip shared copy-on-write pages */

2022-06-17 02:22:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

On Tue, 31 May 2022 15:00:28 -0500 Alex Sierra <[email protected]> wrote:

> This is our MEMORY_DEVICE_COHERENT patch series rebased and updated
> for current 5.18.0

I plan to move this series into the non-rebasing mm-stable branch in a
few days. Unless sternly told not to do so!

2022-06-17 07:59:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

On 17.06.22 04:19, Andrew Morton wrote:
> On Tue, 31 May 2022 15:00:28 -0500 Alex Sierra <[email protected]> wrote:
>
>> This is our MEMORY_DEVICE_COHERENT patch series rebased and updated
>> for current 5.18.0
>
> I plan to move this series into the non-rebasing mm-stable branch in a
> few days. Unless sternly told not to do so!
>

I want to double-check some things regarding PageAnonExclusive
interaction. I'm busy, but I'll try prioritizing it.

--
Thanks,

David / dhildenb

2022-06-17 09:43:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 31.05.22 22:00, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
>
> Signed-off-by: Alex Sierra <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> Reviewed-by: Alistair Popple <[email protected]>
> [hch: rebased ontop of the refcount changes,
> removed is_dev_private_or_coherent_page]
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/memremap.h | 19 +++++++++++++++++++
> mm/memcontrol.c | 7 ++++---
> mm/memory-failure.c | 8 ++++++--
> mm/memremap.c | 10 ++++++++++
> mm/migrate_device.c | 16 +++++++---------
> mm/rmap.c | 5 +++--
> 6 files changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 8af304f6b504..9f752ebed613 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,6 +41,13 @@ struct vmem_altmap {
> * A more complete discussion of unaddressable memory may be found in
> * include/linux/hmm.h and Documentation/vm/hmm.rst.
> *
> + * MEMORY_DEVICE_COHERENT:
> + * Device memory that is cache coherent from device and CPU point of view. This
> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> + * type. Any page of a process can be migrated to such memory. However no one

Any page might not be right, I'm pretty sure. ... just thinking about special pages
like vdso, shared zeropage, ... pinned pages ...

> + * should be allowed to pin such memory so that it can always be evicted.
> + *
> * MEMORY_DEVICE_FS_DAX:
> * Host memory that has similar access semantics as System RAM i.e. DMA
> * coherent and supports page pinning. In support of coordinating page
> @@ -61,6 +68,7 @@ struct vmem_altmap {
> enum memory_type {
> /* 0 is reserved to catch uninitialized type fields */
> MEMORY_DEVICE_PRIVATE = 1,
> + MEMORY_DEVICE_COHERENT,
> MEMORY_DEVICE_FS_DAX,
> MEMORY_DEVICE_GENERIC,
> MEMORY_DEVICE_PCI_P2PDMA,
> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)

In general, this LGTM, and it should be correct with PageAnonExclusive I think.


However, where exactly is pinning forbidden?

--
Thanks,

David / dhildenb

2022-06-17 10:05:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages

On 31.05.22 22:00, Alex Sierra wrote:
> With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
> device-managed anonymous pages that are not LRU pages. Although they
> behave like normal pages for purposes of mapping in CPU page, and for
> COW. They do not support LRU lists, NUMA migration or THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
>
> Signed-off-by: Alex Sierra <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> ---
> fs/proc/task_mmu.c | 2 +-
> include/linux/mm.h | 3 ++-
> mm/gup.c | 6 +++++-
> mm/huge_memory.c | 2 +-
> mm/khugepaged.c | 9 ++++++---
> mm/ksm.c | 6 +++---
> mm/madvise.c | 4 ++--
> mm/memory.c | 9 ++++++++-
> mm/mempolicy.c | 2 +-
> mm/migrate.c | 4 ++--
> mm/mlock.c | 2 +-
> mm/mprotect.c | 2 +-
> 12 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 2d04e3470d4c..2dd8c8a66924 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1792,7 +1792,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
> return NULL;
>
> page = vm_normal_page(vma, addr, pte);
> - if (!page)
> + if (!page || is_zone_device_page(page))
> return NULL;
>
> if (PageReserved(page))
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc8f326be0ce..d3f43908ff8d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -601,7 +601,7 @@ struct vm_operations_struct {
> #endif
> /*
> * Called by vm_normal_page() for special PTEs to find the
> - * page for @addr. This is useful if the default behavior
> + * page for @addr. This is useful if the default behavior
> * (using pte_page()) would not find the correct page.
> */
> struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -2934,6 +2934,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_LRU 0x1000 /* return only LRU (anon or page cache) */

Does that statement hold for special pages like the shared zeropage?

Also, this flag is only valid for in-kernel follow_page() but not for
the ordinary GUP interfaces. What are the semantics there? Is it fenced?


I really wonder if you should simply similarly teach the handful of
users of follow_page() to just special case these pages ... sounds
cleaner to me then adding flags with unclear semantics. Alternatively,
properly document what that flag is actually doing and where it applies.


I know, there was discussion on ... sorry for jumping in now, but this
doesn't look clean to me yet.

--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> On 31.05.22 22:00, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <[email protected]>
>> Acked-by: Felix Kuehling <[email protected]>
>> Reviewed-by: Alistair Popple <[email protected]>
>> [hch: rebased ontop of the refcount changes,
>> removed is_dev_private_or_coherent_page]
>> Signed-off-by: Christoph Hellwig <[email protected]>
>> ---
>> include/linux/memremap.h | 19 +++++++++++++++++++
>> mm/memcontrol.c | 7 ++++---
>> mm/memory-failure.c | 8 ++++++--
>> mm/memremap.c | 10 ++++++++++
>> mm/migrate_device.c | 16 +++++++---------
>> mm/rmap.c | 5 +++--
>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 8af304f6b504..9f752ebed613 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> * A more complete discussion of unaddressable memory may be found in
>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>> *
>> + * MEMORY_DEVICE_COHERENT:
>> + * Device memory that is cache coherent from device and CPU point of view. This
>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> + * type. Any page of a process can be migrated to such memory. However no one
> Any page might not be right, I'm pretty sure. ... just thinking about special pages
> like vdso, shared zeropage, ... pinned pages ...

Hi David,

Yes, I think you're right. This type does not cover all special pages. 
I need to correct that on the cover letter.
Pinned pages are allowed as long as they're not long term pinned.

Regards,
Alex Sierra

>
>> + * should be allowed to pin such memory so that it can always be evicted.
>> + *
>> * MEMORY_DEVICE_FS_DAX:
>> * Host memory that has similar access semantics as System RAM i.e. DMA
>> * coherent and supports page pinning. In support of coordinating page
>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> enum memory_type {
>> /* 0 is reserved to catch uninitialized type fields */
>> MEMORY_DEVICE_PRIVATE = 1,
>> + MEMORY_DEVICE_COHERENT,
>> MEMORY_DEVICE_FS_DAX,
>> MEMORY_DEVICE_GENERIC,
>> MEMORY_DEVICE_PCI_P2PDMA,
>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>
>
> However, where exactly is pinning forbidden?

Long-term pinning is forbidden since it would interfere with the device
memory manager owning the
device-coherent pages (e.g. evictions in TTM). However, normal pinning
is allowed on this device type.

Regards,
Alex Sierra

>

2022-06-17 17:36:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> On 31.05.22 22:00, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <[email protected]>
>>> Acked-by: Felix Kuehling <[email protected]>
>>> Reviewed-by: Alistair Popple <[email protected]>
>>> [hch: rebased ontop of the refcount changes,
>>> removed is_dev_private_or_coherent_page]
>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>> ---
>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>> mm/memcontrol.c | 7 ++++---
>>> mm/memory-failure.c | 8 ++++++--
>>> mm/memremap.c | 10 ++++++++++
>>> mm/migrate_device.c | 16 +++++++---------
>>> mm/rmap.c | 5 +++--
>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 8af304f6b504..9f752ebed613 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>> * A more complete discussion of unaddressable memory may be found in
>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>> *
>>> + * MEMORY_DEVICE_COHERENT:
>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>> + * type. Any page of a process can be migrated to such memory. However no one
>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> like vdso, shared zeropage, ... pinned pages ...
>

Well, you cannot migrate long term pages, that's what I meant :)

>>
>>> + * should be allowed to pin such memory so that it can always be evicted.
>>> + *
>>> * MEMORY_DEVICE_FS_DAX:
>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>> * coherent and supports page pinning. In support of coordinating page
>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>> enum memory_type {
>>> /* 0 is reserved to catch uninitialized type fields */
>>> MEMORY_DEVICE_PRIVATE = 1,
>>> + MEMORY_DEVICE_COHERENT,
>>> MEMORY_DEVICE_FS_DAX,
>>> MEMORY_DEVICE_GENERIC,
>>> MEMORY_DEVICE_PCI_P2PDMA,
>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>
>>
>> However, where exactly is pinning forbidden?
>
> Long-term pinning is forbidden since it would interfere with the device
> memory manager owning the
> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> is allowed on this device type.

I don't see updates to folio_is_pinnable() in this patch.

So wouldn't try_grab_folio() simply pin these pages? What am I missing?

--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/17/2022 12:33 PM, David Hildenbrand wrote:
> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>> Device memory that is cache coherent from device and CPU point of view.
>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>> no one should be allowed to pin such memory so that it can always be
>>>> evicted.
>>>>
>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>> Acked-by: Felix Kuehling <[email protected]>
>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>> [hch: rebased ontop of the refcount changes,
>>>> removed is_dev_private_or_coherent_page]
>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>> ---
>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>> mm/memcontrol.c | 7 ++++---
>>>> mm/memory-failure.c | 8 ++++++--
>>>> mm/memremap.c | 10 ++++++++++
>>>> mm/migrate_device.c | 16 +++++++---------
>>>> mm/rmap.c | 5 +++--
>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>> index 8af304f6b504..9f752ebed613 100644
>>>> --- a/include/linux/memremap.h
>>>> +++ b/include/linux/memremap.h
>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>> * A more complete discussion of unaddressable memory may be found in
>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>> *
>>>> + * MEMORY_DEVICE_COHERENT:
>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>> like vdso, shared zeropage, ... pinned pages ...
> Well, you cannot migrate long term pages, that's what I meant :)
>
>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>> + *
>>>> * MEMORY_DEVICE_FS_DAX:
>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>> * coherent and supports page pinning. In support of coordinating page
>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>> enum memory_type {
>>>> /* 0 is reserved to catch uninitialized type fields */
>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>> + MEMORY_DEVICE_COHERENT,
>>>> MEMORY_DEVICE_FS_DAX,
>>>> MEMORY_DEVICE_GENERIC,
>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>
>>>
>>> However, where exactly is pinning forbidden?
>> Long-term pinning is forbidden since it would interfere with the device
>> memory manager owning the
>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> is allowed on this device type.
> I don't see updates to folio_is_pinnable() in this patch.
Device coherent type pages should return true here, as they are pinnable
pages.
>
> So wouldn't try_grab_folio() simply pin these pages? What am I missing?

As far as I understand this return NULL for long term pin pages.
Otherwise they get refcount incremented.

Regards,
Alex Sierra

>

2022-06-17 21:31:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>> no one should be allowed to pin such memory so that it can always be
>>>>> evicted.
>>>>>
>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>> Acked-by: Felix Kuehling <[email protected]>
>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>> [hch: rebased ontop of the refcount changes,
>>>>> removed is_dev_private_or_coherent_page]
>>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>>> ---
>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>> mm/memcontrol.c | 7 ++++---
>>>>> mm/memory-failure.c | 8 ++++++--
>>>>> mm/memremap.c | 10 ++++++++++
>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>> mm/rmap.c | 5 +++--
>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>> *
>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>> like vdso, shared zeropage, ... pinned pages ...
>> Well, you cannot migrate long term pages, that's what I meant :)
>>
>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>> + *
>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>> enum memory_type {
>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>> + MEMORY_DEVICE_COHERENT,
>>>>> MEMORY_DEVICE_FS_DAX,
>>>>> MEMORY_DEVICE_GENERIC,
>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>
>>>>
>>>> However, where exactly is pinning forbidden?
>>> Long-term pinning is forbidden since it would interfere with the device
>>> memory manager owning the
>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>> is allowed on this device type.
>> I don't see updates to folio_is_pinnable() in this patch.
> Device coherent type pages should return true here, as they are pinnable
> pages.

That function is only called for long-term pinnings in try_grab_folio().

>>
>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>
> As far as I understand this return NULL for long term pin pages.
> Otherwise they get refcount incremented.

I don't follow.

You're saying

a) folio_is_pinnable() returns true for device coherent pages

and that

b) device coherent pages don't get long-term pinned


Yet, the code says

struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
struct folio *folio;

/*
* Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
* right zone, so fail and let the caller fall back to the slow
* path.
*/
if (unlikely((flags & FOLL_LONGTERM) &&
!is_pinnable_page(page)))
return NULL;
...
return folio;
}
}


What prevents these pages from getting long-term pinned as stated in this patch?

I am probably missing something important.

--
Thanks,

David / dhildenb

2022-06-18 09:51:59

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
<[email protected]> wrote:
>
>
> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> > On 31.05.22 22:00, Alex Sierra wrote:
> >> Device memory that is cache coherent from device and CPU point of view.
> >> This is used on platforms that have an advanced system bus (like CAPI
> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> no one should be allowed to pin such memory so that it can always be
> >> evicted.
> >>
> >> Signed-off-by: Alex Sierra <[email protected]>
> >> Acked-by: Felix Kuehling <[email protected]>
> >> Reviewed-by: Alistair Popple <[email protected]>
> >> [hch: rebased ontop of the refcount changes,
> >> removed is_dev_private_or_coherent_page]
> >> Signed-off-by: Christoph Hellwig <[email protected]>
> >> ---
> >> include/linux/memremap.h | 19 +++++++++++++++++++
> >> mm/memcontrol.c | 7 ++++---
> >> mm/memory-failure.c | 8 ++++++--
> >> mm/memremap.c | 10 ++++++++++
> >> mm/migrate_device.c | 16 +++++++---------
> >> mm/rmap.c | 5 +++--
> >> 6 files changed, 49 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index 8af304f6b504..9f752ebed613 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >> * A more complete discussion of unaddressable memory may be found in
> >> * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >> *
> >> + * MEMORY_DEVICE_COHERENT:
> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> + * type. Any page of a process can be migrated to such memory. However no one
> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> > like vdso, shared zeropage, ... pinned pages ...
>
> Hi David,
>
> Yes, I think you're right. This type does not cover all special pages.
> I need to correct that on the cover letter.
> Pinned pages are allowed as long as they're not long term pinned.
>
> Regards,
> Alex Sierra

What if I want to hotplug this device's coherent memory, but I do
*not* want the OS
to migrate any page to it ?
I want to fully-control what resides on this memory, as I can consider
this memory
"expensive". i.e. I don't have a lot of it, I want to use it for
specific purposes and
I don't want the OS to start using it when there is some memory pressure in
the system.

Oded

>
> >
> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> + *
> >> * MEMORY_DEVICE_FS_DAX:
> >> * Host memory that has similar access semantics as System RAM i.e. DMA
> >> * coherent and supports page pinning. In support of coordinating page
> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >> enum memory_type {
> >> /* 0 is reserved to catch uninitialized type fields */
> >> MEMORY_DEVICE_PRIVATE = 1,
> >> + MEMORY_DEVICE_COHERENT,
> >> MEMORY_DEVICE_FS_DAX,
> >> MEMORY_DEVICE_GENERIC,
> >> MEMORY_DEVICE_PCI_P2PDMA,
> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >
> >
> > However, where exactly is pinning forbidden?
>
> Long-term pinning is forbidden since it would interfere with the device
> memory manager owning the
> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> is allowed on this device type.
>
> Regards,
> Alex Sierra
>
> >

2022-06-20 00:35:10

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


Oded Gabbay <[email protected]> writes:

> On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> <[email protected]> wrote:
>>
>>
>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> Device memory that is cache coherent from device and CPU point of view.
>> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> no one should be allowed to pin such memory so that it can always be
>> >> evicted.
>> >>
>> >> Signed-off-by: Alex Sierra <[email protected]>
>> >> Acked-by: Felix Kuehling <[email protected]>
>> >> Reviewed-by: Alistair Popple <[email protected]>
>> >> [hch: rebased ontop of the refcount changes,
>> >> removed is_dev_private_or_coherent_page]
>> >> Signed-off-by: Christoph Hellwig <[email protected]>
>> >> ---
>> >> include/linux/memremap.h | 19 +++++++++++++++++++
>> >> mm/memcontrol.c | 7 ++++---
>> >> mm/memory-failure.c | 8 ++++++--
>> >> mm/memremap.c | 10 ++++++++++
>> >> mm/migrate_device.c | 16 +++++++---------
>> >> mm/rmap.c | 5 +++--
>> >> 6 files changed, 49 insertions(+), 16 deletions(-)
>> >>
>> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> index 8af304f6b504..9f752ebed613 100644
>> >> --- a/include/linux/memremap.h
>> >> +++ b/include/linux/memremap.h
>> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >> * A more complete discussion of unaddressable memory may be found in
>> >> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >> *
>> >> + * MEMORY_DEVICE_COHERENT:
>> >> + * Device memory that is cache coherent from device and CPU point of view. This
>> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> >> + * type. Any page of a process can be migrated to such memory. However no one
>> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> > like vdso, shared zeropage, ... pinned pages ...
>>
>> Hi David,
>>
>> Yes, I think you're right. This type does not cover all special pages.
>> I need to correct that on the cover letter.
>> Pinned pages are allowed as long as they're not long term pinned.
>>
>> Regards,
>> Alex Sierra
>
> What if I want to hotplug this device's coherent memory, but I do
> *not* want the OS
> to migrate any page to it ?
> I want to fully-control what resides on this memory, as I can consider
> this memory
> "expensive". i.e. I don't have a lot of it, I want to use it for
> specific purposes and
> I don't want the OS to start using it when there is some memory pressure in
> the system.

This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
pages are only allocated by a device driver and exposed to user-space by
a driver migrating pages to them with migrate_vma. The OS can't just
start using them due to memory pressure for example.

- Alistair

> Oded
>
>>
>> >
>> >> + * should be allowed to pin such memory so that it can always be evicted.
>> >> + *
>> >> * MEMORY_DEVICE_FS_DAX:
>> >> * Host memory that has similar access semantics as System RAM i.e. DMA
>> >> * coherent and supports page pinning. In support of coordinating page
>> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> >> enum memory_type {
>> >> /* 0 is reserved to catch uninitialized type fields */
>> >> MEMORY_DEVICE_PRIVATE = 1,
>> >> + MEMORY_DEVICE_COHERENT,
>> >> MEMORY_DEVICE_FS_DAX,
>> >> MEMORY_DEVICE_GENERIC,
>> >> MEMORY_DEVICE_PCI_P2PDMA,
>> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>> >
>> >
>> > However, where exactly is pinning forbidden?
>>
>> Long-term pinning is forbidden since it would interfere with the device
>> memory manager owning the
>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> is allowed on this device type.
>>
>> Regards,
>> Alex Sierra
>>
>> >

2022-06-20 06:04:25

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <[email protected]> wrote:
>
>
> Oded Gabbay <[email protected]> writes:
>
> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> > <[email protected]> wrote:
> >>
> >>
> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> >> > On 31.05.22 22:00, Alex Sierra wrote:
> >> >> Device memory that is cache coherent from device and CPU point of view.
> >> >> This is used on platforms that have an advanced system bus (like CAPI
> >> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> >> no one should be allowed to pin such memory so that it can always be
> >> >> evicted.
> >> >>
> >> >> Signed-off-by: Alex Sierra <[email protected]>
> >> >> Acked-by: Felix Kuehling <[email protected]>
> >> >> Reviewed-by: Alistair Popple <[email protected]>
> >> >> [hch: rebased ontop of the refcount changes,
> >> >> removed is_dev_private_or_coherent_page]
> >> >> Signed-off-by: Christoph Hellwig <[email protected]>
> >> >> ---
> >> >> include/linux/memremap.h | 19 +++++++++++++++++++
> >> >> mm/memcontrol.c | 7 ++++---
> >> >> mm/memory-failure.c | 8 ++++++--
> >> >> mm/memremap.c | 10 ++++++++++
> >> >> mm/migrate_device.c | 16 +++++++---------
> >> >> mm/rmap.c | 5 +++--
> >> >> 6 files changed, 49 insertions(+), 16 deletions(-)
> >> >>
> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> >> index 8af304f6b504..9f752ebed613 100644
> >> >> --- a/include/linux/memremap.h
> >> >> +++ b/include/linux/memremap.h
> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >> >> * A more complete discussion of unaddressable memory may be found in
> >> >> * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >> >> *
> >> >> + * MEMORY_DEVICE_COHERENT:
> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> >> + * type. Any page of a process can be migrated to such memory. However no one
> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> >> > like vdso, shared zeropage, ... pinned pages ...
> >>
> >> Hi David,
> >>
> >> Yes, I think you're right. This type does not cover all special pages.
> >> I need to correct that on the cover letter.
> >> Pinned pages are allowed as long as they're not long term pinned.
> >>
> >> Regards,
> >> Alex Sierra
> >
> > What if I want to hotplug this device's coherent memory, but I do
> > *not* want the OS
> > to migrate any page to it ?
> > I want to fully-control what resides on this memory, as I can consider
> > this memory
> > "expensive". i.e. I don't have a lot of it, I want to use it for
> > specific purposes and
> > I don't want the OS to start using it when there is some memory pressure in
> > the system.
>
> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
> pages are only allocated by a device driver and exposed to user-space by
> a driver migrating pages to them with migrate_vma. The OS can't just
> start using them due to memory pressure for example.
>
> - Alistair
Thanks for the explanation.

I guess the commit message confused me a bit, especially these two sentences:

"Any page of a process can be migrated to such memory. However no one should be
allowed to pin such memory so that it can always be evicted."

I read them as if the OS is free to choose which pages are migrated to
this memory,
and anything is eligible for migration to that memory (and that's why
we also don't
allow it to pin memory there).

If we are not allowed to pin anything there, can the device driver
decide to disable
any option for oversubscription of this memory area ?

Let's assume the user uses this memory area for doing p2p with other
CXL devices.
In that case, I wouldn't want the driver/OS to migrate pages in and
out of that memory...

So either I should let the user pin those pages, or prevent him from
doing (accidently or not)
oversubscription in this memory area.

wdyt ?

>
> > Oded
> >
> >>
> >> >
> >> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> >> + *
> >> >> * MEMORY_DEVICE_FS_DAX:
> >> >> * Host memory that has similar access semantics as System RAM i.e. DMA
> >> >> * coherent and supports page pinning. In support of coordinating page
> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >> >> enum memory_type {
> >> >> /* 0 is reserved to catch uninitialized type fields */
> >> >> MEMORY_DEVICE_PRIVATE = 1,
> >> >> + MEMORY_DEVICE_COHERENT,
> >> >> MEMORY_DEVICE_FS_DAX,
> >> >> MEMORY_DEVICE_GENERIC,
> >> >> MEMORY_DEVICE_PCI_P2PDMA,
> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >> >
> >> >
> >> > However, where exactly is pinning forbidden?
> >>
> >> Long-term pinning is forbidden since it would interfere with the device
> >> memory manager owning the
> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> >> is allowed on this device type.
> >>
> >> Regards,
> >> Alex Sierra
> >>
> >> >

2022-06-20 08:53:25

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


Oded Gabbay <[email protected]> writes:

> On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <[email protected]> wrote:
>>
>>
>> Oded Gabbay <[email protected]> writes:
>>
>> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
>> > <[email protected]> wrote:
>> >>
>> >>
>> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>> >> > On 31.05.22 22:00, Alex Sierra wrote:
>> >> >> Device memory that is cache coherent from device and CPU point of view.
>> >> >> This is used on platforms that have an advanced system bus (like CAPI
>> >> >> or CXL). Any page of a process can be migrated to such memory. However,
>> >> >> no one should be allowed to pin such memory so that it can always be
>> >> >> evicted.
>> >> >>
>> >> >> Signed-off-by: Alex Sierra <[email protected]>
>> >> >> Acked-by: Felix Kuehling <[email protected]>
>> >> >> Reviewed-by: Alistair Popple <[email protected]>
>> >> >> [hch: rebased ontop of the refcount changes,
>> >> >> removed is_dev_private_or_coherent_page]
>> >> >> Signed-off-by: Christoph Hellwig <[email protected]>
>> >> >> ---
>> >> >> include/linux/memremap.h | 19 +++++++++++++++++++
>> >> >> mm/memcontrol.c | 7 ++++---
>> >> >> mm/memory-failure.c | 8 ++++++--
>> >> >> mm/memremap.c | 10 ++++++++++
>> >> >> mm/migrate_device.c | 16 +++++++---------
>> >> >> mm/rmap.c | 5 +++--
>> >> >> 6 files changed, 49 insertions(+), 16 deletions(-)
>> >> >>
>> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> >> >> index 8af304f6b504..9f752ebed613 100644
>> >> >> --- a/include/linux/memremap.h
>> >> >> +++ b/include/linux/memremap.h
>> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
>> >> >> * A more complete discussion of unaddressable memory may be found in
>> >> >> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>> >> >> *
>> >> >> + * MEMORY_DEVICE_COHERENT:
>> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
>> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>> >> >> + * type. Any page of a process can be migrated to such memory. However no one
>> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
>> >> > like vdso, shared zeropage, ... pinned pages ...
>> >>
>> >> Hi David,
>> >>
>> >> Yes, I think you're right. This type does not cover all special pages.
>> >> I need to correct that on the cover letter.
>> >> Pinned pages are allowed as long as they're not long term pinned.
>> >>
>> >> Regards,
>> >> Alex Sierra
>> >
>> > What if I want to hotplug this device's coherent memory, but I do
>> > *not* want the OS
>> > to migrate any page to it ?
>> > I want to fully-control what resides on this memory, as I can consider
>> > this memory
>> > "expensive". i.e. I don't have a lot of it, I want to use it for
>> > specific purposes and
>> > I don't want the OS to start using it when there is some memory pressure in
>> > the system.
>>
>> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
>> pages are only allocated by a device driver and exposed to user-space by
>> a driver migrating pages to them with migrate_vma. The OS can't just
>> start using them due to memory pressure for example.
>>
>> - Alistair
> Thanks for the explanation.
>
> I guess the commit message confused me a bit, especially these two sentences:
>
> "Any page of a process can be migrated to such memory. However no one should be
> allowed to pin such memory so that it can always be evicted."
>
> I read them as if the OS is free to choose which pages are migrated to
> this memory,
> and anything is eligible for migration to that memory (and that's why
> we also don't
> allow it to pin memory there).
>
> If we are not allowed to pin anything there, can the device driver
> decide to disable
> any option for oversubscription of this memory area ?

I'm not sure I follow your thinking on how oversubscription would work
here, however all allocations are controlled by the driver. So if a
device's coherent memory is full a driver would be unable to migrate
pages to that device until pages are freed by the OS due to being
unmapped or the driver evicts pages by migrating them back to normal CPU
memory.

Pinning of pages is allowed, and could prevent such migrations. However
this patch series prevents device coherent pages from being pinned
longterm (ie. with FOLL_LONGTERM), so it should always be able to evict
pages eventually.

> Let's assume the user uses this memory area for doing p2p with other
> CXL devices.
> In that case, I wouldn't want the driver/OS to migrate pages in and
> out of that memory...

The OS will not migrate pages in or out (although it may free them if no
longer required), but a driver might choose to. So at the moment it's
really up to the driver to implement what you want in this regards.

> So either I should let the user pin those pages, or prevent him from
> doing (accidently or not)
> oversubscription in this memory area.

As noted above pages can be pinned, but not long-term.

- Alistair

> wdyt ?
>
>>
>> > Oded
>> >
>> >>
>> >> >
>> >> >> + * should be allowed to pin such memory so that it can always be evicted.
>> >> >> + *
>> >> >> * MEMORY_DEVICE_FS_DAX:
>> >> >> * Host memory that has similar access semantics as System RAM i.e. DMA
>> >> >> * coherent and supports page pinning. In support of coordinating page
>> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
>> >> >> enum memory_type {
>> >> >> /* 0 is reserved to catch uninitialized type fields */
>> >> >> MEMORY_DEVICE_PRIVATE = 1,
>> >> >> + MEMORY_DEVICE_COHERENT,
>> >> >> MEMORY_DEVICE_FS_DAX,
>> >> >> MEMORY_DEVICE_GENERIC,
>> >> >> MEMORY_DEVICE_PCI_P2PDMA,
>> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>> >> >
>> >> >
>> >> > However, where exactly is pinning forbidden?
>> >>
>> >> Long-term pinning is forbidden since it would interfere with the device
>> >> memory manager owning the
>> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>> >> is allowed on this device type.
>> >>
>> >> Regards,
>> >> Alex Sierra
>> >>
>> >> >

2022-06-20 12:33:01

by Oded Gabbay

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On Mon, Jun 20, 2022 at 11:50 AM Alistair Popple <[email protected]> wrote:
>
>
> Oded Gabbay <[email protected]> writes:
>
> > On Mon, Jun 20, 2022 at 3:33 AM Alistair Popple <[email protected]> wrote:
> >>
> >>
> >> Oded Gabbay <[email protected]> writes:
> >>
> >> > On Fri, Jun 17, 2022 at 8:20 PM Sierra Guiza, Alejandro (Alex)
> >> > <[email protected]> wrote:
> >> >>
> >> >>
> >> >> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
> >> >> > On 31.05.22 22:00, Alex Sierra wrote:
> >> >> >> Device memory that is cache coherent from device and CPU point of view.
> >> >> >> This is used on platforms that have an advanced system bus (like CAPI
> >> >> >> or CXL). Any page of a process can be migrated to such memory. However,
> >> >> >> no one should be allowed to pin such memory so that it can always be
> >> >> >> evicted.
> >> >> >>
> >> >> >> Signed-off-by: Alex Sierra <[email protected]>
> >> >> >> Acked-by: Felix Kuehling <[email protected]>
> >> >> >> Reviewed-by: Alistair Popple <[email protected]>
> >> >> >> [hch: rebased ontop of the refcount changes,
> >> >> >> removed is_dev_private_or_coherent_page]
> >> >> >> Signed-off-by: Christoph Hellwig <[email protected]>
> >> >> >> ---
> >> >> >> include/linux/memremap.h | 19 +++++++++++++++++++
> >> >> >> mm/memcontrol.c | 7 ++++---
> >> >> >> mm/memory-failure.c | 8 ++++++--
> >> >> >> mm/memremap.c | 10 ++++++++++
> >> >> >> mm/migrate_device.c | 16 +++++++---------
> >> >> >> mm/rmap.c | 5 +++--
> >> >> >> 6 files changed, 49 insertions(+), 16 deletions(-)
> >> >> >>
> >> >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> >> >> index 8af304f6b504..9f752ebed613 100644
> >> >> >> --- a/include/linux/memremap.h
> >> >> >> +++ b/include/linux/memremap.h
> >> >> >> @@ -41,6 +41,13 @@ struct vmem_altmap {
> >> >> >> * A more complete discussion of unaddressable memory may be found in
> >> >> >> * include/linux/hmm.h and Documentation/vm/hmm.rst.
> >> >> >> *
> >> >> >> + * MEMORY_DEVICE_COHERENT:
> >> >> >> + * Device memory that is cache coherent from device and CPU point of view. This
> >> >> >> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
> >> >> >> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
> >> >> >> + * type. Any page of a process can be migrated to such memory. However no one
> >> >> > Any page might not be right, I'm pretty sure. ... just thinking about special pages
> >> >> > like vdso, shared zeropage, ... pinned pages ...
> >> >>
> >> >> Hi David,
> >> >>
> >> >> Yes, I think you're right. This type does not cover all special pages.
> >> >> I need to correct that on the cover letter.
> >> >> Pinned pages are allowed as long as they're not long term pinned.
> >> >>
> >> >> Regards,
> >> >> Alex Sierra
> >> >
> >> > What if I want to hotplug this device's coherent memory, but I do
> >> > *not* want the OS
> >> > to migrate any page to it ?
> >> > I want to fully-control what resides on this memory, as I can consider
> >> > this memory
> >> > "expensive". i.e. I don't have a lot of it, I want to use it for
> >> > specific purposes and
> >> > I don't want the OS to start using it when there is some memory pressure in
> >> > the system.
> >>
> >> This is exactly what MEMORY_DEVICE_COHERENT is for. Device coherent
> >> pages are only allocated by a device driver and exposed to user-space by
> >> a driver migrating pages to them with migrate_vma. The OS can't just
> >> start using them due to memory pressure for example.
> >>
> >> - Alistair
> > Thanks for the explanation.
> >
> > I guess the commit message confused me a bit, especially these two sentences:
> >
> > "Any page of a process can be migrated to such memory. However no one should be
> > allowed to pin such memory so that it can always be evicted."
> >
> > I read them as if the OS is free to choose which pages are migrated to
> > this memory,
> > and anything is eligible for migration to that memory (and that's why
> > we also don't
> > allow it to pin memory there).
> >
> > If we are not allowed to pin anything there, can the device driver
> > decide to disable
> > any option for oversubscription of this memory area ?
>
> I'm not sure I follow your thinking on how oversubscription would work
> here, however all allocations are controlled by the driver. So if a
> device's coherent memory is full a driver would be unable to migrate
> pages to that device until pages are freed by the OS due to being
> unmapped or the driver evicts pages by migrating them back to normal CPU
> memory.
>
> Pinning of pages is allowed, and could prevent such migrations. However
> this patch series prevents device coherent pages from being pinned
> longterm (ie. with FOLL_LONGTERM), so it should always be able to evict
> pages eventually.
>
> > Let's assume the user uses this memory area for doing p2p with other
> > CXL devices.
> > In that case, I wouldn't want the driver/OS to migrate pages in and
> > out of that memory...
>
> The OS will not migrate pages in or out (although it may free them if no
> longer required), but a driver might choose to. So at the moment it's
> really up to the driver to implement what you want in this regards.

I see.
In other words, we don't want to allow long-term pinning but
the driver can decide it doesn't want to evict pages out
of that memory, until they are freed.

Thanks,
Oded
>
> > So either I should let the user pin those pages, or prevent him from
> > doing (accidently or not)
> > oversubscription in this memory area.
>
> As noted above pages can be pinned, but not long-term.
>
> - Alistair
>
> > wdyt ?
> >
> >>
> >> > Oded
> >> >
> >> >>
> >> >> >
> >> >> >> + * should be allowed to pin such memory so that it can always be evicted.
> >> >> >> + *
> >> >> >> * MEMORY_DEVICE_FS_DAX:
> >> >> >> * Host memory that has similar access semantics as System RAM i.e. DMA
> >> >> >> * coherent and supports page pinning. In support of coordinating page
> >> >> >> @@ -61,6 +68,7 @@ struct vmem_altmap {
> >> >> >> enum memory_type {
> >> >> >> /* 0 is reserved to catch uninitialized type fields */
> >> >> >> MEMORY_DEVICE_PRIVATE = 1,
> >> >> >> + MEMORY_DEVICE_COHERENT,
> >> >> >> MEMORY_DEVICE_FS_DAX,
> >> >> >> MEMORY_DEVICE_GENERIC,
> >> >> >> MEMORY_DEVICE_PCI_P2PDMA,
> >> >> >> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
> >> >> > In general, this LGTM, and it should be correct with PageAnonExclusive I think.
> >> >> >
> >> >> >
> >> >> > However, where exactly is pinning forbidden?
> >> >>
> >> >> Long-term pinning is forbidden since it would interfere with the device
> >> >> memory manager owning the
> >> >> device-coherent pages (e.g. evictions in TTM). However, normal pinning
> >> >> is allowed on this device type.
> >> >>
> >> >> Regards,
> >> >> Alex Sierra
> >> >>
> >> >> >

2022-06-21 11:38:32

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


Am 6/17/22 um 23:19 schrieb David Hildenbrand:
> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>> evicted.
>>>>>>
>>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>>> Acked-by: Felix Kuehling <[email protected]>
>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>> removed is_dev_private_or_coherent_page]
>>>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>>>> ---
>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>> mm/rmap.c | 5 +++--
>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>> --- a/include/linux/memremap.h
>>>>>> +++ b/include/linux/memremap.h
>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>> *
>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>> like vdso, shared zeropage, ... pinned pages ...
>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>
>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>> + *
>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>> enum memory_type {
>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>
>>>>>
>>>>> However, where exactly is pinning forbidden?
>>>> Long-term pinning is forbidden since it would interfere with the device
>>>> memory manager owning the
>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>> is allowed on this device type.
>>> I don't see updates to folio_is_pinnable() in this patch.
>> Device coherent type pages should return true here, as they are pinnable
>> pages.
> That function is only called for long-term pinnings in try_grab_folio().
>
>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>> As far as I understand this return NULL for long term pin pages.
>> Otherwise they get refcount incremented.
> I don't follow.
>
> You're saying
>
> a) folio_is_pinnable() returns true for device coherent pages
>
> and that
>
> b) device coherent pages don't get long-term pinned
>
>
> Yet, the code says
>
> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> {
> if (flags & FOLL_GET)
> return try_get_folio(page, refs);
> else if (flags & FOLL_PIN) {
> struct folio *folio;
>
> /*
> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
> * right zone, so fail and let the caller fall back to the slow
> * path.
> */
> if (unlikely((flags & FOLL_LONGTERM) &&
> !is_pinnable_page(page)))
> return NULL;
> ...
> return folio;
> }
> }
>
>
> What prevents these pages from getting long-term pinned as stated in this patch?

Long-term pinning is handled by __gup_longterm_locked, which migrates
pages returned by __get_user_pages_locked that cannot be long-term
pinned. try_grab_folio is OK to grab the pages. Anything that can't be
long-term pinned will be migrated afterwards, and
__get_user_pages_locked will be retried. The migration of
DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
("mm/gup: migrate device coherent pages when pinning instead of failing").

Regards,
  Felix


>
> I am probably missing something important.
>
P.S.: I'm on vacation and looking at a tiny screen. Hope I didn't miss
anything myself.

2022-06-21 11:38:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 21.06.22 13:25, Felix Kuehling wrote:
>
> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>> evicted.
>>>>>>>
>>>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>>>> Acked-by: Felix Kuehling <[email protected]>
>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>>>>> ---
>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>> mm/rmap.c | 5 +++--
>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>> *
>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>
>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>> + *
>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>> enum memory_type {
>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>
>>>>>>
>>>>>> However, where exactly is pinning forbidden?
>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>> memory manager owning the
>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>> is allowed on this device type.
>>>> I don't see updates to folio_is_pinnable() in this patch.
>>> Device coherent type pages should return true here, as they are pinnable
>>> pages.
>> That function is only called for long-term pinnings in try_grab_folio().
>>
>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>> As far as I understand this return NULL for long term pin pages.
>>> Otherwise they get refcount incremented.
>> I don't follow.
>>
>> You're saying
>>
>> a) folio_is_pinnable() returns true for device coherent pages
>>
>> and that
>>
>> b) device coherent pages don't get long-term pinned
>>
>>
>> Yet, the code says
>>
>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>> {
>> if (flags & FOLL_GET)
>> return try_get_folio(page, refs);
>> else if (flags & FOLL_PIN) {
>> struct folio *folio;
>>
>> /*
>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>> * right zone, so fail and let the caller fall back to the slow
>> * path.
>> */
>> if (unlikely((flags & FOLL_LONGTERM) &&
>> !is_pinnable_page(page)))
>> return NULL;
>> ...
>> return folio;
>> }
>> }
>>
>>
>> What prevents these pages from getting long-term pinned as stated in this patch?
>
> Long-term pinning is handled by __gup_longterm_locked, which migrates
> pages returned by __get_user_pages_locked that cannot be long-term
> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
> long-term pinned will be migrated afterwards, and
> __get_user_pages_locked will be retried. The migration of
> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
> ("mm/gup: migrate device coherent pages when pinning instead of failing").

Thanks.

__gup_longterm_locked()->check_and_migrate_movable_pages()

Which checks folio_is_pinnable() and doesn't do anything if set.

Sorry to be dense here, but I don't see how what's stated in this patch
works without adjusting folio_is_pinnable().

--
Thanks,

David / dhildenb

2022-06-21 12:23:40

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


David Hildenbrand <[email protected]> writes:

> On 21.06.22 13:25, Felix Kuehling wrote:
>>
>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>> evicted.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>>>>> Acked-by: Felix Kuehling <[email protected]>
>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>>>>>> ---
>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>> *
>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>
>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>> + *
>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>> enum memory_type {
>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>
>>>>>>>
>>>>>>> However, where exactly is pinning forbidden?
>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>> memory manager owning the
>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>> is allowed on this device type.
>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>> Device coherent type pages should return true here, as they are pinnable
>>>> pages.
>>> That function is only called for long-term pinnings in try_grab_folio().
>>>
>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>> As far as I understand this return NULL for long term pin pages.
>>>> Otherwise they get refcount incremented.
>>> I don't follow.
>>>
>>> You're saying
>>>
>>> a) folio_is_pinnable() returns true for device coherent pages
>>>
>>> and that
>>>
>>> b) device coherent pages don't get long-term pinned
>>>
>>>
>>> Yet, the code says
>>>
>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>> {
>>> if (flags & FOLL_GET)
>>> return try_get_folio(page, refs);
>>> else if (flags & FOLL_PIN) {
>>> struct folio *folio;
>>>
>>> /*
>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>> * right zone, so fail and let the caller fall back to the slow
>>> * path.
>>> */
>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>> !is_pinnable_page(page)))
>>> return NULL;
>>> ...
>>> return folio;
>>> }
>>> }
>>>
>>>
>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>
>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>> pages returned by __get_user_pages_locked that cannot be long-term
>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>> long-term pinned will be migrated afterwards, and
>> __get_user_pages_locked will be retried. The migration of
>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>
> Thanks.
>
> __gup_longterm_locked()->check_and_migrate_movable_pages()
>
> Which checks folio_is_pinnable() and doesn't do anything if set.
>
> Sorry to be dense here, but I don't see how what's stated in this patch
> works without adjusting folio_is_pinnable().

Ugh, I think you might be right about try_grab_folio().

We didn't update folio_is_pinnable() to include device coherent pages
because device coherent pages are pinnable. It is really just
FOLL_LONGTERM that we want to prevent here.

For normal PUP that is done by my change in
check_and_migrate_movable_pages() which migrates pages being pinned with
FOLL_LONGTERM. But I think I incorrectly assumed we would take the
pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
So I think the check in try_grab_folio() needs to be:

if (unlikely((flags & FOLL_LONGTERM) &&
(!is_pinnable_page(page) || is_device_coherent_page(page))))

- Alistair

2022-06-21 12:26:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 21.06.22 13:55, Alistair Popple wrote:
>
> David Hildenbrand <[email protected]> writes:
>
>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>
>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>> evicted.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>>>>>>> Acked-by: Felix Kuehling <[email protected]>
>>>>>>>>> Reviewed-by: Alistair Popple <[email protected]>
>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>>>>>>>> ---
>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>> *
>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>
>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>> + *
>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>> enum memory_type {
>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>
>>>>>>>>
>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>> memory manager owning the
>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>> is allowed on this device type.
>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>> pages.
>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>
>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>> As far as I understand this return NULL for long term pin pages.
>>>>> Otherwise they get refcount incremented.
>>>> I don't follow.
>>>>
>>>> You're saying
>>>>
>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>
>>>> and that
>>>>
>>>> b) device coherent pages don't get long-term pinned
>>>>
>>>>
>>>> Yet, the code says
>>>>
>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>> {
>>>> if (flags & FOLL_GET)
>>>> return try_get_folio(page, refs);
>>>> else if (flags & FOLL_PIN) {
>>>> struct folio *folio;
>>>>
>>>> /*
>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>> * right zone, so fail and let the caller fall back to the slow
>>>> * path.
>>>> */
>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>> !is_pinnable_page(page)))
>>>> return NULL;
>>>> ...
>>>> return folio;
>>>> }
>>>> }
>>>>
>>>>
>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>
>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>> pages returned by __get_user_pages_locked that cannot be long-term
>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>> long-term pinned will be migrated afterwards, and
>>> __get_user_pages_locked will be retried. The migration of
>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>
>> Thanks.
>>
>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>
>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>
>> Sorry to be dense here, but I don't see how what's stated in this patch
>> works without adjusting folio_is_pinnable().
>
> Ugh, I think you might be right about try_grab_folio().
>
> We didn't update folio_is_pinnable() to include device coherent pages
> because device coherent pages are pinnable. It is really just
> FOLL_LONGTERM that we want to prevent here.
>
> For normal PUP that is done by my change in
> check_and_migrate_movable_pages() which migrates pages being pinned with
> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
> So I think the check in try_grab_folio() needs to be:

I think I said it already (and I might be wrong without reading the
code), but folio_is_pinnable() is *only* called for long-term pinnings.

It should actually be called folio_is_longterm_pinnable().

That's where that check should go, no?

--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/21/2022 7:25 AM, David Hildenbrand wrote:
> On 21.06.22 13:55, Alistair Popple wrote:
>> David Hildenbrand<[email protected]> writes:
>>
>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>> evicted.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>> *
>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>
>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>> + *
>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>> enum memory_type {
>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>> memory manager owning the
>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>> is allowed on this device type.
>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>> pages.
>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>
>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>> Otherwise they get refcount incremented.
>>>>> I don't follow.
>>>>>
>>>>> You're saying
>>>>>
>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>
>>>>> and that
>>>>>
>>>>> b) device coherent pages don't get long-term pinned
>>>>>
>>>>>
>>>>> Yet, the code says
>>>>>
>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>> {
>>>>> if (flags & FOLL_GET)
>>>>> return try_get_folio(page, refs);
>>>>> else if (flags & FOLL_PIN) {
>>>>> struct folio *folio;
>>>>>
>>>>> /*
>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>> * path.
>>>>> */
>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>> !is_pinnable_page(page)))
>>>>> return NULL;
>>>>> ...
>>>>> return folio;
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>> long-term pinned will be migrated afterwards, and
>>>> __get_user_pages_locked will be retried. The migration of
>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>> Thanks.
>>>
>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>
>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>
>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>> works without adjusting folio_is_pinnable().
>> Ugh, I think you might be right about try_grab_folio().
>>
>> We didn't update folio_is_pinnable() to include device coherent pages
>> because device coherent pages are pinnable. It is really just
>> FOLL_LONGTERM that we want to prevent here.
>>
>> For normal PUP that is done by my change in
>> check_and_migrate_movable_pages() which migrates pages being pinned with
>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>> So I think the check in try_grab_folio() needs to be:
> I think I said it already (and I might be wrong without reading the
> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>
> It should actually be called folio_is_longterm_pinnable().
>
> That's where that check should go, no?

David, I think you're right. We didn't catch this since the LONGTERM gup
test we added to hmm-test only calls to pin_user_pages. Apparently
try_grab_folio is called only from fast callers (ex.
pin_user_pages_fast/get_user_pages_fast). I have added a conditional
similar to what Alistair has proposed to return null on LONGTERM &&
(coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
test was added with LONGTERM set that calls pin_user_pages_fast.
Returning null under this condition it does causes the migration from
dev to system memory.

Actually, Im having different problems with a call to PageAnonExclusive
from try_to_migrate_one during page fault from a HMM test that first
migrate pages to device private and forks to mark as COW these pages.
Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
page)

Regards,
Alex Sierra

2022-06-21 16:17:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>> On 21.06.22 13:55, Alistair Popple wrote:
>>> David Hildenbrand<[email protected]> writes:
>>>
>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>> evicted.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>> *
>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>
>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>> + *
>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>> enum memory_type {
>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>> memory manager owning the
>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>> is allowed on this device type.
>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>> pages.
>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>
>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>> Otherwise they get refcount incremented.
>>>>>> I don't follow.
>>>>>>
>>>>>> You're saying
>>>>>>
>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>
>>>>>> and that
>>>>>>
>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>
>>>>>>
>>>>>> Yet, the code says
>>>>>>
>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>> {
>>>>>> if (flags & FOLL_GET)
>>>>>> return try_get_folio(page, refs);
>>>>>> else if (flags & FOLL_PIN) {
>>>>>> struct folio *folio;
>>>>>>
>>>>>> /*
>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>> * path.
>>>>>> */
>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>> !is_pinnable_page(page)))
>>>>>> return NULL;
>>>>>> ...
>>>>>> return folio;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>> long-term pinned will be migrated afterwards, and
>>>>> __get_user_pages_locked will be retried. The migration of
>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>> Thanks.
>>>>
>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>
>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>
>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>> works without adjusting folio_is_pinnable().
>>> Ugh, I think you might be right about try_grab_folio().
>>>
>>> We didn't update folio_is_pinnable() to include device coherent pages
>>> because device coherent pages are pinnable. It is really just
>>> FOLL_LONGTERM that we want to prevent here.
>>>
>>> For normal PUP that is done by my change in
>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>> So I think the check in try_grab_folio() needs to be:
>> I think I said it already (and I might be wrong without reading the
>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>
>> It should actually be called folio_is_longterm_pinnable().
>>
>> That's where that check should go, no?
>
> David, I think you're right. We didn't catch this since the LONGTERM gup
> test we added to hmm-test only calls to pin_user_pages. Apparently
> try_grab_folio is called only from fast callers (ex.
> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
> similar to what Alistair has proposed to return null on LONGTERM &&
> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
> test was added with LONGTERM set that calls pin_user_pages_fast.
> Returning null under this condition it does causes the migration from
> dev to system memory.
>

Why can't coherent memory simply put its checks into
folio_is_pinnable()? I don't get it why we have to do things differently
here.

> Actually, Im having different problems with a call to PageAnonExclusive
> from try_to_migrate_one during page fault from a HMM test that first
> migrate pages to device private and forks to mark as COW these pages.
> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
> page)

With or without this series? A backtrace would be great.

--
Thanks,

David / dhildenb

2022-06-22 00:55:49

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


David Hildenbrand <[email protected]> writes:

> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>
>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>> David Hildenbrand<[email protected]> writes:
>>>>
>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>> evicted.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>> *
>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>
>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>> + *
>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>> memory manager owning the
>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>> is allowed on this device type.
>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>> pages.
>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>
>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>> Otherwise they get refcount incremented.
>>>>>>> I don't follow.
>>>>>>>
>>>>>>> You're saying
>>>>>>>
>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>
>>>>>>> and that
>>>>>>>
>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>
>>>>>>>
>>>>>>> Yet, the code says
>>>>>>>
>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>> {
>>>>>>> if (flags & FOLL_GET)
>>>>>>> return try_get_folio(page, refs);
>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>> struct folio *folio;
>>>>>>>
>>>>>>> /*
>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>> * path.
>>>>>>> */
>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>> !is_pinnable_page(page)))
>>>>>>> return NULL;
>>>>>>> ...
>>>>>>> return folio;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>> long-term pinned will be migrated afterwards, and
>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>> Thanks.
>>>>>
>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>
>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>
>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>> works without adjusting folio_is_pinnable().
>>>> Ugh, I think you might be right about try_grab_folio().
>>>>
>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>> because device coherent pages are pinnable. It is really just
>>>> FOLL_LONGTERM that we want to prevent here.
>>>>
>>>> For normal PUP that is done by my change in
>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>> So I think the check in try_grab_folio() needs to be:
>>> I think I said it already (and I might be wrong without reading the
>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>
>>> It should actually be called folio_is_longterm_pinnable().
>>>
>>> That's where that check should go, no?
>>
>> David, I think you're right. We didn't catch this since the LONGTERM gup
>> test we added to hmm-test only calls to pin_user_pages. Apparently
>> try_grab_folio is called only from fast callers (ex.
>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>> similar to what Alistair has proposed to return null on LONGTERM &&
>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>> test was added with LONGTERM set that calls pin_user_pages_fast.
>> Returning null under this condition it does causes the migration from
>> dev to system memory.
>>
>
> Why can't coherent memory simply put its checks into
> folio_is_pinnable()? I don't get it why we have to do things differently
> here.

I'd made the reasonable assumption that
folio_is_pinnable()/is_pinnable_page() were used to check if the
folio/page is pinnable or not regardless of FOLL_LONGTERM. Looking at
the code more closely though I see both are actually only used on paths
checking for FOLL_LONGTERM pinning.

So I agree - we should rename these
folio_is_longterm_pinnable()/is_longterm_pinnable_page() and add the
check for coherent pages there. Thanks for pointing that out.

- Alistair

>> Actually, Im having different problems with a call to PageAnonExclusive
>> from try_to_migrate_one during page fault from a HMM test that first
>> migrate pages to device private and forks to mark as COW these pages.
>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>> page)
>
> With or without this series? A backtrace would be great.

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/21/2022 7:16 PM, Alistair Popple wrote:
> David Hildenbrand <[email protected]> writes:
>
>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>> David Hildenbrand<[email protected]> writes:
>>>>>
>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>> *
>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>
>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>> memory manager owning the
>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>> pages.
>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>
>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>> I don't follow.
>>>>>>>>
>>>>>>>> You're saying
>>>>>>>>
>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>
>>>>>>>> and that
>>>>>>>>
>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>
>>>>>>>>
>>>>>>>> Yet, the code says
>>>>>>>>
>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>> {
>>>>>>>> if (flags & FOLL_GET)
>>>>>>>> return try_get_folio(page, refs);
>>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>>> struct folio *folio;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>>> * path.
>>>>>>>> */
>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>> !is_pinnable_page(page)))
>>>>>>>> return NULL;
>>>>>>>> ...
>>>>>>>> return folio;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>> Thanks.
>>>>>>
>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>
>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>
>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>> works without adjusting folio_is_pinnable().
>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>
>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>> because device coherent pages are pinnable. It is really just
>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>
>>>>> For normal PUP that is done by my change in
>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>> So I think the check in try_grab_folio() needs to be:
>>>> I think I said it already (and I might be wrong without reading the
>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>
>>>> It should actually be called folio_is_longterm_pinnable().
>>>>
>>>> That's where that check should go, no?
>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>> try_grab_folio is called only from fast callers (ex.
>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>> Returning null under this condition it does causes the migration from
>>> dev to system memory.
>>>
>> Why can't coherent memory simply put its checks into
>> folio_is_pinnable()? I don't get it why we have to do things differently
>> here.
> I'd made the reasonable assumption that
> folio_is_pinnable()/is_pinnable_page() were used to check if the
> folio/page is pinnable or not regardless of FOLL_LONGTERM. Looking at
> the code more closely though I see both are actually only used on paths
> checking for FOLL_LONGTERM pinning.
>
> So I agree - we should rename these
> folio_is_longterm_pinnable()/is_longterm_pinnable_page() and add the
> check for coherent pages there. Thanks for pointing that out.
>
> - Alistair

Will do in the next patch series.

Regards,
Alex Sierra

>>> Actually, Im having different problems with a call to PageAnonExclusive
>>> from try_to_migrate_one during page fault from a HMM test that first
>>> migrate pages to device private and forks to mark as COW these pages.
>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>> page)
>> With or without this series? A backtrace would be great.

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/21/2022 11:16 AM, David Hildenbrand wrote:
> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>> David Hildenbrand<[email protected]> writes:
>>>>
>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>> evicted.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>> *
>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>
>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>> + *
>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>> memory manager owning the
>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>> is allowed on this device type.
>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>> pages.
>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>
>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>> Otherwise they get refcount incremented.
>>>>>>> I don't follow.
>>>>>>>
>>>>>>> You're saying
>>>>>>>
>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>
>>>>>>> and that
>>>>>>>
>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>
>>>>>>>
>>>>>>> Yet, the code says
>>>>>>>
>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>> {
>>>>>>> if (flags & FOLL_GET)
>>>>>>> return try_get_folio(page, refs);
>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>> struct folio *folio;
>>>>>>>
>>>>>>> /*
>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>> * path.
>>>>>>> */
>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>> !is_pinnable_page(page)))
>>>>>>> return NULL;
>>>>>>> ...
>>>>>>> return folio;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>> long-term pinned will be migrated afterwards, and
>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>> Thanks.
>>>>>
>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>
>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>
>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>> works without adjusting folio_is_pinnable().
>>>> Ugh, I think you might be right about try_grab_folio().
>>>>
>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>> because device coherent pages are pinnable. It is really just
>>>> FOLL_LONGTERM that we want to prevent here.
>>>>
>>>> For normal PUP that is done by my change in
>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>> So I think the check in try_grab_folio() needs to be:
>>> I think I said it already (and I might be wrong without reading the
>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>
>>> It should actually be called folio_is_longterm_pinnable().
>>>
>>> That's where that check should go, no?
>> David, I think you're right. We didn't catch this since the LONGTERM gup
>> test we added to hmm-test only calls to pin_user_pages. Apparently
>> try_grab_folio is called only from fast callers (ex.
>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>> similar to what Alistair has proposed to return null on LONGTERM &&
>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>> test was added with LONGTERM set that calls pin_user_pages_fast.
>> Returning null under this condition it does causes the migration from
>> dev to system memory.
>>
> Why can't coherent memory simply put its checks into
> folio_is_pinnable()? I don't get it why we have to do things differently
> here.
>
>> Actually, Im having different problems with a call to PageAnonExclusive
>> from try_to_migrate_one during page fault from a HMM test that first
>> migrate pages to device private and forks to mark as COW these pages.
>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>> page)
> With or without this series? A backtrace would be great.

Here's the back trace. This happens in a hmm-test added in this patch
series. However, I have tried to isolate this BUG by just adding the COW
test with private device memory only. This is only present as follows.
Allocate anonymous mem->Migrate to private device memory->fork->try to
access to parent's anonymous memory (which will suppose to trigger a
page fault and migration to system mem). Just for the record, if the
child is terminated before the parent's memory is accessed, this problem
is not present.

patch name for this test: tools: add selftests to hmm for COW in device
memory

[  528.727237] BUG: unable to handle page fault for address:
ffffea1fffffffc0
[  528.739585] #PF: supervisor read access in kernel mode
[  528.745324] #PF: error_code(0x0000) - not-present page
[  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted
5.19.0-rc3-kfd-alex #257
[  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS
RTY1002BDS 09/17/2021
[  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX:
ffffeaffffffffc0
[  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffc90003cdfaf8
[  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09:
0000000000000000
[  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12:
ffff888194450540
[  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15:
03ffffffffffffff
[  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000)
knlGS:0000000000000000
[  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4:
0000000000770ee0
[  528.874275] PKRU: 55555554
[  528.877286] Call Trace:
[  528.880016]  <TASK>
[  528.882356]  ? lock_is_held_type+0xdf/0x130
[  528.887033]  rmap_walk_anon+0x167/0x410
[  528.891316]  try_to_migrate+0x90/0xd0
[  528.895405]  ? try_to_unmap_one+0xe10/0xe10
[  528.900074]  ? anon_vma_ctor+0x50/0x50
[  528.904260]  ? put_anon_vma+0x10/0x10
[  528.908347]  ? invalid_mkclean_vma+0x20/0x20
[  528.913114]  migrate_vma_setup+0x5f4/0x750
[  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[  528.923532]  do_swap_page+0xac0/0xe50
[  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
[  528.932199]  __handle_mm_fault+0x949/0x1440
[  528.936876]  handle_mm_fault+0x13f/0x3e0
[  528.941256]  do_user_addr_fault+0x215/0x740
[  528.945928]  exc_page_fault+0x75/0x280
[  528.950115]  asm_exc_page_fault+0x27/0x30
[  528.954593] RIP: 0033:0x40366b
[  528.958001] Code: 00 48 89 85 d8 fe ff ff eb 2a 48 8b 85 d0 fe ff ff
48 8d 14 85 00 00 00 00 48 8b 85 d8 fe ff ff 48 01 d0 48 8b 95 d0 fe ff
ff <89> 10 48 83 85 d0 fe ff ff 01 48 8b 85 40 ff ff ff 48 c1 e8 02 48
[  528.978973] RSP: 002b:00007fffffffe280 EFLAGS: 00010206
[  528.984806] RAX: 00007ffff7ff4000 RBX: 0000000000000000 RCX:
0000000000000000
[  528.992774] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00007ffff77ee968
[  529.000742] RBP: 00007fffffffe430 R08: 00007ffff7fdb740 R09:
0000000000000000
[  529.008709] R10: 00007ffff7fdba10 R11: 0000000000000246 R12:
0000000000400e30
[  529.016675] R13: 00007fffffffe630 R14: 0000000000000000 R15:
0000000000000000
[  529.024638]  </TASK>
[  529.027074] Modules linked in: test_hmm xt_conntrack xt_MASQUERADE
nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 br_netfilter ip6table_filter ip6_tables iptable_filter
k10temp ip_tables x_tables i2c_piix4 [last unloaded: test_hmm]
[  529.053595] CR2: ffffea1fffffffc0
[  529.057296] ---[ end trace 0000000000000000 ]---
[  529.197816] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  529.197823] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  529.197826] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  529.197828] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX:
ffffeaffffffffc0
[  529.197830] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffffc90003cdfaf8
[  529.197831] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09:
0000000000000000
[  529.197832] R10: ffffc90003cdf910 R11: 0000000000000002 R12:
ffff888194450540
[  529.197833] R13: ffff888160d057c0 R14: 0000000000000000 R15:
03ffffffffffffff
[  529.197835] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000)
knlGS:0000000000000000
[  529.197837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  529.197839] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4:
0000000000770ee0
[  529.197840] PKRU: 55555554
[  529.197841] note: hmm-tests[18275] exited with preempt_count 1

Regards,
Alex Sierra

>

2022-06-23 07:57:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>> David Hildenbrand<[email protected]> writes:
>>>>>
>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>> *
>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>
>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>> memory manager owning the
>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>> pages.
>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>
>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>> I don't follow.
>>>>>>>>
>>>>>>>> You're saying
>>>>>>>>
>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>
>>>>>>>> and that
>>>>>>>>
>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>
>>>>>>>>
>>>>>>>> Yet, the code says
>>>>>>>>
>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>> {
>>>>>>>> if (flags & FOLL_GET)
>>>>>>>> return try_get_folio(page, refs);
>>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>>> struct folio *folio;
>>>>>>>>
>>>>>>>> /*
>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>>> * path.
>>>>>>>> */
>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>> !is_pinnable_page(page)))
>>>>>>>> return NULL;
>>>>>>>> ...
>>>>>>>> return folio;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>> Thanks.
>>>>>>
>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>
>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>
>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>> works without adjusting folio_is_pinnable().
>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>
>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>> because device coherent pages are pinnable. It is really just
>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>
>>>>> For normal PUP that is done by my change in
>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>> So I think the check in try_grab_folio() needs to be:
>>>> I think I said it already (and I might be wrong without reading the
>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>
>>>> It should actually be called folio_is_longterm_pinnable().
>>>>
>>>> That's where that check should go, no?
>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>> try_grab_folio is called only from fast callers (ex.
>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>> Returning null under this condition it does causes the migration from
>>> dev to system memory.
>>>
>> Why can't coherent memory simply put its checks into
>> folio_is_pinnable()? I don't get it why we have to do things differently
>> here.
>>
>>> Actually, Im having different problems with a call to PageAnonExclusive
>>> from try_to_migrate_one during page fault from a HMM test that first
>>> migrate pages to device private and forks to mark as COW these pages.
>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>> page)
>> With or without this series? A backtrace would be great.
>
> Here's the back trace. This happens in a hmm-test added in this patch
> series. However, I have tried to isolate this BUG by just adding the COW
> test with private device memory only. This is only present as follows.
> Allocate anonymous mem->Migrate to private device memory->fork->try to
> access to parent's anonymous memory (which will suppose to trigger a
> page fault and migration to system mem). Just for the record, if the
> child is terminated before the parent's memory is accessed, this problem
> is not present.


The only usage of PageAnonExclusive() in try_to_migrate_one() is:

anon_exclusive = folio_test_anon(folio) &&
PageAnonExclusive(subpage);

Which can only possibly fail if subpage is not actually part of the folio.


I see some controversial code in the the if (folio_is_zone_device(folio)) case later:

* The assignment to subpage above was computed from a
* swap PTE which results in an invalid pointer.
* Since only PAGE_SIZE pages can currently be
* migrated, just set it to page. This will need to be
* changed when hugepage migrations to device private
* memory are supported.
*/
subpage = &folio->page;

There we have our invalid pointer hint.

I don't see how it could have worked if the child quit, though? Maybe
just pure luck?


Does the following fix your issue:



From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Thu, 23 Jun 2022 09:38:45 +0200
Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
try_to_migrate_one()

The subpage we calculate is an invalid pointer for device private pages,
because device private pages are mapped via non-present device private
entries, not ordinary present PTEs.

Let's just not compute broken pointers and fixup later. Move the proper
assignment of the correct subpage to the beginning of the function and
assert that we really only have a single page in our folio.

This currently results in a BUG when tying to compute anon_exclusive,
because:

[ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
[ 528.739585] #PF: supervisor read access in kernel mode
[ 528.745324] #PF: error_code(0x0000) - not-present page
[ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
[ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
[ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
[ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
[ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
[ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
[ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
[ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
[ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
[ 528.874275] PKRU: 55555554
[ 528.877286] Call Trace:
[ 528.880016] <TASK>
[ 528.882356] ? lock_is_held_type+0xdf/0x130
[ 528.887033] rmap_walk_anon+0x167/0x410
[ 528.891316] try_to_migrate+0x90/0xd0
[ 528.895405] ? try_to_unmap_one+0xe10/0xe10
[ 528.900074] ? anon_vma_ctor+0x50/0x50
[ 528.904260] ? put_anon_vma+0x10/0x10
[ 528.908347] ? invalid_mkclean_vma+0x20/0x20
[ 528.913114] migrate_vma_setup+0x5f4/0x750
[ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[ 528.923532] do_swap_page+0xac0/0xe50
[ 528.927623] ? __lock_acquire+0x4b2/0x1ac0
[ 528.932199] __handle_mm_fault+0x949/0x1440
[ 528.936876] handle_mm_fault+0x13f/0x3e0
[ 528.941256] do_user_addr_fault+0x215/0x740
[ 528.945928] exc_page_fault+0x75/0x280
[ 528.950115] asm_exc_page_fault+0x27/0x30
[ 528.954593] RIP: 0033:0x40366b
...

Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Reported-by: Sierra Guiza, Alejandro (Alex) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/rmap.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..746c05acad27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
/* Unexpected PMD-mapped THP? */
VM_BUG_ON_FOLIO(!pvmw.pte, folio);

- subpage = folio_page(folio,
- pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ if (folio_is_zone_device(folio)) {
+ /*
+ * Our PTE is a non-present device exclusive entry and
+ * calculating the subpage as for the common case would
+ * result in an invalid pointer.
+ *
+ * Since only PAGE_SIZE pages can currently be
+ * migrated, just set it to page. This will need to be
+ * changed when hugepage migrations to device private
+ * memory are supported.
+ */
+ VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
+ subpage = &folio->page;
+ } else {
+ subpage = folio_page(folio,
+ pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ }
address = pvmw.address;
anon_exclusive = folio_test_anon(folio) &&
PageAnonExclusive(subpage);
@@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
/*
* No need to invalidate here it will synchronize on
* against the special swap migration pte.
- *
- * The assignment to subpage above was computed from a
- * swap PTE which results in an invalid pointer.
- * Since only PAGE_SIZE pages can currently be
- * migrated, just set it to page. This will need to be
- * changed when hugepage migrations to device private
- * memory are supported.
*/
- subpage = &folio->page;
} else if (PageHWPoison(subpage)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
--
2.35.3






--
Thanks,

David / dhildenb

2022-06-23 19:20:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support

On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:
>
> On 6/23/2022 2:57 AM, David Hildenbrand wrote:
>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>>> David Hildenbrand<[email protected]> writes:
>>>>>>>
>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>>
>>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>>> pages.
>>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>>
>>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>>> I don't follow.
>>>>>>>>>>
>>>>>>>>>> You're saying
>>>>>>>>>>
>>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>>
>>>>>>>>>> and that
>>>>>>>>>>
>>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yet, the code says
>>>>>>>>>>
>>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>>> {
>>>>>>>>>> if (flags & FOLL_GET)
>>>>>>>>>> return try_get_folio(page, refs);
>>>>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>>>>> struct folio *folio;
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>>>>> * path.
>>>>>>>>>> */
>>>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>>> !is_pinnable_page(page)))
>>>>>>>>>> return NULL;
>>>>>>>>>> ...
>>>>>>>>>> return folio;
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>>
>>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>>
>>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>>> works without adjusting folio_is_pinnable().
>>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>>
>>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>>> because device coherent pages are pinnable. It is really just
>>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>>
>>>>>>> For normal PUP that is done by my change in
>>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>>> I think I said it already (and I might be wrong without reading the
>>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>>
>>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>>
>>>>>> That's where that check should go, no?
>>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>>> try_grab_folio is called only from fast callers (ex.
>>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>>> Returning null under this condition it does causes the migration from
>>>>> dev to system memory.
>>>>>
>>>> Why can't coherent memory simply put its checks into
>>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>>> here.
>>>>
>>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>>> migrate pages to device private and forks to mark as COW these pages.
>>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>>> page)
>>>> With or without this series? A backtrace would be great.
>>> Here's the back trace. This happens in a hmm-test added in this patch
>>> series. However, I have tried to isolate this BUG by just adding the COW
>>> test with private device memory only. This is only present as follows.
>>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>>> access to parent's anonymous memory (which will suppose to trigger a
>>> page fault and migration to system mem). Just for the record, if the
>>> child is terminated before the parent's memory is accessed, this problem
>>> is not present.
>>
>> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>>
>> anon_exclusive = folio_test_anon(folio) &&
>> PageAnonExclusive(subpage);
>>
>> Which can only possibly fail if subpage is not actually part of the folio.
>>
>>
>> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>>
>> * The assignment to subpage above was computed from a
>> * swap PTE which results in an invalid pointer.
>> * Since only PAGE_SIZE pages can currently be
>> * migrated, just set it to page. This will need to be
>> * changed when hugepage migrations to device private
>> * memory are supported.
>> */
>> subpage = &folio->page;
>>
>> There we have our invalid pointer hint.
>>
>> I don't see how it could have worked if the child quit, though? Maybe
>> just pure luck?
>>
>>
>> Does the following fix your issue:
>
> Yes, it fixed the issue. Thanks. Should we include this patch in this
> patch series or separated?
>
> Regards,
> Alex Sierra

I'll send it right away "officially" so we can get it into 5.19. Can I
add your tested-by?


--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/23/2022 2:57 AM, David Hildenbrand wrote:
> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>> David Hildenbrand<[email protected]> writes:
>>>>>>
>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>
>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>> pages.
>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>
>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>> I don't follow.
>>>>>>>>>
>>>>>>>>> You're saying
>>>>>>>>>
>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>
>>>>>>>>> and that
>>>>>>>>>
>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yet, the code says
>>>>>>>>>
>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>> {
>>>>>>>>> if (flags & FOLL_GET)
>>>>>>>>> return try_get_folio(page, refs);
>>>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>>>> struct folio *folio;
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>>>> * path.
>>>>>>>>> */
>>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>> !is_pinnable_page(page)))
>>>>>>>>> return NULL;
>>>>>>>>> ...
>>>>>>>>> return folio;
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>> Thanks.
>>>>>>>
>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>
>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>
>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>> works without adjusting folio_is_pinnable().
>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>
>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>> because device coherent pages are pinnable. It is really just
>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>
>>>>>> For normal PUP that is done by my change in
>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>> I think I said it already (and I might be wrong without reading the
>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>
>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>
>>>>> That's where that check should go, no?
>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>> try_grab_folio is called only from fast callers (ex.
>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>> Returning null under this condition it does causes the migration from
>>>> dev to system memory.
>>>>
>>> Why can't coherent memory simply put its checks into
>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>> here.
>>>
>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>> migrate pages to device private and forks to mark as COW these pages.
>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>> page)
>>> With or without this series? A backtrace would be great.
>> Here's the back trace. This happens in a hmm-test added in this patch
>> series. However, I have tried to isolate this BUG by just adding the COW
>> test with private device memory only. This is only present as follows.
>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>> access to parent's anonymous memory (which will suppose to trigger a
>> page fault and migration to system mem). Just for the record, if the
>> child is terminated before the parent's memory is accessed, this problem
>> is not present.
>
> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>
> anon_exclusive = folio_test_anon(folio) &&
> PageAnonExclusive(subpage);
>
> Which can only possibly fail if subpage is not actually part of the folio.
>
>
> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>
> * The assignment to subpage above was computed from a
> * swap PTE which results in an invalid pointer.
> * Since only PAGE_SIZE pages can currently be
> * migrated, just set it to page. This will need to be
> * changed when hugepage migrations to device private
> * memory are supported.
> */
> subpage = &folio->page;
>
> There we have our invalid pointer hint.
>
> I don't see how it could have worked if the child quit, though? Maybe
> just pure luck?
>
>
> Does the following fix your issue:

Yes, it fixed the issue. Thanks. Should we include this patch in this
patch series or separated?

Regards,
Alex Sierra
>
>
>
> From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <[email protected]>
> Date: Thu, 23 Jun 2022 09:38:45 +0200
> Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
> try_to_migrate_one()
>
> The subpage we calculate is an invalid pointer for device private pages,
> because device private pages are mapped via non-present device private
> entries, not ordinary present PTEs.
>
> Let's just not compute broken pointers and fixup later. Move the proper
> assignment of the correct subpage to the beginning of the function and
> assert that we really only have a single page in our folio.
>
> This currently results in a BUG when tying to compute anon_exclusive,
> because:
>
> [ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
> [ 528.739585] #PF: supervisor read access in kernel mode
> [ 528.745324] #PF: error_code(0x0000) - not-present page
> [ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
> [ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
> [ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
> [ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
> [ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
> c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
> 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
> [ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
> [ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
> [ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
> [ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
> [ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
> [ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
> [ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
> [ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
> [ 528.874275] PKRU: 55555554
> [ 528.877286] Call Trace:
> [ 528.880016] <TASK>
> [ 528.882356] ? lock_is_held_type+0xdf/0x130
> [ 528.887033] rmap_walk_anon+0x167/0x410
> [ 528.891316] try_to_migrate+0x90/0xd0
> [ 528.895405] ? try_to_unmap_one+0xe10/0xe10
> [ 528.900074] ? anon_vma_ctor+0x50/0x50
> [ 528.904260] ? put_anon_vma+0x10/0x10
> [ 528.908347] ? invalid_mkclean_vma+0x20/0x20
> [ 528.913114] migrate_vma_setup+0x5f4/0x750
> [ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm]
> [ 528.923532] do_swap_page+0xac0/0xe50
> [ 528.927623] ? __lock_acquire+0x4b2/0x1ac0
> [ 528.932199] __handle_mm_fault+0x949/0x1440
> [ 528.936876] handle_mm_fault+0x13f/0x3e0
> [ 528.941256] do_user_addr_fault+0x215/0x740
> [ 528.945928] exc_page_fault+0x75/0x280
> [ 528.950115] asm_exc_page_fault+0x27/0x30
> [ 528.954593] RIP: 0033:0x40366b
> ...
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Reported-by: Sierra Guiza, Alejandro (Alex) <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "Matthew Wilcox (Oracle)" <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/rmap.c | 27 +++++++++++++++++----------
> 1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..746c05acad27 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> /* Unexpected PMD-mapped THP? */
> VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>
> - subpage = folio_page(folio,
> - pte_pfn(*pvmw.pte) - folio_pfn(folio));
> + if (folio_is_zone_device(folio)) {
> + /*
> + * Our PTE is a non-present device exclusive entry and
> + * calculating the subpage as for the common case would
> + * result in an invalid pointer.
> + *
> + * Since only PAGE_SIZE pages can currently be
> + * migrated, just set it to page. This will need to be
> + * changed when hugepage migrations to device private
> + * memory are supported.
> + */
> + VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
> + subpage = &folio->page;
> + } else {
> + subpage = folio_page(folio,
> + pte_pfn(*pvmw.pte) - folio_pfn(folio));
> + }
> address = pvmw.address;
> anon_exclusive = folio_test_anon(folio) &&
> PageAnonExclusive(subpage);
> @@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> /*
> * No need to invalidate here it will synchronize on
> * against the special swap migration pte.
> - *
> - * The assignment to subpage above was computed from a
> - * swap PTE which results in an invalid pointer.
> - * Since only PAGE_SIZE pages can currently be
> - * migrated, just set it to page. This will need to be
> - * changed when hugepage migrations to device private
> - * memory are supported.
> */
> - subpage = &folio->page;
> } else if (PageHWPoison(subpage)) {
> pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> if (folio_test_hugetlb(folio)) {

Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support


On 6/23/2022 1:21 PM, David Hildenbrand wrote:
> On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/23/2022 2:57 AM, David Hildenbrand wrote:
>>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>>>> David Hildenbrand<[email protected]> writes:
>>>>>>>>
>>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<[email protected]>
>>>>>>>>>>>>>>>> Acked-by: Felix Kuehling<[email protected]>
>>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<[email protected]>
>>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<[email protected]>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++---
>>>>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++--
>>>>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++
>>>>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++---------
>>>>>>>>>>>>>>>> mm/rmap.c | 5 +++--
>>>>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>>> enum memory_type {
>>>>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>>>> pages.
>>>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>>>
>>>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>>>> I don't follow.
>>>>>>>>>>>
>>>>>>>>>>> You're saying
>>>>>>>>>>>
>>>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>>>
>>>>>>>>>>> and that
>>>>>>>>>>>
>>>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yet, the code says
>>>>>>>>>>>
>>>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>>>> {
>>>>>>>>>>> if (flags & FOLL_GET)
>>>>>>>>>>> return try_get_folio(page, refs);
>>>>>>>>>>> else if (flags & FOLL_PIN) {
>>>>>>>>>>> struct folio *folio;
>>>>>>>>>>>
>>>>>>>>>>> /*
>>>>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>>>> * right zone, so fail and let the caller fall back to the slow
>>>>>>>>>>> * path.
>>>>>>>>>>> */
>>>>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>>>> !is_pinnable_page(page)))
>>>>>>>>>>> return NULL;
>>>>>>>>>>> ...
>>>>>>>>>>> return folio;
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>>>
>>>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>>>
>>>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>>>> works without adjusting folio_is_pinnable().
>>>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>>>
>>>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>>>> because device coherent pages are pinnable. It is really just
>>>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>>>
>>>>>>>> For normal PUP that is done by my change in
>>>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>>>> I think I said it already (and I might be wrong without reading the
>>>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>>>
>>>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>>>
>>>>>>> That's where that check should go, no?
>>>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>>>> try_grab_folio is called only from fast callers (ex.
>>>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>>>> Returning null under this condition it does causes the migration from
>>>>>> dev to system memory.
>>>>>>
>>>>> Why can't coherent memory simply put its checks into
>>>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>>>> here.
>>>>>
>>>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>>>> migrate pages to device private and forks to mark as COW these pages.
>>>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>>>> page)
>>>>> With or without this series? A backtrace would be great.
>>>> Here's the back trace. This happens in a hmm-test added in this patch
>>>> series. However, I have tried to isolate this BUG by just adding the COW
>>>> test with private device memory only. This is only present as follows.
>>>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>>>> access to parent's anonymous memory (which will suppose to trigger a
>>>> page fault and migration to system mem). Just for the record, if the
>>>> child is terminated before the parent's memory is accessed, this problem
>>>> is not present.
>>> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>>>
>>> anon_exclusive = folio_test_anon(folio) &&
>>> PageAnonExclusive(subpage);
>>>
>>> Which can only possibly fail if subpage is not actually part of the folio.
>>>
>>>
>>> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>>>
>>> * The assignment to subpage above was computed from a
>>> * swap PTE which results in an invalid pointer.
>>> * Since only PAGE_SIZE pages can currently be
>>> * migrated, just set it to page. This will need to be
>>> * changed when hugepage migrations to device private
>>> * memory are supported.
>>> */
>>> subpage = &folio->page;
>>>
>>> There we have our invalid pointer hint.
>>>
>>> I don't see how it could have worked if the child quit, though? Maybe
>>> just pure luck?
>>>
>>>
>>> Does the following fix your issue:
>> Yes, it fixed the issue. Thanks. Should we include this patch in this
>> patch series or separated?
>>
>> Regards,
>> Alex Sierra
> I'll send it right away "officially" so we can get it into 5.19. Can I
> add your tested-by?

Of course.

Alex Sierra

>
>