Subject: [PATCH v6 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

This is hopefully the final version, with all patches Reviewed.
Andrew, if there are no further objections, please merge this version.

This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.

Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.

This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.

System stability and performance are not affected according to our
ongoing testing, including xfstests.

How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.

The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.

Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.

We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

v2:
- test_hmm is now able to create private and coherent device mirror
instances in the same driver probe. This adds more usability to the hmm
test by not having to remove the kernel module for each device type
test (private/coherent type). This is done by passing the module
parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
four instances of device_mirror. The first two correspond to private
device type, the last two to coherent type. Then, they can be easily
accessed from user space through /dev/hmm_mirror<num_device>. Usually
num_device 0 and 1 are for private, and 2 and 3 for coherent types.

- Coherent device type pages at gup are now migrated back to system
memory if they have been long term pinned (FOLL_LONGTERM). The reason
is these pages could eventually interfere with their own device memory
manager. A new hmm_gup_test has been added to the hmm-test to test this
functionality. It makes use of the gup_test module to long term pin
user pages that have been migrate to device memory first.

- Other patch corrections made by Felix, Alistair and Christoph.

v3:
- Based on last v2 feedback we got from Alistair, we've decided to
remove migration logic for FOLL_LONGTERM coherent device type pages at
gup for now. Ideally, this should be done through the kernel mm,
instead of calling the device driver to do it. Currently, there's no
support for migrating device pages based on pfn, mainly because
migrate_pages() relies on pages being LRU pages. Alistair mentioned, he
has started to work on adding this migrate device pages logic. For now,
we fail on get_user_pages call with FOLL_LONGTERM for DEVICE_COHERENT
pages.

- Also, hmm_gup_test has been removed from hmm-test. We plan to include
it again after this migration work is ready.

- Addressed Liam Howlett's feedback changes.

v4:
- Addressed Alistair Popple's last v3 feedback.

- Use the same system entry path for coherent device pages at
migrate_vma_insert_page.

- Add coherent device type support for try_to_migrate /
try_to_migrate_one.

- Include number of coherent device pages successfully migrated back to
system at test_hmm. Made the proper changes to hmm-test to read/check
this number.

v5:
- Rebase on 5.17-rc1.
- Addressed Alistair Popple's last v4 feedback.

v6:
- Corrections with zero pages, to make sure these are only migrated
only when MIGRATE_VMA_SELECT_SYSTEM is set.
- Set last reviewed patches.
- Rebased to the latest master branch from repo:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Alex Sierra (10):
mm: add zone device coherent type memory support
mm: add device coherent vma selection for memory migration
mm/gup: fail get_user_pages for LONGTERM dev coherent type
drm/amdkfd: add SPM support for SVM
drm/amdkfd: coherent type as sys mem on migration to ram
lib: test_hmm add ioctl to get zone device type
lib: test_hmm add module param for zone device type
lib: add support for device coherent type in test_hmm
tools: update hmm-test to support device coherent type
tools: update test_hmm script to support SP config

drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 34 ++-
include/linux/memremap.h | 8 +
include/linux/migrate.h | 1 +
include/linux/mm.h | 16 +
lib/test_hmm.c | 356 +++++++++++++++++------
lib/test_hmm_uapi.h | 22 +-
mm/gup.c | 7 +
mm/memcontrol.c | 6 +-
mm/memory-failure.c | 8 +-
mm/memremap.c | 14 +-
mm/migrate.c | 57 ++--
mm/rmap.c | 5 +-
tools/testing/selftests/vm/hmm-tests.c | 123 ++++++--
tools/testing/selftests/vm/test_hmm.sh | 24 +-
14 files changed, 519 insertions(+), 162 deletions(-)

--
2.32.0


Subject: [PATCH v6 04/10] drm/amdkfd: add SPM support for SVM

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++++++++++++++---------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index ed5385137f48..5e8d944d359e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -933,7 +933,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
{
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long size;
void *r;

@@ -948,28 +948,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
- res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
- if (IS_ERR(res))
- return -ENOMEM;
+ if (adev->gmc.xgmi.connected_to_cpu) {
+ pgmap->range.start = adev->gmc.aper_base;
+ pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
+ pgmap->type = MEMORY_DEVICE_COHERENT;
+ } else {
+ res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+ if (IS_ERR(res))
+ return -ENOMEM;
+ pgmap->range.start = res->start;
+ pgmap->range.end = res->end;
+ pgmap->type = MEMORY_DEVICE_PRIVATE;
+ }

- pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
- pgmap->range.start = res->start;
- pgmap->range.end = res->end;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
- pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+ pgmap->flags = 0;
/* Device manager releases device-specific resources, memory region and
* pgmap when driver disconnects from device.
*/
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-
/* Disable SVM support capability */
pgmap->type = 0;
- devm_release_mem_region(adev->dev, res->start, resource_size(res));
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+ devm_release_mem_region(adev->dev, res->start,
+ res->end - res->start + 1);
return PTR_ERR(r);
}

@@ -982,3 +988,4 @@ int svm_migrate_init(struct amdgpu_device *adev)

return 0;
}
+
--
2.32.0

Subject: [PATCH v6 01/10] mm: add zone device coherent type memory support

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
v4:
- use the same system entry path for coherent device pages at
migrate_vma_insert_page.

- Add coherent device type support for try_to_migrate /
try_to_migrate_one.
---
include/linux/memremap.h | 8 +++++++
include/linux/mm.h | 16 ++++++++++++++
mm/memcontrol.c | 6 +++---
mm/memory-failure.c | 8 +++++--
mm/memremap.c | 14 ++++++++++++-
mm/migrate.c | 45 ++++++++++++++++++++--------------------
mm/rmap.c | 5 +++--
7 files changed, 71 insertions(+), 31 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..727b8c789193 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@ struct vmem_altmap {
* A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.rst.
*
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
* MEMORY_DEVICE_FS_DAX:
* Host memory that has similar access semantics as System RAM i.e. DMA
* coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@ struct vmem_altmap {
enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+ MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..545f41da9fe9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1106,6 +1106,7 @@ static inline bool page_is_devmap_managed(struct page *page)
return false;
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
case MEMORY_DEVICE_FS_DAX:
return true;
default:
@@ -1135,6 +1136,21 @@ static inline bool is_device_private_page(const struct page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
}

+static inline bool is_device_coherent_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
+static inline bool is_dev_private_or_coherent_page(const struct page *page)
+{
+ return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+ is_zone_device_page(page) &&
+ (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ page->pgmap->type == MEMORY_DEVICE_COHERENT);
+}
+
static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..0882b5b2a857 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5691,8 +5691,8 @@ static int mem_cgroup_move_account(struct page *page,
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
- * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and
+ * thus not on the lru.
* For now we such page is charge like a regular page would be as for all
* intent and purposes it is just special memory taking the place of a
* regular page.
@@ -5726,7 +5726,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
*/
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
- if (is_device_private_page(page))
+ if (is_dev_private_or_coherent_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a9..f498ed3ece79 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1617,12 +1617,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
goto unlock;
}

- if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
/*
- * TODO: Handle HMM pages which may need coordination
+ * TODO: Handle device pages which may need coordination
* with device-side memory.
*/
goto unlock;
+ default:
+ break;
}

/*
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..354c8027823f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key);
static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
{
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ pgmap->type == MEMORY_DEVICE_COHERENT ||
pgmap->type == MEMORY_DEVICE_FS_DAX)
static_branch_dec(&devmap_managed_key);
}
@@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
{
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ pgmap->type == MEMORY_DEVICE_COHERENT ||
pgmap->type == MEMORY_DEVICE_FS_DAX)
static_branch_inc(&devmap_managed_key);
}
@@ -327,6 +329,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
+ case MEMORY_DEVICE_COHERENT:
+ if (!pgmap->ops->page_free) {
+ WARN(1, "Missing page_free method\n");
+ return ERR_PTR(-EINVAL);
+ }
+ if (!pgmap->owner) {
+ WARN(1, "Missing owner\n");
+ return ERR_PTR(-EINVAL);
+ }
+ break;
case MEMORY_DEVICE_FS_DAX:
if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
@@ -469,7 +481,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
void free_devmap_managed_page(struct page *page)
{
/* notify page idle for dax */
- if (!is_device_private_page(page)) {
+ if (!is_dev_private_or_coherent_page(page)) {
wake_up_var(&page->_refcount);
return;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..cd137aedcfe5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -345,7 +345,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
* Device private pages have an extra refcount as they are
* ZONE_DEVICE pages.
*/
- expected_count += is_device_private_page(page);
+ expected_count += is_dev_private_or_coherent_page(page);
if (mapping)
expected_count += compound_nr(page) + page_has_private(page);

@@ -2611,7 +2611,7 @@ EXPORT_SYMBOL(migrate_vma_setup);
* handle_pte_fault()
* do_anonymous_page()
* to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * private or coherent page.
*/
static void migrate_vma_insert_page(struct migrate_vma *migrate,
unsigned long addr,
@@ -2676,25 +2676,24 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
*/
__SetPageUptodate(page);

- if (is_zone_device_page(page)) {
- if (is_device_private_page(page)) {
- swp_entry_t swp_entry;
+ if (is_device_private_page(page)) {
+ swp_entry_t swp_entry;

- if (vma->vm_flags & VM_WRITE)
- swp_entry = make_writable_device_private_entry(
- page_to_pfn(page));
- else
- swp_entry = make_readable_device_private_entry(
- page_to_pfn(page));
- entry = swp_entry_to_pte(swp_entry);
- } else {
- /*
- * For now we only support migrating to un-addressable
- * device memory.
- */
- pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
- goto abort;
- }
+ if (vma->vm_flags & VM_WRITE)
+ swp_entry = make_writable_device_private_entry(
+ page_to_pfn(page));
+ else
+ swp_entry = make_readable_device_private_entry(
+ page_to_pfn(page));
+ entry = swp_entry_to_pte(swp_entry);
+ } else if (is_zone_device_page(page) &&
+ !is_device_coherent_page(page)) {
+ /*
+ * We support migrating to private and coherent types
+ * for device zone memory.
+ */
+ pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
+ goto abort;
} else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
@@ -2796,10 +2795,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
mapping = page_mapping(page);

if (is_zone_device_page(newpage)) {
- if (is_device_private_page(newpage)) {
+ if (is_dev_private_or_coherent_page(newpage)) {
/*
- * For now only support private anonymous when
- * migrating to un-addressable device memory.
+ * For now only support private and coherent
+ * anonymous when migrating to device memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..f2bd5a3da72c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1835,7 +1835,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

- if (is_zone_device_page(page)) {
+ if (is_device_private_page(page)) {
unsigned long pfn = page_to_pfn(page);
swp_entry_t entry;
pte_t swp_pte;
@@ -1976,7 +1976,8 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
TTU_SYNC)))
return;

- if (is_zone_device_page(page) && !is_device_private_page(page))
+ if (is_zone_device_page(page) &&
+ !is_dev_private_or_coherent_page(page))
return;

/*
--
2.32.0

Subject: [PATCH v6 02/10] mm: add device coherent vma selection for memory migration

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
v2:
condition added when migrations from device coherent pages.
v6:
Corrections with zero pages, to make sure these are only migrated
only when MIGRATE_VMA_SELECT_SYSTEM is set.
---
include/linux/migrate.h | 1 +
mm/migrate.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index db96e10eb8da..66a34eae8cb6 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
};

struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index cd137aedcfe5..69c6830c47c6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2264,15 +2264,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
- if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
- goto next;
pfn = pte_pfn(pte);
- if (is_zero_pfn(pfn)) {
+ if (is_zero_pfn(pfn) &&
+ (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+ if (page && !is_zone_device_page(page) &&
+ !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+ goto next;
+ else if (page && is_device_coherent_page(page) &&
+ (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+ page->pgmap->owner != migrate->pgmap_owner))
+ goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
--
2.32.0

Subject: [PATCH v6 09/10] tools: update hmm-test to support device coherent type

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
v2:
Set FIXTURE_VARIANT to add multiple device types to the FIXTURE. This
will run all the tests for each device type (private and coherent) in
case both existed during hmm-test driver probed.
v4:
Check for the number of pages successfully migrated from coherent
device to system at migrate_multiple test.
---
tools/testing/selftests/vm/hmm-tests.c | 123 ++++++++++++++++++++-----
1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 203323967b50..84ec8c4a1dc7 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_t cpages;
uint64_t faults;
+ int zone_device_type;
+};
+
+enum {
+ HMM_PRIVATE_DEVICE_ONE,
+ HMM_PRIVATE_DEVICE_TWO,
+ HMM_COHERENCE_DEVICE_ONE,
+ HMM_COHERENCE_DEVICE_TWO,
};

#define TWOMEG (1 << 21)
@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm)
+{
+ int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+ .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+ .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
FIXTURE(hmm2)
{
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned int page_shift;
};

+FIXTURE_VARIANT(hmm2)
+{
+ int device_number0;
+ int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+ .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+ .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+ .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+ .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
static int hmm_open(int unit)
{
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
}

+static bool hmm_is_coherent_type(int dev_num)
+{
+ return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
FIXTURE_SETUP(hmm)
{
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd = hmm_open(0);
+ self->fd = hmm_open(variant->device_number);
+ if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
}

@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;

- self->fd0 = hmm_open(0);
+ self->fd0 = hmm_open(variant->device_number0);
+ if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+ SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
- self->fd1 = hmm_open(1);
+ self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
}

@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+ buffer->zone_device_type = cmd.zone_device_type;

return 0;
}
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(&t, NULL);
}

+static int hmm_migrate_sys_to_dev(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+ struct hmm_buffer *buffer,
+ unsigned long npages)
+{
+ return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
/*
* Simple NULL test of device open/close.
*/
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -936,7 +1001,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);

/* Migrate memory to the device again. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -976,7 +1041,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);

hmm_buffer_free(buffer);
@@ -1015,7 +1080,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;

/* Migrating a protected area should be an error. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);

/* Punch a hole after the first page address. */
@@ -1023,7 +1088,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);

/* We expect an error if the vma doesn't cover the range. */
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);

/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1120,13 @@ TEST_F(hmm2, migrate_mixed)

/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);

/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, -ENOENT);
buffer->ptr = p;

@@ -1070,8 +1135,12 @@ TEST_F(hmm2, migrate_mixed)
}

/*
- * Migrate anonymous memory to device private memory and fault it back to system
- * memory multiple times.
+ * Migrate anonymous memory to device memory and back to system memory
+ * multiple times. In case of private zone configuration, this is done
+ * through fault pages accessed by CPU. In case of coherent zone configuration,
+ * the pages from the device should be explicitly migrated back to system memory.
+ * The reason is Coherent device zone has coherent access by CPU, therefore
+ * it will not generate any page fault.
*/
TEST_F(hmm, migrate_multiple)
{
@@ -1107,8 +1176,7 @@ TEST_F(hmm, migrate_multiple)
ptr[i] = i;

/* Migrate memory to device. */
- ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
- npages);
+ ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);

@@ -1116,7 +1184,13 @@ TEST_F(hmm, migrate_multiple)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

- /* Fault pages back to system memory and check them. */
+ /* Migrate back to system memory and check them. */
+ if (hmm_is_coherent_type(variant->device_number)) {
+ ret = hmm_migrate_dev_to_sys(self->fd, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+ }
+
for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);

@@ -1354,13 +1428,13 @@ TEST_F(hmm2, snapshot)

/* Page 5 will be migrated to device 0. */
buffer->ptr = p + 5 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

/* Page 6 will be migrated to device 1. */
buffer->ptr = p + 6 * self->page_size;
- ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+ ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 1);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 1);

@@ -1377,9 +1451,16 @@ TEST_F(hmm2, snapshot)
ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
- HMM_DMIRROR_PROT_WRITE);
- ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ if (!hmm_is_coherent_type(variant->device_number0)) {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+ } else {
+ ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL |
+ HMM_DMIRROR_PROT_WRITE);
+ ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE |
+ HMM_DMIRROR_PROT_WRITE);
+ }

hmm_buffer_free(buffer);
}
--
2.32.0

Subject: [PATCH v6 07/10] lib: test_hmm add module param for zone device type

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
lib/test_hmm.c | 73 ++++++++++++++++++++++++++++++++-------------
lib/test_hmm_uapi.h | 1 +
2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 556bd80ce22e..c7f8d00e7b95 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,6 +34,16 @@
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+ "Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+ "Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
static const struct dev_pagemap_ops dmirror_devmem_ops;
static const struct mmu_interval_notifier_ops dmirror_min_ops;
static dev_t dmirror_dev;
@@ -452,28 +462,44 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
return ret;
}

-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
struct page **ppage)
{
struct dmirror_chunk *devmem;
- struct resource *res;
+ struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+ int ret = -ENOMEM;

devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
- return false;
+ return ret;

- res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
- if (IS_ERR(res))
+ switch (mdevice->zone_device_type) {
+ case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
+ res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+ if (IS_ERR_OR_NULL(res))
+ goto err_devmem;
+ devmem->pagemap.range.start = res->start;
+ devmem->pagemap.range.end = res->end;
+ devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+ break;
+ case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
+ devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) ?
+ spm_addr_dev0 :
+ spm_addr_dev1;
+ devmem->pagemap.range.end = devmem->pagemap.range.start +
+ DEVMEM_CHUNK_SIZE - 1;
+ devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+ break;
+ default:
+ ret = -EINVAL;
goto err_devmem;
+ }

- devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.range.start = res->start;
- devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -494,10 +520,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(&devmem->pagemap, numa_node_id());
- if (IS_ERR(ptr))
+ if (IS_ERR_OR_NULL(ptr)) {
+ if (ptr)
+ ret = PTR_ERR(ptr);
+ else
+ ret = -EFAULT;
goto err_release;
+ }

devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -526,15 +556,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
}
spin_unlock(&mdevice->lock);

- return true;
+ return 0;

err_release:
mutex_unlock(&mdevice->devmem_lock);
- release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+ if (res && devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
err_devmem:
kfree(devmem);

- return false;
+ return ret;
}

static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -559,7 +591,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
spin_unlock(&mdevice->lock);
} else {
spin_unlock(&mdevice->lock);
- if (!dmirror_allocate_chunk(mdevice, &dpage))
+ if (dmirror_allocate_chunk(mdevice, &dpage))
goto error;
}

@@ -1242,10 +1274,8 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
if (ret)
return ret;

- /* Build a list of free ZONE_DEVICE private struct pages */
- dmirror_allocate_chunk(mdevice, NULL);
-
- return 0;
+ /* Build a list of free ZONE_DEVICE struct pages */
+ return dmirror_allocate_chunk(mdevice, NULL);
}

static void dmirror_device_remove(struct dmirror_device *mdevice)
@@ -1258,8 +1288,9 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
mdevice->devmem_chunks[i];

memunmap_pages(&devmem->pagemap);
- release_mem_region(devmem->pagemap.range.start,
- range_len(&devmem->pagemap.range));
+ if (devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
kfree(devmem);
}
kfree(mdevice->devmem_chunks);
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 17f842f1aa02..625f3690d086 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -68,6 +68,7 @@ enum {
enum {
/* 0 is reserved to catch uninitialized type fields */
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT,
};

#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: [PATCH v6 05/10] drm/amdkfd: coherent type as sys mem on migration to ram

Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra <[email protected]>
Reviewed-by: Felix Kuehling <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 5e8d944d359e..846ba55723fb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -659,9 +659,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
- migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);

+ if (adev->gmc.xgmi.connected_to_cpu)
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+ else
+ migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
--
2.32.0

Subject: [PATCH v6 03/10] mm/gup: fail get_user_pages for LONGTERM dev coherent type

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager. For now, we are just
returning error for PIN_LONGTERM Coherent device type pages. Eventually,
these type of pages will get migrated to system memory, once the device
migration pages support is added.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
mm/gup.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..f596b932d7d7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1864,6 +1864,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
* If we get a movable page, since we are going to be pinning
* these entries, try to move them out if possible.
*/
+ if (is_dev_private_or_coherent_page(head)) {
+ WARN_ON_ONCE(is_device_private_page(head));
+ ret = -EFAULT;
+ goto unpin_pages;
+ }
+
if (!is_pinnable_page(head)) {
if (PageHuge(head)) {
if (!isolate_huge_page(head, &movable_page_list))
@@ -1894,6 +1900,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
if (list_empty(&movable_page_list) && !isolation_error_count)
return nr_pages;

+unpin_pages:
if (gup_flags & FOLL_PIN) {
unpin_user_pages(pages, nr_pages);
} else {
--
2.32.0

Subject: [PATCH v6 08/10] lib: add support for device coherent type in test_hmm

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x100000000 &
0x140000000 physical address. Ex.
efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000

Private and coherent device mirror instances can be created in the same
probed. This is done by passing the module parameters spm_addr_dev0 &
spm_addr_dev1. In this case, it will create four instances of
device_mirror. The first two correspond to private device type, the
last two to coherent type. Then, they can be easily accessed from user
space through /dev/hmm_mirror<num_device>. Usually num_device 0 and 1
are for private, and 2 and 3 for coherent types. If no module
parameters are passed, two instances of private type device_mirror will
be created only.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
v4:
Return number of coherent device pages successfully migrated to system.
This is returned at cmd->cpages.
---
lib/test_hmm.c | 260 +++++++++++++++++++++++++++++++++-----------
lib/test_hmm_uapi.h | 15 ++-
2 files changed, 205 insertions(+), 70 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c7f8d00e7b95..dedce7908ac6 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -29,11 +29,22 @@

#include "test_hmm_uapi.h"

-#define DMIRROR_NDEVICES 2
+#define DMIRROR_NDEVICES 4
#define DMIRROR_RANGE_FAULT_TIMEOUT 1000
#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U)
#define DEVMEM_CHUNKS_RESERVE 16

+/*
+ * For device_private pages, dpage is just a dummy struct page
+ * representing a piece of device memory. dmirror_devmem_alloc_page
+ * allocates a real system memory page as backing storage to fake a
+ * real device. zone_device_data points to that backing page. But
+ * for device_coherent memory, the struct page represents real
+ * physical CPU-accessible memory that we can use directly.
+ */
+#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
+ (page)->zone_device_data : (page))
+
static unsigned long spm_addr_dev0;
module_param(spm_addr_dev0, long, 0644);
MODULE_PARM_DESC(spm_addr_dev0,
@@ -122,6 +133,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce,
return 0;
}

+static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
+{
+ return (mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
+}
+
+static enum migrate_vma_direction
+ dmirror_select_device(struct dmirror *dmirror)
+{
+ return (dmirror->mdevice->zone_device_type ==
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+ MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+}
+
static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
{
vfree(bounce->ptr);
@@ -572,16 +598,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
{
struct page *dpage = NULL;
- struct page *rpage;
+ struct page *rpage = NULL;

/*
- * This is a fake device so we alloc real system memory to store
- * our device memory.
+ * For ZONE_DEVICE private type, this is a fake device so we alloc real
+ * system memory to store our device memory.
+ * For ZONE_DEVICE coherent type we use the actual dpage to store the data
+ * and ignore rpage.
*/
- rpage = alloc_page(GFP_HIGHUSER);
- if (!rpage)
- return NULL;
-
+ if (dmirror_is_private_zone(mdevice)) {
+ rpage = alloc_page(GFP_HIGHUSER);
+ if (!rpage)
+ return NULL;
+ }
spin_lock(&mdevice->lock);

if (mdevice->free_pages) {
@@ -601,7 +630,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
return dpage;

error:
- __free_page(rpage);
+ if (rpage)
+ __free_page(rpage);
return NULL;
}

@@ -627,12 +657,16 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
* unallocated pte_none() or read-only zero page.
*/
spage = migrate_pfn_to_page(*src);
+ if (WARN(spage && is_zone_device_page(spage),
+ "page already in device spage pfn: 0x%lx\n",
+ page_to_pfn(spage)))
+ continue;

dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;

- rpage = dpage->zone_device_data;
+ rpage = BACKING_PAGE(dpage);
if (spage)
copy_highpage(rpage, spage);
else
@@ -646,6 +680,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
*/
rpage->zone_device_data = dmirror;

+ pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage));
if ((*src & MIGRATE_PFN_WRITE) ||
(!spage && args->vma->vm_flags & VM_WRITE))
@@ -723,11 +759,7 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
if (!dpage)
continue;

- /*
- * Store the page that holds the data so the page table
- * doesn't have to deal with ZONE_DEVICE private pages.
- */
- entry = dpage->zone_device_data;
+ entry = BACKING_PAGE(dpage);
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
@@ -807,15 +839,124 @@ static int dmirror_exclusive(struct dmirror *dmirror,
return ret;
}

-static int dmirror_migrate(struct dmirror *dmirror,
- struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+ struct dmirror *dmirror)
+{
+ const unsigned long *src = args->src;
+ unsigned long *dst = args->dst;
+ unsigned long start = args->start;
+ unsigned long end = args->end;
+ unsigned long addr;
+
+ for (addr = start; addr < end; addr += PAGE_SIZE,
+ src++, dst++) {
+ struct page *dpage, *spage;
+
+ spage = migrate_pfn_to_page(*src);
+ if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+ continue;
+
+ if (WARN_ON(!is_dev_private_or_coherent_page(spage)))
+ continue;
+ spage = BACKING_PAGE(spage);
+ dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
+ if (!dpage)
+ continue;
+ pr_debug("migrating from dev to sys pfn src: 0x%lx pfn dst: 0x%lx\n",
+ page_to_pfn(spage), page_to_pfn(dpage));
+
+ lock_page(dpage);
+ xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
+ copy_highpage(dpage, spage);
+ *dst = migrate_pfn(page_to_pfn(dpage));
+ if (*src & MIGRATE_PFN_WRITE)
+ *dst |= MIGRATE_PFN_WRITE;
+ }
+ return 0;
+}
+
+static unsigned long dmirror_successful_migrated_pages(struct migrate_vma *migrate)
+{
+ unsigned long cpages = 0;
+ unsigned long i;
+
+ for (i = 0; i < migrate->npages; i++) {
+ if (migrate->src[i] & MIGRATE_PFN_VALID &&
+ migrate->src[i] & MIGRATE_PFN_MIGRATE)
+ cpages++;
+ }
+ return cpages;
+}
+
+static int dmirror_migrate_to_system(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ unsigned long start, end, addr;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ struct mm_struct *mm = dmirror->notifier.mm;
+ struct vm_area_struct *vma;
+ unsigned long src_pfns[64] = { 0 };
+ unsigned long dst_pfns[64] = { 0 };
+ struct migrate_vma args;
+ unsigned long next;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ /* Since the mm is for the mirrored process, get a reference first. */
+ if (!mmget_not_zero(mm))
+ return -EINVAL;
+
+ cmd->cpages = 0;
+ mmap_read_lock(mm);
+ for (addr = start; addr < end; addr = next) {
+ vma = vma_lookup(mm, addr);
+ if (!vma || !(vma->vm_flags & VM_READ)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+ if (next > vma->vm_end)
+ next = vma->vm_end;
+
+ args.vma = vma;
+ args.src = src_pfns;
+ args.dst = dst_pfns;
+ args.start = addr;
+ args.end = next;
+ args.pgmap_owner = dmirror->mdevice;
+ args.flags = dmirror_select_device(dmirror);
+
+ ret = migrate_vma_setup(&args);
+ if (ret)
+ goto out;
+
+ pr_debug("Migrating from device mem to sys mem\n");
+ dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+
+ migrate_vma_pages(&args);
+ cmd->cpages += dmirror_successful_migrated_pages(&args);
+ migrate_vma_finalize(&args);
+ }
+out:
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ return ret;
+}
+
+static int dmirror_migrate_to_device(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
{
unsigned long start, end, addr;
unsigned long size = cmd->npages << PAGE_SHIFT;
struct mm_struct *mm = dmirror->notifier.mm;
struct vm_area_struct *vma;
- unsigned long src_pfns[64];
- unsigned long dst_pfns[64];
+ unsigned long src_pfns[64] = { 0 };
+ unsigned long dst_pfns[64] = { 0 };
struct dmirror_bounce bounce;
struct migrate_vma args;
unsigned long next;
@@ -852,6 +993,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
if (ret)
goto out;

+ pr_debug("Migrating from sys mem to device mem\n");
dmirror_migrate_alloc_and_copy(&args, dmirror);
migrate_vma_pages(&args);
dmirror_migrate_finalize_and_map(&args, dmirror);
@@ -860,7 +1002,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
mmap_read_unlock(mm);
mmput(mm);

- /* Return the migrated data for verification. */
+ /* Return the migrated data for verification. only for pages in device zone */
ret = dmirror_bounce_init(&bounce, start, size);
if (ret)
return ret;
@@ -897,12 +1039,22 @@ static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
}

page = hmm_pfn_to_page(entry);
- if (is_device_private_page(page)) {
- /* Is the page migrated to this device or some other? */
- if (dmirror->mdevice == dmirror_page_to_device(page))
+ if (is_dev_private_or_coherent_page(page)) {
+ /* Is page ZONE_DEVICE coherent? */
+ if (is_device_coherent_page(page)) {
+ if (dmirror->mdevice == dmirror_page_to_device(page))
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL;
+ else
+ *perm = HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE;
+ /*
+ * Is page ZONE_DEVICE private migrated to
+ * this device or some other?
+ */
+ } else if (dmirror->mdevice == dmirror_page_to_device(page)) {
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
- else
+ } else {
*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+ }
} else if (is_zero_pfn(page_to_pfn(page)))
*perm = HMM_DMIRROR_PROT_ZERO;
else
@@ -1099,8 +1251,12 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_write(dmirror, &cmd);
break;

- case HMM_DMIRROR_MIGRATE:
- ret = dmirror_migrate(dmirror, &cmd);
+ case HMM_DMIRROR_MIGRATE_TO_DEV:
+ ret = dmirror_migrate_to_device(dmirror, &cmd);
+ break;
+
+ case HMM_DMIRROR_MIGRATE_TO_SYS:
+ ret = dmirror_migrate_to_system(dmirror, &cmd);
break;

case HMM_DMIRROR_EXCLUSIVE:
@@ -1165,14 +1321,13 @@ static const struct file_operations dmirror_fops = {

static void dmirror_devmem_free(struct page *page)
{
- struct page *rpage = page->zone_device_data;
+ struct page *rpage = BACKING_PAGE(page);
struct dmirror_device *mdevice;

- if (rpage)
+ if (rpage != page)
__free_page(rpage);

mdevice = dmirror_page_to_device(page);
-
spin_lock(&mdevice->lock);
mdevice->cfree++;
page->zone_device_data = mdevice->free_pages;
@@ -1180,43 +1335,11 @@ static void dmirror_devmem_free(struct page *page)
spin_unlock(&mdevice->lock);
}

-static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
- const unsigned long *src = args->src;
- unsigned long *dst = args->dst;
- unsigned long start = args->start;
- unsigned long end = args->end;
- unsigned long addr;
-
- for (addr = start; addr < end; addr += PAGE_SIZE,
- src++, dst++) {
- struct page *dpage, *spage;
-
- spage = migrate_pfn_to_page(*src);
- if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
- continue;
- spage = spage->zone_device_data;
-
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
- if (!dpage)
- continue;
-
- lock_page(dpage);
- xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
- copy_highpage(dpage, spage);
- *dst = migrate_pfn(page_to_pfn(dpage));
- if (*src & MIGRATE_PFN_WRITE)
- *dst |= MIGRATE_PFN_WRITE;
- }
- return 0;
-}
-
static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
{
struct migrate_vma args;
- unsigned long src_pfns;
- unsigned long dst_pfns;
+ unsigned long src_pfns = 0;
+ unsigned long dst_pfns = 0;
struct page *rpage;
struct dmirror *dmirror;
vm_fault_t ret;
@@ -1236,7 +1359,7 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
args.src = &src_pfns;
args.dst = &dst_pfns;
args.pgmap_owner = dmirror->mdevice;
- args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+ args.flags = dmirror_select_device(dmirror);

if (migrate_vma_setup(&args))
return VM_FAULT_SIGBUS;
@@ -1315,6 +1438,12 @@ static int __init hmm_dmirror_init(void)
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
dmirror_devices[ndevices++].zone_device_type =
HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ if (spm_addr_dev0 && spm_addr_dev1) {
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+ }
for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
@@ -1337,7 +1466,8 @@ static void __exit hmm_dmirror_exit(void)
int id;

for (id = 0; id < DMIRROR_NDEVICES; id++)
- dmirror_device_remove(dmirror_devices + id);
+ if (dmirror_devices[id].zone_device_type)
+ dmirror_device_remove(dmirror_devices + id);
unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
}

diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 625f3690d086..e190b2ab6f19 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,11 +33,12 @@ struct hmm_dmirror_cmd {
/* Expose the address space of the calling process through hmm device file */
#define HMM_DMIRROR_READ _IOWR('H', 0x00, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
-#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_DEV _IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE_TO_SYS _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x07, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -52,6 +53,8 @@ struct hmm_dmirror_cmd {
* device the ioctl() is made
* HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
* other device
+ * HMM_DMIRROR_PROT_DEV_COHERENT: Migrate device coherent page on the device
+ * the ioctl() is made
*/
enum {
HMM_DMIRROR_PROT_ERROR = 0xFF,
@@ -63,6 +66,8 @@ enum {
HMM_DMIRROR_PROT_ZERO = 0x10,
HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20,
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
+ HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL = 0x40,
+ HMM_DMIRROR_PROT_DEV_COHERENT_REMOTE = 0x50,
};

enum {
--
2.32.0

Subject: [PATCH v6 06/10] lib: test_hmm add ioctl to get zone device type

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Poppple <[email protected]>
---
lib/test_hmm.c | 23 +++++++++++++++++++++--
lib/test_hmm_uapi.h | 8 ++++++++
2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 767538089a62..556bd80ce22e 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -84,6 +84,7 @@ struct dmirror_chunk {
struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem *devmem;
+ unsigned int zone_device_type;

unsigned int devmem_capacity;
unsigned int devmem_count;
@@ -1024,6 +1025,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
}

+static int dmirror_get_device_type(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ mutex_lock(&dmirror->mutex);
+ cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+ mutex_unlock(&dmirror->mutex);
+
+ return 0;
+}
static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1074,6 +1084,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_snapshot(dmirror, &cmd);
break;

+ case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+ ret = dmirror_get_device_type(dmirror, &cmd);
+ break;
default:
return -EINVAL;
}
@@ -1258,14 +1271,20 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
static int __init hmm_dmirror_init(void)
{
int ret;
- int id;
+ int id = 0;
+ int ndevices = 0;

ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
"HMM_DMIRROR");
if (ret)
goto err_unreg;

- for (id = 0; id < DMIRROR_NDEVICES; id++) {
+ memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0]));
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ dmirror_devices[ndevices++].zone_device_type =
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+ for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
goto err_chrdev;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd06..17f842f1aa02 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -19,6 +19,7 @@
* @npages: (in) number of pages to read/write
* @cpages: (out) number of pages copied
* @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
*/
struct hmm_dmirror_cmd {
__u64 addr;
@@ -26,6 +27,7 @@ struct hmm_dmirror_cmd {
__u64 npages;
__u64 cpages;
__u64 faults;
+ __u64 zone_device_type;
};

/* Expose the address space of the calling process through hmm device file */
@@ -35,6 +37,7 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +65,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
};

+enum {
+ /* 0 is reserved to catch uninitialized type fields */
+ HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
#endif /* _LIB_TEST_HMM_UAPI_H */
--
2.32.0

Subject: [PATCH v6 10/10] tools: update test_hmm script to support SP config

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
v2:
Add more mknods for device coherent type. These are represented under
/dev/hmm_mirror2 and /dev/hmm_mirror3, only in case they have created
at probing the hmm-test driver.
---
tools/testing/selftests/vm/test_hmm.sh | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..539c9371e592 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()

load_driver()
{
- modprobe $DRIVER > /dev/null 2>&1
+ if [ $# -eq 0 ]; then
+ modprobe $DRIVER > /dev/null 2>&1
+ else
+ if [ $# -eq 2 ]; then
+ modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+ > /dev/null 2>&1
+ else
+ echo "Missing module parameters. Make sure pass"\
+ "spm_addr_dev0 and spm_addr_dev1"
+ usage
+ fi
+ fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+ if [ $# -eq 2 ]; then
+ mknod /dev/hmm_dmirror2 c $major 2
+ mknod /dev/hmm_dmirror3 c $major 3
+ fi
fi
}

@@ -58,7 +73,7 @@ run_smoke()
{
echo "Running smoke test. Note, this test provides basic coverage."

- load_driver
+ load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
}
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+ echo "# Smoke testing with SPM enabled"
+ echo "./${TEST_NAME}.sh smoke <spm_addr_dev0> <spm_addr_dev1>"
+ echo
exit 0
}

@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
- run_smoke
+ run_smoke $2 $3
else
usage
fi
--
2.32.0

2022-02-11 16:42:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 01.02.22 16:48, Alex Sierra wrote:
> Device memory that is cache coherent from device and CPU point of view.
> This is used on platforms that have an advanced system bus (like CAPI
> or CXL). Any page of a process can be migrated to such memory. However,
> no one should be allowed to pin such memory so that it can always be
> evicted.
>
> Signed-off-by: Alex Sierra <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> Reviewed-by: Alistair Popple <[email protected]>

So, I'm currently messing with PageAnon() pages and CoW semantics ...
all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
easier but I'm not sure yet if they make my life harder. I hope you can
help me understand some of that stuff.

1) What are expected CoW semantics for DEVICE_COHERENT?

I assume we'll share them just like other PageAnon() pages during fork()
readable, and the first sharer writing to them receives an "ordinary"
!ZONE_DEVICE copy.

So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
that we don't have to go through the loop of restoring a device
exclusive entry?

2) How are these pages freed to clear/invalidate PageAnon() ?

I assume for PageAnon() ZONE_DEVICE pages we'll always for via
free_devmap_managed_page(), correct?


3) FOLL_PIN

While you write "no one should be allowed to pin such memory", patch #2
only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
you might want to be a bit more precise?


... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?


Thanks for any information.

--
Thanks,

David / dhildenb

2022-02-11 17:48:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 11.02.22 17:45, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>
>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>
> Currently the only way to get a DEVICE_PRIVATE page out of the page
> tables is via hmm_range_fault() and that doesn't manipulate any ref
> counts.

Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
essentially just pointers at ordinary PageAnon() pages. So with DEVICE
COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
present in the page tables where GUP could FOLL_PIN them.


--
Thanks,

David / dhildenb

2022-02-11 18:49:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:

> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages

Currently the only way to get a DEVICE_PRIVATE page out of the page
tables is via hmm_range_fault() and that doesn't manipulate any ref
counts.

Jason

2022-02-11 21:29:51

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support


Am 2022-02-11 um 11:39 schrieb David Hildenbrand:
> On 11.02.22 17:15, David Hildenbrand wrote:
>> On 01.02.22 16:48, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <[email protected]>
>>> Acked-by: Felix Kuehling <[email protected]>
>>> Reviewed-by: Alistair Popple <[email protected]>
>> So, I'm currently messing with PageAnon() pages and CoW semantics ...
>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
>> easier but I'm not sure yet if they make my life harder. I hope you can
>> help me understand some of that stuff.
>>
>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>
>> I assume we'll share them just like other PageAnon() pages during fork()
>> readable, and the first sharer writing to them receives an "ordinary"
>> !ZONE_DEVICE copy.
>>
>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>> that we don't have to go through the loop of restoring a device
>> exclusive entry?
>>
>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>
>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
>> free_devmap_managed_page(), correct?
>>
>>
>> 3) FOLL_PIN
>>
>> While you write "no one should be allowed to pin such memory", patch #2
>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>> you might want to be a bit more precise?
>>
>>
>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>>
>>
>> Thanks for any information.
>>
> (digging a bit more, I realized that device exclusive pages are not
> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
> will be the actual first PageAnon() ZONE_DEVICE pages that can be
> present in a page table.)

I think DEVICE_GENERIC pages can also be mapped in the page table. In
fact, the first version of our patches attempted to add migration
support to DEVICE_GENERIC. But we were convinced to create a new
ZONE_DEVICE page type for our use case instead.

Regards,
  Felix


>

2022-02-12 10:40:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 11.02.22 17:15, David Hildenbrand wrote:
> On 01.02.22 16:48, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <[email protected]>
>> Acked-by: Felix Kuehling <[email protected]>
>> Reviewed-by: Alistair Popple <[email protected]>
>
> So, I'm currently messing with PageAnon() pages and CoW semantics ...
> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
> easier but I'm not sure yet if they make my life harder. I hope you can
> help me understand some of that stuff.
>
> 1) What are expected CoW semantics for DEVICE_COHERENT?
>
> I assume we'll share them just like other PageAnon() pages during fork()
> readable, and the first sharer writing to them receives an "ordinary"
> !ZONE_DEVICE copy.
>
> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
> that we don't have to go through the loop of restoring a device
> exclusive entry?
>
> 2) How are these pages freed to clear/invalidate PageAnon() ?
>
> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
> free_devmap_managed_page(), correct?
>
>
> 3) FOLL_PIN
>
> While you write "no one should be allowed to pin such memory", patch #2
> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
> you might want to be a bit more precise?
>
>
> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>
>
> Thanks for any information.
>

(digging a bit more, I realized that device exclusive pages are not
actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
will be the actual first PageAnon() ZONE_DEVICE pages that can be
present in a page table.)

--
Thanks,

David / dhildenb

2022-02-12 14:22:06

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support


Am 2022-02-11 um 11:15 schrieb David Hildenbrand:
> On 01.02.22 16:48, Alex Sierra wrote:
>> Device memory that is cache coherent from device and CPU point of view.
>> This is used on platforms that have an advanced system bus (like CAPI
>> or CXL). Any page of a process can be migrated to such memory. However,
>> no one should be allowed to pin such memory so that it can always be
>> evicted.
>>
>> Signed-off-by: Alex Sierra <[email protected]>
>> Acked-by: Felix Kuehling <[email protected]>
>> Reviewed-by: Alistair Popple <[email protected]>
> So, I'm currently messing with PageAnon() pages and CoW semantics ...
> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
> easier but I'm not sure yet if they make my life harder. I hope you can
> help me understand some of that stuff.
>
> 1) What are expected CoW semantics for DEVICE_COHERENT?
>
> I assume we'll share them just like other PageAnon() pages during fork()
> readable, and the first sharer writing to them receives an "ordinary"
> !ZONE_DEVICE copy.

Yes.


>
> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
> that we don't have to go through the loop of restoring a device
> exclusive entry?

I'm not sure how DEVICE_EXCLUSIVE pages are handled under CoW. As I
understand it, they're not really in a special memory zone like
DEVICE_COHERENT. Just a special way of mapping an ordinary page in order
to allow device-exclusive access for some time. I suspect there may even
be a possibility that a page can be both DEVICE_EXCLUSIVE and
DEVICE_COHERENT.

That said, your statement sounds correct. There is no requirement to do
anything with the new "ordinary" page after copying. What actually
happens to DEVICE_COHERENT pages on CoW is a bit convoluted:

When the page is marked as CoW, it is marked R/O in the CPU page table.
This causes an MMU notifier that invalidates the device PTE. The next
device access in the parent process causes a page fault. If that's a
write fault (usually is in our current driver), it will trigger CoW,
which means the parent process now gets a new system memory copy of the
page, while the child process keeps the DEVICE_COHERENT page. The driver
could decide to migrate the page back to a new DEVICE_COHERENT allocation.

In practice that means, "fork" basically causes all DEVICE_COHERENT
memory in the parent process to be migrated to ordinary system memory,
which is quite disruptive. What we have today results in correct
behaviour, but the performance is far from ideal.

We could probably mitigate it by making the driver better at mapping
pages R/O in the device on read faults, at the potential cost of having
to handle a second (write) fault later.


>
> 2) How are these pages freed to clear/invalidate PageAnon() ?
>
> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
> free_devmap_managed_page(), correct?

Yes. The driver depends on the the page->pgmap->ops->page_free callback
to free the device memory allocation backing the page.


>
>
> 3) FOLL_PIN
>
> While you write "no one should be allowed to pin such memory", patch #2
> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
> you might want to be a bit more precise?

I agree. I think the paragraph was written before we fully fleshed out
the interaction with GUP, and the forgotten.


>
>
> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages,

Right. Trying to GUP a DEVICE_PRIVATE page causes a page fault that
migrates the page back to normal system memory (using the
page->pgmap->ops->migrate_to_ram callback). Then you pin the system
memory page.


> but can we
> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?

I assume you mean DEVICE_COHERENT, not DEVICE_EXCLUSIVE? In that case
the answer is "Yes".

Regards,
  Felix


>
>
> Thanks for any information.
>

2022-02-12 19:18:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
> On 11.02.22 17:45, Jason Gunthorpe wrote:
> > On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
> >
> >> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
> >
> > Currently the only way to get a DEVICE_PRIVATE page out of the page
> > tables is via hmm_range_fault() and that doesn't manipulate any ref
> > counts.
>
> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
> present in the page tables where GUP could FOLL_PIN them.

This is my understanding

Though you probably understand what PageAnon means alot better than I
do.. I wonder if it really makes sense to talk about that together
with ZONE_DEVICE which has alot in common with filesystem originated
pages too.

I'm not sure what AMDs plan is here, is there an expecation that a GPU
driver will somehow stuff these pages into an existing anonymous
memory VMA or do they always come from a driver originated VMA?

Jason

2022-02-14 08:09:24

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Felix Kuehling <[email protected]> writes:

> Am 2022-02-11 um 11:15 schrieb David Hildenbrand:
>> On 01.02.22 16:48, Alex Sierra wrote:
>>> Device memory that is cache coherent from device and CPU point of view.
>>> This is used on platforms that have an advanced system bus (like CAPI
>>> or CXL). Any page of a process can be migrated to such memory. However,
>>> no one should be allowed to pin such memory so that it can always be
>>> evicted.
>>>
>>> Signed-off-by: Alex Sierra <[email protected]>
>>> Acked-by: Felix Kuehling <[email protected]>
>>> Reviewed-by: Alistair Popple <[email protected]>
>> So, I’m currently messing with PageAnon() pages and CoW semantics …
>> all these PageAnon() ZONE_DEVICE variants don’t necessarily make my life
>> easier but I’m not sure yet if they make my life harder. I hope you can
>> help me understand some of that stuff.
>>
>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>
>> I assume we’ll share them just like other PageAnon() pages during fork()
>> readable, and the first sharer writing to them receives an “ordinary”
>> !ZONE_DEVICE copy.
>
> Yes.
>
>
>>
>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>> that we don’t have to go through the loop of restoring a device
>> exclusive entry?
>
> I’m not sure how DEVICE_EXCLUSIVE pages are handled under CoW. As I understand
> it, they’re not really in a special memory zone like DEVICE_COHERENT. Just a
> special way of mapping an ordinary page in order to allow device-exclusive
> access for some time. I suspect there may even be a possibility that a page can
> be both DEVICE_EXCLUSIVE and DEVICE_COHERENT.

Right - there aren’t really device exclusive pages, they are just special
non-present ptes conceptually pretty similar to migration entries. The
difference is that on CPU fault (or fork) the original entry is restored
immediately after notifying the device that it no longer has exclusive access.

As device exclusive entries can be turned into normal entries whenever required
we handle CoW by restoring the original ptes if a device exclusive entry is
encountered. This reduces the chances of introducing any subtle CoW bugs as it
just gets handled the same as any normal page table entry (because the exclusive
entries will have been removed).

> That said, your statement sounds correct. There is no requirement to do anything
> with the new “ordinary” page after copying. What actually happens to
> DEVICE_COHERENT pages on CoW is a bit convoluted:
>
> When the page is marked as CoW, it is marked R/O in the CPU page table. This
> causes an MMU notifier that invalidates the device PTE. The next device access
> in the parent process causes a page fault. If that’s a write fault (usually is
> in our current driver), it will trigger CoW, which means the parent process now
> gets a new system memory copy of the page, while the child process keeps the
> DEVICE_COHERENT page. The driver could decide to migrate the page back to a new
> DEVICE_COHERENT allocation.
>
> In practice that means, “fork” basically causes all DEVICE_COHERENT memory in
> the parent process to be migrated to ordinary system memory, which is quite
> disruptive. What we have today results in correct behaviour, but the performance
> is far from ideal.
>
> We could probably mitigate it by making the driver better at mapping pages R/O
> in the device on read faults, at the potential cost of having to handle a second
> (write) fault later.
>
>
>>
>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>
>> I assume for PageAnon() ZONE_DEVICE pages we’ll always for via
>> free_devmap_managed_page(), correct?
>
> Yes. The driver depends on the the page->pgmap->ops->page_free callback to free
> the device memory allocation backing the page.
>
>
>>
>>
>> 3) FOLL_PIN
>>
>> While you write “no one should be allowed to pin such memory”, patch #2
>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>> you might want to be a bit more precise?
>
> I agree. I think the paragraph was written before we fully fleshed out the
> interaction with GUP, and the forgotten.
>
>
>>
>>
>> … I’m pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages,
>
> Right. Trying to GUP a DEVICE_PRIVATE page causes a page fault that migrates the
> page back to normal system memory (using the page->pgmap->ops->migrate_to_ram
> callback). Then you pin the system memory page.
>
>
>> but can we
>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?

In the case of device exclusive entries GUP/PUP will fault and restore the
original entry. It will then pin the original normal page pointed to by the
device exclusive entry.

• Alistair

>
> I assume you mean DEVICE_COHERENT, not DEVICE_EXCLUSIVE? In that case the answer
> is “Yes”.
>
> Regards,
>   Felix
>
>
>>
>>
>> Thanks for any information.
>>


Attachments:
(No filename) (4.92 kB)

2022-02-15 13:08:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 11.02.22 17:56, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
>> On 11.02.22 17:45, Jason Gunthorpe wrote:
>>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>>>
>>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>>>
>>> Currently the only way to get a DEVICE_PRIVATE page out of the page
>>> tables is via hmm_range_fault() and that doesn't manipulate any ref
>>> counts.
>>
>> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
>> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
>> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
>> present in the page tables where GUP could FOLL_PIN them.
>
> This is my understanding
>
> Though you probably understand what PageAnon means alot better than I
> do.. I wonder if it really makes sense to talk about that together
> with ZONE_DEVICE which has alot in common with filesystem originated
> pages too.

For me, PageAnon() means that modifications are visible only to the
modifying process. On actual CoW, the underlying page will get replaced
-- in the world of DEVICE_COHERENT that would mean that once you write
to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page.

PageAnon() pages don't have a mapping, thus they can only be found in
MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be
found via a page table, and not looked up via the page cache (excluding
the swap cache).

So if we have PageAnon() pages on ZONE_DEVICE, they generally have the
exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in
the page tables the allocation/freeing path differs -- I guess :)

... and as we want pinning semantics to be different we have to touch GUP.

>
> I'm not sure what AMDs plan is here, is there an expecation that a GPU
> driver will somehow stuff these pages into an existing anonymous
> memory VMA or do they always come from a driver originated VMA?

My understanding is that a driver can just decide to replace "ordinary"
PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD
can clarify.


--
Thanks,

David / dhildenb

2022-02-15 14:43:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 11.02.22 18:07, Felix Kuehling wrote:
>
> Am 2022-02-11 um 11:39 schrieb David Hildenbrand:
>> On 11.02.22 17:15, David Hildenbrand wrote:
>>> On 01.02.22 16:48, Alex Sierra wrote:
>>>> Device memory that is cache coherent from device and CPU point of view.
>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>> no one should be allowed to pin such memory so that it can always be
>>>> evicted.
>>>>
>>>> Signed-off-by: Alex Sierra <[email protected]>
>>>> Acked-by: Felix Kuehling <[email protected]>
>>>> Reviewed-by: Alistair Popple <[email protected]>
>>> So, I'm currently messing with PageAnon() pages and CoW semantics ...
>>> all these PageAnon() ZONE_DEVICE variants don't necessarily make my life
>>> easier but I'm not sure yet if they make my life harder. I hope you can
>>> help me understand some of that stuff.
>>>
>>> 1) What are expected CoW semantics for DEVICE_COHERENT?
>>>
>>> I assume we'll share them just like other PageAnon() pages during fork()
>>> readable, and the first sharer writing to them receives an "ordinary"
>>> !ZONE_DEVICE copy.
>>>
>>> So this would be just like DEVICE_EXCLUSIVE CoW handling I assume, just
>>> that we don't have to go through the loop of restoring a device
>>> exclusive entry?
>>>
>>> 2) How are these pages freed to clear/invalidate PageAnon() ?
>>>
>>> I assume for PageAnon() ZONE_DEVICE pages we'll always for via
>>> free_devmap_managed_page(), correct?
>>>
>>>
>>> 3) FOLL_PIN
>>>
>>> While you write "no one should be allowed to pin such memory", patch #2
>>> only blocks FOLL_LONGTERM. So I assume we allow ordinary FOLL_PIN and
>>> you might want to be a bit more precise?
>>>
>>>
>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages, but can we
>>> FILL_PIN DEVICE_EXCLUSIVE pages? I strongly assume so?
>>>
>>>
>>> Thanks for any information.
>>>
>> (digging a bit more, I realized that device exclusive pages are not
>> actually/necessarily ZONE_DEVICE pages -- so I assume DEVICE_COHERENT
>> will be the actual first PageAnon() ZONE_DEVICE pages that can be
>> present in a page table.)
>
> I think DEVICE_GENERIC pages can also be mapped in the page table. In
> fact, the first version of our patches attempted to add migration
> support to DEVICE_GENERIC. But we were convinced to create a new
> ZONE_DEVICE page type for our use case instead.

Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
assumption was that they would be part of a special mapping.

--
Thanks,

David / dhildenb

2022-02-15 20:04:00

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support


On 2022-02-15 07:15, David Hildenbrand wrote:
> On 11.02.22 17:56, Jason Gunthorpe wrote:
>> On Fri, Feb 11, 2022 at 05:49:08PM +0100, David Hildenbrand wrote:
>>> On 11.02.22 17:45, Jason Gunthorpe wrote:
>>>> On Fri, Feb 11, 2022 at 05:15:25PM +0100, David Hildenbrand wrote:
>>>>
>>>>> ... I'm pretty sure we cannot FOLL_PIN DEVICE_PRIVATE pages
>>>> Currently the only way to get a DEVICE_PRIVATE page out of the page
>>>> tables is via hmm_range_fault() and that doesn't manipulate any ref
>>>> counts.
>>> Thanks for clarifying Jason! ... and AFAIU, device exclusive entries are
>>> essentially just pointers at ordinary PageAnon() pages. So with DEVICE
>>> COHERENT we'll have the first PageAnon() ZONE_DEVICE pages mapped as
>>> present in the page tables where GUP could FOLL_PIN them.
>> This is my understanding
>>
>> Though you probably understand what PageAnon means alot better than I
>> do.. I wonder if it really makes sense to talk about that together
>> with ZONE_DEVICE which has alot in common with filesystem originated
>> pages too.
> For me, PageAnon() means that modifications are visible only to the
> modifying process. On actual CoW, the underlying page will get replaced
> -- in the world of DEVICE_COHERENT that would mean that once you write
> to a DEVICE_COHERENT you could suddenly have a !DEVICE_COHERENT page.
>
> PageAnon() pages don't have a mapping, thus they can only be found in
> MAP_ANON VMAs or in MAP_SHARED VMAs with MAP_PRIVATE. They can only be
> found via a page table, and not looked up via the page cache (excluding
> the swap cache).
>
> So if we have PageAnon() pages on ZONE_DEVICE, they generally have the
> exact same semantics as !ZONE_DEVICE pages, but the way they "appear" in
> the page tables the allocation/freeing path differs -- I guess :)
>
> ... and as we want pinning semantics to be different we have to touch GUP.
>
>> I'm not sure what AMDs plan is here, is there an expecation that a GPU
>> driver will somehow stuff these pages into an existing anonymous
>> memory VMA or do they always come from a driver originated VMA?
> My understanding is that a driver can just decide to replace "ordinary"
> PageAnon() pages e.g., in a MAP_ANON VMA by these pages. Hopefully AMD
> can clarify.

Yes. DEVICE_COHERENT pages are very similar to DEVICE_PRIVATE. They are
in normal anonymous VMAs (e.g. the application called mmap(...,
MAP_ANONYMOUS, ...)). You get DEVICE_PRIVATE/COHERENT pages as a result
of the driver migrating normal anonymous pages into device memory. The
main difference is, that DEVICE_PRIVATE pages aren't host accessible,
while DEVICE_COHERENT pages are host-accessible and can be mapped in the
CPU page table or DMA-mapped by peer devices. That's why GUP needs to
deal with them.

Regards,
  Felix


>
>

2022-02-15 20:43:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > > assumption was that they would be part of a special mapping.
> >
> > We need to stop using the special PTEs and VMAs for things that have a
> > struct page. This is a mistake DAX created that must be undone.
>
> Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
> ptes first given that DAX is more complicated.

Probably, I think we can check the page->pgmap type to tell the
difference.

I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
made safe by using the unmap_mapping_range(), which won't work
here. Is there some other trick being used to keep track of references
inside the AMD driver?

Jason

2022-02-15 21:06:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Tue, Feb 15, 2022 at 01:16:43PM +0100, David Hildenbrand wrote:
> > fact, the first version of our patches attempted to add migration
> > support to DEVICE_GENERIC. But we were convinced to create a new
> > ZONE_DEVICE page type for our use case instead.
>
> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> assumption was that they would be part of a special mapping.

We need to stop using the special PTEs and VMAs for things that have a
struct page. This is a mistake DAX created that must be undone.

Jason

2022-02-15 22:40:10

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support


On 2022-02-15 14:41, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>>>> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>>>> assumption was that they would be part of a special mapping.
>>> We need to stop using the special PTEs and VMAs for things that have a
>>> struct page. This is a mistake DAX created that must be undone.
>> Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
>> ptes first given that DAX is more complicated.
> Probably, I think we can check the page->pgmap type to tell the
> difference.
>
> I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
> made safe by using the unmap_mapping_range(), which won't work
> here. Is there some other trick being used to keep track of references
> inside the AMD driver?

Not sure I'm following all the discussion about VMAs and DAX. So I may
be answering the wrong question: We treat each ZONE_DEVICE page as a
reference to the BO (buffer object) that backs the page. We increment
the BO refcount for each page we migrate into it. In the
dev_pagemap_ops.page_free callback we drop that reference. Once all
pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can
free the BO allocation.

Regards,
  Felix


[*] That's a somewhat simplified view. There may be other references to
the BO, which allows us to reuse the same BO for the same virtual
address range.

>
> Jason

2022-02-15 22:53:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > assumption was that they would be part of a special mapping.
>
> We need to stop using the special PTEs and VMAs for things that have a
> struct page. This is a mistake DAX created that must be undone.

Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
ptes first given that DAX is more complicated.

2022-02-16 03:34:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:

> Device private and device coherent pages are not marked with pte_devmap and they
> are backed by a struct page. The only way of inserting them is via migrate_vma.
> The refcount is decremented in zap_pte_range() on munmap() with special handling
> for device private pages. Looking at it again though I wonder if there is any
> special treatment required in zap_pte_range() for device coherent pages given
> they count as present pages.

This is what I guessed, but we shouldn't be able to just drop
pte_devmap on these pages without any other work?? Granted it does
very little already..

I thought at least gup_fast needed to be touched or did this get
handled by scanning the page list after the fact?

Jason

2022-02-16 06:48:08

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Jason Gunthorpe <[email protected]> writes:

> On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>>
>> On 2022-02-15 14:41, Jason Gunthorpe wrote:
>> > On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>> > > On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>> > > > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>> > > > > assumption was that they would be part of a special mapping.
>> > > > We need to stop using the special PTEs and VMAs for things that have a
>> > > > struct page. This is a mistake DAX created that must be undone.
>> > > Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
>> > > ptes first given that DAX is more complicated.
>> > Probably, I think we can check the page->pgmap type to tell the
>> > difference.
>> >
>> > I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
>> > made safe by using the unmap_mapping_range(), which won't work
>> > here. Is there some other trick being used to keep track of references
>> > inside the AMD driver?
>>
>> Not sure I'm following all the discussion about VMAs and DAX. So I may be
>> answering the wrong question: We treat each ZONE_DEVICE page as a reference
>> to the BO (buffer object) that backs the page. We increment the BO refcount
>> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
>> we drop that reference. Once all pages backed by a BO are freed, the BO
>> refcount reaches 0 [*] and we can free the BO allocation.
>
> Userspace does
> 1) mmap(MAP_PRIVATE) to allocate anon memory
> 2) something to trigger migration to install a ZONE_DEVICE page
> 3) munmap()
>
> Who decrements the refcout on the munmap?
>
> When a ZONE_DEVICE page is installed in the PTE is supposed to be
> marked as pte_devmap and that disables all the normal page refcounting
> during munmap().

Device private and device coherent pages are not marked with pte_devmap and they
are backed by a struct page. The only way of inserting them is via migrate_vma.
The refcount is decremented in zap_pte_range() on munmap() with special handling
for device private pages. Looking at it again though I wonder if there is any
special treatment required in zap_pte_range() for device coherent pages given
they count as present pages.

> fsdax makes this work by working the refcounts backwards, the page is
> refcounted while it exists in the driver, when the driver decides to
> remove it then unmap_mapping_range() is called to purge it from all
> PTEs and then refcount is decrd. munmap/fork/etc don't change the
> refcount.

The equivalent here is for drivers to use migrate_vma to migrate the pages back
from device memory to CPU memory. In this case the refcounting is (mostly)
handled by migration code which decrements the refcount on the original source
device page during the migration.

- Alistair

> Jason


Attachments:
(No filename) (2.88 kB)

2022-02-16 07:05:56

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support


On 2022-02-15 16:47, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>> On 2022-02-15 14:41, Jason Gunthorpe wrote:
>>> On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
>>>> On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
>>>>>> Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
>>>>>> assumption was that they would be part of a special mapping.
>>>>> We need to stop using the special PTEs and VMAs for things that have a
>>>>> struct page. This is a mistake DAX created that must be undone.
>>>> Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
>>>> ptes first given that DAX is more complicated.
>>> Probably, I think we can check the page->pgmap type to tell the
>>> difference.
>>>
>>> I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
>>> made safe by using the unmap_mapping_range(), which won't work
>>> here. Is there some other trick being used to keep track of references
>>> inside the AMD driver?
>> Not sure I'm following all the discussion about VMAs and DAX. So I may be
>> answering the wrong question: We treat each ZONE_DEVICE page as a reference
>> to the BO (buffer object) that backs the page. We increment the BO refcount
>> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
>> we drop that reference. Once all pages backed by a BO are freed, the BO
>> refcount reaches 0 [*] and we can free the BO allocation.
> Userspace does
> 1) mmap(MAP_PRIVATE) to allocate anon memory
> 2) something to trigger migration to install a ZONE_DEVICE page
> 3) munmap()
>
> Who decrements the refcout on the munmap?
>
> When a ZONE_DEVICE page is installed in the PTE is supposed to be
> marked as pte_devmap and that disables all the normal page refcounting
> during munmap().
>
> fsdax makes this work by working the refcounts backwards, the page is
> refcounted while it exists in the driver, when the driver decides to
> remove it then unmap_mapping_range() is called to purge it from all
> PTEs and then refcount is decrd. munmap/fork/etc don't change the
> refcount.

Hmm, that just means, whether or not there are PTEs doesn't really
matter. It should work the same as it does for DEVICE_PRIVATE pages. I'm
not sure where DEVICE_PRIVATE page's refcounts are decremented on unmap,
TBH. But I can't find it in our driver, or in the test_hmm driver for
that matter.

Regards,
  Felix

>
> Jason

2022-02-16 07:23:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
>
> On 2022-02-15 14:41, Jason Gunthorpe wrote:
> > On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
> > > On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
> > > > > Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My
> > > > > assumption was that they would be part of a special mapping.
> > > > We need to stop using the special PTEs and VMAs for things that have a
> > > > struct page. This is a mistake DAX created that must be undone.
> > > Yes, we'll get to it. Maybe we can do it for the non-DAX devmap
> > > ptes first given that DAX is more complicated.
> > Probably, I think we can check the page->pgmap type to tell the
> > difference.
> >
> > I'm not sure how the DEVICE_GENERIC can work without this, as DAX was
> > made safe by using the unmap_mapping_range(), which won't work
> > here. Is there some other trick being used to keep track of references
> > inside the AMD driver?
>
> Not sure I'm following all the discussion about VMAs and DAX. So I may be
> answering the wrong question: We treat each ZONE_DEVICE page as a reference
> to the BO (buffer object) that backs the page. We increment the BO refcount
> for each page we migrate into it. In the dev_pagemap_ops.page_free callback
> we drop that reference. Once all pages backed by a BO are freed, the BO
> refcount reaches 0 [*] and we can free the BO allocation.

Userspace does
1) mmap(MAP_PRIVATE) to allocate anon memory
2) something to trigger migration to install a ZONE_DEVICE page
3) munmap()

Who decrements the refcout on the munmap?

When a ZONE_DEVICE page is installed in the PTE is supposed to be
marked as pte_devmap and that disables all the normal page refcounting
during munmap().

fsdax makes this work by working the refcounts backwards, the page is
refcounted while it exists in the driver, when the driver decides to
remove it then unmap_mapping_range() is called to purge it from all
PTEs and then refcount is decrd. munmap/fork/etc don't change the
refcount.

Jason

2022-02-16 07:35:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:

> > Userspace does
> > 1) mmap(MAP_PRIVATE) to allocate anon memory
> > 2) something to trigger migration to install a ZONE_DEVICE page
> > 3) munmap()
> >
> > Who decrements the refcout on the munmap?
> >
> > When a ZONE_DEVICE page is installed in the PTE is supposed to be
> > marked as pte_devmap and that disables all the normal page refcounting
> > during munmap().
> >
> > fsdax makes this work by working the refcounts backwards, the page is
> > refcounted while it exists in the driver, when the driver decides to
> > remove it then unmap_mapping_range() is called to purge it from all
> > PTEs and then refcount is decrd. munmap/fork/etc don't change the
> > refcount.
>
> Hmm, that just means, whether or not there are PTEs doesn't really
> matter.

Yes, that is the FSDAX model

> It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure
> where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I
> can't find it in our driver, or in the test_hmm driver for that matter.

It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap
entries. The put_page for that case is here:

static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
[..]
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);

if (unlikely(zap_skip_check_mapping(details, page)))
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;

if (is_device_private_entry(entry))
page_remove_rmap(page, false);

put_page(page);

However the devmap case will return NULL from vm_normal_page() and won't
do the put_page() embedded inside the __tlb_remove_page() in the
pte_present() block in the same function.

After reflecting for awhile, I think Christoph's idea is quite
good. Just make it so you don't set pte_devmap() on your pages and
then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.

Jason

2022-02-16 07:37:22

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>
> > Device private and device coherent pages are not marked with pte_devmap and they
> > are backed by a struct page. The only way of inserting them is via migrate_vma.
> > The refcount is decremented in zap_pte_range() on munmap() with special handling
> > for device private pages. Looking at it again though I wonder if there is any
> > special treatment required in zap_pte_range() for device coherent pages given
> > they count as present pages.
>
> This is what I guessed, but we shouldn't be able to just drop
> pte_devmap on these pages without any other work?? Granted it does
> very little already..

Yes, I agree we need to check this more closely. For device private pages
not having pte_devmap is fine, because they are non-present swap entries so
they always get special handling in the swap entry paths but the same isn't
true for coherent device pages.

> I thought at least gup_fast needed to be touched or did this get
> handled by scanning the page list after the fact?

Right, for gup I think the only special handling required is to prevent
pinning. I had assumed that check_and_migrate_movable_pages() would still get
called for gup_fast but unless I've missed something I don't think it does.
That means gup_fast could still pin movable and coherent pages. Technically
that is ok for coherent pages, but it's undesirable.

- Alistair

> Jason
>




2022-02-16 08:34:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On 16.02.22 03:36, Alistair Popple wrote:
> On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>>
>>> Device private and device coherent pages are not marked with pte_devmap and they
>>> are backed by a struct page. The only way of inserting them is via migrate_vma.
>>> The refcount is decremented in zap_pte_range() on munmap() with special handling
>>> for device private pages. Looking at it again though I wonder if there is any
>>> special treatment required in zap_pte_range() for device coherent pages given
>>> they count as present pages.
>>
>> This is what I guessed, but we shouldn't be able to just drop
>> pte_devmap on these pages without any other work?? Granted it does
>> very little already..
>
> Yes, I agree we need to check this more closely. For device private pages
> not having pte_devmap is fine, because they are non-present swap entries so
> they always get special handling in the swap entry paths but the same isn't
> true for coherent device pages.

I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
look like when mapped? I'd assume it's also (currently) still offset by
one, meaning, if it's mapped into a single page table it's always at
least 2.

Just a note that if my assumption is correct and if we'd have such a
page mapped R/O, do_wp_page() would always have to copy it
unconditionally and would not be able to reuse it on write faults.
(while I'm working on improving the reuse logic, I think there is also
work in progress to avoid this additional reference on some ZONE_DEVICE
stuff -- I'd assume that would include DEVICE_COHERENT ?)

>
>> I thought at least gup_fast needed to be touched or did this get
>> handled by scanning the page list after the fact?
>
> Right, for gup I think the only special handling required is to prevent
> pinning. I had assumed that check_and_migrate_movable_pages() would still get
> called for gup_fast but unless I've missed something I don't think it does.
> That means gup_fast could still pin movable and coherent pages. Technically
> that is ok for coherent pages, but it's undesirable.

We really should have the same pinning rules for GUP vs. GUP-fast.
is_pinnable_page() should be the right place for such checks (similarly
as indicated in my reply to the migration series).

--
Thanks,

David / dhildenb

2022-02-16 12:37:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
> On 16.02.22 03:36, Alistair Popple wrote:
> > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
> >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
> >>
> >>> Device private and device coherent pages are not marked with pte_devmap and they
> >>> are backed by a struct page. The only way of inserting them is via migrate_vma.
> >>> The refcount is decremented in zap_pte_range() on munmap() with special handling
> >>> for device private pages. Looking at it again though I wonder if there is any
> >>> special treatment required in zap_pte_range() for device coherent pages given
> >>> they count as present pages.
> >>
> >> This is what I guessed, but we shouldn't be able to just drop
> >> pte_devmap on these pages without any other work?? Granted it does
> >> very little already..
> >
> > Yes, I agree we need to check this more closely. For device private pages
> > not having pte_devmap is fine, because they are non-present swap entries so
> > they always get special handling in the swap entry paths but the same isn't
> > true for coherent device pages.
>
> I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
> look like when mapped? I'd assume it's also (currently) still offset by
> one, meaning, if it's mapped into a single page table it's always at
> least 2.

Christoph fixed this offset by one and updated the DEVICE_COHERENT
patchset, I hope we will see that version merged.

> >> I thought at least gup_fast needed to be touched or did this get
> >> handled by scanning the page list after the fact?
> >
> > Right, for gup I think the only special handling required is to prevent
> > pinning. I had assumed that check_and_migrate_movable_pages() would still get
> > called for gup_fast but unless I've missed something I don't think it does.
> > That means gup_fast could still pin movable and coherent pages. Technically
> > that is ok for coherent pages, but it's undesirable.
>
> We really should have the same pinning rules for GUP vs. GUP-fast.
> is_pinnable_page() should be the right place for such checks (similarly
> as indicated in my reply to the migration series).

Yes, I think this is a bug too.

The other place that needs careful audit is all the callers using
vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
if we don't set pte_devmap.

Jason

2022-02-16 22:56:15

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Am 2022-02-15 um 21:01 schrieb Jason Gunthorpe:
> On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:
>
>>> Userspace does
>>> 1) mmap(MAP_PRIVATE) to allocate anon memory
>>> 2) something to trigger migration to install a ZONE_DEVICE page
>>> 3) munmap()
>>>
>>> Who decrements the refcout on the munmap?
>>>
>>> When a ZONE_DEVICE page is installed in the PTE is supposed to be
>>> marked as pte_devmap and that disables all the normal page refcounting
>>> during munmap().
>>>
>>> fsdax makes this work by working the refcounts backwards, the page is
>>> refcounted while it exists in the driver, when the driver decides to
>>> remove it then unmap_mapping_range() is called to purge it from all
>>> PTEs and then refcount is decrd. munmap/fork/etc don't change the
>>> refcount.
>> Hmm, that just means, whether or not there are PTEs doesn't really
>> matter.
> Yes, that is the FSDAX model
>
>> It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure
>> where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I
>> can't find it in our driver, or in the test_hmm driver for that matter.
> It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap
> entries. The put_page for that case is here:
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct zap_details *details)
> {
> [..]
> if (is_device_private_entry(entry) ||
> is_device_exclusive_entry(entry)) {
> struct page *page = pfn_swap_entry_to_page(entry);
>
> if (unlikely(zap_skip_check_mapping(details, page)))
> continue;
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> rss[mm_counter(page)]--;
>
> if (is_device_private_entry(entry))
> page_remove_rmap(page, false);
>
> put_page(page);
>
> However the devmap case will return NULL from vm_normal_page() and won't
> do the put_page() embedded inside the __tlb_remove_page() in the
> pte_present() block in the same function.
>
> After reflecting for awhile, I think Christoph's idea is quite
> good. Just make it so you don't set pte_devmap() on your pages and
> then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.

I'm not sure if pte_devmap is actually set for our DEVICE_COHERENT
pages. As far as I can tell, this comes from a bit in the pfn:

#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
...
static inline bool pfn_t_devmap(pfn_t pfn)
{
const u64 flags = PFN_DEV|PFN_MAP;

return (pfn.val & flags) == flags;
}

In the case of DEVICE_COHERENT memory, the pfns correspond to real
physical memory addresses. I don't think they have those PFN_DEV|PFN_MAP
bits set.

Regards,
  Felix


>
> Jason

2022-02-17 03:49:20

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Jason Gunthorpe <[email protected]> writes:

> On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
>> On 16.02.22 03:36, Alistair Popple wrote:
>> > On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
>> >> On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
>> >>
>> >>> Device private and device coherent pages are not marked with pte_devmap and they
>> >>> are backed by a struct page. The only way of inserting them is via migrate_vma.
>> >>> The refcount is decremented in zap_pte_range() on munmap() with special handling
>> >>> for device private pages. Looking at it again though I wonder if there is any
>> >>> special treatment required in zap_pte_range() for device coherent pages given
>> >>> they count as present pages.
>> >>
>> >> This is what I guessed, but we shouldn't be able to just drop
>> >> pte_devmap on these pages without any other work?? Granted it does
>> >> very little already..
>> >
>> > Yes, I agree we need to check this more closely. For device private pages
>> > not having pte_devmap is fine, because they are non-present swap entries so
>> > they always get special handling in the swap entry paths but the same isn't
>> > true for coherent device pages.
>>
>> I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page
>> look like when mapped? I'd assume it's also (currently) still offset by
>> one, meaning, if it's mapped into a single page table it's always at
>> least 2.
>
> Christoph fixed this offset by one and updated the DEVICE_COHERENT
> patchset, I hope we will see that version merged.
>
>> >> I thought at least gup_fast needed to be touched or did this get
>> >> handled by scanning the page list after the fact?
>> >
>> > Right, for gup I think the only special handling required is to prevent
>> > pinning. I had assumed that check_and_migrate_movable_pages() would still get
>> > called for gup_fast but unless I've missed something I don't think it does.
>> > That means gup_fast could still pin movable and coherent pages. Technically
>> > that is ok for coherent pages, but it's undesirable.
>>
>> We really should have the same pinning rules for GUP vs. GUP-fast.
>> is_pinnable_page() should be the right place for such checks (similarly
>> as indicated in my reply to the migration series).
>
> Yes, I think this is a bug too.

Agreed, I will add a fix for it to my series as I was surprised the rules for
PUP-fast were different. I can see how this happened though -
check_and_migrate_cma_pages() (the precursor to
check_and_migrate_movable_pages()) was added before PUP-fast and FOLL_LONGTERM
so I guess we just never added this check there.

- Alistair

> The other place that needs careful audit is all the callers using
> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
> if we don't set pte_devmap.
>
> Jason


Attachments:
(No filename) (2.85 kB)

2022-02-17 12:23:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Wed, Feb 16, 2022 at 11:56:51AM -0500, Felix Kuehling wrote:

> In the case of DEVICE_COHERENT memory, the pfns correspond to real physical
> memory addresses. I don't think they have those PFN_DEV|PFN_MAP bits set.

So do DAX pages. The PTE flag does several things. As this would be
the first time ZONE_DEVICE pages do not set devmap it needs a full
audit.

eg the gup_fast bug Alistair pointed at needs fixing at least.

Jason

2022-02-17 23:35:15

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
> The other place that needs careful audit is all the callers using
> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
> if we don't set pte_devmap.

How much code are we talking about here? A quick search finds 26
call-sites in 12 files in current master:

fs/proc/task_mmu.c
mm/hmm.c
mm/gup.c
mm/huge_memory.c (vm_normal_page_pmd)
mm/khugepaged.c
mm/madvise.c
mm/mempolicy.c
mm/memory.c
mm/mlock.c
mm/migrate.c
mm/mprotect.c
mm/memcontrol.c

I'm thinking of a more theoretical approach: Instead of auditing all
users, I'd ask, what are the invariants that a vm_normal_page should
have. Then check, whether our DEVICE_COHERENT pages satisfy them. But
maybe the concept of a vm_normal_page isn't defined clearly enough for that.

That said, I think we (Alex and myself) made an implicit assumption from
the start, that a DEVICE_COHERENT page should behave a lot like a normal
page in terms of VMA mappings, even if we didn't know what that means in
detail.

I can now at least name some differences between DEVICE_COHERENT and
normal pages: how the memory is allocated, how data is migrated into
DEVICE_COHERENT pages and that it can't be on any LRU list (because the
lru list_head in struct page is aliased by pgmap and zone_device_data).
Maybe I'll find more differences if I keep digging.

Regards,
  Felix


>
> Jason

2022-02-18 00:46:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:

> I'm thinking of a more theoretical approach: Instead of auditing all users,
> I'd ask, what are the invariants that a vm_normal_page should have. Then
> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
> of a vm_normal_page isn't defined clearly enough for that.

I would say the expectation is that only 'page cache and anon' pages
are returned - ie the first union in struct page

This is because the first file in your list I looked at:

static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mm_walk *walk)

{
page = vm_normal_page(vma, addr, ptent);
[..]
if (pageout) {
if (!isolate_lru_page(page)) {

Uses the LRU field, so this is incompatible with all the other page
types.

One mitigation of this might be to formally make vm_normal_page() ==
'pte to page cache and anon page' and add a new function that is 'pte
to any struct page'

Then go through and sort callers as appropriate.

The 'pte to page cache and anon page' can detect ZONE_DEVICE by
calling is_zone_device_page() insted of pte_devmap() and then continue
to return NULL. This same trick will fix GUP_fast.

Jason

2022-02-18 01:31:48

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Felix Kuehling <[email protected]> writes:

> Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
>> The other place that needs careful audit is all the callers using
>> vm_normal_page() - they must all be able to accept a ZONE_DEVICE page
>> if we don't set pte_devmap.
>
> How much code are we talking about here? A quick search finds 26 call-sites in
> 12 files in current master:
>
> fs/proc/task_mmu.c
> mm/hmm.c
> mm/gup.c
> mm/huge_memory.c (vm_normal_page_pmd)
> mm/khugepaged.c
> mm/madvise.c
> mm/mempolicy.c
> mm/memory.c
> mm/mlock.c
> mm/migrate.c
> mm/mprotect.c
> mm/memcontrol.c
>
> I'm thinking of a more theoretical approach: Instead of auditing all users, I'd
> ask, what are the invariants that a vm_normal_page should have. Then check,
> whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a
> vm_normal_page isn't defined clearly enough for that.
>
> That said, I think we (Alex and myself) made an implicit assumption from the
> start, that a DEVICE_COHERENT page should behave a lot like a normal page in
> terms of VMA mappings, even if we didn't know what that means in detail.

Yes I'm afraid I made a similar mistake when reviewing this, forgetting that
DEVICE_COHERENT pages are not LRU pages and therefore need special treatment in
some places. So for now I will have to withdraw my reviewed-by until this has
been looked at more closely, because as you note below accidentally treating
them as LRU pages leads to a bad time.

> I can now at least name some differences between DEVICE_COHERENT and normal
> pages: how the memory is allocated, how data is migrated into DEVICE_COHERENT
> pages and that it can't be on any LRU list (because the lru list_head in struct
> page is aliased by pgmap and zone_device_data). Maybe I'll find more differences
> if I keep digging.
>
> Regards,
>   Felix
>
>
>>
>> Jason


Attachments:
(No filename) (1.90 kB)

2022-02-18 19:54:40

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
> On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
>
>> I'm thinking of a more theoretical approach: Instead of auditing all users,
>> I'd ask, what are the invariants that a vm_normal_page should have. Then
>> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
>> of a vm_normal_page isn't defined clearly enough for that.
> I would say the expectation is that only 'page cache and anon' pages
> are returned - ie the first union in struct page
>
> This is because the first file in your list I looked at:
>
> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct mm_walk *walk)
>
> {
> page = vm_normal_page(vma, addr, ptent);
> [..]
> if (pageout) {
> if (!isolate_lru_page(page)) {
>
> Uses the LRU field, so this is incompatible with all the other page
> types.
>
> One mitigation of this might be to formally make vm_normal_page() ==
> 'pte to page cache and anon page' and add a new function that is 'pte
> to any struct page'
>
> Then go through and sort callers as appropriate.
>
> The 'pte to page cache and anon page' can detect ZONE_DEVICE by
> calling is_zone_device_page() insted of pte_devmap() and then continue
> to return NULL. This same trick will fix GUP_fast.

Sounds good to me. What about vm_normal_page_pmd? Should we remove the
pmd_devmap check from that function as well. I'm not even sure what a
huge zone_device page would look like, but maybe that's a worthwhile
future optimization for our driver.

I'd propose the function names vm_normal_page and
vm_normal_or_device_page for the two functions you described. The latter
would basically be the current vm_normal_page with the pte_devmap check
removed. vm_normal_page could be implemented as a wrapper around
vm_normal_or_device_page, which just adds the !is_zone_device_page() check.

Regards,
  Felix


>
> Jason
>

2022-02-19 18:49:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
> Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
> > On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
> >
> > > I'm thinking of a more theoretical approach: Instead of auditing all users,
> > > I'd ask, what are the invariants that a vm_normal_page should have. Then
> > > check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
> > > of a vm_normal_page isn't defined clearly enough for that.
> > I would say the expectation is that only 'page cache and anon' pages
> > are returned - ie the first union in struct page
> >
> > This is because the first file in your list I looked at:
> >
> > static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > unsigned long addr, unsigned long end,
> > struct mm_walk *walk)
> >
> > {
> > page = vm_normal_page(vma, addr, ptent);
> > [..]
> > if (pageout) {
> > if (!isolate_lru_page(page)) {
> >
> > Uses the LRU field, so this is incompatible with all the other page
> > types.
> >
> > One mitigation of this might be to formally make vm_normal_page() ==
> > 'pte to page cache and anon page' and add a new function that is 'pte
> > to any struct page'
> >
> > Then go through and sort callers as appropriate.
> >
> > The 'pte to page cache and anon page' can detect ZONE_DEVICE by
> > calling is_zone_device_page() insted of pte_devmap() and then continue
> > to return NULL. This same trick will fix GUP_fast.
>
> Sounds good to me. What about vm_normal_page_pmd? Should we remove the
> pmd_devmap check from that function as well. I'm not even sure what a huge
> zone_device page would look like, but maybe that's a worthwhile future
> optimization for our driver.

IIRC there are other problems here as PMDs are currently wired to THPs
and not general at all..

We have huge zone_device pages, it is just any compound page of
contiguous pfns - you should be aggregating any contiguous string of
logical PFNs together into a folio for performance. If the folio is
stuffed into a PMD or not is a different question..

> I'd propose the function names vm_normal_page and vm_normal_or_device_page
> for the two functions you described.

I wouldn't say device_page, it should be any type of page - though
device_page is the only other option ATM, AFIAK.

> current vm_normal_page with the pte_devmap check removed. vm_normal_page
> could be implemented as a wrapper around vm_normal_or_device_page, which
> just adds the !is_zone_device_page() check.

Yes

Jason

2022-02-20 06:03:04

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v6 01/10] mm: add zone device coherent type memory support

Am 2022-02-18 um 14:26 schrieb Jason Gunthorpe:
> On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
>> Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
>>> On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
>>>
>>>> I'm thinking of a more theoretical approach: Instead of auditing all users,
>>>> I'd ask, what are the invariants that a vm_normal_page should have. Then
>>>> check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept
>>>> of a vm_normal_page isn't defined clearly enough for that.
>>> I would say the expectation is that only 'page cache and anon' pages
>>> are returned - ie the first union in struct page
>>>
>>> This is because the first file in your list I looked at:
>>>
>>> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>> unsigned long addr, unsigned long end,
>>> struct mm_walk *walk)
>>>
>>> {
>>> page = vm_normal_page(vma, addr, ptent);
>>> [..]
>>> if (pageout) {
>>> if (!isolate_lru_page(page)) {
>>>
>>> Uses the LRU field, so this is incompatible with all the other page
>>> types.
>>>
>>> One mitigation of this might be to formally make vm_normal_page() ==
>>> 'pte to page cache and anon page' and add a new function that is 'pte
>>> to any struct page'
>>>
>>> Then go through and sort callers as appropriate.
>>>
>>> The 'pte to page cache and anon page' can detect ZONE_DEVICE by
>>> calling is_zone_device_page() insted of pte_devmap() and then continue
>>> to return NULL. This same trick will fix GUP_fast.
>> Sounds good to me. What about vm_normal_page_pmd? Should we remove the
>> pmd_devmap check from that function as well. I'm not even sure what a huge
>> zone_device page would look like, but maybe that's a worthwhile future
>> optimization for our driver.
> IIRC there are other problems here as PMDs are currently wired to THPs
> and not general at all..
>
> We have huge zone_device pages, it is just any compound page of
> contiguous pfns - you should be aggregating any contiguous string of
> logical PFNs together into a folio for performance. If the folio is
> stuffed into a PMD or not is a different question..
>
>> I'd propose the function names vm_normal_page and vm_normal_or_device_page
>> for the two functions you described.
> I wouldn't say device_page, it should be any type of page - though
> device_page is the only other option ATM, AFIAK.

Ok, then how about vm_normal_lru_page and vm_normal_any_page?

Regards,
  Felix


>
>> current vm_normal_page with the pte_devmap check removed. vm_normal_page
>> could be implemented as a wrapper around vm_normal_or_device_page, which
>> just adds the !is_zone_device_page() check.
> Yes
>
> Jason

Subject: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

DEVICE_COHERENT pages introduce a subtle distinction in the way
"normal" pages can be used by various callers throughout the kernel.
They behave like normal pages for purposes of mapping in CPU page
tables, and for COW. But they do not support LRU lists, NUMA
migration or THP. Therefore we split vm_normal_page into two
functions vm_normal_any_page and vm_normal_lru_page. The latter will
only return pages that can be put on an LRU list and that support
NUMA migration and THP.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Signed-off-by: Alex Sierra <[email protected]>
---
fs/proc/task_mmu.c | 12 +++++-----
include/linux/mm.h | 53 ++++++++++++++++++++++++---------------------
mm/gup.c | 10 +++++----
mm/hmm.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 8 +++----
mm/ksm.c | 4 ++--
mm/madvise.c | 4 ++--
mm/memcontrol.c | 2 +-
mm/memory.c | 38 ++++++++++++++++++++++----------
mm/mempolicy.c | 4 ++--
mm/migrate.c | 2 +-
mm/migrate_device.c | 2 +-
mm/mlock.c | 6 ++---
mm/mprotect.c | 2 +-
15 files changed, 85 insertions(+), 66 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 18f8c3acbb85..4274128fbb4c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -519,7 +519,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;

if (pte_present(*pte)) {
- page = vm_normal_page(vma, addr, *pte);
+ page = vm_normal_any_page(vma, addr, *pte);
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

@@ -705,7 +705,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
struct page *page = NULL;

if (pte_present(*pte)) {
- page = vm_normal_page(vma, addr, *pte);
+ page = vm_normal_any_page(vma, addr, *pte);
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

@@ -1059,7 +1059,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
return false;
if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
return false;
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_any_page(vma, addr, pte);
if (!page)
return false;
return page_maybe_dma_pinned(page);
@@ -1172,7 +1172,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte_present(ptent))
continue;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_any_page(vma, addr, ptent);
if (!page)
continue;

@@ -1383,7 +1383,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
if (pm->show_pfn)
frame = pte_pfn(pte);
flags |= PM_PRESENT;
- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_any_page(vma, addr, pte);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
if (pte_uffd_wp(pte))
@@ -1761,7 +1761,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
if (!pte_present(pte))
return NULL;

- page = vm_normal_page(vma, addr, pte);
+ page = vm_normal_lru_page(vma, addr, pte);
if (!page)
return NULL;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ff9f149ca201..8c9f87151d93 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -593,8 +593,8 @@ struct vm_operations_struct {
unsigned long addr);
#endif
/*
- * Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
+ * Called by vm_normal_x_page() for special PTEs to find the
+ * page for @addr. This is useful if the default behavior
* (using pte_page()) would not find the correct page.
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
@@ -1781,7 +1781,9 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct ucounts *);
extern void user_shm_unlock(size_t, struct ucounts *);

-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte);
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
@@ -2880,27 +2882,28 @@ static inline vm_fault_t vmf_error(int err)
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
unsigned int foll_flags);

-#define FOLL_WRITE 0x01 /* check pte is writable */
-#define FOLL_TOUCH 0x02 /* mark page accessed */
-#define FOLL_GET 0x04 /* do get_page on page */
-#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
-#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
- * and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
-#define FOLL_NOFAULT 0x80 /* do not fault in pages */
-#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
-#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
-#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
-#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
-#define FOLL_MLOCK 0x1000 /* lock present pages */
-#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
-#define FOLL_COW 0x4000 /* internal GUP flag */
-#define FOLL_ANON 0x8000 /* don't do file mappings */
-#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
-#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
-#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
-#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_WRITE 0x01 /* check pte is writable */
+#define FOLL_TOUCH 0x02 /* mark page accessed */
+#define FOLL_GET 0x04 /* do get_page on page */
+#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
+#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
+ * and return without waiting upon it */
+#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
+#define FOLL_NOFAULT 0x80 /* do not fault in pages */
+#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
+#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
+#define FOLL_MLOCK 0x1000 /* lock present pages */
+#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
+#define FOLL_COW 0x4000 /* internal GUP flag */
+#define FOLL_ANON 0x8000 /* don't do file mappings */
+#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
+#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
+#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -3227,7 +3230,7 @@ extern long copy_huge_page_from_user(struct page *dst_page,
* @vma: Pointer to the struct vm_area_struct to consider
*
* Whether transhuge page-table entries are considered "special" following
- * the definition in vm_normal_page().
+ * the definition in vm_normal_x_page().
*
* Return: true if transhuge page-table entries should be considered special,
* false otherwise.
diff --git a/mm/gup.c b/mm/gup.c
index 41349b685eaf..9e172c906ded 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -539,8 +539,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return NULL;
}
-
- page = vm_normal_page(vma, address, pte);
+ if (flags & (FOLL_MLOCK | FOLL_LRU))
+ page = vm_normal_lru_page(vma, address, pte);
+ else
+ page = vm_normal_any_page(vma, address, pte);
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -824,7 +826,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
*
* Return: the mapped (struct page *), %NULL if no mapping exists, or
* an error pointer if there is a mapping to something not represented
- * by a page descriptor (see also vm_normal_page()).
+ * by a page descriptor (see also vm_normal_x_page()).
*/
static struct page *follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags,
@@ -917,7 +919,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
*vma = get_gate_vma(mm);
if (!page)
goto out;
- *page = vm_normal_page(*vma, address, *pte);
+ *page = vm_normal_any_page(*vma, address, *pte);
if (!*page) {
if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
goto unmap;
diff --git a/mm/hmm.c b/mm/hmm.c
index bd56641c79d4..90c949d66712 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -300,7 +300,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* Since each architecture defines a struct page for the zero page, just
* fall through and treat it like a normal page.
*/
- if (!vm_normal_page(walk->vma, addr, pte) &&
+ if (!vm_normal_any_page(walk->vma, addr, pte) &&
!pte_devmap(pte) &&
!is_zero_pfn(pte_pfn(pte))) {
if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..ea1efc825774 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
}

/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
+ follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
page = follow_page(vma, addr, follflags);

if (IS_ERR(page))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..a7153db09afa 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_PTE_NON_PRESENT;
goto out;
}
- page = vm_normal_page(vma, address, pteval);
+ page = vm_normal_lru_page(vma, address, pteval);
if (unlikely(!page)) {
result = SCAN_PAGE_NULL;
goto out;
@@ -1286,7 +1286,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
if (pte_write(pteval))
writable = true;

- page = vm_normal_page(vma, _address, pteval);
+ page = vm_normal_lru_page(vma, _address, pteval);
if (unlikely(!page)) {
result = SCAN_PAGE_NULL;
goto out_unmap;
@@ -1494,7 +1494,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (!pte_present(*pte))
goto abort;

- page = vm_normal_page(vma, addr, *pte);
+ page = vm_normal_lru_page(vma, addr, *pte);

/*
* Note that uprobe, debugger, or MAP_PRIVATE may change the
@@ -1512,7 +1512,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)

if (pte_none(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
+ page = vm_normal_lru_page(vma, addr, *pte);
page_remove_rmap(page, false);
}

diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..352d37e44694 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
do {
cond_resched();
page = follow_page(vma, addr,
- FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
+ FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
@@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
if (!vma)
goto out;

- page = follow_page(vma, addr, FOLL_GET);
+ page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
if (IS_ERR_OR_NULL(page))
goto out;
if (PageAnon(page)) {
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..1a553aad9aa3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -439,7 +439,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!pte_present(ptent))
continue;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_lru_page(vma, addr, ptent);
if (!page)
continue;

@@ -649,7 +649,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
continue;
}

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_lru_page(vma, addr, ptent);
if (!page)
continue;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 10259c35fde2..9677eb27dea8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5476,7 +5476,7 @@ enum mc_target_type {
static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
{
- struct page *page = vm_normal_page(vma, addr, ptent);
+ struct page *page = vm_normal_any_page(vma, addr, ptent);

if (!page || !page_mapped(page))
return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..cff84e6a6c4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -565,7 +565,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
}

/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+ * vm_normal_any_page -- This function gets the "struct page" associated with a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
@@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* PFNMAP mappings in order to support COWable mappings.
*
*/
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
@@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return NULL;
if (is_zero_pfn(pfn))
return NULL;
- if (pte_devmap(pte))
- return NULL;

print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return pfn_to_page(pfn);
}

+/*
+ * vm_normal_lru_page -- This function gets the "struct page" associated
+ * with a pte only for page cache and anon page. These pages are LRU handled.
+ */
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte)
+{
+ struct page *page;
+
+ page = vm_normal_any_page(vma, addr, pte);
+ if (is_zone_device_page(page))
+ return NULL;
+
+ return page;
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd)
@@ -670,7 +684,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
/*
* There is no pmd_special() but there may be special pmds, e.g.
* in a direct-access (dax) mapping, so let's just replicate the
- * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
+ * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_any_page() here.
*/
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -946,7 +960,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
pte_t pte = *src_pte;
struct page *page;

- page = vm_normal_page(src_vma, addr, pte);
+ page = vm_normal_any_page(src_vma, addr, pte);
if (page) {
int retval;

@@ -1358,7 +1372,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_present(ptent)) {
struct page *page;

- page = vm_normal_page(vma, addr, ptent);
+ page = vm_normal_any_page(vma, addr, ptent);
if (unlikely(zap_skip_check_mapping(details, page)))
continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -2168,7 +2182,7 @@ EXPORT_SYMBOL(vmf_insert_pfn);

static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
{
- /* these checks mirror the abort conditions in vm_normal_page */
+ /* these checks mirror the abort conditions in vm_normal_lru_page */
if (vma->vm_flags & VM_MIXEDMAP)
return true;
if (pfn_t_devmap(pfn))
@@ -2198,7 +2212,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,

/*
* If we don't have pte special, then we have to use the pfn_valid()
- * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
+ * based VM_MIXEDMAP scheme (see vm_normal_any_page), and thus we *must*
* refcount the page if pfn_valid is true (hence insert_page rather
* than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
* without pte special, it would there be refcounted as a normal page.
@@ -2408,7 +2422,7 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
* There's a horrible special case to handle copy-on-write
* behaviour that some programs depend on. We mark the "original"
* un-COW'ed pages by matching them up with "vma->vm_pgoff".
- * See vm_normal_page() for details.
+ * See vm_normal_any_page() for details.
*/
if (is_cow_mapping(vma->vm_flags)) {
if (addr != vma->vm_start || end != vma->vm_end)
@@ -3267,7 +3281,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
mm_tlb_flush_pending(vmf->vma->vm_mm)))
flush_tlb_page(vmf->vma, vmf->address);

- vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+ vmf->page = vm_normal_any_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -4364,7 +4378,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
old_pte = ptep_get(vmf->pte);
pte = pte_modify(old_pte, vma->vm_page_prot);

- page = vm_normal_page(vma, vmf->address, pte);
+ page = vm_normal_lru_page(vma, vmf->address, pte);
if (!page)
goto out_map;

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 028e8dd82b44..9962de4981d6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -527,11 +527,11 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
+ page = vm_normal_lru_page(vma, addr, *pte);
if (!page)
continue;
/*
- * vm_normal_page() filters out zero pages, but there might
+ * vm_normal_lru_page() filters out zero pages, but there might
* still be PageReserved pages to skip, perhaps in a VDSO.
*/
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c
index c31d04b46a5e..17d049311b78 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
goto out;

/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
+ follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
page = follow_page(vma, addr, follflags);

err = PTR_ERR(page);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 3373b535d5c9..fac1b6978361 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -154,7 +154,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
migrate->cpages++;
goto next;
}
- page = vm_normal_page(migrate->vma, addr, pte);
+ page = vm_normal_any_page(migrate->vma, addr, pte);
if (page && !is_zone_device_page(page) &&
!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
goto next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..52613e2f2a70 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -342,7 +342,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
* a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
*
* The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
+ * zone, as long as the pte's are present and vm_normal_any_page() succeeds. These
* pages also get pinned.
*
* Returns the address of the next page that should be scanned. This equals
@@ -373,7 +373,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
struct page *page = NULL;
pte++;
if (pte_present(*pte))
- page = vm_normal_page(vma, start, *pte);
+ page = vm_normal_lru_page(vma, start, *pte);
/*
* Break if page could not be obtained or the page's node+zone does not
* match
@@ -439,7 +439,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
* suits munlock very well (and if somehow an abnormal page
* has sneaked into the range, we won't oops here: great).
*/
- page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
+ page = follow_page(vma, start, FOLL_GET | FOLL_DUMP | FOLL_LRU);

if (page && !IS_ERR(page)) {
if (PageTransTail(page)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0138dfcdb1d8..d236394d41d5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,7 +88,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (pte_protnone(oldpte))
continue;

- page = vm_normal_page(vma, addr, oldpte);
+ page = vm_normal_lru_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;

--
2.32.0

2022-03-01 01:04:52

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling


On 2022-02-28 15:34, Alex Sierra wrote:
> DEVICE_COHERENT pages introduce a subtle distinction in the way
> "normal" pages can be used by various callers throughout the kernel.
> They behave like normal pages for purposes of mapping in CPU page
> tables, and for COW. But they do not support LRU lists, NUMA
> migration or THP.

Should have mentioned KSM here as well for completeness.


> Therefore we split vm_normal_page into two
> functions vm_normal_any_page and vm_normal_lru_page. The latter will
> only return pages that can be put on an LRU list and that support
> NUMA migration and THP.
>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.
>
> Signed-off-by: Alex Sierra <[email protected]>

Acked-by: Felix Kuehling <[email protected]>

FWIW. Full disclosure, Alex and I worked on this together, but it's a
bit like the blind leading the blind. ;) It's mostly untested at this
point. Alex is working on adding tests for get_user_pages of
DEVICE_COHERENT pages without FOLL_LONGTERM to test_hmm and also a test
for COW of DEVICE_COHERENT pages.

A few more nit-picks inline.


> ---
> fs/proc/task_mmu.c | 12 +++++-----
> include/linux/mm.h | 53 ++++++++++++++++++++++++---------------------
> mm/gup.c | 10 +++++----
> mm/hmm.c | 2 +-
> mm/huge_memory.c | 2 +-
> mm/khugepaged.c | 8 +++----
> mm/ksm.c | 4 ++--
> mm/madvise.c | 4 ++--
> mm/memcontrol.c | 2 +-
> mm/memory.c | 38 ++++++++++++++++++++++----------
> mm/mempolicy.c | 4 ++--
> mm/migrate.c | 2 +-
> mm/migrate_device.c | 2 +-
> mm/mlock.c | 6 ++---
> mm/mprotect.c | 2 +-
> 15 files changed, 85 insertions(+), 66 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 18f8c3acbb85..4274128fbb4c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -519,7 +519,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> struct page *page = NULL;
>
> if (pte_present(*pte)) {
> - page = vm_normal_page(vma, addr, *pte);
> + page = vm_normal_any_page(vma, addr, *pte);
> } else if (is_swap_pte(*pte)) {
> swp_entry_t swpent = pte_to_swp_entry(*pte);
>
> @@ -705,7 +705,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> struct page *page = NULL;
>
> if (pte_present(*pte)) {
> - page = vm_normal_page(vma, addr, *pte);
> + page = vm_normal_any_page(vma, addr, *pte);
> } else if (is_swap_pte(*pte)) {
> swp_entry_t swpent = pte_to_swp_entry(*pte);
>
> @@ -1059,7 +1059,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
> return false;
> if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
> return false;
> - page = vm_normal_page(vma, addr, pte);
> + page = vm_normal_any_page(vma, addr, pte);
> if (!page)
> return false;
> return page_maybe_dma_pinned(page);
> @@ -1172,7 +1172,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> if (!pte_present(ptent))
> continue;
>
> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_any_page(vma, addr, ptent);
> if (!page)
> continue;
>
> @@ -1383,7 +1383,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> if (pm->show_pfn)
> frame = pte_pfn(pte);
> flags |= PM_PRESENT;
> - page = vm_normal_page(vma, addr, pte);
> + page = vm_normal_any_page(vma, addr, pte);
> if (pte_soft_dirty(pte))
> flags |= PM_SOFT_DIRTY;
> if (pte_uffd_wp(pte))
> @@ -1761,7 +1761,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
> if (!pte_present(pte))
> return NULL;
>
> - page = vm_normal_page(vma, addr, pte);
> + page = vm_normal_lru_page(vma, addr, pte);
> if (!page)
> return NULL;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ff9f149ca201..8c9f87151d93 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -593,8 +593,8 @@ struct vm_operations_struct {
> unsigned long addr);
> #endif
> /*
> - * Called by vm_normal_page() for special PTEs to find the
> - * page for @addr. This is useful if the default behavior
> + * Called by vm_normal_x_page() for special PTEs to find the

I'd use vm_normal_*_page in these comments to avoid confusion, because
vm_normal_x_page is actually a valid symbol name, which doesn't exist.


> + * page for @addr. This is useful if the default behavior
> * (using pte_page()) would not find the correct page.
> */
> struct page *(*find_special_page)(struct vm_area_struct *vma,
> @@ -1781,7 +1781,9 @@ static inline bool can_do_mlock(void) { return false; }
> extern int user_shm_lock(size_t, struct ucounts *);
> extern void user_shm_unlock(size_t, struct ucounts *);
>
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr,
> + pte_t pte);
> +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t pmd);
> @@ -2880,27 +2882,28 @@ static inline vm_fault_t vmf_error(int err)
> struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> unsigned int foll_flags);
>
> -#define FOLL_WRITE 0x01 /* check pte is writable */
> -#define FOLL_TOUCH 0x02 /* mark page accessed */
> -#define FOLL_GET 0x04 /* do get_page on page */
> -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> - * and return without waiting upon it */
> -#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
> -#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> -#define FOLL_MLOCK 0x1000 /* lock present pages */
> -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> -#define FOLL_COW 0x4000 /* internal GUP flag */
> -#define FOLL_ANON 0x8000 /* don't do file mappings */
> -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_WRITE 0x01 /* check pte is writable */
> +#define FOLL_TOUCH 0x02 /* mark page accessed */
> +#define FOLL_GET 0x04 /* do get_page on page */
> +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> + * and return without waiting upon it */
> +#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
> +#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_MLOCK 0x1000 /* lock present pages */
> +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> +#define FOLL_COW 0x4000 /* internal GUP flag */
> +#define FOLL_ANON 0x8000 /* don't do file mappings */
> +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
>
> /*
> * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> @@ -3227,7 +3230,7 @@ extern long copy_huge_page_from_user(struct page *dst_page,
> * @vma: Pointer to the struct vm_area_struct to consider
> *
> * Whether transhuge page-table entries are considered "special" following
> - * the definition in vm_normal_page().
> + * the definition in vm_normal_x_page().

vm_normal_*_page


> *
> * Return: true if transhuge page-table entries should be considered special,
> * false otherwise.
> diff --git a/mm/gup.c b/mm/gup.c
> index 41349b685eaf..9e172c906ded 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -539,8 +539,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> pte_unmap_unlock(ptep, ptl);
> return NULL;
> }
> -
> - page = vm_normal_page(vma, address, pte);
> + if (flags & (FOLL_MLOCK | FOLL_LRU))
> + page = vm_normal_lru_page(vma, address, pte);
> + else
> + page = vm_normal_any_page(vma, address, pte);
> if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
> /*
> * Only return device mapping pages in the FOLL_GET or FOLL_PIN
> @@ -824,7 +826,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
> *
> * Return: the mapped (struct page *), %NULL if no mapping exists, or
> * an error pointer if there is a mapping to something not represented
> - * by a page descriptor (see also vm_normal_page()).
> + * by a page descriptor (see also vm_normal_x_page()).

vm_normal_*_page


> */
> static struct page *follow_page_mask(struct vm_area_struct *vma,
> unsigned long address, unsigned int flags,
> @@ -917,7 +919,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
> *vma = get_gate_vma(mm);
> if (!page)
> goto out;
> - *page = vm_normal_page(*vma, address, *pte);
> + *page = vm_normal_any_page(*vma, address, *pte);
> if (!*page) {
> if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
> goto unmap;
> diff --git a/mm/hmm.c b/mm/hmm.c
> index bd56641c79d4..90c949d66712 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -300,7 +300,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> * Since each architecture defines a struct page for the zero page, just
> * fall through and treat it like a normal page.
> */
> - if (!vm_normal_page(walk->vma, addr, pte) &&
> + if (!vm_normal_any_page(walk->vma, addr, pte) &&
> !pte_devmap(pte) &&
> !is_zero_pfn(pte_pfn(pte))) {
> if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..ea1efc825774 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> }
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
> page = follow_page(vma, addr, follflags);
>
> if (IS_ERR(page))
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..a7153db09afa 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> result = SCAN_PTE_NON_PRESENT;
> goto out;
> }
> - page = vm_normal_page(vma, address, pteval);
> + page = vm_normal_lru_page(vma, address, pteval);
> if (unlikely(!page)) {
> result = SCAN_PAGE_NULL;
> goto out;
> @@ -1286,7 +1286,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> if (pte_write(pteval))
> writable = true;
>
> - page = vm_normal_page(vma, _address, pteval);
> + page = vm_normal_lru_page(vma, _address, pteval);
> if (unlikely(!page)) {
> result = SCAN_PAGE_NULL;
> goto out_unmap;
> @@ -1494,7 +1494,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
> if (!pte_present(*pte))
> goto abort;
>
> - page = vm_normal_page(vma, addr, *pte);
> + page = vm_normal_lru_page(vma, addr, *pte);
>
> /*
> * Note that uprobe, debugger, or MAP_PRIVATE may change the
> @@ -1512,7 +1512,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
>
> if (pte_none(*pte))
> continue;
> - page = vm_normal_page(vma, addr, *pte);
> + page = vm_normal_lru_page(vma, addr, *pte);
> page_remove_rmap(page, false);
> }
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index c20bd4d9a0d9..352d37e44694 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> do {
> cond_resched();
> page = follow_page(vma, addr,
> - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
> if (IS_ERR_OR_NULL(page))
> break;
> if (PageKsm(page))
> @@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
> if (!vma)
> goto out;
>
> - page = follow_page(vma, addr, FOLL_GET);
> + page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
> if (IS_ERR_OR_NULL(page))
> goto out;
> if (PageAnon(page)) {
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5604064df464..1a553aad9aa3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -439,7 +439,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (!pte_present(ptent))
> continue;
>
> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_lru_page(vma, addr, ptent);
> if (!page)
> continue;
>
> @@ -649,7 +649,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> continue;
> }
>
> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_lru_page(vma, addr, ptent);
> if (!page)
> continue;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 10259c35fde2..9677eb27dea8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5476,7 +5476,7 @@ enum mc_target_type {
> static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> unsigned long addr, pte_t ptent)
> {
> - struct page *page = vm_normal_page(vma, addr, ptent);
> + struct page *page = vm_normal_any_page(vma, addr, ptent);
>
> if (!page || !page_mapped(page))
> return NULL;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..cff84e6a6c4b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -565,7 +565,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> }
>
> /*
> - * vm_normal_page -- This function gets the "struct page" associated with a pte.
> + * vm_normal_any_page -- This function gets the "struct page" associated with a pte.
> *
> * "Special" mappings do not wish to be associated with a "struct page" (either
> * it doesn't exist, or it exists but they don't want to touch it). In this
> @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> * PFNMAP mappings in order to support COWable mappings.
> *
> */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte)
> {
> unsigned long pfn = pte_pfn(pte);
> @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> return NULL;
> if (is_zero_pfn(pfn))
> return NULL;
> - if (pte_devmap(pte))
> - return NULL;
>
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> return pfn_to_page(pfn);
> }
>
> +/*
> + * vm_normal_lru_page -- This function gets the "struct page" associated
> + * with a pte only for page cache and anon page. These pages are LRU handled.
> + */
> +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
> + pte_t pte)
> +{
> + struct page *page;
> +
> + page = vm_normal_any_page(vma, addr, pte);
> + if (is_zone_device_page(page))
> + return NULL;
> +
> + return page;
> +}
> +
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t pmd)
> @@ -670,7 +684,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> /*
> * There is no pmd_special() but there may be special pmds, e.g.
> * in a direct-access (dax) mapping, so let's just replicate the
> - * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> + * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_any_page() here.
> */
> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -946,7 +960,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> pte_t pte = *src_pte;
> struct page *page;
>
> - page = vm_normal_page(src_vma, addr, pte);
> + page = vm_normal_any_page(src_vma, addr, pte);
> if (page) {
> int retval;
>
> @@ -1358,7 +1372,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (pte_present(ptent)) {
> struct page *page;
>
> - page = vm_normal_page(vma, addr, ptent);
> + page = vm_normal_any_page(vma, addr, ptent);
> if (unlikely(zap_skip_check_mapping(details, page)))
> continue;
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> @@ -2168,7 +2182,7 @@ EXPORT_SYMBOL(vmf_insert_pfn);
>
> static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
> {
> - /* these checks mirror the abort conditions in vm_normal_page */
> + /* these checks mirror the abort conditions in vm_normal_lru_page */
> if (vma->vm_flags & VM_MIXEDMAP)
> return true;
> if (pfn_t_devmap(pfn))

If this is to match the new vm_normal_lru_page, it should replace "if
(pfn_t_devmap(pfn))" with a check that the page is not a device page.
But for that it would have to actually look up the struct page.

I'm not sure what to do about this. __vm_insert_mixed still does
something special with devmap pages, which no longer matches
vm_normal_*_page.


> @@ -2198,7 +2212,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
>
> /*
> * If we don't have pte special, then we have to use the pfn_valid()
> - * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
> + * based VM_MIXEDMAP scheme (see vm_normal_any_page), and thus we *must*
> * refcount the page if pfn_valid is true (hence insert_page rather
> * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
> * without pte special, it would there be refcounted as a normal page.
> @@ -2408,7 +2422,7 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> * There's a horrible special case to handle copy-on-write
> * behaviour that some programs depend on. We mark the "original"
> * un-COW'ed pages by matching them up with "vma->vm_pgoff".
> - * See vm_normal_page() for details.
> + * See vm_normal_any_page() for details.
> */
> if (is_cow_mapping(vma->vm_flags)) {
> if (addr != vma->vm_start || end != vma->vm_end)
> @@ -3267,7 +3281,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> mm_tlb_flush_pending(vmf->vma->vm_mm)))
> flush_tlb_page(vmf->vma, vmf->address);
>
> - vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
> + vmf->page = vm_normal_any_page(vma, vmf->address, vmf->orig_pte);
> if (!vmf->page) {
> /*
> * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
> @@ -4364,7 +4378,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> old_pte = ptep_get(vmf->pte);
> pte = pte_modify(old_pte, vma->vm_page_prot);
>
> - page = vm_normal_page(vma, vmf->address, pte);
> + page = vm_normal_lru_page(vma, vmf->address, pte);
> if (!page)
> goto out_map;
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 028e8dd82b44..9962de4981d6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -527,11 +527,11 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> if (!pte_present(*pte))
> continue;
> - page = vm_normal_page(vma, addr, *pte);
> + page = vm_normal_lru_page(vma, addr, *pte);
> if (!page)
> continue;
> /*
> - * vm_normal_page() filters out zero pages, but there might
> + * vm_normal_lru_page() filters out zero pages, but there might
> * still be PageReserved pages to skip, perhaps in a VDSO.
> */
> if (PageReserved(page))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c31d04b46a5e..17d049311b78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> goto out;
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
> page = follow_page(vma, addr, follflags);
>
> err = PTR_ERR(page);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 3373b535d5c9..fac1b6978361 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -154,7 +154,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> migrate->cpages++;
> goto next;
> }
> - page = vm_normal_page(migrate->vma, addr, pte);
> + page = vm_normal_any_page(migrate->vma, addr, pte);
> if (page && !is_zone_device_page(page) &&
> !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
> goto next;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 8f584eddd305..52613e2f2a70 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -342,7 +342,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
> * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
> *
> * The rest of @pvec is filled by subsequent pages within the same pmd and same
> - * zone, as long as the pte's are present and vm_normal_page() succeeds. These
> + * zone, as long as the pte's are present and vm_normal_any_page() succeeds. These

The comment says vm_normal_any_page. But the function uses
vm_normal_lru_page.

Regards,
  Felix


> * pages also get pinned.
> *
> * Returns the address of the next page that should be scanned. This equals
> @@ -373,7 +373,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
> struct page *page = NULL;
> pte++;
> if (pte_present(*pte))
> - page = vm_normal_page(vma, start, *pte);
> + page = vm_normal_lru_page(vma, start, *pte);
> /*
> * Break if page could not be obtained or the page's node+zone does not
> * match
> @@ -439,7 +439,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
> * suits munlock very well (and if somehow an abnormal page
> * has sneaked into the range, we won't oops here: great).
> */
> - page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
> + page = follow_page(vma, start, FOLL_GET | FOLL_DUMP | FOLL_LRU);
>
> if (page && !IS_ERR(page)) {
> if (PageTransTail(page)) {
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 0138dfcdb1d8..d236394d41d5 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -88,7 +88,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> if (pte_protnone(oldpte))
> continue;
>
> - page = vm_normal_page(vma, addr, oldpte);
> + page = vm_normal_lru_page(vma, addr, oldpte);
> if (!page || PageKsm(page))
> continue;
>

2022-03-01 11:24:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

On 28.02.22 21:34, Alex Sierra wrote:
> DEVICE_COHERENT pages introduce a subtle distinction in the way
> "normal" pages can be used by various callers throughout the kernel.
> They behave like normal pages for purposes of mapping in CPU page
> tables, and for COW. But they do not support LRU lists, NUMA
> migration or THP. Therefore we split vm_normal_page into two
> functions vm_normal_any_page and vm_normal_lru_page. The latter will
> only return pages that can be put on an LRU list and that support
> NUMA migration and THP.

Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn?

>
> We also introduced a FOLL_LRU flag that adds the same behaviour to
> follow_page and related APIs, to allow callers to specify that they
> expect to put pages on an LRU list.

[...]
> -#define FOLL_WRITE 0x01 /* check pte is writable */
> -#define FOLL_TOUCH 0x02 /* mark page accessed */
> -#define FOLL_GET 0x04 /* do get_page on page */
> -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> - * and return without waiting upon it */
> -#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
> -#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> -#define FOLL_MLOCK 0x1000 /* lock present pages */
> -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> -#define FOLL_COW 0x4000 /* internal GUP flag */
> -#define FOLL_ANON 0x8000 /* don't do file mappings */
> -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_WRITE 0x01 /* check pte is writable */
> +#define FOLL_TOUCH 0x02 /* mark page accessed */
> +#define FOLL_GET 0x04 /* do get_page on page */
> +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> + * and return without waiting upon it */
> +#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
> +#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
> +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_MLOCK 0x1000 /* lock present pages */
> +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> +#define FOLL_COW 0x4000 /* internal GUP flag */
> +#define FOLL_ANON 0x8000 /* don't do file mappings */
> +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
>

Can we minimize code churn, please?


> if (PageReserved(page))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c31d04b46a5e..17d049311b78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
> goto out;
>
> /* FOLL_DUMP to ignore special (like zero) pages */
> - follflags = FOLL_GET | FOLL_DUMP;
> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
> page = follow_page(vma, addr, follflags);

Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.


--
Thanks,

David / dhildenb

2022-03-01 16:54:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

On 01.03.22 17:30, Felix Kuehling wrote:
> Am 2022-03-01 um 11:22 schrieb David Hildenbrand:
>>>>> if (PageReserved(page))
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index c31d04b46a5e..17d049311b78 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>>>>> goto out;
>>>>>
>>>>> /* FOLL_DUMP to ignore special (like zero) pages */
>>>>> - follflags = FOLL_GET | FOLL_DUMP;
>>>>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>>>>> page = follow_page(vma, addr, follflags);
>>>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
>>> This function later calls isolate_lru_page, which is something you can't
>>> do with a device page.
>>>
>> Then, that code might require care instead. We most certainly don't want
>> to have random memory holes in a dump just because some anonymous memory
>> was migrated to DEVICE_COHERENT.
> I don't think this code is for core dumps. The call chain I see is
>
> SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move
> -> add_page_for_migration
>

Ah, sorry, I got mislead by FOLL_DUMP and thought we'd be in
get_dump_page() . As you said, nothing to do.

--
Thanks,

David / dhildenb

2022-03-01 17:15:24

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling


Am 2022-03-01 um 03:03 schrieb David Hildenbrand:
> On 28.02.22 21:34, Alex Sierra wrote:
>> DEVICE_COHERENT pages introduce a subtle distinction in the way
>> "normal" pages can be used by various callers throughout the kernel.
>> They behave like normal pages for purposes of mapping in CPU page
>> tables, and for COW. But they do not support LRU lists, NUMA
>> migration or THP. Therefore we split vm_normal_page into two
>> functions vm_normal_any_page and vm_normal_lru_page. The latter will
>> only return pages that can be put on an LRU list and that support
>> NUMA migration and THP.
> Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn?

I don't care much, personally, what names we end up with. I'll wait for
a consensus here.


>
>> We also introduced a FOLL_LRU flag that adds the same behaviour to
>> follow_page and related APIs, to allow callers to specify that they
>> expect to put pages on an LRU list.
> [...]
>> -#define FOLL_WRITE 0x01 /* check pte is writable */
>> -#define FOLL_TOUCH 0x02 /* mark page accessed */
>> -#define FOLL_GET 0x04 /* do get_page on page */
>> -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
>> -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
>> -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
>> - * and return without waiting upon it */
>> -#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
>> -#define FOLL_NOFAULT 0x80 /* do not fault in pages */
>> -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
>> -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
>> -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
>> -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
>> -#define FOLL_MLOCK 0x1000 /* lock present pages */
>> -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
>> -#define FOLL_COW 0x4000 /* internal GUP flag */
>> -#define FOLL_ANON 0x8000 /* don't do file mappings */
>> -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
>> -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
>> -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
>> -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
>> +#define FOLL_WRITE 0x01 /* check pte is writable */
>> +#define FOLL_TOUCH 0x02 /* mark page accessed */
>> +#define FOLL_GET 0x04 /* do get_page on page */
>> +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
>> +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
>> +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
>> + * and return without waiting upon it */
>> +#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
>> +#define FOLL_NOFAULT 0x80 /* do not fault in pages */
>> +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
>> +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
>> +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
>> +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
>> +#define FOLL_MLOCK 0x1000 /* lock present pages */
>> +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
>> +#define FOLL_COW 0x4000 /* internal GUP flag */
>> +#define FOLL_ANON 0x8000 /* don't do file mappings */
>> +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
>> +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
>> +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
>> +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
>> +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
>>
> Can we minimize code churn, please?

OK. I guess we could unindent the FOLL_LRU number to avoid changing all
the comments.


>
>
>> if (PageReserved(page))
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index c31d04b46a5e..17d049311b78 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>> goto out;
>>
>> /* FOLL_DUMP to ignore special (like zero) pages */
>> - follflags = FOLL_GET | FOLL_DUMP;
>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>> page = follow_page(vma, addr, follflags);
> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.

This function later calls isolate_lru_page, which is something you can't
do with a device page.

Regards,
  Felix


>

2022-03-01 19:48:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling


>>
>>> if (PageReserved(page))
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index c31d04b46a5e..17d049311b78 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>>> goto out;
>>>
>>> /* FOLL_DUMP to ignore special (like zero) pages */
>>> - follflags = FOLL_GET | FOLL_DUMP;
>>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>>> page = follow_page(vma, addr, follflags);
>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
>
> This function later calls isolate_lru_page, which is something you can't
> do with a device page.
>

Then, that code might require care instead. We most certainly don't want
to have random memory holes in a dump just because some anonymous memory
was migrated to DEVICE_COHERENT.

--
Thanks,

David / dhildenb

2022-03-01 23:24:51

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH] mm: split vm_normal_pages for LRU and non-LRU handling

Am 2022-03-01 um 11:22 schrieb David Hildenbrand:
>>>> if (PageReserved(page))
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index c31d04b46a5e..17d049311b78 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
>>>> goto out;
>>>>
>>>> /* FOLL_DUMP to ignore special (like zero) pages */
>>>> - follflags = FOLL_GET | FOLL_DUMP;
>>>> + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
>>>> page = follow_page(vma, addr, follflags);
>>> Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
>> This function later calls isolate_lru_page, which is something you can't
>> do with a device page.
>>
> Then, that code might require care instead. We most certainly don't want
> to have random memory holes in a dump just because some anonymous memory
> was migrated to DEVICE_COHERENT.
I don't think this code is for core dumps. The call chain I see is

SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move
-> add_page_for_migration

As far as I can tell this is for NUMA migrations.

Regards,
  Felix