2021-03-26 00:10:06

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 0/8] Add support for SVM atomics in Nouveau

This is the seventh version of a series to add support to Nouveau for
atomic memory operations on OpenCL shared virtual memory (SVM) regions.

This version primarily improves readability of the Nouveau fault priority
calculation code along with other minor functional and cosmetic
improvements listed in the changelogs.

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one(). These should not change any functionality, but any
help testing would be much appreciated as I have not been able to test
every usage of try_to_unmap_one().

Patch 5 contains the bulk of the implementation for device exclusive
memory.

Patch 6 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 7 is a cleanup for the Nouveau SVM implementation.

Patch 8 contains the implementation of atomic access for the Nouveau
driver.

This has been tested using the latest upstream Mesa userspace with a simple
OpenCL test program which checks the results of atomic GPU operations on a
SVM buffer whilst also writing to the same buffer from the CPU.

Alistair Popple (8):
mm: Remove special swap entry functions
mm/swapops: Rework swap entry manipulation code
mm/rmap: Split try_to_munlock from try_to_unmap
mm/rmap: Split migration into its own function
mm: Device exclusive memory access
mm: Selftests for exclusive device memory
nouveau/svm: Refactor nouveau_range_fault
nouveau/svm: Implement atomic SVM access

Documentation/vm/hmm.rst | 19 +-
arch/s390/mm/pgtable.c | 2 +-
drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
drivers/gpu/drm/nouveau/nouveau_svm.c | 156 ++++-
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
fs/proc/task_mmu.c | 23 +-
include/linux/mmu_notifier.h | 26 +-
include/linux/rmap.h | 9 +-
include/linux/swap.h | 8 +-
include/linux/swapops.h | 123 ++--
lib/test_hmm.c | 126 +++-
lib/test_hmm_uapi.h | 2 +
mm/debug_vm_pgtable.c | 12 +-
mm/hmm.c | 12 +-
mm/huge_memory.c | 45 +-
mm/hugetlb.c | 10 +-
mm/memcontrol.c | 2 +-
mm/memory.c | 128 +++-
mm/migrate.c | 51 +-
mm/mprotect.c | 18 +-
mm/page_vma_mapped.c | 15 +-
mm/rmap.c | 604 +++++++++++++++---
tools/testing/selftests/vm/hmm-tests.c | 158 +++++
24 files changed, 1282 insertions(+), 275 deletions(-)

--
2.20.1


2021-03-26 00:10:06

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 1/8] mm: Remove special swap entry functions

Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>

---

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
builds
---
arch/s390/mm/pgtable.c | 2 +-
fs/proc/task_mmu.c | 23 +++++---------
include/linux/swap.h | 4 +--
include/linux/swapops.h | 69 ++++++++++++++---------------------------
mm/hmm.c | 5 ++-
mm/huge_memory.c | 4 +--
mm/memcontrol.c | 2 +-
mm/memory.c | 10 +++---
mm/migrate.c | 6 ++--
mm/page_vma_mapped.c | 6 ++--
10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
- struct page *page = migration_entry_to_page(entry);
+ struct page *page = pfn_swap_entry_to_page(entry);

dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..08ee59d945c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
- } else if (is_migration_entry(swpent))
- page = migration_entry_to_page(swpent);
- else if (is_device_private_entry(swpent))
- page = device_private_entry_to_page(swpent);
+ } else if (is_pfn_swap_entry(swpent))
+ page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(&vma->vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);

if (is_migration_entry(entry))
- page = migration_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

- if (is_migration_entry(swpent))
- page = migration_entry_to_page(swpent);
- else if (is_device_private_entry(swpent))
- page = device_private_entry_to_page(swpent);
+ if (is_pfn_swap_entry(swpent))
+ page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1383,11 +1379,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
- if (is_migration_entry(entry))
- page = migration_entry_to_page(entry);
-
- if (is_device_private_entry(entry))
- page = device_private_entry_to_page(entry);
+ if (is_pfn_swap_entry(entry))
+ page = pfn_swap_entry_to_page(entry);
}

if (page && !PageAnon(page))
@@ -1444,7 +1437,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
- page = migration_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);
}
#endif

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..516104b9334b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -523,8 +523,8 @@ static inline void show_swap_cache_info(void)
{
}

-#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
-#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
+/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
+#define free_swap_and_cache(e) is_pfn_swap_entry(e)

static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
{
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..139be8235ad2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -121,16 +121,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
}
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
- return swp_offset(entry);
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
- return pfn_to_page(swp_offset(entry));
-}
#else /* CONFIG_DEVICE_PRIVATE */
static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
{
@@ -150,16 +140,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry)
{
return false;
}
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
- return 0;
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
- return NULL;
-}
#endif /* CONFIG_DEVICE_PRIVATE */

#ifdef CONFIG_MIGRATION
@@ -182,22 +162,6 @@ static inline int is_write_migration_entry(swp_entry_t entry)
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
- return swp_offset(entry);
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
- struct page *p = pfn_to_page(swp_offset(entry));
- /*
- * Any use of migration entries may only occur while the
- * corresponding page is locked
- */
- BUG_ON(!PageLocked(compound_head(p)));
- return p;
-}
-
static inline void make_migration_entry_read(swp_entry_t *entry)
{
*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
@@ -217,16 +181,6 @@ static inline int is_migration_entry(swp_entry_t swp)
return 0;
}

-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
- return 0;
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
- return NULL;
-}
-
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl) { }
@@ -241,6 +195,29 @@ static inline int is_write_migration_entry(swp_entry_t entry)

#endif

+static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
+{
+ struct page *p = pfn_to_page(swp_offset(entry));
+
+ /*
+ * Any use of migration entries may only occur while the
+ * corresponding page is locked
+ */
+ BUG_ON(is_migration_entry(entry) && !PageLocked(p));
+
+ return p;
+}
+
+/*
+ * A pfn swap entry is a special type of swap entry that always has a pfn stored
+ * in the swap offset. They are used to represent unaddressable device memory
+ * and to restrict access to a page undergoing migration.
+ */
+static inline bool is_pfn_swap_entry(swp_entry_t entry)
+{
+ return is_migration_entry(entry) || is_device_private_entry(entry);
+}
+
struct page_vma_mapped_walk;

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba4442..3b2dda71d0ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct hmm_range *range,
swp_entry_t entry)
{
return is_device_private_entry(entry) &&
- device_private_entry_to_page(entry)->pgmap->owner ==
+ pfn_swap_entry_to_page(entry)->pgmap->owner ==
range->dev_private_owner;
}

@@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
cpu_flags = HMM_PFN_VALID;
if (is_write_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = device_private_entry_to_pfn(entry) |
- cpu_flags;
+ *hmm_pfn = swp_offset(entry) | cpu_flags;
return 0;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..a4cda8564bcf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1700,7 +1700,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
entry = pmd_to_swp_entry(orig_pmd);
- page = pfn_to_page(swp_offset(entry));
+ page = pfn_swap_entry_to_page(entry);
flush_needed = 0;
} else
WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
@@ -2108,7 +2108,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry_t entry;

entry = pmd_to_swp_entry(old_pmd);
- page = pfn_to_page(swp_offset(entry));
+ page = pfn_swap_entry_to_page(entry);
write = is_write_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..043840dbe48a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5523,7 +5523,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
* as special swap entry in the CPU page table.
*/
if (is_device_private_entry(ent)) {
- page = device_private_entry_to_page(ent);
+ page = pfn_swap_entry_to_page(ent);
/*
* MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
* a refcount of 1 when free (unlike normal page)
diff --git a/mm/memory.c b/mm/memory.c
index c8e357627318..1c98e3c1c2de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -730,7 +730,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
rss[MM_SWAPENTS]++;
} else if (is_migration_entry(entry)) {
- page = migration_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);

rss[mm_counter(page)]++;

@@ -749,7 +749,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (is_device_private_entry(entry)) {
- page = device_private_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);

/*
* Update rss count even for unaddressable pages, as
@@ -1286,7 +1286,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry)) {
- struct page *page = device_private_entry_to_page(entry);
+ struct page *page = pfn_swap_entry_to_page(entry);

if (unlikely(details && details->check_mapping)) {
/*
@@ -1315,7 +1315,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
else if (is_migration_entry(entry)) {
struct page *page;

- page = migration_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);
rss[mm_counter(page)]--;
}
if (unlikely(!free_swap_and_cache(entry)))
@@ -3282,7 +3282,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
} else if (is_device_private_entry(entry)) {
- vmf->page = device_private_entry_to_page(entry);
+ vmf->page = pfn_swap_entry_to_page(entry);
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..600978d18750 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -321,7 +321,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
if (!is_migration_entry(entry))
goto out;

- page = migration_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);

/*
* Once page cache replacement of page migration started, page_count
@@ -361,7 +361,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
ptl = pmd_lock(mm, pmd);
if (!is_pmd_migration_entry(*pmd))
goto unlock;
- page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+ page = pfn_swap_entry_to_page(pmd_to_swp_entry(*pmd));
if (!get_page_unless_zero(page))
goto unlock;
spin_unlock(ptl);
@@ -2443,7 +2443,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (!is_device_private_entry(entry))
goto next;

- page = device_private_entry_to_page(entry);
+ page = pfn_swap_entry_to_page(entry);
if (!(migrate->flags &
MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
page->pgmap->owner != migrate->pgmap_owner)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 86e3a3688d59..eed988ab2e81 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
if (!is_migration_entry(entry))
return false;

- pfn = migration_entry_to_pfn(entry);
+ pfn = swp_offset(entry);
} else if (is_swap_pte(*pvmw->pte)) {
swp_entry_t entry;

@@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
if (!is_device_private_entry(entry))
return false;

- pfn = device_private_entry_to_pfn(entry);
+ pfn = swp_offset(entry);
} else {
if (!pte_present(*pvmw->pte))
return false;
@@ -200,7 +200,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);

- if (migration_entry_to_page(entry) != page)
+ if (pfn_swap_entry_to_page(entry) != page)
return not_found(pvmw);
return true;
}
--
2.20.1

2021-03-26 00:10:15

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 2/8] mm/swapops: Rework swap entry manipulation code

Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
---
include/linux/swapops.h | 56 ++++++++++++++++++++++-------------------
mm/debug_vm_pgtable.c | 12 ++++-----
mm/hmm.c | 2 +-
mm/huge_memory.c | 26 +++++++++++++------
mm/hugetlb.c | 10 +++++---
mm/memory.c | 10 +++++---
mm/migrate.c | 26 ++++++++++++++-----
mm/mprotect.c | 10 +++++---
mm/rmap.c | 10 +++++---
9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
}

#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
- return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
- page_to_pfn(page));
+ return swp_entry(SWP_DEVICE_READ, offset);
}

-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
{
- int type = swp_type(entry);
- return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+ return swp_entry(SWP_DEVICE_WRITE, offset);
}

-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
{
- *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+ int type = swp_type(entry);
+ return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
}

-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
}
#else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
return swp_entry(0, 0);
}

-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
{
+ return swp_entry(0, 0);
}

static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t entry)
return false;
}

-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return false;
}
#endif /* CONFIG_DEVICE_PRIVATE */

#ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
- BUG_ON(!PageLocked(compound_head(page)));
-
- return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
- page_to_pfn(page));
-}
-
static inline int is_migration_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
{
- *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+ return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(SWP_MIGRATION_WRITE, offset);
}

extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
#else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}

-#define make_migration_entry(page, write) swp_entry(0, 0)
static inline int is_migration_entry(swp_entry_t swp)
{
return 0;
}

-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte) { }
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
{
return 0;
}
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index a9bd6ce1ba02..3697a80b32f8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -817,17 +817,17 @@ static void __init swap_migration_tests(void)
* locked, otherwise it stumbles upon a BUG_ON().
*/
__SetPageLocked(page);
- swp = make_migration_entry(page, 1);
+ swp = make_writable_migration_entry(page_to_pfn(page));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(!is_write_migration_entry(swp));
+ WARN_ON(!is_writable_migration_entry(swp));

- make_migration_entry_read(&swp);
+ swp = make_readable_migration_entry(swp_offset(swp));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(is_write_migration_entry(swp));
+ WARN_ON(is_writable_migration_entry(swp));

- swp = make_migration_entry(page, 0);
+ swp = make_readable_migration_entry(page_to_pfn(page));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(is_write_migration_entry(swp));
+ WARN_ON(is_writable_migration_entry(swp));
__ClearPageLocked(page);
__free_page(page);
}
diff --git a/mm/hmm.c b/mm/hmm.c
index 3b2dda71d0ed..11df3ca30b82 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -255,7 +255,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
*/
if (hmm_is_device_private_entry(range, entry)) {
cpu_flags = HMM_PFN_VALID;
- if (is_write_device_private_entry(entry))
+ if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
*hmm_pfn = swp_offset(entry) | cpu_flags;
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4cda8564bcf..89af065cea5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1051,8 +1051,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
swp_entry_t entry = pmd_to_swp_entry(pmd);

VM_BUG_ON(!is_pmd_migration_entry(pmd));
- if (is_write_migration_entry(entry)) {
- make_migration_entry_read(&entry);
+ if (is_writable_migration_entry(entry)) {
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
pmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
@@ -1825,13 +1826,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry_t entry = pmd_to_swp_entry(*pmd);

VM_BUG_ON(!is_pmd_migration_entry(*pmd));
- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
pmd_t newpmd;
/*
* A protection check is difficult so
* just be safe and disable write
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -2109,7 +2111,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

entry = pmd_to_swp_entry(old_pmd);
page = pfn_swap_entry_to_page(entry);
- write = is_write_migration_entry(entry);
+ write = is_writable_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2141,7 +2143,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
if (freeze || pmd_migration) {
swp_entry_t swp_entry;
- swp_entry = make_migration_entry(page + i, write);
+ if (write)
+ swp_entry = make_writable_migration_entry(
+ page_to_pfn(page + i));
+ else
+ swp_entry = make_readable_migration_entry(
+ page_to_pfn(page + i));
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
@@ -2998,7 +3005,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
if (pmd_dirty(pmdval))
set_page_dirty(page);
- entry = make_migration_entry(page, pmd_write(pmdval));
+ if (pmd_write(pmdval))
+ entry = make_writable_migration_entry(page_to_pfn(page));
+ else
+ entry = make_readable_migration_entry(page_to_pfn(page));
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3024,7 +3034,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
- if (is_write_migration_entry(entry))
+ if (is_writable_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);

flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fb42c6dd74b..59645169839b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3795,12 +3795,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
is_hugetlb_entry_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);

- if (is_write_migration_entry(swp_entry) && cow) {
+ if (is_writable_migration_entry(swp_entry) && cow) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
*/
- make_migration_entry_read(&swp_entry);
+ swp_entry = make_readable_migration_entry(
+ swp_offset(swp_entry));
entry = swp_entry_to_pte(swp_entry);
set_huge_swap_pte_at(src, addr, src_pte,
entry, sz);
@@ -4970,10 +4971,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
if (unlikely(is_hugetlb_entry_migration(pte))) {
swp_entry_t entry = pte_to_swp_entry(pte);

- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
pte_t newpte;

- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
diff --git a/mm/memory.c b/mm/memory.c
index 1c98e3c1c2de..3a5705cfc891 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -734,13 +734,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,

rss[mm_counter(page)]++;

- if (is_write_migration_entry(entry) &&
+ if (is_writable_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*src_pte))
pte = pte_swp_mksoft_dirty(pte);
@@ -771,9 +772,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* when a device driver is involved (you cannot easily
* save and restore device driver state).
*/
- if (is_write_device_private_entry(entry) &&
+ if (is_writable_device_private_entry(entry) &&
is_cow_mapping(vm_flags)) {
- make_device_private_entry_read(&entry);
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
pte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(*src_pte))
pte = pte_swp_mkuffd_wp(pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index 600978d18750..b752543adb64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -237,13 +237,18 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
* Recheck VMA as permissions can change since migration started
*/
entry = pte_to_swp_entry(*pvmw.pte);
- if (is_write_migration_entry(entry))
+ if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
pte = pte_mkuffd_wp(pte);

if (unlikely(is_device_private_page(new))) {
- entry = make_device_private_entry(new, pte_write(pte));
+ if (pte_write(pte))
+ entry = make_writable_device_private_entry(
+ page_to_pfn(new));
+ else
+ entry = make_readable_device_private_entry(
+ page_to_pfn(new));
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*pvmw.pte))
pte = pte_swp_mksoft_dirty(pte);
@@ -2451,7 +2456,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,

mpfn = migrate_pfn(page_to_pfn(page)) |
MIGRATE_PFN_MIGRATE;
- if (is_write_device_private_entry(entry))
+ if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
@@ -2497,8 +2502,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
ptep_get_and_clear(mm, addr, ptep);

/* Setup special migration page table entry */
- entry = make_migration_entry(page, mpfn &
- MIGRATE_PFN_WRITE);
+ if (mpfn & MIGRATE_PFN_WRITE)
+ entry = make_writable_migration_entry(
+ page_to_pfn(page));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(page));
swp_pte = swp_entry_to_pte(entry);
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
@@ -2971,7 +2980,12 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (is_device_private_page(page)) {
swp_entry_t swp_entry;

- swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+ if (vma->vm_flags & VM_WRITE)
+ swp_entry = make_writable_device_private_entry(
+ page_to_pfn(page));
+ else
+ swp_entry = make_readable_device_private_entry(
+ page_to_pfn(page));
entry = swp_entry_to_pte(swp_entry);
}
} else {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..f21b760ec809 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -143,23 +143,25 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;

- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
/*
* A protection check is difficult so
* just be safe and disable write
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
- } else if (is_write_device_private_entry(entry)) {
+ } else if (is_writable_device_private_entry(entry)) {
/*
* We do not preserve soft-dirtiness. See
* copy_one_pte() for explanation.
*/
- make_device_private_entry_read(&entry);
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..977e70803ed8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1526,7 +1526,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- entry = make_migration_entry(page, 0);
+ entry = make_readable_migration_entry(page_to_pfn(page));
swp_pte = swp_entry_to_pte(entry);

/*
@@ -1622,8 +1622,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- entry = make_migration_entry(subpage,
- pte_write(pteval));
+ if (pte_write(pteval))
+ entry = make_writable_migration_entry(
+ page_to_pfn(subpage));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(subpage));
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
--
2.20.1

2021-03-26 00:10:21

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>

---

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
include/linux/rmap.h | 1 -
mm/rmap.c | 40 ++++++++++++++++++++++++++++++++--------
2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..e26ac2d71346 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -87,7 +87,6 @@ struct anon_vma_chain {

enum ttu_flags {
TTU_MIGRATION = 0x1, /* migration mode */
- TTU_MUNLOCK = 0x2, /* munlock mode */

TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */
TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */
diff --git a/mm/rmap.c b/mm/rmap.c
index 977e70803ed8..d02bade5245b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;

- /* munlock has nothing to gain from examining un-locked vmas */
- if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
- return true;
-
if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
is_zone_device_page(page) && !is_device_private_page(page))
return true;
@@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
- if (flags & TTU_MUNLOCK)
- continue;
}

/* Unexpected PMD-mapped THP? */
@@ -1784,6 +1778,37 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
return !page_mapcount(page) ? true : false;
}

+static bool try_to_munlock_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, void *arg)
+{
+ struct page_vma_mapped_walk pvmw = {
+ .page = page,
+ .vma = vma,
+ .address = address,
+ };
+
+ /* munlock has nothing to gain from examining un-locked vmas */
+ if (!(vma->vm_flags & VM_LOCKED))
+ return true;
+
+ while (page_vma_mapped_walk(&pvmw)) {
+ /* PTE-mapped THP are never mlocked */
+ if (!PageTransCompound(page)) {
+ /*
+ * Holding pte lock, we do *not* need
+ * mmap_lock here
+ */
+ mlock_vma_page(page);
+ }
+ page_vma_mapped_walk_done(&pvmw);
+
+ /* found a mlocked page, no point continuing munlock check */
+ return false;
+ }
+
+ return true;
+}
+
/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
@@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
void try_to_munlock(struct page *page)
{
struct rmap_walk_control rwc = {
- .rmap_one = try_to_unmap_one,
- .arg = (void *)TTU_MUNLOCK,
+ .rmap_one = try_to_munlock_one,
.done = page_not_mapped,
.anon_lock = page_lock_anon_vma_read,

--
2.20.1

2021-03-26 00:10:23

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 6/8] mm: Selftests for exclusive device memory

Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Jason Gunthorpe <[email protected]>
Tested-by: Ralph Campbell <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
---
lib/test_hmm.c | 124 +++++++++++++++++++
lib/test_hmm_uapi.h | 2 +
tools/testing/selftests/vm/hmm-tests.c | 158 +++++++++++++++++++++++++
3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
#include <linux/swapops.h>
#include <linux/sched/mm.h>
#include <linux/platform_device.h>
+#include <linux/rmap.h>

#include "test_hmm_uapi.h"

@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long cpages;
};

+#define DPT_XA_TAG_ATOMIC 1UL
#define DPT_XA_TAG_WRITE 3UL

/*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
}
}

+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+ unsigned long end)
+{
+ unsigned long pfn;
+
+ for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+ void *entry;
+ struct page *page;
+
+ entry = xa_load(&dmirror->pt, pfn);
+ page = xa_untag_pointer(entry);
+ if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+ return -EPERM;
+ }
+
+ return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+ unsigned long pfn, mapped = 0;
+ int i;
+
+ /* Map the migrated pages into the device's page tables. */
+ mutex_lock(&dmirror->mutex);
+
+ for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
+ void *entry;
+
+ if (!pages[i])
+ continue;
+
+ entry = pages[i];
+ entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+ entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+ if (xa_is_err(entry)) {
+ mutex_unlock(&dmirror->mutex);
+ return xa_err(entry);
+ }
+
+ mapped++;
+ }
+
+ mutex_unlock(&dmirror->mutex);
+ return mapped;
+}
+
static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
{
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
return 0;
}

+static int dmirror_exclusive(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ unsigned long start, end, addr;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ struct mm_struct *mm = dmirror->notifier.mm;
+ struct page *pages[64];
+ struct dmirror_bounce bounce;
+ unsigned long next;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ /* Since the mm is for the mirrored process, get a reference first. */
+ if (!mmget_not_zero(mm))
+ return -EINVAL;
+
+ mmap_read_lock(mm);
+ for (addr = start; addr < end; addr = next) {
+ int i, mapped;
+
+ if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+ next = end;
+ else
+ next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+ ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+ mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+ for (i = 0; i < ret; i++) {
+ if (pages[i]) {
+ unlock_page(pages[i]);
+ put_page(pages[i]);
+ }
+ }
+
+ if (addr + (mapped << PAGE_SHIFT) < next) {
+ mmap_read_unlock(mm);
+ mmput(mm);
+ return -EBUSY;
+ }
+ }
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ /* Return the migrated data for verification. */
+ ret = dmirror_bounce_init(&bounce, start, size);
+ if (ret)
+ return ret;
+ mutex_lock(&dmirror->mutex);
+ ret = dmirror_do_read(dmirror, start, end, &bounce);
+ mutex_unlock(&dmirror->mutex);
+ if (ret == 0) {
+ if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+ bounce.size))
+ ret = -EFAULT;
+ }
+
+ cmd->cpages = bounce.cpages;
+ dmirror_bounce_fini(&bounce);
+ return ret;
+}
+
static int dmirror_migrate(struct dmirror *dmirror,
struct hmm_dmirror_cmd *cmd)
{
@@ -949,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_migrate(dmirror, &cmd);
break;

+ case HMM_DMIRROR_EXCLUSIVE:
+ ret = dmirror_exclusive(dmirror, &cmd);
+ break;
+
+ case HMM_DMIRROR_CHECK_EXCLUSIVE:
+ ret = dmirror_check_atomic(dmirror, cmd.addr,
+ cmd.addr + (cmd.npages << PAGE_SHIFT));
+ break;
+
case HMM_DMIRROR_SNAPSHOT:
ret = dmirror_snapshot(dmirror, &cmd);
break;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..f14dea5dcd06 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,6 +33,8 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..864f126ffd78 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1485,4 +1485,162 @@ TEST_F(hmm2, double_map)
hmm_buffer_free(buffer);
}

+/*
+ * Basic check of exclusive faulting.
+ */
+TEST_F(hmm, exclusive)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i]++, i);
+
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i+1);
+
+ /* Check atomic access revoked */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+
+ hmm_buffer_free(buffer);
+}
+
+TEST_F(hmm, exclusive_mprotect)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ ret = mprotect(buffer->ptr, size, PROT_READ);
+ ASSERT_EQ(ret, 0);
+
+ /* Simulate a device writing system memory. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+ ASSERT_EQ(ret, -EPERM);
+
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Check copy-on-write works.
+ */
+TEST_F(hmm, exclusive_cow)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ fork();
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i]++, i);
+
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i+1);
+
+ hmm_buffer_free(buffer);
+}
+
TEST_HARNESS_MAIN
--
2.20.1

2021-03-26 00:10:38

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 8/8] nouveau/svm: Implement atomic SVM access

Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple <[email protected]>

---

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
device
---
drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
drivers/gpu/drm/nouveau/nouveau_svm.c | 126 ++++++++++++++++--
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
#define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL
#define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL
#define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL
#define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL
#define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL
#define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..81526d65b4e2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
#include <linux/sched/mm.h>
#include <linux/sort.h>
#include <linux/hmm.h>
+#include <linux/rmap.h>

struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
};

+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
#define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
#define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)

@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
fault->client);
}

+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+ switch (fault) {
+ case FAULT_ACCESS_PREFETCH:
+ return 0;
+ case FAULT_ACCESS_READ:
+ return 1;
+ case FAULT_ACCESS_WRITE:
+ return 2;
+ case FAULT_ACCESS_ATOMIC:
+ return 3;
+ default:
+ WARN_ON_ONCE(1);
+ return -1;
+ }
+}
+
static int
nouveau_svm_fault_cmp(const void *a, const void *b)
{
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
- /*XXX: atomic? */
- return (fa->access == 0 || fa->access == 3) -
- (fb->access == 0 || fb->access == 3);
+ return nouveau_svm_fault_priority(fa->access) -
+ nouveau_svm_fault_priority(fb->access);
}

static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);

+ if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+ range->owner == sn->svmm->vmm->cli->drm->dev)
+ return true;
+
/*
* serializes the update to mni->invalidate_seq done by caller and
* prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
}

+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+ struct nouveau_drm *drm,
+ struct nouveau_pfnmap_args *args, u32 size,
+ struct svm_notifier *notifier)
+{
+ unsigned long timeout =
+ jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+ struct mm_struct *mm = svmm->notifier.mm;
+ struct page *page;
+ unsigned long start = args->p.addr;
+ unsigned long notifier_seq;
+ int ret = 0;
+
+ ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+ args->p.addr, args->p.size,
+ &nouveau_svm_mni_ops);
+ if (ret)
+ return ret;
+
+ while (true) {
+ if (time_after(jiffies, timeout)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+ mmap_read_lock(mm);
+ make_device_exclusive_range(mm, start, start + PAGE_SIZE,
+ &page, drm->dev);
+ mmap_read_unlock(mm);
+ if (!page) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mutex_lock(&svmm->mutex);
+ if (!mmu_interval_read_retry(&notifier->notifier,
+ notifier_seq))
+ break;
+ mutex_unlock(&svmm->mutex);
+ }
+
+ /* Map the page on the GPU. */
+ args->p.page = 12;
+ args->p.size = PAGE_SIZE;
+ args->p.addr = start;
+ args->p.phys[0] = page_to_phys(page) |
+ NVIF_VMM_PFNMAP_V0_V |
+ NVIF_VMM_PFNMAP_V0_W |
+ NVIF_VMM_PFNMAP_V0_A |
+ NVIF_VMM_PFNMAP_V0_HOST;
+
+ svmm->vmm->vmm.object.client->super = true;
+ ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
+ svmm->vmm->vmm.object.client->super = false;
+ mutex_unlock(&svmm->mutex);
+
+ unlock_page(page);
+ put_page(page);
+
+out:
+ mmu_interval_notifier_remove(&notifier->notifier);
+ return ret;
+}
+
static int nouveau_range_fault(struct nouveau_svmm *svmm,
struct nouveau_drm *drm,
struct nouveau_pfnmap_args *args, u32 size,
@@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
unsigned long hmm_flags;
u64 inst, start, limit;
int fi, fn;
- int replay = 0, ret;
+ int replay = 0, atomic = 0, ret;

/* Parse available fault buffer entries into a cache, and update
* the GET pointer so HW can reuse the entries.
@@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
/*
* Determine required permissions based on GPU fault
* access flags.
- * XXX: atomic?
*/
switch (buffer->fault[fi]->access) {
case 0: /* READ. */
hmm_flags = HMM_PFN_REQ_FAULT;
break;
+ case 2: /* ATOMIC. */
+ atomic = true;
+ break;
case 3: /* PREFETCH. */
hmm_flags = 0;
break;
@@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
}

notifier.svmm = svmm;
- ret = nouveau_range_fault(svmm, svm->drm, &args.i,
- sizeof(args), hmm_flags, &notifier);
+ if (atomic)
+ ret = nouveau_atomic_range_fault(svmm, svm->drm,
+ &args.i, sizeof(args),
+ &notifier);
+ else
+ ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+ sizeof(args), hmm_flags,
+ &notifier);
mmput(mm);

limit = args.i.p.addr + args.i.p.size;
@@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
*/
if (buffer->fault[fn]->svmm != svmm ||
buffer->fault[fn]->addr >= limit ||
- (buffer->fault[fi]->access == 0 /* READ. */ &&
+ (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
!(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
- (buffer->fault[fi]->access != 0 /* READ. */ &&
- buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
- !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
+ (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+ buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
+ (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+ buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
+ buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
break;
}

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index a2b179568970..f6188aa9171c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *);
#define NVKM_VMM_PFN_APER 0x00000000000000f0ULL
#define NVKM_VMM_PFN_HOST 0x0000000000000000ULL
#define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL
+#define NVKM_VMM_PFN_A 0x0000000000000004ULL
#define NVKM_VMM_PFN_W 0x0000000000000002ULL
#define NVKM_VMM_PFN_V 0x0000000000000001ULL
#define NVKM_VMM_PFN_NONE 0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
index 236db5570771..f02abd9cb4dd 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */

+ if (!(*map->pfn & NVKM_VMM_PFN_A))
+ data |= BIT_ULL(7); /* Atomic disable. */
+
if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
addr = dma_map_page(dev, pfn_to_page(addr), 0,
@@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */

+ if (!(*map->pfn & NVKM_VMM_PFN_A))
+ data |= BIT_ULL(7); /* Atomic disable. */
+
if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
addr = dma_map_page(dev, pfn_to_page(addr), 0,
--
2.20.1

2021-03-26 00:12:16

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 4/8] mm/rmap: Split migration into its own function

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>

---

v5:
* Added comments about how PMD splitting works for migration vs.
unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
which TTU_XXX flags are supported.
---
include/linux/rmap.h | 4 +-
mm/huge_memory.c | 15 +-
mm/migrate.c | 9 +-
mm/rmap.c | 358 ++++++++++++++++++++++++++++++++-----------
4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e26ac2d71346..6062e0cfca2d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
};

enum ttu_flags {
- TTU_MIGRATION = 0x1, /* migration mode */
-
TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */
TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
* do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
- TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */
};

#ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);

+bool try_to_migrate(struct page *page, enum ttu_flags flags);
bool try_to_unmap(struct page *, enum ttu_flags flags);

/* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89af065cea5b..eab004331b97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,16 +2357,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,

static void unmap_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
- TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+ enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;

VM_BUG_ON_PAGE(!PageHead(page), page);

if (PageAnon(page))
- ttu_flags |= TTU_SPLIT_FREEZE;
-
- unmap_success = try_to_unmap(page, ttu_flags);
+ unmap_success = try_to_migrate(page, ttu_flags);
+ else
+ /*
+ * Don't install migration entries for file backed pages. This
+ * helps handle cases when i_size is in the middle of the page
+ * as there is no need to unmap pages beyond i_size manually.
+ */
+ unmap_success = try_to_unmap(page, ttu_flags |
+ TTU_IGNORE_MLOCK);
VM_BUG_ON_PAGE(!unmap_success, page);
}

diff --git a/mm/migrate.c b/mm/migrate.c
index b752543adb64..cc4612e2a246 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1130,7 +1130,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
- try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+ try_to_migrate(page, 0);
page_was_mapped = 1;
}

@@ -1332,7 +1332,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,

if (page_mapped(hpage)) {
bool mapping_locked = false;
- enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+ enum ttu_flags ttu = 0;

if (!PageAnon(hpage)) {
/*
@@ -1349,7 +1349,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
ttu |= TTU_RMAP_LOCKED;
}

- try_to_unmap(hpage, ttu);
+ try_to_migrate(hpage, ttu);
page_was_mapped = 1;

if (mapping_locked)
@@ -2756,7 +2756,6 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
*/
static void migrate_vma_unmap(struct migrate_vma *migrate)
{
- int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK;
const unsigned long npages = migrate->npages;
const unsigned long start = migrate->start;
unsigned long addr, i, restore = 0;
@@ -2768,7 +2767,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
continue;

if (page_mapped(page)) {
- try_to_unmap(page, flags);
+ try_to_migrate(page, 0);
if (page_mapped(page))
goto restore;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index d02bade5245b..b540b44e299a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,14 +1405,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;

- if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
- is_zone_device_page(page) && !is_device_private_page(page))
- return true;
-
- if (flags & TTU_SPLIT_HUGE_PMD) {
- split_huge_pmd_address(vma, address,
- flags & TTU_SPLIT_FREEZE, page);
- }
+ if (flags & TTU_SPLIT_HUGE_PMD)
+ split_huge_pmd_address(vma, address, false, page);

/*
* For THP, we have to assume the worse case ie pmd for invalidation.
@@ -1436,16 +1430,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
- /* PMD-mapped THP migration entry */
- if (!pvmw.pte && (flags & TTU_MIGRATION)) {
- VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
-
- set_pmd_migration_entry(&pvmw, page);
- continue;
- }
-#endif
-
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
@@ -1507,46 +1491,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}
}

- if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & TTU_MIGRATION) &&
- is_zone_device_page(page)) {
- swp_entry_t entry;
- pte_t swp_pte;
-
- pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
-
- /*
- * Store the pfn of the page in a special migration
- * pte. do_swap_page() will wait until the migration
- * pte is removed and then restart fault handling.
- */
- entry = make_readable_migration_entry(page_to_pfn(page));
- swp_pte = swp_entry_to_pte(entry);
-
- /*
- * pteval maps a zone device page and is therefore
- * a swap pte.
- */
- if (pte_swp_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
- /*
- * No need to invalidate here it will synchronize on
- * against the special swap migration pte.
- *
- * The assignment to subpage above was computed from a
- * swap PTE which results in an invalid pointer.
- * Since only PAGE_SIZE pages can currently be
- * migrated, just set it to page. This will need to be
- * changed when hugepage migrations to device private
- * memory are supported.
- */
- subpage = page;
- goto discard;
- }
-
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
if (should_defer_flush(mm, flags)) {
@@ -1599,39 +1543,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
- } else if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
- swp_entry_t entry;
- pte_t swp_pte;
-
- if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
-
- /*
- * Store the pfn of the page in a special migration
- * pte. do_swap_page() will wait until the migration
- * pte is removed and then restart fault handling.
- */
- if (pte_write(pteval))
- entry = make_writable_migration_entry(
- page_to_pfn(subpage));
- else
- entry = make_readable_migration_entry(
- page_to_pfn(subpage));
- swp_pte = swp_entry_to_pte(entry);
- if (pte_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, address, pvmw.pte, swp_pte);
- /*
- * No need to invalidate here it will synchronize on
- * against the special swap migration pte.
- */
} else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(subpage) };
pte_t swp_pte;
@@ -1758,6 +1669,268 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
.anon_lock = page_lock_anon_vma_read,
};

+ if (flags & TTU_RMAP_LOCKED)
+ rmap_walk_locked(page, &rwc);
+ else
+ rmap_walk(page, &rwc);
+
+ return !page_mapcount(page) ? true : false;
+}
+
+/*
+ * @arg: enum ttu_flags will be passed to this argument.
+ *
+ * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
+ * containing migration entries. This and TTU_RMAP_LOCKED are the only supported
+ * flags.
+ */
+static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page_vma_mapped_walk pvmw = {
+ .page = page,
+ .vma = vma,
+ .address = address,
+ };
+ pte_t pteval;
+ struct page *subpage;
+ bool ret = true;
+ struct mmu_notifier_range range;
+ enum ttu_flags flags = (enum ttu_flags)(long)arg;
+
+ if (is_zone_device_page(page) && !is_device_private_page(page))
+ return true;
+
+ /*
+ * unmap_page() in mm/huge_memory.c is the only user of migration with
+ * TTU_SPLIT_HUGE_PMD and it wants to freeze.
+ */
+ if (flags & TTU_SPLIT_HUGE_PMD)
+ split_huge_pmd_address(vma, address, true, page);
+
+ /*
+ * For THP, we have to assume the worse case ie pmd for invalidation.
+ * For hugetlb, it could be much worse if we need to do pud
+ * invalidation in the case of pmd sharing.
+ *
+ * Note that the page can not be free in this function as call of
+ * try_to_unmap() must hold a reference on the page.
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ address,
+ min(vma->vm_end, address + page_size(page)));
+ if (PageHuge(page)) {
+ /*
+ * If sharing is possible, start and end will be adjusted
+ * accordingly.
+ */
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
+ }
+ mmu_notifier_invalidate_range_start(&range);
+
+ while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte) {
+ VM_BUG_ON_PAGE(PageHuge(page) ||
+ !PageTransCompound(page), page);
+
+ set_pmd_migration_entry(&pvmw, page);
+ continue;
+ }
+#endif
+
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+ subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+ address = pvmw.address;
+
+ if (PageHuge(page) && !PageAnon(page)) {
+ /*
+ * To call huge_pmd_unshare, i_mmap_rwsem must be
+ * held in write mode. Caller needs to explicitly
+ * do this outside rmap routines.
+ */
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+ /*
+ * huge_pmd_unshare unmapped an entire PMD
+ * page. There is no way of knowing exactly
+ * which PMDs may be cached for this mm, so
+ * we must flush them all. start/end were
+ * already adjusted above to cover this range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start,
+ range.end);
+
+ /*
+ * The ref count of the PMD page was dropped
+ * which is part of the way map counting
+ * is done for shared PMDs. Return 'true'
+ * here. When there is no other sharing,
+ * huge_pmd_unshare returns false and we will
+ * unmap the actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+ }
+
+ /* Nuke the page table entry. */
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+ /* Move the dirty bit to the page. Now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (is_zone_device_page(page)) {
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ entry = make_readable_migration_entry(
+ page_to_pfn(page));
+ swp_pte = swp_entry_to_pte(entry);
+
+ /*
+ * pteval maps a zone device page and is therefore
+ * a swap pte.
+ */
+ if (pte_swp_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_swp_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+ /*
+ * No need to invalidate here it will synchronize on
+ * against the special swap migration pte.
+ *
+ * The assignment to subpage above was computed from a
+ * swap PTE which results in an invalid pointer.
+ * Since only PAGE_SIZE pages can currently be
+ * migrated, just set it to page. This will need to be
+ * changed when hugepage migrations to device private
+ * memory are supported.
+ */
+ subpage = page;
+ } else if (PageHWPoison(page)) {
+ pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+ if (PageHuge(page)) {
+ hugetlb_count_sub(compound_nr(page), mm);
+ set_huge_swap_pte_at(mm, address,
+ pvmw.pte, pteval,
+ vma_mmu_pagesize(vma));
+ } else {
+ dec_mm_counter(mm, mm_counter(page));
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ }
+
+ } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+ /*
+ * The guest indicated that the page content is of no
+ * interest anymore. Simply discard the pte, vmscan
+ * will take care of the rest.
+ * A future reference will then fault in a new zero
+ * page. When userfaultfd is active, we must not drop
+ * this page though, as its main user (postcopy
+ * migration) will not expect userfaults on already
+ * copied pages.
+ */
+ dec_mm_counter(mm, mm_counter(page));
+ /* We have to invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
+ } else {
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ if (pte_write(pteval))
+ entry = make_writable_migration_entry(
+ page_to_pfn(subpage));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(subpage));
+
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+ /*
+ * No need to invalidate here it will synchronize on
+ * against the special swap migration pte.
+ */
+ }
+
+ /*
+ * No need to call mmu_notifier_invalidate_range() it has be
+ * done above for all cases requiring it to happen under page
+ * table lock before mmu_notifier_invalidate_range_end()
+ *
+ * See Documentation/vm/mmu_notifier.rst
+ */
+ page_remove_rmap(subpage, PageHuge(page));
+ put_page(page);
+ }
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ return ret;
+}
+
+/**
+ * try_to_migrate - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries. Caller must hold the page lock.
+ *
+ * If is successful, return true. Otherwise, false.
+ */
+bool try_to_migrate(struct page *page, enum ttu_flags flags)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = try_to_migrate_one,
+ .arg = (void *)flags,
+ .done = page_not_mapped,
+ .anon_lock = page_lock_anon_vma_read,
+ };
+
+ /*
+ * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
+ * TTU_SPLIT_HUGE_PMD flags.
+ */
+ if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD)))
+ return false;
+
/*
* During exec, a temporary VMA is setup and later moved.
* The VMA is moved under the anon_vma lock but not the
@@ -1766,8 +1939,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
- && !PageKsm(page) && PageAnon(page))
+ if (!PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;

if (flags & TTU_RMAP_LOCKED)
--
2.20.1

2021-03-26 00:12:19

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 7/8] nouveau/svm: Refactor nouveau_range_fault

Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Signed-off-by: Alistair Popple <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ++++++++++++++++-----------
1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 94f841026c3b..a195e48c9aee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
unsigned long hmm_pfns[1];
struct hmm_range range = {
.notifier = &notifier->notifier,
- .start = notifier->notifier.interval_tree.start,
- .end = notifier->notifier.interval_tree.last + 1,
.default_flags = hmm_flags,
.hmm_pfns = hmm_pfns,
.dev_private_owner = drm->dev,
};
- struct mm_struct *mm = notifier->notifier.mm;
+ struct mm_struct *mm = svmm->notifier.mm;
int ret;

+ ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+ args->p.addr, args->p.size,
+ &nouveau_svm_mni_ops);
+ if (ret)
+ return ret;
+
+ range.start = notifier->notifier.interval_tree.start;
+ range.end = notifier->notifier.interval_tree.last + 1;
+
while (true) {
- if (time_after(jiffies, timeout))
- return -EBUSY;
+ if (time_after(jiffies, timeout)) {
+ ret = -EBUSY;
+ goto out;
+ }

range.notifier_seq = mmu_interval_read_begin(range.notifier);
mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
if (ret) {
if (ret == -EBUSY)
continue;
- return ret;
+ goto out;
}

mutex_lock(&svmm->mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm,
svmm->vmm->vmm.object.client->super = false;
mutex_unlock(&svmm->mutex);

+out:
+ mmu_interval_notifier_remove(&notifier->notifier);
+
return ret;
}

@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify)
}

notifier.svmm = svmm;
- ret = mmu_interval_notifier_insert(&notifier.notifier, mm,
- args.i.p.addr, args.i.p.size,
- &nouveau_svm_mni_ops);
- if (!ret) {
- ret = nouveau_range_fault(svmm, svm->drm, &args.i,
- sizeof(args), hmm_flags, &notifier);
- mmu_interval_notifier_remove(&notifier.notifier);
- }
+ ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+ sizeof(args), hmm_flags, &notifier);
mmput(mm);

limit = args.i.p.addr + args.i.p.size;
--
2.20.1

2021-03-26 00:12:21

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v7 5/8] mm: Device exclusive memory access

Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>

---

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
allows notifiers called as a result of make_device_exclusive_range() to
be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
Documentation/vm/hmm.rst | 19 ++-
drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +-
include/linux/mmu_notifier.h | 26 ++--
include/linux/rmap.h | 4 +
include/linux/swap.h | 4 +-
include/linux/swapops.h | 44 +++++-
lib/test_hmm.c | 2 +-
mm/hmm.c | 5 +
mm/memory.c | 108 ++++++++++++-
mm/migrate.c | 10 +-
mm/mprotect.c | 8 +
mm/page_vma_mapped.c | 9 +-
mm/rmap.c | 210 ++++++++++++++++++++++++++
13 files changed, 426 insertions(+), 25 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
- ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+ ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:

The lock can now be released.

+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
Memory cgroup (memcg) and rss accounting
========================================

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
* the invalidation is handled as part of the migration process.
*/
if (update->event == MMU_NOTIFY_MIGRATE &&
- update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+ update->owner == svmm->vmm->cli->drm->dev)
goto out;

if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..2e6068d3fb9f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,12 @@ struct mmu_interval_notifier;
*
* @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
* a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
+ *
+ * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
+ * longer have exclusive access to the page. May ignore the invalidation that's
+ * part of make_device_exclusive_range() if the owner field
+ * matches the value passed to make_device_exclusive_range().
*/
enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -51,6 +56,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
MMU_NOTIFY_RELEASE,
MMU_NOTIFY_MIGRATE,
+ MMU_NOTIFY_EXCLUSIVE,
};

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -269,7 +275,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
- void *migrate_pgmap_owner;
+ void *owner;
};

static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +527,14 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
range->flags = flags;
}

-static inline void mmu_notifier_range_init_migrate(
- struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+ struct mmu_notifier_range *range,
+ enum mmu_notifier_event event, unsigned int flags,
struct vm_area_struct *vma, struct mm_struct *mm,
- unsigned long start, unsigned long end, void *pgmap)
+ unsigned long start, unsigned long end, void *owner)
{
- mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
- start, end);
- range->migrate_pgmap_owner = pgmap;
+ mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+ range->owner = owner;
}

#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
@@ -655,8 +661,8 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range,

#define mmu_notifier_range_init(range,event,flags,vma,mm,start,end) \
_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
- pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+ end, owner) \
_mmu_notifier_range_init(range, start, end)

static inline bool
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6062e0cfca2d..b207c138cbff 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
bool try_to_migrate(struct page *page, enum ttu_flags flags);
bool try_to_unmap(struct page *, enum ttu_flags flags);

+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct page **pages,
+ void *arg);
+
/* Avoid racy checks */
#define PVMW_SYNC (1 << 0)
/* Look for migarion entries rather than present PTEs */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 516104b9334b..7a3c260146df 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
* to a special SWP_DEVICE_* entry.
*/
#ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
#else
#define SWP_DEVICE_NUM 0
#endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4dfd807ae52a..4129bd2ff9d6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
}
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+ return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+ return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+ swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+ return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
+}
#else /* CONFIG_DEVICE_PRIVATE */
static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
@@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return false;
}
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+ return false;
+}
#endif /* CONFIG_DEVICE_PRIVATE */

#ifdef CONFIG_MIGRATION
@@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
- return is_migration_entry(entry) || is_device_private_entry(entry);
+ return is_migration_entry(entry) || is_device_private_entry(entry) ||
+ is_device_exclusive_entry(entry);
}

struct page_vma_mapped_walk;
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
* the invalidation is handled as part of the migration process.
*/
if (range->event == MMU_NOTIFY_MIGRATE &&
- range->migrate_pgmap_owner == dmirror->mdevice)
+ range->owner == dmirror->mdevice)
return true;

if (mmu_notifier_range_blockable(range))
diff --git a/mm/hmm.c b/mm/hmm.c
index 11df3ca30b82..fad6be2bf072 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,6 +26,8 @@
#include <linux/mmu_notifier.h>
#include <linux/memory_hotplug.h>

+#include "internal.h"
+
struct hmm_vma_walk {
struct hmm_range *range;
unsigned long last;
@@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
if (!non_swap_entry(entry))
goto fault;

+ if (is_device_exclusive_entry(entry))
+ goto fault;
+
if (is_migration_entry(entry)) {
pte_unmap(ptep);
hmm_vma_walk->last = addr;
diff --git a/mm/memory.c b/mm/memory.c
index 3a5705cfc891..33d11527ef77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
+ } else if (is_device_exclusive_entry(entry)) {
+ page = pfn_swap_entry_to_page(entry);
+
+ get_page(page);
+ rss[mm_counter(page)]++;
+
+ if (is_writable_device_exclusive_entry(entry) &&
+ is_cow_mapping(vm_flags)) {
+ /*
+ * COW mappings require pages in both
+ * parent and child to be set to read.
+ */
+ entry = make_readable_device_exclusive_entry(
+ swp_offset(entry));
+ pte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(*src_pte))
+ pte = pte_swp_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(*src_pte))
+ pte = pte_swp_mkuffd_wp(pte);
+ set_pte_at(src_mm, addr, src_pte, pte);
+ }
}
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
@@ -1287,7 +1308,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}

entry = pte_to_swp_entry(ptent);
- if (is_device_private_entry(entry)) {
+ if (is_device_private_entry(entry) ||
+ is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);

if (unlikely(details && details->check_mapping)) {
@@ -1303,7 +1325,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
- page_remove_rmap(page, false);
+
+ if (is_device_private_entry(entry))
+ page_remove_rmap(page, false);
+
put_page(page);
continue;
}
@@ -3256,6 +3281,82 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+static void restore_exclusive_pte(struct vm_area_struct *vma,
+ struct page *page, unsigned long address,
+ pte_t *ptep)
+{
+ pte_t pte;
+ swp_entry_t entry;
+
+ pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+ if (pte_swp_soft_dirty(*ptep))
+ pte = pte_mksoft_dirty(pte);
+
+ entry = pte_to_swp_entry(*ptep);
+ if (pte_swp_uffd_wp(*ptep))
+ pte = pte_mkuffd_wp(pte);
+ else if (is_writable_device_exclusive_entry(entry))
+ pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+ set_pte_at(vma->vm_mm, address, ptep, pte);
+
+ /*
+ * No need to take a page reference as one was already
+ * created when the swap entry was made.
+ */
+ if (PageAnon(page))
+ page_add_anon_rmap(page, vma, address, false);
+ else
+ page_add_file_rmap(page, false);
+
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_vma_page(page);
+
+ /*
+ * No need to invalidate - it was non-present before. However
+ * secondary CPUs may have mappings that need invalidating.
+ */
+ update_mmu_cache(vma, address, ptep);
+}
+
+/*
+ * Restore a potential device exclusive pte to a working pte entry
+ */
+static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct vm_area_struct *vma = vmf->vma;
+ struct page_vma_mapped_walk pvmw = {
+ .page = page,
+ .vma = vma,
+ .address = vmf->address,
+ .flags = PVMW_SYNC,
+ };
+ vm_fault_t ret = 0;
+ struct mmu_notifier_range range;
+
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+ return VM_FAULT_RETRY;
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ vmf->address & PAGE_MASK,
+ (vmf->address & PAGE_MASK) + PAGE_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ while (page_vma_mapped_walk(&pvmw)) {
+ if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
+ }
+
+ unlock_page(page);
+
+ mmu_notifier_invalidate_range_end(&range);
+ return ret;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3283,6 +3384,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
+ } else if (is_device_exclusive_entry(entry)) {
+ vmf->page = pfn_swap_entry_to_page(entry);
+ ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
diff --git a/mm/migrate.c b/mm/migrate.c
index cc4612e2a246..9cc9251d4802 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2570,8 +2570,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
* that the registered device driver can skip invalidating device
* private page mappings that won't be migrated.
*/
- mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
- migrate->vma->vm_mm, migrate->start, migrate->end,
+ mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
+ migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end,
migrate->pgmap_owner);
mmu_notifier_invalidate_range_start(&range);

@@ -3074,9 +3074,9 @@ void migrate_vma_pages(struct migrate_vma *migrate)
if (!notified) {
notified = true;

- mmu_notifier_range_init_migrate(&range, 0,
- migrate->vma, migrate->vma->vm_mm,
- addr, migrate->end,
+ mmu_notifier_range_init_owner(&range,
+ MMU_NOTIFY_MIGRATE, 0, migrate->vma,
+ migrate->vma->vm_mm, addr, migrate->end,
migrate->pgmap_owner);
mmu_notifier_invalidate_range_start(&range);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f21b760ec809..c6018541ea3d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
newpte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
+ } else if (is_writable_device_exclusive_entry(entry)) {
+ entry = make_readable_device_exclusive_entry(
+ swp_offset(entry));
+ newpte = swp_entry_to_pte(entry);
+ if (pte_swp_soft_dirty(oldpte))
+ newpte = pte_swp_mksoft_dirty(newpte);
+ if (pte_swp_uffd_wp(oldpte))
+ newpte = pte_swp_mkuffd_wp(newpte);
} else {
newpte = oldpte;
}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eed988ab2e81..29842f169219 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)

/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
- if (!is_device_private_entry(entry))
+ if (!is_device_private_entry(entry) &&
+ !is_device_exclusive_entry(entry))
return false;
} else if (!pte_present(*pvmw->pte))
return false;
@@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;
entry = pte_to_swp_entry(*pvmw->pte);

- if (!is_migration_entry(entry))
+ if (!is_migration_entry(entry) &&
+ !is_device_exclusive_entry(entry))
return false;

pfn = swp_offset(entry);
@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

/* Handle un-addressable ZONE_DEVICE memory */
entry = pte_to_swp_entry(*pvmw->pte);
- if (!is_device_private_entry(entry))
+ if (!is_device_private_entry(entry) &&
+ !is_device_exclusive_entry(entry))
return false;

pfn = swp_offset(entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index b540b44e299a..b0ec88a37dab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2005,6 +2005,216 @@ void try_to_munlock(struct page *page)
rmap_walk(page, &rwc);
}

+struct ttp_args {
+ struct mm_struct *mm;
+ unsigned long address;
+ void *arg;
+ bool valid;
+};
+
+static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page_vma_mapped_walk pvmw = {
+ .page = page,
+ .vma = vma,
+ .address = address,
+ };
+ struct ttp_args *ttp = arg;
+ pte_t pteval;
+ struct page *subpage;
+ bool ret = true;
+ struct mmu_notifier_range range;
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+ vma->vm_mm, address,
+ min(vma->vm_end,
+ address + page_size(page)),
+ ttp->arg);
+ if (PageHuge(page)) {
+ /*
+ * If sharing is possible, start and end will be adjusted
+ * accordingly.
+ */
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
+ }
+ mmu_notifier_invalidate_range_start(&range);
+
+ while (page_vma_mapped_walk(&pvmw)) {
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+ if (!pte_present(*pvmw.pte)) {
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+ address = pvmw.address;
+
+ /* Nuke the page table entry. */
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+ /* Move the dirty bit to the page. Now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ /*
+ * Check that our target page is still mapped at the expected
+ * address.
+ */
+ if (ttp->mm == mm && ttp->address == address &&
+ pte_write(pteval))
+ ttp->valid = true;
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ if (pte_write(pteval))
+ entry = make_writable_device_exclusive_entry(
+ page_to_pfn(subpage));
+ else
+ entry = make_readable_device_exclusive_entry(
+ page_to_pfn(subpage));
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+
+ /* Take a reference for the swap entry */
+ get_page(page);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+ page_remove_rmap(subpage, PageHuge(page));
+ put_page(page);
+ }
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ return ret;
+}
+
+/**
+ * try_to_protect - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ * @mm: the mm_struct where the page is expected to be mapped
+ * @address: address where the page is expected to be mapped
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries to grant a device exclusive access to
+ * the page. Caller must hold the page lock.
+ *
+ * Returns false if the page is still mapped, or if it could not be unmapped
+ * from the expected address. Otherwise returns true (success).
+ */
+static bool try_to_protect(struct page *page, struct mm_struct *mm,
+ unsigned long address, void *arg)
+{
+ struct ttp_args ttp = {
+ .mm = mm,
+ .address = address,
+ .arg = arg,
+ .valid = false,
+ };
+ struct rmap_walk_control rwc = {
+ .rmap_one = try_to_protect_one,
+ .done = page_not_mapped,
+ .anon_lock = page_lock_anon_vma_read,
+ .arg = &ttp,
+ };
+
+ /*
+ * Restrict to anonymous pages for now to avoid potential writeback
+ * issues.
+ */
+ if (!PageAnon(page))
+ return false;
+
+ /*
+ * During exec, a temporary VMA is setup and later moved.
+ * The VMA is moved under the anon_vma lock but not the
+ * page tables leading to a race where migration cannot
+ * find the migration ptes. Rather than increasing the
+ * locking requirements of exec(), migration skips
+ * temporary VMAs until after exec() completes.
+ */
+ if (!PageKsm(page) && PageAnon(page))
+ rwc.invalid_vma = invalid_migration_vma;
+
+ rmap_walk(page, &rwc);
+
+ return ttp.valid && !page_mapcount(page);
+}
+
+/**
+ * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * @mm: mm_struct of assoicated target process
+ * @start: start of the region to mark for exclusive device access
+ * @end: end address of region
+ * @pages: returns the pages which were successfully marked for exclusive access
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering
+ *
+ * Returns: number of pages successfully marked for exclusive access
+ *
+ * This function finds ptes mapping page(s) to the given address range, locks
+ * them and replaces mappings with special swap entries preventing userspace CPU
+ * access. On fault these entries are replaced with the original mapping after
+ * calling MMU notifiers.
+ *
+ * A driver using this to program access from a device must use a mmu notifier
+ * critical section to hold a device specific lock during programming. Once
+ * programming is complete it should drop the page lock and reference after
+ * which point CPU access to the page will revoke the exclusive access.
+ */
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct page **pages,
+ void *arg)
+{
+ unsigned long npages = (end - start) >> PAGE_SHIFT;
+ unsigned long i;
+
+ npages = get_user_pages_remote(mm, start, npages,
+ FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+ pages, NULL, NULL);
+ for (i = 0; i < npages; i++, start += PAGE_SIZE) {
+ if (!trylock_page(pages[i])) {
+ put_page(pages[i]);
+ pages[i] = NULL;
+ continue;
+ }
+
+ if (!try_to_protect(pages[i], mm, start, arg)) {
+ unlock_page(pages[i]);
+ put_page(pages[i]);
+ pages[i] = NULL;
+ }
+ }
+
+ return npages;
+}
+EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+
void __put_anon_vma(struct anon_vma *anon_vma)
{
struct anon_vma *root = anon_vma->root;
--
2.20.1

2021-03-30 18:40:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 1/8] mm: Remove special swap entry functions

On Fri, Mar 26, 2021 at 11:07:58AM +1100, Alistair Popple wrote:
> Remove multiple similar inline functions for dealing with different
> types of special swap entries.
>
> Both migration and device private swap entries use the swap offset to
> store a pfn. Instead of multiple inline functions to obtain a struct
> page for each swap entry type use a common function
> pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
> functions as this results is shorter code that is easier to understand.
>
> Signed-off-by: Alistair Popple <[email protected]>
> Reviewed-by: Ralph Campbell <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
> ---
>
> v7:
> * Reworded commit message to include pfn_swap_entry_to_page()
> * Added Christoph's Reviewed-by
>
> v6:
> * Removed redundant compound_page() call from inside PageLocked()
> * Fixed a minor build issue for s390 reported by kernel test bot
>
> v4:
> * Added pfn_swap_entry_to_page()
> * Reinstated check that migration entries point to locked pages
> * Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
> builds
> ---
> arch/s390/mm/pgtable.c | 2 +-
> fs/proc/task_mmu.c | 23 +++++---------
> include/linux/swap.h | 4 +--
> include/linux/swapops.h | 69 ++++++++++++++---------------------------
> mm/hmm.c | 5 ++-
> mm/huge_memory.c | 4 +--
> mm/memcontrol.c | 2 +-
> mm/memory.c | 10 +++---
> mm/migrate.c | 6 ++--
> mm/page_vma_mapped.c | 6 ++--
> 10 files changed, 50 insertions(+), 81 deletions(-)

Looks good

Reviewed-by: Jason Gunthorpe <[email protected]>

> diff --git a/mm/hmm.c b/mm/hmm.c
> index 943cb2ba4442..3b2dda71d0ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct hmm_range *range,
> swp_entry_t entry)
> {
> return is_device_private_entry(entry) &&
> - device_private_entry_to_page(entry)->pgmap->owner ==
> + pfn_swap_entry_to_page(entry)->pgmap->owner ==
> range->dev_private_owner;
> }
>
> @@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> cpu_flags = HMM_PFN_VALID;
> if (is_write_device_private_entry(entry))
> cpu_flags |= HMM_PFN_WRITE;
> - *hmm_pfn = device_private_entry_to_pfn(entry) |
> - cpu_flags;
> + *hmm_pfn = swp_offset(entry) | cpu_flags;

Though swp_offset() seems poor here

Something like this seems nicer, maybe as an additional patch in this
series?

diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba444232..c06cbc4e3981b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -210,14 +210,6 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
unsigned long end, unsigned long hmm_pfns[], pmd_t pmd);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-static inline bool hmm_is_device_private_entry(struct hmm_range *range,
- swp_entry_t entry)
-{
- return is_device_private_entry(entry) &&
- device_private_entry_to_page(entry)->pgmap->owner ==
- range->dev_private_owner;
-}
-
static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
pte_t pte)
{
@@ -226,6 +218,32 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}

+static bool hmm_pte_handle_device_private(struct hmm_range *range, pte_t pte,
+ unsigned long *hmm_pfn)
+{
+ swp_entry_t entry = pte_to_swp_entry(pte);
+ struct page *device_page;
+ unsigned long cpu_flags;
+
+ if (is_device_private_entry(entry))
+ return false;
+
+ /*
+ * If the device private page matches the device the caller understands
+ * then return the private pfn directly. The caller must know what to do
+ * with it.
+ */
+ device_page = pfn_swap_entry_to_page(entry);
+ if (device_page->pgmap->owner != range->dev_private_owner)
+ return false;
+
+ cpu_flags = HMM_PFN_VALID;
+ if (is_write_device_private_entry(entry))
+ cpu_flags |= HMM_PFN_WRITE;
+ *hmm_pfn = page_to_pfn(device_page) | cpu_flags;
+ return true;
+}
+
static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
unsigned long end, pmd_t *pmdp, pte_t *ptep,
unsigned long *hmm_pfn)
@@ -247,20 +265,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
}

if (!pte_present(pte)) {
- swp_entry_t entry = pte_to_swp_entry(pte);
-
- /*
- * Never fault in device private pages, but just report
- * the PFN even if not present.
- */
- if (hmm_is_device_private_entry(range, entry)) {
- cpu_flags = HMM_PFN_VALID;
- if (is_write_device_private_entry(entry))
- cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = device_private_entry_to_pfn(entry) |
- cpu_flags;
+ if (hmm_pte_handle_device_private(range, pte, hmm_pfn))
return 0;
- }

required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);

Jason

2021-03-30 18:51:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Fri, Mar 26, 2021 at 11:08:00AM +1100, Alistair Popple wrote:

> +static bool try_to_munlock_one(struct page *page, struct vm_area_struct *vma,
> + unsigned long address, void *arg)
> +{

Is this function name right?

> + struct page_vma_mapped_walk pvmw = {
> + .page = page,
> + .vma = vma,
> + .address = address,
> + };
> +
> + /* munlock has nothing to gain from examining un-locked vmas */
> + if (!(vma->vm_flags & VM_LOCKED))
> + return true;
> +
> + while (page_vma_mapped_walk(&pvmw)) {
> + /* PTE-mapped THP are never mlocked */
> + if (!PageTransCompound(page)) {
> + /*
> + * Holding pte lock, we do *not* need
> + * mmap_lock here
> + */
> + mlock_vma_page(page);

Because the only action this function seems to take is to call
*mlock*_vma_page()

> + }
> + page_vma_mapped_walk_done(&pvmw);
> +
> + /* found a mlocked page, no point continuing munlock check */
> + return false;
> + }
> +
> + return true;
> +}
> +
> /**
> * try_to_munlock - try to munlock a page
> * @page: the page to be munlocked
> @@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
> void try_to_munlock(struct page *page)
> {

But this is also called try_to_munlock ??

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code. Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

So what clears PG_mlocked on this call path?

Something needs attention here..

Jason

2021-03-30 19:36:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 3a5705cfc891..33d11527ef77 100644
> +++ b/mm/memory.c
> @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte = pte_swp_mkuffd_wp(pte);
> set_pte_at(src_mm, addr, src_pte, pte);
> }
> + } else if (is_device_exclusive_entry(entry)) {
> + page = pfn_swap_entry_to_page(entry);
> +
> + get_page(page);
> + rss[mm_counter(page)]++;
> +
> + if (is_writable_device_exclusive_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + /*
> + * COW mappings require pages in both
> + * parent and child to be set to read.
> + */
> + entry = make_readable_device_exclusive_entry(
> + swp_offset(entry));
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }

This needs to have the same logic as we now have in
copy_present_page(). The page *is* present and we can't copy the PTE
value hidden in a swap entry if we can't copy the PTE normally.

The code should be shared because nobody is going to remember about
this corner case.

Jason

2021-03-30 22:12:06

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wednesday, 31 March 2021 5:49:03 AM AEDT Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:08:00AM +1100, Alistair Popple wrote:
>
> > +static bool try_to_munlock_one(struct page *page, struct vm_area_struct
*vma,
> > + unsigned long address, void *arg)
> > +{
>
> Is this function name right?

Perhaps. This is called from try_to_munlock() hence the name, but see below
for some commentary on that naming.

> > + struct page_vma_mapped_walk pvmw = {
> > + .page = page,
> > + .vma = vma,
> > + .address = address,
> > + };
> > +
> > + /* munlock has nothing to gain from examining un-locked vmas */
> > + if (!(vma->vm_flags & VM_LOCKED))
> > + return true;
> > +
> > + while (page_vma_mapped_walk(&pvmw)) {
> > + /* PTE-mapped THP are never mlocked */
> > + if (!PageTransCompound(page)) {
> > + /*
> > + * Holding pte lock, we do *not* need
> > + * mmap_lock here
> > + */
> > + mlock_vma_page(page);
>
> Because the only action this function seems to take is to call
> *mlock*_vma_page()
>
> > + }
> > + page_vma_mapped_walk_done(&pvmw);
> > +
> > + /* found a mlocked page, no point continuing munlock check */
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > /**
> > * try_to_munlock - try to munlock a page
> > * @page: the page to be munlocked
> > @@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags
flags)
> > void try_to_munlock(struct page *page)
> > {
>
> But this is also called try_to_munlock ??

As far as I can tell this has always been called try_to_munlock() even though
it appears to do the opposite.

> /**
> * try_to_munlock - try to munlock a page
> * @page: the page to be munlocked
> *
> * Called from munlock code. Checks all of the VMAs mapping the page
> * to make sure nobody else has this page mlocked. The page will be
> * returned with PG_mlocked cleared if no other vmas have it mlocked.
> */

In other words it sets PG_mlocked if one or more vmas has it mlocked. So
try_to_mlock() might be a better name, except that seems to have the potential
for confusion as well because it's only called from the munlock code path and
never for mlock.

> So what clears PG_mlocked on this call path?

See munlock_vma_page(). munlock works by clearing PG_mlocked, then calling
try_to_munlock to check if any VMAs still need it locked in which case
PG_mlocked gets set again. There are no other callers of try_to_munlock().

> Something needs attention here..

I think the code is correct, but perhaps the naming could be better. Would be
interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
as the current name appears based on the context it is called from (munlock)
rather than what it does (mlock).

- Alistair

> Jason
>




2021-03-30 22:18:55

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wednesday, 31 March 2021 9:09:30 AM AEDT Alistair Popple wrote:
> On Wednesday, 31 March 2021 5:49:03 AM AEDT Jason Gunthorpe wrote:
> > On Fri, Mar 26, 2021 at 11:08:00AM +1100, Alistair Popple wrote:

<snip>

> > So what clears PG_mlocked on this call path?
>
> See munlock_vma_page(). munlock works by clearing PG_mlocked, then calling
> try_to_munlock to check if any VMAs still need it locked in which case
> PG_mlocked gets set again. There are no other callers of try_to_munlock().
>
> > Something needs attention here..
>
> I think the code is correct, but perhaps the naming could be better. Would be
> interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
> as the current name appears based on the context it is called from (munlock)
> rather than what it does (mlock).

Actually Documentation/vm/unevictable-lru.rst contains a better suggestion:

try_to_munlock() Reverse Map Scan
---------------------------------

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.

Thoughts on renaming try_to_unlock() -> page_mlocked() and try_to_munlock_one() -> page_mlock_one()?

> - Alistair
>
> > Jason
> >
>
>
>
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>




2021-03-30 22:26:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wed, Mar 31, 2021 at 09:09:30AM +1100, Alistair Popple wrote:

> > > @@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags
> flags)
> > > void try_to_munlock(struct page *page)
> > > {
> >
> > But this is also called try_to_munlock ??
>
> As far as I can tell this has always been called try_to_munlock() even though
> it appears to do the opposite.

Maybe we should change it then?

> > /**
> > * try_to_munlock - try to munlock a page
> > * @page: the page to be munlocked
> > *
> > * Called from munlock code. Checks all of the VMAs mapping the page
> > * to make sure nobody else has this page mlocked. The page will be
> > * returned with PG_mlocked cleared if no other vmas have it mlocked.
> > */
>
> In other words it sets PG_mlocked if one or more vmas has it mlocked. So
> try_to_mlock() might be a better name, except that seems to have the potential
> for confusion as well because it's only called from the munlock code path and
> never for mlock.

That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)

> > Something needs attention here..
>
> I think the code is correct, but perhaps the naming could be better. Would be
> interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
> as the current name appears based on the context it is called from (munlock)
> rather than what it does (mlock).

The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.

Jason

2021-03-30 22:45:26

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
...
>> As far as I can tell this has always been called try_to_munlock() even though
>> it appears to do the opposite.
>
> Maybe we should change it then?
>
>>> /**
>>> * try_to_munlock - try to munlock a page
>>> * @page: the page to be munlocked
>>> *
>>> * Called from munlock code. Checks all of the VMAs mapping the page
>>> * to make sure nobody else has this page mlocked. The page will be
>>> * returned with PG_mlocked cleared if no other vmas have it mlocked.
>>> */
>>
>> In other words it sets PG_mlocked if one or more vmas has it mlocked. So
>> try_to_mlock() might be a better name, except that seems to have the potential
>> for confusion as well because it's only called from the munlock code path and
>> never for mlock.
>
> That explanation makes more sense.. This function looks like it is
> 'set PG_mlocked of the page if any vm->flags has VM_LOCKED'
>
> Maybe call it check_vm_locked or something then and reword the above
> comment?
>
> (and why is it OK to read vm->flags for this without any locking?)
>
>>> Something needs attention here..
>>
>> I think the code is correct, but perhaps the naming could be better. Would be
>> interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
>> as the current name appears based on the context it is called from (munlock)
>> rather than what it does (mlock).
>
> The point of this patch is to make it clearer, after all, so I'd
> change something and maybe slightly clarify the comment.
>

I'd add that, after looking around the calling code, this is a really unhappy
pre-existing situation. Anyone reading this has to remember at which point in the
call stack the naming transitions from "do the opposite of what the name says",
to "do what the name says".

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

Although, it seems reasonable to tack such renaming patches onto the tail end
of this series. But whatever works.

thanks,
--
John Hubbard
NVIDIA

2021-03-30 22:58:45

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wednesday, 31 March 2021 9:43:19 AM AEDT John Hubbard wrote:
> On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
> ...
> >> As far as I can tell this has always been called try_to_munlock() even
though
> >> it appears to do the opposite.
> >
> > Maybe we should change it then?
> >
> >>> /**
> >>> * try_to_munlock - try to munlock a page
> >>> * @page: the page to be munlocked
> >>> *
> >>> * Called from munlock code. Checks all of the VMAs mapping the page
> >>> * to make sure nobody else has this page mlocked. The page will be
> >>> * returned with PG_mlocked cleared if no other vmas have it mlocked.
> >>> */
> >>
> >> In other words it sets PG_mlocked if one or more vmas has it mlocked. So
> >> try_to_mlock() might be a better name, except that seems to have the
potential
> >> for confusion as well because it's only called from the munlock code path
and
> >> never for mlock.
> >
> > That explanation makes more sense.. This function looks like it is
> > 'set PG_mlocked of the page if any vm->flags has VM_LOCKED'
> >
> > Maybe call it check_vm_locked or something then and reword the above
> > comment?
> >
> > (and why is it OK to read vm->flags for this without any locking?)
> >
> >>> Something needs attention here..
> >>
> >> I think the code is correct, but perhaps the naming could be better.
Would be
> >> interested hearing any thoughts on renaming try_to_munlock() to
try_to_mlock()
> >> as the current name appears based on the context it is called from
(munlock)
> >> rather than what it does (mlock).
> >
> > The point of this patch is to make it clearer, after all, so I'd
> > change something and maybe slightly clarify the comment.
> >

Yep, agree with that.

> I'd add that, after looking around the calling code, this is a really
unhappy
> pre-existing situation. Anyone reading this has to remember at which point
in the
> call stack the naming transitions from "do the opposite of what the name
says",
> to "do what the name says".
>
> +1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?

This is actually inspired from a suggestion in Documentation/vm/unevictable-
lru.rst which warns about this problem:

try_to_munlock() Reverse Map Scan
---------------------------------

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.

> Although, it seems reasonable to tack such renaming patches onto the tail
end
> of this series. But whatever works.

Unless anyone objects strongly I will roll the rename into this patch as there
is only one caller of try_to_munlock.

- Alistair

> thanks,
>




2021-03-31 04:03:58

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On 3/30/21 3:56 PM, Alistair Popple wrote:
...
>> +1 for renaming "munlock*" items to "mlock*", where applicable. good grief.
>
> At least the situation was weird enough to prompt further investigation :)
>
> Renaming to mlock* doesn't feel like the right solution to me either though. I
> am not sure if you saw me responding to myself earlier but I am thinking
> renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
> page_mlock_one() might be better. Thoughts?
>

Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code. Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

try_to_munlock() --> try_to_mlock()
try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.


> This is actually inspired from a suggestion in Documentation/vm/unevictable-
> lru.rst which warns about this problem:
>
> try_to_munlock() Reverse Map Scan
> ---------------------------------
>
> .. warning::
> [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
> page_referenced() reverse map walker.
>

This is actually rather bad advice! page_referenced() returns an
int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
stands now, returns void. Usually when I'm writing a TODO item, I'm in a
hurry, and I think that's what probably happened here, too. :)


>> Although, it seems reasonable to tack such renaming patches onto the tail
> end
>> of this series. But whatever works.
>
> Unless anyone objects strongly I will roll the rename into this patch as there
> is only one caller of try_to_munlock.
>
> - Alistair
>

No objections here. :)

thanks,
--
John Hubbard
NVIDIA

2021-03-31 04:12:42

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On 3/30/21 8:56 PM, John Hubbard wrote:
> On 3/30/21 3:56 PM, Alistair Popple wrote:
> ...
>>> +1 for renaming "munlock*" items to "mlock*", where applicable. good grief.
>>
>> At least the situation was weird enough to prompt further investigation :)
>>
>> Renaming to mlock* doesn't feel like the right solution to me either though. I
>> am not sure if you saw me responding to myself earlier but I am thinking
>> renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
>> page_mlock_one() might be better. Thoughts?
>>
>
> Quite confused by this naming idea. Because: try_to_munlock() returns
> void, so a boolean-style name such as "page_mlocked()" is already not a
> good fit.
>
> Even more important, though, is that try_to_munlock() is mlock-ing the
> page, right? Is there some subtle point I'm missing? It really is doing
> an mlock to the best of my knowledge here. Although the kerneldoc
> comment for try_to_munlock() seems questionable too:
>
> /**
> * try_to_munlock - try to munlock a page
> * @page: the page to be munlocked
> *
> * Called from munlock code.  Checks all of the VMAs mapping the page
> * to make sure nobody else has this page mlocked. The page will be
> * returned with PG_mlocked cleared if no other vmas have it mlocked.
> */
>
> ...because I don't see where, in *this* routine, it clears PG_mlocked!
>
> Obviously we agree that a routine should be named based on what it does,
> rather than on who calls it. So I think that still leads to:
>
>     try_to_munlock() --> try_to_mlock()
>     try_to_munlock_one() --> try_to_mlock_one()
>
> Sorry if I'm missing something really obvious.

Actually, re-reading your and Jason's earlier points in the thread, I see
that I'm *not* missing anything, and we are actually in agreement about how
the code operates. OK, good!

Also, as you point out above, maybe the "try_" prefix is not really accurate
either, given how this works. So maybe we have arrived at something like:

try_to_munlock() --> page_mlock() // or mlock_page()...
try_to_munlock_one() --> page_mlock_one()



thanks,
--
John Hubbard
NVIDIA

2021-03-31 04:18:09

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> On 3/30/21 3:56 PM, Alistair Popple wrote:
> ...
> >> +1 for renaming "munlock*" items to "mlock*", where applicable. good
grief.
> >
> > At least the situation was weird enough to prompt further investigation :)
> >
> > Renaming to mlock* doesn't feel like the right solution to me either
though. I
> > am not sure if you saw me responding to myself earlier but I am thinking
> > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
> > page_mlock_one() might be better. Thoughts?
> >
>
> Quite confused by this naming idea. Because: try_to_munlock() returns
> void, so a boolean-style name such as "page_mlocked()" is already not a
> good fit.
>
> Even more important, though, is that try_to_munlock() is mlock-ing the
> page, right? Is there some subtle point I'm missing? It really is doing
> an mlock to the best of my knowledge here. Although the kerneldoc
> comment for try_to_munlock() seems questionable too:

It's mlocking the page if it turns out it still needs to be locked after
unlocking it. But I don't think you're missing anything.

> /**
> * try_to_munlock - try to munlock a page
> * @page: the page to be munlocked
> *
> * Called from munlock code. Checks all of the VMAs mapping the page
> * to make sure nobody else has this page mlocked. The page will be
> * returned with PG_mlocked cleared if no other vmas have it mlocked.
> */
>
> ...because I don't see where, in *this* routine, it clears PG_mlocked!
>
> Obviously we agree that a routine should be named based on what it does,
> rather than on who calls it. So I think that still leads to:
>
> try_to_munlock() --> try_to_mlock()
> try_to_munlock_one() --> try_to_mlock_one()
>
> Sorry if I'm missing something really obvious.

Nope, I confused things somewhat by blindly quoting the documentation whilst
forgetting that try_to_munlock() returns void rather than a bool.

> > This is actually inspired from a suggestion in Documentation/vm/
unevictable-
> > lru.rst which warns about this problem:
> >
> > try_to_munlock() Reverse Map Scan
> > ---------------------------------
> >
> > .. warning::
> > [!] TODO/FIXME: a better name might be page_mlocked() - analogous to
the
> > page_referenced() reverse map walker.
> >
>
> This is actually rather bad advice! page_referenced() returns an
> int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
> stands now, returns void. Usually when I'm writing a TODO item, I'm in a
> hurry, and I think that's what probably happened here, too. :)

So I think we're in agreement. The naming is bad and the advice in the
documentation is also questionable :-)

Thanks for the thoughts, I will re-send this with naming and documentation
updates.

> >> Although, it seems reasonable to tack such renaming patches onto the tail
> > end
> >> of this series. But whatever works.
> >
> > Unless anyone objects strongly I will roll the rename into this patch as
there
> > is only one caller of try_to_munlock.
> >
> > - Alistair
> >
>
> No objections here. :)
>
> thanks,
>




2021-03-31 11:59:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wed, Mar 31, 2021 at 03:15:47PM +1100, Alistair Popple wrote:
> On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> > On 3/30/21 3:56 PM, Alistair Popple wrote:
> > ...
> > >> +1 for renaming "munlock*" items to "mlock*", where applicable. good
> grief.
> > >
> > > At least the situation was weird enough to prompt further investigation :)
> > >
> > > Renaming to mlock* doesn't feel like the right solution to me either
> though. I
> > > am not sure if you saw me responding to myself earlier but I am thinking
> > > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
> > > page_mlock_one() might be better. Thoughts?
> > >
> >
> > Quite confused by this naming idea. Because: try_to_munlock() returns
> > void, so a boolean-style name such as "page_mlocked()" is already not a
> > good fit.
> >
> > Even more important, though, is that try_to_munlock() is mlock-ing the
> > page, right? Is there some subtle point I'm missing? It really is doing
> > an mlock to the best of my knowledge here. Although the kerneldoc
> > comment for try_to_munlock() seems questionable too:
>
> It's mlocking the page if it turns out it still needs to be locked after
> unlocking it. But I don't think you're missing anything.

It is really searching all VMA's to see if the VMA flag is set and if
any are found then it mlocks the page.

But presenting this rountine in its simplified form raises lots of
questions:

- What locking is being used to read the VMA flag?
- Why do we need to manipulate global struct page flags under the
page table locks of a single VMA?
- Why do we need to check for huge pages inside the VMA loop, not
before going to the rmap? PageTransCompoundHead() is not sensitive to
the PTEs. (and what happens if the huge page breaks up concurrently?)
- Why do we clear the mlock bit then run around to try and set it?
Feels racey.

Jason

2021-03-31 13:01:39

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Wednesday, 31 March 2021 6:32:34 AM AEDT Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..33d11527ef77 100644
> > +++ b/mm/memory.c
> > @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
mm_struct *src_mm,
> > pte = pte_swp_mkuffd_wp(pte);
> > set_pte_at(src_mm, addr, src_pte, pte);
> > }
> > + } else if (is_device_exclusive_entry(entry)) {
> > + page = pfn_swap_entry_to_page(entry);
> > +
> > + get_page(page);
> > + rss[mm_counter(page)]++;
> > +
> > + if (is_writable_device_exclusive_entry(entry) &&
> > + is_cow_mapping(vm_flags)) {
> > + /*
> > + * COW mappings require pages in both
> > + * parent and child to be set to read.
> > + */
> > + entry = make_readable_device_exclusive_entry(
> > + swp_offset(entry));
> > + pte = swp_entry_to_pte(entry);
> > + if (pte_swp_soft_dirty(*src_pte))
> > + pte = pte_swp_mksoft_dirty(pte);
> > + if (pte_swp_uffd_wp(*src_pte))
> > + pte = pte_swp_mkuffd_wp(pte);
> > + set_pte_at(src_mm, addr, src_pte, pte);
> > + }
>
> This needs to have the same logic as we now have in
> copy_present_page(). The page *is* present and we can't copy the PTE
> value hidden in a swap entry if we can't copy the PTE normally.

You're saying we need to use copy_present_page() to make sure the split goes
the right way for pinned pages? I guess that makes sense as the split could go
either way at the moment but I should add a check to make sure this isn't used
with pinned pages anyway.

- Alistair

> The code should be shared because nobody is going to remember about
> this corner case.
>
> Jason
>




2021-03-31 13:20:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:

> I guess that makes sense as the split could go either way at the
> moment but I should add a check to make sure this isn't used with
> pinned pages anyway.

Is it possible to have a pinned page under one of these things? If I
pin it before you migrate it then it remains pinned but hidden under
the swap entry?

So the special logic is needed and the pinned page has to be copied
and written as a normal pte, not dropped as a migration entry

Jason

2021-03-31 13:47:27

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
>
> > I guess that makes sense as the split could go either way at the
> > moment but I should add a check to make sure this isn't used with
> > pinned pages anyway.
>
> Is it possible to have a pinned page under one of these things? If I
> pin it before you migrate it then it remains pinned but hidden under
> the swap entry?

At the moment yes. But I had planned (and this reminded me) to add a check to
prevent marking pinned pages for exclusive access. This check was in the
original migration based implementation as I don't think it makes much sense
to allow exclusive access to pinned pages given it indicates another device is
possibly using it.

> So the special logic is needed and the pinned page has to be copied
> and written as a normal pte, not dropped as a migration entry

Yep, if we end up allowing pinned pages to exist under these then that makes
sense. Thanks for the clarification.

- Alistair

> Jason
>



2021-03-31 13:47:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> >
> > > I guess that makes sense as the split could go either way at the
> > > moment but I should add a check to make sure this isn't used with
> > > pinned pages anyway.
> >
> > Is it possible to have a pinned page under one of these things? If I
> > pin it before you migrate it then it remains pinned but hidden under
> > the swap entry?
>
> At the moment yes. But I had planned (and this reminded me) to add a check to
> prevent marking pinned pages for exclusive access.

How do you even do that without races with GUP fast?

Jason

2021-04-01 00:47:59

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > >
> > > > I guess that makes sense as the split could go either way at the
> > > > moment but I should add a check to make sure this isn't used with
> > > > pinned pages anyway.
> > >
> > > Is it possible to have a pinned page under one of these things? If I
> > > pin it before you migrate it then it remains pinned but hidden under
> > > the swap entry?
> >
> > At the moment yes. But I had planned (and this reminded me) to add a check
to
> > prevent marking pinned pages for exclusive access.
>
> How do you even do that without races with GUP fast?

Unless I've missed something I think I've convinced myself it should be safe
to do the pin check after make_device_exclusive() has replaced all the PTEs
with exclusive entries.

GUP fast sequence:
1. Read PTE
2. Pin page
3. Check PTE
4. if PTE changed -> unpin and fallback

If make_device_exclusive() runs after (1) it will either succeed or see the
pin from (2) and fail (as desired). GUP should always see the PTE change and
fallback which will revoke the exclusive access.

- Alistair

> Jason
>




2021-04-01 00:50:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > >
> > > > > I guess that makes sense as the split could go either way at the
> > > > > moment but I should add a check to make sure this isn't used with
> > > > > pinned pages anyway.
> > > >
> > > > Is it possible to have a pinned page under one of these things? If I
> > > > pin it before you migrate it then it remains pinned but hidden under
> > > > the swap entry?
> > >
> > > At the moment yes. But I had planned (and this reminded me) to add a check
> to
> > > prevent marking pinned pages for exclusive access.
> >
> > How do you even do that without races with GUP fast?
>
> Unless I've missed something I think I've convinced myself it should be safe
> to do the pin check after make_device_exclusive() has replaced all the PTEs
> with exclusive entries.
>
> GUP fast sequence:
> 1. Read PTE
> 2. Pin page
> 3. Check PTE
> 4. if PTE changed -> unpin and fallback
>
> If make_device_exclusive() runs after (1) it will either succeed or see the
> pin from (2) and fail (as desired). GUP should always see the PTE change and
> fallback which will revoke the exclusive access.

AFAICT the user can trigger fork at that instant and fork will try to
copy the desposited migration entry before it has been checked

Jason

2021-04-01 02:24:58

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > >
> > > > > > I guess that makes sense as the split could go either way at the
> > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > pinned pages anyway.
> > > > >
> > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > the swap entry?
> > > >
> > > > At the moment yes. But I had planned (and this reminded me) to add a
check
> > to
> > > > prevent marking pinned pages for exclusive access.
> > >
> > > How do you even do that without races with GUP fast?
> >
> > Unless I've missed something I think I've convinced myself it should be
safe
> > to do the pin check after make_device_exclusive() has replaced all the
PTEs
> > with exclusive entries.
> >
> > GUP fast sequence:
> > 1. Read PTE
> > 2. Pin page
> > 3. Check PTE
> > 4. if PTE changed -> unpin and fallback
> >
> > If make_device_exclusive() runs after (1) it will either succeed or see
the
> > pin from (2) and fail (as desired). GUP should always see the PTE change
and
> > fallback which will revoke the exclusive access.
>
> AFAICT the user can trigger fork at that instant and fork will try to
> copy the desposited migration entry before it has been checked

In that case the child will get a read-only exclusive entry and eventually a
page copy via do_wp_page() and GUP will fallback (or fail in the case of fast
only) so the parent's exclusive entry will get removed before the page can be
pinned and therefore shouldn't split the wrong way.

But that is sounding rather complex, and I am not convinced I haven't missed a
corner case. It also seems like it shouldn't be necessary to copy exclusive
entries anyway. I could just remove them and restore the original entry, which
would be far simpler.

> Jason
>




2021-04-01 04:38:02

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

On Wednesday, 31 March 2021 10:57:46 PM AEDT Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 03:15:47PM +1100, Alistair Popple wrote:
> > On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> > > On 3/30/21 3:56 PM, Alistair Popple wrote:
> > > ...
> > > >> +1 for renaming "munlock*" items to "mlock*", where applicable. good
> > grief.
> > > >
> > > > At least the situation was weird enough to prompt further
investigation :)
> > > >
> > > > Renaming to mlock* doesn't feel like the right solution to me either
> > though. I
> > > > am not sure if you saw me responding to myself earlier but I am
thinking
> > > > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() -
>
> > > > page_mlock_one() might be better. Thoughts?
> > > >
> > >
> > > Quite confused by this naming idea. Because: try_to_munlock() returns
> > > void, so a boolean-style name such as "page_mlocked()" is already not a
> > > good fit.
> > >
> > > Even more important, though, is that try_to_munlock() is mlock-ing the
> > > page, right? Is there some subtle point I'm missing? It really is doing
> > > an mlock to the best of my knowledge here. Although the kerneldoc
> > > comment for try_to_munlock() seems questionable too:
> >
> > It's mlocking the page if it turns out it still needs to be locked after
> > unlocking it. But I don't think you're missing anything.
>
> It is really searching all VMA's to see if the VMA flag is set and if
> any are found then it mlocks the page.
>
> But presenting this rountine in its simplified form raises lots of
> questions:
>
> - What locking is being used to read the VMA flag?
> - Why do we need to manipulate global struct page flags under the
> page table locks of a single VMA?

I was wondering that and questioned it in an earlier version of this series. I
have done some digging and the commit log for b87537d9e2fe ("mm: rmap use pte
lock not mmap_sem to set PageMlocked") provides the original justification.

It's fairly long so I won't quote it here but the summary seems to be that
among other things the combination of page lock and ptl makes this safe. I
have yet to verify if everything there still holds and is sensible, but the
last paragraph certainly is :-)

"Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
on this occasion, but that's probably the sensible next step - with a
rename, given that try_to_munlock()'s business is to try to set Mlocked."

> - Why do we need to check for huge pages inside the VMA loop, not
> before going to the rmap? PageTransCompoundHead() is not sensitive to
> the PTEs. (and what happens if the huge page breaks up concurrently?)
> - Why do we clear the mlock bit then run around to try and set it?

I don't have an answer for that as I'm not (yet) across all the mlock code
paths, but I'm hoping this patch at least won't change anything.

> Feels racey.
>
> Jason
>




2021-04-01 17:53:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] mm: Device exclusive memory access

On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > >
> > > > > > > I guess that makes sense as the split could go either way at the
> > > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > > pinned pages anyway.
> > > > > >
> > > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > > the swap entry?
> > > > >
> > > > > At the moment yes. But I had planned (and this reminded me) to add a
> check
> > > to
> > > > > prevent marking pinned pages for exclusive access.
> > > >
> > > > How do you even do that without races with GUP fast?
> > >
> > > Unless I've missed something I think I've convinced myself it should be
> safe
> > > to do the pin check after make_device_exclusive() has replaced all the
> PTEs
> > > with exclusive entries.
> > >
> > > GUP fast sequence:
> > > 1. Read PTE
> > > 2. Pin page
> > > 3. Check PTE
> > > 4. if PTE changed -> unpin and fallback
> > >
> > > If make_device_exclusive() runs after (1) it will either succeed or see
> the
> > > pin from (2) and fail (as desired). GUP should always see the PTE change
> and
> > > fallback which will revoke the exclusive access.
> >
> > AFAICT the user can trigger fork at that instant and fork will try to
> > copy the desposited migration entry before it has been checked
>
> In that case the child will get a read-only exclusive entry and eventually a
> page copy via do_wp_page()

Having do_wp_page() do a copy is a security bug. We closed it with the
at-fork checks.

Jason

2021-04-01 19:30:45

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

CC: Hugh Dickins

On Wed, Mar 31, 2021 at 9:37 PM Alistair Popple <[email protected]> wrote:
>
> On Wednesday, 31 March 2021 10:57:46 PM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 03:15:47PM +1100, Alistair Popple wrote:
> > > On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> > > > On 3/30/21 3:56 PM, Alistair Popple wrote:
> > > > ...
> > > > >> +1 for renaming "munlock*" items to "mlock*", where applicable. good
> > > grief.
> > > > >
> > > > > At least the situation was weird enough to prompt further
> investigation :)
> > > > >
> > > > > Renaming to mlock* doesn't feel like the right solution to me either
> > > though. I
> > > > > am not sure if you saw me responding to myself earlier but I am
> thinking
> > > > > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() -
> >
> > > > > page_mlock_one() might be better. Thoughts?
> > > > >
> > > >
> > > > Quite confused by this naming idea. Because: try_to_munlock() returns
> > > > void, so a boolean-style name such as "page_mlocked()" is already not a
> > > > good fit.
> > > >
> > > > Even more important, though, is that try_to_munlock() is mlock-ing the
> > > > page, right? Is there some subtle point I'm missing? It really is doing
> > > > an mlock to the best of my knowledge here. Although the kerneldoc
> > > > comment for try_to_munlock() seems questionable too:
> > >
> > > It's mlocking the page if it turns out it still needs to be locked after
> > > unlocking it. But I don't think you're missing anything.
> >
> > It is really searching all VMA's to see if the VMA flag is set and if
> > any are found then it mlocks the page.
> >
> > But presenting this rountine in its simplified form raises lots of
> > questions:
> >
> > - What locking is being used to read the VMA flag?
> > - Why do we need to manipulate global struct page flags under the
> > page table locks of a single VMA?
>
> I was wondering that and questioned it in an earlier version of this series. I
> have done some digging and the commit log for b87537d9e2fe ("mm: rmap use pte
> lock not mmap_sem to set PageMlocked") provides the original justification.
>
> It's fairly long so I won't quote it here but the summary seems to be that
> among other things the combination of page lock and ptl makes this safe. I
> have yet to verify if everything there still holds and is sensible, but the
> last paragraph certainly is :-)
>
> "Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
> on this occasion, but that's probably the sensible next step - with a
> rename, given that try_to_munlock()'s business is to try to set Mlocked."
>
> > - Why do we need to check for huge pages inside the VMA loop, not
> > before going to the rmap? PageTransCompoundHead() is not sensitive to
> > the PTEs. (and what happens if the huge page breaks up concurrently?)
> > - Why do we clear the mlock bit then run around to try and set it?
>
> I don't have an answer for that as I'm not (yet) across all the mlock code
> paths, but I'm hoping this patch at least won't change anything.
>

It would be good to ask the person who has the most answers?

Hugh, the thread started at
https://lore.kernel.org/dri-devel/[email protected]/