2021-04-07 20:39:11

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v8 0/8] Add support for SVM atomics in Nouveau

This is the eighth version of a series to add support to Nouveau for atomic
memory operations on OpenCL shared virtual memory (SVM) regions.

The main change for this version is a simplification of device exclusive
entry handling. Instead of copying entries for copy-on-write mappings
during fork they are removed instead. This is safer because there could be
unique corner cases when copying, particularly for pinned pages which
should follow the same logic as copy_present_page(). Removing entries
avoids this possiblity by treating them as normal ptes.

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the CPU
finalising the entry.

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one(). These should not change any functionality, but any
help testing would be much appreciated as I have not been able to test
every usage of try_to_unmap_one().

Patch 5 contains the bulk of the implementation for device exclusive
memory.

Patch 6 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 7 is a cleanup for the Nouveau SVM implementation.

Patch 8 contains the implementation of atomic access for the Nouveau
driver.

This has been tested using the latest upstream Mesa userspace with a simple
OpenCL test program which checks the results of atomic GPU operations on a
SVM buffer whilst also writing to the same buffer from the CPU.

Alistair Popple (8):
mm: Remove special swap entry functions
mm/swapops: Rework swap entry manipulation code
mm/rmap: Split try_to_munlock from try_to_unmap
mm/rmap: Split migration into its own function
mm: Device exclusive memory access
mm: Selftests for exclusive device memory
nouveau/svm: Refactor nouveau_range_fault
nouveau/svm: Implement atomic SVM access

Documentation/vm/hmm.rst | 19 +-
Documentation/vm/unevictable-lru.rst | 33 +-
arch/s390/mm/pgtable.c | 2 +-
drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
drivers/gpu/drm/nouveau/nouveau_svm.c | 156 ++++-
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
fs/proc/task_mmu.c | 23 +-
include/linux/mmu_notifier.h | 26 +-
include/linux/rmap.h | 11 +-
include/linux/swap.h | 8 +-
include/linux/swapops.h | 123 ++--
lib/test_hmm.c | 126 +++-
lib/test_hmm_uapi.h | 2 +
mm/debug_vm_pgtable.c | 12 +-
mm/hmm.c | 12 +-
mm/huge_memory.c | 45 +-
mm/hugetlb.c | 10 +-
mm/memcontrol.c | 2 +-
mm/memory.c | 196 +++++-
mm/migrate.c | 51 +-
mm/mlock.c | 10 +-
mm/mprotect.c | 18 +-
mm/page_vma_mapped.c | 15 +-
mm/rmap.c | 612 +++++++++++++++---
tools/testing/selftests/vm/hmm-tests.c | 158 +++++
26 files changed, 1366 insertions(+), 312 deletions(-)

--
2.20.1


2021-04-07 20:39:45

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code

Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
---
include/linux/swapops.h | 56 ++++++++++++++++++++++-------------------
mm/debug_vm_pgtable.c | 12 ++++-----
mm/hmm.c | 2 +-
mm/huge_memory.c | 26 +++++++++++++------
mm/hugetlb.c | 10 +++++---
mm/memory.c | 10 +++++---
mm/migrate.c | 26 ++++++++++++++-----
mm/mprotect.c | 10 +++++---
mm/rmap.c | 10 +++++---
9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
}

#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
- return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
- page_to_pfn(page));
+ return swp_entry(SWP_DEVICE_READ, offset);
}

-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
{
- int type = swp_type(entry);
- return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+ return swp_entry(SWP_DEVICE_WRITE, offset);
}

-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
{
- *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+ int type = swp_type(entry);
+ return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
}

-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
}
#else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
{
return swp_entry(0, 0);
}

-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
{
+ return swp_entry(0, 0);
}

static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t entry)
return false;
}

-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
{
return false;
}
#endif /* CONFIG_DEVICE_PRIVATE */

#ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
- BUG_ON(!PageLocked(compound_head(page)));
-
- return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
- page_to_pfn(page));
-}
-
static inline int is_migration_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
{
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}

-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
{
- *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+ return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(SWP_MIGRATION_WRITE, offset);
}

extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
#else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+ return swp_entry(0, 0);
+}

-#define make_migration_entry(page, write) swp_entry(0, 0)
static inline int is_migration_entry(swp_entry_t swp)
{
return 0;
}

-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte) { }
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
{
return 0;
}
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index a9bd6ce1ba02..3697a80b32f8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -817,17 +817,17 @@ static void __init swap_migration_tests(void)
* locked, otherwise it stumbles upon a BUG_ON().
*/
__SetPageLocked(page);
- swp = make_migration_entry(page, 1);
+ swp = make_writable_migration_entry(page_to_pfn(page));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(!is_write_migration_entry(swp));
+ WARN_ON(!is_writable_migration_entry(swp));

- make_migration_entry_read(&swp);
+ swp = make_readable_migration_entry(swp_offset(swp));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(is_write_migration_entry(swp));
+ WARN_ON(is_writable_migration_entry(swp));

- swp = make_migration_entry(page, 0);
+ swp = make_readable_migration_entry(page_to_pfn(page));
WARN_ON(!is_migration_entry(swp));
- WARN_ON(is_write_migration_entry(swp));
+ WARN_ON(is_writable_migration_entry(swp));
__ClearPageLocked(page);
__free_page(page);
}
diff --git a/mm/hmm.c b/mm/hmm.c
index 3b2dda71d0ed..11df3ca30b82 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -255,7 +255,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
*/
if (hmm_is_device_private_entry(range, entry)) {
cpu_flags = HMM_PFN_VALID;
- if (is_write_device_private_entry(entry))
+ if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
*hmm_pfn = swp_offset(entry) | cpu_flags;
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4cda8564bcf..89af065cea5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1051,8 +1051,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
swp_entry_t entry = pmd_to_swp_entry(pmd);

VM_BUG_ON(!is_pmd_migration_entry(pmd));
- if (is_write_migration_entry(entry)) {
- make_migration_entry_read(&entry);
+ if (is_writable_migration_entry(entry)) {
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
pmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
@@ -1825,13 +1826,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry_t entry = pmd_to_swp_entry(*pmd);

VM_BUG_ON(!is_pmd_migration_entry(*pmd));
- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
pmd_t newpmd;
/*
* A protection check is difficult so
* just be safe and disable write
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -2109,7 +2111,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

entry = pmd_to_swp_entry(old_pmd);
page = pfn_swap_entry_to_page(entry);
- write = is_write_migration_entry(entry);
+ write = is_writable_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2141,7 +2143,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
if (freeze || pmd_migration) {
swp_entry_t swp_entry;
- swp_entry = make_migration_entry(page + i, write);
+ if (write)
+ swp_entry = make_writable_migration_entry(
+ page_to_pfn(page + i));
+ else
+ swp_entry = make_readable_migration_entry(
+ page_to_pfn(page + i));
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
@@ -2998,7 +3005,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
if (pmd_dirty(pmdval))
set_page_dirty(page);
- entry = make_migration_entry(page, pmd_write(pmdval));
+ if (pmd_write(pmdval))
+ entry = make_writable_migration_entry(page_to_pfn(page));
+ else
+ entry = make_readable_migration_entry(page_to_pfn(page));
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3024,7 +3034,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
- if (is_write_migration_entry(entry))
+ if (is_writable_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);

flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fb42c6dd74b..59645169839b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3795,12 +3795,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
is_hugetlb_entry_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);

- if (is_write_migration_entry(swp_entry) && cow) {
+ if (is_writable_migration_entry(swp_entry) && cow) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
*/
- make_migration_entry_read(&swp_entry);
+ swp_entry = make_readable_migration_entry(
+ swp_offset(swp_entry));
entry = swp_entry_to_pte(swp_entry);
set_huge_swap_pte_at(src, addr, src_pte,
entry, sz);
@@ -4970,10 +4971,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
if (unlikely(is_hugetlb_entry_migration(pte))) {
swp_entry_t entry = pte_to_swp_entry(pte);

- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
pte_t newpte;

- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
diff --git a/mm/memory.c b/mm/memory.c
index 1c98e3c1c2de..3a5705cfc891 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -734,13 +734,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,

rss[mm_counter(page)]++;

- if (is_write_migration_entry(entry) &&
+ if (is_writable_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both
* parent and child to be set to read.
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*src_pte))
pte = pte_swp_mksoft_dirty(pte);
@@ -771,9 +772,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* when a device driver is involved (you cannot easily
* save and restore device driver state).
*/
- if (is_write_device_private_entry(entry) &&
+ if (is_writable_device_private_entry(entry) &&
is_cow_mapping(vm_flags)) {
- make_device_private_entry_read(&entry);
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
pte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(*src_pte))
pte = pte_swp_mkuffd_wp(pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index 600978d18750..b752543adb64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -237,13 +237,18 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
* Recheck VMA as permissions can change since migration started
*/
entry = pte_to_swp_entry(*pvmw.pte);
- if (is_write_migration_entry(entry))
+ if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
pte = pte_mkuffd_wp(pte);

if (unlikely(is_device_private_page(new))) {
- entry = make_device_private_entry(new, pte_write(pte));
+ if (pte_write(pte))
+ entry = make_writable_device_private_entry(
+ page_to_pfn(new));
+ else
+ entry = make_readable_device_private_entry(
+ page_to_pfn(new));
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*pvmw.pte))
pte = pte_swp_mksoft_dirty(pte);
@@ -2451,7 +2456,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,

mpfn = migrate_pfn(page_to_pfn(page)) |
MIGRATE_PFN_MIGRATE;
- if (is_write_device_private_entry(entry))
+ if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
@@ -2497,8 +2502,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
ptep_get_and_clear(mm, addr, ptep);

/* Setup special migration page table entry */
- entry = make_migration_entry(page, mpfn &
- MIGRATE_PFN_WRITE);
+ if (mpfn & MIGRATE_PFN_WRITE)
+ entry = make_writable_migration_entry(
+ page_to_pfn(page));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(page));
swp_pte = swp_entry_to_pte(entry);
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
@@ -2971,7 +2980,12 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
if (is_device_private_page(page)) {
swp_entry_t swp_entry;

- swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+ if (vma->vm_flags & VM_WRITE)
+ swp_entry = make_writable_device_private_entry(
+ page_to_pfn(page));
+ else
+ swp_entry = make_readable_device_private_entry(
+ page_to_pfn(page));
entry = swp_entry_to_pte(swp_entry);
}
} else {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..f21b760ec809 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -143,23 +143,25 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;

- if (is_write_migration_entry(entry)) {
+ if (is_writable_migration_entry(entry)) {
/*
* A protection check is difficult so
* just be safe and disable write
*/
- make_migration_entry_read(&entry);
+ entry = make_readable_migration_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
- } else if (is_write_device_private_entry(entry)) {
+ } else if (is_writable_device_private_entry(entry)) {
/*
* We do not preserve soft-dirtiness. See
* copy_one_pte() for explanation.
*/
- make_device_private_entry_read(&entry);
+ entry = make_readable_device_private_entry(
+ swp_offset(entry));
newpte = swp_entry_to_pte(entry);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..977e70803ed8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1526,7 +1526,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- entry = make_migration_entry(page, 0);
+ entry = make_readable_migration_entry(page_to_pfn(page));
swp_pte = swp_entry_to_pte(entry);

/*
@@ -1622,8 +1622,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- entry = make_migration_entry(subpage,
- pte_write(pteval));
+ if (pte_write(pteval))
+ entry = make_writable_migration_entry(
+ page_to_pfn(subpage));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(subpage));
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
--
2.20.1

2021-04-07 20:40:12

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access

Some NVIDIA GPUs do not support direct atomic access to system memory
via PCIe. Instead this must be emulated by granting the GPU exclusive
access to the memory. This is achieved by replacing CPU page table
entries with special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via
MMU notifiers to revoke the atomic access. The original page table
entries are then restored allowing CPU access to proceed.

Signed-off-by: Alistair Popple <[email protected]>

---

v7:
* Removed magic values for fault access levels
* Improved readability of fault comparison code

v4:
* Check that page table entries haven't changed before mapping on the
device
---
drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
drivers/gpu/drm/nouveau/nouveau_svm.c | 126 ++++++++++++++++--
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
.../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
index d6dd40f21eed..9c7ff56831c5 100644
--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
+++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
#define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL
#define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL
#define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL
+#define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL
#define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL
#define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL
#define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a195e48c9aee..81526d65b4e2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
#include <linux/sched/mm.h>
#include <linux/sort.h>
#include <linux/hmm.h>
+#include <linux/rmap.h>

struct nouveau_svm {
struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
} buffer[1];
};

+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
#define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
#define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)

@@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
fault->client);
}

+static int
+nouveau_svm_fault_priority(u8 fault)
+{
+ switch (fault) {
+ case FAULT_ACCESS_PREFETCH:
+ return 0;
+ case FAULT_ACCESS_READ:
+ return 1;
+ case FAULT_ACCESS_WRITE:
+ return 2;
+ case FAULT_ACCESS_ATOMIC:
+ return 3;
+ default:
+ WARN_ON_ONCE(1);
+ return -1;
+ }
+}
+
static int
nouveau_svm_fault_cmp(const void *a, const void *b)
{
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
return ret;
if ((ret = (s64)fa->addr - fb->addr))
return ret;
- /*XXX: atomic? */
- return (fa->access == 0 || fa->access == 3) -
- (fb->access == 0 || fb->access == 3);
+ return nouveau_svm_fault_priority(fa->access) -
+ nouveau_svm_fault_priority(fb->access);
}

static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
struct svm_notifier *sn =
container_of(mni, struct svm_notifier, notifier);

+ if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+ range->owner == sn->svmm->vmm->cli->drm->dev)
+ return true;
+
/*
* serializes the update to mni->invalidate_seq done by caller and
* prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
}

+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+ struct nouveau_drm *drm,
+ struct nouveau_pfnmap_args *args, u32 size,
+ struct svm_notifier *notifier)
+{
+ unsigned long timeout =
+ jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+ struct mm_struct *mm = svmm->notifier.mm;
+ struct page *page;
+ unsigned long start = args->p.addr;
+ unsigned long notifier_seq;
+ int ret = 0;
+
+ ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+ args->p.addr, args->p.size,
+ &nouveau_svm_mni_ops);
+ if (ret)
+ return ret;
+
+ while (true) {
+ if (time_after(jiffies, timeout)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+ mmap_read_lock(mm);
+ make_device_exclusive_range(mm, start, start + PAGE_SIZE,
+ &page, drm->dev);
+ mmap_read_unlock(mm);
+ if (!page) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mutex_lock(&svmm->mutex);
+ if (!mmu_interval_read_retry(&notifier->notifier,
+ notifier_seq))
+ break;
+ mutex_unlock(&svmm->mutex);
+ }
+
+ /* Map the page on the GPU. */
+ args->p.page = 12;
+ args->p.size = PAGE_SIZE;
+ args->p.addr = start;
+ args->p.phys[0] = page_to_phys(page) |
+ NVIF_VMM_PFNMAP_V0_V |
+ NVIF_VMM_PFNMAP_V0_W |
+ NVIF_VMM_PFNMAP_V0_A |
+ NVIF_VMM_PFNMAP_V0_HOST;
+
+ svmm->vmm->vmm.object.client->super = true;
+ ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
+ svmm->vmm->vmm.object.client->super = false;
+ mutex_unlock(&svmm->mutex);
+
+ unlock_page(page);
+ put_page(page);
+
+out:
+ mmu_interval_notifier_remove(&notifier->notifier);
+ return ret;
+}
+
static int nouveau_range_fault(struct nouveau_svmm *svmm,
struct nouveau_drm *drm,
struct nouveau_pfnmap_args *args, u32 size,
@@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
unsigned long hmm_flags;
u64 inst, start, limit;
int fi, fn;
- int replay = 0, ret;
+ int replay = 0, atomic = 0, ret;

/* Parse available fault buffer entries into a cache, and update
* the GET pointer so HW can reuse the entries.
@@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
/*
* Determine required permissions based on GPU fault
* access flags.
- * XXX: atomic?
*/
switch (buffer->fault[fi]->access) {
case 0: /* READ. */
hmm_flags = HMM_PFN_REQ_FAULT;
break;
+ case 2: /* ATOMIC. */
+ atomic = true;
+ break;
case 3: /* PREFETCH. */
hmm_flags = 0;
break;
@@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
}

notifier.svmm = svmm;
- ret = nouveau_range_fault(svmm, svm->drm, &args.i,
- sizeof(args), hmm_flags, &notifier);
+ if (atomic)
+ ret = nouveau_atomic_range_fault(svmm, svm->drm,
+ &args.i, sizeof(args),
+ &notifier);
+ else
+ ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+ sizeof(args), hmm_flags,
+ &notifier);
mmput(mm);

limit = args.i.p.addr + args.i.p.size;
@@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
*/
if (buffer->fault[fn]->svmm != svmm ||
buffer->fault[fn]->addr >= limit ||
- (buffer->fault[fi]->access == 0 /* READ. */ &&
+ (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
!(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
- (buffer->fault[fi]->access != 0 /* READ. */ &&
- buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
- !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
+ (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+ buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
+ (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+ buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
+ buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+ !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
break;
}

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index a2b179568970..f6188aa9171c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *);
#define NVKM_VMM_PFN_APER 0x00000000000000f0ULL
#define NVKM_VMM_PFN_HOST 0x0000000000000000ULL
#define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL
+#define NVKM_VMM_PFN_A 0x0000000000000004ULL
#define NVKM_VMM_PFN_W 0x0000000000000002ULL
#define NVKM_VMM_PFN_V 0x0000000000000001ULL
#define NVKM_VMM_PFN_NONE 0x0000000000000000ULL
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
index 236db5570771..f02abd9cb4dd 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */

+ if (!(*map->pfn & NVKM_VMM_PFN_A))
+ data |= BIT_ULL(7); /* Atomic disable. */
+
if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
addr = dma_map_page(dev, pfn_to_page(addr), 0,
@@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */

+ if (!(*map->pfn & NVKM_VMM_PFN_A))
+ data |= BIT_ULL(7); /* Atomic disable. */
+
if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
addr = dma_map_page(dev, pfn_to_page(addr), 0,
--
2.20.1

2021-04-07 20:40:18

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v8 4/8] mm/rmap: Split migration into its own function

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>

---

v5:
* Added comments about how PMD splitting works for migration vs.
unmapping
* Tightened up the flag check in try_to_migrate() to be explicit about
which TTU_XXX flags are supported.
---
include/linux/rmap.h | 4 +-
mm/huge_memory.c | 15 +-
mm/migrate.c | 9 +-
mm/rmap.c | 358 ++++++++++++++++++++++++++++++++-----------
4 files changed, 280 insertions(+), 106 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 38a746787c2f..0e25d829f742 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
};

enum ttu_flags {
- TTU_MIGRATION = 0x1, /* migration mode */
-
TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */
TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
@@ -96,7 +94,6 @@ enum ttu_flags {
* do a final flush if necessary */
TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
* caller holds it */
- TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */
};

#ifdef CONFIG_MMU
@@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
int page_referenced(struct page *, int is_locked,
struct mem_cgroup *memcg, unsigned long *vm_flags);

+bool try_to_migrate(struct page *page, enum ttu_flags flags);
bool try_to_unmap(struct page *, enum ttu_flags flags);

/* Avoid racy checks */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89af065cea5b..eab004331b97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,16 +2357,21 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,

static void unmap_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
- TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+ enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;

VM_BUG_ON_PAGE(!PageHead(page), page);

if (PageAnon(page))
- ttu_flags |= TTU_SPLIT_FREEZE;
-
- unmap_success = try_to_unmap(page, ttu_flags);
+ unmap_success = try_to_migrate(page, ttu_flags);
+ else
+ /*
+ * Don't install migration entries for file backed pages. This
+ * helps handle cases when i_size is in the middle of the page
+ * as there is no need to unmap pages beyond i_size manually.
+ */
+ unmap_success = try_to_unmap(page, ttu_flags |
+ TTU_IGNORE_MLOCK);
VM_BUG_ON_PAGE(!unmap_success, page);
}

diff --git a/mm/migrate.c b/mm/migrate.c
index b752543adb64..cc4612e2a246 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1130,7 +1130,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
/* Establish migration ptes */
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
- try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+ try_to_migrate(page, 0);
page_was_mapped = 1;
}

@@ -1332,7 +1332,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,

if (page_mapped(hpage)) {
bool mapping_locked = false;
- enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+ enum ttu_flags ttu = 0;

if (!PageAnon(hpage)) {
/*
@@ -1349,7 +1349,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
ttu |= TTU_RMAP_LOCKED;
}

- try_to_unmap(hpage, ttu);
+ try_to_migrate(hpage, ttu);
page_was_mapped = 1;

if (mapping_locked)
@@ -2756,7 +2756,6 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
*/
static void migrate_vma_unmap(struct migrate_vma *migrate)
{
- int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK;
const unsigned long npages = migrate->npages;
const unsigned long start = migrate->start;
unsigned long addr, i, restore = 0;
@@ -2768,7 +2767,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
continue;

if (page_mapped(page)) {
- try_to_unmap(page, flags);
+ try_to_migrate(page, 0);
if (page_mapped(page))
goto restore;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index f09d522725b9..7f91f058f1f5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,14 +1405,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;

- if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
- is_zone_device_page(page) && !is_device_private_page(page))
- return true;
-
- if (flags & TTU_SPLIT_HUGE_PMD) {
- split_huge_pmd_address(vma, address,
- flags & TTU_SPLIT_FREEZE, page);
- }
+ if (flags & TTU_SPLIT_HUGE_PMD)
+ split_huge_pmd_address(vma, address, false, page);

/*
* For THP, we have to assume the worse case ie pmd for invalidation.
@@ -1436,16 +1430,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
- /* PMD-mapped THP migration entry */
- if (!pvmw.pte && (flags & TTU_MIGRATION)) {
- VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
-
- set_pmd_migration_entry(&pvmw, page);
- continue;
- }
-#endif
-
/*
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
@@ -1507,46 +1491,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
}
}

- if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & TTU_MIGRATION) &&
- is_zone_device_page(page)) {
- swp_entry_t entry;
- pte_t swp_pte;
-
- pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
-
- /*
- * Store the pfn of the page in a special migration
- * pte. do_swap_page() will wait until the migration
- * pte is removed and then restart fault handling.
- */
- entry = make_readable_migration_entry(page_to_pfn(page));
- swp_pte = swp_entry_to_pte(entry);
-
- /*
- * pteval maps a zone device page and is therefore
- * a swap pte.
- */
- if (pte_swp_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_swp_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
- /*
- * No need to invalidate here it will synchronize on
- * against the special swap migration pte.
- *
- * The assignment to subpage above was computed from a
- * swap PTE which results in an invalid pointer.
- * Since only PAGE_SIZE pages can currently be
- * migrated, just set it to page. This will need to be
- * changed when hugepage migrations to device private
- * memory are supported.
- */
- subpage = page;
- goto discard;
- }
-
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
if (should_defer_flush(mm, flags)) {
@@ -1599,39 +1543,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/* We have to invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
- } else if (IS_ENABLED(CONFIG_MIGRATION) &&
- (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
- swp_entry_t entry;
- pte_t swp_pte;
-
- if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- set_pte_at(mm, address, pvmw.pte, pteval);
- ret = false;
- page_vma_mapped_walk_done(&pvmw);
- break;
- }
-
- /*
- * Store the pfn of the page in a special migration
- * pte. do_swap_page() will wait until the migration
- * pte is removed and then restart fault handling.
- */
- if (pte_write(pteval))
- entry = make_writable_migration_entry(
- page_to_pfn(subpage));
- else
- entry = make_readable_migration_entry(
- page_to_pfn(subpage));
- swp_pte = swp_entry_to_pte(entry);
- if (pte_soft_dirty(pteval))
- swp_pte = pte_swp_mksoft_dirty(swp_pte);
- if (pte_uffd_wp(pteval))
- swp_pte = pte_swp_mkuffd_wp(swp_pte);
- set_pte_at(mm, address, pvmw.pte, swp_pte);
- /*
- * No need to invalidate here it will synchronize on
- * against the special swap migration pte.
- */
} else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(subpage) };
pte_t swp_pte;
@@ -1758,6 +1669,268 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
.anon_lock = page_lock_anon_vma_read,
};

+ if (flags & TTU_RMAP_LOCKED)
+ rmap_walk_locked(page, &rwc);
+ else
+ rmap_walk(page, &rwc);
+
+ return !page_mapcount(page) ? true : false;
+}
+
+/*
+ * @arg: enum ttu_flags will be passed to this argument.
+ *
+ * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
+ * containing migration entries. This and TTU_RMAP_LOCKED are the only supported
+ * flags.
+ */
+static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, void *arg)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page_vma_mapped_walk pvmw = {
+ .page = page,
+ .vma = vma,
+ .address = address,
+ };
+ pte_t pteval;
+ struct page *subpage;
+ bool ret = true;
+ struct mmu_notifier_range range;
+ enum ttu_flags flags = (enum ttu_flags)(long)arg;
+
+ if (is_zone_device_page(page) && !is_device_private_page(page))
+ return true;
+
+ /*
+ * unmap_page() in mm/huge_memory.c is the only user of migration with
+ * TTU_SPLIT_HUGE_PMD and it wants to freeze.
+ */
+ if (flags & TTU_SPLIT_HUGE_PMD)
+ split_huge_pmd_address(vma, address, true, page);
+
+ /*
+ * For THP, we have to assume the worse case ie pmd for invalidation.
+ * For hugetlb, it could be much worse if we need to do pud
+ * invalidation in the case of pmd sharing.
+ *
+ * Note that the page can not be free in this function as call of
+ * try_to_unmap() must hold a reference on the page.
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ address,
+ min(vma->vm_end, address + page_size(page)));
+ if (PageHuge(page)) {
+ /*
+ * If sharing is possible, start and end will be adjusted
+ * accordingly.
+ */
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
+ }
+ mmu_notifier_invalidate_range_start(&range);
+
+ while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+ /* PMD-mapped THP migration entry */
+ if (!pvmw.pte) {
+ VM_BUG_ON_PAGE(PageHuge(page) ||
+ !PageTransCompound(page), page);
+
+ set_pmd_migration_entry(&pvmw, page);
+ continue;
+ }
+#endif
+
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+ subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+ address = pvmw.address;
+
+ if (PageHuge(page) && !PageAnon(page)) {
+ /*
+ * To call huge_pmd_unshare, i_mmap_rwsem must be
+ * held in write mode. Caller needs to explicitly
+ * do this outside rmap routines.
+ */
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+ /*
+ * huge_pmd_unshare unmapped an entire PMD
+ * page. There is no way of knowing exactly
+ * which PMDs may be cached for this mm, so
+ * we must flush them all. start/end were
+ * already adjusted above to cover this range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start,
+ range.end);
+
+ /*
+ * The ref count of the PMD page was dropped
+ * which is part of the way map counting
+ * is done for shared PMDs. Return 'true'
+ * here. When there is no other sharing,
+ * huge_pmd_unshare returns false and we will
+ * unmap the actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+ }
+
+ /* Nuke the page table entry. */
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+ pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+ /* Move the dirty bit to the page. Now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ /* Update high watermark before we lower rss */
+ update_hiwater_rss(mm);
+
+ if (is_zone_device_page(page)) {
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ entry = make_readable_migration_entry(
+ page_to_pfn(page));
+ swp_pte = swp_entry_to_pte(entry);
+
+ /*
+ * pteval maps a zone device page and is therefore
+ * a swap pte.
+ */
+ if (pte_swp_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_swp_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+ /*
+ * No need to invalidate here it will synchronize on
+ * against the special swap migration pte.
+ *
+ * The assignment to subpage above was computed from a
+ * swap PTE which results in an invalid pointer.
+ * Since only PAGE_SIZE pages can currently be
+ * migrated, just set it to page. This will need to be
+ * changed when hugepage migrations to device private
+ * memory are supported.
+ */
+ subpage = page;
+ } else if (PageHWPoison(page)) {
+ pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+ if (PageHuge(page)) {
+ hugetlb_count_sub(compound_nr(page), mm);
+ set_huge_swap_pte_at(mm, address,
+ pvmw.pte, pteval,
+ vma_mmu_pagesize(vma));
+ } else {
+ dec_mm_counter(mm, mm_counter(page));
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ }
+
+ } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+ /*
+ * The guest indicated that the page content is of no
+ * interest anymore. Simply discard the pte, vmscan
+ * will take care of the rest.
+ * A future reference will then fault in a new zero
+ * page. When userfaultfd is active, we must not drop
+ * this page though, as its main user (postcopy
+ * migration) will not expect userfaults on already
+ * copied pages.
+ */
+ dec_mm_counter(mm, mm_counter(page));
+ /* We have to invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
+ } else {
+ swp_entry_t entry;
+ pte_t swp_pte;
+
+ if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+ set_pte_at(mm, address, pvmw.pte, pteval);
+ ret = false;
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+
+ /*
+ * Store the pfn of the page in a special migration
+ * pte. do_swap_page() will wait until the migration
+ * pte is removed and then restart fault handling.
+ */
+ if (pte_write(pteval))
+ entry = make_writable_migration_entry(
+ page_to_pfn(subpage));
+ else
+ entry = make_readable_migration_entry(
+ page_to_pfn(subpage));
+
+ swp_pte = swp_entry_to_pte(entry);
+ if (pte_soft_dirty(pteval))
+ swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
+ set_pte_at(mm, address, pvmw.pte, swp_pte);
+ /*
+ * No need to invalidate here it will synchronize on
+ * against the special swap migration pte.
+ */
+ }
+
+ /*
+ * No need to call mmu_notifier_invalidate_range() it has be
+ * done above for all cases requiring it to happen under page
+ * table lock before mmu_notifier_invalidate_range_end()
+ *
+ * See Documentation/vm/mmu_notifier.rst
+ */
+ page_remove_rmap(subpage, PageHuge(page));
+ put_page(page);
+ }
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ return ret;
+}
+
+/**
+ * try_to_migrate - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries. Caller must hold the page lock.
+ *
+ * If is successful, return true. Otherwise, false.
+ */
+bool try_to_migrate(struct page *page, enum ttu_flags flags)
+{
+ struct rmap_walk_control rwc = {
+ .rmap_one = try_to_migrate_one,
+ .arg = (void *)flags,
+ .done = page_not_mapped,
+ .anon_lock = page_lock_anon_vma_read,
+ };
+
+ /*
+ * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
+ * TTU_SPLIT_HUGE_PMD flags.
+ */
+ if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD)))
+ return false;
+
/*
* During exec, a temporary VMA is setup and later moved.
* The VMA is moved under the anon_vma lock but not the
@@ -1766,8 +1939,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
* locking requirements of exec(), migration skips
* temporary VMAs until after exec() completes.
*/
- if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
- && !PageKsm(page) && PageAnon(page))
+ if (!PageKsm(page) && PageAnon(page))
rwc.invalid_vma = invalid_migration_vma;

if (flags & TTU_RMAP_LOCKED)
--
2.20.1

2021-04-07 20:40:52

by Alistair Popple

[permalink] [raw]
Subject: [PATCH v8 6/8] mm: Selftests for exclusive device memory

Adds some selftests for exclusive device memory.

Signed-off-by: Alistair Popple <[email protected]>
Acked-by: Jason Gunthorpe <[email protected]>
Tested-by: Ralph Campbell <[email protected]>
Reviewed-by: Ralph Campbell <[email protected]>
---
lib/test_hmm.c | 124 +++++++++++++++++++
lib/test_hmm_uapi.h | 2 +
tools/testing/selftests/vm/hmm-tests.c | 158 +++++++++++++++++++++++++
3 files changed, 284 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5c9f5a020c1d..305a9d9e2b4c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -25,6 +25,7 @@
#include <linux/swapops.h>
#include <linux/sched/mm.h>
#include <linux/platform_device.h>
+#include <linux/rmap.h>

#include "test_hmm_uapi.h"

@@ -46,6 +47,7 @@ struct dmirror_bounce {
unsigned long cpages;
};

+#define DPT_XA_TAG_ATOMIC 1UL
#define DPT_XA_TAG_WRITE 3UL

/*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
}
}

+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+ unsigned long end)
+{
+ unsigned long pfn;
+
+ for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+ void *entry;
+ struct page *page;
+
+ entry = xa_load(&dmirror->pt, pfn);
+ page = xa_untag_pointer(entry);
+ if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+ return -EPERM;
+ }
+
+ return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+ struct page **pages, struct dmirror *dmirror)
+{
+ unsigned long pfn, mapped = 0;
+ int i;
+
+ /* Map the migrated pages into the device's page tables. */
+ mutex_lock(&dmirror->mutex);
+
+ for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
+ void *entry;
+
+ if (!pages[i])
+ continue;
+
+ entry = pages[i];
+ entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+ entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+ if (xa_is_err(entry)) {
+ mutex_unlock(&dmirror->mutex);
+ return xa_err(entry);
+ }
+
+ mapped++;
+ }
+
+ mutex_unlock(&dmirror->mutex);
+ return mapped;
+}
+
static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
struct dmirror *dmirror)
{
@@ -661,6 +711,71 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
return 0;
}

+static int dmirror_exclusive(struct dmirror *dmirror,
+ struct hmm_dmirror_cmd *cmd)
+{
+ unsigned long start, end, addr;
+ unsigned long size = cmd->npages << PAGE_SHIFT;
+ struct mm_struct *mm = dmirror->notifier.mm;
+ struct page *pages[64];
+ struct dmirror_bounce bounce;
+ unsigned long next;
+ int ret;
+
+ start = cmd->addr;
+ end = start + size;
+ if (end < start)
+ return -EINVAL;
+
+ /* Since the mm is for the mirrored process, get a reference first. */
+ if (!mmget_not_zero(mm))
+ return -EINVAL;
+
+ mmap_read_lock(mm);
+ for (addr = start; addr < end; addr = next) {
+ int i, mapped;
+
+ if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+ next = end;
+ else
+ next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+ ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+ mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+ for (i = 0; i < ret; i++) {
+ if (pages[i]) {
+ unlock_page(pages[i]);
+ put_page(pages[i]);
+ }
+ }
+
+ if (addr + (mapped << PAGE_SHIFT) < next) {
+ mmap_read_unlock(mm);
+ mmput(mm);
+ return -EBUSY;
+ }
+ }
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ /* Return the migrated data for verification. */
+ ret = dmirror_bounce_init(&bounce, start, size);
+ if (ret)
+ return ret;
+ mutex_lock(&dmirror->mutex);
+ ret = dmirror_do_read(dmirror, start, end, &bounce);
+ mutex_unlock(&dmirror->mutex);
+ if (ret == 0) {
+ if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+ bounce.size))
+ ret = -EFAULT;
+ }
+
+ cmd->cpages = bounce.cpages;
+ dmirror_bounce_fini(&bounce);
+ return ret;
+}
+
static int dmirror_migrate(struct dmirror *dmirror,
struct hmm_dmirror_cmd *cmd)
{
@@ -949,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_migrate(dmirror, &cmd);
break;

+ case HMM_DMIRROR_EXCLUSIVE:
+ ret = dmirror_exclusive(dmirror, &cmd);
+ break;
+
+ case HMM_DMIRROR_CHECK_EXCLUSIVE:
+ ret = dmirror_check_atomic(dmirror, cmd.addr,
+ cmd.addr + (cmd.npages << PAGE_SHIFT));
+ break;
+
case HMM_DMIRROR_SNAPSHOT:
ret = dmirror_snapshot(dmirror, &cmd);
break;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..f14dea5dcd06 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,6 +33,8 @@ struct hmm_dmirror_cmd {
#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd)
#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd)

/*
* Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..864f126ffd78 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1485,4 +1485,162 @@ TEST_F(hmm2, double_map)
hmm_buffer_free(buffer);
}

+/*
+ * Basic check of exclusive faulting.
+ */
+TEST_F(hmm, exclusive)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i]++, i);
+
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i+1);
+
+ /* Check atomic access revoked */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+
+ hmm_buffer_free(buffer);
+}
+
+TEST_F(hmm, exclusive_mprotect)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ /* Check what the device read. */
+ for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i);
+
+ ret = mprotect(buffer->ptr, size, PROT_READ);
+ ASSERT_EQ(ret, 0);
+
+ /* Simulate a device writing system memory. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+ ASSERT_EQ(ret, -EPERM);
+
+ hmm_buffer_free(buffer);
+}
+
+/*
+ * Check copy-on-write works.
+ */
+TEST_F(hmm, exclusive_cow)
+{
+ struct hmm_buffer *buffer;
+ unsigned long npages;
+ unsigned long size;
+ unsigned long i;
+ int *ptr;
+ int ret;
+
+ npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+ ASSERT_NE(npages, 0);
+ size = npages << self->page_shift;
+
+ buffer = malloc(sizeof(*buffer));
+ ASSERT_NE(buffer, NULL);
+
+ buffer->fd = -1;
+ buffer->size = size;
+ buffer->mirror = malloc(size);
+ ASSERT_NE(buffer->mirror, NULL);
+
+ buffer->ptr = mmap(NULL, size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS,
+ buffer->fd, 0);
+ ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+ /* Initialize buffer in system memory. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ /* Map memory exclusively for device access. */
+ ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(buffer->cpages, npages);
+
+ fork();
+
+ /* Fault pages back to system memory and check them. */
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i]++, i);
+
+ for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+ ASSERT_EQ(ptr[i], i+1);
+
+ hmm_buffer_free(buffer);
+}
+
TEST_HARNESS_MAIN
--
2.20.1

2021-05-06 07:49:05

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] Add support for SVM atomics in Nouveau

Hi Andrew,

There is currently no outstanding feedback for this series so I am hoping it
may be considered for inclusion (or at least the mm portions - I still need
Reviews/Acks for the Nouveau bits). The main change for v8 was removal of
entries on fork rather than copying in response to feedback from Jason so any
follow up comments on patch 5 would also be welcome. The series contains a
number of general clean-ups suggested by Christoph along with a feature to
temporarily make selected user page mappings write-protected.

This is needed to support OpenCL atomic operations in Nouveau to shared
virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc
flag. A more complete description of the OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

I have been testing this with Mesa 21.1.0 and a simple OpenCL program which
checks GPU atomic accesses to system memory are atomic. Without this series
the test fails as there is no way of write-protecting the userspace page
mapping which results in the device clobbering CPU writes. For reference the
test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ .

- Alistair

On Wednesday, 7 April 2021 6:42:30 PM AEST Alistair Popple wrote:
> This is the eighth version of a series to add support to Nouveau for atomic
> memory operations on OpenCL shared virtual memory (SVM) regions.
>
> The main change for this version is a simplification of device exclusive
> entry handling. Instead of copying entries for copy-on-write mappings
> during fork they are removed instead. This is safer because there could be
> unique corner cases when copying, particularly for pinned pages which
> should follow the same logic as copy_present_page(). Removing entries
> avoids this possiblity by treating them as normal ptes.
>
> Exclusive device access is implemented by adding a new swap entry type
> (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
> difference is that on fault the original entry is immediately restored by
> the fault handler instead of waiting.
>
> Restoring the entry triggers calls to MMU notifers which allows a device
> driver to revoke the atomic access permission from the GPU prior to the CPU
> finalising the entry.
>
> Patches 1 & 2 refactor existing migration and device private entry
> functions.
>
> Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
> functionality into separate functions - try_to_migrate_one() and
> try_to_munlock_one(). These should not change any functionality, but any
> help testing would be much appreciated as I have not been able to test
> every usage of try_to_unmap_one().
>
> Patch 5 contains the bulk of the implementation for device exclusive
> memory.
>
> Patch 6 contains some additions to the HMM selftests to ensure everything
> works as expected.
>
> Patch 7 is a cleanup for the Nouveau SVM implementation.
>
> Patch 8 contains the implementation of atomic access for the Nouveau
> driver.
>
> This has been tested using the latest upstream Mesa userspace with a simple
> OpenCL test program which checks the results of atomic GPU operations on a
> SVM buffer whilst also writing to the same buffer from the CPU.
>
> Alistair Popple (8):
> mm: Remove special swap entry functions
> mm/swapops: Rework swap entry manipulation code
> mm/rmap: Split try_to_munlock from try_to_unmap
> mm/rmap: Split migration into its own function
> mm: Device exclusive memory access
> mm: Selftests for exclusive device memory
> nouveau/svm: Refactor nouveau_range_fault
> nouveau/svm: Implement atomic SVM access
>
> Documentation/vm/hmm.rst | 19 +-
> Documentation/vm/unevictable-lru.rst | 33 +-
> arch/s390/mm/pgtable.c | 2 +-
> drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
> drivers/gpu/drm/nouveau/nouveau_svm.c | 156 ++++-
> drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
> .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
> fs/proc/task_mmu.c | 23 +-
> include/linux/mmu_notifier.h | 26 +-
> include/linux/rmap.h | 11 +-
> include/linux/swap.h | 8 +-
> include/linux/swapops.h | 123 ++--
> lib/test_hmm.c | 126 +++-
> lib/test_hmm_uapi.h | 2 +
> mm/debug_vm_pgtable.c | 12 +-
> mm/hmm.c | 12 +-
> mm/huge_memory.c | 45 +-
> mm/hugetlb.c | 10 +-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 196 +++++-
> mm/migrate.c | 51 +-
> mm/mlock.c | 10 +-
> mm/mprotect.c | 18 +-
> mm/page_vma_mapped.c | 15 +-
> mm/rmap.c | 612 +++++++++++++++---
> tools/testing/selftests/vm/hmm-tests.c | 158 +++++
> 26 files changed, 1366 insertions(+), 312 deletions(-)
>
>




2021-05-21 20:06:49

by Ben Skeggs

[permalink] [raw]
Subject: Re: [PATCH v8 8/8] nouveau/svm: Implement atomic SVM access

On Wed, 7 Apr 2021 at 18:43, Alistair Popple <[email protected]> wrote:
>
> Some NVIDIA GPUs do not support direct atomic access to system memory
> via PCIe. Instead this must be emulated by granting the GPU exclusive
> access to the memory. This is achieved by replacing CPU page table
> entries with special swap entries that fault on userspace access.
>
> The driver then grants the GPU permission to update the page undergoing
> atomic access via the GPU page tables. When CPU access to the page is
> required a CPU fault is raised which calls into the device driver via
> MMU notifiers to revoke the atomic access. The original page table
> entries are then restored allowing CPU access to proceed.
>
> Signed-off-by: Alistair Popple <[email protected]>
The Nouveau bits at least look good to me.

For patches 7/8:
Reviewed-by: Ben Skeggs <[email protected]>

>
> ---
>
> v7:
> * Removed magic values for fault access levels
> * Improved readability of fault comparison code
>
> v4:
> * Check that page table entries haven't changed before mapping on the
> device
> ---
> drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 +
> drivers/gpu/drm/nouveau/nouveau_svm.c | 126 ++++++++++++++++--
> drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 +
> .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 +
> 4 files changed, 123 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> index d6dd40f21eed..9c7ff56831c5 100644
> --- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> +++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h
> @@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
> #define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL
> #define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL
> #define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL
> +#define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL
> #define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL
> #define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL
> #define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index a195e48c9aee..81526d65b4e2 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -35,6 +35,7 @@
> #include <linux/sched/mm.h>
> #include <linux/sort.h>
> #include <linux/hmm.h>
> +#include <linux/rmap.h>
>
> struct nouveau_svm {
> struct nouveau_drm *drm;
> @@ -67,6 +68,11 @@ struct nouveau_svm {
> } buffer[1];
> };
>
> +#define FAULT_ACCESS_READ 0
> +#define FAULT_ACCESS_WRITE 1
> +#define FAULT_ACCESS_ATOMIC 2
> +#define FAULT_ACCESS_PREFETCH 3
> +
> #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
> #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
>
> @@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm,
> fault->client);
> }
>
> +static int
> +nouveau_svm_fault_priority(u8 fault)
> +{
> + switch (fault) {
> + case FAULT_ACCESS_PREFETCH:
> + return 0;
> + case FAULT_ACCESS_READ:
> + return 1;
> + case FAULT_ACCESS_WRITE:
> + return 2;
> + case FAULT_ACCESS_ATOMIC:
> + return 3;
> + default:
> + WARN_ON_ONCE(1);
> + return -1;
> + }
> +}
> +
> static int
> nouveau_svm_fault_cmp(const void *a, const void *b)
> {
> @@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b)
> return ret;
> if ((ret = (s64)fa->addr - fb->addr))
> return ret;
> - /*XXX: atomic? */
> - return (fa->access == 0 || fa->access == 3) -
> - (fb->access == 0 || fb->access == 3);
> + return nouveau_svm_fault_priority(fa->access) -
> + nouveau_svm_fault_priority(fb->access);
> }
>
> static void
> @@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
> struct svm_notifier *sn =
> container_of(mni, struct svm_notifier, notifier);
>
> + if (range->event == MMU_NOTIFY_EXCLUSIVE &&
> + range->owner == sn->svmm->vmm->cli->drm->dev)
> + return true;
> +
> /*
> * serializes the update to mni->invalidate_seq done by caller and
> * prevents invalidation of the PTE from progressing while HW is being
> @@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm,
> args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
> }
>
> +static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
> + struct nouveau_drm *drm,
> + struct nouveau_pfnmap_args *args, u32 size,
> + struct svm_notifier *notifier)
> +{
> + unsigned long timeout =
> + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> + struct mm_struct *mm = svmm->notifier.mm;
> + struct page *page;
> + unsigned long start = args->p.addr;
> + unsigned long notifier_seq;
> + int ret = 0;
> +
> + ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
> + args->p.addr, args->p.size,
> + &nouveau_svm_mni_ops);
> + if (ret)
> + return ret;
> +
> + while (true) {
> + if (time_after(jiffies, timeout)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + notifier_seq = mmu_interval_read_begin(&notifier->notifier);
> + mmap_read_lock(mm);
> + make_device_exclusive_range(mm, start, start + PAGE_SIZE,
> + &page, drm->dev);
> + mmap_read_unlock(mm);
> + if (!page) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + mutex_lock(&svmm->mutex);
> + if (!mmu_interval_read_retry(&notifier->notifier,
> + notifier_seq))
> + break;
> + mutex_unlock(&svmm->mutex);
> + }
> +
> + /* Map the page on the GPU. */
> + args->p.page = 12;
> + args->p.size = PAGE_SIZE;
> + args->p.addr = start;
> + args->p.phys[0] = page_to_phys(page) |
> + NVIF_VMM_PFNMAP_V0_V |
> + NVIF_VMM_PFNMAP_V0_W |
> + NVIF_VMM_PFNMAP_V0_A |
> + NVIF_VMM_PFNMAP_V0_HOST;
> +
> + svmm->vmm->vmm.object.client->super = true;
> + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
> + svmm->vmm->vmm.object.client->super = false;
> + mutex_unlock(&svmm->mutex);
> +
> + unlock_page(page);
> + put_page(page);
> +
> +out:
> + mmu_interval_notifier_remove(&notifier->notifier);
> + return ret;
> +}
> +
> static int nouveau_range_fault(struct nouveau_svmm *svmm,
> struct nouveau_drm *drm,
> struct nouveau_pfnmap_args *args, u32 size,
> @@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
> unsigned long hmm_flags;
> u64 inst, start, limit;
> int fi, fn;
> - int replay = 0, ret;
> + int replay = 0, atomic = 0, ret;
>
> /* Parse available fault buffer entries into a cache, and update
> * the GET pointer so HW can reuse the entries.
> @@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
> /*
> * Determine required permissions based on GPU fault
> * access flags.
> - * XXX: atomic?
> */
> switch (buffer->fault[fi]->access) {
> case 0: /* READ. */
> hmm_flags = HMM_PFN_REQ_FAULT;
> break;
> + case 2: /* ATOMIC. */
> + atomic = true;
> + break;
> case 3: /* PREFETCH. */
> hmm_flags = 0;
> break;
> @@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify)
> }
>
> notifier.svmm = svmm;
> - ret = nouveau_range_fault(svmm, svm->drm, &args.i,
> - sizeof(args), hmm_flags, &notifier);
> + if (atomic)
> + ret = nouveau_atomic_range_fault(svmm, svm->drm,
> + &args.i, sizeof(args),
> + &notifier);
> + else
> + ret = nouveau_range_fault(svmm, svm->drm, &args.i,
> + sizeof(args), hmm_flags,
> + &notifier);
> mmput(mm);
>
> limit = args.i.p.addr + args.i.p.size;
> @@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
> */
> if (buffer->fault[fn]->svmm != svmm ||
> buffer->fault[fn]->addr >= limit ||
> - (buffer->fault[fi]->access == 0 /* READ. */ &&
> + (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
> !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
> - (buffer->fault[fi]->access != 0 /* READ. */ &&
> - buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
> - !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
> + (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
> + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
> + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
> + (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
> + buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
> + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
> + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
> break;
> }
>
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> index a2b179568970..f6188aa9171c 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> @@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *);
> #define NVKM_VMM_PFN_APER 0x00000000000000f0ULL
> #define NVKM_VMM_PFN_HOST 0x0000000000000000ULL
> #define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL
> +#define NVKM_VMM_PFN_A 0x0000000000000004ULL
> #define NVKM_VMM_PFN_W 0x0000000000000002ULL
> #define NVKM_VMM_PFN_V 0x0000000000000001ULL
> #define NVKM_VMM_PFN_NONE 0x0000000000000000ULL
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> index 236db5570771..f02abd9cb4dd 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
> @@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
> if (!(*map->pfn & NVKM_VMM_PFN_W))
> data |= BIT_ULL(6); /* RO. */
>
> + if (!(*map->pfn & NVKM_VMM_PFN_A))
> + data |= BIT_ULL(7); /* Atomic disable. */
> +
> if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
> addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
> addr = dma_map_page(dev, pfn_to_page(addr), 0,
> @@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt,
> if (!(*map->pfn & NVKM_VMM_PFN_W))
> data |= BIT_ULL(6); /* RO. */
>
> + if (!(*map->pfn & NVKM_VMM_PFN_A))
> + data |= BIT_ULL(7); /* Atomic disable. */
> +
> if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
> addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
> addr = dma_map_page(dev, pfn_to_page(addr), 0,
> --
> 2.20.1
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel