2021-01-08 17:17:36

by Will Deacon

[permalink] [raw]
Subject: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

Hi all,

This is version two of the series I originally posted here:

https://lore.kernel.org/r/[email protected]

The patches allow architectures to opt-in at runtime for faultaround
mappings to be created as 'old' instead of 'young'. Although there have
been previous attempts at this, they failed either because the decision
was deferred to userspace [1] or because it was done unconditionally and
shown to regress benchmarks for particular architectures [2].

The big difference in this version is that I have reworked it based on
Kirill's patch which he posted as a follow-up to the original. However,
I can't tell where we've landed on that -- Linus seemed to like it, but
Hugh was less enthusiastic. I think that my subsequent patches are an
awful lot cleaner after the rework, but I don't think I necessarily
depend on everything in there so if that patch is likely to be a
sticking point then I can try to extract the parts I need.

Feedback welcome.

Cheers,

Will

[1] https://www.spinics.net/lists/linux-mm/msg143831.html
[2] 315d09bf30c2 ("Revert "mm: make faultaround produce old ptes"")

Cc: Catalin Marinas <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>

--->8

Kirill A. Shutemov (1):
mm: Cleanup faultaround and finish_fault() codepaths

Will Deacon (2):
mm: Allow architectures to request 'old' entries when prefaulting
arm64: mm: Implement arch_wants_old_prefaulted_pte()

arch/arm64/include/asm/pgtable.h | 12 +-
fs/xfs/xfs_file.c | 6 +-
include/linux/mm.h | 17 ++-
include/linux/pgtable.h | 11 ++
mm/filemap.c | 181 +++++++++++++++++++------
mm/memory.c | 219 +++++++++++--------------------
6 files changed, 251 insertions(+), 195 deletions(-)

--
2.29.2.729.g45daf8777d-goog


2021-01-08 17:17:56

by Will Deacon

[permalink] [raw]
Subject: [PATCH v2 1/3] mm: Cleanup faultaround and finish_fault() codepaths

From: "Kirill A. Shutemov" <[email protected]>

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
fs/xfs/xfs_file.c | 6 +-
include/linux/mm.h | 12 ++-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 177 ++++++++++++++++++++++++++---------
mm/memory.c | 199 ++++++++++++----------------------------
5 files changed, 213 insertions(+), 192 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..111fe73bb8a7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite(
return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
}

-static void
+static vm_fault_t
xfs_filemap_map_pages(
struct vm_fault *vmf,
pgoff_t start_pgoff,
pgoff_t end_pgoff)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- filemap_map_pages(vmf, start_pgoff, end_pgoff);
+ ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ return ret;
}

static const struct vm_operations_struct xfs_file_vm_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..801dd99f733c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -542,8 +542,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -578,7 +578,7 @@ struct vm_operations_struct {
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
- void (*map_pages)(struct vm_fault *vmf,
+ vm_fault_t (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);

@@ -988,7 +988,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
@@ -2622,7 +2624,7 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
-extern void filemap_map_pages(struct vm_fault *vmf,
+extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..36eb748f3c97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1314,6 +1314,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c9d564317a5..c1f2dc89b8a7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2911,74 +2912,164 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_fault *vmf,
- pgoff_t start_pgoff, pgoff_t end_pgoff)
+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
{
- struct file *file = vmf->vma->vm_file;
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
+ vm_fault_t ret = do_set_pmd(vmf, page);
+ if (!ret) {
+ /* The page is mapped successfully, reference consumed. */
+ unlock_page(page);
+ return true;
+ }
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_uptodate_page(struct page *page,
+ struct address_space *mapping,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ unsigned long max_idx;
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
+static inline struct page *first_map_page(struct address_space *mapping,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_find(xas, end_pgoff),
+ mapping, xas, end_pgoff);
+}
+
+static inline struct page *next_map_page(struct address_space *mapping,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_next_entry(xas, end_pgoff),
+ mapping, xas, end_pgoff);
+}
+
+vm_fault_t filemap_map_pages(struct vm_fault *vmf,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
+ unsigned long address = vmf->address;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ vm_fault_t ret = 0;

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = first_map_page(mapping, &xas, end_pgoff);
+ if (!head)
+ goto out;

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->address = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
mmap_miss--;

vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
- if (vmf->pte)
- vmf->pte += xas.xa_index - last_pgoff;
+ vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+
+ /* The fault is handled */
+ if (vmf->address == address)
+ ret = VM_FAULT_NOPAGE;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
+ vmf->address = address;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+ return ret;
}
EXPORT_SYMBOL(filemap_map_pages);

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..3e2fc2950ad7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3503,7 +3503,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3643,66 +3643,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3717,7 +3657,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3775,52 +3715,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
return ret;
}
#else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
- BUILD_BUG();
- return 0;
+ return VM_FAULT_FALLBACK;
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3837,14 +3742,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3862,12 +3761,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3876,12 +3775,38 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (!(vma->vm_flags & VM_SHARED)) {
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ ret = 0;
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+ else
+ ret = VM_FAULT_NOPAGE;
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
}

@@ -3951,13 +3876,12 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
- vm_fault_t ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- vmf->address = max(address & mask, vmf->vma->vm_start);
- off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ address = max(address & mask, vmf->vma->vm_start);
+ off = ((vmf->address - address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;

/*
@@ -3965,7 +3889,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
* the vma or nr_pages from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);
@@ -3973,31 +3897,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
if (pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
- goto out;
+ return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}

- vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
-
- /* Huge page is mapped? Page fault is solved */
- if (pmd_trans_huge(*vmf->pmd)) {
- ret = VM_FAULT_NOPAGE;
- goto out;
- }
-
- /* ->map_pages() haven't done anything useful. Cold page cache? */
- if (!vmf->pte)
- goto out;
-
- /* check if the page fault is solved */
- vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
- if (!pte_none(*vmf->pte))
- ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
- vmf->address = address;
- vmf->pte = NULL;
- return ret;
+ return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

static vm_fault_t do_read_fault(struct vm_fault *vmf)
@@ -4353,7 +4257,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
2.29.2.729.g45daf8777d-goog

2021-01-08 17:18:01

by Will Deacon

[permalink] [raw]
Subject: [PATCH v2 3/3] arm64: mm: Implement arch_wants_old_prefaulted_pte()

On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
improves vmscan behaviour and does not appear to introduce any overhead.

Implement arch_wants_old_prefaulted_pte() to return 'true' if we detect
hardware access flag support at runtime. This can be extended in future
based on MIDR matching if necessary.

Cc: Catalin Marinas <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 501562793ce2..e17b96d0e4b5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -980,7 +980,17 @@ static inline bool arch_faults_on_old_pte(void)

return !cpu_has_hw_af();
}
-#define arch_faults_on_old_pte arch_faults_on_old_pte
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
+/*
+ * Experimentally, it's cheap to set the access flag in hardware and we
+ * benefit from prefaulting mappings as 'old' to start with.
+ */
+static inline bool arch_wants_old_prefaulted_pte(void)
+{
+ return !arch_faults_on_old_pte();
+}
+#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte

#endif /* !__ASSEMBLY__ */

--
2.29.2.729.g45daf8777d-goog

2021-01-08 17:20:03

by Will Deacon

[permalink] [raw]
Subject: [PATCH v2 2/3] mm: Allow architectures to request 'old' entries when prefaulting

Commit 5c0a85fad949 ("mm: make faultaround produce old ptes") changed
the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
since this avoids vmscan wrongly assuming that they are hot, despite
having never been explicitly accessed by userspace. The change has been
shown to benefit numerous arm64 micro-architectures (with hardware
access flag) running Android, where both application launch latency and
direct reclaim time are significantly reduced.

Unfortunately, commit 315d09bf30c2 ("Revert "mm: make faultaround produce
old ptes"") reverted the change to it being identified as the cause of a
~6% regression in unixbench on x86. Experiments on a variety of recent
arm64 micro-architectures indicate that unixbench is not affected by
the original commit, yielding a 0-1% performance improvement.

Since one size does not fit all for the initial state of prefaulted PTEs,
introduce arch_wants_old_prefaulted_pte(), which allows an architecture
to opt-in to 'old' prefaulted PTEs at runtime based on whatever criteria
it may have.

Cc: Jan Kara <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Reported-by: Vinayak Menon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/mm.h | 5 ++++-
mm/filemap.c | 12 ++++++++----
mm/memory.c | 20 +++++++++++++++++++-
3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 801dd99f733c..873e410d6238 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -434,6 +434,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PREFAULT: Fault was a prefault.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -464,6 +465,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80
#define FAULT_FLAG_INSTRUCTION 0x100
#define FAULT_FLAG_INTERRUPTIBLE 0x200
+#define FAULT_FLAG_PREFAULT 0x400

/*
* The default fault flags that should be used by most of the
@@ -501,7 +503,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_PREFAULT, "PREFAULT" }

/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/mm/filemap.c b/mm/filemap.c
index c1f2dc89b8a7..0fb9d1714797 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3051,14 +3051,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
if (!pte_none(*vmf->pte))
goto unlock;

+ /* We're about to handle the fault */
+ if (vmf->address == address) {
+ vmf->flags &= ~FAULT_FLAG_PREFAULT;
+ ret = VM_FAULT_NOPAGE;
+ } else {
+ vmf->flags |= FAULT_FLAG_PREFAULT;
+ }
+
do_set_pte(vmf, page);
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
-
- /* The fault is handled */
- if (vmf->address == address)
- ret = VM_FAULT_NOPAGE;
continue;
unlock:
unlock_page(head);
diff --git a/mm/memory.c b/mm/memory.c
index 3e2fc2950ad7..f0e7c589ca9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -134,6 +134,18 @@ static inline bool arch_faults_on_old_pte(void)
}
#endif

+#ifndef arch_wants_old_prefaulted_pte
+static inline bool arch_wants_old_prefaulted_pte(void)
+{
+ /*
+ * Transitioning a PTE from 'old' to 'young' can be expensive on
+ * some architectures, even if it's performed in hardware. By
+ * default, "false" means prefaulted entries will be 'young'.
+ */
+ return false;
+}
+#endif
+
static int __init disable_randmaps(char *s)
{
randomize_va_space = 0;
@@ -3725,11 +3737,17 @@ void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+ bool prefault = vmf->flags & FAULT_FLAG_PREFAULT;
pte_t entry;

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
- entry = pte_sw_mkyoung(entry);
+
+ if (prefault && arch_wants_old_prefaulted_pte())
+ entry = pte_mkold(entry);
+ else
+ entry = pte_sw_mkyoung(entry);
+
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/* copy-on-write page */
--
2.29.2.729.g45daf8777d-goog

2021-01-08 19:37:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <[email protected]> wrote:
>
> The big difference in this version is that I have reworked it based on
> Kirill's patch which he posted as a follow-up to the original. However,
> I can't tell where we've landed on that -- Linus seemed to like it, but
> Hugh was less enthusiastic.

Yeah, I like it, but I have to admit that it had a disturbingly high
number of small details wrong for several versions. I hope you picked
up the final version of the code.

At the same time, I do think that the "disturbingly high number of
issues" was primarily exactly _because_ the old code was so
incomprehensible, and I think the end result is much cleaner, so I
still like it.

>I think that my subsequent patches are an
> awful lot cleaner after the rework

Yeah, I think that's a side effect of "now the code really makes a lot
more sense". Your subsequent patches 2-3 certainly are much simpler
now, although I'd be inclined to add an argument to "do_set_pte()"
that has the "write" and "pretault" bits in it, instead of having to
modify the 'vmf' structure.

I still dislike how we basically randomly modify the information in
that 'vmf' thing.

That said, now it's just a small detail - not really objectionable,
just a "this could be cleaner, I think".

I think it was Kirill who pointed out that we sadly cannot make 'vmf'
read-only anyway, because it does also contain those pre-allocation
details etc (vmf->pte etc) that are very much about what the current
"state" of the fault is. So while I would hope it could be more
read-only than it is, my wish that it could _actually_ be 'const' is
clearly just an irrelevant dream.

Linus

2021-01-08 19:45:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Fri, Jan 8, 2021 at 11:34 AM Linus Torvalds
<[email protected]> wrote:
>
> Yeah, I think that's a side effect of "now the code really makes a lot
> more sense". Your subsequent patches 2-3 certainly are much simpler
> now

On that note - they could be simpler still if this was just done
entirely unconditionally..

I'm taking your word for "it makes sense", but when you say

On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
improves vmscan behaviour and does not appear to introduce any overhead.

in the description for patch 3, it makes me wonder how noticeable the
overhead is on the hardware that _does_ take a fault on old pte's..

IOW, it would be lovely to see numbers if you have any like that..

Both ways, actually. Because I also wonder how noticeable the vmscan
improvement is. You say there's no measurable overhead for platforms
with hardware dirty/accessed bits, but maybe there's not a lot of
measurable improvements from a more exact accessed bit either?

Linus

2021-01-11 13:34:16

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <[email protected]> wrote:
> >
> > The big difference in this version is that I have reworked it based on
> > Kirill's patch which he posted as a follow-up to the original. However,
> > I can't tell where we've landed on that -- Linus seemed to like it, but
> > Hugh was less enthusiastic.
>
> Yeah, I like it, but I have to admit that it had a disturbingly high
> number of small details wrong for several versions. I hope you picked
> up the final version of the code.

I picked the version from here:

https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box

and actually, I just noticed that willy spotted a typo in a comment, so
I'll fix that locally as well as adding the above to a 'Link:' tag for
reference.

> At the same time, I do think that the "disturbingly high number of
> issues" was primarily exactly _because_ the old code was so
> incomprehensible, and I think the end result is much cleaner, so I
> still like it.
>
> >I think that my subsequent patches are an
> > awful lot cleaner after the rework
>
> Yeah, I think that's a side effect of "now the code really makes a lot
> more sense". Your subsequent patches 2-3 certainly are much simpler
> now, although I'd be inclined to add an argument to "do_set_pte()"
> that has the "write" and "pretault" bits in it, instead of having to
> modify the 'vmf' structure.

I played with a few different ways of doing this, but I can't say I prefer
them over what I ended up posting. Having a bunch of 'bool' arguments makes
the callers hard to read and brings into question what exactly vmf->flags is
for. I also tried adding a separate 'address' parameter so that vmf->address
is always the real faulting address, but 'address' is the thing to use for
the pte (i.e. prefault is when 'address != vmf->address'). That wasn't too
bad, but it made the normal finish_fault() case look weird.

So I think I'll leave it as-is and see if anybody wants to change it later
on.

Will

2021-01-11 14:07:31

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Fri, Jan 08, 2021 at 11:42:39AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 11:34 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Yeah, I think that's a side effect of "now the code really makes a lot
> > more sense". Your subsequent patches 2-3 certainly are much simpler
> > now
>
> On that note - they could be simpler still if this was just done
> entirely unconditionally..
>
> I'm taking your word for "it makes sense", but when you say
>
> On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
> improves vmscan behaviour and does not appear to introduce any overhead.
>
> in the description for patch 3, it makes me wonder how noticeable the
> overhead is on the hardware that _does_ take a fault on old pte's..
>
> IOW, it would be lovely to see numbers if you have any like that..

[Vinayak -- please chime in if I miss anything here, as you've posted these
numbers before]

The initial posting in 2016 had some numbers based on a 3.18 kernel, which
didn't have support for hardware AF/DBM:

https://lore.kernel.org/lkml/[email protected]

(note that "Kirill's-fix" in the last column was a quick hack and didn't
make the faulting pte young)

So yes, for the cases we care about in Android (where the vmscan behaviour
seems to be the important thing), then this patch makes sense for
non-hardware AF/DBM CPUs too. In either case, we see ~80% reduction in
direct reclaim time according to mmtests [1] and double-digit percentage
reductions in app launch latency (some of this is mentioned in the link
above). The actual fault cost isn't especially relevant.

*However...*

For machines with lots of memory, the increased fault cost when hardware
AF/DBM is not available may well be measurable, and I suspect it would
hurt unixbench (which was the reason behind reverting this on x86 [2],
although I must admit that the diagnosis wasn't particularly satisfactory
[3]). We could run those numbers on arm64 but, due to the wide diversity of
micro-architectures we have to deal with, I would like to keep our options
open to detecting this dynamically anyway, just in case somebody builds a
CPU which struggles in this area.

Cheers,

Will

[1] https://github.com/gormanm/mmtests
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/

2021-01-11 14:26:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> I still dislike how we basically randomly modify the information in
> that 'vmf' thing.

I wounder if it would be acceptable to pass down to faultaround a copy
of vmf, so it mess with it without risking to corrupt the original one?

We would need to transfer back prealloc_pte, but the rest can be safely
discarded, I believe.

But it's somewhat wasteful of stack. sizeof(vmf) == 96 on x86-64.

--
Kirill A. Shutemov

2021-01-11 14:29:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Allow architectures to request 'old' entries when prefaulting

On Fri, Jan 08, 2021 at 05:15:16PM +0000, Will Deacon wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c1f2dc89b8a7..0fb9d1714797 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3051,14 +3051,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> if (!pte_none(*vmf->pte))
> goto unlock;
>
> + /* We're about to handle the fault */
> + if (vmf->address == address) {
> + vmf->flags &= ~FAULT_FLAG_PREFAULT;
> + ret = VM_FAULT_NOPAGE;
> + } else {
> + vmf->flags |= FAULT_FLAG_PREFAULT;
> + }
> +

Do we need to restore the oririnal status of the bit once we are done?

--
Kirill A. Shutemov

2021-01-11 14:29:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Cleanup faultaround and finish_fault() codepaths

On Fri, Jan 08, 2021 at 05:15:15PM +0000, Will Deacon wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
>
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
>
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
>
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> fs/xfs/xfs_file.c | 6 +-
> include/linux/mm.h | 12 ++-
> include/linux/pgtable.h | 11 +++
> mm/filemap.c | 177 ++++++++++++++++++++++++++---------
> mm/memory.c | 199 ++++++++++++----------------------------
> 5 files changed, 213 insertions(+), 192 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b0f93f73837..111fe73bb8a7 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite(
> return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> }
>
> -static void
> +static vm_fault_t
> xfs_filemap_map_pages(
> struct vm_fault *vmf,
> pgoff_t start_pgoff,
> pgoff_t end_pgoff)
> {
> struct inode *inode = file_inode(vmf->vma->vm_file);
> + vm_fault_t ret;
>
> xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> - filemap_map_pages(vmf, start_pgoff, end_pgoff);
> + ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + return ret;
> }
>
> static const struct vm_operations_struct xfs_file_vm_ops = {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ecdf8a8cd6ae..801dd99f733c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -542,8 +542,8 @@ struct vm_fault {
> * is not NULL, otherwise pmd.
> */
> pgtable_t prealloc_pte; /* Pre-allocated pte page table.
> - * vm_ops->map_pages() calls
> - * alloc_set_pte() from atomic context.
> + * vm_ops->map_pages() sets up a page
> + * table from from atomic context.
> * do_fault_around() pre-allocates
> * page table to avoid allocation from
> * atomic context.

The typo Matthew has pointed out:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d3d4e307fa09..358fc8616d8b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -535,7 +535,7 @@ struct vm_fault {
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
* vm_ops->map_pages() sets up a page
- * table from from atomic context.
+ * table from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
--
Kirill A. Shutemov

2021-01-11 14:31:16

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm: Cleanup faultaround and finish_fault() codepaths

On Mon, Jan 11, 2021 at 05:26:20PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 08, 2021 at 05:15:15PM +0000, Will Deacon wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ecdf8a8cd6ae..801dd99f733c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -542,8 +542,8 @@ struct vm_fault {
> > * is not NULL, otherwise pmd.
> > */
> > pgtable_t prealloc_pte; /* Pre-allocated pte page table.
> > - * vm_ops->map_pages() calls
> > - * alloc_set_pte() from atomic context.
> > + * vm_ops->map_pages() sets up a page
> > + * table from from atomic context.
> > * do_fault_around() pre-allocates
> > * page table to avoid allocation from
> > * atomic context.
>
> The typo Matthew has pointed out:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d3d4e307fa09..358fc8616d8b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -535,7 +535,7 @@ struct vm_fault {
> */
> pgtable_t prealloc_pte; /* Pre-allocated pte page table.
> * vm_ops->map_pages() sets up a page
> - * table from from atomic context.
> + * table from atomic context.
> * do_fault_around() pre-allocates
> * page table to avoid allocation from
> * atomic context.

Cheers, fixed locally and will include for v3!

Will

2021-01-11 14:41:39

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Jan 11, 2021 at 05:25:33PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jan 08, 2021 at 05:15:16PM +0000, Will Deacon wrote:
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c1f2dc89b8a7..0fb9d1714797 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3051,14 +3051,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> > if (!pte_none(*vmf->pte))
> > goto unlock;
> >
> > + /* We're about to handle the fault */
> > + if (vmf->address == address) {
> > + vmf->flags &= ~FAULT_FLAG_PREFAULT;
> > + ret = VM_FAULT_NOPAGE;
> > + } else {
> > + vmf->flags |= FAULT_FLAG_PREFAULT;
> > + }
> > +
>
> Do we need to restore the oririnal status of the bit once we are done?

I can certainly add that, although it doesn't look like we do that for
vmf->pte, so it's hard to tell what the rules are here. It certainly feels
odd to restore some fields but not others, as it looks like vmf->address
will be out-of-whack with vmf->pte when filemap_map_pages() returns. Am I
missing something?

Will

2021-01-11 14:50:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Jan 11, 2021 at 02:37:42PM +0000, Will Deacon wrote:
> On Mon, Jan 11, 2021 at 05:25:33PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jan 08, 2021 at 05:15:16PM +0000, Will Deacon wrote:
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index c1f2dc89b8a7..0fb9d1714797 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3051,14 +3051,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> > > if (!pte_none(*vmf->pte))
> > > goto unlock;
> > >
> > > + /* We're about to handle the fault */
> > > + if (vmf->address == address) {
> > > + vmf->flags &= ~FAULT_FLAG_PREFAULT;
> > > + ret = VM_FAULT_NOPAGE;
> > > + } else {
> > > + vmf->flags |= FAULT_FLAG_PREFAULT;
> > > + }
> > > +
> >
> > Do we need to restore the oririnal status of the bit once we are done?
>
> I can certainly add that, although it doesn't look like we do that for
> vmf->pte, so it's hard to tell what the rules are here. It certainly feels
> odd to restore some fields but not others, as it looks like vmf->address
> will be out-of-whack with vmf->pte when filemap_map_pages() returns. Am I
> missing something?

Unlike vmf->flags or vmf->address, vmf->pte is not going to be reused.
finish_fault() will overwrite it.

Yeah, it's messy.

--
Kirill A. Shutemov

2021-01-11 19:28:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Mon, Jan 11, 2021 at 6:24 AM Kirill A. Shutemov <[email protected]> wrote:
>
> I wonder if it would be acceptable to pass down to faultaround a copy
> of vmf, so it mess with it without risking to corrupt the original one?

I'd almost prefer to split vmf into two parts: the 'this is the fault
info' part and the 'this is the fault handling state' part.

So the first one would be filled in by the actual page faulter (or
GUP) - and then be 'const' during the lookup, while the second one
would be set up by handle_mm_fault() and would contain that "this is
the current state of my fault state machine" and contain things like
that ->pte thing.

And then if somebody actually needs to pass in "modified fault state"
(ie that whole "I'm doing fault-around, so I'll use multiple
addresses") they'd never modify the address in the fault info, they'd
just pass the address as an explicit argument (like most cases already
do - the "change addr or flags in vmf" is actually already _fairly_
rare).

Linus

2021-01-11 21:06:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Mon, 11 Jan 2021, Will Deacon wrote:
> On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> > On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <[email protected]> wrote:
> > >
> > > The big difference in this version is that I have reworked it based on
> > > Kirill's patch which he posted as a follow-up to the original. However,
> > > I can't tell where we've landed on that -- Linus seemed to like it, but
> > > Hugh was less enthusiastic.
> >
> > Yeah, I like it, but I have to admit that it had a disturbingly high
> > number of small details wrong for several versions. I hope you picked
> > up the final version of the code.
>
> I picked the version from here:
>
> https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box
>
> and actually, I just noticed that willy spotted a typo in a comment, so
> I'll fix that locally as well as adding the above to a 'Link:' tag for
> reference.
>
> > At the same time, I do think that the "disturbingly high number of
> > issues" was primarily exactly _because_ the old code was so
> > incomprehensible, and I think the end result is much cleaner, so I
> > still like it.

Just to report that I gave this v2 set a spin on a few (x86_64 and i386)
machines, and found nothing objectionable this time around.

And the things that I'm unenthusiastic about are exactly those details
that you and Kirill and Linus find unsatisfactory, but awkward to
eliminate: expect no new insights from me!

Hugh

2021-01-13 02:52:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Tue, Jan 12, 2021 at 1:48 PM Will Deacon <[email protected]> wrote:
>
> > And then if somebody actually needs to pass in "modified fault state"
> > (ie that whole "I'm doing fault-around, so I'll use multiple
> > addresses") they'd never modify the address in the fault info, they'd
> > just pass the address as an explicit argument (like most cases already
> > do - the "change addr or flags in vmf" is actually already _fairly_
> > rare).
>
> Alright then, I'll take another crack at this for v3 and see how far I
> get.

But please do keep it as a separate patch, and if it turns out to be
ugly for whatever reason, treat it as just a "maybe that would help"
suggestion rather than anything stronger..

Linus

2021-01-13 04:07:17

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Mon, Jan 11, 2021 at 01:03:29PM -0800, Hugh Dickins wrote:
> On Mon, 11 Jan 2021, Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> > > On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <[email protected]> wrote:
> > > >
> > > > The big difference in this version is that I have reworked it based on
> > > > Kirill's patch which he posted as a follow-up to the original. However,
> > > > I can't tell where we've landed on that -- Linus seemed to like it, but
> > > > Hugh was less enthusiastic.
> > >
> > > Yeah, I like it, but I have to admit that it had a disturbingly high
> > > number of small details wrong for several versions. I hope you picked
> > > up the final version of the code.
> >
> > I picked the version from here:
> >
> > https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box
> >
> > and actually, I just noticed that willy spotted a typo in a comment, so
> > I'll fix that locally as well as adding the above to a 'Link:' tag for
> > reference.
> >
> > > At the same time, I do think that the "disturbingly high number of
> > > issues" was primarily exactly _because_ the old code was so
> > > incomprehensible, and I think the end result is much cleaner, so I
> > > still like it.
>
> Just to report that I gave this v2 set a spin on a few (x86_64 and i386)
> machines, and found nothing objectionable this time around.

Thanks, Hugh.

> And the things that I'm unenthusiastic about are exactly those details
> that you and Kirill and Linus find unsatisfactory, but awkward to
> eliminate: expect no new insights from me!

Well, I'll keep you on CC for v3 -- just in case!

Will

2021-01-13 04:08:17

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

On Mon, Jan 11, 2021 at 11:25:37AM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2021 at 6:24 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > I wonder if it would be acceptable to pass down to faultaround a copy
> > of vmf, so it mess with it without risking to corrupt the original one?
>
> I'd almost prefer to split vmf into two parts: the 'this is the fault
> info' part and the 'this is the fault handling state' part.
>
> So the first one would be filled in by the actual page faulter (or
> GUP) - and then be 'const' during the lookup, while the second one
> would be set up by handle_mm_fault() and would contain that "this is
> the current state of my fault state machine" and contain things like
> that ->pte thing.
>
> And then if somebody actually needs to pass in "modified fault state"
> (ie that whole "I'm doing fault-around, so I'll use multiple
> addresses") they'd never modify the address in the fault info, they'd
> just pass the address as an explicit argument (like most cases already
> do - the "change addr or flags in vmf" is actually already _fairly_
> rare).

Alright then, I'll take another crack at this for v3 and see how far I
get.

Will