2021-04-07 13:52:31

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 00/37] Speculative page faults


This patchset is my take on speculative page faults (spf).
It builds on ideas that have been previously proposed by Laurent Dufour,
Peter Zijlstra and others before. While Laurent's previous proposal
was rejected around the time of LSF/MM 2019, I am hoping we can revisit
this now based on what I think is a simpler and more bisectable approach,
much improved scaling numbers in the anonymous vma case, and the Android
use case that has since emerged. I will expand on these points towards
the end of this message.


Classical page fault processing takes the mmap read lock in order to
prevent races with mmap writers. In contrast, speculative fault
processing does not take the mmap read lock, and instead verifies,
when the results of the page fault are about to get committed and
become visible to other threads, that no mmap writers have been
running concurrently with the page fault. If the check fails,
speculative updates do not get committed and the fault is retried
in the usual, non-speculative way (with the mmap read lock held).

The concurrency check is implemented using a per-mm mmap sequence count.
The counter is incremented at the beginning and end of each mmap write
operation. If the counter is initially observed to have an even value,
and has the same value later on, the observer can deduce that no mmap
writers have been running concurrently with it between those two times.
This is similar to a seqlock, except that readers never spin on the
counter value (they would instead revert to taking the mmap read lock),
and writers are allowed to sleep. One benefit of this approach is that
it requires no writer side changes, just some hooks in the mmap write
lock APIs that writers already use.

The first step of a speculative page fault is to look up the vma and
read its contents (currently by making a copy of the vma, though in
principle it would be sufficient to only read the vma attributes that
are used in page faults). The mmap sequence count is used to verify
that there were no mmap writers concurrent to the lookup and copy steps.
Note that walking rbtrees while there may potentially be concurrent
writers is not an entirely new idea in linux, as latched rbtrees
are already doing this. This is safe as long as the lookup is
followed by a sequence check to verify that concurrency did not
actually occur (and abort the speculative fault if it did).

The next step is to walk down the existing page table tree to find the
current pte entry. This is done with interrupts disabled to avoid
races with munmap(). Again, not an entirely new idea, as this repeats
a pattern already present in fast GUP. Similar precautions are also
taken when taking the page table lock.

In the case of mapped files, additional precautions must be taken when
dereferencing vma->vm_file. The vma copy that is taken at the start of
the speculative page fault includes a pointer to the file, but does not
take a reference to it. The reference is held by the original vma.
In order to guarantee the validity of the file through the vm_ops->fault()
and/or vm_ops->map_pages() calls, these are run within an rcu read locked
code section, with the mmap sequence count being verified at the start of
the section and the vma file fput() is rcu-deferred when vmas are freed.


The patch series applies on top of linux v5.12-rc5;
a git tree is also available:
git fetch https://github.com/lespinasse/linux.git v5.12-rc5-spf


Commits 1 to 4 are preparatory cleanups
(which I would like to push regardless of what happens to the rest)

Commits 5 and 6 introduce CONFIG_SPECULATIVE_PAGE_FAULT and lets us
enable it on x86 so we can test the new code as it gets introduced.

Commits 7 and 8 extend handle_mm_fault() so it can be used for
speculative faults; initially these always abort with VM_FAULT_RETRY.

Commits 9 to 14 introduce all the basic building blocks for speculative
page faults:
- Commit 9 adds the mmap sequence count that will be used for detecting
when writers have been running concurrently with an spf attempt
(in which case the attempt will be aborted);
- Commit 10 adds RCU safe vma freeing;
- Commit 11 does a lockless VMA lookup and starts the spf handling attempt;
- (Commit 12 is a small refactor preparing for the next commit);
- Commit 13 walks down the existing page tables, carefully avoiding
races with potential writers (munmap in particular)
- Commit 14 introduces pte_map_lock() and pte_spinlock(), which attempt
to (optionally map and) lock an existing page table when it's time to
commit page fault results to it.

Commits 15 to 22 progressively implement spf for most anon vma cases.
This mostly comes down to using the pte_map_lock() and pte_spinlock()
APIs where needed, and making sure to abort speculation in unsupported
cases (mostly anon_vma allocation and userfaultfd). The work is split
in small steps so that changes can be tested soon after they are added.

Commits 23 to 26 introduce the main ideas for extending spf to file
mapped vmas:
- Commit 23 adds RCU safe vma->vm_file freeing;
- Commit 24 uses RCU to ensure vma->vm_file stays valid through the
vm_ops->fault() method call;
- Commit 25 implements the speculative case for filemap_fault().
This only handles the common case where the desired page is already
cached, is not locked and where we don't need to trigger readahead.
- Commit 26 ensures we use pte_map_lock() to commit the faulted page to the
mm address space, verifying that no mmap writers ran during the fault.

Commits 27 and 28 use the same ideas to implement speculative fault-around.

Commits 29 to 32 complete spf support for most mapped file types.
A list of supported file types is introduced, and we allow speculative
execution through the new speculative file faulting paths for those files.
Note that the speculative support does not extend to shared file
mapping writes in this patchset, as this would require some support
for vm_ops->page_mkwrite() which isn't handled yet.

Commits 33 and 34 disable speculative handling for single-threaded
userspace. This is for (minor) performance tuning and is pushed
towards the end of the series to make it easier to exercise the spf
paths as they are introduced.

Commit 35 adds some extra statistics. Not pushing for inclusion on
these but I thought it might come handy when discussing spf performance.

Commits 36 and 37 add spf support on the arm64 architecture. It should
be easy to add other architectures too, given access to test machines.


As Laurent's previous proposal before LSF/MM 2019 was rejected for
complexity reasons, I am hoping that some of the changes I made will
help address these concerns:

- First, the patchset is structured to be bisectable. Every few commits
the new code gets enabled, which makes for easier testing and also,
I think, should make it easier for reviewers to understand how the
various commits relate to each other.

- This patchset requires no changes to mmap writers other than instrumenting
the mmap write lock APIs.

- The fault handler operates on a stable copy of the vma, so it does not
require any special care to avoid the possibility of vma fields being
modified concurrently with it.


mmtest results on HP Z840 workstation
(2 socket Xeon E5-2690 v3 @ 2.60GHz, 24 cores / 48 threads total):
See https://www.lespinasse.org/kernel/v5.12-rc5-spf/mmtests/

Highlights from the above:

- pft results show some pretty damn good scalability. Throughput using
all 48 threads (24 cores) is 24x that of single-threaded tests, and
3.7x higher than when using the baseline kernel. This is the result
of avoiding writing into any shared cache lines, be it for mmap lock
or vma refcounting, during speculative page faults.
(Experiments show that adding a single atomic variable per vma,
which would be incremented and decremented before and after spf
copies the vma, would decrease the 48-threads throughput by 62%).

- wis-pf page_fault1 (anon THP) shows some moderate improvement.

- wis-pf page_fault2 (private file mapping write) shows some
improvement, though much less than in the pft (anon) case.
The performance is limited by LRU contention here.

- wis-pf page_fault3 (shared file mapping write) gets a bit slower.
This patchset does not implement speculative handling for this case.

- wis-mmap likes the change, even though it doesn't do much page faults.
This seems to be a side effect of rcu safe vma freeing reducing vma
reuse between threads running on separate CPUs.

- wis-malloc benefits from a mix of the anon vma and rcu effects.

- macro benchmarks are mostly performance neutral, with some small
benefit in limited cases.


Another potential motivation for this is the Android use case.
Several Android vendors have picked up the previous SPF proposal and
included it on their devices because it reduces application start-up
times, which is an important performance metric for them.

Early Android test results with this patchset (actually a backport of
it to the Android 5.10 kernel) show that it is also effective in
reducing application start times, similar to what the prior SPF
proposal did. I do not have a direct comparison available though,
as our port of Laurent's prior SPF series to the 5.10 android kernel
is showing some instabilities. SPF abort rates are in the low single
digit percentage range, and few of these aborts are caused by the fact
that we only use a global counter to detect fault vs writer concurrency.
We are seeing launch time improvements for a variety of Android packages
that use a large number of threads during startup:
Meituan: 15%
Iqiyi: 8%
Tiktok: 4%
YouTube: 3%
I expect Suren may have more Android numbers from partners in a few days.

ChromeOS measurements with the previous SPF patchset also showed
significant reduction in page fault, mmap and munmap times. We do not
have data on this new SPF proposal yet, though.


I would like any feedback on the patchset, and most particularly about:
- Has the complexity issue in previous SPF proposals been sufficiently
addressed ?
- Are the performance benefits compelling enough ? I think the answer is
yes for the anon case, but maybe not as clear for the file cases.
- Is the Android use case compelling enough to merge the entire patchset ?
- Can we use this as a foundation for other mmap scalability work ?
I hear several proposals involving the idea of RCU based fault handling,
and hope this proposal might serve as a building block for these ?


Michel Lespinasse (37):
mmap locking API: mmap_lock_is_contended returns a bool
mmap locking API: name the return values
do_anonymous_page: use update_mmu_tlb()
do_anonymous_page: reduce code duplication
mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
mm: add FAULT_FLAG_SPECULATIVE flag
mm: add do_handle_mm_fault()
mm: add per-mm mmap sequence counter for speculative page fault handling.
mm: rcu safe vma freeing
x86/mm: attempt speculative mm faults first
mm: refactor __handle_mm_fault() / handle_pte_fault()
mm: implement speculative handling in __handle_mm_fault().
mm: add pte_map_lock() and pte_spinlock()
mm: implement speculative handling in do_anonymous_page()
mm: enable speculative fault handling through do_anonymous_page()
mm: implement speculative handling in do_numa_page()
mm: enable speculative fault handling in do_numa_page()
mm: implement speculative handling in wp_page_copy()
mm: implement and enable speculative fault handling in handle_pte_fault()
mm: implement speculative handling in do_swap_page()
mm: enable speculative fault handling through do_swap_page()
mm: rcu safe vma->vm_file freeing
mm: implement speculative handling in __do_fault()
mm: implement speculative handling in filemap_fault()
mm: implement speculative fault handling in finish_fault()
mm: implement speculative handling in do_fault_around()
mm: implement speculative handling in filemap_map_pages()
fs: list file types that support speculative faults.
mm: enable speculative fault handling for supported file types.
ext4: implement speculative fault handling
f2fs: implement speculative fault handling
mm: enable speculative fault handling only for multithreaded user space
mm: rcu safe vma freeing only for multithreaded user space
mm: spf statistics
arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
arm64/mm: attempt speculative mm faults first

arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 52 +++
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 51 +++
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/exec.c | 1 +
fs/ext4/file.c | 1 +
fs/ext4/inode.c | 7 +-
fs/f2fs/file.c | 8 +-
fs/fuse/file.c | 1 +
fs/nfs/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/vboxsf/file.c | 1 +
fs/xfs/xfs_file.c | 3 +
include/linux/mm.h | 76 +++-
include/linux/mm_types.h | 20 +-
include/linux/mmap_lock.h | 109 ++++--
include/linux/vm_event_item.h | 28 ++
include/linux/vmstat.h | 6 +
kernel/fork.c | 26 +-
mm/Kconfig | 22 ++
mm/Kconfig.debug | 7 +
mm/filemap.c | 73 +++-
mm/memory.c | 634 ++++++++++++++++++++++++----------
mm/mmap.c | 11 +-
mm/nommu.c | 6 +-
mm/vmstat.c | 28 ++
28 files changed, 942 insertions(+), 235 deletions(-)

--
2.20.1


2021-04-07 13:53:35

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 01/37] mmap locking API: mmap_lock_is_contended returns a bool

Change mmap_lock_is_contended to return a bool value, rather than an
int which the callers are then supposed to interpret as a bool. This
is to ensure consistency with other mmap lock API functions (such as
the trylock functions).

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mmap_lock.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0540f0156f58..4e27f755766b 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -171,9 +171,9 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
}

-static inline int mmap_lock_is_contended(struct mm_struct *mm)
+static inline bool mmap_lock_is_contended(struct mm_struct *mm)
{
- return rwsem_is_contended(&mm->mmap_lock);
+ return rwsem_is_contended(&mm->mmap_lock) != 0;
}

#endif /* _LINUX_MMAP_LOCK_H */
--
2.20.1

2021-04-07 13:53:49

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 33/37] mm: enable speculative fault handling only for multithreaded user space

Performance tuning: single threaded userspace does not benefit from
speculative page faults, so we turn them off to avoid any related
(small) extra overheads.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 48b86911a6df..b1a07ca82d59 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1318,6 +1318,10 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif

+ /* Only try spf for multithreaded user space faults. */
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto no_spf;
+
count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
if (seq & 1)
@@ -1351,6 +1355,7 @@ void do_user_addr_fault(struct pt_regs *regs,

spf_abort:
count_vm_event(SPF_ABORT);
+no_spf:

/*
* Kernel-mode access to the user address space should only occur
--
2.20.1

2021-04-07 14:11:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 33/37] mm: enable speculative fault handling only for multithreaded user space

On Tue, Apr 06, 2021 at 06:44:58PM -0700, Michel Lespinasse wrote:
> + /* Only try spf for multithreaded user space faults. */

This comment is misleading ... mm_users will also be incremented for
ptraced programs as well as programs that are having their /proc/$pid/maps
examined, etc. Maybe:

/* No need to try spf for single-threaded programs */

Also, please, can we not use an acronym for this feature? It's not a
speculative page fault. The page fault is really happening. We're
trying to handle it under RCU protection (if anything the faultaround
code is the speculative page fault code ...) This is unlocked page
fault handling, perhaps?

> + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> + goto no_spf;
> +
> count_vm_event(SPF_ATTEMPT);
> seq = mmap_seq_read_start(mm);
> if (seq & 1)
> @@ -1351,6 +1355,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>
> spf_abort:
> count_vm_event(SPF_ABORT);
> +no_spf:
>
> /*
> * Kernel-mode access to the user address space should only occur
> --
> 2.20.1
>
>

2021-04-07 16:50:59

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 22/37] mm: enable speculative fault handling through do_swap_page()

Change do_swap_page() to allow speculative fault execution to proceed.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ab3160719bf3..6eddd7b4e89c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3340,11 +3340,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

- if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- pte_unmap(vmf->pte);
- return VM_FAULT_RETRY;
- }
-
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
/*
--
2.20.1

2021-04-07 16:56:21

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 06/37] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..a93e4ed7040e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -33,6 +33,7 @@ config X86_64
select NEED_DMA_MAP_STATE
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

config FORCE_DYNAMIC_FTRACE
def_bool y
--
2.20.1

2021-04-07 16:56:21

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 15/37] mm: implement speculative handling in do_anonymous_page()

Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3e192d5f89a6..fd84576f9c01 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3598,8 +3598,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vma->vm_page_prot));
} else {
/* Allocate our own private page. */
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(!vma->anon_vma)) {
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
+ if (__anon_vma_prepare(vma))
+ goto oom;
+ }
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
@@ -3620,8 +3624,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
entry = pte_mkwrite(pte_mkdirty(entry));
}

- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto release;
+ }
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
@@ -3636,6 +3642,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (page)
put_page(page);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_MISSING);
}

@@ -3653,6 +3661,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return 0;
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
+release:
if (page)
put_page(page);
return ret;
--
2.20.1

2021-04-07 16:56:48

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 20/37] mm: implement and enable speculative fault handling in handle_pte_fault()

In handle_pte_fault(), allow speculative execution to proceed.

Use pte_spinlock() to validate the mmap sequence count when locking
the page table.

If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 547d9d0ee962..fc555fae0844 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3103,6 +3103,7 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;

+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
vm_fault_t ret;

@@ -3123,6 +3124,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = VM_FAULT_WRITE;

+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
get_page(vmf->page);

if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
@@ -3176,6 +3179,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_WP);
}

@@ -4366,13 +4371,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

- if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- pte_unmap(vmf->pte);
+ if (!pte_spinlock(vmf))
return VM_FAULT_RETRY;
- }
-
- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
--
2.20.1

2021-04-07 16:56:48

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 14/37] mm: add pte_map_lock() and pte_spinlock()

pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.

The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.

In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.

If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 34 ++++++++++++++++++++++
mm/memory.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 105 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dee8a4833779..f26490aff514 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,5 +3183,39 @@ extern int sysctl_nr_trim_pages;

void mem_dump_obj(void *object);

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf);
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(!vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+#else /* !CONFIG_SPECULATIVE_PAGE_FAULT */
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
+ &vmf->ptl);
+ return true;
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ return true;
+}
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index a17704aac019..3e192d5f89a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2566,6 +2566,77 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_existing_page_range);

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+ pte_t *pte = vmf->pte;
+ spinlock_t *ptl;
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+ local_irq_disable();
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto fail;
+ /*
+ * The mmap sequence count check guarantees that the page
+ * tables are still valid at that point, and having IRQs
+ * disabled ensures that they stay around (see Fast GUP
+ * comment in mm/gup.c).
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto fail;
+#endif
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ /*
+ * Try locking the page table.
+ *
+ * Note that we might race against zap_pte_range() which
+ * invalidates TLBs while holding the page table lock.
+ * We still have local IRQs disabled here to prevent the
+ * page table from being reclaimed, and zap_pte_range() could
+ * thus deadlock with us if we tried using spin_lock() here.
+ *
+ * We also don't want to retry until spin_trylock() succeeds,
+ * because of the starvation potential against a stream of lockers.
+ */
+ if (unlikely(!spin_trylock(ptl)))
+ goto fail;
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto unlock_fail;
+ local_irq_enable();
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ return true;
+
+unlock_fail:
+ spin_unlock(ptl);
+fail:
+ if (pte)
+ pte_unmap(pte);
+ local_irq_enable();
+ return false;
+}
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* handle_pte_fault chooses page fault handler according to an entry which was
* read non-atomically. Before making any commitment, on those architectures
--
2.20.1

2021-04-07 16:57:19

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 19/37] mm: implement speculative handling in wp_page_copy()

Change wp_page_copy() to handle the speculative case.
This involves aborting speculative faults if they have to allocate an
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.

Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 31 ++++++++++++++++++++++---------
1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eea72bd78d06..547d9d0ee962 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2911,20 +2911,27 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ vm_fault_t ret = VM_FAULT_OOM;

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(!vma->anon_vma)) {
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
+ if (__anon_vma_prepare(vma))
+ goto out;
+ }

if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;

if (!cow_user_page(new_page, old_page, vmf)) {
/*
@@ -2941,7 +2948,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}

if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
- goto oom_free_new;
+ goto out_free_new;
cgroup_throttle_swaprate(new_page, GFP_KERNEL);

__SetPageUptodate(new_page);
@@ -2954,7 +2961,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
* Re-check the pte - we dropped the lock
*/
- vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ /* put_page() will uncharge the page */
+ goto out_free_new;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -3042,12 +3053,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
- return VM_FAULT_OOM;
+ return ret;
}

/**
@@ -3190,6 +3201,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);

pte_unmap_unlock(vmf->pte, vmf->ptl);
+ vmf->pte = NULL;
return wp_page_copy(vmf);
}

@@ -3228,6 +3240,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);

pte_unmap_unlock(vmf->pte, vmf->ptl);
+ vmf->pte = NULL;
return wp_page_copy(vmf);
}

--
2.20.1

2021-04-07 16:57:19

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 18/37] mm: enable speculative fault handling in do_numa_page()

Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().

do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78cc36749754..eea72bd78d06 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3346,6 +3346,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
goto out;

@@ -4342,17 +4347,17 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}

- if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- pte_unmap(vmf->pte);
- return VM_FAULT_RETRY;
- }
-
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
--
2.20.1

2021-04-07 16:57:19

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 23/37] mm: rcu safe vma->vm_file freeing

Defer freeing of vma->vm_file when freeing vmas.
This is to allow speculative page faults in the mapped file case.

Signed-off-by: Michel Lespinasse <[email protected]>
---
fs/exec.c | 1 +
kernel/fork.c | 17 +++++++++++++++--
mm/mmap.c | 11 +++++++----
mm/nommu.c | 6 ++----
4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c9da73eb0f53 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -286,6 +286,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
mmap_write_unlock(mm);
err_free:
bprm->vma = NULL;
+ VM_BUG_ON(vma->vm_file);
vm_area_free(vma);
return err;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index b6078e546114..2f20a5c5fed8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -369,19 +369,31 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}

+static inline void ____vm_area_free(struct vm_area_struct *vma)
+{
+ if (vma->vm_file)
+ fput(vma->vm_file);
+ kmem_cache_free(vm_area_cachep, vma);
+}
+
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
static void __vm_area_free(struct rcu_head *head)
{
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
vm_rcu);
- kmem_cache_free(vm_area_cachep, vma);
+ ____vm_area_free(vma);
}

+#endif
+
void vm_area_free(struct vm_area_struct *vma)
{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
call_rcu(&vma->vm_rcu, __vm_area_free);
+#else
+ ____vm_area_free(vma);
+#endif
}
-#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

static void account_kernel_stack(struct task_struct *tsk, int account)
{
@@ -621,6 +633,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
fail_nomem_anon_vma_fork:
mpol_put(vma_policy(tmp));
fail_nomem_policy:
+ tmp->vm_file = NULL; /* prevents fput within vm_area_free() */
vm_area_free(tmp);
fail_nomem:
retval = -ENOMEM;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..cc2323e243bb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -178,9 +178,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
- if (vma->vm_file)
- fput(vma->vm_file);
mpol_put(vma_policy(vma));
+ /* fput(vma->vm_file) happens in vm_area_free after an RCU delay. */
vm_area_free(vma);
return next;
}
@@ -949,7 +948,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (remove_next) {
if (file) {
uprobe_munmap(next, next->vm_start, next->vm_end);
- fput(file);
+ /* fput(file) happens whthin vm_area_free(next) */
+ VM_BUG_ON(file != next->vm_file);
}
if (next->anon_vma)
anon_vma_merge(vma, next);
@@ -1828,7 +1828,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* fput the vma->vm_file here or we would add an extra fput for file
* and cause general protection fault ultimately.
*/
- fput(vma->vm_file);
+ /* fput happens within vm_area_free */
vm_area_free(vma);
vma = merge;
/* Update vm_flags to pick up the change. */
@@ -1907,6 +1907,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vm_flags & VM_DENYWRITE)
allow_write_access(file);
free_vma:
+ VM_BUG_ON(vma->vm_file);
vm_area_free(vma);
unacct_error:
if (charged)
@@ -2779,6 +2780,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
out_free_mpol:
mpol_put(vma_policy(new));
out_free_vma:
+ new->vm_file = NULL; /* prevents fput within vm_area_free() */
vm_area_free(new);
return err;
}
@@ -3343,6 +3345,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
out_free_mempol:
mpol_put(vma_policy(new_vma));
out_free_vma:
+ new_vma->vm_file = NULL; /* Prevent fput within vm_area_free */
vm_area_free(new_vma);
out:
return NULL;
diff --git a/mm/nommu.c b/mm/nommu.c
index 5c9ab799c0e6..06a0dc0b913b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -664,9 +664,8 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
{
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
- if (vma->vm_file)
- fput(vma->vm_file);
put_nommu_region(vma->vm_region);
+ /* fput(vma->vm_file) happens within vm_area_free() */
vm_area_free(vma);
}

@@ -1267,8 +1266,7 @@ unsigned long do_mmap(struct file *file,
if (region->vm_file)
fput(region->vm_file);
kmem_cache_free(vm_region_jar, region);
- if (vma->vm_file)
- fput(vma->vm_file);
+ /* fput(vma->vm_file) happens within vm_area_free() */
vm_area_free(vma);
return ret;

--
2.20.1

2021-04-07 16:57:40

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 07/37] mm: add FAULT_FLAG_SPECULATIVE flag

Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are
attempting speculative fault handling (without holding the mmap lock).

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..021fdab5b721 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -434,6 +434,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_SPECULATIVE: The fault is handled without holding the mmap_sem.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -464,6 +465,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80
#define FAULT_FLAG_INSTRUCTION 0x100
#define FAULT_FLAG_INTERRUPTIBLE 0x200
+#define FAULT_FLAG_SPECULATIVE 0x400

/*
* The default fault flags that should be used by most of the
@@ -501,7 +503,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_SPECULATIVE, "SPECULATIVE" }

/*
* vm_fault is filled by the pagefault handler and passed to the vma's
--
2.20.1

2021-04-07 16:57:53

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 02/37] mmap locking API: name the return values

In the mmap locking API, the *_killable() functions return an error
(or 0 on success), and the *_trylock() functions return a boolean
(true on success).

Rename the return values "int error" and "bool ok", respectively,
rather than using "ret" for both cases which I find less readable.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mmap_lock.h | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4e27f755766b..8ff276a7560e 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -81,22 +81,22 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)

static inline int mmap_write_lock_killable(struct mm_struct *mm)
{
- int ret;
+ int error;

__mmap_lock_trace_start_locking(mm, true);
- ret = down_write_killable(&mm->mmap_lock);
- __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
- return ret;
+ error = down_write_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, true, !error);
+ return error;
}

static inline bool mmap_write_trylock(struct mm_struct *mm)
{
- bool ret;
+ bool ok;

__mmap_lock_trace_start_locking(mm, true);
- ret = down_write_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, true, ret);
- return ret;
+ ok = down_write_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, true, ok);
+ return ok;
}

static inline void mmap_write_unlock(struct mm_struct *mm)
@@ -120,22 +120,22 @@ static inline void mmap_read_lock(struct mm_struct *mm)

static inline int mmap_read_lock_killable(struct mm_struct *mm)
{
- int ret;
+ int error;

__mmap_lock_trace_start_locking(mm, false);
- ret = down_read_killable(&mm->mmap_lock);
- __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
- return ret;
+ error = down_read_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, false, !error);
+ return error;
}

static inline bool mmap_read_trylock(struct mm_struct *mm)
{
- bool ret;
+ bool ok;

__mmap_lock_trace_start_locking(mm, false);
- ret = down_read_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, false, ret);
- return ret;
+ ok = down_read_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, false, ok);
+ return ok;
}

static inline void mmap_read_unlock(struct mm_struct *mm)
--
2.20.1

2021-04-07 16:58:22

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 29/37] fs: list file types that support speculative faults.

Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.

Initially this is set for files that implement fault() with filemap_fault().

Signed-off-by: Michel Lespinasse <[email protected]>
---
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/fuse/file.c | 1 +
fs/nfs/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/vboxsf/file.c | 1 +
include/linux/mm.h | 7 +++++++
mm/filemap.c | 1 +
8 files changed, 14 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e155f013839..b31851271e51 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2293,6 +2293,7 @@ static const struct vm_operations_struct btrfs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = btrfs_page_mkwrite,
+ .speculative = true,
};

static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 042e24aad410..a0d5fbb25c62 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4116,6 +4116,7 @@ static const struct vm_operations_struct cifs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = cifs_page_mkwrite,
+ .speculative = true,
};

int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..c4874240d157 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2354,6 +2354,7 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = fuse_page_mkwrite,
+ .speculative = true,
};

static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 16ad5050e046..e653c6bc23ca 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -588,6 +588,7 @@ static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = nfs_vm_page_mkwrite,
+ .speculative = true,
};

static int nfs_need_check_write(struct file *filp, struct inode *inode,
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 0e4b4be3aa26..3d97f1c3e9c7 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1598,6 +1598,7 @@ static const struct vm_operations_struct ubifs_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ubifs_vm_page_mkwrite,
+ .speculative = true,
};

static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/vboxsf/file.c b/fs/vboxsf/file.c
index c4ab5996d97a..e0a6a3af9cb9 100644
--- a/fs/vboxsf/file.c
+++ b/fs/vboxsf/file.c
@@ -146,6 +146,7 @@ static const struct vm_operations_struct vboxsf_file_vm_ops = {
.close = vboxsf_vma_close,
.fault = filemap_fault,
.map_pages = filemap_map_pages,
+ .speculative = true,
};

static int vboxsf_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f26490aff514..b4c0c10e434e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -641,6 +641,13 @@ struct vm_operations_struct {
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
unsigned long addr);
+ /*
+ * speculative indicates that the vm_operations support
+ * speculative page faults. This allows ->fault and ->map_pages
+ * to be called with FAULT_FLAG_SPECULATIVE set; such calls will
+ * run within an rcu read locked section and with mmap lock not held.
+ */
+ bool speculative;
};

static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
diff --git a/mm/filemap.c b/mm/filemap.c
index d496771749e6..b83040c672d3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3226,6 +3226,7 @@ const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = filemap_page_mkwrite,
+ .speculative = true,
};

/* This is used for a general mmap of a disk file */
--
2.20.1

2021-04-07 16:58:49

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 27/37] mm: implement speculative handling in do_fault_around()

Call the vm_ops->map_pages method within an rcu read locked section.
In the speculative case, verify the mmap sequence lock at the start of
the section. A match guarantees that the original vma is still valid
at that time, and that the associated vma->vm_file stays valid while
the vm_ops->map_pages() method is running.

Do not test vmf->pmd in the speculative case - we only speculate when
a page table already exists, and and this saves us from having to handle
synchronization around the vmf->pmd read.

Change xfs_filemap_map_pages() account for the fact that it can not
block anymore, as it is now running within an rcu read lock.

Signed-off-by: Michel Lespinasse <[email protected]>
---
fs/xfs/xfs_file.c | 3 +++
mm/memory.c | 22 ++++++++++++++++++++--
2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..b360732b20ae 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1387,6 +1387,9 @@ xfs_filemap_map_pages(
struct inode *inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;

+ if (!xfs_ilock_nowait(XFS_I(inode), XFS_MMAPLOCK_SHARED))
+ return (vmf->flags & FAULT_FLAG_SPECULATIVE) ?
+ VM_FAULT_RETRY : 0;
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/mm/memory.c b/mm/memory.c
index 13e2aaf900e5..a20e13d84145 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4012,6 +4012,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
+ vm_fault_t ret;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
@@ -4030,14 +4031,31 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);

- if (pmd_none(*vmf->pmd)) {
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+ pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}

- return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+ rcu_read_lock();
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+ rcu_read_unlock();
+ return VM_FAULT_RETRY;
+ }
+ /*
+ * the mmap sequence check verified that vmf->vma was still
+ * current at that point in time.
+ * The rcu read lock ensures vmf->vma->vm_file stays valid.
+ */
+ }
+#endif
+ ret = vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
+ rcu_read_unlock();
+ return ret;
}

static vm_fault_t do_read_fault(struct vm_fault *vmf)
--
2.20.1

2021-04-07 16:58:50

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 31/37] ext4: implement speculative fault handling

We just need to make sure ext4_filemap_fault() doesn't block in the
speculative case as it is called with an rcu read lock held.

Signed-off-by: Michel Lespinasse <[email protected]>
---
fs/ext4/file.c | 1 +
fs/ext4/inode.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 194f5d00fa32..546b9d4aa9d7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -742,6 +742,7 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
+ .speculative = true,
};

static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..370484403c71 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6192,7 +6192,12 @@ vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;

- down_read(&EXT4_I(inode)->i_mmap_sem);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ if (!down_read_trylock(&EXT4_I(inode)->i_mmap_sem))
+ return VM_FAULT_RETRY;
+ } else {
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ }
ret = filemap_fault(vmf);
up_read(&EXT4_I(inode)->i_mmap_sem);

--
2.20.1

2021-04-07 16:58:55

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 36/37] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..6c2bbfe3fd38 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -208,6 +208,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
ARM 64-bit (AArch64) Linux support.

--
2.20.1

2021-04-07 16:59:27

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 32/37] f2fs: implement speculative fault handling

We just need to make sure f2fs_filemap_fault() doesn't block in the
speculative case as it is called with an rcu read lock held.

Signed-off-by: Michel Lespinasse <[email protected]>
---
fs/f2fs/file.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d26ff2ae3f5e..c1cfdc3ec98e 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -37,7 +37,12 @@ static vm_fault_t f2fs_filemap_fault(struct vm_fault *vmf)
struct inode *inode = file_inode(vmf->vma->vm_file);
vm_fault_t ret;

- down_read(&F2FS_I(inode)->i_mmap_sem);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ if (!down_read_trylock(&F2FS_I(inode)->i_mmap_sem))
+ return VM_FAULT_RETRY;
+ } else {
+ down_read(&F2FS_I(inode)->i_mmap_sem);
+ }
ret = filemap_fault(vmf);
up_read(&F2FS_I(inode)->i_mmap_sem);

@@ -171,6 +176,7 @@ static const struct vm_operations_struct f2fs_file_vm_ops = {
.fault = f2fs_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = f2fs_vm_page_mkwrite,
+ .speculative = true,
};

static int get_parent_ino(struct inode *inode, nid_t *pino)
--
2.20.1

2021-04-07 17:06:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 27/37] mm: implement speculative handling in do_fault_around()

On Tue, Apr 06, 2021 at 06:44:52PM -0700, Michel Lespinasse wrote:
> Change xfs_filemap_map_pages() account for the fact that it can not
> block anymore, as it is now running within an rcu read lock.

Better to just delete xfs_filemap_map_pages() and use filemap_map_pages
directly.

2021-04-07 17:13:12

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 03/37] do_anonymous_page: use update_mmu_tlb()

update_mmu_tlb() can be used instead of update_mmu_cache() when the
page fault handler detects that it lost the race to another page fault.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5efa07fb6cdc..8ee4bd239303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3567,7 +3567,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte)) {
- update_mmu_cache(vma, vmf->address, vmf->pte);
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
}

--
2.20.1

2021-04-07 17:13:16

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 04/37] do_anonymous_page: reduce code duplication

In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.

This change reduces the code duplication between the two cases.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 85 +++++++++++++++++++++++------------------------------
1 file changed, 37 insertions(+), 48 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8ee4bd239303..477c98bfdd9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3495,7 +3495,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct page *page;
+ struct page *page = NULL;
vm_fault_t ret = 0;
pte_t entry;

@@ -3525,77 +3525,66 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (!pte_none(*vmf->pte)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- goto unlock;
- }
- ret = check_stable_address_space(vma->vm_mm);
- if (ret)
- goto unlock;
- /* Deliver the page fault to userland, check inside PT lock */
- if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return handle_userfault(vmf, VM_UFFD_MISSING);
- }
- goto setpte;
+ } else {
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ goto oom;
+ page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+ if (!page)
+ goto oom;
+
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+ goto oom_free_page;
+ cgroup_throttle_swaprate(page, GFP_KERNEL);
+
+ /*
+ * The memory barrier inside __SetPageUptodate makes sure that
+ * preceding stores to the page contents become visible before
+ * the set_pte_at() write.
+ */
+ __SetPageUptodate(page);
+
+ entry = mk_pte(page, vma->vm_page_prot);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
}

- /* Allocate our own private page. */
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
- if (!page)
- goto oom;
-
- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
- goto oom_free_page;
- cgroup_throttle_swaprate(page, GFP_KERNEL);
-
- /*
- * The memory barrier inside __SetPageUptodate makes sure that
- * preceding stores to the page contents become visible before
- * the set_pte_at() write.
- */
- __SetPageUptodate(page);
-
- entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
-
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
- goto release;
+ goto unlock;
}

ret = check_stable_address_space(vma->vm_mm);
if (ret)
- goto release;
+ goto unlock;

/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- put_page(page);
+ if (page)
+ put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}

- inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
- lru_cache_add_inactive_or_unevictable(page, vma);
-setpte:
+ if (page) {
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, vmf->address, false);
+ lru_cache_add_inactive_or_unevictable(page, vma);
+ }
+
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (page)
+ put_page(page);
return ret;
-release:
- put_page(page);
- goto unlock;
oom_free_page:
put_page(page);
oom:
--
2.20.1

2021-04-07 17:13:27

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 05/37] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

This configuration variable will be used to build the code needed to
handle speculative page fault.

This is enabled by default on supported architectures with SMP and MMU set.

The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to try speculative fault handling first.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/Kconfig | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..322bda319dea 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,26 @@ config MAPPING_DIRTY_HELPERS
config KMAP_LOCAL
bool

+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+ bool "Speculative page faults"
+ default y
+ depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && MMU && SMP
+ help
+ Try to handle user space page faults without holding the mmap lock.
+
+ Instead of blocking writers through the use of mmap lock,
+ the page fault handler merely verifies, at the end of the page
+ fault, that no writers have been running concurrently with it.
+
+ In high concurrency situations, the speculative fault handler
+ gains a throughput advantage by avoiding having to update the
+ mmap lock reader count.
+
+ If the check fails due to a concurrent writer, or due to hitting
+ an unsupported case, the fault handler falls back to classical
+ processing using the mmap read lock.
+
endmenu
--
2.20.1

2021-04-07 17:13:43

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 16/37] mm: enable speculative fault handling through do_anonymous_page()

in x86 fault handler, only attempt spf if the vma is anonymous.

In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().

In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 2 +-
mm/memory.c | 16 ++++++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f8c8e325af77..fbf265f56a06 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,7 +1324,7 @@ void do_user_addr_fault(struct pt_regs *regs,
goto spf_abort;
rcu_read_lock();
vma = find_vma(mm, address);
- if (!vma || vma->vm_start > address) {
+ if (!vma || vma->vm_start > address || !vma_is_anonymous(vma)) {
rcu_read_unlock();
goto spf_abort;
}
diff --git a/mm/memory.c b/mm/memory.c
index fd84576f9c01..a2c5bf29f989 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4103,6 +4103,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;

+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
/*
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
@@ -4340,6 +4342,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);

@@ -4668,8 +4675,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
{
vm_fault_t ret;

- if (flags & FAULT_FLAG_SPECULATIVE)
- return VM_FAULT_RETRY;
+ VM_BUG_ON((flags & FAULT_FLAG_SPECULATIVE) && !vma_is_anonymous(vma));

__set_current_state(TASK_RUNNING);

@@ -4691,10 +4697,12 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();

- if (unlikely(is_vm_hugetlb_page(vma)))
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ VM_BUG_ON(flags & FAULT_FLAG_SPECULATIVE);
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
- else
+ } else {
ret = __handle_mm_fault(vma, address, flags, seq);
+ }

if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
--
2.20.1

2021-04-07 17:13:44

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 17/37] mm: implement speculative handling in do_numa_page()

change do_numa_page() to use pte_spinlock() when locking the page table,
so that the mmap sequence counter will be validated in the speculative case.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a2c5bf29f989..78cc36749754 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4181,8 +4181,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
--
2.20.1

2021-04-07 17:19:37

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 08/37] mm: add do_handle_mm_fault()

Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.

In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).

The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 12 +++++++++---
mm/memory.c | 10 +++++++---
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 021fdab5b721..d5988e78e6ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1724,9 +1724,15 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);

#ifdef CONFIG_MMU
-extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- struct pt_regs *regs);
+extern vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ unsigned long seq, struct pt_regs *regs);
+static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ struct pt_regs *regs)
+{
+ return do_handle_mm_fault(vma, address, flags, 0, regs);
+}
extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 477c98bfdd9d..3691be1f1319 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4509,11 +4509,15 @@ static inline void mm_account_fault(struct pt_regs *regs,
* The mmap_lock may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, struct pt_regs *regs)
+vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ unsigned long seq, struct pt_regs *regs)
{
vm_fault_t ret;

+ if (flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
+
__set_current_state(TASK_RUNNING);

count_vm_event(PGFAULT);
@@ -4555,7 +4559,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,

return ret;
}
-EXPORT_SYMBOL_GPL(handle_mm_fault);
+EXPORT_SYMBOL_GPL(do_handle_mm_fault);

#ifndef __PAGETABLE_P4D_FOLDED
/*
--
2.20.1

2021-04-07 17:19:49

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.

Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.

The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 63 +++++++++++++++++++++++++++++++++++------------------
1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6eddd7b4e89c..7139004c624d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3709,29 +3709,50 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret;

- /*
- * Preallocate pte before we take page_lock because this might lead to
- * deadlocks for memcg reclaim which waits for pages under writeback:
- * lock_page(A)
- * SetPageWriteback(A)
- * unlock_page(A)
- * lock_page(B)
- * lock_page(B)
- * pte_alloc_one
- * shrink_page_list
- * wait_on_page_writeback(A)
- * SetPageWriteback(B)
- * unlock_page(B)
- * # flush A, B to clear the writeback
- */
- if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
- vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
- if (!vmf->prealloc_pte)
- return VM_FAULT_OOM;
- smp_wmb(); /* See comment in __pte_alloc() */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ rcu_read_lock();
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+ ret = VM_FAULT_RETRY;
+ } else {
+ /*
+ * The mmap sequence count check guarantees that the
+ * vma we fetched at the start of the fault was still
+ * current at that point in time. The rcu read lock
+ * ensures vmf->vma->vm_file stays valid.
+ */
+ ret = vma->vm_ops->fault(vmf);
+ }
+ rcu_read_unlock();
+ } else
+#endif
+ {
+ /*
+ * Preallocate pte before we take page_lock because
+ * this might lead to deadlocks for memcg reclaim
+ * which waits for pages under writeback:
+ * lock_page(A)
+ * SetPageWriteback(A)
+ * unlock_page(A)
+ * lock_page(B)
+ * lock_page(B)
+ * pte_alloc_one
+ * shrink_page_list
+ * wait_on_page_writeback(A)
+ * SetPageWriteback(B)
+ * unlock_page(B)
+ * # flush A, B to clear writeback
+ */
+ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+ vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
+ if (!vmf->prealloc_pte)
+ return VM_FAULT_OOM;
+ smp_wmb(); /* See comment in __pte_alloc() */
+ }
+
+ ret = vma->vm_ops->fault(vmf);
}

- ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
--
2.20.1

2021-04-07 17:20:00

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 35/37] mm: spf statistics

Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option,
and dump extra statistics about executed spf cases and abort reasons
when the option is set.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 19 +++++++---
include/linux/mmap_lock.h | 19 +++++++++-
include/linux/vm_event_item.h | 24 ++++++++++++
include/linux/vmstat.h | 6 +++
mm/Kconfig.debug | 7 ++++
mm/memory.c | 71 ++++++++++++++++++++++++++++-------
mm/vmstat.c | 24 ++++++++++++
7 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b1a07ca82d59..e210bbcb8bc5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1324,22 +1324,31 @@ void do_user_addr_fault(struct pt_regs *regs,

count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
- if (seq & 1)
+ if (seq & 1) {
+ count_vm_spf_event(SPF_ABORT_ODD);
goto spf_abort;
+ }
rcu_read_lock();
vma = find_vma(mm, address);
- if (!vma || vma->vm_start > address ||
- !vma_can_speculate(vma, flags)) {
+ if (!vma || vma->vm_start > address) {
rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_UNMAPPED);
+ goto spf_abort;
+ }
+ if (!vma_can_speculate(vma, flags)) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
goto spf_abort;
}
pvma = *vma;
rcu_read_unlock();
- if (!mmap_seq_read_check(mm, seq))
+ if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
goto spf_abort;
vma = &pvma;
- if (unlikely(access_error(error_code, vma)))
+ if (unlikely(access_error(error_code, vma))) {
+ count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
goto spf_abort;
+ }
fault = do_handle_mm_fault(vma, address,
flags | FAULT_FLAG_SPECULATIVE, seq, regs);

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 8f4eca2d0f43..98f24a9910a9 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
#include <linux/rwsem.h>
#include <linux/tracepoint-defs.h>
#include <linux/types.h>
+#include <linux/vmstat.h>

#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
#define MMAP_LOCK_SEQ_INITIALIZER(name) \
@@ -104,12 +105,26 @@ static inline unsigned long mmap_seq_read_start(struct mm_struct *mm)
return seq;
}

-static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq)
+static inline bool __mmap_seq_read_check(struct mm_struct *mm,
+ unsigned long seq)
{
smp_rmb();
return seq == READ_ONCE(mm->mmap_seq);
}
-#endif
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq,
+ enum vm_event_item fail_event)
+{
+ if (__mmap_seq_read_check(mm, seq))
+ return true;
+ count_vm_event(fail_event);
+ return false;
+}
+#else
+#define mmap_seq_read_check(mm, seq, fail) __mmap_seq_read_check(mm, seq)
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT_STATS */
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

static inline void mmap_write_lock(struct mm_struct *mm)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index cc4f8d14e43f..6d25fd9ce4d1 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -124,6 +124,30 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
SPF_ATTEMPT,
SPF_ABORT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+ SPF_ABORT_ODD,
+ SPF_ABORT_UNMAPPED,
+ SPF_ABORT_NO_SPECULATE,
+ SPF_ABORT_VMA_COPY,
+ SPF_ABORT_ACCESS_ERROR,
+ SPF_ABORT_PUD,
+ SPF_ABORT_PMD,
+ SPF_ABORT_ANON_VMA,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ1,
+ SPF_ABORT_PTE_MAP_LOCK_PMD,
+ SPF_ABORT_PTE_MAP_LOCK_PTL,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ2,
+ SPF_ABORT_USERFAULTFD,
+ SPF_ABORT_FAULT,
+ SPF_ABORT_NON_SWAP_ENTRY,
+ SPF_ABORT_SWAP_NOPAGE,
+ SPF_ATTEMPT_ANON,
+ SPF_ATTEMPT_FILE,
+ SPF_ATTEMPT_SWAP,
+ SPF_ATTEMPT_NUMA,
+ SPF_ATTEMPT_PTE,
+ SPF_ATTEMPT_WP,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 506d625163a1..34e05604a93f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -109,6 +109,12 @@ static inline void vm_events_fold_cpu(int cpu)

#endif /* CONFIG_VM_EVENT_COUNTERS */

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+#define count_vm_spf_event(x) count_vm_event(x)
+#else
+#define count_vm_spf_event(x) do {} while (0)
+#endif
+
#ifdef CONFIG_NUMA_BALANCING
#define count_vm_numa_event(x) count_vm_event(x)
#define count_vm_numa_events(x, y) count_vm_events(x, y)
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 1e73717802f8..6be8ca7950ee 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -150,3 +150,10 @@ config PTDUMP_DEBUGFS
kernel.

If in doubt, say N.
+
+config SPECULATIVE_PAGE_FAULT_STATS
+ bool "Additional statistics for speculative page faults"
+ depends on SPECULATIVE_PAGE_FAULT
+ help
+ Additional statistics for speculative page faults.
+ If in doubt, say N.
diff --git a/mm/memory.c b/mm/memory.c
index 074945faf1ab..6165d340e134 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2585,7 +2585,7 @@ bool __pte_map_lock(struct vm_fault *vmf)
}

local_irq_disable();
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq, SPF_ABORT_PTE_MAP_LOCK_SEQ1))
goto fail;
/*
* The mmap sequence count check guarantees that the page
@@ -2599,8 +2599,10 @@ bool __pte_map_lock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PMD);
goto fail;
+ }
#endif
ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
if (!pte)
@@ -2617,9 +2619,11 @@ bool __pte_map_lock(struct vm_fault *vmf)
* We also don't want to retry until spin_trylock() succeeds,
* because of the starvation potential against a stream of lockers.
*/
- if (unlikely(!spin_trylock(ptl)))
+ if (unlikely(!spin_trylock(ptl))) {
+ count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PTL);
goto fail;
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ }
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq, SPF_ABORT_PTE_MAP_LOCK_SEQ2))
goto unlock_fail;
local_irq_enable();
vmf->pte = pte;
@@ -2891,6 +2895,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)

if (unlikely(!vma->anon_vma)) {
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_ANON_VMA);
ret = VM_FAULT_RETRY;
goto out;
}
@@ -3153,10 +3158,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_WP);
+
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_USERFAULTFD);
return VM_FAULT_RETRY;
+ }
return handle_userfault(vmf, VM_UFFD_WP);
}

@@ -3340,6 +3350,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_SWAP);
+
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
/*
@@ -3366,6 +3379,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_NON_SWAP_ENTRY);
ret = VM_FAULT_RETRY;
} else if (is_migration_entry(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
@@ -3392,6 +3406,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ count_vm_spf_event(SPF_ABORT_SWAP_NOPAGE);
return VM_FAULT_RETRY;
}

@@ -3598,6 +3613,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
pte_t entry;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_ANON);
+
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -3627,8 +3645,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
} else {
/* Allocate our own private page. */
if (unlikely(!vma->anon_vma)) {
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_ANON_VMA);
return VM_FAULT_RETRY;
+ }
if (__anon_vma_prepare(vma))
goto oom;
}
@@ -3670,8 +3690,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (page)
put_page(page);
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_USERFAULTFD);
return VM_FAULT_RETRY;
+ }
return handle_userfault(vmf, VM_UFFD_MISSING);
}

@@ -3712,7 +3734,8 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
rcu_read_lock();
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq,
+ SPF_ABORT_FAULT)) {
ret = VM_FAULT_RETRY;
} else {
/*
@@ -4042,7 +4065,8 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
rcu_read_lock();
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq)) {
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq,
+ SPF_ABORT_FAULT)) {
rcu_read_unlock();
return VM_FAULT_RETRY;
}
@@ -4091,8 +4115,10 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
vm_fault_t ret;

if (unlikely(!vma->anon_vma)) {
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_ANON_VMA);
return VM_FAULT_RETRY;
+ }
if (__anon_vma_prepare(vma))
return VM_FAULT_OOM;
}
@@ -4178,6 +4204,9 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_FILE);
+
/*
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
@@ -4251,6 +4280,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
bool was_writable = pte_savedwrite(vmf->orig_pte);
int flags = 0;

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_NUMA);
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
@@ -4423,6 +4455,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_PTE);
+
if (!pte_spinlock(vmf))
return VM_FAULT_RETRY;
entry = vmf->orig_pte;
@@ -4490,20 +4525,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
local_irq_disable();
pgd = pgd_offset(mm, address);
pgdval = READ_ONCE(*pgd);
- if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }

p4d = p4d_offset(pgd, address);
p4dval = READ_ONCE(*p4d);
- if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }

vmf.pud = pud_offset(p4d, address);
pudval = READ_ONCE(*vmf.pud);
if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
unlikely(pud_trans_huge(pudval)) ||
- unlikely(pud_devmap(pudval)))
+ unlikely(pud_devmap(pudval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }

vmf.pmd = pmd_offset(vmf.pud, address);
vmf.orig_pmd = READ_ONCE(*vmf.pmd);
@@ -4521,8 +4562,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (unlikely(pmd_none(vmf.orig_pmd) ||
is_swap_pmd(vmf.orig_pmd) ||
pmd_trans_huge(vmf.orig_pmd) ||
- pmd_devmap(vmf.orig_pmd)))
+ pmd_devmap(vmf.orig_pmd))) {
+ count_vm_spf_event(SPF_ABORT_PMD);
goto spf_fail;
+ }

/*
* The above does not allocate/instantiate page-tables because
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9ae1c27a549e..ac4ff4343a49 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1369,6 +1369,30 @@ const char * const vmstat_text[] = {
"spf_attempt",
"spf_abort",
#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+ "SPF_ABORT_ODD",
+ "SPF_ABORT_UNMAPPED",
+ "SPF_ABORT_NO_SPECULATE",
+ "SPF_ABORT_VMA_COPY",
+ "SPF_ABORT_ACCESS_ERROR",
+ "SPF_ABORT_PUD",
+ "SPF_ABORT_PMD",
+ "SPF_ABORT_ANON_VMA",
+ "SPF_ABORT_PTE_MAP_LOCK_SEQ1",
+ "SPF_ABORT_PTE_MAP_LOCK_PMD",
+ "SPF_ABORT_PTE_MAP_LOCK_PTL",
+ "SPF_ABORT_PTE_MAP_LOCK_SEQ2",
+ "SPF_ABORT_USERFAULTFD",
+ "SPF_ABORT_FAULT",
+ "SPF_ABORT_NON_SWAP_ENTRY",
+ "SPF_ABORT_SWAP_NOPAGE",
+ "SPF_ATTEMPT_ANON",
+ "SPF_ATTEMPT_FILE",
+ "SPF_ATTEMPT_SWAP",
+ "SPF_ATTEMPT_NUMA",
+ "SPF_ATTEMPT_PTE",
+ "SPF_ATTEMPT_WP",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.20.1

2021-04-07 17:20:10

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

The page table tree is walked with local irqs disabled, which prevents
page table reclamation (similarly to what fast GUP does). The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 4 +++
mm/memory.c | 77 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5988e78e6ab..dee8a4833779 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -525,6 +525,10 @@ struct vm_fault {
};
unsigned int flags; /* FAULT_FLAG_xxx flags
* XXX: should really be 'const' */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ unsigned long seq;
+ pmd_t orig_pmd;
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
* the 'address' */
pud_t *pud; /* Pointer to pud entry matching
diff --git a/mm/memory.c b/mm/memory.c
index 66e7a4554c54..a17704aac019 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4307,7 +4307,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* return value. See filemap_fault() and __lock_page_or_retry().
*/
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+ unsigned long address, unsigned int flags, unsigned long seq)
{
struct vm_fault vmf = {
.vma = vma,
@@ -4322,6 +4322,79 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
p4d_t *p4d;
vm_fault_t ret;

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ if (flags & FAULT_FLAG_SPECULATIVE) {
+ pgd_t pgdval;
+ p4d_t p4dval;
+ pud_t pudval;
+
+ vmf.seq = seq;
+
+ local_irq_disable();
+ pgd = pgd_offset(mm, address);
+ pgdval = READ_ONCE(*pgd);
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ goto spf_fail;
+
+ p4d = p4d_offset(pgd, address);
+ p4dval = READ_ONCE(*p4d);
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ goto spf_fail;
+
+ vmf.pud = pud_offset(p4d, address);
+ pudval = READ_ONCE(*vmf.pud);
+ if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
+ unlikely(pud_trans_huge(pudval)) ||
+ unlikely(pud_devmap(pudval)))
+ goto spf_fail;
+
+ vmf.pmd = pmd_offset(vmf.pud, address);
+ vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+
+ /*
+ * pmd_none could mean that a hugepage collapse is in
+ * progress in our back as collapse_huge_page() mark
+ * it before invalidating the pte (which is done once
+ * the IPI is catched by all CPU and we have interrupt
+ * disabled). For this reason we cannot handle THP in
+ * a speculative way since we can't safely identify an
+ * in progress collapse operation done in our back on
+ * that PMD.
+ */
+ if (unlikely(pmd_none(vmf.orig_pmd) ||
+ is_swap_pmd(vmf.orig_pmd) ||
+ pmd_trans_huge(vmf.orig_pmd) ||
+ pmd_devmap(vmf.orig_pmd)))
+ goto spf_fail;
+
+ /*
+ * The above does not allocate/instantiate page-tables because
+ * doing so would lead to the possibility of instantiating
+ * page-tables after free_pgtables() -- and consequently
+ * leaking them.
+ *
+ * The result is that we take at least one non-speculative
+ * fault per PMD in order to instantiate it.
+ */
+
+ vmf.pte = pte_offset_map(vmf.pmd, address);
+ vmf.orig_pte = READ_ONCE(*vmf.pte);
+ barrier();
+ if (pte_none(vmf.orig_pte)) {
+ pte_unmap(vmf.pte);
+ vmf.pte = NULL;
+ }
+
+ local_irq_enable();
+
+ return handle_pte_fault(&vmf);
+
+spf_fail:
+ local_irq_enable();
+ return VM_FAULT_RETRY;
+ }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
@@ -4541,7 +4614,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
- ret = __handle_mm_fault(vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags, seq);

if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
--
2.20.1

2021-04-07 17:20:10

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 34/37] mm: rcu safe vma freeing only for multithreaded user space

Performance tuning: as single threaded userspace does not use
speculative page faults, it does not require rcu safe vma freeing.
Turn this off to avoid the related (small) extra overheads.

For multi threaded userspace, we often see a performance benefit from
the rcu safe vma freeing - even in tests that do not have any frequent
concurrent page faults ! This is because rcu safe vma freeing prevents
recently released vmas from being immediately reused in a new thread.

Signed-off-by: Michel Lespinasse <[email protected]>
---
kernel/fork.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2f20a5c5fed8..623875e8e742 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -389,10 +389,12 @@ static void __vm_area_free(struct rcu_head *head)
void vm_area_free(struct vm_area_struct *vma)
{
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
- call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
- ____vm_area_free(vma);
+ if (atomic_read(&vma->vm_mm->mm_users) > 1) {
+ call_rcu(&vma->vm_rcu, __vm_area_free);
+ return;
+ }
#endif
+ ____vm_area_free(vma);
}

static void account_kernel_stack(struct task_struct *tsk, int account)
--
2.20.1

2021-04-07 17:20:12

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 10/37] mm: rcu safe vma freeing

This prepares for speculative page faults looking up and copying vmas
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm_types.h | 16 +++++++++++-----
kernel/fork.c | 11 ++++++++++-
2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 70882e628908..024970635921 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -304,12 +304,18 @@ struct vm_userfaultfd_ctx {};
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */

- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
+ union {
+ struct {
+ /* VMA covers [vm_start; vm_end) addresses within mm */
+ unsigned long vm_start, vm_end;

- /* linked list of VM areas per task, sorted by address */
- struct vm_area_struct *vm_next, *vm_prev;
+ /* linked list of VMAs per task, sorted by address */
+ struct vm_area_struct *vm_next, *vm_prev;
+ };
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+ };

struct rb_node vm_rb;

diff --git a/kernel/fork.c b/kernel/fork.c
index 426cd0c51f9e..b6078e546114 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -369,11 +369,20 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}

-void vm_area_free(struct vm_area_struct *vma)
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static void __vm_area_free(struct rcu_head *head)
{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
kmem_cache_free(vm_area_cachep, vma);
}

+void vm_area_free(struct vm_area_struct *vma)
+{
+ call_rcu(&vma->vm_rcu, __vm_area_free);
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
static void account_kernel_stack(struct task_struct *tsk, int account)
{
void *stack = task_stack_page(tsk);
--
2.20.1

2021-04-07 17:20:15

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 26/37] mm: implement speculative fault handling in finish_fault()

In the speculative case, we want to avoid direct pmd checks (which
would require some extra synchronization to be safe), and rely on
pte_map_lock which will both lock the page table and verify that the
pmd has not changed from its initial value.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7139004c624d..13e2aaf900e5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3915,23 +3915,25 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return ret;
}

- if (pmd_none(*vmf->pmd)) {
- if (PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
}

- if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
- return VM_FAULT_OOM;
+ /* See comment in __handle_mm_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
}

- /* See comment in __handle_mm_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return 0;
-
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
ret = 0;
/* Re-check under ptl */
if (likely(pte_none(*vmf->pte)))
--
2.20.1

2021-04-07 17:20:15

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

The counter's write side is hooked into the existing mmap locking API:
mmap_write_lock() increments the counter to the next (odd) value, and
mmap_write_unlock() increments it again to the next (even) value.

The counter's speculative read side is supposed to be used as follows:

seq = mmap_seq_read_start(mm);
if (seq & 1)
goto fail;
.... speculative handling here ....
if (!mmap_seq_read_check(mm, seq)
goto fail;

This API guarantees that, if none of the "fail" tests abort
speculative execution, the speculative code section did not run
concurrently with any mmap writer.

This is very similar to a seqlock, but both the writer and speculative
readers are allowed to block. In the fail case, the speculative reader
does not spin on the sequence counter; instead it should fall back to
a different mechanism such as grabbing the mmap lock read side.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm_types.h | 4 +++
include/linux/mmap_lock.h | 58 +++++++++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..70882e628908 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -461,6 +461,10 @@ struct mm_struct {
* counters
*/
struct rw_semaphore mmap_lock;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ unsigned long mmap_seq;
+#endif
+

struct list_head mmlist; /* List of maybe swapped mm's. These
* are globally strung together off
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 8ff276a7560e..8f4eca2d0f43 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -8,8 +8,16 @@
#include <linux/tracepoint-defs.h>
#include <linux/types.h>

-#define MMAP_LOCK_INITIALIZER(name) \
- .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+#define MMAP_LOCK_SEQ_INITIALIZER(name) \
+ .mmap_seq = 0,
+#else
+#define MMAP_LOCK_SEQ_INITIALIZER(name)
+#endif
+
+#define MMAP_LOCK_INITIALIZER(name) \
+ .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock), \
+ MMAP_LOCK_SEQ_INITIALIZER(name)

DECLARE_TRACEPOINT(mmap_lock_start_locking);
DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
@@ -63,13 +71,52 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ mm->mmap_seq = 0;
+#endif
}

+static inline void __mmap_seq_write_lock(struct mm_struct *mm)
+{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ VM_BUG_ON_MM(mm->mmap_seq & 1, mm);
+ mm->mmap_seq++;
+ smp_wmb();
+#endif
+}
+
+static inline void __mmap_seq_write_unlock(struct mm_struct *mm)
+{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ smp_wmb();
+ mm->mmap_seq++;
+ VM_BUG_ON_MM(mm->mmap_seq & 1, mm);
+#endif
+}
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline unsigned long mmap_seq_read_start(struct mm_struct *mm)
+{
+ unsigned long seq;
+
+ seq = READ_ONCE(mm->mmap_seq);
+ smp_rmb();
+ return seq;
+}
+
+static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq)
+{
+ smp_rmb();
+ return seq == READ_ONCE(mm->mmap_seq);
+}
+#endif
+
static inline void mmap_write_lock(struct mm_struct *mm)
{
__mmap_lock_trace_start_locking(mm, true);
down_write(&mm->mmap_lock);
__mmap_lock_trace_acquire_returned(mm, true, true);
+ __mmap_seq_write_lock(mm);
}

static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
@@ -77,6 +124,7 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
__mmap_lock_trace_start_locking(mm, true);
down_write_nested(&mm->mmap_lock, subclass);
__mmap_lock_trace_acquire_returned(mm, true, true);
+ __mmap_seq_write_lock(mm);
}

static inline int mmap_write_lock_killable(struct mm_struct *mm)
@@ -86,6 +134,8 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
__mmap_lock_trace_start_locking(mm, true);
error = down_write_killable(&mm->mmap_lock);
__mmap_lock_trace_acquire_returned(mm, true, !error);
+ if (likely(!error))
+ __mmap_seq_write_lock(mm);
return error;
}

@@ -96,17 +146,21 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
__mmap_lock_trace_start_locking(mm, true);
ok = down_write_trylock(&mm->mmap_lock) != 0;
__mmap_lock_trace_acquire_returned(mm, true, ok);
+ if (likely(ok))
+ __mmap_seq_write_lock(mm);
return ok;
}

static inline void mmap_write_unlock(struct mm_struct *mm)
{
+ __mmap_seq_write_unlock(mm);
up_write(&mm->mmap_lock);
__mmap_lock_trace_released(mm, true);
}

static inline void mmap_write_downgrade(struct mm_struct *mm)
{
+ __mmap_seq_write_unlock(mm);
downgrade_write(&mm->mmap_lock);
__mmap_lock_trace_acquire_returned(mm, false, true);
}
--
2.20.1

2021-04-07 17:20:16

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 12/37] mm: refactor __handle_mm_fault() / handle_pte_fault()

Move the code that initializes vmf->pte and vmf->orig_pte from
handle_pte_fault() to its single call site in __handle_mm_fault().

This ensures vmf->pte is now initialized together with the higher levels
of the page table hierarchy. This also prepares for speculative page fault
handling, where the entire page table walk (higher levels down to ptes)
needs special care in the speculative case.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 98 ++++++++++++++++++++++++++---------------------------
1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3691be1f1319..66e7a4554c54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3516,7 +3516,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See comment in handle_pte_fault() */
+ /* See comment in __handle_mm_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3797,7 +3797,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}

- /* See comment in handle_pte_fault() */
+ /* See comment in __handle_mm_fault() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;

@@ -4253,53 +4253,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
pte_t entry;

- if (unlikely(pmd_none(*vmf->pmd))) {
- /*
- * Leave __pte_alloc() until later: because vm_ops->fault may
- * want to allocate huge page, and if we expose page table
- * for an instant, it will be difficult to retract from
- * concurrent faults and from rmap lookups.
- */
- vmf->pte = NULL;
- } else {
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
- * of pmd_trans_huge() to ensure the pmd didn't become
- * pmd_trans_huge under us and then back to pmd_none, as a
- * result of MADV_DONTNEED running immediately after a huge pmd
- * fault in a different thread of this mm, in turn leading to a
- * misleading pmd_trans_huge() retval. All we have to ensure is
- * that it is a regular pmd that we can walk with
- * pte_offset_map() and we can do that through an atomic read
- * in C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge
- * pmd from under us anymore at this point because we hold the
- * mmap_lock read mode and khugepaged takes it in write mode.
- * So now it's safe to run pte_offset_map().
- */
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- vmf->orig_pte = *vmf->pte;
-
- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
- * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
- * accesses. The code below just needs a consistent view
- * for the ifs and we later double check anyway with the
- * ptl lock held. So here a barrier will do.
- */
- barrier();
- if (pte_none(vmf->orig_pte)) {
- pte_unmap(vmf->pte);
- vmf->pte = NULL;
- }
- }
-
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
@@ -4439,6 +4392,53 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
}
}

+ if (unlikely(pmd_none(*vmf.pmd))) {
+ /*
+ * Leave __pte_alloc() until later: because vm_ops->fault may
+ * want to allocate huge page, and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ vmf.pte = NULL;
+ } else {
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
+ if (pmd_devmap_trans_unstable(vmf.pmd))
+ return 0;
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_lock read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ vmf.pte = pte_offset_map(vmf.pmd, vmf.address);
+ vmf.orig_pte = *vmf.pte;
+
+ /*
+ * some architectures can have larger ptes than wordsize,
+ * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
+ * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
+ * accesses. The code below just needs a consistent view
+ * for the ifs and we later double check anyway with the
+ * ptl lock held. So here a barrier will do.
+ */
+ barrier();
+ if (pte_none(vmf.orig_pte)) {
+ pte_unmap(vmf.pte);
+ vmf.pte = NULL;
+ }
+ }
+
return handle_pte_fault(&vmf);
}

--
2.20.1

2021-04-07 17:20:16

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.

The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.

The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.

Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 36 +++++++++++++++++++++++++++++++++++
include/linux/vm_event_item.h | 4 ++++
mm/vmstat.c | 4 ++++
3 files changed, 44 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a73347e2cdfc..f8c8e325af77 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
struct mm_struct *mm;
vm_fault_t fault;
unsigned int flags = FAULT_FLAG_DEFAULT;
+ struct vm_area_struct pvma;
+ unsigned long seq;

tsk = current;
mm = tsk->mm;
@@ -1316,6 +1318,39 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif

+ count_vm_event(SPF_ATTEMPT);
+ seq = mmap_seq_read_start(mm);
+ if (seq & 1)
+ goto spf_abort;
+ rcu_read_lock();
+ vma = find_vma(mm, address);
+ if (!vma || vma->vm_start > address) {
+ rcu_read_unlock();
+ goto spf_abort;
+ }
+ pvma = *vma;
+ rcu_read_unlock();
+ if (!mmap_seq_read_check(mm, seq))
+ goto spf_abort;
+ vma = &pvma;
+ if (unlikely(access_error(error_code, vma)))
+ goto spf_abort;
+ fault = do_handle_mm_fault(vma, address,
+ flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ kernelmode_fixup_or_oops(regs, error_code, address,
+ SIGBUS, BUS_ADRERR);
+ return;
+ }
+ if (!(fault & VM_FAULT_RETRY))
+ goto done;
+
+spf_abort:
+ count_vm_event(SPF_ABORT);
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1412,6 +1447,7 @@ void do_user_addr_fault(struct pt_regs *regs,
}

mmap_read_unlock(mm);
+done:
if (likely(!(fault & VM_FAULT_ERROR)))
return;

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..cc4f8d14e43f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ SPF_ATTEMPT,
+ SPF_ABORT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74b2c374b86c..9ae1c27a549e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1365,6 +1365,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ "spf_attempt",
+ "spf_abort",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.20.1

2021-04-07 17:20:54

by Michel Lespinasse

[permalink] [raw]
Subject: [RFC PATCH 28/37] mm: implement speculative handling in filemap_map_pages()

In the speculative case, we know the page table already exists, and it
must be locked with pte_map_lock(). In the case where no page is found
for the given address, return VM_FAULT_RETRY which will abort the
fault before we get into the vm_ops->fault() callback. This is fine
because if filemap_map_pages does not find the page in page cache,
vm_ops->fault() will not either.

Initialize addr and last_pgoff to correspond to the pte at the original
fault address (which was mapped with pte_map_lock()), rather than the
pte at start_pgoff. The choice of initial values doesn't matter as
they will all be adjusted together before use, so they just need to be
consistent with each other, and using the original fault address and
pte allows us to reuse pte_map_lock() without any changes to it.

Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/filemap.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6e8505fe5df9..d496771749e6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3136,25 +3136,31 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct vm_area_struct *vma = vmf->vma;
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
- pgoff_t last_pgoff = start_pgoff;
+ pgoff_t last_pgoff;
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
- vm_fault_t ret = 0;
+ vm_fault_t ret = (vmf->flags & FAULT_FLAG_SPECULATIVE) ?
+ VM_FAULT_RETRY : 0;

- rcu_read_lock();
+ /* filemap_map_pages() is called within an rcu read lock already. */
head = first_map_page(mapping, &xas, end_pgoff);
if (!head)
- goto out;
+ return ret;

- if (filemap_map_pmd(vmf, head)) {
- ret = VM_FAULT_NOPAGE;
- goto out;
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+ filemap_map_pmd(vmf, head))
+ return VM_FAULT_NOPAGE;
+
+ if (!pte_map_lock(vmf)) {
+ unlock_page(head);
+ put_page(head);
+ return VM_FAULT_RETRY;
}
+ addr = vmf->address;
+ last_pgoff = vmf->pgoff;

- addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
page = find_subpage(head, xas.xa_index);
if (PageHWPoison(page))
@@ -3184,8 +3190,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
put_page(head);
} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
- rcu_read_unlock();
+ vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
}
--
2.20.1

2021-04-07 17:23:09

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 03/37] do_anonymous_page: use update_mmu_tlb()

Hi Bibo,

You introduced this code in commit 7df676974359f back in May.
Could you check that the change is correct ?

Thanks,

On Tue, Apr 06, 2021 at 06:44:28PM -0700, Michel Lespinasse wrote:
> update_mmu_tlb() can be used instead of update_mmu_cache() when the
> page fault handler detects that it lost the race to another page fault.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5efa07fb6cdc..8ee4bd239303 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3567,7 +3567,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
> if (!pte_none(*vmf->pte)) {
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> + update_mmu_tlb(vma, vmf->address, vmf->pte);
> goto release;
> }
>
> --
> 2.20.1
>

2021-04-07 17:23:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> In the speculative case, call the vm_ops->fault() method from within
> an rcu read locked section, and verify the mmap sequence lock at the
> start of the section. A match guarantees that the original vma is still
> valid at that time, and that the associated vma->vm_file stays valid
> while the vm_ops->fault() method is running.
>
> Note that this implies that speculative faults can not sleep within
> the vm_ops->fault method. We will only attempt to fetch existing pages
> from the page cache during speculative faults; any miss (or prefetch)
> will be handled by falling back to non-speculative fault handling.
>
> The speculative handling case also does not preallocate page tables,
> as it is always called with a pre-existing page table.

I still don't understand why you want to do this. The speculative
fault that doesn't do I/O is already here, and it's called ->map_pages
(which I see you also do later). So what's the point of this patch?

2021-04-07 17:23:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 34/37] mm: rcu safe vma freeing only for multithreaded user space

On Tue, Apr 06, 2021 at 06:44:59PM -0700, Michel Lespinasse wrote:
> Performance tuning: as single threaded userspace does not use
> speculative page faults, it does not require rcu safe vma freeing.
> Turn this off to avoid the related (small) extra overheads.
>
> For multi threaded userspace, we often see a performance benefit from
> the rcu safe vma freeing - even in tests that do not have any frequent
> concurrent page faults ! This is because rcu safe vma freeing prevents
> recently released vmas from being immediately reused in a new thread.

Why does that provide a performance benefit? Recently released
VMAs are cache-hot, and NUMA-local. I'd expect the RCU delay to be
performance-negative.

2021-04-07 17:23:51

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Wed, Apr 07, 2021 at 03:35:27AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> >
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> >
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
>
> I still don't understand why you want to do this. The speculative
> fault that doesn't do I/O is already here, and it's called ->map_pages
> (which I see you also do later). So what's the point of this patch?

I have to admit I did not give much tought about which path would be
generally most common here.

The speculative vm_ops->fault path would be used:
- for private mapping write faults,
- when fault-around is disabled (probably an uncommon case in general,
but actually common at Google).

That said, I do think your point makes sense in general, espicially if
this could help avoid the per-filesystem enable bit.

2021-04-07 17:23:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Tue, Apr 06, 2021 at 07:53:20PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 03:35:27AM +0100, Matthew Wilcox wrote:
> > On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > > In the speculative case, call the vm_ops->fault() method from within
> > > an rcu read locked section, and verify the mmap sequence lock at the
> > > start of the section. A match guarantees that the original vma is still
> > > valid at that time, and that the associated vma->vm_file stays valid
> > > while the vm_ops->fault() method is running.
> > >
> > > Note that this implies that speculative faults can not sleep within
> > > the vm_ops->fault method. We will only attempt to fetch existing pages
> > > from the page cache during speculative faults; any miss (or prefetch)
> > > will be handled by falling back to non-speculative fault handling.
> > >
> > > The speculative handling case also does not preallocate page tables,
> > > as it is always called with a pre-existing page table.
> >
> > I still don't understand why you want to do this. The speculative
> > fault that doesn't do I/O is already here, and it's called ->map_pages
> > (which I see you also do later). So what's the point of this patch?
>
> I have to admit I did not give much tought about which path would be
> generally most common here.
>
> The speculative vm_ops->fault path would be used:
> - for private mapping write faults,
> - when fault-around is disabled (probably an uncommon case in general,
> but actually common at Google).

Why is it disabled? The PTE table is already being allocated and filled
... why not quickly check the page cache to see if there are other pages
within this 2MB range and fill in their PTEs too? Even if only one
of them is ever hit, the reduction in page faults is surely worth it.
Obviously if your workload has such non-locality that you hit only one
page in a 2MB range and then no other, it'll lose ... but then you have
a really badly designed workload!

> That said, I do think your point makes sense in general, espicially if
> this could help avoid the per-filesystem enable bit.

2021-04-07 21:06:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> In the speculative case, call the vm_ops->fault() method from within
> an rcu read locked section, and verify the mmap sequence lock at the
> start of the section. A match guarantees that the original vma is still
> valid at that time, and that the associated vma->vm_file stays valid
> while the vm_ops->fault() method is running.
>
> Note that this implies that speculative faults can not sleep within
> the vm_ops->fault method. We will only attempt to fetch existing pages
> from the page cache during speculative faults; any miss (or prefetch)
> will be handled by falling back to non-speculative fault handling.
>
> The speculative handling case also does not preallocate page tables,
> as it is always called with a pre-existing page table.

So what's wrong with SRCU ? Laurent mumbled something about frequent
SRCU kthread activity being a problem; is that still so and is that
fundamentally unfixable?

Because to me it seems a much more natural solution to the whole thing.

2021-04-07 21:23:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> The page table tree is walked with local irqs disabled, which prevents
> page table reclamation (similarly to what fast GUP does). The logic is
> otherwise similar to the non-speculative path, but with additional
> restrictions: in the speculative path, we do not handle huge pages or
> wiring new pages tables.

Not on most architectures. Quoting the actual comment in mm/gup.c:

> * Before activating this code, please be aware that the following assumptions
> * are currently made:
> *
> * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> * free pages containing page tables or TLB flushing requires IPI broadcast.

On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
assumption that it is safe to dereference a pointer in a page table just
because irqs are off. You need RCU protection, too.

You have the same error in the cover letter.

--Andy

2021-04-07 22:00:36

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > struct mm_struct *mm;
> > vm_fault_t fault;
> > unsigned int flags = FAULT_FLAG_DEFAULT;
> > + struct vm_area_struct pvma;
>
> That's 200 bytes on-stack... I suppose that's just about acceptible, but
> perhaps we need a comment in struct vm_area_struct to make people aware
> this things lives on-stack and size really is an issue now.

Right, I agree that having the vma copy on-stack is not ideal.

I think what really should be done, is to copy just the attributes of
the vma that will be needed during the page fault. Things like vm_mm,
vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
vm_policy. We definitely do not need rbtree and rmap fields such as
vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...

The reason I did things this way, is because changing the entire fault
handler to use attributes stored in struct vm_fault, rather than in
the original vma, would be quite intrusive. I think it would be a
reasonable step to consider once there is agreement on the rest of the
speculative fault patch set, but it's practical doing it before then.

2021-04-07 22:00:56

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

On Wed, Apr 07, 2021 at 01:14:53PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > > struct mm_struct *mm;
> > > vm_fault_t fault;
> > > unsigned int flags = FAULT_FLAG_DEFAULT;
> > > + struct vm_area_struct pvma;
> >
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
>
> Right, I agree that having the vma copy on-stack is not ideal.
>
> I think what really should be done, is to copy just the attributes of
> the vma that will be needed during the page fault. Things like vm_mm,
> vm_page_prot, vm_flags, vm_ops, vm_pgoff, vm_file, vm_private_data,
> vm_policy. We definitely do not need rbtree and rmap fields such as
> vm_prev, vm_next, vm_rb, rb_subtree_gap, shared, anon_vma_chain etc...
>
> The reason I did things this way, is because changing the entire fault
> handler to use attributes stored in struct vm_fault, rather than in
> the original vma, would be quite intrusive. I think it would be a
> reasonable step to consider once there is agreement on the rest of the
> speculative fault patch set, but it's practical doing it before then.

I meant it's NOT practical using attributes rather than a vma copy
until there is sufficient agreement to merge the patchset.

2021-04-07 22:04:23

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

On Wed, Apr 07, 2021 at 04:47:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:34PM -0700, Michel Lespinasse wrote:
> > The counter's write side is hooked into the existing mmap locking API:
> > mmap_write_lock() increments the counter to the next (odd) value, and
> > mmap_write_unlock() increments it again to the next (even) value.
> >
> > The counter's speculative read side is supposed to be used as follows:
> >
> > seq = mmap_seq_read_start(mm);
> > if (seq & 1)
> > goto fail;
> > .... speculative handling here ....
> > if (!mmap_seq_read_check(mm, seq)
> > goto fail;
> >
> > This API guarantees that, if none of the "fail" tests abort
> > speculative execution, the speculative code section did not run
> > concurrently with any mmap writer.
>
> So this is obviously safe, but it's also super excessive. Any change,
> anywhere, will invalidate and abort a SPF.
>
> Since you make a complete copy of the vma, you could memcmp it in its
> entirety instead of this.

Yeah, there is a deliberate choice here to start with the simplest
possible approach, but this could lead to more SPF aborts than
strictly necessary.

It's not clear to me that just comparing original vs current vma
attributes would always be sufficient - I think in some cases, we may
also need to worry about attributes being changed back and forth
concurrently with the fault. However, the comparison you suggest would
be safe at least in the case where the original pte was pte_none.

At some point, I did try implementing the optimization you suggest -
if the global counter has changed, re-fetch the current vma and
compare it against the old - but this was basically no help for the
(Android) workloads I looked at. There were just too few aborts that
were caused by the global counter in the first place. Note that my
patchset only handles anon and page cache faults speculatively, so
generally there wasn't very much time for the counter to change.

But yes, this may not hold across all workloads, and maybe we'd want
to do something smarter once/if a problem workload is identified.

2021-04-07 22:04:31

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

On Wed, Apr 07, 2021 at 04:35:28PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > > struct mm_struct *mm;
> > > vm_fault_t fault;
> > > unsigned int flags = FAULT_FLAG_DEFAULT;
> > > + struct vm_area_struct pvma;
> >
> > That's 200 bytes on-stack... I suppose that's just about acceptible, but
> > perhaps we need a comment in struct vm_area_struct to make people aware
> > this things lives on-stack and size really is an issue now.
>
> Michel's gone off and done his own thing here.

I don't think that is an entirely fair representation. First we are
both aware of each other's work, there is no working in dark caves here.
But also, I don't even consider this patchset to be entirely my thing;
most of the main building blocks come from prior proposals before mine.

> The rest of us (Laurent, Liam & I) are working on top of the maple tree
> which shrinks vm_area_struct by five pointers, so just 160 bytes.

The idea of evaluating maple tree and speculative faults as a bundle
is actually worrying to me. I think both ideas are interesting and
worth looking into on their own, but I'm not convinced that they have
much to do with each other.

> Also, our approach doesn't involve copying VMAs in order to handle a fault.

See my other reply to Peter's message - copying VMAs is a convenient
way to reduce the size of the patchset, but it's not fundamental to
the approach at all.

2021-04-07 22:06:28

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Wed, Apr 07, 2021 at 04:40:34PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > In the speculative case, call the vm_ops->fault() method from within
> > an rcu read locked section, and verify the mmap sequence lock at the
> > start of the section. A match guarantees that the original vma is still
> > valid at that time, and that the associated vma->vm_file stays valid
> > while the vm_ops->fault() method is running.
> >
> > Note that this implies that speculative faults can not sleep within
> > the vm_ops->fault method. We will only attempt to fetch existing pages
> > from the page cache during speculative faults; any miss (or prefetch)
> > will be handled by falling back to non-speculative fault handling.
> >
> > The speculative handling case also does not preallocate page tables,
> > as it is always called with a pre-existing page table.
>
> So what's wrong with SRCU ? Laurent mumbled something about frequent
> SRCU kthread activity being a problem; is that still so and is that
> fundamentally unfixable?
>
> Because to me it seems a much more natural solution to the whole thing.

The short answer is that I did not try SRCU. My thought process was,
page cache already uses an RCU read lock, I just need to expand its
scope a little.

Using SRCU might allow us to hit disk during speculative faults; OTOH
we may need to switch to a more robust validation mechanism than the
global counter to reap any benefits.

2021-04-07 22:07:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Wed, Apr 07, 2021 at 02:20:27PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 07, 2021 at 04:40:34PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 06:44:49PM -0700, Michel Lespinasse wrote:
> > > In the speculative case, call the vm_ops->fault() method from within
> > > an rcu read locked section, and verify the mmap sequence lock at the
> > > start of the section. A match guarantees that the original vma is still
> > > valid at that time, and that the associated vma->vm_file stays valid
> > > while the vm_ops->fault() method is running.
> > >
> > > Note that this implies that speculative faults can not sleep within
> > > the vm_ops->fault method. We will only attempt to fetch existing pages
> > > from the page cache during speculative faults; any miss (or prefetch)
> > > will be handled by falling back to non-speculative fault handling.
> > >
> > > The speculative handling case also does not preallocate page tables,
> > > as it is always called with a pre-existing page table.
> >
> > So what's wrong with SRCU ? Laurent mumbled something about frequent
> > SRCU kthread activity being a problem; is that still so and is that
> > fundamentally unfixable?
> >
> > Because to me it seems a much more natural solution to the whole thing.
>
> The short answer is that I did not try SRCU. My thought process was,
> page cache already uses an RCU read lock, I just need to expand its
> scope a little.
>
> Using SRCU might allow us to hit disk during speculative faults; OTOH
> we may need to switch to a more robust validation mechanism than the
> global counter to reap any benefits.

Why would you want to do I/O under SRCU?! The benefit of SRCU is that
you can allocate page tables under SRCU.

Doing I/O without any lock held already works; it just uses the file
refcount. It would be better to use a vma refcount, as I already said.

2021-04-07 22:37:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/37] mm: add per-mm mmap sequence counter for speculative page fault handling.

On Tue, Apr 06, 2021 at 06:44:34PM -0700, Michel Lespinasse wrote:
> The counter's write side is hooked into the existing mmap locking API:
> mmap_write_lock() increments the counter to the next (odd) value, and
> mmap_write_unlock() increments it again to the next (even) value.
>
> The counter's speculative read side is supposed to be used as follows:
>
> seq = mmap_seq_read_start(mm);
> if (seq & 1)
> goto fail;
> .... speculative handling here ....
> if (!mmap_seq_read_check(mm, seq)
> goto fail;
>
> This API guarantees that, if none of the "fail" tests abort
> speculative execution, the speculative code section did not run
> concurrently with any mmap writer.

So this is obviously safe, but it's also super excessive. Any change,
anywhere, will invalidate and abort a SPF.

Since you make a complete copy of the vma, you could memcmp it in its
entirety instead of this.

2021-04-07 22:38:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> struct mm_struct *mm;
> vm_fault_t fault;
> unsigned int flags = FAULT_FLAG_DEFAULT;
> + struct vm_area_struct pvma;

That's 200 bytes on-stack... I suppose that's just about acceptible, but
perhaps we need a comment in struct vm_area_struct to make people aware
this things lives on-stack and size really is an issue now.

2021-04-07 23:16:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 11/37] x86/mm: attempt speculative mm faults first

On Wed, Apr 07, 2021 at 04:48:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 06:44:36PM -0700, Michel Lespinasse wrote:
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1219,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
> > struct mm_struct *mm;
> > vm_fault_t fault;
> > unsigned int flags = FAULT_FLAG_DEFAULT;
> > + struct vm_area_struct pvma;
>
> That's 200 bytes on-stack... I suppose that's just about acceptible, but
> perhaps we need a comment in struct vm_area_struct to make people aware
> this things lives on-stack and size really is an issue now.

Michel's gone off and done his own thing here.

The rest of us (Laurent, Liam & I) are working on top of the maple tree
which shrinks vm_area_struct by five pointers, so just 160 bytes.

Also, our approach doesn't involve copying VMAs in order to handle a fault.

2021-04-08 05:18:30

by kernel test robot

[permalink] [raw]
Subject: [mm] 87b1c39af4: nvml.blk_rw_mt_TEST0_check_pmem_debug.fail



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 87b1c39af431a20ab69bdceb9036360cf58a28e4 ("[RFC PATCH 23/37] mm: rcu safe vma->vm_file freeing")
url: https://github.com/0day-ci/linux/commits/Michel-Lespinasse/Speculative-page-faults/20210407-095517
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a500fc918f7b8dc3dff2e6c74f3e73e856c18248

in testcase: nvml
version: nvml-x86_64-09f75f696-1_20210402
with following parameters:

test: pmem
group: blk
nr_pmem: 1
fs: ext4
mount_option: dax
bp_memmap: 32G!4G
ucode: 0x7000019



on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 48G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


2021-04-07 16:01:30 ./RUNTESTS -f pmem blk_rw_mt
blk_rw_mt/TEST0: SETUP (check/pmem/debug)
[MATCHING FAILED, COMPLETE FILE (out0.log) BELOW]
blk_rw_mt/TEST0: START: blk_rw_mt
./blk_rw_mt 4096 /fs/pmem0//test_blk_rw_mt0????⠏⠍⠙⠅ɗPMDKӜ⥺????/testfile1 123 5 80
4096 block size 4096 usable blocks 100
/fs/pmem0//test_blk_rw_mt0????⠏⠍⠙⠅ɗPMDKӜ⥺????/testfile1: pmemblk_check: Resource temporarily unavailable
blk_rw_mt/TEST0: DONE

[EOF]
out0.log.match:1 blk_rw_mt$(nW)TEST0: START: blk_rw_mt
out0.log:1 blk_rw_mt/TEST0: START: blk_rw_mt
out0.log.match:2 $(nW)blk_rw_mt$(nW) 4096 $(nW)testfile1 123 5 80
out0.log:2 ./blk_rw_mt 4096 /fs/pmem0//test_blk_rw_mt0????⠏⠍⠙⠅ɗPMDKӜ⥺????/testfile1 123 5 80
out0.log.match:3 4096 block size 4096 usable blocks 100
out0.log:3 4096 block size 4096 usable blocks 100
out0.log.match:4 blk_rw_mt$(nW)TEST0: DONE
out0.log:4 /fs/pmem0//test_blk_rw_mt0????⠏⠍⠙⠅ɗPMDKӜ⥺????/testfile1: pmemblk_check: Resource temporarily unavailable
FAIL: match: out0.log.match:4 did not match pattern
RUNTESTS: stopping: blk_rw_mt/TEST0 failed, TEST=check FS=pmem BUILD=debug


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (2.51 kB)
config-5.12.0-rc2-00042-g87b1c39af431 (175.59 kB)
job-script (5.96 kB)
kmsg.xz (29.23 kB)
job.yaml (4.90 kB)
reproduce (2.22 kB)
nvml.xz (45.01 kB)
Download all attachments

2021-04-08 07:03:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> Doing I/O without any lock held already works; it just uses the file
> refcount. It would be better to use a vma refcount, as I already said.

The original workload that I developed SPF for (waaaay back when) was
prefaulting a single huge vma. Using a vma refcount was a total loss
because it resulted in the same cacheline contention that down_read()
was having.

As such, I'm always incredibly sad to see mention of vma refcounts.
They're fundamentally not solving the problem :/

2021-04-08 07:18:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > Doing I/O without any lock held already works; it just uses the file
> > refcount. It would be better to use a vma refcount, as I already said.
>
> The original workload that I developed SPF for (waaaay back when) was
> prefaulting a single huge vma. Using a vma refcount was a total loss
> because it resulted in the same cacheline contention that down_read()
> was having.
>
> As such, I'm always incredibly sad to see mention of vma refcounts.
> They're fundamentally not solving the problem :/

OK, let me outline my locking scheme because I think it's rather better
than Michel's. The vma refcount is the slow path.

1. take the RCU read lock
2. walk the pgd/p4d/pud/pmd
3. allocate page tables if necessary. *handwave GFP flags*.
4. walk the vma tree
5. call ->map_pages
6. take ptlock
7. insert page(s)
8. drop ptlock
if this all worked out, we're done, drop the RCU read lock and return.
9. increment vma refcount
10. drop RCU read lock
11. call ->fault
12. decrement vma refcount

Compared to today, where we bump the refcount on the file underlying the
vma, this is _better_ scalability -- different mappings of the same file
will not contend on the file's refcount.

I suspect your huge VMA was anon, and that wouldn't need a vma refcount
as faulting in new pages doesn't need to do I/O, just drop the RCU
lock, allocate and retry.

2021-04-08 07:54:50

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 34/37] mm: rcu safe vma freeing only for multithreaded user space

On Wed, Apr 07, 2021 at 03:50:06AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 06:44:59PM -0700, Michel Lespinasse wrote:
> > Performance tuning: as single threaded userspace does not use
> > speculative page faults, it does not require rcu safe vma freeing.
> > Turn this off to avoid the related (small) extra overheads.
> >
> > For multi threaded userspace, we often see a performance benefit from
> > the rcu safe vma freeing - even in tests that do not have any frequent
> > concurrent page faults ! This is because rcu safe vma freeing prevents
> > recently released vmas from being immediately reused in a new thread.
>
> Why does that provide a performance benefit? Recently released
> VMAs are cache-hot, and NUMA-local. I'd expect the RCU delay to be
> performance-negative.

I only have the observation and no full explanation for it.
Just try it on wis-mmap and wis-malloc threaded cases. Of course this
all washes away when dealing with more realistic macro benchmarks.

2021-04-08 08:20:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > Doing I/O without any lock held already works; it just uses the file
> > > refcount. It would be better to use a vma refcount, as I already said.
> >
> > The original workload that I developed SPF for (waaaay back when) was
> > prefaulting a single huge vma. Using a vma refcount was a total loss
> > because it resulted in the same cacheline contention that down_read()
> > was having.
> >
> > As such, I'm always incredibly sad to see mention of vma refcounts.
> > They're fundamentally not solving the problem :/
>
> OK, let me outline my locking scheme because I think it's rather better
> than Michel's. The vma refcount is the slow path.
>
> 1. take the RCU read lock
> 2. walk the pgd/p4d/pud/pmd
> 3. allocate page tables if necessary. *handwave GFP flags*.

The problem with allocating page-tables was that you can race with
zap_page_range() if you're not holding mmap_sem, and as such can install
a page-table after, in which case it leaks.

IIRC that was solvable, but it did need a bit of care.

> 4. walk the vma tree
> 5. call ->map_pages

I can't remember ->map_pages().. I think that's 'new'. git-blame tells
me that's 2014, and I did the original SPF in 2010.

Yes, that looks like a useful thing to have, it does the non-blocking
part of ->fault().

I suppose the thing missing here is that if ->map_pages() does not
return a page, we have:

goto 9

> 6. take ptlock
> 7. insert page(s)
> 8. drop ptlock
> if this all worked out, we're done, drop the RCU read lock and return.

> 9. increment vma refcount
> 10. drop RCU read lock
> 11. call ->fault
> 12. decrement vma refcount

And here we do 6-8 again, right?

> Compared to today, where we bump the refcount on the file underlying the
> vma, this is _better_ scalability -- different mappings of the same file
> will not contend on the file's refcount.
>
> I suspect your huge VMA was anon, and that wouldn't need a vma refcount
> as faulting in new pages doesn't need to do I/O, just drop the RCU
> lock, allocate and retry.

IIRC yes, it was either a huge matrix setup or some database thing, I
can't remember. But the thing was, we didn't have that ->map_pages(), so
we had to call ->fault(), which can sleep, so I had to use SRCU across
the whole thing (or rather, I hacked up preemptible-rcu, because SRCU
was super primitive back then). It did kick start significant SRCU
rework IIRC. Anyway, that's all ancient history.

2021-04-08 08:39:25

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > Doing I/O without any lock held already works; it just uses the file
> > > refcount. It would be better to use a vma refcount, as I already said.
> >
> > The original workload that I developed SPF for (waaaay back when) was
> > prefaulting a single huge vma. Using a vma refcount was a total loss
> > because it resulted in the same cacheline contention that down_read()
> > was having.
> >
> > As such, I'm always incredibly sad to see mention of vma refcounts.
> > They're fundamentally not solving the problem :/
>
> OK, let me outline my locking scheme because I think it's rather better
> than Michel's. The vma refcount is the slow path.
>
> 1. take the RCU read lock
> 2. walk the pgd/p4d/pud/pmd
> 3. allocate page tables if necessary. *handwave GFP flags*.
> 4. walk the vma tree
> 5. call ->map_pages
> 6. take ptlock
> 7. insert page(s)
> 8. drop ptlock
> if this all worked out, we're done, drop the RCU read lock and return.
> 9. increment vma refcount
> 10. drop RCU read lock
> 11. call ->fault
> 12. decrement vma refcount

Note that most of your proposed steps seem similar in principle to mine.
Looking at the fast path (steps 1-8):
- step 2 sounds like the speculative part of __handle_mm_fault()
- (step 3 not included in my proposal)
- step 4 is basically the lookup I currently have in the arch fault handler
- step 6 sounds like the speculative part of map_pte_lock()

I have working implementations for each step, while your proposal
summarizes each as a point item. It's not clear to me what to make of it;
presumably you would be "filling in the blanks" in a different way
than I have but you are not explaining how. Are you suggesting that
the precautions taken in each step to avoid races with mmap writers
would not be necessary in your proposal ? if that is the case, what is
the alternative mechanism would you use to handle such races ?

Going back to the source of this, you suggested not copying the VMA,
what is your proposed alternative ? Do you suggest that fault handlers
should deal with the vma potentially mutating under them ? Or should
mmap writers consider vmas as immutable and copy them whenever they
want to change them ? or are you implying a locking mechanism that would
prevent mmap writers from executing while the fault is running ?

> Compared to today, where we bump the refcount on the file underlying the
> vma, this is _better_ scalability -- different mappings of the same file
> will not contend on the file's refcount.
>
> I suspect your huge VMA was anon, and that wouldn't need a vma refcount
> as faulting in new pages doesn't need to do I/O, just drop the RCU
> lock, allocate and retry.

2021-04-08 11:30:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault()

On Thu, Apr 08, 2021 at 01:37:34AM -0700, Michel Lespinasse wrote:
> On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote:
> > > > Doing I/O without any lock held already works; it just uses the file
> > > > refcount. It would be better to use a vma refcount, as I already said.
> > >
> > > The original workload that I developed SPF for (waaaay back when) was
> > > prefaulting a single huge vma. Using a vma refcount was a total loss
> > > because it resulted in the same cacheline contention that down_read()
> > > was having.
> > >
> > > As such, I'm always incredibly sad to see mention of vma refcounts.
> > > They're fundamentally not solving the problem :/
> >
> > OK, let me outline my locking scheme because I think it's rather better
> > than Michel's. The vma refcount is the slow path.
> >
> > 1. take the RCU read lock
> > 2. walk the pgd/p4d/pud/pmd
> > 3. allocate page tables if necessary. *handwave GFP flags*.
> > 4. walk the vma tree
> > 5. call ->map_pages
> > 6. take ptlock
> > 7. insert page(s)
> > 8. drop ptlock
> > if this all worked out, we're done, drop the RCU read lock and return.
> > 9. increment vma refcount
> > 10. drop RCU read lock
> > 11. call ->fault
> > 12. decrement vma refcount
>
> Note that most of your proposed steps seem similar in principle to mine.
> Looking at the fast path (steps 1-8):
> - step 2 sounds like the speculative part of __handle_mm_fault()
> - (step 3 not included in my proposal)
> - step 4 is basically the lookup I currently have in the arch fault handler
> - step 6 sounds like the speculative part of map_pte_lock()
>
> I have working implementations for each step, while your proposal
> summarizes each as a point item. It's not clear to me what to make of it;
> presumably you would be "filling in the blanks" in a different way
> than I have but you are not explaining how. Are you suggesting that
> the precautions taken in each step to avoid races with mmap writers
> would not be necessary in your proposal ? if that is the case, what is
> the alternative mechanism would you use to handle such races ?

I don't know if you noticed, I've been a little busy with memory folios.
I did tell you that on the call, but you don't seem to retain anything
I tell you on the call, so maybe I shouldn't bother calling in any more.

> Going back to the source of this, you suggested not copying the VMA,
> what is your proposed alternative ? Do you suggest that fault handlers
> should deal with the vma potentially mutating under them ? Or should
> mmap writers consider vmas as immutable and copy them whenever they
> want to change them ? or are you implying a locking mechanism that would
> prevent mmap writers from executing while the fault is running ?

The VMA should be immutable, as I explained to you before.

2021-04-21 01:51:49

by Chinwen Chang

[permalink] [raw]
Subject: Re: [RFC PATCH 00/37] Speculative page faults

On Tue, 2021-04-06 at 18:44 -0700, Michel Lespinasse wrote:
> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.
>
>
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
>
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
>
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).
>
> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap(). Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
>
> In the case of mapped files, additional precautions must be taken when
> dereferencing vma->vm_file. The vma copy that is taken at the start of
> the speculative page fault includes a pointer to the file, but does not
> take a reference to it. The reference is held by the original vma.
> In order to guarantee the validity of the file through the vm_ops->fault()
> and/or vm_ops->map_pages() calls, these are run within an rcu read locked
> code section, with the mmap sequence count being verified at the start of
> the section and the vma file fput() is rcu-deferred when vmas are freed.
>
>
> The patch series applies on top of linux v5.12-rc5;
> a git tree is also available:
> git fetch https://github.com/lespinasse/linux.git v5.12-rc5-spf
>
>
> Commits 1 to 4 are preparatory cleanups
> (which I would like to push regardless of what happens to the rest)
>
> Commits 5 and 6 introduce CONFIG_SPECULATIVE_PAGE_FAULT and lets us
> enable it on x86 so we can test the new code as it gets introduced.
>
> Commits 7 and 8 extend handle_mm_fault() so it can be used for
> speculative faults; initially these always abort with VM_FAULT_RETRY.
>
> Commits 9 to 14 introduce all the basic building blocks for speculative
> page faults:
> - Commit 9 adds the mmap sequence count that will be used for detecting
> when writers have been running concurrently with an spf attempt
> (in which case the attempt will be aborted);
> - Commit 10 adds RCU safe vma freeing;
> - Commit 11 does a lockless VMA lookup and starts the spf handling attempt;
> - (Commit 12 is a small refactor preparing for the next commit);
> - Commit 13 walks down the existing page tables, carefully avoiding
> races with potential writers (munmap in particular)
> - Commit 14 introduces pte_map_lock() and pte_spinlock(), which attempt
> to (optionally map and) lock an existing page table when it's time to
> commit page fault results to it.
>
> Commits 15 to 22 progressively implement spf for most anon vma cases.
> This mostly comes down to using the pte_map_lock() and pte_spinlock()
> APIs where needed, and making sure to abort speculation in unsupported
> cases (mostly anon_vma allocation and userfaultfd). The work is split
> in small steps so that changes can be tested soon after they are added.
>
> Commits 23 to 26 introduce the main ideas for extending spf to file
> mapped vmas:
> - Commit 23 adds RCU safe vma->vm_file freeing;
> - Commit 24 uses RCU to ensure vma->vm_file stays valid through the
> vm_ops->fault() method call;
> - Commit 25 implements the speculative case for filemap_fault().
> This only handles the common case where the desired page is already
> cached, is not locked and where we don't need to trigger readahead.
> - Commit 26 ensures we use pte_map_lock() to commit the faulted page to the
> mm address space, verifying that no mmap writers ran during the fault.
>
> Commits 27 and 28 use the same ideas to implement speculative fault-around.
>
> Commits 29 to 32 complete spf support for most mapped file types.
> A list of supported file types is introduced, and we allow speculative
> execution through the new speculative file faulting paths for those files.
> Note that the speculative support does not extend to shared file
> mapping writes in this patchset, as this would require some support
> for vm_ops->page_mkwrite() which isn't handled yet.
>
> Commits 33 and 34 disable speculative handling for single-threaded
> userspace. This is for (minor) performance tuning and is pushed
> towards the end of the series to make it easier to exercise the spf
> paths as they are introduced.
>
> Commit 35 adds some extra statistics. Not pushing for inclusion on
> these but I thought it might come handy when discussing spf performance.
>
> Commits 36 and 37 add spf support on the arm64 architecture. It should
> be easy to add other architectures too, given access to test machines.
>
>
> As Laurent's previous proposal before LSF/MM 2019 was rejected for
> complexity reasons, I am hoping that some of the changes I made will
> help address these concerns:
>
> - First, the patchset is structured to be bisectable. Every few commits
> the new code gets enabled, which makes for easier testing and also,
> I think, should make it easier for reviewers to understand how the
> various commits relate to each other.
>
> - This patchset requires no changes to mmap writers other than instrumenting
> the mmap write lock APIs.
>
> - The fault handler operates on a stable copy of the vma, so it does not
> require any special care to avoid the possibility of vma fields being
> modified concurrently with it.
>
>
> mmtest results on HP Z840 workstation
> (2 socket Xeon E5-2690 v3 @ 2.60GHz, 24 cores / 48 threads total):
> See https://www.lespinasse.org/kernel/v5.12-rc5-spf/mmtests/
>
> Highlights from the above:
>
> - pft results show some pretty damn good scalability. Throughput using
> all 48 threads (24 cores) is 24x that of single-threaded tests, and
> 3.7x higher than when using the baseline kernel. This is the result
> of avoiding writing into any shared cache lines, be it for mmap lock
> or vma refcounting, during speculative page faults.
> (Experiments show that adding a single atomic variable per vma,
> which would be incremented and decremented before and after spf
> copies the vma, would decrease the 48-threads throughput by 62%).
>
> - wis-pf page_fault1 (anon THP) shows some moderate improvement.
>
> - wis-pf page_fault2 (private file mapping write) shows some
> improvement, though much less than in the pft (anon) case.
> The performance is limited by LRU contention here.
>
> - wis-pf page_fault3 (shared file mapping write) gets a bit slower.
> This patchset does not implement speculative handling for this case.
>
> - wis-mmap likes the change, even though it doesn't do much page faults.
> This seems to be a side effect of rcu safe vma freeing reducing vma
> reuse between threads running on separate CPUs.
>
> - wis-malloc benefits from a mix of the anon vma and rcu effects.
>
> - macro benchmarks are mostly performance neutral, with some small
> benefit in limited cases.
>
>
> Another potential motivation for this is the Android use case.
> Several Android vendors have picked up the previous SPF proposal and
> included it on their devices because it reduces application start-up
> times, which is an important performance metric for them.
>
> Early Android test results with this patchset (actually a backport of
> it to the Android 5.10 kernel) show that it is also effective in
> reducing application start times, similar to what the prior SPF
> proposal did. I do not have a direct comparison available though,
> as our port of Laurent's prior SPF series to the 5.10 android kernel
> is showing some instabilities. SPF abort rates are in the low single
> digit percentage range, and few of these aborts are caused by the fact
> that we only use a global counter to detect fault vs writer concurrency.
> We are seeing launch time improvements for a variety of Android packages
> that use a large number of threads during startup:
> Meituan: 15%
> Iqiyi: 8%
> Tiktok: 4%
> YouTube: 3%
> I expect Suren may have more Android numbers from partners in a few days.
>

Thanks for the effort:)

We have tried to merge this SPF series to our code base, and measured
the difference of application launch time. As you mentioned, it has
great improvements, especially for those high TLP ones.

Here is a brief report of our experiment, FYR.

(The launch time is in ms)

app version-code w/o SPF w/ SPF improve(%)
------------------------------------------------------------------
Meituan 900140601 2427 1655 31.81
Taobao 241 1200 1170 2.5
WeChat 1360 941 922 2.02
Line 16071401 686 665 3.06

> ChromeOS measurements with the previous SPF patchset also showed
> significant reduction in page fault, mmap and munmap times. We do not
> have data on this new SPF proposal yet, though.
>
>
> I would like any feedback on the patchset, and most particularly about:
> - Has the complexity issue in previous SPF proposals been sufficiently
> addressed ?
> - Are the performance benefits compelling enough ? I think the answer is
> yes for the anon case, but maybe not as clear for the file cases.
> - Is the Android use case compelling enough to merge the entire patchset ?
> - Can we use this as a foundation for other mmap scalability work ?
> I hear several proposals involving the idea of RCU based fault handling,
> and hope this proposal might serve as a building block for these ?
>
>
> Michel Lespinasse (37):
> mmap locking API: mmap_lock_is_contended returns a bool
> mmap locking API: name the return values
> do_anonymous_page: use update_mmu_tlb()
> do_anonymous_page: reduce code duplication
> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> mm: add FAULT_FLAG_SPECULATIVE flag
> mm: add do_handle_mm_fault()
> mm: add per-mm mmap sequence counter for speculative page fault handling.
> mm: rcu safe vma freeing
> x86/mm: attempt speculative mm faults first
> mm: refactor __handle_mm_fault() / handle_pte_fault()
> mm: implement speculative handling in __handle_mm_fault().
> mm: add pte_map_lock() and pte_spinlock()
> mm: implement speculative handling in do_anonymous_page()
> mm: enable speculative fault handling through do_anonymous_page()
> mm: implement speculative handling in do_numa_page()
> mm: enable speculative fault handling in do_numa_page()
> mm: implement speculative handling in wp_page_copy()
> mm: implement and enable speculative fault handling in handle_pte_fault()
> mm: implement speculative handling in do_swap_page()
> mm: enable speculative fault handling through do_swap_page()
> mm: rcu safe vma->vm_file freeing
> mm: implement speculative handling in __do_fault()
> mm: implement speculative handling in filemap_fault()
> mm: implement speculative fault handling in finish_fault()
> mm: implement speculative handling in do_fault_around()
> mm: implement speculative handling in filemap_map_pages()
> fs: list file types that support speculative faults.
> mm: enable speculative fault handling for supported file types.
> ext4: implement speculative fault handling
> f2fs: implement speculative fault handling
> mm: enable speculative fault handling only for multithreaded user space
> mm: rcu safe vma freeing only for multithreaded user space
> mm: spf statistics
> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> arm64/mm: attempt speculative mm faults first
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/fault.c | 52 +++
> arch/x86/Kconfig | 1 +
> arch/x86/mm/fault.c | 51 +++
> fs/btrfs/file.c | 1 +
> fs/cifs/file.c | 1 +
> fs/exec.c | 1 +
> fs/ext4/file.c | 1 +
> fs/ext4/inode.c | 7 +-
> fs/f2fs/file.c | 8 +-
> fs/fuse/file.c | 1 +
> fs/nfs/file.c | 1 +
> fs/ubifs/file.c | 1 +
> fs/vboxsf/file.c | 1 +
> fs/xfs/xfs_file.c | 3 +
> include/linux/mm.h | 76 +++-
> include/linux/mm_types.h | 20 +-
> include/linux/mmap_lock.h | 109 ++++--
> include/linux/vm_event_item.h | 28 ++
> include/linux/vmstat.h | 6 +
> kernel/fork.c | 26 +-
> mm/Kconfig | 22 ++
> mm/Kconfig.debug | 7 +
> mm/filemap.c | 73 +++-
> mm/memory.c | 634 ++++++++++++++++++++++++----------
> mm/mmap.c | 11 +-
> mm/nommu.c | 6 +-
> mm/vmstat.c | 28 ++
> 28 files changed, 942 insertions(+), 235 deletions(-)

2021-04-28 17:36:47

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > The page table tree is walked with local irqs disabled, which prevents
> > page table reclamation (similarly to what fast GUP does). The logic is
> > otherwise similar to the non-speculative path, but with additional
> > restrictions: in the speculative path, we do not handle huge pages or
> > wiring new pages tables.
>
> Not on most architectures. Quoting the actual comment in mm/gup.c:
>
> > * Before activating this code, please be aware that the following assumptions
> > * are currently made:
> > *
> > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > * free pages containing page tables or TLB flushing requires IPI broadcast.
>
> On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> assumption that it is safe to dereference a pointer in a page table just
> because irqs are off. You need RCU protection, too.
>
> You have the same error in the cover letter.

Hi Andy,

Thanks for your comment. At first I thought did not matter, because we
only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
with paravirt. So I took another look at fast GUP to make sure I
actually understand it.

This brings a question about lockless_pages_from_mm() - I see it
disabling interrupts, which it explains is necessary for disabling THP
splitting IPIs, but I do not see it taking an RCU read lock as would
be necessary for preventing paga table freeing on
MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
indirectly takes an rcu read lock somehow ? I think this is something
I should also mention in my explanation, and I have not seen a good
description of this on the fast GUP side...

Thanks,

--
Michel "walken" Lespinasse

2021-04-28 17:41:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
> >
> > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > The page table tree is walked with local irqs disabled, which prevents
> > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > otherwise similar to the non-speculative path, but with additional
> > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > wiring new pages tables.
> > >
> > > Not on most architectures. Quoting the actual comment in mm/gup.c:
> > >
> > > > * Before activating this code, please be aware that the following assumptions
> > > > * are currently made:
> > > > *
> > > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> > >
> > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > assumption that it is safe to dereference a pointer in a page table just
> > > because irqs are off. You need RCU protection, too.
> > >
> > > You have the same error in the cover letter.
> >
> > Hi Andy,
> >
> > Thanks for your comment. At first I thought did not matter, because we
> > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > with paravirt. So I took another look at fast GUP to make sure I
> > actually understand it.
> >
> > This brings a question about lockless_pages_from_mm() - I see it
> > disabling interrupts, which it explains is necessary for disabling THP
> > splitting IPIs, but I do not see it taking an RCU read lock as would
> > be necessary for preventing paga table freeing on
> > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > indirectly takes an rcu read lock somehow ? I think this is something
> > I should also mention in my explanation, and I have not seen a good
> > description of this on the fast GUP side...
>
> Sounds like a bug! That being said, based on my extremely limited
> understanding of how the common RCU modes work, local_irq_save()
> probably implies an RCU lock in at least some cases. Hi Paul!

In modern kernels, local_irq_save() does have RCU reader semantics,
meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
regions. It will also wait for pre-existing bh-disable, preempt-disable,
and of course rcu_read_lock() sections of code.

But don't try this in older kernels, that is not in kernel in which
synchronize_sched() is defined!

Thanx, Paul

2021-04-28 19:22:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
>
> On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > The page table tree is walked with local irqs disabled, which prevents
> > > page table reclamation (similarly to what fast GUP does). The logic is
> > > otherwise similar to the non-speculative path, but with additional
> > > restrictions: in the speculative path, we do not handle huge pages or
> > > wiring new pages tables.
> >
> > Not on most architectures. Quoting the actual comment in mm/gup.c:
> >
> > > * Before activating this code, please be aware that the following assumptions
> > > * are currently made:
> > > *
> > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> >
> > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > assumption that it is safe to dereference a pointer in a page table just
> > because irqs are off. You need RCU protection, too.
> >
> > You have the same error in the cover letter.
>
> Hi Andy,
>
> Thanks for your comment. At first I thought did not matter, because we
> only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> with paravirt. So I took another look at fast GUP to make sure I
> actually understand it.
>
> This brings a question about lockless_pages_from_mm() - I see it
> disabling interrupts, which it explains is necessary for disabling THP
> splitting IPIs, but I do not see it taking an RCU read lock as would
> be necessary for preventing paga table freeing on
> MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> indirectly takes an rcu read lock somehow ? I think this is something
> I should also mention in my explanation, and I have not seen a good
> description of this on the fast GUP side...

Sounds like a bug! That being said, based on my extremely limited
understanding of how the common RCU modes work, local_irq_save()
probably implies an RCU lock in at least some cases. Hi Paul!

--Andy

2021-04-29 00:03:19

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
> > >
> > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > otherwise similar to the non-speculative path, but with additional
> > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > wiring new pages tables.
> > > >
> > > > Not on most architectures. Quoting the actual comment in mm/gup.c:
> > > >
> > > > > * Before activating this code, please be aware that the following assumptions
> > > > > * are currently made:
> > > > > *
> > > > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> > > >
> > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > assumption that it is safe to dereference a pointer in a page table just
> > > > because irqs are off. You need RCU protection, too.
> > > >
> > > > You have the same error in the cover letter.
> > >
> > > Hi Andy,
> > >
> > > Thanks for your comment. At first I thought did not matter, because we
> > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > with paravirt. So I took another look at fast GUP to make sure I
> > > actually understand it.
> > >
> > > This brings a question about lockless_pages_from_mm() - I see it
> > > disabling interrupts, which it explains is necessary for disabling THP
> > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > be necessary for preventing paga table freeing on
> > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > indirectly takes an rcu read lock somehow ? I think this is something
> > > I should also mention in my explanation, and I have not seen a good
> > > description of this on the fast GUP side...
> >
> > Sounds like a bug! That being said, based on my extremely limited
> > understanding of how the common RCU modes work, local_irq_save()
> > probably implies an RCU lock in at least some cases. Hi Paul!
>
> In modern kernels, local_irq_save() does have RCU reader semantics,
> meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> regions. It will also wait for pre-existing bh-disable, preempt-disable,
> and of course rcu_read_lock() sections of code.

Thanks Paul for confirming / clarifying this. BTW, it would be good to
add this to the rcu header files, just so people have something to
reference to when they depend on such behavior (like fast GUP
currently does).

Going back to my patch. I don't need to protect against THP splitting
here, as I'm only handling the small page case. So when
MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
using only an rcu read lock, instead of disabling interrupts which
implicitly creates the rcu read lock. I'm not sure which way to go -
fast GUP always disables interrupts regardless of the
MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
made for following the fast GUP stes rather than trying to be smarter.

Andy, do you have any opinion on this ? Or anyone else really ?

Thanks,

--
Michel "walken" Lespinasse

2021-04-29 00:06:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <[email protected]> wrote:
>
> On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > wiring new pages tables.
> > > > >
> > > > > Not on most architectures. Quoting the actual comment in mm/gup.c:
> > > > >
> > > > > > * Before activating this code, please be aware that the following assumptions
> > > > > > * are currently made:
> > > > > > *
> > > > > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > >
> > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > because irqs are off. You need RCU protection, too.
> > > > >
> > > > > You have the same error in the cover letter.
> > > >
> > > > Hi Andy,
> > > >
> > > > Thanks for your comment. At first I thought did not matter, because we
> > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > actually understand it.
> > > >
> > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > be necessary for preventing paga table freeing on
> > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > I should also mention in my explanation, and I have not seen a good
> > > > description of this on the fast GUP side...
> > >
> > > Sounds like a bug! That being said, based on my extremely limited
> > > understanding of how the common RCU modes work, local_irq_save()
> > > probably implies an RCU lock in at least some cases. Hi Paul!
> >
> > In modern kernels, local_irq_save() does have RCU reader semantics,
> > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > regions. It will also wait for pre-existing bh-disable, preempt-disable,
> > and of course rcu_read_lock() sections of code.
>
> Thanks Paul for confirming / clarifying this. BTW, it would be good to
> add this to the rcu header files, just so people have something to
> reference to when they depend on such behavior (like fast GUP
> currently does).

Or, even better, fast GUP could add an explicit RCU read lock.

>
> Going back to my patch. I don't need to protect against THP splitting
> here, as I'm only handling the small page case. So when
> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> using only an rcu read lock, instead of disabling interrupts which
> implicitly creates the rcu read lock. I'm not sure which way to go -
> fast GUP always disables interrupts regardless of the
> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> made for following the fast GUP stes rather than trying to be smarter.

How about adding some little helpers:

lockless_page_walk_begin();

lockless_page_walk_end();

these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
irqsave otherwise. And they're somewhat self-documenting.

--Andy

2021-04-29 15:55:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 05:02:25PM -0700, Michel Lespinasse wrote:
> On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > wiring new pages tables.
> > > > >
> > > > > Not on most architectures. Quoting the actual comment in mm/gup.c:
> > > > >
> > > > > > * Before activating this code, please be aware that the following assumptions
> > > > > > * are currently made:
> > > > > > *
> > > > > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > >
> > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > because irqs are off. You need RCU protection, too.
> > > > >
> > > > > You have the same error in the cover letter.
> > > >
> > > > Hi Andy,
> > > >
> > > > Thanks for your comment. At first I thought did not matter, because we
> > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > actually understand it.
> > > >
> > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > be necessary for preventing paga table freeing on
> > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > I should also mention in my explanation, and I have not seen a good
> > > > description of this on the fast GUP side...
> > >
> > > Sounds like a bug! That being said, based on my extremely limited
> > > understanding of how the common RCU modes work, local_irq_save()
> > > probably implies an RCU lock in at least some cases. Hi Paul!
> >
> > In modern kernels, local_irq_save() does have RCU reader semantics,
> > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > regions. It will also wait for pre-existing bh-disable, preempt-disable,
> > and of course rcu_read_lock() sections of code.
>
> Thanks Paul for confirming / clarifying this. BTW, it would be good to
> add this to the rcu header files, just so people have something to
> reference to when they depend on such behavior (like fast GUP
> currently does).

There is this in the synchronize_rcu() header block comment:

* synchronize_rcu() was waiting. RCU read-side critical sections are
* delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
* In addition, regions of code across which interrupts, preemption, or
* softirqs have been disabled also serve as RCU read-side critical
* sections. This includes hardware interrupt handlers, softirq handlers,
* and NMI handlers.

I have pulled this into a separate paragraph to increase its visibility,
and will check out other locations in comments and documentation.

Thanx, Paul

> Going back to my patch. I don't need to protect against THP splitting
> here, as I'm only handling the small page case. So when
> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> using only an rcu read lock, instead of disabling interrupts which
> implicitly creates the rcu read lock. I'm not sure which way to go -
> fast GUP always disables interrupts regardless of the
> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> made for following the fast GUP stes rather than trying to be smarter.
>
> Andy, do you have any opinion on this ? Or anyone else really ?
>
> Thanks,
>
> --
> Michel "walken" Lespinasse

2021-04-29 16:15:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <[email protected]> wrote:
> > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > add this to the rcu header files, just so people have something to
> > reference to when they depend on such behavior (like fast GUP
> > currently does).
>
> Or, even better, fast GUP could add an explicit RCU read lock.
>
> >
> > Going back to my patch. I don't need to protect against THP splitting
> > here, as I'm only handling the small page case. So when
> > MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> > using only an rcu read lock, instead of disabling interrupts which
> > implicitly creates the rcu read lock. I'm not sure which way to go -
> > fast GUP always disables interrupts regardless of the
> > MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> > made for following the fast GUP stes rather than trying to be smarter.
>
> How about adding some little helpers:
>
> lockless_page_walk_begin();
>
> lockless_page_walk_end();
>
> these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
> irqsave otherwise. And they're somewhat self-documenting.

One of the worst things we can do while holding a spinlock is take a
cache miss because we then delay for several thousand cycles to wait for
the cache line. That gives every other CPU a really long opportunity
to slam into the spinlock and things go downhill fast at that point.
We've even seen patches to do things like read A, take lock L, then read
A to avoid the cache miss while holding the lock.

What sort of performance effect would it have to free page tables
under RCU for all architectures? It's painful on s390 & powerpc because
different tables share the same struct page, but I have to believe that's
a solvable problem.

2021-04-29 18:07:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().



> On Apr 29, 2021, at 9:12 AM, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
>>> On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <[email protected]> wrote:
>>> Thanks Paul for confirming / clarifying this. BTW, it would be good to
>>> add this to the rcu header files, just so people have something to
>>> reference to when they depend on such behavior (like fast GUP
>>> currently does).
>>
>> Or, even better, fast GUP could add an explicit RCU read lock.
>>
>>>
>>> Going back to my patch. I don't need to protect against THP splitting
>>> here, as I'm only handling the small page case. So when
>>> MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
>>> using only an rcu read lock, instead of disabling interrupts which
>>> implicitly creates the rcu read lock. I'm not sure which way to go -
>>> fast GUP always disables interrupts regardless of the
>>> MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
>>> made for following the fast GUP stes rather than trying to be smarter.
>>
>> How about adding some little helpers:
>>
>> lockless_page_walk_begin();
>>
>> lockless_page_walk_end();
>>
>> these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
>> irqsave otherwise. And they're somewhat self-documenting.
>
> One of the worst things we can do while holding a spinlock is take a
> cache miss because we then delay for several thousand cycles to wait for
> the cache line. That gives every other CPU a really long opportunity
> to slam into the spinlock and things go downhill fast at that point.
> We've even seen patches to do things like read A, take lock L, then read
> A to avoid the cache miss while holding the lock.
>
> What sort of performance effect would it have to free page tables
> under RCU for all architectures? It's painful on s390 & powerpc because
> different tables share the same struct page, but I have to believe that's
> a solvable problem.

The IPI locking mechanism is entirely useless on any architecture that wants to do paravirt shootdowns, so this seems like a good strategy to me.

2021-04-29 18:35:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 08:52:50AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 28, 2021 at 05:02:25PM -0700, Michel Lespinasse wrote:
> > On Wed, Apr 28, 2021 at 09:11:08AM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 28, 2021 at 08:13:53AM -0700, Andy Lutomirski wrote:
> > > > On Wed, Apr 28, 2021 at 8:05 AM Michel Lespinasse <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 07, 2021 at 08:36:01AM -0700, Andy Lutomirski wrote:
> > > > > > On 4/6/21 6:44 PM, Michel Lespinasse wrote:
> > > > > > > The page table tree is walked with local irqs disabled, which prevents
> > > > > > > page table reclamation (similarly to what fast GUP does). The logic is
> > > > > > > otherwise similar to the non-speculative path, but with additional
> > > > > > > restrictions: in the speculative path, we do not handle huge pages or
> > > > > > > wiring new pages tables.
> > > > > >
> > > > > > Not on most architectures. Quoting the actual comment in mm/gup.c:
> > > > > >
> > > > > > > * Before activating this code, please be aware that the following assumptions
> > > > > > > * are currently made:
> > > > > > > *
> > > > > > > * *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
> > > > > > > * free pages containing page tables or TLB flushing requires IPI broadcast.
> > > > > >
> > > > > > On MMU_GATHER_RCU_TABLE_FREE architectures, you cannot make the
> > > > > > assumption that it is safe to dereference a pointer in a page table just
> > > > > > because irqs are off. You need RCU protection, too.
> > > > > >
> > > > > > You have the same error in the cover letter.
> > > > >
> > > > > Hi Andy,
> > > > >
> > > > > Thanks for your comment. At first I thought did not matter, because we
> > > > > only enable ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT on selected
> > > > > architectures, and I thought MMU_GATHER_RCU_TABLE_FREE is not set on
> > > > > these. But I was wrong - MMU_GATHER_RCU_TABLE_FREE is enabled on X86
> > > > > with paravirt. So I took another look at fast GUP to make sure I
> > > > > actually understand it.
> > > > >
> > > > > This brings a question about lockless_pages_from_mm() - I see it
> > > > > disabling interrupts, which it explains is necessary for disabling THP
> > > > > splitting IPIs, but I do not see it taking an RCU read lock as would
> > > > > be necessary for preventing paga table freeing on
> > > > > MMU_GATHER_RCU_TABLE_FREE configs. I figure local_irq_save()
> > > > > indirectly takes an rcu read lock somehow ? I think this is something
> > > > > I should also mention in my explanation, and I have not seen a good
> > > > > description of this on the fast GUP side...
> > > >
> > > > Sounds like a bug! That being said, based on my extremely limited
> > > > understanding of how the common RCU modes work, local_irq_save()
> > > > probably implies an RCU lock in at least some cases. Hi Paul!
> > >
> > > In modern kernels, local_irq_save() does have RCU reader semantics,
> > > meaning that synchronize_rcu() will wait for pre-exiting irq-disabled
> > > regions. It will also wait for pre-existing bh-disable, preempt-disable,
> > > and of course rcu_read_lock() sections of code.
> >
> > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > add this to the rcu header files, just so people have something to
> > reference to when they depend on such behavior (like fast GUP
> > currently does).
>
> There is this in the synchronize_rcu() header block comment:
>
> * synchronize_rcu() was waiting. RCU read-side critical sections are
> * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
> * In addition, regions of code across which interrupts, preemption, or
> * softirqs have been disabled also serve as RCU read-side critical
> * sections. This includes hardware interrupt handlers, softirq handlers,
> * and NMI handlers.
>
> I have pulled this into a separate paragraph to increase its visibility,
> and will check out other locations in comments and documentation.

Ditto for call_rcu() and the separate paragraph.

The rcu_read_lock_bh() and rcu_read_lock_sched() header comments noted
that these act as RCU read-side critical sections, but I added similar
verbiage to rcu_dereference_bh_check() and rcu_dereference_sched_check().

Please see below for the resulting commit.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 97262c64c2cf807bf06825e454c4bedd228fadfb
Author: Paul E. McKenney <[email protected]>
Date: Thu Apr 29 11:18:01 2021 -0700

rcu: Improve comments describing RCU read-side critical sections

There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.

This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.

Reported-by: Michel Lespinasse <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a10480f2b4ef..c01b04ad64c4 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -532,7 +532,10 @@ do { \
* @p: The pointer to read, prior to dereferencing
* @c: The conditions under which the dereference will take place
*
- * This is the RCU-bh counterpart to rcu_dereference_check().
+ * This is the RCU-bh counterpart to rcu_dereference_check(). However,
+ * please note that in recent kernels, synchronize_rcu() waits for
+ * local_bh_disable() regions of code in addition to regions of code
+ * demarked by rcu_read_lock() and rcu_read_unlock().
*/
#define rcu_dereference_bh_check(p, c) \
__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
@@ -543,6 +546,9 @@ do { \
* @c: The conditions under which the dereference will take place
*
* This is the RCU-sched counterpart to rcu_dereference_check().
+ * However, please note that in recent kernels, synchronize_rcu() waits
+ * for preemption-disabled regions of code in addition to regions of code
+ * demarked by rcu_read_lock() and rcu_read_unlock().
*/
#define rcu_dereference_sched_check(p, c) \
__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
@@ -634,6 +640,12 @@ do { \
* sections, invocation of the corresponding RCU callback is deferred
* until after the all the other CPUs exit their critical sections.
*
+ * In recent kernels, synchronize_rcu() and call_rcu() also wait for
+ * regions of code with preemption disabled, including regions of code
+ * with interrupts or softirqs disabled. If your kernel is old enough
+ * for synchronize_sched() to be defined, only code enclosed within
+ * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ *
* Note, however, that RCU callbacks are permitted to run concurrently
* with new RCU read-side critical sections. One way that this can happen
* is via the following sequence of events: (1) CPU 0 enters an RCU
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ea1d4eef1ad..0e76bf47d92b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3071,12 +3071,13 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
* period elapses, in other words after all pre-existing RCU read-side
* critical sections have completed. However, the callback function
* might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
- * may be nested. In addition, regions of code across which interrupts,
- * preemption, or softirqs have been disabled also serve as RCU read-side
- * critical sections. This includes hardware interrupt handlers, softirq
- * handlers, and NMI handlers.
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), and may be nested. In addition, regions of code
+ * across which interrupts, preemption, or softirqs have been disabled
+ * also serve as RCU read-side critical sections. This includes hardware
+ * interrupt handlers, softirq handlers, and NMI handlers.
*
* Note that all CPUs must agree that the grace period extended beyond
* all pre-existing RCU read-side critical section. On systems with more
@@ -3771,12 +3772,13 @@ static int rcu_blocking_is_gp(void)
* read-side critical sections have completed. Note, however, that
* upon return from synchronize_rcu(), the caller might well be executing
* concurrently with new RCU read-side critical sections that began while
- * synchronize_rcu() was waiting. RCU read-side critical sections are
- * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
- * In addition, regions of code across which interrupts, preemption, or
- * softirqs have been disabled also serve as RCU read-side critical
- * sections. This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
+ * synchronize_rcu() was waiting.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), and may be nested. In addition, regions of code
+ * across which interrupts, preemption, or softirqs have been disabled
+ * also serve as RCU read-side critical sections. This includes hardware
+ * interrupt handlers, softirq handlers, and NMI handlers.
*
* Note that this guarantee implies further memory-ordering guarantees.
* On systems with more than one CPU, when synchronize_rcu() returns,

2021-04-29 18:51:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> +++ b/include/linux/rcupdate.h
> @@ -532,7 +532,10 @@ do { \
> * @p: The pointer to read, prior to dereferencing
> * @c: The conditions under which the dereference will take place
> *
> - * This is the RCU-bh counterpart to rcu_dereference_check().
> + * This is the RCU-bh counterpart to rcu_dereference_check(). However,
> + * please note that in recent kernels, synchronize_rcu() waits for
> + * local_bh_disable() regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().
> */

I've been trying to get rid of "please note that" in my own documentation
recently. It doesn't add any value. Also, "recent kernels" is going to
go stale quickly, "Since v5.8" (or whatever) is good because it lets us
know in ten years that we can just delete the reference.

So I'd make this:

* This is the RCU-bh equivalent of rcu_dereference_check(). Since v5.8,
* synchronize_rcu() waits for code with bottom halves disabled as well
* as code between rcu_read_lock() and rcu_read_unlock().

2021-04-29 19:25:11

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 28, 2021 at 05:05:17PM -0700, Andy Lutomirski wrote:
> > On Wed, Apr 28, 2021 at 5:02 PM Michel Lespinasse <[email protected]> wrote:
> > > Thanks Paul for confirming / clarifying this. BTW, it would be good to
> > > add this to the rcu header files, just so people have something to
> > > reference to when they depend on such behavior (like fast GUP
> > > currently does).
> >
> > Or, even better, fast GUP could add an explicit RCU read lock.
> >
> > >
> > > Going back to my patch. I don't need to protect against THP splitting
> > > here, as I'm only handling the small page case. So when
> > > MMU_GATHER_RCU_TABLE_FREE is enabled, I *think* I could get away with
> > > using only an rcu read lock, instead of disabling interrupts which
> > > implicitly creates the rcu read lock. I'm not sure which way to go -
> > > fast GUP always disables interrupts regardless of the
> > > MMU_GATHER_RCU_TABLE_FREE setting, and I think there is a case to be
> > > made for following the fast GUP stes rather than trying to be smarter.
> >
> > How about adding some little helpers:
> >
> > lockless_page_walk_begin();
> >
> > lockless_page_walk_end();
> >
> > these turn into RCU read locks if MMU_GATHER_RCU_TABLE_FREE and into
> > irqsave otherwise. And they're somewhat self-documenting.
>
> One of the worst things we can do while holding a spinlock is take a
> cache miss because we then delay for several thousand cycles to wait for
> the cache line. That gives every other CPU a really long opportunity
> to slam into the spinlock and things go downhill fast at that point.
> We've even seen patches to do things like read A, take lock L, then read
> A to avoid the cache miss while holding the lock.

I understand the effect your are describing, but I do not see how it
applies here - what cacheline are we likely to miss on when using
local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?

> What sort of performance effect would it have to free page tables
> under RCU for all architectures? It's painful on s390 & powerpc because
> different tables share the same struct page, but I have to believe that's
> a solvable problem.

I agree using RCU to free page tables would be a good thing to try.
I am afraid of adding that to this patchset though, as it seems
somewhate unrelated and adds risk. IMO we are most likely to find
justification for pushing this if/when we try accessing remote mm's without
taking the mmap lock, since disabling IPIs clearly wouldn't work there.

--
Michel "walken" Lespinasse

2021-04-29 19:37:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 12:14:28PM -0700, Michel Lespinasse wrote:
> On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> > One of the worst things we can do while holding a spinlock is take a
> > cache miss because we then delay for several thousand cycles to wait for
> > the cache line. That gives every other CPU a really long opportunity
> > to slam into the spinlock and things go downhill fast at that point.
> > We've even seen patches to do things like read A, take lock L, then read
> > A to avoid the cache miss while holding the lock.
>
> I understand the effect your are describing, but I do not see how it
> applies here - what cacheline are we likely to miss on when using
> local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?

It's the same cache lines in both cases. The difference is that in one
case we have interrupts disabled (and a spinlock held? i wasn't clear
on that) and in the other case, we just have preemption disabled.

> > What sort of performance effect would it have to free page tables
> > under RCU for all architectures? It's painful on s390 & powerpc because
> > different tables share the same struct page, but I have to believe that's
> > a solvable problem.
>
> I agree using RCU to free page tables would be a good thing to try.
> I am afraid of adding that to this patchset though, as it seems
> somewhate unrelated and adds risk. IMO we are most likely to find
> justification for pushing this if/when we try accessing remote mm's without
> taking the mmap lock, since disabling IPIs clearly wouldn't work there.

I think that needs to happen _before_ this patchset. Creating a mess and
then trying to clean it up later isn't a great way to do development.

2021-04-29 21:19:19

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> ------------------------------------------------------------------------
>
> commit 97262c64c2cf807bf06825e454c4bedd228fadfb
> Author: Paul E. McKenney <[email protected]>
> Date: Thu Apr 29 11:18:01 2021 -0700
>
> rcu: Improve comments describing RCU read-side critical sections
>
> There are a number of places that call out the fact that preempt-disable
> regions of code now act as RCU read-side critical sections, where
> preempt-disable regions of code include irq-disable regions of code,
> bh-disable regions of code, hardirq handlers, and NMI handlers. However,
> someone relying solely on (for example) the call_rcu() header comment
> might well have no idea that preempt-disable regions of code have RCU
> semantics.
>
> This commit therefore updates the header comments for
> call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
> rcu_dereference_sched_check() to call out these new(ish) forms of RCU
> readers.
>
> Reported-by: Michel Lespinasse <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index a10480f2b4ef..c01b04ad64c4 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -532,7 +532,10 @@ do { \
> * @p: The pointer to read, prior to dereferencing
> * @c: The conditions under which the dereference will take place
> *
> - * This is the RCU-bh counterpart to rcu_dereference_check().
> + * This is the RCU-bh counterpart to rcu_dereference_check(). However,
> + * please note that in recent kernels, synchronize_rcu() waits for
> + * local_bh_disable() regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().

Two things:
- "recent kernels" could be clarified, as Matthew pointed out
- The above is not 100% clear if call_rcu() also waits for
local_bh_disable() regions of code ? (you did clarify this in tree.c
but I think it's better to have that here as well)

> */
> #define rcu_dereference_bh_check(p, c) \
> __rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
> @@ -543,6 +546,9 @@ do { \
> * @c: The conditions under which the dereference will take place
> *
> * This is the RCU-sched counterpart to rcu_dereference_check().
> + * However, please note that in recent kernels, synchronize_rcu() waits
> + * for preemption-disabled regions of code in addition to regions of code
> + * demarked by rcu_read_lock() and rcu_read_unlock().

Same comments regarding "recent kernels" and call_rcu() here.

> */
> #define rcu_dereference_sched_check(p, c) \
> __rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
> @@ -634,6 +640,12 @@ do { \
> * sections, invocation of the corresponding RCU callback is deferred
> * until after the all the other CPUs exit their critical sections.
> *
> + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> + * regions of code with preemption disabled, including regions of code
> + * with interrupts or softirqs disabled. If your kernel is old enough
> + * for synchronize_sched() to be defined, only code enclosed within
> + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> + *

Thanks, this is the quote I was looking for, and also I think it's
important for it to be in rcupdate.h rather than any .c implementation
(I think it's more natural to look at headers for this kind of stuff).

Same comment regarding "old enough" / "recent kernels" though.

> * Note, however, that RCU callbacks are permitted to run concurrently
> * with new RCU read-side critical sections. One way that this can happen
> * is via the following sequence of events: (1) CPU 0 enters an RCU
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c

The tree.c changes look fine to me.

Thanks a lot for looking into this !

--
Michel "walken" Lespinasse

2021-04-29 23:59:37

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 08:34:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 29, 2021 at 12:14:28PM -0700, Michel Lespinasse wrote:
> > On Thu, Apr 29, 2021 at 05:12:34PM +0100, Matthew Wilcox wrote:
> > > One of the worst things we can do while holding a spinlock is take a
> > > cache miss because we then delay for several thousand cycles to wait for
> > > the cache line. That gives every other CPU a really long opportunity
> > > to slam into the spinlock and things go downhill fast at that point.
> > > We've even seen patches to do things like read A, take lock L, then read
> > > A to avoid the cache miss while holding the lock.
> >
> > I understand the effect your are describing, but I do not see how it
> > applies here - what cacheline are we likely to miss on when using
> > local_irq_disable() that we wouldn't touch if using rcu_read_lock() ?
>
> It's the same cache lines in both cases. The difference is that in one
> case we have interrupts disabled (and a spinlock held? i wasn't clear
> on that) and in the other case, we just have preemption disabled.

To add some context - the problem we are trying to solve here (and a
different instance of it in the next commit) is that we are trying to
map and/or lock the page table, but need to prevent it from being
freed while we are trying to do so. Disabling interrupts or taking an
rcu read lock are two mechanisms for preventing that race, but the
structures accessed are the same in either case.

> > > What sort of performance effect would it have to free page tables
> > > under RCU for all architectures? It's painful on s390 & powerpc because
> > > different tables share the same struct page, but I have to believe that's
> > > a solvable problem.
> >
> > I agree using RCU to free page tables would be a good thing to try.
> > I am afraid of adding that to this patchset though, as it seems
> > somewhate unrelated and adds risk. IMO we are most likely to find
> > justification for pushing this if/when we try accessing remote mm's without
> > taking the mmap lock, since disabling IPIs clearly wouldn't work there.
>
> I think that needs to happen _before_ this patchset. Creating a mess and
> then trying to clean it up later isn't a great way to do development.

Agree we don't want to create a mess... but I see that as an argument for
not hastily changing the way page tables are reclaimed...

--
Michel "walken" Lespinasse

2021-05-03 03:16:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 07:49:08PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > +++ b/include/linux/rcupdate.h
> > @@ -532,7 +532,10 @@ do { \
> > * @p: The pointer to read, prior to dereferencing
> > * @c: The conditions under which the dereference will take place
> > *
> > - * This is the RCU-bh counterpart to rcu_dereference_check().
> > + * This is the RCU-bh counterpart to rcu_dereference_check(). However,
> > + * please note that in recent kernels, synchronize_rcu() waits for
> > + * local_bh_disable() regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
> > */
>
> I've been trying to get rid of "please note that" in my own documentation
> recently. It doesn't add any value. Also, "recent kernels" is going to
> go stale quickly, "Since v5.8" (or whatever) is good because it lets us
> know in ten years that we can just delete the reference.
>
> So I'd make this:
>
> * This is the RCU-bh equivalent of rcu_dereference_check(). Since v5.8,
> * synchronize_rcu() waits for code with bottom halves disabled as well
> * as code between rcu_read_lock() and rcu_read_unlock().

Normally, I would be right there with you on the "less is more"
approach to writing. But in this particular case:

1. I added comments to rcu_read_lock_bh(), rcu_read_lock_sched(),
call_rcu(), and synchronize_rcu().

2. I included a section entitled "RCU flavor consolidation" in the
2019 edition of the RCU API: https://lwn.net/Articles/777036/

3. I presented on this topic at LCA:
https://www.youtube.com/watch?v=hZX1aokdNiY

4. I published a paper on this topic:
https://dl.acm.org/doi/10.1145/3319647.3325836
http://www.rdrop.com/~paulmck/RCU/rcu-exploit.2019.05.01a.pdf

All of these, even taken together, have proven to be insufficient.
This therefore does not appear to be the place to economize on words. :-/

Your point on the version (v5.0, as it turns out) is right on, and I
will make that change.

Thanx, Paul

2021-05-03 03:42:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Thu, Apr 29, 2021 at 02:17:58PM -0700, Michel Lespinasse wrote:
> On Thu, Apr 29, 2021 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > ------------------------------------------------------------------------
> >
> > commit 97262c64c2cf807bf06825e454c4bedd228fadfb
> > Author: Paul E. McKenney <[email protected]>
> > Date: Thu Apr 29 11:18:01 2021 -0700
> >
> > rcu: Improve comments describing RCU read-side critical sections
> >
> > There are a number of places that call out the fact that preempt-disable
> > regions of code now act as RCU read-side critical sections, where
> > preempt-disable regions of code include irq-disable regions of code,
> > bh-disable regions of code, hardirq handlers, and NMI handlers. However,
> > someone relying solely on (for example) the call_rcu() header comment
> > might well have no idea that preempt-disable regions of code have RCU
> > semantics.
> >
> > This commit therefore updates the header comments for
> > call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
> > rcu_dereference_sched_check() to call out these new(ish) forms of RCU
> > readers.
> >
> > Reported-by: Michel Lespinasse <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index a10480f2b4ef..c01b04ad64c4 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -532,7 +532,10 @@ do { \
> > * @p: The pointer to read, prior to dereferencing
> > * @c: The conditions under which the dereference will take place
> > *
> > - * This is the RCU-bh counterpart to rcu_dereference_check().
> > + * This is the RCU-bh counterpart to rcu_dereference_check(). However,
> > + * please note that in recent kernels, synchronize_rcu() waits for
> > + * local_bh_disable() regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
>
> Two things:
> - "recent kernels" could be clarified, as Matthew pointed out
> - The above is not 100% clear if call_rcu() also waits for
> local_bh_disable() regions of code ? (you did clarify this in tree.c
> but I think it's better to have that here as well)

Good points, I updated both.

> > */
> > #define rcu_dereference_bh_check(p, c) \
> > __rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
> > @@ -543,6 +546,9 @@ do { \
> > * @c: The conditions under which the dereference will take place
> > *
> > * This is the RCU-sched counterpart to rcu_dereference_check().
> > + * However, please note that in recent kernels, synchronize_rcu() waits
> > + * for preemption-disabled regions of code in addition to regions of code
> > + * demarked by rcu_read_lock() and rcu_read_unlock().
>
> Same comments regarding "recent kernels" and call_rcu() here.

And here as well.

> > */
> > #define rcu_dereference_sched_check(p, c) \
> > __rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
> > @@ -634,6 +640,12 @@ do { \
> > * sections, invocation of the corresponding RCU callback is deferred
> > * until after the all the other CPUs exit their critical sections.
> > *
> > + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> > + * regions of code with preemption disabled, including regions of code
> > + * with interrupts or softirqs disabled. If your kernel is old enough
> > + * for synchronize_sched() to be defined, only code enclosed within
> > + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> > + *
>
> Thanks, this is the quote I was looking for, and also I think it's
> important for it to be in rcupdate.h rather than any .c implementation
> (I think it's more natural to look at headers for this kind of stuff).
>
> Same comment regarding "old enough" / "recent kernels" though.
>
> > * Note, however, that RCU callbacks are permitted to run concurrently
> > * with new RCU read-side critical sections. One way that this can happen
> > * is via the following sequence of events: (1) CPU 0 enters an RCU
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>
> The tree.c changes look fine to me.

I added the version here also.

> Thanks a lot for looking into this !

And here is the updated commit. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit cc5a0ad5aa52d26379d5cd04d0a8f0917caf7365
Author: Paul E. McKenney <[email protected]>
Date: Thu Apr 29 11:18:01 2021 -0700

rcu: Improve comments describing RCU read-side critical sections

There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.

This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.

Reported-by: Michel Lespinasse <[email protected]>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a10480f2b4ef..adc2043e92db 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -532,7 +532,12 @@ do { \
* @p: The pointer to read, prior to dereferencing
* @c: The conditions under which the dereference will take place
*
- * This is the RCU-bh counterpart to rcu_dereference_check().
+ * This is the RCU-bh counterpart to rcu_dereference_check(). However,
+ * please note that starting in v5.0 kernels, vanilla RCU grace periods
+ * wait for local_bh_disable() regions of code in addition to regions of
+ * code demarked by rcu_read_lock() and rcu_read_unlock(). This means
+ * that synchronize_rcu(), call_rcu, and friends all take not only
+ * rcu_read_lock() but also rcu_read_lock_bh() into account.
*/
#define rcu_dereference_bh_check(p, c) \
__rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu)
@@ -543,6 +548,11 @@ do { \
* @c: The conditions under which the dereference will take place
*
* This is the RCU-sched counterpart to rcu_dereference_check().
+ * However, please note that starting in v5.0 kernels, vanilla RCU grace
+ * periods wait for preempt_disable() regions of code in addition to
+ * regions of code demarked by rcu_read_lock() and rcu_read_unlock().
+ * This means that synchronize_rcu(), call_rcu, and friends all take not
+ * only rcu_read_lock() but also rcu_read_lock_sched() into account.
*/
#define rcu_dereference_sched_check(p, c) \
__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
@@ -634,6 +644,12 @@ do { \
* sections, invocation of the corresponding RCU callback is deferred
* until after the all the other CPUs exit their critical sections.
*
+ * In recent kernels, synchronize_rcu() and call_rcu() also wait for
+ * regions of code with preemption disabled, including regions of code
+ * with interrupts or softirqs disabled. If your kernel is old enough
+ * for synchronize_sched() to be defined, only code enclosed within
+ * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ *
* Note, however, that RCU callbacks are permitted to run concurrently
* with new RCU read-side critical sections. One way that this can happen
* is via the following sequence of events: (1) CPU 0 enters an RCU
@@ -728,9 +744,11 @@ static inline void rcu_read_unlock(void)
/**
* rcu_read_lock_bh() - mark the beginning of an RCU-bh critical section
*
- * This is equivalent of rcu_read_lock(), but also disables softirqs.
- * Note that anything else that disables softirqs can also serve as
- * an RCU read-side critical section.
+ * This is equivalent to rcu_read_lock(), but also disables softirqs.
+ * Note that anything else that disables softirqs can also serve as an RCU
+ * read-side critical section. However, please note that this equivalence
+ * applies only to v5.0 and later. Before v5.0, rcu_read_lock() and
+ * rcu_read_lock_bh() were unrelated.
*
* Note that rcu_read_lock_bh() and the matching rcu_read_unlock_bh()
* must occur in the same context, for example, it is illegal to invoke
@@ -763,9 +781,12 @@ static inline void rcu_read_unlock_bh(void)
/**
* rcu_read_lock_sched() - mark the beginning of a RCU-sched critical section
*
- * This is equivalent of rcu_read_lock(), but disables preemption.
- * Read-side critical sections can also be introduced by anything else
- * that disables preemption, including local_irq_disable() and friends.
+ * This is equivalent to rcu_read_lock(), but also disables preemption.
+ * Read-side critical sections can also be introduced by anything else that
+ * disables preemption, including local_irq_disable() and friends. However,
+ * please note that the equivalence to rcu_read_lock() applies only to
+ * v5.0 and later. Before v5.0, rcu_read_lock() and rcu_read_lock_sched()
+ * were unrelated.
*
* Note that rcu_read_lock_sched() and the matching rcu_read_unlock_sched()
* must occur in the same context, for example, it is illegal to invoke
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ea1d4eef1ad..9089c23e80dc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3071,12 +3071,14 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
* period elapses, in other words after all pre-existing RCU read-side
* critical sections have completed. However, the callback function
* might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(), and
- * may be nested. In addition, regions of code across which interrupts,
- * preemption, or softirqs have been disabled also serve as RCU read-side
- * critical sections. This includes hardware interrupt handlers, softirq
- * handlers, and NMI handlers.
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested. In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections. This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
*
* Note that all CPUs must agree that the grace period extended beyond
* all pre-existing RCU read-side critical section. On systems with more
@@ -3771,10 +3773,12 @@ static int rcu_blocking_is_gp(void)
* read-side critical sections have completed. Note, however, that
* upon return from synchronize_rcu(), the caller might well be executing
* concurrently with new RCU read-side critical sections that began while
- * synchronize_rcu() was waiting. RCU read-side critical sections are
- * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
- * In addition, regions of code across which interrupts, preemption, or
- * softirqs have been disabled also serve as RCU read-side critical
+ * synchronize_rcu() was waiting.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested. In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
* sections. This includes hardware interrupt handlers, softirq handlers,
* and NMI handlers.
*

2021-05-03 04:40:13

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Sun, May 02, 2021 at 08:40:49PM -0700, Paul E. McKenney wrote:
> @@ -634,6 +644,12 @@ do { \
> * sections, invocation of the corresponding RCU callback is deferred
> * until after the all the other CPUs exit their critical sections.
> *
> + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> + * regions of code with preemption disabled, including regions of code
> + * with interrupts or softirqs disabled. If your kernel is old enough
> + * for synchronize_sched() to be defined, only code enclosed within
> + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> + *
> * Note, however, that RCU callbacks are permitted to run concurrently
> * with new RCU read-side critical sections. One way that this can happen
> * is via the following sequence of events: (1) CPU 0 enters an RCU

You still have "old enough" / "recent kernels" here. But maybe it's OK
given that you added relevant version numbers elsewhere.

Everything else looks great to me.

Thanks,

--
Michel "walken" Lespinasse

2021-05-03 18:12:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 13/37] mm: implement speculative handling in __handle_mm_fault().

On Sun, May 02, 2021 at 09:34:30PM -0700, Michel Lespinasse wrote:
> On Sun, May 02, 2021 at 08:40:49PM -0700, Paul E. McKenney wrote:
> > @@ -634,6 +644,12 @@ do { \
> > * sections, invocation of the corresponding RCU callback is deferred
> > * until after the all the other CPUs exit their critical sections.
> > *
> > + * In recent kernels, synchronize_rcu() and call_rcu() also wait for
> > + * regions of code with preemption disabled, including regions of code
> > + * with interrupts or softirqs disabled. If your kernel is old enough
> > + * for synchronize_sched() to be defined, only code enclosed within
> > + * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
> > + *
> > * Note, however, that RCU callbacks are permitted to run concurrently
> > * with new RCU read-side critical sections. One way that this can happen
> > * is via the following sequence of events: (1) CPU 0 enters an RCU
>
> You still have "old enough" / "recent kernels" here. But maybe it's OK
> given that you added relevant version numbers elsewhere.
>
> Everything else looks great to me.

Good point! Like this?

Thanx, Paul

------------------------------------------------------------------------

commit fd8393a2a8a5ffd25d0766abb262137c36bda9f3
Author: Paul E. McKenney <[email protected]>
Date: Mon May 3 09:32:14 2021 -0700

fixup! rcu: Improve comments describing RCU read-side critical sections

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 901ab6fa252b..323954363389 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -644,11 +644,11 @@ do { \
* sections, invocation of the corresponding RCU callback is deferred
* until after the all the other CPUs exit their critical sections.
*
- * In recent kernels, synchronize_rcu() and call_rcu() also wait for
- * regions of code with preemption disabled, including regions of code
- * with interrupts or softirqs disabled. If your kernel is old enough
- * for synchronize_sched() to be defined, only code enclosed within
- * rcu_read_lock() and rcu_read_unlock() are guaranteed to be waited for.
+ * In v5.0 and later kernels, synchronize_rcu() and call_rcu() also
+ * wait for regions of code with preemption disabled, including regions of
+ * code with interrupts or softirqs disabled. In pre-v5.0 kernels, which
+ * define synchronize_sched(), only code enclosed within rcu_read_lock()
+ * and rcu_read_unlock() are guaranteed to be waited for.
*
* Note, however, that RCU callbacks are permitted to run concurrently
* with new RCU read-side critical sections. One way that this can happen

2021-06-29 00:36:36

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/37] Speculative page faults

First, I'm (ab?)using `git send-email` to reply to this thread, so apologies in
advance if I've messed up and the mail gets mangled in some way.

Here are some results comparing the "pft" microbenchmark from mmtests, running
on "large-ish" machines: one with a Skylake CPU, one with a Rome CPU. These
results are comparing Linus' v5.12 tag versus a 5.12 base with Michel's SPF
patchset.

The tests were run with the following mmtests configuration:

export PFT_ITERATIONS=30
export PFT_MAPPING_SIZE=$((MEMTOTAL_BYTES/5))
export PFT_MIN_CLIENTS=1
export PFT_MAX_CLIENTS=$NUMCPUS
export PFT_USE_PROCESSES=no
export PFT_PAGESIZES=base

To explain the result formatting a bit:

Higher numbers are better (so +% means a performance improvement vs. the v5.12
base kernel).

To be explicit about the output format: our internal testing framework parses
the output of pft, and generates this comparison table for the A/B test. In
pft's raw output, the columns are as follows:

- Gigabytes of RAM used
- Number of threads
- Number of cachelines
- User time
- Sys time
- Elapsed time
- Faults per sec per CPU
- Faults per sec

So the metrics in the test report, are called "pft_faults_{cpu,sec}_#". These
are aggregating the rows of output from the pft test:

- "cpu" means the result is faults per sec per CPU
- "sec' means the result is faults per sec
- The number corresponds to the "number of threads" column.

Note that pft runs the test multiple times for each threadcount. So, each row in
these reports is an aggregation of all of those runs for a particular thread
count.

To summarize the results:

On the Skylake machine the SPF patchset seems to provide a scalability
improvement across the board. With a small # of threads, the SPF patchset
outperforms the v5.12 base. As the number of threads grows, the gap diverges,
with SPF beating baseline by a larger and larger %.

On the Rome machine, however, SPF outperforms baseline with small numbers of
threads, but as the threadcount increases they *converge*. Eventually (between
79 and 110 threads), baseline wins out, and SPF actually gives us *less*
performance. I don't really have a clear explanation for this.



Okay, with all that said, here are the results:

Host: 2x Intel Skylake CPU, 28 cores / 56 threads each = 56 cores / 112 threads, 192 GiB RAM

[*] KERNELS
+-----------------+
| BASE KERNEL (A) |
+-----------------+
Kernel File: upstream-5-12.tar.xz

+-----------------+
| TEST KERNEL (B) |
+-----------------+
Kernel File: spf-5-12.tar.xz


[*] TAGS
LABEL | VALUE
-------------------------------+-----------------
kernel_version | 5.12.0-smp-DEV
kernel_version_major | 5
kernel_version_minor | 12
machine_platform_arch | X86_64
machine_platform_genus | skylake
machine_total_logical_cores | 112
machine_total_physical_cores | 56
machine_total_ram_gib | 192
test_name | pft
user | axelrasmussen

Note: values are delimited by colon (:)


[*] METRICS
LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION
---------------------+-------+-----------------+------------------+------------------------+------------------------+--------------------+----------------
pft_faults_cpu_1 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 565493.841 | 597051.859 | 574980.2721 | 570812.4335 | 8581.000516619562 |
(B) b864a2166b52 | 30 | 564770.188 | 593162.85 | 582501.4053333333 | 587672.1510000001 | 10230.821799571824 |
| | -0.13% | -0.65% | +1.31% | +2.95% | +19.23% | <not defined>
pft_faults_cpu_4 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 511097.313 | 525778.068 | 519169.2665333333 | 518875.51300000004 | 3309.878510268415 |
(B) b864a2166b52 | 30 | 522341.465 | 534768.283 | 528535.7364666668 | 529354.5875 | 2784.0796478571897 |
| | +2.20% | +1.71% | +1.80% | +2.02% | -15.89% | <not defined>
pft_faults_cpu_7 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 444047.293 | 477042.807 | 458029.40570000006 | 457146.5575 | 7141.073092774091 |
(B) b864a2166b52 | 30 | 465624.007 | 501500.836 | 483791.1242666667 | 483134.7105 | 8256.428471156061 |
| | +4.86% | +5.13% | +5.62% | +5.68% | +15.62% | <not defined>
pft_faults_cpu_12 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 358661.222 | 434231.404 | 401799.73736666667 | 403513.576 | 18949.321891028267 |
(B) b864a2166b52 | 30 | 418233.425 | 479956.806 | 451832.4433666666 | 455295.054 | 17026.783733301498 |
| | +16.61% | +10.53% | +12.45% | +12.83% | -10.15% | <not defined>
pft_faults_cpu_21 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 198339.171 | 258696.047 | 228804.68719999996 | 230395.9155 | 14085.607109196615 |
(B) b864a2166b52 | 30 | 323446.812 | 384591.881 | 364176.7125666667 | 367052.1725 | 14414.08907027631 |
| | +63.08% | +48.67% | +59.16% | +59.31% | +2.33% | <not defined>
pft_faults_cpu_30 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 133771.268 | 162076.635 | 152097.81180000002 | 152233.40350000001 | 6269.9142812145465 |
(B) b864a2166b52 | 30 | 179859.892 | 270190.89 | 243455.58066666668 | 242803.359 | 18445.098731373928 |
| | +34.45% | +66.71% | +60.07% | +59.49% | +194.18% | <not defined>
pft_faults_cpu_48 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 82579.061 | 104388.661 | 91568.5356333333 | 90592.1755 | 5546.777510223763 |
(B) b864a2166b52 | 30 | 123542.341 | 171553.261 | 146579.92550000004 | 145360.7165 | 12601.231672937389 |
| | +49.60% | +64.34% | +60.08% | +60.46% | +127.18% | <not defined>
pft_faults_cpu_79 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 50894.891 | 59619.003 | 55129.4893 | 54661.8725 | 2458.8216885069032 |
(B) b864a2166b52 | 30 | 91927.332 | 120933.559 | 102996.35123333332 | 102243.0855 | 7074.060753404679 |
| | +80.62% | +102.84% | +86.83% | +87.05% | +187.70% | <not defined>
pft_faults_cpu_110 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 31258.548 | 41685.942 | 36224.262500000004 | 36592.803 | 2680.4950001193283 |
(B) b864a2166b52 | 30 | 74168.507 | 94616.572 | 82115.45999999999 | 81169.6525 | 4481.52286411527 |
| | +137.27% | +126.97% | +126.69% | +121.82% | +67.19% | <not defined>
pft_faults_cpu_112 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 31130.401 | 38908.904 | 35375.443100000004 | 35829.6035 | 2290.1349177377056 |
(B) b864a2166b52 | 30 | 73258.73 | 92613.135 | 81351.41166666667 | 80899.55249999999 | 4220.631828597486 |
| | +135.33% | +138.03% | +129.97% | +125.79% | +84.30% | <not defined>
pft_faults_sec_1 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 564433.945 | 595277.365 | 573625.4990333333 | 569636.4879999999 | 8382.325888292982 |
(B) b864a2166b52 | 30 | 563737.055 | 591576.036 | 580978.3862666666 | 586013.3670000001 | 9974.96387222383 |
| | -0.12% | -0.62% | +1.28% | +2.87% | +19.00% | <not defined>
pft_faults_sec_4 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.029701865e+06 | 2.089334274e+06 | 2.062801138933333e+06 | 2.059596098e+06 | 14075.442079268854 |
(B) b864a2166b52 | 30 | 2.068478884e+06 | 2.12696578e+06 | 2.0998217110999995e+06 | 2.1048569074999997e+06 | 14703.254715627592 |
| | +1.91% | +1.80% | +1.79% | +2.20% | +4.46% | <not defined>
pft_faults_sec_7 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.072865342e+06 | 3.302000943e+06 | 3.169453882633334e+06 | 3.167175195e+06 | 50915.049503127535 |
(B) b864a2166b52 | 30 | 3.123651261e+06 | 3.48522451e+06 | 3.3531759071666673e+06 | 3.351685131e+06 | 68012.57859696122 |
| | +1.65% | +5.55% | +5.80% | +5.83% | +33.58% | <not defined>
pft_faults_sec_12 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 4.215664312e+06 | 5.12743371e+06 | 4.703126739933333e+06 | 4.728185016e+06 | 240286.8825153738 |
(B) b864a2166b52 | 30 | 4.968388093e+06 | 5.69411546e+06 | 5.359866512633331e+06 | 5.413731953e+06 | 207070.2801760229 |
| | +17.86% | +11.05% | +13.96% | +14.50% | -13.82% | <not defined>
pft_faults_sec_21 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.947870087e+06 | 5.151356879e+06 | 4.5692916812333325e+06 | 4.575656394e+06 | 287930.17681688163 |
(B) b864a2166b52 | 30 | 6.631669322e+06 | 8.009437176e+06 | 7.552175649799999e+06 | 7.6175255265e+06 | 317684.09702681314 |
| | +67.98% | +55.48% | +65.28% | +66.48% | +10.33% | <not defined>
pft_faults_sec_30 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.789153013e+06 | 4.456429949e+06 | 4.1501704604999996e+06 | 4.139623102e+06 | 158198.37530642716 |
(B) b864a2166b52 | 30 | 5.321981541e+06 | 7.794407019e+06 | 6.994074038533334e+06 | 7.011582908e+06 | 503444.05851691077 |
| | +40.45% | +74.90% | +68.52% | +69.38% | +218.24% | <not defined>
pft_faults_sec_48 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.496402856e+06 | 4.644547516e+06 | 3.820860616866667e+06 | 3.7869018389999997e+06 | 211076.70378028657 |
(B) b864a2166b52 | 30 | 5.699219322e+06 | 7.20952334e+06 | 6.456858787866667e+06 | 6.41562651e+06 | 450973.69897164387 |
| | +63.00% | +55.23% | +68.99% | +69.42% | +113.65% | <not defined>
pft_faults_sec_79 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.479105285e+06 | 4.035012999e+06 | 3.7728950527e+06 | 3.730774097e+06 | 142408.19600171916 |
(B) b864a2166b52 | 30 | 6.740820473e+06 | 8.557811658e+06 | 7.495920335099999e+06 | 7.4525865505e+06 | 458003.8944174562 |
| | +93.75% | +112.09% | +98.68% | +99.76% | +221.61% | <not defined>
pft_faults_sec_110 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.223198632e+06 | 4.134061429e+06 | 3.660475087133333e+06 | 3.7196672410000004e+06 | 250346.40765518686 |
(B) b864a2166b52 | 30 | 7.766066553e+06 | 1.0045628577e+07 | 8.546153794033334e+06 | 8.481194604e+06 | 479152.6121881429 |
| | +140.94% | +143.00% | +133.47% | +128.01% | +91.40% | <not defined>
pft_faults_sec_112 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 3.208729058e+06 | 3.960855868e+06 | 3.627510960199999e+06 | 3.6704841215000004e+06 | 228364.6649597633 |
(B) b864a2166b52 | 30 | 7.697393128e+06 | 9.643206206e+06 | 8.579052920500001e+06 | 8.4982216315e+06 | 429771.9693968675 |
| | +139.89% | +143.46% | +136.50% | +131.53% | +88.20% | <not defined>


============================================================================================================================================================


Host: 2x AMD Rome CPU, 64 cores / 128 threads each = 128 cores / 256 threads, 1 TiB RAM

[*] KERNELS
+-----------------+
| BASE KERNEL (A) |
+-----------------+
Kernel File: upstream-5-12.tar.xz

+-----------------+
| TEST KERNEL (B) |
+-----------------+
Kernel File: spf-5-12.tar.xz


[*] TAGS
LABEL | VALUE
-------------------------------+-----------------
kernel_version | 5.12.0-smp-DEV
kernel_version_major | 5
kernel_version_minor | 12
machine_platform_arch | X86_64
machine_platform_genus | rome
machine_total_logical_cores | 256
machine_total_physical_cores | 128
machine_total_ram_gib | 1024
test_name | pft
user | axelrasmussen

Note: values are delimited by colon (:)


[*] METRICS
LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION
---------------------+-------+-----------------+------------------+------------------------+------------------------+------------------------+----------------
pft_faults_cpu_1 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 1.322299227e+06 | 1.350579319e+06 | 1.3360688650999998e+06 | 1.332799986e+06 | 8036.2199341743535 |
(B) b864a2166b52 | 30 | 1.315533425e+06 | 1.335456188e+06 | 1.3275440968000002e+06 | 1.3296407555e+06 | 6298.976910285278 |
| | -0.51% | -1.12% | -0.64% | -0.24% | -21.62% | <not defined>
pft_faults_cpu_4 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 599753.078 | 835196.831 | 785973.1949666667 | 818552.9575 | 71177.98344956583 |
(B) b864a2166b52 | 30 | 1.033135922e+06 | 1.069290557e+06 | 1.0482872088333336e+06 | 1.0468814775e+06 | 10779.7490408502 |
| | +72.26% | +28.03% | +33.37% | +27.89% | -84.86% | <not defined>
pft_faults_cpu_7 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 347767.594 | 582284.169 | 508009.3081 | 532618.8404999999 | 76179.64543982298 |
(B) b864a2166b52 | 30 | 876652.745 | 963373.111 | 928708.5755333335 | 933480.6370000001 | 22305.420182136346 |
| | +152.08% | +65.45% | +82.81% | +75.26% | -70.72% | <not defined>
pft_faults_cpu_12 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 220799.208 | 305899.445 | 280923.0221 | 286852.4555 | 19497.416638332666 |
(B) b864a2166b52 | 30 | 339602.001 | 531225.324 | 468710.86046666675 | 505322.6055 | 63089.14230805185 |
| | +53.81% | +73.66% | +66.85% | +76.16% | +223.58% | <not defined>
pft_faults_cpu_21 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 127237.472 | 213103.883 | 171006.9222 | 186122.27850000001 | 28325.93727251356 |
(B) b864a2166b52 | 30 | 171981.459 | 239555.654 | 208386.11153333331 | 219433.95549999998 | 21608.36313357416 |
| | +35.17% | +12.41% | +21.86% | +17.90% | -23.72% | <not defined>
pft_faults_cpu_30 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 98673.026 | 217605.193 | 165832.00136666666 | 181021.78100000002 | 37277.826073472075 |
(B) b864a2166b52 | 30 | 124005.714 | 221043.49 | 186948.93826666664 | 203430.498 | 28825.232853924826 |
| | +25.67% | +1.58% | +12.73% | +12.38% | -22.67% | <not defined>
pft_faults_cpu_48 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 88635.374 | 201202.155 | 164099.58250000002 | 178427.289 | 31615.465775759487 |
(B) b864a2166b52 | 30 | 103850.066 | 198347.633 | 165546.73610000004 | 189675.25900000002 | 33104.97333224132 |
| | +17.17% | -1.42% | +0.88% | +6.30% | +4.71% | <not defined>
pft_faults_cpu_79 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 81102.261 | 89112.314 | 83463.4302 | 83469.707 | 1869.7214449321116 |
(B) b864a2166b52 | 30 | 83875.566 | 94160.682 | 91835.6409 | 92396.381 | 2578.3539781233358 |
| | +3.42% | +5.67% | +10.03% | +10.69% | +37.90% | <not defined>
pft_faults_cpu_110 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 77305.793 | 83501.445 | 81548.62710000001 | 81830.84099999999 | 1372.4842285878892 |
(B) b864a2166b52 | 30 | 76212.241 | 78931.508 | 78123.50329999998 | 78462.183 | 762.927436545318 |
| | -1.41% | -5.47% | -4.20% | -4.12% | -44.41% | <not defined>
pft_faults_cpu_128 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 75343.922 | 82741.042 | 80366.33486666666 | 80530.75899999999 | 1430.3656765353558 |
(B) b864a2166b52 | 30 | 72095.698 | 74123.908 | 73297.6984 | 73425.3245 | 571.842103541272 |
| | -4.31% | -10.41% | -8.80% | -8.82% | -60.02% | <not defined>
pft_faults_sec_1 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 1.317050628e+06 | 1.345243569e+06 | 1.3311298687999998e+06 | 1.3281300295000002e+06 | 8035.465831526684 |
(B) b864a2166b52 | 30 | 1.311461655e+06 | 1.330935614e+06 | 1.3231212217333335e+06 | 1.3250745915e+06 | 6193.445975030056 |
| | -0.42% | -1.06% | -0.60% | -0.23% | -22.92% | <not defined>
pft_faults_sec_4 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.389687474e+06 | 3.327314892e+06 | 3.1228356446666666e+06 | 3.2612643784999996e+06 | 289852.17947539146 |
(B) b864a2166b52 | 30 | 4.102518552e+06 | 4.250608035e+06 | 4.1639243518333333e+06 | 4.157235491e+06 | 41878.60201517858 |
| | +71.68% | +27.75% | +33.34% | +27.47% | -85.55% | <not defined>
pft_faults_sec_7 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.414012583e+06 | 4.032415905e+06 | 3.5188523367000003e+06 | 3.6881281375e+06 | 524652.8315124605 |
(B) b864a2166b52 | 30 | 5.959160786e+06 | 6.580231542e+06 | 6.329055505566668e+06 | 6.3639131035e+06 | 147685.48556391377 |
| | +146.86% | +63.18% | +79.86% | +72.55% | -71.85% | <not defined>
pft_faults_sec_12 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.636643985e+06 | 3.648873092e+06 | 3.352667637600001e+06 | 3.4238721715e+06 | 231685.81558700278 |
(B) b864a2166b52 | 30 | 3.95869994e+06 | 6.222621801e+06 | 5.5047526395666655e+06 | 5.925404714e+06 | 728923.8441016441 |
| | +50.14% | +70.54% | +64.19% | +73.06% | +214.62% | <not defined>
pft_faults_sec_21 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.659931432e+06 | 4.417172596e+06 | 3.5592916604333334e+06 | 3.8707699085e+06 | 581152.2296007784 |
(B) b864a2166b52 | 30 | 3.550370032e+06 | 4.92603416e+06 | 4.2723118866e+06 | 4.490487737e+06 | 438423.4518313975 |
| | +33.48% | +11.52% | +20.03% | +16.01% | -24.56% | <not defined>
pft_faults_sec_30 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 2.946083199e+06 | 6.423331347e+06 | 4.914002587066667e+06 | 5.3587292535e+06 | 1.0898160729269844e+06 |
(B) b864a2166b52 | 30 | 3.64123808e+06 | 6.532025228e+06 | 5.492737309800002e+06 | 5.97391926e+06 | 843563.7696236438 |
| | +23.60% | +1.69% | +11.78% | +11.48% | -22.60% | <not defined>
pft_faults_sec_48 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 4.214980049e+06 | 9.528581084e+06 | 7.7757277348e+06 | 8.448662988499999e+06 | 1.4883900190462691e+06 |
(B) b864a2166b52 | 30 | 4.84579977e+06 | 9.330363688e+06 | 7.751113585633334e+06 | 8.8988808295e+06 | 1.5801553180324829e+06 |
| | +14.97% | -2.08% | -0.32% | +5.33% | +6.17% | <not defined>
pft_faults_sec_79 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 6.348996215e+06 | 6.985977576e+06 | 6.533078342966668e+06 | 6.5315793925e+06 | 151519.46530198734 |
(B) b864a2166b52 | 30 | 6.484637033e+06 | 7.169593031e+06 | 6.984747101799999e+06 | 7.004222881e+06 | 164329.43630222118 |
| | +2.14% | +2.63% | +6.91% | +7.24% | +8.45% | <not defined>
pft_faults_sec_110 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 7.880294448e+06 | 9.09788171e+06 | 8.874888452533334e+06 | 8.9250222115e+06 | 218417.912935066 |
(B) b864a2166b52 | 30 | 7.887984615e+06 | 8.408191775e+06 | 8.3027231049666675e+06 | 8.3457714105e+06 | 104353.66169193448 |
| | +0.10% | -7.58% | -6.45% | -6.49% | -52.22% | <not defined>
pft_faults_sec_128 | | | | | | |
(A) 9f4ad9e425a1 | 30 | 8.659779012e+06 | 1.0498137566e+07 | 1.01219317397e+07 | 1.0153906594999999e+07 | 329224.7320975063 |
(B) b864a2166b52 | 30 | 8.906613971e+06 | 9.256002835e+06 | 9.113201005133333e+06 | 9.114904465e+06 | 96222.63922675727 |
| | +2.85% | -11.83% | -9.97% | -10.23% | -70.77% | <not defined>


============================================================================================================================================================


CPU cache info (getconf -a | grep CACHE):

Rome:
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 524288
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 268435456
LEVEL3_CACHE_ASSOC 0
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0

Skylake:
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 1048576
LEVEL2_CACHE_ASSOC 16
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 40370176
LEVEL3_CACHE_ASSOC 11
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0

2021-07-21 11:48:46

by Vijayanand Jitta

[permalink] [raw]
Subject: Re: [RFC PATCH 00/37] Speculative page faults

From: Vijayanand Jitta <[email protected]>

We have tried out the SPF patch series on our code base
and have seen significant improvements in app launch
latency and ebizzy tests. Please find the results below.

App launch latency (in ms) :

Apps w/o spf with spf % improvement
--------------------------------------------------------
Changba 1531 1252 18.2
Taobao 1813 1506 16.9
QQlive 1966 1775 9.7


Ebizzy test (ebizzy_64 -mRTt 8) :

ebizzy_64 -mRTt 8 w/o spf with spf % improvement
--------------------------------------------------------------
Avg_records_count 2455 2717 10.7
Avg_real_time 10 10 -
Avg_user_time 0.74 0.66 10.8
Avg_sys_time 12.2 11.6 4.9

So, getting this series mainlined would be of great help.

Thanks,
Vijay
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation