This patchset is my take on speculative page faults (spf).
It builds on ideas that have been previously proposed by Laurent Dufour,
Peter Zijlstra and others before. While Laurent's previous proposal
was rejected around the time of LSF/MM 2019, I am hoping we can revisit
this now based on what I think is a simpler and more bisectable approach,
much improved scaling numbers in the anonymous vma case, and the Android
use case that has since emerged. I will expand on these points towards
the end of this message.
The patch series applies on top of linux v5.17-rc1;
a git tree is also available:
git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
I would like these patches to be considered for inclusion into v5.18.
Several android vendors are using Laurent Dufour's previous SPF work into
their kernel tree in order to improve application startup performance,
want to converge to an upstream accepted solution, and have reported good
numbers with previous versions of this patchset. Also, there is a broader
interest into reducing mmap lock dependencies in critical MM paths,
and I think this patchset would be a good first step in that direction.
This patchset follows the same overall structure as the v1 proposal,
with the following differences:
- Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
- The mmu notifier lock is new; this fixes a race in v1 patchset
between speculative COW faults and registering new MMU notifiers.
- Speculative handling of swap-cache pages has been removed.
- Commit 30 is new; this fixes build issues that showed in some configs.
In principle it would also be possible to extend this work for handling
file mapped vmas; I have pending work on such patches too but they are
not mature enough to be submitted for inclusion at this point.
Patchset summary:
Classical page fault processing takes the mmap read lock in order to
prevent races with mmap writers. In contrast, speculative fault
processing does not take the mmap read lock, and instead verifies,
when the results of the page fault are about to get committed and
become visible to other threads, that no mmap writers have been
running concurrently with the page fault. If the check fails,
speculative updates do not get committed and the fault is retried
in the usual, non-speculative way (with the mmap read lock held).
The concurrency check is implemented using a per-mm mmap sequence count.
The counter is incremented at the beginning and end of each mmap write
operation. If the counter is initially observed to have an even value,
and has the same value later on, the observer can deduce that no mmap
writers have been running concurrently with it between those two times.
This is similar to a seqlock, except that readers never spin on the
counter value (they would instead revert to taking the mmap read lock),
and writers are allowed to sleep. One benefit of this approach is that
it requires no writer side changes, just some hooks in the mmap write
lock APIs that writers already use.
The first step of a speculative page fault is to look up the vma and
read its contents (currently by making a copy of the vma, though in
principle it would be sufficient to only read the vma attributes that
are used in page faults). The mmap sequence count is used to verify
that there were no mmap writers concurrent to the lookup and copy steps.
Note that walking rbtrees while there may potentially be concurrent
writers is not an entirely new idea in linux, as latched rbtrees
are already doing this. This is safe as long as the lookup is
followed by a sequence check to verify that concurrency did not
actually occur (and abort the speculative fault if it did).
The next step is to walk down the existing page table tree to find the
current pte entry. This is done with interrupts disabled to avoid
races with munmap(). Again, not an entirely new idea, as this repeats
a pattern already present in fast GUP. Similar precautions are also
taken when taking the page table lock.
Breaking COW on an existing mapping may require firing MMU notifiers.
Some care is required to avoid racing with registering new notifiers.
This patchset adds a new per-cpu rwsem to handle this situation.
Commits 1 to 5 are preparatory cleanups.
Commits 6 and 7 introduce CONFIG_SPECULATIVE_PAGE_FAULT and lets us
enable it on x86 so we can test the new code as it gets introduced.
Commits 8 and 9 extend handle_mm_fault() so it can be used for
speculative faults; initially these always abort with VM_FAULT_RETRY.
Commits 10 to 27 progressively implement the speculative handling of
page faults. Importantly, they are structured to be bisectable:
the new code gets enabled every few commits.
- Commit 10 adds the mmap sequence count that will be used for detecting
when writers have been running concurrently with an spf attempt
(in which case the attempt will be aborted);
- Commit 11 adds RCU safe vma freeing;
- Commit 12 adds a version of find_vma that doesn't check for mmap locking;
- Commit 13 does a lockless VMA lookup and starts the spf handling attempt;
- Commit 14 introduces an API for preventing page table reclamation
(using RCU or disabling interrupts depending on build config options);
- (Commit 15 is a small refactor preparing for the next commit);
- Commit 16 walks down the existing page tables, carefully avoiding
races with potential writers (munmap in particular)
- Commit 17 introduces pte_map_lock() and pte_spinlock(), which attempt
to (optionally map and) lock an existing page table when it's time to
commit page fault results to it.
- Commits 18 to 21 implement SPF for the simplest cases
(do_anonymous_page and do_numa_page). This mostly comes down to
using the pte_map_lock() and pte_spinlock() APIs where needed,
and making sure to abort speculation in unsupported cases
(mostly anon_vma allocation and userfaultfd).
- Commits 22 to 25 add a new mmu_notifier_lock
- Commits 26 and 27 implement some additional SPF cases, using the new
mmu_notifier_lock for the COW cases.
Commits 28 and 29 disable speculative handling for single threaded
userspace. This is for (minor) performance tuning and is pushed
towards the end of the series to make it easier to exercise the spf
paths as they are introduced.
Commits 30 and 31 add some extra statistics.
Commits 32 to 35 add spf support on the arm64 and powerpc architectures.
Michel Lespinasse (34):
mm: export dump_mm
mmap locking API: mmap_lock_is_contended returns a bool
mmap locking API: name the return values
do_anonymous_page: use update_mmu_tlb()
do_anonymous_page: reduce code duplication
mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
mm: add FAULT_FLAG_SPECULATIVE flag
mm: add do_handle_mm_fault()
mm: add per-mm mmap sequence counter for speculative page fault handling.
mm: rcu safe vma freeing
mm: separate mmap locked assertion from find_vma
x86/mm: attempt speculative mm faults first
mm: add speculative_page_walk_begin() and speculative_page_walk_end()
mm: refactor __handle_mm_fault() / handle_pte_fault()
mm: implement speculative handling in __handle_mm_fault().
mm: add pte_map_lock() and pte_spinlock()
mm: implement speculative handling in do_anonymous_page()
mm: enable speculative fault handling through do_anonymous_page()
mm: implement speculative handling in do_numa_page()
mm: enable speculative fault handling in do_numa_page()
mm: add mmu_notifier_lock
mm: write lock mmu_notifier_lock when registering mmu notifiers
mm: add mmu_notifier_trylock() and mmu_notifier_unlock()
mm: implement speculative handling in wp_page_copy()
mm: implement and enable speculative fault handling in handle_pte_fault()
mm: disable speculative faults for single threaded user space
mm: disable rcu safe vma freeing for single threaded user space
mm: create new include/linux/vm_event.h header file
mm: anon spf statistics
arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
arm64/mm: attempt speculative mm faults first
powerpc/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
powerpc/mm: attempt speculative mm faults first
Suren Baghdasaryan (1):
percpu-rwsem: enable percpu_sem destruction in atomic context
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 62 ++++
arch/powerpc/Kconfig | 1 +
arch/powerpc/mm/fault.c | 64 ++++
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 63 ++++
drivers/gpu/drm/i915/i915_gpu_error.c | 4 +-
include/linux/mm.h | 68 +++-
include/linux/mm_types.h | 33 +-
include/linux/mmap_lock.h | 109 ++++--
include/linux/mmu_notifier.h | 52 ++-
include/linux/percpu-rwsem.h | 13 +-
include/linux/vm_event.h | 111 ++++++
include/linux/vm_event_item.h | 25 ++
include/linux/vmstat.h | 95 +-----
kernel/fork.c | 18 +-
kernel/locking/percpu-rwsem.c | 32 ++
mm/Kconfig | 22 ++
mm/Kconfig.debug | 7 +
mm/debug.c | 1 +
mm/memory.c | 474 +++++++++++++++++++-------
mm/mmap.c | 13 +-
mm/vmstat.c | 25 ++
23 files changed, 1040 insertions(+), 254 deletions(-)
create mode 100644 include/linux/vm_event.h
--
2.20.1
Speculative page faults will use these to protect against races with
page table reclamation.
This could always be handled by disabling local IRQs as the fast GUP
code does; however speculative page faults do not need to protect
against races with THP page splitting, so a weaker rcu read lock is
sufficient in the MMU_GATHER_RCU_TABLE_FREE case.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index aa24cd8c06e9..663952d14bad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2725,6 +2725,28 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+/*
+ * speculative_page_walk_begin() ... speculative_page_walk_end() protects
+ * against races with page table reclamation.
+ *
+ * This is similar to what fast GUP does, but fast GUP also needs to
+ * protect against races with THP page splitting, so it always needs
+ * to disable interrupts.
+ * Speculative page faults only need to protect against page table reclamation,
+ * so rcu_read_lock() is sufficient in the MMU_GATHER_RCU_TABLE_FREE case.
+ */
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+#define speculative_page_walk_begin() rcu_read_lock()
+#define speculative_page_walk_end() rcu_read_unlock()
+#else
+#define speculative_page_walk_begin() local_irq_disable()
+#define speculative_page_walk_end() local_irq_enable()
+#endif
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* handle_pte_fault chooses page fault handler according to an entry which was
* read non-atomically. Before making any commitment, on those architectures
--
2.20.1
Change mmap_lock_is_contended to return a bool value, rather than an
int which the callers are then supposed to interpret as a bool. This
is to ensure consistency with other mmap lock API functions (such as
the trylock functions).
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mmap_lock.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..db9785e11274 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -162,9 +162,9 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
}
-static inline int mmap_lock_is_contended(struct mm_struct *mm)
+static inline bool mmap_lock_is_contended(struct mm_struct *mm)
{
- return rwsem_is_contended(&mm->mmap_lock);
+ return rwsem_is_contended(&mm->mmap_lock) != 0;
}
#endif /* _LINUX_MMAP_LOCK_H */
--
2.20.1
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.
In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).
The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 12 +++++++++---
mm/memory.c | 10 +++++++---
2 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f7aa3f0a396..4600dbb98cef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1851,9 +1851,15 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
-extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- struct pt_regs *regs);
+extern vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ unsigned long seq, struct pt_regs *regs);
+static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ struct pt_regs *regs)
+{
+ return do_handle_mm_fault(vma, address, flags, 0, regs);
+}
extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index f83e06b1dafb..aa24cd8c06e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4761,11 +4761,15 @@ static inline void mm_account_fault(struct pt_regs *regs,
* The mmap_lock may have been released depending on flags and our
* return value. See filemap_fault() and __folio_lock_or_retry().
*/
-vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags, struct pt_regs *regs)
+vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags,
+ unsigned long seq, struct pt_regs *regs)
{
vm_fault_t ret;
+ if (flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
+
__set_current_state(TASK_RUNNING);
count_vm_event(PGFAULT);
@@ -4807,7 +4811,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
return ret;
}
-EXPORT_SYMBOL_GPL(handle_mm_fault);
+EXPORT_SYMBOL_GPL(do_handle_mm_fault);
#ifndef __PAGETABLE_P4D_FOLDED
/*
--
2.20.1
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.
The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 44 +++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 5 ++++
include/linux/vm_event_item.h | 4 ++++
mm/vmstat.c | 4 ++++
4 files changed, 57 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d0074c6ed31a..99b0a358154e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1226,6 +1226,10 @@ void do_user_addr_fault(struct pt_regs *regs,
struct mm_struct *mm;
vm_fault_t fault;
unsigned int flags = FAULT_FLAG_DEFAULT;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ struct vm_area_struct pvma;
+ unsigned long seq;
+#endif
tsk = current;
mm = tsk->mm;
@@ -1323,6 +1327,43 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ count_vm_event(SPF_ATTEMPT);
+ seq = mmap_seq_read_start(mm);
+ if (seq & 1)
+ goto spf_abort;
+ rcu_read_lock();
+ vma = __find_vma(mm, address);
+ if (!vma || vma->vm_start > address) {
+ rcu_read_unlock();
+ goto spf_abort;
+ }
+ pvma = *vma;
+ rcu_read_unlock();
+ if (!mmap_seq_read_check(mm, seq))
+ goto spf_abort;
+ vma = &pvma;
+ if (unlikely(access_error(error_code, vma)))
+ goto spf_abort;
+ fault = do_handle_mm_fault(vma, address,
+ flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+ if (!(fault & VM_FAULT_RETRY))
+ goto done;
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ kernelmode_fixup_or_oops(regs, error_code, address,
+ SIGBUS, BUS_ADRERR,
+ ARCH_DEFAULT_PKEY);
+ return;
+ }
+
+spf_abort:
+ count_vm_event(SPF_ABORT);
+#endif
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1419,6 +1460,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}
mmap_read_unlock(mm);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b6678578a729..305f05d2a4bc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -370,6 +370,11 @@ struct anon_vma_name {
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
+ *
+ * Note that speculative page faults make an on-stack copy of the VMA,
+ * so the structure size matters.
+ * (TODO - it would be preferable to copy only the required vma attributes
+ * rather than the entire vma).
*/
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7b2363388bfa..f00b3e36ff39 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -133,6 +133,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ SPF_ATTEMPT,
+ SPF_ABORT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4057372745d0..dbb0160e5558 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1390,6 +1390,10 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ "spf_attempt",
+ "spf_abort",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.20.1
Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are
attempting speculative fault handling (without holding the mmap lock).
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 3 ++-
include/linux/mm_types.h | 2 ++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..7f7aa3f0a396 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -461,7 +461,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_SPECULATIVE, "SPECULATIVE" }
/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..0ae3bf854aad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -790,6 +790,7 @@ typedef struct {
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_SPECULATIVE: The fault is handled without holding the mmap lock.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -821,6 +822,7 @@ enum fault_flag {
FAULT_FLAG_REMOTE = 1 << 7,
FAULT_FLAG_INSTRUCTION = 1 << 8,
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
+ FAULT_FLAG_SPECULATIVE = 1 << 10,
};
#endif /* _LINUX_MM_TYPES_H */
--
2.20.1
update_mmu_tlb() can be used instead of update_mmu_cache() when the
page fault handler detects that it lost the race to another page fault.
It looks like this one call was missed in
https://patchwork.kernel.org/project/linux-mips/patch/[email protected]
after Andrew asked to replace all update_mmu_cache() calls with an alias
in the previous version of this patch here:
https://patchwork.kernel.org/project/linux-mips/patch/[email protected]/#23374625
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..cd9432df3a27 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3799,7 +3799,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte)) {
- update_mmu_cache(vma, vmf->address, vmf->pte);
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
}
--
2.20.1
Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 1ce837e47395..8d036140634d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3846,8 +3846,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vma->vm_page_prot));
} else {
/* Allocate our own private page. */
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(!vma->anon_vma)) {
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
+ if (__anon_vma_prepare(vma))
+ goto oom;
+ }
page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
if (!page)
goto oom;
@@ -3869,8 +3873,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
entry = pte_mkwrite(pte_mkdirty(entry));
}
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto release;
+ }
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
@@ -3885,6 +3891,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (page)
put_page(page);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_MISSING);
}
@@ -3902,6 +3910,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return 0;
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
+release:
if (page)
put_page(page);
return ret;
--
2.20.1
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ebe8fc76949a..378bc33bac54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
config FORCE_DYNAMIC_FTRACE
def_bool y
--
2.20.1
change do_numa_page() to use pte_spinlock() when locking the page table,
so that the mmap sequence counter will be validated in the speculative case.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 74b51aae8166..083e015ff194 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4441,8 +4441,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
--
2.20.1
These new functions are to be used when firing MMU notifications
without holding any of the mmap or rmap locks, as is the case with
speculative page fault handlers.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mmu_notifier.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ace76fe91c0c..d0430410fdd8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -772,4 +772,29 @@ static inline void mmu_notifier_synchronize(void)
#endif /* CONFIG_MMU_NOTIFIER */
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_SPECULATIVE_PAGE_FAULT)
+
+static inline bool mmu_notifier_trylock(struct mm_struct *mm)
+{
+ return percpu_down_read_trylock(mm->mmu_notifier_lock);
+}
+
+static inline void mmu_notifier_unlock(struct mm_struct *mm)
+{
+ percpu_up_read(mm->mmu_notifier_lock);
+}
+
+#else
+
+static inline bool mmu_notifier_trylock(struct mm_struct *mm)
+{
+ return true;
+}
+
+static inline void mmu_notifier_unlock(struct mm_struct *mm)
+{
+}
+
+#endif
+
#endif /* _LINUX_MMU_NOTIFIER_H */
--
2.20.1
Introduce mmu_notifier_lock as a per-mm percpu_rw_semaphore,
as well as the code to initialize and destroy it together with the mm.
This lock will be used to prevent races between mmu_notifier_register()
and speculative fault handlers that need to fire MMU notifications
without holding any of the mmap or rmap locks.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm_types.h | 6 +++++-
include/linux/mmu_notifier.h | 27 +++++++++++++++++++++++++--
kernel/fork.c | 3 ++-
3 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 305f05d2a4bc..f77e2dec038d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -462,6 +462,7 @@ struct vm_area_struct {
} __randomize_layout;
struct kioctx_table;
+struct percpu_rw_semaphore;
struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
@@ -608,7 +609,10 @@ struct mm_struct {
struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_subscriptions *notifier_subscriptions;
-#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ struct percpu_rw_semaphore *mmu_notifier_lock;
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+#endif /* CONFIG_MMU_NOTIFIER */
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 45fc2c81e370..ace76fe91c0c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,6 +6,8 @@
#include <linux/spinlock.h>
#include <linux/mm_types.h>
#include <linux/mmap_lock.h>
+#include <linux/percpu-rwsem.h>
+#include <linux/slab.h>
#include <linux/srcu.h>
#include <linux/interval_tree.h>
@@ -499,15 +501,35 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
__mmu_notifier_invalidate_range(mm, start, end);
}
-static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
+static inline bool mmu_notifier_subscriptions_init(struct mm_struct *mm)
{
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ mm->mmu_notifier_lock = kzalloc(sizeof(struct percpu_rw_semaphore), GFP_KERNEL);
+ if (!mm->mmu_notifier_lock)
+ return false;
+ if (percpu_init_rwsem(mm->mmu_notifier_lock)) {
+ kfree(mm->mmu_notifier_lock);
+ return false;
+ }
+#endif
+
mm->notifier_subscriptions = NULL;
+ return true;
}
static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
{
if (mm_has_notifiers(mm))
__mmu_notifier_subscriptions_destroy(mm);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ if (!in_atomic()) {
+ percpu_free_rwsem(mm->mmu_notifier_lock);
+ kfree(mm->mmu_notifier_lock);
+ } else {
+ percpu_rwsem_async_destroy(mm->mmu_notifier_lock);
+ }
+#endif
}
@@ -724,8 +746,9 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
{
}
-static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
+static inline bool mmu_notifier_subscriptions_init(struct mm_struct *mm)
{
+ return true;
}
static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e5f2e8de31a..db92e42d0087 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1069,7 +1069,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm_init_owner(mm, p);
mm_init_pasid(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
- mmu_notifier_subscriptions_init(mm);
+ if (!mmu_notifier_subscriptions_init(mm))
+ goto fail_nopgd;
init_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
--
2.20.1
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6978140edfa4..e764329b11a7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -222,6 +222,7 @@ config ARM64
select THREAD_INFO_IN_TASK
select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
select TRACE_IRQFLAGS_SUPPORT
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
ARM 64-bit (AArch64) Linux support.
--
2.20.1
Performance tuning: as single threaded userspace does not use
speculative page faults, it does not require rcu safe vma freeing.
Turn this off to avoid the related (small) extra overheads.
For multi threaded userspace, we often see a performance benefit from
the rcu safe vma freeing - even in tests that do not have any frequent
concurrent page faults ! This is because rcu safe vma freeing prevents
recently released vmas from being immediately reused in a new thread.
Signed-off-by: Michel Lespinasse <[email protected]>
---
kernel/fork.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index db92e42d0087..34600fe86743 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -384,10 +384,12 @@ void vm_area_free(struct vm_area_struct *vma)
{
free_vma_anon_name(vma);
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
- call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
- kmem_cache_free(vm_area_cachep, vma);
+ if (atomic_read(&vma->vm_mm->mm_users) > 1) {
+ call_rcu(&vma->vm_rcu, __vm_area_free);
+ return;
+ }
#endif
+ kmem_cache_free(vm_area_cachep, vma);
}
static void account_kernel_stack(struct task_struct *tsk, int account)
--
2.20.1
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
This follows the lines of the x86 speculative fault handling code,
but with some minor arch differences such as the way that the
access_pkey_error case is handled
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/powerpc/mm/fault.c | 64 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb8ecd7343a9..3f039504e8fd 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -395,6 +395,10 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
int is_write = page_fault_is_write(error_code);
vm_fault_t fault, major = 0;
bool kprobe_fault = kprobe_page_fault(regs, 11);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ struct vm_area_struct pvma;
+ unsigned long seq;
+#endif
if (unlikely(debugger_fault_handler(regs) || kprobe_fault))
return 0;
@@ -451,6 +455,63 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ /*
+ * No need to try speculative faults for kernel or
+ * single threaded user space.
+ */
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto no_spf;
+
+ count_vm_event(SPF_ATTEMPT);
+ seq = mmap_seq_read_start(mm);
+ if (seq & 1) {
+ count_vm_spf_event(SPF_ABORT_ODD);
+ goto spf_abort;
+ }
+ rcu_read_lock();
+ vma = __find_vma(mm, address);
+ if (!vma || vma->vm_start > address) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_UNMAPPED);
+ goto spf_abort;
+ }
+ if (!vma_is_anonymous(vma)) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
+ goto spf_abort;
+ }
+ pvma = *vma;
+ rcu_read_unlock();
+ if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
+ goto spf_abort;
+ vma = &pvma;
+#ifdef CONFIG_PPC_MEM_KEYS
+ if (unlikely(access_pkey_error(is_write, is_exec,
+ (error_code & DSISR_KEYFAULT), vma))) {
+ count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
+ goto spf_abort;
+ }
+#endif /* CONFIG_PPC_MEM_KEYS */
+ if (unlikely(access_error(is_write, is_exec, vma))) {
+ count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
+ goto spf_abort;
+ }
+ fault = do_handle_mm_fault(vma, address,
+ flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+ major |= fault & VM_FAULT_MAJOR;
+
+ if (fault_signal_pending(fault, regs))
+ return user_mode(regs) ? 0 : SIGBUS;
+ if (!(fault & VM_FAULT_RETRY))
+ goto done;
+
+spf_abort:
+ count_vm_event(SPF_ABORT);
+no_spf:
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -522,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
}
mmap_read_unlock(current->mm);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
--
2.20.1
Add a new CONFIG_SPECULATIVE_PAGE_FAULT_STATS config option,
and dump extra statistics about executed spf cases and abort reasons
when the option is set.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++---
include/linux/mmap_lock.h | 19 ++++++++++--
include/linux/vm_event.h | 6 ++++
include/linux/vm_event_item.h | 21 +++++++++++++
mm/Kconfig.debug | 7 +++++
mm/memory.c | 56 ++++++++++++++++++++++++++++-------
mm/vmstat.c | 21 +++++++++++++
7 files changed, 131 insertions(+), 17 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d6f8d4967c49..a5a19561c319 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1337,21 +1337,31 @@ void do_user_addr_fault(struct pt_regs *regs,
count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
- if (seq & 1)
+ if (seq & 1) {
+ count_vm_spf_event(SPF_ABORT_ODD);
goto spf_abort;
+ }
rcu_read_lock();
vma = __find_vma(mm, address);
- if (!vma || vma->vm_start > address || !vma_is_anonymous(vma)) {
+ if (!vma || vma->vm_start > address) {
rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_UNMAPPED);
+ goto spf_abort;
+ }
+ if (!vma_is_anonymous(vma)) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
goto spf_abort;
}
pvma = *vma;
rcu_read_unlock();
- if (!mmap_seq_read_check(mm, seq))
+ if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
goto spf_abort;
vma = &pvma;
- if (unlikely(access_error(error_code, vma)))
+ if (unlikely(access_error(error_code, vma))) {
+ count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
goto spf_abort;
+ }
fault = do_handle_mm_fault(vma, address,
flags | FAULT_FLAG_SPECULATIVE, seq, regs);
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index a2459eb15a33..747805ce07b8 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
#include <linux/rwsem.h>
#include <linux/tracepoint-defs.h>
#include <linux/types.h>
+#include <linux/vm_event.h>
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
#define MMAP_LOCK_SEQ_INITIALIZER(name) \
@@ -104,12 +105,26 @@ static inline unsigned long mmap_seq_read_start(struct mm_struct *mm)
return seq;
}
-static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq)
+static inline bool __mmap_seq_read_check(struct mm_struct *mm,
+ unsigned long seq)
{
smp_rmb();
return seq == READ_ONCE(mm->mmap_seq);
}
-#endif
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+static inline bool mmap_seq_read_check(struct mm_struct *mm, unsigned long seq,
+ enum vm_event_item fail_event)
+{
+ if (__mmap_seq_read_check(mm, seq))
+ return true;
+ count_vm_event(fail_event);
+ return false;
+}
+#else
+#define mmap_seq_read_check(mm, seq, fail) __mmap_seq_read_check(mm, seq)
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT_STATS */
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
static inline void mmap_write_lock(struct mm_struct *mm)
{
diff --git a/include/linux/vm_event.h b/include/linux/vm_event.h
index b3ae108a3841..689a21387dad 100644
--- a/include/linux/vm_event.h
+++ b/include/linux/vm_event.h
@@ -77,6 +77,12 @@ static inline void vm_events_fold_cpu(int cpu)
#endif /* CONFIG_VM_EVENT_COUNTERS */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+#define count_vm_spf_event(x) count_vm_event(x)
+#else
+#define count_vm_spf_event(x) do {} while (0)
+#endif
+
#ifdef CONFIG_NUMA_BALANCING
#define count_vm_numa_event(x) count_vm_event(x)
#define count_vm_numa_events(x, y) count_vm_events(x, y)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f00b3e36ff39..0390b81b1e71 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -137,6 +137,27 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
SPF_ATTEMPT,
SPF_ABORT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+ SPF_ABORT_ODD,
+ SPF_ABORT_UNMAPPED,
+ SPF_ABORT_NO_SPECULATE,
+ SPF_ABORT_VMA_COPY,
+ SPF_ABORT_ACCESS_ERROR,
+ SPF_ABORT_PUD,
+ SPF_ABORT_PMD,
+ SPF_ABORT_ANON_VMA,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ1,
+ SPF_ABORT_PTE_MAP_LOCK_PMD,
+ SPF_ABORT_PTE_MAP_LOCK_PTL,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ2,
+ SPF_ABORT_USERFAULTFD,
+ SPF_ABORT_FAULT,
+ SPF_ABORT_SWAP,
+ SPF_ATTEMPT_ANON,
+ SPF_ATTEMPT_NUMA,
+ SPF_ATTEMPT_PTE,
+ SPF_ATTEMPT_WP,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 5bd5bb097252..73b61cc95562 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -174,3 +174,10 @@ config PTDUMP_DEBUGFS
kernel.
If in doubt, say N.
+
+config SPECULATIVE_PAGE_FAULT_STATS
+ bool "Additional statistics for speculative page faults"
+ depends on SPECULATIVE_PAGE_FAULT
+ help
+ Additional statistics for speculative page faults.
+ If in doubt, say N.
diff --git a/mm/memory.c b/mm/memory.c
index 7f8dbd729dce..a5754309eaae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2762,7 +2762,8 @@ bool __pte_map_lock(struct vm_fault *vmf)
}
speculative_page_walk_begin();
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ1))
goto fail;
/*
* The mmap sequence count check guarantees that the page
@@ -2775,8 +2776,10 @@ bool __pte_map_lock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PMD);
goto fail;
+ }
#endif
ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
if (!pte)
@@ -2793,9 +2796,12 @@ bool __pte_map_lock(struct vm_fault *vmf)
* We also don't want to retry until spin_trylock() succeeds,
* because of the starvation potential against a stream of lockers.
*/
- if (unlikely(!spin_trylock(ptl)))
+ if (unlikely(!spin_trylock(ptl))) {
+ count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PTL);
goto fail;
- if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ }
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq,
+ SPF_ABORT_PTE_MAP_LOCK_SEQ2))
goto unlock_fail;
speculative_page_walk_end();
vmf->pte = pte;
@@ -3091,6 +3097,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (unlikely(!vma->anon_vma)) {
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_ANON_VMA);
ret = VM_FAULT_RETRY;
goto out;
}
@@ -3367,10 +3374,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_WP);
+
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_USERFAULTFD);
return VM_FAULT_RETRY;
+ }
return handle_userfault(vmf, VM_UFFD_WP);
}
@@ -3620,6 +3632,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
pte_unmap(vmf->pte);
+ count_vm_spf_event(SPF_ABORT_SWAP);
return VM_FAULT_RETRY;
}
@@ -3852,6 +3865,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
pte_t entry;
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_ANON);
+
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -3881,8 +3897,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
} else {
/* Allocate our own private page. */
if (unlikely(!vma->anon_vma)) {
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_ANON_VMA);
return VM_FAULT_RETRY;
+ }
if (__anon_vma_prepare(vma))
goto oom;
}
@@ -3925,8 +3943,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (page)
put_page(page);
- if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ count_vm_spf_event(SPF_ABORT_USERFAULTFD);
return VM_FAULT_RETRY;
+ }
return handle_userfault(vmf, VM_UFFD_MISSING);
}
@@ -4470,6 +4490,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
bool was_writable = pte_savedwrite(vmf->orig_pte);
int flags = 0;
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_NUMA);
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
@@ -4651,6 +4674,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ count_vm_spf_event(SPF_ATTEMPT_PTE);
+
if (!pte_spinlock(vmf))
return VM_FAULT_RETRY;
entry = vmf->orig_pte;
@@ -4718,20 +4744,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
speculative_page_walk_begin();
pgd = pgd_offset(mm, address);
pgdval = READ_ONCE(*pgd);
- if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }
p4d = p4d_offset(pgd, address);
p4dval = READ_ONCE(*p4d);
- if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }
vmf.pud = pud_offset(p4d, address);
pudval = READ_ONCE(*vmf.pud);
if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
unlikely(pud_trans_huge(pudval)) ||
- unlikely(pud_devmap(pudval)))
+ unlikely(pud_devmap(pudval))) {
+ count_vm_spf_event(SPF_ABORT_PUD);
goto spf_fail;
+ }
vmf.pmd = pmd_offset(vmf.pud, address);
vmf.orig_pmd = READ_ONCE(*vmf.pmd);
@@ -4749,8 +4781,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (unlikely(pmd_none(vmf.orig_pmd) ||
is_swap_pmd(vmf.orig_pmd) ||
pmd_trans_huge(vmf.orig_pmd) ||
- pmd_devmap(vmf.orig_pmd)))
+ pmd_devmap(vmf.orig_pmd))) {
+ count_vm_spf_event(SPF_ABORT_PMD);
goto spf_fail;
+ }
/*
* The above does not allocate/instantiate page-tables because
diff --git a/mm/vmstat.c b/mm/vmstat.c
index dbb0160e5558..20ac17cf582a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1394,6 +1394,27 @@ const char * const vmstat_text[] = {
"spf_attempt",
"spf_abort",
#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT_STATS
+ "SPF_ABORT_ODD",
+ "SPF_ABORT_UNMAPPED",
+ "SPF_ABORT_NO_SPECULATE",
+ "SPF_ABORT_VMA_COPY",
+ "SPF_ABORT_ACCESS_ERROR",
+ "SPF_ABORT_PUD",
+ "SPF_ABORT_PMD",
+ "SPF_ABORT_ANON_VMA",
+ "SPF_ABORT_PTE_MAP_LOCK_SEQ1",
+ "SPF_ABORT_PTE_MAP_LOCK_PMD",
+ "SPF_ABORT_PTE_MAP_LOCK_PTL",
+ "SPF_ABORT_PTE_MAP_LOCK_SEQ2",
+ "SPF_ABORT_USERFAULTFD",
+ "SPF_ABORT_FAULT",
+ "SPF_ABORT_SWAP",
+ "SPF_ATTEMPT_ANON",
+ "SPF_ATTEMPT_NUMA",
+ "SPF_ATTEMPT_PTE",
+ "SPF_ATTEMPT_WP",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.20.1
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT so that the speculative fault
handling code can be compiled on this architecture.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/powerpc/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b779603978e1..5f82bc7eee0b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -144,6 +144,7 @@ config PPC
select ARCH_STACKWALK
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC_BOOK3S || PPC_8xx || 40x
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_MEMTEST
--
2.20.1
From: Suren Baghdasaryan <[email protected]>
Calling percpu_free_rwsem in atomic context results in "scheduling while
atomic" bug being triggered:
BUG: scheduling while atomic: klogd/158/0x00000002
...
__schedule_bug+0x191/0x290
schedule_debug+0x97/0x180
__schedule+0xdc/0xba0
schedule+0xda/0x250
schedule_timeout+0x92/0x2d0
__wait_for_common+0x25b/0x430
wait_for_completion+0x1f/0x30
rcu_barrier+0x440/0x4f0
rcu_sync_dtor+0xaa/0x190
percpu_free_rwsem+0x41/0x80
Introduce percpu_rwsem_destroy function to perform semaphore destruction
in a worker thread.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/percpu-rwsem.h | 13 ++++++++++++-
kernel/locking/percpu-rwsem.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 5fda40f97fe9..bf1668fc9c5e 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -13,7 +13,14 @@ struct percpu_rw_semaphore {
struct rcu_sync rss;
unsigned int __percpu *read_count;
struct rcuwait writer;
- wait_queue_head_t waiters;
+ /*
+ * destroy_list_entry is used during object destruction when waiters
+ * can't be used, therefore reusing the same space.
+ */
+ union {
+ wait_queue_head_t waiters;
+ struct list_head destroy_list_entry;
+ };
atomic_t block;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
@@ -127,8 +134,12 @@ extern void percpu_up_write(struct percpu_rw_semaphore *);
extern int __percpu_init_rwsem(struct percpu_rw_semaphore *,
const char *, struct lock_class_key *);
+/* Can't be called in atomic context. */
extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
+/* Invokes percpu_free_rwsem and frees the semaphore from a worker thread. */
+extern void percpu_rwsem_async_destroy(struct percpu_rw_semaphore *sem);
+
#define percpu_init_rwsem(sem) \
({ \
static struct lock_class_key rwsem_key; \
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 70a32a576f3f..a3d37bf83c60 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -7,6 +7,7 @@
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/sched/task.h>
+#include <linux/slab.h>
#include <linux/errno.h>
int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
@@ -268,3 +269,34 @@ void percpu_up_write(struct percpu_rw_semaphore *sem)
rcu_sync_exit(&sem->rss);
}
EXPORT_SYMBOL_GPL(percpu_up_write);
+
+static LIST_HEAD(destroy_list);
+static DEFINE_SPINLOCK(destroy_list_lock);
+
+static void destroy_list_workfn(struct work_struct *work)
+{
+ struct percpu_rw_semaphore *sem, *sem2;
+ LIST_HEAD(to_destroy);
+
+ spin_lock(&destroy_list_lock);
+ list_splice_init(&destroy_list, &to_destroy);
+ spin_unlock(&destroy_list_lock);
+
+ if (list_empty(&to_destroy))
+ return;
+
+ list_for_each_entry_safe(sem, sem2, &to_destroy, destroy_list_entry) {
+ percpu_free_rwsem(sem);
+ kfree(sem);
+ }
+}
+
+static DECLARE_WORK(destroy_list_work, destroy_list_workfn);
+
+void percpu_rwsem_async_destroy(struct percpu_rw_semaphore *sem)
+{
+ spin_lock(&destroy_list_lock);
+ list_add_tail(&sem->destroy_list_entry, &destroy_list);
+ spin_unlock(&destroy_list_lock);
+ schedule_work(&destroy_list_work);
+}
--
2.20.1
In the mmap locking API, the *_killable() functions return an error
(or 0 on success), and the *_trylock() functions return a boolean
(true on success).
Rename the return values "int error" and "bool ok", respectively,
rather than using "ret" for both cases which I find less readable.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mmap_lock.h | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index db9785e11274..1b14468183d7 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -81,22 +81,22 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
static inline int mmap_write_lock_killable(struct mm_struct *mm)
{
- int ret;
+ int error;
__mmap_lock_trace_start_locking(mm, true);
- ret = down_write_killable(&mm->mmap_lock);
- __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
- return ret;
+ error = down_write_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, true, !error);
+ return error;
}
static inline bool mmap_write_trylock(struct mm_struct *mm)
{
- bool ret;
+ bool ok;
__mmap_lock_trace_start_locking(mm, true);
- ret = down_write_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, true, ret);
- return ret;
+ ok = down_write_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, true, ok);
+ return ok;
}
static inline void mmap_write_unlock(struct mm_struct *mm)
@@ -120,22 +120,22 @@ static inline void mmap_read_lock(struct mm_struct *mm)
static inline int mmap_read_lock_killable(struct mm_struct *mm)
{
- int ret;
+ int error;
__mmap_lock_trace_start_locking(mm, false);
- ret = down_read_killable(&mm->mmap_lock);
- __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
- return ret;
+ error = down_read_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, false, !error);
+ return error;
}
static inline bool mmap_read_trylock(struct mm_struct *mm)
{
- bool ret;
+ bool ok;
__mmap_lock_trace_start_locking(mm, false);
- ret = down_read_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, false, ret);
- return ret;
+ ok = down_read_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, false, ok);
+ return ok;
}
static inline void mmap_read_unlock(struct mm_struct *mm)
--
2.20.1
Change wp_page_copy() to handle the speculative case. This involves
aborting speculative faults if they have to allocate an anon_vma,
read-locking the mmu_notifier_lock to avoid races with
mmu_notifier_register(), and using pte_map_lock() instead of
pte_offset_map_lock() to complete the page fault.
Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 42 +++++++++++++++++++++++++++++++++---------
1 file changed, 33 insertions(+), 9 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 73b1a328b797..fd8984d89109 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3087,20 +3087,27 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ vm_fault_t ret = VM_FAULT_OOM;
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(!vma->anon_vma)) {
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
+ if (__anon_vma_prepare(vma))
+ goto out;
+ }
if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
if (!cow_user_page(new_page, old_page, vmf)) {
/*
@@ -3117,11 +3124,16 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL))
- goto oom_free_new;
+ goto out_free_new;
cgroup_throttle_swaprate(new_page, GFP_KERNEL);
__SetPageUptodate(new_page);
+ if ((vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+ !mmu_notifier_trylock(mm)) {
+ ret = VM_FAULT_RETRY;
+ goto out_free_new;
+ }
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE);
@@ -3130,7 +3142,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
* Re-check the pte - we dropped the lock
*/
- vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ /* put_page() will uncharge the page */
+ goto out_notify;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -3205,6 +3221,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* the above ptep_clear_flush_notify() did already call it.
*/
mmu_notifier_invalidate_range_only_end(&range);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ mmu_notifier_unlock(mm);
if (old_page) {
/*
* Don't let another task, with possibly unlocked vma,
@@ -3221,12 +3239,16 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_notify:
+ mmu_notifier_invalidate_range_only_end(&range);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ mmu_notifier_unlock(mm);
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
- return VM_FAULT_OOM;
+ return ret;
}
/**
@@ -3369,6 +3391,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return wp_pfn_shared(vmf);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ vmf->pte = NULL;
return wp_page_copy(vmf);
}
@@ -3407,6 +3430,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ vmf->pte = NULL;
return wp_page_copy(vmf);
}
--
2.20.1
in x86 fault handler, only attempt spf if the vma is anonymous.
In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().
In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 2 +-
mm/memory.c | 16 ++++++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 99b0a358154e..6ba109413396 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1334,7 +1334,7 @@ void do_user_addr_fault(struct pt_regs *regs,
goto spf_abort;
rcu_read_lock();
vma = __find_vma(mm, address);
- if (!vma || vma->vm_start > address) {
+ if (!vma || vma->vm_start > address || !vma_is_anonymous(vma)) {
rcu_read_unlock();
goto spf_abort;
}
diff --git a/mm/memory.c b/mm/memory.c
index 8d036140634d..74b51aae8166 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4365,6 +4365,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;
+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
/*
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
@@ -4609,6 +4611,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
@@ -4937,8 +4944,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
{
vm_fault_t ret;
- if (flags & FAULT_FLAG_SPECULATIVE)
- return VM_FAULT_RETRY;
+ VM_BUG_ON((flags & FAULT_FLAG_SPECULATIVE) && !vma_is_anonymous(vma));
__set_current_state(TASK_RUNNING);
@@ -4960,10 +4966,12 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
mem_cgroup_enter_user_fault();
- if (unlikely(is_vm_hugetlb_page(vma)))
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ VM_BUG_ON(flags & FAULT_FLAG_SPECULATIVE);
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
- else
+ } else {
ret = __handle_mm_fault(vma, address, flags, seq);
+ }
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
--
2.20.1
In handle_pte_fault(), allow speculative execution to proceed.
Use pte_spinlock() to validate the mmap sequence count when locking
the page table.
If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index fd8984d89109..7f8dbd729dce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3293,6 +3293,7 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
vm_fault_t ret;
@@ -3313,6 +3314,8 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = VM_FAULT_WRITE;
+ VM_BUG_ON(vmf->flags & FAULT_FLAG_SPECULATIVE);
+
get_page(vmf->page);
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
@@ -3366,6 +3369,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
return handle_userfault(vmf, VM_UFFD_WP);
}
@@ -4646,13 +4651,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
- if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- pte_unmap(vmf->pte);
+ if (!pte_spinlock(vmf))
return VM_FAULT_RETRY;
- }
-
- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
--
2.20.1
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.
The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.
In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.
If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 38 ++++++++++++++++++++++++++
mm/memory.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e2122bd3da3..7f1083fb94e0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3394,5 +3394,43 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif
+#ifdef CONFIG_MMU
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf);
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(!vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+#else /* !CONFIG_SPECULATIVE_PAGE_FAULT */
+
+#define pte_map_lock(__vmf) \
+({ \
+ struct vm_fault *vmf = __vmf; \
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, \
+ vmf->address, &vmf->ptl); \
+ true; \
+})
+
+#define pte_spinlock(__vmf) \
+({ \
+ struct vm_fault *vmf = __vmf; \
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); \
+ spin_lock(vmf->ptl); \
+ true; \
+})
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+#endif /* CONFIG_MMU */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index d0db10bd5bee..1ce837e47395 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2745,6 +2745,72 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
#define speculative_page_walk_end() local_irq_enable()
#endif
+bool __pte_map_lock(struct vm_fault *vmf)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+ pte_t *pte = vmf->pte;
+ spinlock_t *ptl;
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+ speculative_page_walk_begin();
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto fail;
+ /*
+ * The mmap sequence count check guarantees that the page
+ * tables are still valid at that point, and
+ * speculative_page_walk_begin() ensures that they stay around.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto fail;
+#endif
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ /*
+ * Try locking the page table.
+ *
+ * Note that we might race against zap_pte_range() which
+ * invalidates TLBs while holding the page table lock.
+ * We are still under the speculative_page_walk_begin() section,
+ * and zap_pte_range() could thus deadlock with us if we tried
+ * using spin_lock() here.
+ *
+ * We also don't want to retry until spin_trylock() succeeds,
+ * because of the starvation potential against a stream of lockers.
+ */
+ if (unlikely(!spin_trylock(ptl)))
+ goto fail;
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto unlock_fail;
+ speculative_page_walk_end();
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ return true;
+
+unlock_fail:
+ spin_unlock(ptl);
+fail:
+ if (pte)
+ pte_unmap(pte);
+ speculative_page_walk_end();
+ return false;
+}
+
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
/*
--
2.20.1
Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().
do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 083e015ff194..73b1a328b797 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3589,6 +3589,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
if (!pte_unmap_same(vmf))
goto out;
@@ -4611,17 +4616,17 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_fault(vmf);
}
- if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
- pte_unmap(vmf->pte);
- return VM_FAULT_RETRY;
- }
-
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ pte_unmap(vmf->pte);
+ return VM_FAULT_RETRY;
+ }
+
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
--
2.20.1
Change mm_take_all_locks to also take the mmu_notifier_lock.
Note that mm_take_all_locks is called from mmu_notifier_register() only.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/mmap.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index b09a2c875507..a67c3600d995 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3592,6 +3592,10 @@ int mm_take_all_locks(struct mm_struct *mm)
mutex_lock(&mm_all_locks_mutex);
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_SPECULATIVE_PAGE_FAULT)
+ percpu_down_write(mm->mmu_notifier_lock);
+#endif
+
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (signal_pending(current))
goto out_unlock;
@@ -3679,6 +3683,10 @@ void mm_drop_all_locks(struct mm_struct *mm)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_SPECULATIVE_PAGE_FAULT)
+ percpu_up_write(mm->mmu_notifier_lock);
+#endif
+
mutex_unlock(&mm_all_locks_mutex);
}
--
2.20.1
This configuration variable will be used to build the code needed to
handle speculative page fault.
This is enabled by default on supported architectures with SMP and MMU set.
The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to try speculative fault handling first.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/Kconfig | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..d304fca0f293 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -894,4 +894,26 @@ config ANON_VMA_NAME
source "mm/damon/Kconfig"
+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+ bool "Speculative page faults"
+ default y
+ depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && MMU && SMP
+ help
+ Try to handle user space page faults without holding the mmap lock.
+
+ Instead of blocking writers through the use of mmap lock,
+ the page fault handler merely verifies, at the end of the page
+ fault, that no writers have been running concurrently with it.
+
+ In high concurrency situations, the speculative fault handler
+ gains a throughput advantage by avoiding having to update the
+ mmap lock reader count.
+
+ If the check fails due to a concurrent writer, or due to hitting
+ an unsupported case, the fault handler falls back to classical
+ processing using the mmap read lock.
+
endmenu
--
2.20.1
In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.
This change reduces the code duplication between the two cases.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 87 +++++++++++++++++++++++------------------------------
1 file changed, 38 insertions(+), 49 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index cd9432df3a27..f83e06b1dafb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3726,7 +3726,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct page *page;
+ struct page *page = NULL;
vm_fault_t ret = 0;
pte_t entry;
@@ -3756,78 +3756,67 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (!pte_none(*vmf->pte)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- goto unlock;
- }
- ret = check_stable_address_space(vma->vm_mm);
- if (ret)
- goto unlock;
- /* Deliver the page fault to userland, check inside PT lock */
- if (userfaultfd_missing(vma)) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return handle_userfault(vmf, VM_UFFD_MISSING);
- }
- goto setpte;
+ } else {
+ /* Allocate our own private page. */
+ if (unlikely(anon_vma_prepare(vma)))
+ goto oom;
+ page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+ if (!page)
+ goto oom;
+
+ if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
+ goto oom_free_page;
+ cgroup_throttle_swaprate(page, GFP_KERNEL);
+
+ /*
+ * The memory barrier inside __SetPageUptodate makes sure that
+ * preceding stores to the page contents become visible before
+ * the set_pte_at() write.
+ */
+ __SetPageUptodate(page);
+
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = pte_sw_mkyoung(entry);
+ if (vma->vm_flags & VM_WRITE)
+ entry = pte_mkwrite(pte_mkdirty(entry));
}
- /* Allocate our own private page. */
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
- if (!page)
- goto oom;
-
- if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
- goto oom_free_page;
- cgroup_throttle_swaprate(page, GFP_KERNEL);
-
- /*
- * The memory barrier inside __SetPageUptodate makes sure that
- * preceding stores to the page contents become visible before
- * the set_pte_at() write.
- */
- __SetPageUptodate(page);
-
- entry = mk_pte(page, vma->vm_page_prot);
- entry = pte_sw_mkyoung(entry);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
-
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
- goto release;
+ goto unlock;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
- goto release;
+ goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- put_page(page);
+ if (page)
+ put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
- inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
- lru_cache_add_inactive_or_unevictable(page, vma);
-setpte:
+ if (page) {
+ inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, vmf->address, false);
+ lru_cache_add_inactive_or_unevictable(page, vma);
+ }
+
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (page)
+ put_page(page);
return ret;
-release:
- put_page(page);
- goto unlock;
oom_free_page:
put_page(page);
oom:
--
2.20.1
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
This follows the lines of the x86 speculative fault handling code,
but with some minor arch differences such as the way that the
VM_FAULT_BADACCESS case is handled.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/arm64/mm/fault.c | 62 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 77341b160aca..2598795f4e70 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -25,6 +25,7 @@
#include <linux/perf_event.h>
#include <linux/preempt.h>
#include <linux/hugetlb.h>
+#include <linux/vm_event_item.h>
#include <asm/acpi.h>
#include <asm/bug.h>
@@ -524,6 +525,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ struct vm_area_struct *vma;
+ struct vm_area_struct pvma;
+ unsigned long seq;
+#endif
if (kprobe_page_fault(regs, esr))
return 0;
@@ -574,6 +580,59 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ /*
+ * No need to try speculative faults for kernel or
+ * single threaded user space.
+ */
+ if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto no_spf;
+
+ count_vm_event(SPF_ATTEMPT);
+ seq = mmap_seq_read_start(mm);
+ if (seq & 1) {
+ count_vm_spf_event(SPF_ABORT_ODD);
+ goto spf_abort;
+ }
+ rcu_read_lock();
+ vma = __find_vma(mm, addr);
+ if (!vma || vma->vm_start > addr) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_UNMAPPED);
+ goto spf_abort;
+ }
+ if (!vma_is_anonymous(vma)) {
+ rcu_read_unlock();
+ count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
+ goto spf_abort;
+ }
+ pvma = *vma;
+ rcu_read_unlock();
+ if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
+ goto spf_abort;
+ vma = &pvma;
+ if (!(vma->vm_flags & vm_flags)) {
+ count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
+ goto spf_abort;
+ }
+ fault = do_handle_mm_fault(vma, addr & PAGE_MASK,
+ mm_flags | FAULT_FLAG_SPECULATIVE, seq, regs);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
+ return 0;
+ }
+ if (!(fault & VM_FAULT_RETRY))
+ goto done;
+
+spf_abort:
+ count_vm_event(SPF_ABORT);
+no_spf:
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -612,6 +671,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
goto retry;
}
mmap_read_unlock(mm);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
/*
* Handle the "normal" (no error) case first.
--
2.20.1
The speculative path calls speculative_page_walk_begin() before walking
the page table tree to prevent page table reclamation. The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 6 ++++
mm/memory.c | 77 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 81 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f7712179503..2e2122bd3da3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -483,6 +483,10 @@ struct vm_fault {
};
enum fault_flag flags; /* FAULT_FLAG_xxx flags
* XXX: should really be 'const' */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ unsigned long seq;
+ pmd_t orig_pmd;
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
* the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -490,9 +494,11 @@ struct vm_fault {
*/
union {
pte_t orig_pte; /* Value of PTE at the time of fault */
+#ifndef CONFIG_SPECULATIVE_PAGE_FAULT
pmd_t orig_pmd; /* Value of PMD at the time of fault,
* used by PMD fault only.
*/
+#endif
};
struct page *cow_page; /* Page handler may use for COW fault */
diff --git a/mm/memory.c b/mm/memory.c
index 37a4b92bd4bf..d0db10bd5bee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4581,7 +4581,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* return value. See filemap_fault() and __folio_lock_or_retry().
*/
static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+ unsigned long address, unsigned int flags, unsigned long seq)
{
struct vm_fault vmf = {
.vma = vma,
@@ -4596,6 +4596,79 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
p4d_t *p4d;
vm_fault_t ret;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ if (flags & FAULT_FLAG_SPECULATIVE) {
+ pgd_t pgdval;
+ p4d_t p4dval;
+ pud_t pudval;
+
+ vmf.seq = seq;
+
+ speculative_page_walk_begin();
+ pgd = pgd_offset(mm, address);
+ pgdval = READ_ONCE(*pgd);
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ goto spf_fail;
+
+ p4d = p4d_offset(pgd, address);
+ p4dval = READ_ONCE(*p4d);
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ goto spf_fail;
+
+ vmf.pud = pud_offset(p4d, address);
+ pudval = READ_ONCE(*vmf.pud);
+ if (pud_none(pudval) || unlikely(pud_bad(pudval)) ||
+ unlikely(pud_trans_huge(pudval)) ||
+ unlikely(pud_devmap(pudval)))
+ goto spf_fail;
+
+ vmf.pmd = pmd_offset(vmf.pud, address);
+ vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+
+ /*
+ * pmd_none could mean that a hugepage collapse is in
+ * progress in our back as collapse_huge_page() mark
+ * it before invalidating the pte (which is done once
+ * the IPI is catched by all CPU and we have interrupt
+ * disabled). For this reason we cannot handle THP in
+ * a speculative way since we can't safely identify an
+ * in progress collapse operation done in our back on
+ * that PMD.
+ */
+ if (unlikely(pmd_none(vmf.orig_pmd) ||
+ is_swap_pmd(vmf.orig_pmd) ||
+ pmd_trans_huge(vmf.orig_pmd) ||
+ pmd_devmap(vmf.orig_pmd)))
+ goto spf_fail;
+
+ /*
+ * The above does not allocate/instantiate page-tables because
+ * doing so would lead to the possibility of instantiating
+ * page-tables after free_pgtables() -- and consequently
+ * leaking them.
+ *
+ * The result is that we take at least one non-speculative
+ * fault per PMD in order to instantiate it.
+ */
+
+ vmf.pte = pte_offset_map(vmf.pmd, address);
+ vmf.orig_pte = READ_ONCE(*vmf.pte);
+ barrier();
+ if (pte_none(vmf.orig_pte)) {
+ pte_unmap(vmf.pte);
+ vmf.pte = NULL;
+ }
+
+ speculative_page_walk_end();
+
+ return handle_pte_fault(&vmf);
+
+ spf_fail:
+ speculative_page_walk_end();
+ return VM_FAULT_RETRY;
+ }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
@@ -4815,7 +4888,7 @@ vm_fault_t do_handle_mm_fault(struct vm_area_struct *vma,
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
- ret = __handle_mm_fault(vma, address, flags);
+ ret = __handle_mm_fault(vma, address, flags, seq);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
--
2.20.1
Split off the definitions necessary to update event counters from vmstat.h
into a new vm_event.h file.
The rationale is to allow header files included from mm.h to update
counter events. vmstat.h can not be included from such header files,
because it refers to page_pgdat() which is only defined later down
in mm.h, and thus results in compile errors. vm_event.h does not refer
to page_pgdat() and thus does not result in such errors.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/vm_event.h | 105 +++++++++++++++++++++++++++++++++++++++
include/linux/vmstat.h | 95 +----------------------------------
2 files changed, 106 insertions(+), 94 deletions(-)
create mode 100644 include/linux/vm_event.h
diff --git a/include/linux/vm_event.h b/include/linux/vm_event.h
new file mode 100644
index 000000000000..b3ae108a3841
--- /dev/null
+++ b/include/linux/vm_event.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VM_EVENT_H
+#define _LINUX_VM_EVENT_H
+
+#include <linux/types.h>
+#include <linux/percpu.h>
+#include <linux/mmzone.h>
+#include <linux/vm_event_item.h>
+#include <linux/atomic.h>
+
+#ifdef CONFIG_VM_EVENT_COUNTERS
+/*
+ * Light weight per cpu counter implementation.
+ *
+ * Counters should only be incremented and no critical kernel component
+ * should rely on the counter values.
+ *
+ * Counters are handled completely inline. On many platforms the code
+ * generated will simply be the increment of a global address.
+ */
+
+struct vm_event_state {
+ unsigned long event[NR_VM_EVENT_ITEMS];
+};
+
+DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
+
+/*
+ * vm counters are allowed to be racy. Use raw_cpu_ops to avoid the
+ * local_irq_disable overhead.
+ */
+static inline void __count_vm_event(enum vm_event_item item)
+{
+ raw_cpu_inc(vm_event_states.event[item]);
+}
+
+static inline void count_vm_event(enum vm_event_item item)
+{
+ this_cpu_inc(vm_event_states.event[item]);
+}
+
+static inline void __count_vm_events(enum vm_event_item item, long delta)
+{
+ raw_cpu_add(vm_event_states.event[item], delta);
+}
+
+static inline void count_vm_events(enum vm_event_item item, long delta)
+{
+ this_cpu_add(vm_event_states.event[item], delta);
+}
+
+extern void all_vm_events(unsigned long *);
+
+extern void vm_events_fold_cpu(int cpu);
+
+#else
+
+/* Disable counters */
+static inline void count_vm_event(enum vm_event_item item)
+{
+}
+static inline void count_vm_events(enum vm_event_item item, long delta)
+{
+}
+static inline void __count_vm_event(enum vm_event_item item)
+{
+}
+static inline void __count_vm_events(enum vm_event_item item, long delta)
+{
+}
+static inline void all_vm_events(unsigned long *ret)
+{
+}
+static inline void vm_events_fold_cpu(int cpu)
+{
+}
+
+#endif /* CONFIG_VM_EVENT_COUNTERS */
+
+#ifdef CONFIG_NUMA_BALANCING
+#define count_vm_numa_event(x) count_vm_event(x)
+#define count_vm_numa_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_numa_event(x) do {} while (0)
+#define count_vm_numa_events(x, y) do { (void)(y); } while (0)
+#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_DEBUG_TLBFLUSH
+#define count_vm_tlb_event(x) count_vm_event(x)
+#define count_vm_tlb_events(x, y) count_vm_events(x, y)
+#else
+#define count_vm_tlb_event(x) do {} while (0)
+#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
+#endif
+
+#ifdef CONFIG_DEBUG_VM_VMACACHE
+#define count_vm_vmacache_event(x) count_vm_event(x)
+#else
+#define count_vm_vmacache_event(x) do {} while (0)
+#endif
+
+#define __count_zid_vm_events(item, zid, delta) \
+ __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
+
+#endif /* _LINUX_VM_EVENT_H */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bfe38869498d..7c3c892ce89a 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,6 +6,7 @@
#include <linux/percpu.h>
#include <linux/mmzone.h>
#include <linux/vm_event_item.h>
+#include <linux/vm_event.h>
#include <linux/atomic.h>
#include <linux/static_key.h>
#include <linux/mmdebug.h>
@@ -40,100 +41,6 @@ enum writeback_stat_item {
NR_VM_WRITEBACK_STAT_ITEMS,
};
-#ifdef CONFIG_VM_EVENT_COUNTERS
-/*
- * Light weight per cpu counter implementation.
- *
- * Counters should only be incremented and no critical kernel component
- * should rely on the counter values.
- *
- * Counters are handled completely inline. On many platforms the code
- * generated will simply be the increment of a global address.
- */
-
-struct vm_event_state {
- unsigned long event[NR_VM_EVENT_ITEMS];
-};
-
-DECLARE_PER_CPU(struct vm_event_state, vm_event_states);
-
-/*
- * vm counters are allowed to be racy. Use raw_cpu_ops to avoid the
- * local_irq_disable overhead.
- */
-static inline void __count_vm_event(enum vm_event_item item)
-{
- raw_cpu_inc(vm_event_states.event[item]);
-}
-
-static inline void count_vm_event(enum vm_event_item item)
-{
- this_cpu_inc(vm_event_states.event[item]);
-}
-
-static inline void __count_vm_events(enum vm_event_item item, long delta)
-{
- raw_cpu_add(vm_event_states.event[item], delta);
-}
-
-static inline void count_vm_events(enum vm_event_item item, long delta)
-{
- this_cpu_add(vm_event_states.event[item], delta);
-}
-
-extern void all_vm_events(unsigned long *);
-
-extern void vm_events_fold_cpu(int cpu);
-
-#else
-
-/* Disable counters */
-static inline void count_vm_event(enum vm_event_item item)
-{
-}
-static inline void count_vm_events(enum vm_event_item item, long delta)
-{
-}
-static inline void __count_vm_event(enum vm_event_item item)
-{
-}
-static inline void __count_vm_events(enum vm_event_item item, long delta)
-{
-}
-static inline void all_vm_events(unsigned long *ret)
-{
-}
-static inline void vm_events_fold_cpu(int cpu)
-{
-}
-
-#endif /* CONFIG_VM_EVENT_COUNTERS */
-
-#ifdef CONFIG_NUMA_BALANCING
-#define count_vm_numa_event(x) count_vm_event(x)
-#define count_vm_numa_events(x, y) count_vm_events(x, y)
-#else
-#define count_vm_numa_event(x) do {} while (0)
-#define count_vm_numa_events(x, y) do { (void)(y); } while (0)
-#endif /* CONFIG_NUMA_BALANCING */
-
-#ifdef CONFIG_DEBUG_TLBFLUSH
-#define count_vm_tlb_event(x) count_vm_event(x)
-#define count_vm_tlb_events(x, y) count_vm_events(x, y)
-#else
-#define count_vm_tlb_event(x) do {} while (0)
-#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
-#endif
-
-#ifdef CONFIG_DEBUG_VM_VMACACHE
-#define count_vm_vmacache_event(x) count_vm_event(x)
-#else
-#define count_vm_vmacache_event(x) do {} while (0)
-#endif
-
-#define __count_zid_vm_events(item, zid, delta) \
- __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
-
/*
* Zone and node-based page accounting with per cpu differentials.
*/
--
2.20.1
Performance tuning: single threaded userspace does not benefit from
speculative page faults, so we turn them off to avoid any related
(small) extra overheads.
Signed-off-by: Michel Lespinasse <[email protected]>
---
arch/x86/mm/fault.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6ba109413396..d6f8d4967c49 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1328,6 +1328,13 @@ void do_user_addr_fault(struct pt_regs *regs,
#endif
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ /*
+ * No need to try speculative faults for kernel or
+ * single threaded user space.
+ */
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto no_spf;
+
count_vm_event(SPF_ATTEMPT);
seq = mmap_seq_read_start(mm);
if (seq & 1)
@@ -1362,7 +1369,9 @@ void do_user_addr_fault(struct pt_regs *regs,
spf_abort:
count_vm_event(SPF_ABORT);
-#endif
+no_spf:
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
/*
* Kernel-mode access to the user address space should only occur
--
2.20.1
Hi Michel,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v5.17-rc1 next-20220128]
[cannot apply to tip/x86/mm arm64/for-next/core powerpc/next hnaz-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Michel-Lespinasse/Speculative-page-faults/20220128-212122
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 145d9b498fc827b79c1260b4caa29a8e59d4c2b9
config: arm-vt8500_v6_v7_defconfig (https://download.01.org/0day-ci/archive/20220129/[email protected]/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 33b45ee44b1f32ffdbc995e6fec806271b4b3ba4)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/fa5331bae2e49ce86eff959390b451b7401f9156
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Michel-Lespinasse/Speculative-page-faults/20220128-212122
git checkout fa5331bae2e49ce86eff959390b451b7401f9156
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> mm/memory.c:3876:20: warning: variable 'vmf' is uninitialized when used within its own initialization [-Wuninitialized]
if (!pte_map_lock(vmf)) {
~~~~~~~~~~~~~^~~~
include/linux/mm.h:3418:25: note: expanded from macro 'pte_map_lock'
struct vm_fault *vmf = __vmf; \
~~~ ^~~~~
1 warning generated.
vim +/vmf +3876 mm/memory.c
3808
3809 /*
3810 * We enter with non-exclusive mmap_lock (to exclude vma changes,
3811 * but allow concurrent faults), and pte mapped but not yet locked.
3812 * We return with mmap_lock still held, but pte unmapped and unlocked.
3813 */
3814 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
3815 {
3816 struct vm_area_struct *vma = vmf->vma;
3817 struct page *page = NULL;
3818 vm_fault_t ret = 0;
3819 pte_t entry;
3820
3821 /* File mapping without ->vm_ops ? */
3822 if (vma->vm_flags & VM_SHARED)
3823 return VM_FAULT_SIGBUS;
3824
3825 /*
3826 * Use pte_alloc() instead of pte_alloc_map(). We can't run
3827 * pte_offset_map() on pmds where a huge pmd might be created
3828 * from a different thread.
3829 *
3830 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
3831 * parallel threads are excluded by other means.
3832 *
3833 * Here we only have mmap_read_lock(mm).
3834 */
3835 if (pte_alloc(vma->vm_mm, vmf->pmd))
3836 return VM_FAULT_OOM;
3837
3838 /* See comment in __handle_mm_fault() */
3839 if (unlikely(pmd_trans_unstable(vmf->pmd)))
3840 return 0;
3841
3842 /* Use the zero-page for reads */
3843 if (!(vmf->flags & FAULT_FLAG_WRITE) &&
3844 !mm_forbids_zeropage(vma->vm_mm)) {
3845 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
3846 vma->vm_page_prot));
3847 } else {
3848 /* Allocate our own private page. */
3849 if (unlikely(!vma->anon_vma)) {
3850 if (vmf->flags & FAULT_FLAG_SPECULATIVE)
3851 return VM_FAULT_RETRY;
3852 if (__anon_vma_prepare(vma))
3853 goto oom;
3854 }
3855 page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
3856 if (!page)
3857 goto oom;
3858
3859 if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
3860 goto oom_free_page;
3861 cgroup_throttle_swaprate(page, GFP_KERNEL);
3862
3863 /*
3864 * The memory barrier inside __SetPageUptodate makes sure that
3865 * preceding stores to the page contents become visible before
3866 * the set_pte_at() write.
3867 */
3868 __SetPageUptodate(page);
3869
3870 entry = mk_pte(page, vma->vm_page_prot);
3871 entry = pte_sw_mkyoung(entry);
3872 if (vma->vm_flags & VM_WRITE)
3873 entry = pte_mkwrite(pte_mkdirty(entry));
3874 }
3875
> 3876 if (!pte_map_lock(vmf)) {
3877 ret = VM_FAULT_RETRY;
3878 goto release;
3879 }
3880 if (!pte_none(*vmf->pte)) {
3881 update_mmu_tlb(vma, vmf->address, vmf->pte);
3882 goto unlock;
3883 }
3884
3885 ret = check_stable_address_space(vma->vm_mm);
3886 if (ret)
3887 goto unlock;
3888
3889 /* Deliver the page fault to userland, check inside PT lock */
3890 if (userfaultfd_missing(vma)) {
3891 pte_unmap_unlock(vmf->pte, vmf->ptl);
3892 if (page)
3893 put_page(page);
3894 if (vmf->flags & FAULT_FLAG_SPECULATIVE)
3895 return VM_FAULT_RETRY;
3896 return handle_userfault(vmf, VM_UFFD_MISSING);
3897 }
3898
3899 if (page) {
3900 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
3901 page_add_new_anon_rmap(page, vma, vmf->address, false);
3902 lru_cache_add_inactive_or_unevictable(page, vma);
3903 }
3904
3905 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
3906
3907 /* No need to invalidate - it was non-present before */
3908 update_mmu_cache(vma, vmf->address, vmf->pte);
3909 pte_unmap_unlock(vmf->pte, vmf->ptl);
3910 return 0;
3911 unlock:
3912 pte_unmap_unlock(vmf->pte, vmf->ptl);
3913 release:
3914 if (page)
3915 put_page(page);
3916 return ret;
3917 oom_free_page:
3918 put_page(page);
3919 oom:
3920 return VM_FAULT_OOM;
3921 }
3922
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Sat, Jan 29, 2022 at 05:03:53AM +0800, kernel test robot wrote:
> >> mm/memory.c:3876:20: warning: variable 'vmf' is uninitialized when used within its own initialization [-Wuninitialized]
> if (!pte_map_lock(vmf)) {
> ~~~~~~~~~~~~~^~~~
> include/linux/mm.h:3418:25: note: expanded from macro 'pte_map_lock'
> struct vm_fault *vmf = __vmf; \
> ~~~ ^~~~~
> 1 warning generated.
Ah, that's interesting - this works with gcc, but breaks with clang.
The following amended patch should fix this:
(I only added underscores to the pte_map_lock and pte_spinlock macros)
------------------------------------ 8< ---------------------------------
mm: add pte_map_lock() and pte_spinlock()
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.
The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.
In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.
If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.
Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 38 ++++++++++++++++++++++++++
mm/memory.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e2122bd3da3..80894db6f01a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3394,5 +3394,43 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif
+#ifdef CONFIG_MMU
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf);
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(!vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+#else /* !CONFIG_SPECULATIVE_PAGE_FAULT */
+
+#define pte_map_lock(____vmf) \
+({ \
+ struct vm_fault *__vmf = ____vmf; \
+ __vmf->pte = pte_offset_map_lock(__vmf->vma->vm_mm, __vmf->pmd, \
+ __vmf->address, &__vmf->ptl); \
+ true; \
+})
+
+#define pte_spinlock(____vmf) \
+({ \
+ struct vm_fault *__vmf = ____vmf; \
+ __vmf->ptl = pte_lockptr(__vmf->vma->vm_mm, __vmf->pmd); \
+ spin_lock(__vmf->ptl); \
+ true; \
+})
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+#endif /* CONFIG_MMU */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index d0db10bd5bee..1ce837e47395 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2745,6 +2745,72 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
#define speculative_page_walk_end() local_irq_enable()
#endif
+bool __pte_map_lock(struct vm_fault *vmf)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+ pte_t *pte = vmf->pte;
+ spinlock_t *ptl;
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+ speculative_page_walk_begin();
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto fail;
+ /*
+ * The mmap sequence count check guarantees that the page
+ * tables are still valid at that point, and
+ * speculative_page_walk_begin() ensures that they stay around.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto fail;
+#endif
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ /*
+ * Try locking the page table.
+ *
+ * Note that we might race against zap_pte_range() which
+ * invalidates TLBs while holding the page table lock.
+ * We are still under the speculative_page_walk_begin() section,
+ * and zap_pte_range() could thus deadlock with us if we tried
+ * using spin_lock() here.
+ *
+ * We also don't want to retry until spin_trylock() succeeds,
+ * because of the starvation potential against a stream of lockers.
+ */
+ if (unlikely(!spin_trylock(ptl)))
+ goto fail;
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto unlock_fail;
+ speculative_page_walk_end();
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ return true;
+
+unlock_fail:
+ spin_unlock(ptl);
+fail:
+ if (pte)
+ pte_unmap(pte);
+ speculative_page_walk_end();
+ return false;
+}
+
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
/*
--
2.20.1
Greeting,
FYI, we noticed the following commit (built with clang-14):
commit: fa5331bae2e49ce86eff959390b451b7401f9156 ("[PATCH v2 18/35] mm: implement speculative handling in do_anonymous_page()")
url: https://github.com/0day-ci/linux/commits/Michel-Lespinasse/Speculative-page-faults/20220128-212122
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 145d9b498fc827b79c1260b4caa29a8e59d4c2b9
patch link: https://lore.kernel.org/linux-mm/[email protected]
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu Icelake-Server -smp 4 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+------------------------------------------+------------+------------+
| | b19284b7ad | fa5331bae2 |
+------------------------------------------+------------+------------+
| canonical_address#:#[##] | 0 | 10 |
| RIP:__handle_mm_fault | 0 | 10 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 10 |
+------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 331.159834][ T1] rtc-test rtc-test.2: registered as rtc3
[ 331.161803][ T1] sdhci: Secure Digital Host Controller Interface driver
[ 331.162959][ T1] sdhci: Copyright(c) Pierre Ossman
[ 331.165687][ T1] sdhci-pltfm: SDHCI platform and OF driver helper
[ 331.168206][ T1] leds_apu: No PC Engines APUv1 board detected. For APUv2,3 support, enable CONFIG_PCENGINES_APU2
[ 331.179298][ T61] general protection fault, probably for non-canonical address 0xf555515555555555: 0000 [#1] KASAN PTI
[ 331.180173][ T61] KASAN: maybe wild-memory-access in range [0xaaaaaaaaaaaaaaa8-0xaaaaaaaaaaaaaaaf]
[ 331.180173][ T61] CPU: 0 PID: 61 Comm: kworker/u2:1 Not tainted 5.17.0-rc1-00248-gfa5331bae2e4 #1 48e2d12faa7f614111ba8a377c1a6d47b436f5c7
[ 331.180173][ T61] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 331.180173][ T61] RIP: 0010:__handle_mm_fault (memory.c:?)
[ 331.180173][ T61] Code: 0c 00 4c 89 f0 48 83 c8 42 41 f6 04 24 02 49 0f 44 c6 48 89 45 c0 48 b8 55 55 55 55 55 51 55 f5 49 bf aa aa aa aa aa aa aa aa <80> 38 00 74 08 4c 89 ff e8 43 2e 0c 00 49 8b 1f 48 83 c3 40 48 89
All code
========
0: 0c 00 or $0x0,%al
2: 4c 89 f0 mov %r14,%rax
5: 48 83 c8 42 or $0x42,%rax
9: 41 f6 04 24 02 testb $0x2,(%r12)
e: 49 0f 44 c6 cmove %r14,%rax
12: 48 89 45 c0 mov %rax,-0x40(%rbp)
16: 48 b8 55 55 55 55 55 movabs $0xf555515555555555,%rax
1d: 51 55 f5
20: 49 bf aa aa aa aa aa movabs $0xaaaaaaaaaaaaaaaa,%r15
27: aa aa aa
2a:* 80 38 00 cmpb $0x0,(%rax) <-- trapping instruction
2d: 74 08 je 0x37
2f: 4c 89 ff mov %r15,%rdi
32: e8 43 2e 0c 00 callq 0xc2e7a
37: 49 8b 1f mov (%r15),%rbx
3a: 48 83 c3 40 add $0x40,%rbx
3e: 48 rex.W
3f: 89 .byte 0x89
Code starting with the faulting instruction
===========================================
0: 80 38 00 cmpb $0x0,(%rax)
3: 74 08 je 0xd
5: 4c 89 ff mov %r15,%rdi
8: e8 43 2e 0c 00 callq 0xc2e50
d: 49 8b 1f mov (%r15),%rbx
10: 48 83 c3 40 add $0x40,%rbx
14: 48 rex.W
15: 89 .byte 0x89
[ 331.180173][ T61] RSP: 0000:ffffc9000101fab0 EFLAGS: 00010202
[ 331.180173][ T61] RAX: f555515555555555 RBX: 00000003ed304000 RCX: 0000000000000000
[ 331.180173][ T61] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff96587108
[ 331.180173][ T61] RBP: ffffc9000101fbd0 R08: dffffc0000000000 R09: fffff94001f69821
[ 331.180173][ T61] R10: dffff54001f69822 R11: 1ffffd4001f69820 R12: ffff88815ece4058
[ 331.180173][ T61] R13: 1ffff1102bd9c80b R14: 80000003ed304025 R15: aaaaaaaaaaaaaaaa
[ 331.180173][ T61] FS: 0000000000000000(0000) GS:ffffffff95883000(0000) knlGS:0000000000000000
[ 331.180173][ T61] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 331.180173][ T61] CR2: ffff88843ffff000 CR3: 000000001a636001 CR4: 0000000000170eb0
[ 331.180173][ T61] Call Trace:
[ 331.180173][ T61] <TASK>
[ 331.180173][ T61] do_handle_mm_fault (??:?)
[ 331.180173][ T61] __get_user_pages (gup.c:?)
[ 331.180173][ T61] __get_user_pages_remote (gup.c:?)
[ 331.180173][ T61] get_user_pages_remote (??:?)
To reproduce:
# build kernel
cd linux
cp config-5.17.0-rc1-00248-gfa5331bae2e4 .config
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On Sun, Jan 30, 2022 at 10:54:49AM +0800, kernel test robot wrote:
> +------------------------------------------+------------+------------+
> | | b19284b7ad | fa5331bae2 |
> +------------------------------------------+------------+------------+
> | canonical_address#:#[##] | 0 | 10 |
> | RIP:__handle_mm_fault | 0 | 10 |
> | Kernel_panic-not_syncing:Fatal_exception | 0 | 10 |
> +------------------------------------------+------------+------------+
Thanks. I think this might be the direct result of the uninitialized
vmf variable as previously reported in
https://lore.kernel.org/linux-mm/[email protected]/
On Fri, Jan 28, 2022 at 05:10:04AM -0800, Michel Lespinasse wrote:
> Attempt speculative mm fault handling first, and fall back to the
> existing (non-speculative) code if that fails.
>
> This follows the lines of the x86 speculative fault handling code,
> but with some minor arch differences such as the way that the
> VM_FAULT_BADACCESS case is handled.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> ---
> arch/arm64/mm/fault.c | 62 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 77341b160aca..2598795f4e70 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -25,6 +25,7 @@
> #include <linux/perf_event.h>
> #include <linux/preempt.h>
> #include <linux/hugetlb.h>
> +#include <linux/vm_event_item.h>
>
> #include <asm/acpi.h>
> #include <asm/bug.h>
> @@ -524,6 +525,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
> unsigned long vm_flags;
> unsigned int mm_flags = FAULT_FLAG_DEFAULT;
> unsigned long addr = untagged_addr(far);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + struct vm_area_struct *vma;
> + struct vm_area_struct pvma;
> + unsigned long seq;
> +#endif
>
> if (kprobe_page_fault(regs, esr))
> return 0;
> @@ -574,6 +580,59 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + /*
> + * No need to try speculative faults for kernel or
> + * single threaded user space.
> + */
> + if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> + goto no_spf;
> +
> + count_vm_event(SPF_ATTEMPT);
> + seq = mmap_seq_read_start(mm);
> + if (seq & 1) {
> + count_vm_spf_event(SPF_ABORT_ODD);
> + goto spf_abort;
> + }
> + rcu_read_lock();
> + vma = __find_vma(mm, addr);
> + if (!vma || vma->vm_start > addr) {
> + rcu_read_unlock();
> + count_vm_spf_event(SPF_ABORT_UNMAPPED);
> + goto spf_abort;
> + }
> + if (!vma_is_anonymous(vma)) {
> + rcu_read_unlock();
> + count_vm_spf_event(SPF_ABORT_NO_SPECULATE);
> + goto spf_abort;
> + }
> + pvma = *vma;
> + rcu_read_unlock();
> + if (!mmap_seq_read_check(mm, seq, SPF_ABORT_VMA_COPY))
> + goto spf_abort;
> + vma = &pvma;
> + if (!(vma->vm_flags & vm_flags)) {
> + count_vm_spf_event(SPF_ABORT_ACCESS_ERROR);
> + goto spf_abort;
> + }
> + fault = do_handle_mm_fault(vma, addr & PAGE_MASK,
> + mm_flags | FAULT_FLAG_SPECULATIVE, seq, regs);
> +
> + /* Quick path to respond to signals */
> + if (fault_signal_pending(fault, regs)) {
> + if (!user_mode(regs))
> + goto no_context;
> + return 0;
> + }
> + if (!(fault & VM_FAULT_RETRY))
> + goto done;
> +
> +spf_abort:
> + count_vm_event(SPF_ABORT);
> +no_spf:
> +
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
The speculative page fault implementation here (and for PowerPC as well)
looks very similar to x86. Can we factor it our rather than copy 3 (or
more) times?
> +
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
> @@ -612,6 +671,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned int esr,
> goto retry;
> }
> mmap_read_unlock(mm);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +done:
> +#endif
>
> /*
> * Handle the "normal" (no error) case first.
> --
> 2.20.1
>
>
--
Sincerely yours,
Mike.
On Sun, Jan 30, 2022 at 11:13:26AM +0200, Mike Rapoport wrote:
> The speculative page fault implementation here (and for PowerPC as well)
> looks very similar to x86. Can we factor it our rather than copy 3 (or
> more) times?
In each arch, the speculative code was written along the lines of the
existing non-speculative code, so that behavior would be unchanged
when speculation succeeds.
Now each arch's existing, non-speculative code paths are quite similar,
but they do have small differences as to how they implement various
permission checks, protection keys and the like. The same small
differences end up being reflected in the new speculative code paths.
I agree it would be nice if this code could be unified between archs,
but IMO this should start with the existing non-speculative code -
I don't think it would make sense to try unifying the new speculative
code while trying to follow the behavior of the non-unified old
non-speculative code paths...
On 28.01.22 14:09, Michel Lespinasse wrote:
Hi Michel,
> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.
>
> The patch series applies on top of linux v5.17-rc1;
> a git tree is also available:
> git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
>
> I would like these patches to be considered for inclusion into v5.18.
Just a general note: we certainly need (much more) review. And I think
we'll have to make a decision if the maintenance effort + complexity
will be worth the benefit.
> Several android vendors are using Laurent Dufour's previous SPF work into
> their kernel tree in order to improve application startup performance,
> want to converge to an upstream accepted solution, and have reported good
> numbers with previous versions of this patchset. Also, there is a broader
> interest into reducing mmap lock dependencies in critical MM paths,
> and I think this patchset would be a good first step in that direction.
>
>
> This patchset follows the same overall structure as the v1 proposal,
> with the following differences:
> - Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
> - The mmu notifier lock is new; this fixes a race in v1 patchset
> between speculative COW faults and registering new MMU notifiers.
> - Speculative handling of swap-cache pages has been removed.
> - Commit 30 is new; this fixes build issues that showed in some configs.
>
>
> In principle it would also be possible to extend this work for handling
> file mapped vmas; I have pending work on such patches too but they are
> not mature enough to be submitted for inclusion at this point.
>
I'd have expected a performance evaluation at this point, to highlight
the possible benefit and eventually also downsides, if any.
>
> Patchset summary:
>
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
>
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
>
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).
>
> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap(). Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
>
> Breaking COW on an existing mapping may require firing MMU notifiers.
> Some care is required to avoid racing with registering new notifiers.
> This patchset adds a new per-cpu rwsem to handle this situation.
I have to admit that this sounds complicated and possibly dangerous to me.
Here is one of my concerns, I hope you can clarify:
GUP-fast only ever walks page tables and doesn't actually modify any
page table state, including, not taking page table locks which might not
reside in the memmap directly but in auxiliary data. It works because we
only ever drop the last reference to a page table (to free it) after we
synchronized against GUP-fast either via an IPI or synchronize_rcu(), as
GUP=fast disables interrupts.
I'd assume that taking page table locks on page tables that might no
longer be spanned by a VMA because of concurrent page table
deconstruction is dangerous:
On munmap(), we do the VMA update under mmap_lock in write mode, to the
remove the page tables under mmap_lock in read mode.
Let's take a look at free_pte_range() on x86:
free_pte_range()
-> pte_free_tlb()
-> tlb_flush_pmd_range()
-> __tlb_adjust_range()
/* Doesn't actually flush but only updates the tlb range */
-> __pte_free_tlb()
-> ___pte_free_tlb()
-> pgtable_pte_page_dtor()
-> ptlock_free()
/* page table lock was freed */
-> paravirt_tlb_remove_table()
-> tlb_remove_page()
-> tlb_remove_page_size()
-> __tlb_remove_page_size()
/* Page added to TLB batch flushing+freeing */
The later tlb_flush_mmu() via tlb_flush_mmu_free()->tlb_table_flush()
will the free the page tables, after synchronizing against GUP-fast. But
at that point we already deconstructed the page tables.
So just reading your summary here, what prevents in your approach taking
a page table lock with racing against page table lock freeing? I cannot
see how a seqcount would help.
IIUC, with what you propose we cannot easily have auxiliary data for a
page table, at least not via current pgtable_pte_page_dtor(), including
page locks, which is a drawback (and currently eventually a BUG in your
code?) at least for me. But I only read the cover letter, so I might be
missing something important :)
--
Thanks,
David / dhildenb
* Michel Lespinasse <[email protected]> [220128 08:10]:
> In the mmap locking API, the *_killable() functions return an error
> (or 0 on success), and the *_trylock() functions return a boolean
> (true on success).
>
> Rename the return values "int error" and "bool ok", respectively,
> rather than using "ret" for both cases which I find less readable.
Would it be better to add function documentation in regards to return
types? I think changing the variables does help, but putting a block
with Return: <what's above> would work best.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> ---
> include/linux/mmap_lock.h | 32 ++++++++++++++++----------------
> 1 file changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index db9785e11274..1b14468183d7 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -81,22 +81,22 @@ static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
>
> static inline int mmap_write_lock_killable(struct mm_struct *mm)
> {
> - int ret;
> + int error;
>
> __mmap_lock_trace_start_locking(mm, true);
> - ret = down_write_killable(&mm->mmap_lock);
> - __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> - return ret;
> + error = down_write_killable(&mm->mmap_lock);
> + __mmap_lock_trace_acquire_returned(mm, true, !error);
> + return error;
> }
>
> static inline bool mmap_write_trylock(struct mm_struct *mm)
> {
> - bool ret;
> + bool ok;
>
> __mmap_lock_trace_start_locking(mm, true);
> - ret = down_write_trylock(&mm->mmap_lock) != 0;
> - __mmap_lock_trace_acquire_returned(mm, true, ret);
> - return ret;
> + ok = down_write_trylock(&mm->mmap_lock) != 0;
> + __mmap_lock_trace_acquire_returned(mm, true, ok);
> + return ok;
> }
>
> static inline void mmap_write_unlock(struct mm_struct *mm)
> @@ -120,22 +120,22 @@ static inline void mmap_read_lock(struct mm_struct *mm)
>
> static inline int mmap_read_lock_killable(struct mm_struct *mm)
> {
> - int ret;
> + int error;
>
> __mmap_lock_trace_start_locking(mm, false);
> - ret = down_read_killable(&mm->mmap_lock);
> - __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
> - return ret;
> + error = down_read_killable(&mm->mmap_lock);
> + __mmap_lock_trace_acquire_returned(mm, false, !error);
> + return error;
> }
>
> static inline bool mmap_read_trylock(struct mm_struct *mm)
> {
> - bool ret;
> + bool ok;
>
> __mmap_lock_trace_start_locking(mm, false);
> - ret = down_read_trylock(&mm->mmap_lock) != 0;
> - __mmap_lock_trace_acquire_returned(mm, false, ret);
> - return ret;
> + ok = down_read_trylock(&mm->mmap_lock) != 0;
> + __mmap_lock_trace_acquire_returned(mm, false, ok);
> + return ok;
> }
>
> static inline void mmap_read_unlock(struct mm_struct *mm)
> --
> 2.20.1
>
On Mon, Jan 31, 2022 at 1:56 AM David Hildenbrand <[email protected]> wrote:
>
> On 28.01.22 14:09, Michel Lespinasse wrote:
>
> Hi Michel,
>
> > This patchset is my take on speculative page faults (spf).
> > It builds on ideas that have been previously proposed by Laurent Dufour,
> > Peter Zijlstra and others before. While Laurent's previous proposal
> > was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> > this now based on what I think is a simpler and more bisectable approach,
> > much improved scaling numbers in the anonymous vma case, and the Android
> > use case that has since emerged. I will expand on these points towards
> > the end of this message.
> >
> > The patch series applies on top of linux v5.17-rc1;
> > a git tree is also available:
> > git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> >
> > I would like these patches to be considered for inclusion into v5.18.
>
> Just a general note: we certainly need (much more) review. And I think
> we'll have to make a decision if the maintenance effort + complexity
> will be worth the benefit.
>
> > Several android vendors are using Laurent Dufour's previous SPF work into
> > their kernel tree in order to improve application startup performance,
> > want to converge to an upstream accepted solution, and have reported good
> > numbers with previous versions of this patchset. Also, there is a broader
> > interest into reducing mmap lock dependencies in critical MM paths,
> > and I think this patchset would be a good first step in that direction.
> >
> >
> > This patchset follows the same overall structure as the v1 proposal,
> > with the following differences:
> > - Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
> > - The mmu notifier lock is new; this fixes a race in v1 patchset
> > between speculative COW faults and registering new MMU notifiers.
> > - Speculative handling of swap-cache pages has been removed.
> > - Commit 30 is new; this fixes build issues that showed in some configs.
> >
> >
> > In principle it would also be possible to extend this work for handling
> > file mapped vmas; I have pending work on such patches too but they are
> > not mature enough to be submitted for inclusion at this point.
> >
>
> I'd have expected a performance evaluation at this point, to highlight
> the possible benefit and eventually also downsides, if any.
Hi David,
In Android we and several Android vendors reported application start
time improvements (a critical metric in Android world) on the previous
SPF posting.
My test results were included in the cover letter:
https://lore.kernel.org/lkml/[email protected]/T/#m23c5cb33b1a04979c792db6ddd7e3245e5f86bcb
Android vendors reported their results on the same thread:
https://lore.kernel.org/lkml/[email protected]/T/#m8eb304b67c9a33388e2fe4448a04a74879120b34
https://lore.kernel.org/lkml/[email protected]/T/#maaa58f7072732e5a2a77fe9f65dd3e444c2aed04
And Axel ran pft (pagefault test) benchmarks on server class machines
with results reported here:
https://lore.kernel.org/lkml/[email protected]/T/#mc3965e87a702c67909a078a67f8f7964d707b2e0
The Android performance team had recently reported a case when a
low-end device was having visible performance issues and after
applying SPF the device became usable. I'm CC'ing Tim Murray from that
team to provide more information if possible.
As a side-note, an older version of SPF has been used for several
years on Android and many vendors specifically requested us to include
it in our kernels. It is currently maintained in Android Common Kernel
as an out-of-tree patchset and getting it upstream would be huge for
us in terms of getting more testing in a wider ecosystem and
maintenance efforts.
Thanks,
Suren.
>
> >
> > Patchset summary:
> >
> > Classical page fault processing takes the mmap read lock in order to
> > prevent races with mmap writers. In contrast, speculative fault
> > processing does not take the mmap read lock, and instead verifies,
> > when the results of the page fault are about to get committed and
> > become visible to other threads, that no mmap writers have been
> > running concurrently with the page fault. If the check fails,
> > speculative updates do not get committed and the fault is retried
> > in the usual, non-speculative way (with the mmap read lock held).
> >
> > The concurrency check is implemented using a per-mm mmap sequence count.
> > The counter is incremented at the beginning and end of each mmap write
> > operation. If the counter is initially observed to have an even value,
> > and has the same value later on, the observer can deduce that no mmap
> > writers have been running concurrently with it between those two times.
> > This is similar to a seqlock, except that readers never spin on the
> > counter value (they would instead revert to taking the mmap read lock),
> > and writers are allowed to sleep. One benefit of this approach is that
> > it requires no writer side changes, just some hooks in the mmap write
> > lock APIs that writers already use.
> >
> > The first step of a speculative page fault is to look up the vma and
> > read its contents (currently by making a copy of the vma, though in
> > principle it would be sufficient to only read the vma attributes that
> > are used in page faults). The mmap sequence count is used to verify
> > that there were no mmap writers concurrent to the lookup and copy steps.
> > Note that walking rbtrees while there may potentially be concurrent
> > writers is not an entirely new idea in linux, as latched rbtrees
> > are already doing this. This is safe as long as the lookup is
> > followed by a sequence check to verify that concurrency did not
> > actually occur (and abort the speculative fault if it did).
> >
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap(). Again, not an entirely new idea, as this repeats
> > a pattern already present in fast GUP. Similar precautions are also
> > taken when taking the page table lock.
> >
> > Breaking COW on an existing mapping may require firing MMU notifiers.
> > Some care is required to avoid racing with registering new notifiers.
> > This patchset adds a new per-cpu rwsem to handle this situation.
>
> I have to admit that this sounds complicated and possibly dangerous to me.
>
>
> Here is one of my concerns, I hope you can clarify:
>
> GUP-fast only ever walks page tables and doesn't actually modify any
> page table state, including, not taking page table locks which might not
> reside in the memmap directly but in auxiliary data. It works because we
> only ever drop the last reference to a page table (to free it) after we
> synchronized against GUP-fast either via an IPI or synchronize_rcu(), as
> GUP=fast disables interrupts.
>
>
> I'd assume that taking page table locks on page tables that might no
> longer be spanned by a VMA because of concurrent page table
> deconstruction is dangerous:
>
>
> On munmap(), we do the VMA update under mmap_lock in write mode, to the
> remove the page tables under mmap_lock in read mode.
>
> Let's take a look at free_pte_range() on x86:
>
> free_pte_range()
> -> pte_free_tlb()
> -> tlb_flush_pmd_range()
> -> __tlb_adjust_range()
> /* Doesn't actually flush but only updates the tlb range */
> -> __pte_free_tlb()
> -> ___pte_free_tlb()
> -> pgtable_pte_page_dtor()
> -> ptlock_free()
> /* page table lock was freed */
> -> paravirt_tlb_remove_table()
> -> tlb_remove_page()
> -> tlb_remove_page_size()
> -> __tlb_remove_page_size()
> /* Page added to TLB batch flushing+freeing */
>
> The later tlb_flush_mmu() via tlb_flush_mmu_free()->tlb_table_flush()
> will the free the page tables, after synchronizing against GUP-fast. But
> at that point we already deconstructed the page tables.
>
> So just reading your summary here, what prevents in your approach taking
> a page table lock with racing against page table lock freeing? I cannot
> see how a seqcount would help.
>
>
> IIUC, with what you propose we cannot easily have auxiliary data for a
> page table, at least not via current pgtable_pte_page_dtor(), including
> page locks, which is a drawback (and currently eventually a BUG in your
> code?) at least for me. But I only read the cover letter, so I might be
> missing something important :)
>
> --
> Thanks,
>
> David / dhildenb
>
On Sat, Jan 29, 2022 at 4:13 AM Hillf Danton <[email protected]> wrote:
>
> On Fri, 28 Jan 2022 05:09:53 -0800 Michel Lespinasse wrote:
> > +
> > +static LIST_HEAD(destroy_list);
> > +static DEFINE_SPINLOCK(destroy_list_lock);
>
> static bool destroyer_running;
>
> > +
> > +static void destroy_list_workfn(struct work_struct *work)
> > +{
> > + struct percpu_rw_semaphore *sem, *sem2;
> > + LIST_HEAD(to_destroy);
> > +
>
> again:
>
> > + spin_lock(&destroy_list_lock);
>
> if (list_empty(&destroy_list)) {
> destroyer_running = false;
> spin_unlock(&destroy_list_lock);
> return;
> }
> destroyer_running = true;
>
> > + list_splice_init(&destroy_list, &to_destroy);
> > + spin_unlock(&destroy_list_lock);
> > +
> > + if (list_empty(&to_destroy))
> > + return;
> > +
> > + list_for_each_entry_safe(sem, sem2, &to_destroy, destroy_list_entry) {
>
> list_del(&sem->destroy_list_entry);
>
> > + percpu_free_rwsem(sem);
> > + kfree(sem);
> > + }
>
> goto again;
> > +}
> > +
> > +static DECLARE_WORK(destroy_list_work, destroy_list_workfn);
> > +
> > +void percpu_rwsem_async_destroy(struct percpu_rw_semaphore *sem)
> > +{
> > + spin_lock(&destroy_list_lock);
> > + list_add_tail(&sem->destroy_list_entry, &destroy_list);
> > + spin_unlock(&destroy_list_lock);
> > + schedule_work(&destroy_list_work);
>
> Nits
> spin_lock(&destroy_list_lock);
> 1/ /* LIFO */
> list_add(&sem->destroy_list_entry, &destroy_list);
> 2/ /* spawn worker if it is idle */
> if (!destroyer_running)
> 3/ /* this is not critical work */
> queue_work(system_unbound_wq, &destroy_list_work);
> spin_unlock(&destroy_list_lock);
Thanks for the review! Just to clarify, are you suggesting
simplifications to the current patch or do you see a function issue?
> > +}
> > --
> > 2.20.1
>
On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <[email protected]> wrote:
> Patchset summary:
>
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
>
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
>
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).
I'm surprised that descending the rbtree locklessly doesn't flat-out
oops the kernel. How are we assured that every pointer which is
encountered actually points at the right thing? Against things
which tear that tree down?
> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap().
Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
point of view?
> Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
>
> Breaking COW on an existing mapping may require firing MMU notifiers.
> Some care is required to avoid racing with registering new notifiers.
> This patchset adds a new per-cpu rwsem to handle this situation.
On Mon, Jan 31, 2022 at 05:14:34PM -0800, Andrew Morton wrote:
> On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <[email protected]> wrote:
> > The first step of a speculative page fault is to look up the vma and
> > read its contents (currently by making a copy of the vma, though in
> > principle it would be sufficient to only read the vma attributes that
> > are used in page faults). The mmap sequence count is used to verify
> > that there were no mmap writers concurrent to the lookup and copy steps.
> > Note that walking rbtrees while there may potentially be concurrent
> > writers is not an entirely new idea in linux, as latched rbtrees
> > are already doing this. This is safe as long as the lookup is
> > followed by a sequence check to verify that concurrency did not
> > actually occur (and abort the speculative fault if it did).
>
> I'm surprised that descending the rbtree locklessly doesn't flat-out
> oops the kernel. How are we assured that every pointer which is
> encountered actually points at the right thing? Against things
> which tear that tree down?
It doesn't necessarily point at the _right_ thing. You may get
entirely the wrong node in the tree if you race with a modification,
but, as Michel says, you check the seqcount before you even look at
the VMA (and if the seqcount indicates a modification, you throw away
the result and fall back to the locked version). The rbtree always
points to other rbtree nodes, so you aren't going to walk into some
completely wrong data structure.
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap().
>
> Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> point of view?
I am not a fan of this approach. For other reasons, I think we want to
switch to RCU-freed page tables, and then we can walk the page tables
with the RCU lock held. Some architectures already RCU-free the page
tables, so I think it's just a matter of converting the rest.
On Mon, Jan 31, 2022 at 12:07:29AM -0800, Michel Lespinasse wrote:
> On Sun, Jan 30, 2022 at 11:13:26AM +0200, Mike Rapoport wrote:
> > The speculative page fault implementation here (and for PowerPC as well)
> > looks very similar to x86. Can we factor it our rather than copy 3 (or
> > more) times?
>
> In each arch, the speculative code was written along the lines of the
> existing non-speculative code, so that behavior would be unchanged
> when speculation succeeds.
>
> Now each arch's existing, non-speculative code paths are quite similar,
> but they do have small differences as to how they implement various
> permission checks, protection keys and the like. The same small
> differences end up being reflected in the new speculative code paths.
>
> I agree it would be nice if this code could be unified between archs,
> but IMO this should start with the existing non-speculative code -
> I don't think it would make sense to try unifying the new speculative
> code while trying to follow the behavior of the non-unified old
> non-speculative code paths...
Then maybe this unification can be done as the ground work for the
speculative page fault handling?
--
Sincerely yours,
Mike.
* Michel Lespinasse <[email protected]> [220128 08:10]:
> Attempt speculative mm fault handling first, and fall back to the
> existing (non-speculative) code if that fails.
>
> The speculative handling closely mirrors the non-speculative logic.
> This includes some x86 specific bits such as the access_error() call.
> This is why we chose to implement the speculative handling in arch/x86
> rather than in common code.
>
> The vma is first looked up and copied, under protection of the rcu
> read lock. The mmap lock sequence count is used to verify the
> integrity of the copied vma, and passed to do_handle_mm_fault() to
> allow checking against races with mmap writers when finalizing the fault.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> ---
> arch/x86/mm/fault.c | 44 +++++++++++++++++++++++++++++++++++
> include/linux/mm_types.h | 5 ++++
> include/linux/vm_event_item.h | 4 ++++
> mm/vmstat.c | 4 ++++
> 4 files changed, 57 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..99b0a358154e 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1226,6 +1226,10 @@ void do_user_addr_fault(struct pt_regs *regs,
> struct mm_struct *mm;
> vm_fault_t fault;
> unsigned int flags = FAULT_FLAG_DEFAULT;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + struct vm_area_struct pvma;
> + unsigned long seq;
> +#endif
>
> tsk = current;
> mm = tsk->mm;
> @@ -1323,6 +1327,43 @@ void do_user_addr_fault(struct pt_regs *regs,
> }
> #endif
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + count_vm_event(SPF_ATTEMPT);
> + seq = mmap_seq_read_start(mm);
> + if (seq & 1)
> + goto spf_abort;
> + rcu_read_lock();
> + vma = __find_vma(mm, address);
> + if (!vma || vma->vm_start > address) {
This fits the vma_lookup() pattern - although you will have to work
around the locking issue still. This is the same for the other
platforms too; they fit the pattern also.
> + rcu_read_unlock();
> + goto spf_abort;
> + }
> + pvma = *vma;
> + rcu_read_unlock();
> + if (!mmap_seq_read_check(mm, seq))
> + goto spf_abort;
> + vma = &pvma;
> + if (unlikely(access_error(error_code, vma)))
> + goto spf_abort;
> + fault = do_handle_mm_fault(vma, address,
> + flags | FAULT_FLAG_SPECULATIVE, seq, regs);
> +
> + if (!(fault & VM_FAULT_RETRY))
> + goto done;
> +
> + /* Quick path to respond to signals */
> + if (fault_signal_pending(fault, regs)) {
> + if (!user_mode(regs))
> + kernelmode_fixup_or_oops(regs, error_code, address,
> + SIGBUS, BUS_ADRERR,
> + ARCH_DEFAULT_PKEY);
> + return;
> + }
> +
> +spf_abort:
> + count_vm_event(SPF_ABORT);
> +#endif
> +
> /*
> * Kernel-mode access to the user address space should only occur
> * on well-defined single instructions listed in the exception
> @@ -1419,6 +1460,9 @@ void do_user_addr_fault(struct pt_regs *regs,
> }
>
> mmap_read_unlock(mm);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +done:
> +#endif
> if (likely(!(fault & VM_FAULT_ERROR)))
> return;
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b6678578a729..305f05d2a4bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -370,6 +370,11 @@ struct anon_vma_name {
> * per VM-area/task. A VM area is any part of the process virtual memory
> * space that has a special rule for the page-fault handlers (ie a shared
> * library, the executable area etc).
> + *
> + * Note that speculative page faults make an on-stack copy of the VMA,
> + * so the structure size matters.
> + * (TODO - it would be preferable to copy only the required vma attributes
> + * rather than the entire vma).
> */
> struct vm_area_struct {
> /* The first cache line has the info for VMA tree walking. */
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 7b2363388bfa..f00b3e36ff39 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -133,6 +133,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> DIRECT_MAP_LEVEL3_SPLIT,
> +#endif
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + SPF_ATTEMPT,
> + SPF_ABORT,
> #endif
> NR_VM_EVENT_ITEMS
> };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4057372745d0..dbb0160e5558 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1390,6 +1390,10 @@ const char * const vmstat_text[] = {
> "direct_map_level2_splits",
> "direct_map_level3_splits",
> #endif
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + "spf_attempt",
> + "spf_abort",
> +#endif
> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
> };
> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> --
> 2.20.1
>
On 2022-01-31 17:14:34 [-0800], Andrew Morton wrote:
> On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <[email protected]> wrote:
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap().
>
> Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> point of view?
I applied the series on top of RT and gave it shot. Nothing out of the
ordinary happened so that is good.
From browsing through the code:
- speculative_page_walk_begin() seems to disable interrupts.
There is a spin_trylock() invocation in that area. That is okay since
it is never invoked from in_IRQ(). But there should not be any regular
spin_lock() in such a section.
- We do have a seqcount API. So instead of mmap_seq_read_start() one
could use raw_read_seqcount(). The lockdep bits would also check if
the associated lock (in this case mmap_lock) is held in the write
path.
- The read side (mmap_seq_read_start()) does not attempt to stabilize
the counter (waiting for even) which is good. Otherwise special care
would be needed ;)
Sebastian
On Tue, Feb 01, 2022 at 10:58:03AM +0200, Mike Rapoport wrote:
> On Mon, Jan 31, 2022 at 12:07:29AM -0800, Michel Lespinasse wrote:
> > On Sun, Jan 30, 2022 at 11:13:26AM +0200, Mike Rapoport wrote:
> > > The speculative page fault implementation here (and for PowerPC as well)
> > > looks very similar to x86. Can we factor it our rather than copy 3 (or
> > > more) times?
> >
> > In each arch, the speculative code was written along the lines of the
> > existing non-speculative code, so that behavior would be unchanged
> > when speculation succeeds.
> >
> > Now each arch's existing, non-speculative code paths are quite similar,
> > but they do have small differences as to how they implement various
> > permission checks, protection keys and the like. The same small
> > differences end up being reflected in the new speculative code paths.
> >
> > I agree it would be nice if this code could be unified between archs,
> > but IMO this should start with the existing non-speculative code -
> > I don't think it would make sense to try unifying the new speculative
> > code while trying to follow the behavior of the non-unified old
> > non-speculative code paths...
>
> Then maybe this unification can be done as the ground work for the
> speculative page fault handling?
I feel like this is quite unrelated, and that introducing such
artificial dependencies is a bad work habit we have here in linux MM...
That said, unifying the PF code between archs would be an interesting
project on its own. The way I see it, there could be a unified page
fault handler, with some arch specific parts defined as inline
functions. I can see myself making an x86/arm64/powerpc initial
proposal if there is enough interest for it, but I'm not sure how
extending it to more exotic archs would go - I think this would have
to involve arch maintainers at least for testing purposes, and I'm not
sure if they'd have any bandwidth for such a project...
--
Michel "walken" Lespinasse
On Mon, Jan 31, 2022 at 6:10 PM Hillf Danton <[email protected]> wrote:
>
> On Mon, 31 Jan 2022 10:04:16 -0800 Suren Baghdasaryan wrote:
> > On Sat, Jan 29, 2022 at 4:13 AM Hillf Danton wrote:
> > >
> > > On Fri, 28 Jan 2022 05:09:53 -0800 Michel Lespinasse wrote:
> > > > +
> > > > +static LIST_HEAD(destroy_list);
> > > > +static DEFINE_SPINLOCK(destroy_list_lock);
> > >
> > > static bool destroyer_running;
> > >
> > > > +
> > > > +static void destroy_list_workfn(struct work_struct *work)
> > > > +{
> > > > + struct percpu_rw_semaphore *sem, *sem2;
> > > > + LIST_HEAD(to_destroy);
> > > > +
> > >
> > > again:
> > >
> > > > + spin_lock(&destroy_list_lock);
> > >
> > > if (list_empty(&destroy_list)) {
> > > destroyer_running = false;
> > > spin_unlock(&destroy_list_lock);
> > > return;
> > > }
> > > destroyer_running = true;
> > >
> > > > + list_splice_init(&destroy_list, &to_destroy);
> > > > + spin_unlock(&destroy_list_lock);
> > > > +
> > > > + if (list_empty(&to_destroy))
> > > > + return;
> > > > +
> > > > + list_for_each_entry_safe(sem, sem2, &to_destroy, destroy_list_entry) {
> > >
> > > list_del(&sem->destroy_list_entry);
> > >
> > > > + percpu_free_rwsem(sem);
> > > > + kfree(sem);
> > > > + }
> > >
> > > goto again;
> > > > +}
> > > > +
> > > > +static DECLARE_WORK(destroy_list_work, destroy_list_workfn);
> > > > +
> > > > +void percpu_rwsem_async_destroy(struct percpu_rw_semaphore *sem)
> > > > +{
> > > > + spin_lock(&destroy_list_lock);
> > > > + list_add_tail(&sem->destroy_list_entry, &destroy_list);
> > > > + spin_unlock(&destroy_list_lock);
> > > > + schedule_work(&destroy_list_work);
> > >
> > > Nits
> > > spin_lock(&destroy_list_lock);
> > > 1/ /* LIFO */
> > > list_add(&sem->destroy_list_entry, &destroy_list);
> > > 2/ /* spawn worker if it is idle */
> > > if (!destroyer_running)
> > > 3/ /* this is not critical work */
> > > queue_work(system_unbound_wq, &destroy_list_work);
> > > spin_unlock(&destroy_list_lock);
> >
> > Thanks for the review! Just to clarify, are you suggesting
> > simplifications to the current patch or do you see a function issue?
>
> Apart from the nits that can be safely ignored in usual spins, I wonder if
> the async destroy can be used in the contexts wrt raw_spin_lock.
>
> Hillf
>
> raw_spin_lock_irq(&foo->lock);
> ...
> percpu_rwsem_async_destroy(*sem);
> ...
> raw_spin_unlock_irq(&foo->lock);
Sorry for the delay. Are you concerned about the use of spin_lock()
inside percpu_rwsem_async_destroy() which would become a sleeping lock
in case of PREEMPT_RT? If so, we can use raw_spin_lock() when locking
destroy_list_lock. Please confirm. Thanks!
>
On Mon, Jan 31, 2022 at 04:17:43PM +0000, Liam Howlett wrote:
> * Michel Lespinasse <[email protected]> [220128 08:10]:
> > In the mmap locking API, the *_killable() functions return an error
> > (or 0 on success), and the *_trylock() functions return a boolean
> > (true on success).
> >
> > Rename the return values "int error" and "bool ok", respectively,
> > rather than using "ret" for both cases which I find less readable.
>
> Would it be better to add function documentation in regards to return
> types? I think changing the variables does help, but putting a block
> with Return: <what's above> would work best.
That would work, I guess. I'm not sure what it says about our general
coding style, that the comment would kinda stick out like a sore thumb
compared to the rest of the file, or of similar include files
(say, other lock definitions). I don't care very strongly either way.
--
Michel "walken" Lespinasse
On Tue, Feb 01, 2022 at 05:16:43PM +0000, Liam Howlett wrote:
> > + vma = __find_vma(mm, address);
> > + if (!vma || vma->vm_start > address) {
>
> This fits the vma_lookup() pattern - although you will have to work
> around the locking issue still. This is the same for the other
> platforms too; they fit the pattern also.
In this case, I think it's just as well to follow the lines of the
non-speculative path, which itself can't use vma_lookup() because it
needs to handle the stack expansion case...
--
Michel "walken" Lespinasse
On Tue, Feb 01, 2022 at 02:20:39AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 05:14:34PM -0800, Andrew Morton wrote:
> > On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <[email protected]> wrote:
> > > The next step is to walk down the existing page table tree to find the
> > > current pte entry. This is done with interrupts disabled to avoid
> > > races with munmap().
> >
> > Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> > point of view?
>
> I am not a fan of this approach. For other reasons, I think we want to
> switch to RCU-freed page tables, and then we can walk the page tables
> with the RCU lock held. Some architectures already RCU-free the page
> tables, so I think it's just a matter of converting the rest.
Note - I have no problem with switching to RCU-freed page tables
everywhere when and if we end up needing to. I just don't see that
this need comes from the SPF patchset, so I don't think this should
be introduced as an artificial dependency.
--
Michel "walken" Lespinasse
On Mon, Feb 7, 2022 at 4:21 PM Hillf Danton <[email protected]> wrote:
>
> On Mon, 7 Feb 2022 11:31:38 -0800 Suren Baghdasaryan wrote:
> >
> > Sorry for the delay. Are you concerned about the use of spin_lock()
> > inside percpu_rwsem_async_destroy() which would become a sleeping lock
> > in case of PREEMPT_RT? If so, we can use raw_spin_lock() when locking
> > destroy_list_lock. Please confirm. Thanks!
>
> Yes please replace spin lock with the raw version which can fit in
> more scenarios.
Thanks for confirmation! I'll rework my patch and will send it to
Michel to include in the next version of his patchset.
>
> Hillf
>
On Mon, Feb 07, 2022 at 09:39:19AM -0800, Michel Lespinasse wrote:
> On Tue, Feb 01, 2022 at 10:58:03AM +0200, Mike Rapoport wrote:
> > On Mon, Jan 31, 2022 at 12:07:29AM -0800, Michel Lespinasse wrote:
> > > On Sun, Jan 30, 2022 at 11:13:26AM +0200, Mike Rapoport wrote:
> > > > The speculative page fault implementation here (and for PowerPC as well)
> > > > looks very similar to x86. Can we factor it our rather than copy 3 (or
> > > > more) times?
> > >
> > > In each arch, the speculative code was written along the lines of the
> > > existing non-speculative code, so that behavior would be unchanged
> > > when speculation succeeds.
> > >
> > > Now each arch's existing, non-speculative code paths are quite similar,
> > > but they do have small differences as to how they implement various
> > > permission checks, protection keys and the like. The same small
> > > differences end up being reflected in the new speculative code paths.
> > >
> > > I agree it would be nice if this code could be unified between archs,
> > > but IMO this should start with the existing non-speculative code -
> > > I don't think it would make sense to try unifying the new speculative
> > > code while trying to follow the behavior of the non-unified old
> > > non-speculative code paths...
> >
> > Then maybe this unification can be done as the ground work for the
> > speculative page fault handling?
>
> I feel like this is quite unrelated, and that introducing such
> artificial dependencies is a bad work habit we have here in linux MM...
The reduction of the code duplication in page fault handlers per se is
indeed not very related to SPF work, but since the SPF patches increase
the code duplication, I believe that the refactoring that prevents this
additional code duplication is related and is in scope of this work.
> That said, unifying the PF code between archs would be an interesting
> project on its own. The way I see it, there could be a unified page
> fault handler, with some arch specific parts defined as inline
> functions. I can see myself making an x86/arm64/powerpc initial
> proposal if there is enough interest for it, but I'm not sure how
> extending it to more exotic archs would go - I think this would have
> to involve arch maintainers at least for testing purposes, and I'm not
> sure if they'd have any bandwidth for such a project...
There is no need to convert all architectures and surely not at once.
The parts of page fault handler that are shared by several architectures
can live under #ifdef ARCH_WANTS_GENERIC_PAGE_FAULT or something like this.
> --
> Michel "walken" Lespinasse
--
Sincerely yours,
Mike.
On Fri, Jan 28, 2022 at 05:09:31AM -0800, Michel Lespinasse wrote:
> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.
>
> The patch series applies on top of linux v5.17-rc1;
> a git tree is also available:
> git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
>
> I would like these patches to be considered for inclusion into v5.18.
> Several android vendors are using Laurent Dufour's previous SPF work into
> their kernel tree in order to improve application startup performance,
> want to converge to an upstream accepted solution, and have reported good
> numbers with previous versions of this patchset. Also, there is a broader
> interest into reducing mmap lock dependencies in critical MM paths,
> and I think this patchset would be a good first step in that direction.
>
I think there is serious lack of performance data here. The only
performance point offered is the Android Application Startup case.
Unfortunately, that benefit may be specific to the Zygote process that
preloads classes that may be required and listens for new applications to
start. I suspect the benefit wouldn't apply to most Linux distributions
and even JVM-based workloads are not primarily constrained by the startup
cost. Improving application start up costs is not great justification
for this level of code complexity even though I recognise why it is a
key performance indicator for Android given that startup times affect
the user experience.
Laurent's original work was partially motivated by the performance of
a proprietary application. While I cannot replicate a full production
workload as that can only be done by the company, I could do a basic
evaluation commonly conducted on standalone systems. It was extremely
fault intensive with SPF success rates greater than 96% but almost no
change in actual performance. It's perfectly possible that the application
has changed since SPF was first proposed. The developers did spend a fair
amount of effort at making the application NUMA-aware and reusing memory
more aggressively to avoid faults. It's still very fault intensive but
does not appear to suffer due to parallel memory operations guessing from
the data.
On my own tests, the only preliminary test that was a clear winner
was will-it-scale using threads for the page-fault workloads and
page-fault-test for threads. To be far, the increases there are dramatic
with a high success rate of speculative faults.
pft timings
5.17.0-rc3 5.17.0-rc3
vanilla mm-spfault-v2r1
Amean elapsed-1 32.66 ( 0.00%) 32.77 * -0.36%*
Amean elapsed-4 9.17 ( 0.00%) 8.89 * 3.07%*
Amean elapsed-7 5.53 ( 0.00%) 5.26 * 4.95%*
Amean elapsed-12 4.13 ( 0.00%) 3.50 * 15.16%*
Amean elapsed-21 3.93 ( 0.00%) 2.79 * 29.03%*
Amean elapsed-30 4.02 ( 0.00%) 2.94 * 26.79%*
Amean elapsed-48 4.37 ( 0.00%) 2.83 * 35.24%*
Amean elapsed-79 4.13 ( 0.00%) 2.17 * 47.36%*
Amean elapsed-80 4.12 ( 0.00%) 2.13 * 48.22%*
Ops SPFault Attempt 0.00 4734439786.00
Ops SPFault Abort 0.00 9360014.00
Ops SPFault Success 0.00 99.80
This is the ideal case for SPF but not very realistic. Interestingly,
ebizzy barely benefitted even though it's threaded because it's not
guaranteed to be address space modification intensive.
Hackbench took a performance hit between 0-5% depending on the exact
configuration and machine used. It is threaded and had high SPF abort rates
(up to 50%). It's not a great example but it shows at least one example
where SPF hurts more than it help and there may be other applications
that are harmed by having to retry faults.
The scope of SPF is narrow relative to the much older discussion of
breaking up mmap_sem. The only time SPF benefits is when faults are racing
against parallel memory address updates holding mmap_sem for write.
That requires a threaded application that is both intense in terms of
address space updates and fault intensive. That is much narrower than
threaded applications that are address space update intensive (e.g.
using mprotect to avoid accidentally leaking data, mapping data files
for IO etc). Have we examples of realistic applications that meet all the
criteria of "threaded", "address-space intensive" and "fault intensive"
that are common enough to justify the complexity?
Admittedly, I initially just threw this series at a collection of
workloads that simply stress the allocator because it stresses faults as
a side-effect but most of them did not match the criteria for "threaded
application that is both address space update intensive and fault
intensive". I'm struggling to think of good examples although redis
is a possibility. HPC workloads like NPB parallelised with OpenMP is a
possibility but I looked at some old results and while it does trap faults,
the vast majority are related to NUMA balancing. The other ones I normally
consider for scaling purposes are process orientated and not threads.
On the patches themselves, I'm not sure the optimisation for ignoring SPF
is guaranteed to work as mm_users could be temporarily elevated although
probably not enough to matter. I also think patch 5 stands on its own and
could be sent separately. For the others, I didn't read them in sufficient
depth but noted that the level of similar logic between speculative
and non-speculative paths could be a maintenance headache to keep the
speculative and !speculative rules in sync. I didn't see obvious problems
as such but I still think the complexity is high for a corner case.
--
Mel Gorman
SUSE Labs
On Wed, Feb 23, 2022 at 8:11 AM Mel Gorman <[email protected]> wrote:
>
> On Fri, Jan 28, 2022 at 05:09:31AM -0800, Michel Lespinasse wrote:
> > This patchset is my take on speculative page faults (spf).
> > It builds on ideas that have been previously proposed by Laurent Dufour,
> > Peter Zijlstra and others before. While Laurent's previous proposal
> > was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> > this now based on what I think is a simpler and more bisectable approach,
> > much improved scaling numbers in the anonymous vma case, and the Android
> > use case that has since emerged. I will expand on these points towards
> > the end of this message.
> >
> > The patch series applies on top of linux v5.17-rc1;
> > a git tree is also available:
> > git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> >
> > I would like these patches to be considered for inclusion into v5.18.
> > Several android vendors are using Laurent Dufour's previous SPF work into
> > their kernel tree in order to improve application startup performance,
> > want to converge to an upstream accepted solution, and have reported good
> > numbers with previous versions of this patchset. Also, there is a broader
> > interest into reducing mmap lock dependencies in critical MM paths,
> > and I think this patchset would be a good first step in that direction.
> >
>
> I think there is serious lack of performance data here. The only
> performance point offered is the Android Application Startup case.
> Unfortunately, that benefit may be specific to the Zygote process that
> preloads classes that may be required and listens for new applications to
> start. I suspect the benefit wouldn't apply to most Linux distributions
> and even JVM-based workloads are not primarily constrained by the startup
> cost. Improving application start up costs is not great justification
> for this level of code complexity even though I recognise why it is a
> key performance indicator for Android given that startup times affect
> the user experience.
>
> Laurent's original work was partially motivated by the performance of
> a proprietary application. While I cannot replicate a full production
> workload as that can only be done by the company, I could do a basic
> evaluation commonly conducted on standalone systems. It was extremely
> fault intensive with SPF success rates greater than 96% but almost no
> change in actual performance. It's perfectly possible that the application
> has changed since SPF was first proposed. The developers did spend a fair
> amount of effort at making the application NUMA-aware and reusing memory
> more aggressively to avoid faults. It's still very fault intensive but
> does not appear to suffer due to parallel memory operations guessing from
> the data.
>
> On my own tests, the only preliminary test that was a clear winner
> was will-it-scale using threads for the page-fault workloads and
> page-fault-test for threads. To be far, the increases there are dramatic
> with a high success rate of speculative faults.
>
> pft timings
> 5.17.0-rc3 5.17.0-rc3
> vanilla mm-spfault-v2r1
> Amean elapsed-1 32.66 ( 0.00%) 32.77 * -0.36%*
> Amean elapsed-4 9.17 ( 0.00%) 8.89 * 3.07%*
> Amean elapsed-7 5.53 ( 0.00%) 5.26 * 4.95%*
> Amean elapsed-12 4.13 ( 0.00%) 3.50 * 15.16%*
> Amean elapsed-21 3.93 ( 0.00%) 2.79 * 29.03%*
> Amean elapsed-30 4.02 ( 0.00%) 2.94 * 26.79%*
> Amean elapsed-48 4.37 ( 0.00%) 2.83 * 35.24%*
> Amean elapsed-79 4.13 ( 0.00%) 2.17 * 47.36%*
> Amean elapsed-80 4.12 ( 0.00%) 2.13 * 48.22%*
>
> Ops SPFault Attempt 0.00 4734439786.00
> Ops SPFault Abort 0.00 9360014.00
> Ops SPFault Success 0.00 99.80
>
> This is the ideal case for SPF but not very realistic. Interestingly,
> ebizzy barely benefitted even though it's threaded because it's not
> guaranteed to be address space modification intensive.
>
> Hackbench took a performance hit between 0-5% depending on the exact
> configuration and machine used. It is threaded and had high SPF abort rates
> (up to 50%). It's not a great example but it shows at least one example
> where SPF hurts more than it help and there may be other applications
> that are harmed by having to retry faults.
>
> The scope of SPF is narrow relative to the much older discussion of
> breaking up mmap_sem. The only time SPF benefits is when faults are racing
> against parallel memory address updates holding mmap_sem for write.
> That requires a threaded application that is both intense in terms of
> address space updates and fault intensive. That is much narrower than
> threaded applications that are address space update intensive (e.g.
> using mprotect to avoid accidentally leaking data, mapping data files
> for IO etc). Have we examples of realistic applications that meet all the
> criteria of "threaded", "address-space intensive" and "fault intensive"
> that are common enough to justify the complexity?
>
> Admittedly, I initially just threw this series at a collection of
> workloads that simply stress the allocator because it stresses faults as
> a side-effect but most of them did not match the criteria for "threaded
> application that is both address space update intensive and fault
> intensive". I'm struggling to think of good examples although redis
> is a possibility. HPC workloads like NPB parallelised with OpenMP is a
> possibility but I looked at some old results and while it does trap faults,
> the vast majority are related to NUMA balancing. The other ones I normally
> consider for scaling purposes are process orientated and not threads.
>
> On the patches themselves, I'm not sure the optimisation for ignoring SPF
> is guaranteed to work as mm_users could be temporarily elevated although
> probably not enough to matter. I also think patch 5 stands on its own and
> could be sent separately. For the others, I didn't read them in sufficient
> depth but noted that the level of similar logic between speculative
> and non-speculative paths could be a maintenance headache to keep the
> speculative and !speculative rules in sync. I didn't see obvious problems
> as such but I still think the complexity is high for a corner case.
Hi Mel,
Thank you for taking your time to analyze SPF effects on different
workloads. Your feedback drove me to look into the reasons Android
benefits from this patchset. What we know is that apps which benefit
the most are the ones with high number of threads (~100) and when I
strace'd one of these apps I can see that each thread mmaps several
areas upon startup (Stack and Thread-local storage (TLS), thread
signal stack, indirect ref table).
So, I created a simple test that spawns a given number of threads,
each thread mmapping and faulting-in a given number of vmas with a
given number of pages in each one. Each thread records the time it
takes to mmap the vmas and fault-in the pages and the test reports the
total and the average times measured. You can find my test program
here: https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c
I ran a number of tests on my Pixel 6 and SPF shows quite positive
results even with a small number of vmas and pages. Couple examples:
100 threads, 2 vmas, 10 pages (cmdline: spf_test 100 2 10)
Baseline avg time: 1,889,398.01ns
SPF avg time: 327,299.36ns
Improvement: 83%
100 threads, 10 vmas, 2 pages (cmdline: spf_test 100 10 2)
Baseline avg time: 1,234,861.48ns
SPF avg time: 800,392.82ns
Improvement: 35%
100 threads, 10 vmas, 10 pages (cmdline: spf_test 100 10 10)
Baseline avg time: 12,199,939.04ns
SPF avg time: 3,223,206.41ns
Improvement: 74%
100 threads, 30 vmas, 30 pages (cmdline: spf_test 100 30 30)
Baseline avg time: 255,827,268.16ns
SPF avg time: 41,538,348.47ns
Improvement: 84%
To minimize the noise, the test setup was to run with the same
parameters for several hundred times and take the average between
runs.
I think this test represents an example of what you were describing as
a "threaded application that is both address space update intensive
and fault intensive" because mmaps modify the address space with
page-faults happening in parallel. We can call it an artificial
workload but it does not strike me as something very unusual. I can
imagine other systems apart from Android which could spawn multiple
threads with each thread mapping some memory area to work with and
using that area immediately.
Thanks,
Suren.
>
> --
> Mel Gorman
> SUSE Labs
On Fri, Jan 28, 2022 at 05:09:54AM -0800, Michel Lespinasse wrote:
> Introduce mmu_notifier_lock as a per-mm percpu_rw_semaphore,
> as well as the code to initialize and destroy it together with the mm.
>
> This lock will be used to prevent races between mmu_notifier_register()
> and speculative fault handlers that need to fire MMU notifications
> without holding any of the mmap or rmap locks.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> ---
> include/linux/mm_types.h | 6 +++++-
> include/linux/mmu_notifier.h | 27 +++++++++++++++++++++++++--
> kernel/fork.c | 3 ++-
> 3 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 305f05d2a4bc..f77e2dec038d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -462,6 +462,7 @@ struct vm_area_struct {
> } __randomize_layout;
>
> struct kioctx_table;
> +struct percpu_rw_semaphore;
> struct mm_struct {
> struct {
> struct vm_area_struct *mmap; /* list of VMAs */
> @@ -608,7 +609,10 @@ struct mm_struct {
> struct file __rcu *exe_file;
> #ifdef CONFIG_MMU_NOTIFIER
> struct mmu_notifier_subscriptions *notifier_subscriptions;
> -#endif
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + struct percpu_rw_semaphore *mmu_notifier_lock;
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +#endif /* CONFIG_MMU_NOTIFIER */
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> #endif
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 45fc2c81e370..ace76fe91c0c 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,6 +6,8 @@
> #include <linux/spinlock.h>
> #include <linux/mm_types.h>
> #include <linux/mmap_lock.h>
> +#include <linux/percpu-rwsem.h>
> +#include <linux/slab.h>
> #include <linux/srcu.h>
> #include <linux/interval_tree.h>
>
> @@ -499,15 +501,35 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
> __mmu_notifier_invalidate_range(mm, start, end);
> }
>
> -static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
> +static inline bool mmu_notifier_subscriptions_init(struct mm_struct *mm)
> {
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + mm->mmu_notifier_lock = kzalloc(sizeof(struct percpu_rw_semaphore), GFP_KERNEL);
> + if (!mm->mmu_notifier_lock)
> + return false;
> + if (percpu_init_rwsem(mm->mmu_notifier_lock)) {
> + kfree(mm->mmu_notifier_lock);
> + return false;
> + }
> +#endif
> +
> mm->notifier_subscriptions = NULL;
> + return true;
> }
>
> static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> {
> if (mm_has_notifiers(mm))
> __mmu_notifier_subscriptions_destroy(mm);
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + if (!in_atomic()) {
> + percpu_free_rwsem(mm->mmu_notifier_lock);
> + kfree(mm->mmu_notifier_lock);
> + } else {
> + percpu_rwsem_async_destroy(mm->mmu_notifier_lock);
> + }
> +#endif
> }
>
We have received a bug report from our customer running Android GKI kernel
android-13-5.15 branch where this series is included. As the callstack [1]
indicates, the non-atomic test it self is not sufficient to free the percpu
rwsem.
The scenario deduced from the callstack:
- context switch on CPU#0 from 'A' to idle. idle thread took A's mm
- 'A' later ran on another CPU and exited. A's mm has still reference.
- Now CPU#0 is being hotplugged out. As part of this, idle thread's
mm is switched (in idle_task_exit()) but its active_mm freeing is
deferred to finish_cpu() which gets called later from the control processor
(the thread which initiated the CPU hotplug). Please see the reasoning
on why mmdrop() is not called in idle_task_exit() at
commit bf2c59fce4074('sched/core: Fix illegal RCU from offline CPUs')
- Now when finish_cpu() tries call percpu_free_rwsem() directly since we are
not in atomic path but hotplug path where cpus_write_lock() called is causing
the deadlock.
I am not sure if there is a clean way other than freeing the per-cpu
rwsemaphore asynchronously all the time.
[1]
-001|context_switch(inline)
-001|__schedule()
-002|__preempt_count_sub(inline)
-002|schedule()
-003|_raw_spin_unlock_irq(inline)
-003|spin_unlock_irq(inline)
-003|percpu_rwsem_wait()
-004|__preempt_count_add(inline)
-004|__percpu_down_read()
-005|percpu_down_read(inline)
-005|cpus_read_lock() // trying to get cpu_hotplug_lock again
-006|rcu_barrier()
-007|rcu_sync_dtor()
-008|mmu_notifier_subscriptions_destroy(inline)
-008|__mmdrop()
-009|mmdrop(inline)
-009|finish_cpu()
-010|cpuhp_invoke_callback()
-011|cpuhp_invoke_callback_range(inline)
-011|cpuhp_down_callbacks()
-012|_cpu_down() // acquired cpu_hotplug_lock (write lock)
Thanks,
Pavan
On Wed, Jul 27, 2022 at 12:34 AM Pavan Kondeti
<[email protected]> wrote:
>
> On Fri, Jan 28, 2022 at 05:09:54AM -0800, Michel Lespinasse wrote:
> > Introduce mmu_notifier_lock as a per-mm percpu_rw_semaphore,
> > as well as the code to initialize and destroy it together with the mm.
> >
> > This lock will be used to prevent races between mmu_notifier_register()
> > and speculative fault handlers that need to fire MMU notifications
> > without holding any of the mmap or rmap locks.
> >
> > Signed-off-by: Michel Lespinasse <[email protected]>
> > ---
> > include/linux/mm_types.h | 6 +++++-
> > include/linux/mmu_notifier.h | 27 +++++++++++++++++++++++++--
> > kernel/fork.c | 3 ++-
> > 3 files changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 305f05d2a4bc..f77e2dec038d 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -462,6 +462,7 @@ struct vm_area_struct {
> > } __randomize_layout;
> >
> > struct kioctx_table;
> > +struct percpu_rw_semaphore;
> > struct mm_struct {
> > struct {
> > struct vm_area_struct *mmap; /* list of VMAs */
> > @@ -608,7 +609,10 @@ struct mm_struct {
> > struct file __rcu *exe_file;
> > #ifdef CONFIG_MMU_NOTIFIER
> > struct mmu_notifier_subscriptions *notifier_subscriptions;
> > -#endif
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + struct percpu_rw_semaphore *mmu_notifier_lock;
> > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > +#endif /* CONFIG_MMU_NOTIFIER */
> > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> > pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> > #endif
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 45fc2c81e370..ace76fe91c0c 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -6,6 +6,8 @@
> > #include <linux/spinlock.h>
> > #include <linux/mm_types.h>
> > #include <linux/mmap_lock.h>
> > +#include <linux/percpu-rwsem.h>
> > +#include <linux/slab.h>
> > #include <linux/srcu.h>
> > #include <linux/interval_tree.h>
> >
> > @@ -499,15 +501,35 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
> > __mmu_notifier_invalidate_range(mm, start, end);
> > }
> >
> > -static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
> > +static inline bool mmu_notifier_subscriptions_init(struct mm_struct *mm)
> > {
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + mm->mmu_notifier_lock = kzalloc(sizeof(struct percpu_rw_semaphore), GFP_KERNEL);
> > + if (!mm->mmu_notifier_lock)
> > + return false;
> > + if (percpu_init_rwsem(mm->mmu_notifier_lock)) {
> > + kfree(mm->mmu_notifier_lock);
> > + return false;
> > + }
> > +#endif
> > +
> > mm->notifier_subscriptions = NULL;
> > + return true;
> > }
> >
> > static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> > {
> > if (mm_has_notifiers(mm))
> > __mmu_notifier_subscriptions_destroy(mm);
> > +
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + if (!in_atomic()) {
> > + percpu_free_rwsem(mm->mmu_notifier_lock);
> > + kfree(mm->mmu_notifier_lock);
> > + } else {
> > + percpu_rwsem_async_destroy(mm->mmu_notifier_lock);
> > + }
> > +#endif
> > }
> >
>
> We have received a bug report from our customer running Android GKI kernel
> android-13-5.15 branch where this series is included. As the callstack [1]
> indicates, the non-atomic test it self is not sufficient to free the percpu
> rwsem.
>
> The scenario deduced from the callstack:
>
> - context switch on CPU#0 from 'A' to idle. idle thread took A's mm
>
> - 'A' later ran on another CPU and exited. A's mm has still reference.
>
> - Now CPU#0 is being hotplugged out. As part of this, idle thread's
> mm is switched (in idle_task_exit()) but its active_mm freeing is
> deferred to finish_cpu() which gets called later from the control processor
> (the thread which initiated the CPU hotplug). Please see the reasoning
> on why mmdrop() is not called in idle_task_exit() at
> commit bf2c59fce4074('sched/core: Fix illegal RCU from offline CPUs')
>
> - Now when finish_cpu() tries call percpu_free_rwsem() directly since we are
> not in atomic path but hotplug path where cpus_write_lock() called is causing
> the deadlock.
>
> I am not sure if there is a clean way other than freeing the per-cpu
> rwsemaphore asynchronously all the time.
Thanks for reporting this issue, Pavan. I think your suggestion of
doing unconditional async destruction of mmu_notifier_lock would be
fine here. percpu_rwsem_async_destroy has a bit of an overhead to
schedule that work but I don't think the exit path is too performance
critical to suffer from that. Michel, WDYT?
>
> [1]
>
> -001|context_switch(inline)
> -001|__schedule()
> -002|__preempt_count_sub(inline)
> -002|schedule()
> -003|_raw_spin_unlock_irq(inline)
> -003|spin_unlock_irq(inline)
> -003|percpu_rwsem_wait()
> -004|__preempt_count_add(inline)
> -004|__percpu_down_read()
> -005|percpu_down_read(inline)
> -005|cpus_read_lock() // trying to get cpu_hotplug_lock again
> -006|rcu_barrier()
> -007|rcu_sync_dtor()
> -008|mmu_notifier_subscriptions_destroy(inline)
> -008|__mmdrop()
> -009|mmdrop(inline)
> -009|finish_cpu()
> -010|cpuhp_invoke_callback()
> -011|cpuhp_invoke_callback_range(inline)
> -011|cpuhp_down_callbacks()
> -012|_cpu_down() // acquired cpu_hotplug_lock (write lock)
>
> Thanks,
> Pavan
>