by Michel Lespinasse

[permalink] [raw]

Subject: Re: [PATCH v2 18/35] mm: implement speculative handling in do_anonymous_page()

On Sat, Jan 29, 2022 at 05:03:53AM +0800, kernel test robot wrote:
> >> mm/memory.c:3876:20: warning: variable 'vmf' is uninitialized when used within its own initialization [-Wuninitialized]
> if (!pte_map_lock(vmf)) {
> ~~~~~~~~~~~~~^~~~
> include/linux/mm.h:3418:25: note: expanded from macro 'pte_map_lock'
> struct vm_fault *vmf = __vmf; \
> ~~~ ^~~~~
> 1 warning generated.

Ah, that's interesting - this works with gcc, but breaks with clang.

The following amended patch should fix this:
(I only added underscores to the pte_map_lock and pte_spinlock macros)

------------------------------------ 8< ---------------------------------

mm: add pte_map_lock() and pte_spinlock()

pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.

The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.

In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.

If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.

Signed-off-by: Michel Lespinasse <[email protected]>
---
include/linux/mm.h | 38 ++++++++++++++++++++++++++
mm/memory.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e2122bd3da3..80894db6f01a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3394,5 +3394,43 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

+#ifdef CONFIG_MMU
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+
+bool __pte_map_lock(struct vm_fault *vmf);
+
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ VM_BUG_ON(!vmf->pte);
+ return __pte_map_lock(vmf);
+}
+
+#else /* !CONFIG_SPECULATIVE_PAGE_FAULT */
+
+#define pte_map_lock(____vmf) \
+({ \
+ struct vm_fault *__vmf = ____vmf; \
+ __vmf->pte = pte_offset_map_lock(__vmf->vma->vm_mm, __vmf->pmd, \
+ __vmf->address, &__vmf->ptl); \
+ true; \
+})
+
+#define pte_spinlock(____vmf) \
+({ \
+ struct vm_fault *__vmf = ____vmf; \
+ __vmf->ptl = pte_lockptr(__vmf->vma->vm_mm, __vmf->pmd); \
+ spin_lock(__vmf->ptl); \
+ true; \
+})
+
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+#endif /* CONFIG_MMU */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index d0db10bd5bee..1ce837e47395 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2745,6 +2745,72 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
#define speculative_page_walk_end() local_irq_enable()
#endif

+bool __pte_map_lock(struct vm_fault *vmf)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+ pte_t *pte = vmf->pte;
+ spinlock_t *ptl;
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+ speculative_page_walk_begin();
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto fail;
+ /*
+ * The mmap sequence count check guarantees that the page
+ * tables are still valid at that point, and
+ * speculative_page_walk_begin() ensures that they stay around.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto fail;
+#endif
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (!pte)
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ /*
+ * Try locking the page table.
+ *
+ * Note that we might race against zap_pte_range() which
+ * invalidates TLBs while holding the page table lock.
+ * We are still under the speculative_page_walk_begin() section,
+ * and zap_pte_range() could thus deadlock with us if we tried
+ * using spin_lock() here.
+ *
+ * We also don't want to retry until spin_trylock() succeeds,
+ * because of the starvation potential against a stream of lockers.
+ */
+ if (unlikely(!spin_trylock(ptl)))
+ goto fail;
+ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq))
+ goto unlock_fail;
+ speculative_page_walk_end();
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ return true;
+
+unlock_fail:
+ spin_unlock(ptl);
+fail:
+ if (pte)
+ pte_unmap(pte);
+ speculative_page_walk_end();
+ return false;
+}
+
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

/*
--
2.20.1

2022-02-01 09:36:35

On Wed, Jul 27, 2022 at 12:34 AM Pavan Kondeti
<[email protected]> wrote:
>
> On Fri, Jan 28, 2022 at 05:09:54AM -0800, Michel Lespinasse wrote:
> > Introduce mmu_notifier_lock as a per-mm percpu_rw_semaphore,
> > as well as the code to initialize and destroy it together with the mm.
> >
> > This lock will be used to prevent races between mmu_notifier_register()
> > and speculative fault handlers that need to fire MMU notifications
> > without holding any of the mmap or rmap locks.
> >
> > Signed-off-by: Michel Lespinasse <[email protected]>
> > ---
> > include/linux/mm_types.h | 6 +++++-
> > include/linux/mmu_notifier.h | 27 +++++++++++++++++++++++++--
> > kernel/fork.c | 3 ++-
> > 3 files changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 305f05d2a4bc..f77e2dec038d 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -462,6 +462,7 @@ struct vm_area_struct {
> > } __randomize_layout;
> >
> > struct kioctx_table;
> > +struct percpu_rw_semaphore;
> > struct mm_struct {
> > struct {
> > struct vm_area_struct *mmap; /* list of VMAs */
> > @@ -608,7 +609,10 @@ struct mm_struct {
> > struct file __rcu *exe_file;
> > #ifdef CONFIG_MMU_NOTIFIER
> > struct mmu_notifier_subscriptions *notifier_subscriptions;
> > -#endif
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + struct percpu_rw_semaphore *mmu_notifier_lock;
> > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > +#endif /* CONFIG_MMU_NOTIFIER */
> > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> > pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> > #endif
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 45fc2c81e370..ace76fe91c0c 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -6,6 +6,8 @@
> > #include <linux/spinlock.h>
> > #include <linux/mm_types.h>
> > #include <linux/mmap_lock.h>
> > +#include <linux/percpu-rwsem.h>
> > +#include <linux/slab.h>
> > #include <linux/srcu.h>
> > #include <linux/interval_tree.h>
> >
> > @@ -499,15 +501,35 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
> > __mmu_notifier_invalidate_range(mm, start, end);
> > }
> >
> > -static inline void mmu_notifier_subscriptions_init(struct mm_struct *mm)
> > +static inline bool mmu_notifier_subscriptions_init(struct mm_struct *mm)
> > {
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + mm->mmu_notifier_lock = kzalloc(sizeof(struct percpu_rw_semaphore), GFP_KERNEL);
> > + if (!mm->mmu_notifier_lock)
> > + return false;
> > + if (percpu_init_rwsem(mm->mmu_notifier_lock)) {
> > + kfree(mm->mmu_notifier_lock);
> > + return false;
> > + }
> > +#endif
> > +
> > mm->notifier_subscriptions = NULL;
> > + return true;
> > }
> >
> > static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
> > {
> > if (mm_has_notifiers(mm))
> > __mmu_notifier_subscriptions_destroy(mm);
> > +
> > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > + if (!in_atomic()) {
> > + percpu_free_rwsem(mm->mmu_notifier_lock);
> > + kfree(mm->mmu_notifier_lock);
> > + } else {
> > + percpu_rwsem_async_destroy(mm->mmu_notifier_lock);
> > + }
> > +#endif
> > }
> >
>
> We have received a bug report from our customer running Android GKI kernel
> android-13-5.15 branch where this series is included. As the callstack [1]
> indicates, the non-atomic test it self is not sufficient to free the percpu
> rwsem.
>
> The scenario deduced from the callstack:
>
> - context switch on CPU#0 from 'A' to idle. idle thread took A's mm
>
> - 'A' later ran on another CPU and exited. A's mm has still reference.
>
> - Now CPU#0 is being hotplugged out. As part of this, idle thread's
> mm is switched (in idle_task_exit()) but its active_mm freeing is
> deferred to finish_cpu() which gets called later from the control processor
> (the thread which initiated the CPU hotplug). Please see the reasoning
> on why mmdrop() is not called in idle_task_exit() at
> commit bf2c59fce4074('sched/core: Fix illegal RCU from offline CPUs')
>
> - Now when finish_cpu() tries call percpu_free_rwsem() directly since we are
> not in atomic path but hotplug path where cpus_write_lock() called is causing
> the deadlock.
>
> I am not sure if there is a clean way other than freeing the per-cpu
> rwsemaphore asynchronously all the time.

Thanks for reporting this issue, Pavan. I think your suggestion of
doing unconditional async destruction of mmu_notifier_lock would be
fine here. percpu_rwsem_async_destroy has a bit of an overhead to
schedule that work but I don't think the exit path is too performance
critical to suffer from that. Michel, WDYT?

>
> [1]
>
> -001|context_switch(inline)
> -001|__schedule()
> -002|__preempt_count_sub(inline)
> -002|schedule()
> -003|_raw_spin_unlock_irq(inline)
> -003|spin_unlock_irq(inline)
> -003|percpu_rwsem_wait()
> -004|__preempt_count_add(inline)
> -004|__percpu_down_read()
> -005|percpu_down_read(inline)
> -005|cpus_read_lock() // trying to get cpu_hotplug_lock again
> -006|rcu_barrier()
> -007|rcu_sync_dtor()
> -008|mmu_notifier_subscriptions_destroy(inline)
> -008|__mmdrop()
> -009|mmdrop(inline)
> -009|finish_cpu()
> -010|cpuhp_invoke_callback()
> -011|cpuhp_invoke_callback_range(inline)
> -011|cpuhp_down_callbacks()
> -012|_cpu_down() // acquired cpu_hotplug_lock (write lock)
>
> Thanks,
> Pavan
>