2024-02-15 12:18:30

by Ryan Roberts

[permalink] [raw]
Subject: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

This is an RFC for a series that aims to reduce the cost and complexity of
ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
The approach came from discussion with Mark and David [2].

It introduces a new helper, ptep_get_lockless_norecency(), which allows the
access and dirty bits in the returned pte to be incorrect. This relaxation
permits arm64's implementation to just read the single target pte, and avoids
having to iterate over the full contpte block to gather the access and dirty
bits, for the contpte case.

It turns out that none of the call sites using ptep_get_lockless() require
accurate access and dirty bit information, so we can also convert those sites.
Although a couple of places need care (see patches 2 and 3).

Arguably patch 3 is a bit fragile, given the wide accessibility of
vmf->orig_pte. So it might make sense to drop this patch and stick to using
ptep_get_lockless() in the page fault path. I'm keen to hear opinions.

I've chosen the name "recency" because it's shortish and somewhat descriptive,
and is alredy used in a couple of places to mean similar things (see mglru and
damon). I'm open to other names if anyone has better ideas.

If concensus is that this approach is generally acceptable, I intend to create a
series in future to do a similar thing with ptep_get() -> ptep_get_norecency().

---
This series applies on top of [1].

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan

Ryan Roberts (4):
mm: Introduce ptep_get_lockless_norecency()
mm/gup: Use ptep_get_lockless_norecency()
mm/memory: Use ptep_get_lockless_norecency() for orig_pte
arm64/mm: Override ptep_get_lockless_norecency()

arch/arm64/include/asm/pgtable.h | 6 ++++
include/linux/pgtable.h | 55 ++++++++++++++++++++++++++++--
kernel/events/core.c | 2 +-
mm/gup.c | 7 ++--
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 57 ++++++++++++++++++++------------
mm/swap_state.c | 2 +-
mm/swapfile.c | 2 +-
9 files changed, 102 insertions(+), 33 deletions(-)

--
2.25.1



2024-02-15 12:18:43

by Ryan Roberts

[permalink] [raw]
Subject: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

With the introduction of contpte mapping support for arm64, that
architecture's implementation of ptep_get_lockless() has become very
complex due to the need to gather access and dirty bits from across all
of the ptes in the contpte block. This requires careful implementation
to ensure the returned value is consistent (because its not possible to
read all ptes atomically), but even in the common case when there is no
racing modification, we have to read all ptes, which gives an ~O(n^2)
cost if the core-mm is iterating over a range, and performing a
ptep_get_lockless() on each pte.

Solve this by introducing ptep_get_lockless_norecency(), which does not
make any guarantees about access and dirty bits. Therefore it can simply
read the single target pte.

At the same time, convert all call sites that previously used
ptep_get_lockless() but don't care about access and dirty state.

We may want to do something similar for ptep_get() (i.e.
ptep_get_norecency()) in future; it doesn't suffer from the consistency
problem because the PTL serializes it with any modifications, but does
suffer the same O(n^2) cost.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
kernel/events/core.c | 2 +-
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 2 +-
mm/swap_state.c | 2 +-
mm/swapfile.c | 2 +-
7 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a36cf4e124b0..9dd40fdbd825 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
#endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */

-/*
- * We require that the PTE can be read atomically.
- */
#ifndef ptep_get_lockless
+/**
+ * ptep_get_lockless - Get a pte without holding the page table lock. Young and
+ * dirty bits are guaranteed to accurately reflect the state
+ * of the pte at the time of the call.
+ * @ptep: Page table pointer for pte to get.
+ *
+ * If young and dirty information is not required, use
+ * ptep_get_lockless_norecency() which can be faster on some architectures.
+ *
+ * May be overridden by the architecture; otherwise, implemented using
+ * ptep_get(), on the assumption that it is atomic.
+ *
+ * Context: Any.
+ */
static inline pte_t ptep_get_lockless(pte_t *ptep)
{
return ptep_get(ptep);
}
#endif

+#ifndef ptep_get_lockless_norecency
+/**
+ * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
+ * Young and dirty bits may not be accurate.
+ * @ptep: Page table pointer for pte to get.
+ *
+ * Prefer this over ptep_get_lockless() when young and dirty information is not
+ * required since it can be faster on some architectures.
+ *
+ * May be overridden by the architecture; otherwise, implemented using the more
+ * precise ptep_get_lockless().
+ *
+ * Context: Any.
+ */
+static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
+{
+ return ptep_get_lockless(ptep);
+}
+#endif
+
#ifndef pmdp_get_lockless
static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
{
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1..27991312d635 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7583,7 +7583,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
if (!ptep)
goto again;

- pte = ptep_get_lockless(ptep);
+ pte = ptep_get_lockless_norecency(ptep);
if (pte_present(pte))
size = pte_leaf_size(pte);
pte_unmap(ptep);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 68283e54c899..41dc44eb8454 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
}

if (pte) {
- pte_t pteval = ptep_get_lockless(pte);
+ pte_t pteval = ptep_get_lockless_norecency(pte);

BUG_ON(pte_present(pteval) && !pte_huge(pteval));
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2771fc043b3b..1a6c9ed8237a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
}
}

- vmf.orig_pte = ptep_get_lockless(pte);
+ vmf.orig_pte = ptep_get_lockless_norecency(pte);
if (!is_swap_pte(vmf.orig_pte))
continue;

diff --git a/mm/memory.c b/mm/memory.c
index 4dd8e35b593a..8e65fa1884f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4353,7 +4353,7 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
int i;

for (i = 0; i < nr_pages; i++) {
- if (!pte_none(ptep_get_lockless(pte + i)))
+ if (!pte_none(ptep_get_lockless_norecency(pte + i)))
return false;
}

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2f540748f7c0..061c6c16c7ff 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -837,7 +837,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
if (!pte)
break;
}
- pentry = ptep_get_lockless(pte);
+ pentry = ptep_get_lockless_norecency(pte);
if (!is_swap_pte(pentry))
continue;
entry = pte_to_swp_entry(pentry);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d1bd8d1e17bd..c414dd238814 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1857,7 +1857,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
break;
}

- ptent = ptep_get_lockless(pte);
+ ptent = ptep_get_lockless_norecency(pte);

if (!is_swap_pte(ptent))
continue;
--
2.25.1


2024-02-15 12:18:53

by Ryan Roberts

[permalink] [raw]
Subject: [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency()

Gup needs to read ptes locklessly, so it uses ptep_get_lockless().
However, the returned access and dirty bits are unimportant so let's
switch over to ptep_get_lockless_norecency().

The wrinkle is that gup needs to check that the pte hasn't changed once
it has pinned the folio following this model:

pte = ptep_get_lockless_norecency(ptep)
...
if (!pte_same(pte, ptep_get_lockless(ptep)))
// RACE!
...

And now that pte may not contain correct access and dirty information,
the pte_same() comparison could spuriously fail. So let's introduce a
new pte_same_norecency() helper which will ignore the access and dirty
bits when doing the comparison.

Note that previously, ptep_get() was being used for the comparison; this
is technically incorrect because the PTL is not held. I've also
converted the comparison to use the preferred pmd_same() helper instead
of doing a raw value comparison.

As a side-effect, this new approach removes the possibility of
concurrent read/write to the page causing a spurious fast gup failure,
because the access and dirty bits are no longer used in the comparison.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/pgtable.h | 18 ++++++++++++++++++
mm/gup.c | 7 ++++---
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9dd40fdbd825..8123affa8baf 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -936,6 +936,24 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
}
#endif

+/**
+ * pte_same_norecency - Compare pte_a and pte_b, ignoring young and dirty bits,
+ * if the ptes are present.
+ *
+ * @pte_a: First pte to compare.
+ * @pte_b: Second pte to compare.
+ *
+ * Returns 1 if the ptes match, else 0.
+ */
+static inline int pte_same_norecency(pte_t pte_a, pte_t pte_b)
+{
+ if (pte_present(pte_a))
+ pte_a = pte_mkold(pte_mkclean(pte_a));
+ if (pte_present(pte_b))
+ pte_b = pte_mkold(pte_mkclean(pte_b));
+ return pte_same(pte_a, pte_b);
+}
+
#ifndef __HAVE_ARCH_PTE_UNUSED
/*
* Some architectures provide facilities to virtualization guests
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..0f96d0a5ec09 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
if (!ptep)
return 0;
do {
- pte_t pte = ptep_get_lockless(ptep);
+ pte_t pte = ptep_get_lockless_norecency(ptep);
struct page *page;
struct folio *folio;

@@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
goto pte_unmap;
}

- if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
- unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+ if (unlikely(!pmd_same(pmd, *pmdp)) ||
+ unlikely(!pte_same_norecency(pte,
+ ptep_get_lockless_norecency(ptep)))) {
gup_put_folio(folio, 1, flags);
goto pte_unmap;
}
--
2.25.1


2024-02-15 12:19:09

by Ryan Roberts

[permalink] [raw]
Subject: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
ptep_get_lockless_norecency() to save orig_pte.

There are a number of places that follow this model:

orig_pte = ptep_get_lockless(ptep)
...
<lock>
if (!pte_same(orig_pte, ptep_get(ptep)))
// RACE!
...
<unlock>

So we need to be careful to convert all of those to use
pte_same_norecency() so that the access and dirty bits are excluded from
the comparison.

Additionally there are a couple of places that genuinely rely on the
access and dirty bits of orig_pte, but with some careful refactoring, we
can use ptep_get() once we are holding the lock to achieve equivalent
logic.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/memory.c | 55 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8e65fa1884f1..075245314ec3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3014,7 +3014,7 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
spin_lock(vmf->ptl);
- same = pte_same(ptep_get(vmf->pte), vmf->orig_pte);
+ same = pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte);
spin_unlock(vmf->ptl);
}
#endif
@@ -3062,11 +3062,14 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* take a double page fault, so mark it accessed here.
*/
vmf->pte = NULL;
- if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
+ if (!arch_has_hw_pte_young()) {
pte_t entry;

vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (likely(vmf->pte))
+ entry = ptep_get(vmf->pte);
+ if (unlikely(!vmf->pte ||
+ !pte_same_norecency(entry, vmf->orig_pte))) {
/*
* Other thread has already handled the fault
* and update local tlb only
@@ -3077,9 +3080,11 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
goto pte_unlock;
}

- entry = pte_mkyoung(vmf->orig_pte);
- if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
- update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
+ if (!pte_young(entry)) {
+ entry = pte_mkyoung(entry);
+ if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+ update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
+ }
}

/*
@@ -3094,7 +3099,8 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,

/* Re-validate under PTL if the page is still mapped */
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (unlikely(!vmf->pte ||
+ !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
if (vmf->pte)
update_mmu_tlb(vma, addr, vmf->pte);
@@ -3369,14 +3375,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* Re-check the pte - we dropped the lock
*/
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
- if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (likely(vmf->pte))
+ entry = ptep_get(vmf->pte);
+ if (likely(vmf->pte && pte_same_norecency(entry, vmf->orig_pte))) {
if (old_folio) {
if (!folio_test_anon(old_folio)) {
dec_mm_counter(mm, mm_counter_file(old_folio));
inc_mm_counter(mm, MM_ANONPAGES);
}
} else {
- ksm_might_unmap_zero_page(mm, vmf->orig_pte);
+ /* Needs dirty bit so can't use vmf->orig_pte. */
+ ksm_might_unmap_zero_page(mm, entry);
inc_mm_counter(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
@@ -3494,7 +3503,7 @@ static vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf, struct folio *folio
* We might have raced with another page fault while we released the
* pte_offset_map_lock.
*/
- if (!pte_same(ptep_get(vmf->pte), vmf->orig_pte)) {
+ if (!pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte)) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
return VM_FAULT_NOPAGE;
@@ -3883,7 +3892,8 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+ if (likely(vmf->pte &&
+ pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte)))
restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);

if (vmf->pte)
@@ -3928,7 +3938,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
* quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED.
* So is_pte_marker() check is not enough to safely drop the pte.
*/
- if (pte_same(vmf->orig_pte, ptep_get(vmf->pte)))
+ if (pte_same_norecency(vmf->orig_pte, ptep_get(vmf->pte)))
pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
@@ -4028,8 +4038,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (unlikely(!vmf->pte ||
- !pte_same(ptep_get(vmf->pte),
- vmf->orig_pte)))
+ !pte_same_norecency(ptep_get(vmf->pte),
+ vmf->orig_pte)))
goto unlock;

/*
@@ -4117,7 +4127,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (likely(vmf->pte &&
- pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+ pte_same_norecency(ptep_get(vmf->pte),
+ vmf->orig_pte)))
ret = VM_FAULT_OOM;
goto unlock;
}
@@ -4187,7 +4198,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+ if (unlikely(!vmf->pte ||
+ !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

if (unlikely(!folio_test_uptodate(folio))) {
@@ -4747,7 +4759,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
static bool vmf_pte_changed(struct vm_fault *vmf)
{
if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)
- return !pte_same(ptep_get(vmf->pte), vmf->orig_pte);
+ return !pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte);

return !pte_none(ptep_get(vmf->pte));
}
@@ -5125,7 +5137,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* the pfn may be screwed if the read is non atomic.
*/
spin_lock(vmf->ptl);
- if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
}
@@ -5197,7 +5209,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
if (unlikely(!vmf->pte))
goto out;
- if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (unlikely(!pte_same_norecency(ptep_get(vmf->pte),
+ vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
}
@@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
if (unlikely(!vmf->pte))
return 0;
- vmf->orig_pte = ptep_get_lockless(vmf->pte);
+ vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

if (pte_none(vmf->orig_pte)) {
@@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)

spin_lock(vmf->ptl);
entry = vmf->orig_pte;
- if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
+ if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
goto unlock;
}
--
2.25.1


2024-02-15 12:19:19

by Ryan Roberts

[permalink] [raw]
Subject: [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency()

Override ptep_get_lockless_norecency() when CONFIG_ARM64_CONTPTE is
enabled. Because this API doesn't require the access and dirty bits to
be accurate, for the contpte case, we can avoid reading all ptes in the
contpte block to collect those bits, in contrast to ptep_get_lockless().

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 401087e8a43d..c0e4ccf74714 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1287,6 +1287,12 @@ static inline pte_t ptep_get_lockless(pte_t *ptep)
return contpte_ptep_get_lockless(ptep);
}

+#define ptep_get_lockless_norecency ptep_get_lockless_norecency
+static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
+{
+ return __ptep_get(ptep);
+}
+
static inline void set_pte(pte_t *ptep, pte_t pte)
{
/*
--
2.25.1


2024-03-26 16:18:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15.02.24 13:17, Ryan Roberts wrote:
> This is an RFC for a series that aims to reduce the cost and complexity of
> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
> The approach came from discussion with Mark and David [2].
>
> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
> access and dirty bits in the returned pte to be incorrect. This relaxation
> permits arm64's implementation to just read the single target pte, and avoids
> having to iterate over the full contpte block to gather the access and dirty
> bits, for the contpte case.
>
> It turns out that none of the call sites using ptep_get_lockless() require
> accurate access and dirty bit information, so we can also convert those sites.
> Although a couple of places need care (see patches 2 and 3).
>
> Arguably patch 3 is a bit fragile, given the wide accessibility of
> vmf->orig_pte. So it might make sense to drop this patch and stick to using
> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.

Yes. Especially as we have these pte_same() checks that might just fail
now because of wrong accessed/dirty bits?

Likely, we just want to read "the real deal" on both sides of the
pte_same() handling.

>
> I've chosen the name "recency" because it's shortish and somewhat descriptive,
> and is alredy used in a couple of places to mean similar things (see mglru and
> damon). I'm open to other names if anyone has better ideas.

Not a native speaker; works for me.

>
> If concensus is that this approach is generally acceptable, I intend to create a
> series in future to do a similar thing with ptep_get() -> ptep_get_norecency().

Yes, sounds good to me.

--
Cheers,

David / dhildenb


2024-03-26 16:27:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

On 15.02.24 13:17, Ryan Roberts wrote:
> With the introduction of contpte mapping support for arm64, that
> architecture's implementation of ptep_get_lockless() has become very
> complex due to the need to gather access and dirty bits from across all
> of the ptes in the contpte block. This requires careful implementation
> to ensure the returned value is consistent (because its not possible to
> read all ptes atomically), but even in the common case when there is no
> racing modification, we have to read all ptes, which gives an ~O(n^2)
> cost if the core-mm is iterating over a range, and performing a
> ptep_get_lockless() on each pte.
>
> Solve this by introducing ptep_get_lockless_norecency(), which does not
> make any guarantees about access and dirty bits. Therefore it can simply
> read the single target pte.
>
> At the same time, convert all call sites that previously used
> ptep_get_lockless() but don't care about access and dirty state.
>

I'd probably split that part off.

> We may want to do something similar for ptep_get() (i.e.
> ptep_get_norecency()) in future; it doesn't suffer from the consistency
> problem because the PTL serializes it with any modifications, but does
> suffer the same O(n^2) cost.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
> kernel/events/core.c | 2 +-
> mm/hugetlb.c | 2 +-
> mm/khugepaged.c | 2 +-
> mm/memory.c | 2 +-
> mm/swap_state.c | 2 +-
> mm/swapfile.c | 2 +-
> 7 files changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a36cf4e124b0..9dd40fdbd825 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
> #endif /* CONFIG_PGTABLE_LEVELS > 2 */
> #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>
> -/*
> - * We require that the PTE can be read atomically.
> - */
> #ifndef ptep_get_lockless
> +/**
> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
> + * dirty bits are guaranteed to accurately reflect the state
> + * of the pte at the time of the call.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * If young and dirty information is not required, use
> + * ptep_get_lockless_norecency() which can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using
> + * ptep_get(), on the assumption that it is atomic.
> + *
> + * Context: Any.
> + */

I think we usually say "Any context.". But I would just do it like idr.h:

"Any context. It is safe to call this function without locking in your
code."

.. but is this true? We really want to say "without page table lock".
Because there must be some way to prevent against concurrent page table
freeing. For example, GUP-fast disables IRQs, whereby page table freeing
code frees using RCU.

> static inline pte_t ptep_get_lockless(pte_t *ptep)
> {
> return ptep_get(ptep);
> }
> #endif
>
> +#ifndef ptep_get_lockless_norecency
> +/**
> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
> + * Young and dirty bits may not be accurate.
> + * @ptep: Page table pointer for pte to get.
> + *
> + * Prefer this over ptep_get_lockless() when young and dirty information is not
> + * required since it can be faster on some architectures.
> + *
> + * May be overridden by the architecture; otherwise, implemented using the more
> + * precise ptep_get_lockless().
> + *
> + * Context: Any.

Same comment.

> + */
> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
> +{
> + return ptep_get_lockless(ptep);
> +}
> +#endif

[...]

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 68283e54c899..41dc44eb8454 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> if (pte) {
> - pte_t pteval = ptep_get_lockless(pte);
> + pte_t pteval = ptep_get_lockless_norecency(pte);
>
> BUG_ON(pte_present(pteval) && !pte_huge(pteval));
> }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2771fc043b3b..1a6c9ed8237a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> }
> }
>
> - vmf.orig_pte = ptep_get_lockless(pte);
> + vmf.orig_pte = ptep_get_lockless_norecency(pte);
> if (!is_swap_pte(vmf.orig_pte))
> continue;


Hm, I think you mentioned that we want to be careful with vmf.orig_pte.

--
Cheers,

David / dhildenb


2024-03-26 16:32:27

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26/03/2024 16:17, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> This is an RFC for a series that aims to reduce the cost and complexity of
>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
>> The approach came from discussion with Mark and David [2].
>>
>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
>> access and dirty bits in the returned pte to be incorrect. This relaxation
>> permits arm64's implementation to just read the single target pte, and avoids
>> having to iterate over the full contpte block to gather the access and dirty
>> bits, for the contpte case.
>>
>> It turns out that none of the call sites using ptep_get_lockless() require
>> accurate access and dirty bit information, so we can also convert those sites.
>> Although a couple of places need care (see patches 2 and 3).
>>
>> Arguably patch 3 is a bit fragile, given the wide accessibility of
>> vmf->orig_pte. So it might make sense to drop this patch and stick to using
>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
>
> Yes. Especially as we have these pte_same() checks that might just fail now
> because of wrong accessed/dirty bits?

Which pte_same() checks are you referring to? I've changed them all to
pte_same_norecency() which ignores the access/dirty bits when doing the comparison.

>
> Likely, we just want to read "the real deal" on both sides of the pte_same()
> handling.

Sorry I'm not sure I understand? You mean read the full pte including
access/dirty? That's the same as dropping the patch, right? Of course if we do
that, we still have to keep pte_get_lockless() around for this case. In an ideal
world we would convert everything over to ptep_get_lockless_norecency() and
delete ptep_get_lockless() to remove the ugliness from arm64.

>
>>
>> I've chosen the name "recency" because it's shortish and somewhat descriptive,
>> and is alredy used in a couple of places to mean similar things (see mglru and
>> damon). I'm open to other names if anyone has better ideas.
>
> Not a native speaker; works for me.
>
>>
>> If concensus is that this approach is generally acceptable, I intend to create a
>> series in future to do a similar thing with ptep_get() -> ptep_get_norecency().
>
> Yes, sounds good to me.
>


2024-03-26 16:33:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency()

On 15.02.24 13:17, Ryan Roberts wrote:
> Gup needs to read ptes locklessly, so it uses ptep_get_lockless().
> However, the returned access and dirty bits are unimportant so let's
> switch over to ptep_get_lockless_norecency().
>
> The wrinkle is that gup needs to check that the pte hasn't changed once
> it has pinned the folio following this model:
>
> pte = ptep_get_lockless_norecency(ptep)
> ...
> if (!pte_same(pte, ptep_get_lockless(ptep)))
> // RACE!
> ...
>
> And now that pte may not contain correct access and dirty information,
> the pte_same() comparison could spuriously fail. So let's introduce a
> new pte_same_norecency() helper which will ignore the access and dirty
> bits when doing the comparison.
>
> Note that previously, ptep_get() was being used for the comparison; this
> is technically incorrect because the PTL is not held. I've also
> converted the comparison to use the preferred pmd_same() helper instead
> of doing a raw value comparison.
>
> As a side-effect, this new approach removes the possibility of
> concurrent read/write to the page causing a spurious fast gup failure,
> because the access and dirty bits are no longer used in the comparison.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---

[...]

> #ifndef __HAVE_ARCH_PTE_UNUSED
> /*
> * Some architectures provide facilities to virtualization guests
> diff --git a/mm/gup.c b/mm/gup.c
> index df83182ec72d..0f96d0a5ec09 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> if (!ptep)
> return 0;
> do {
> - pte_t pte = ptep_get_lockless(ptep);
> + pte_t pte = ptep_get_lockless_norecency(ptep);
> struct page *page;
> struct folio *folio;
>
> @@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
> goto pte_unmap;
> }
>
> - if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> - unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
> + if (unlikely(!pmd_same(pmd, *pmdp)) ||
> + unlikely(!pte_same_norecency(pte,
> + ptep_get_lockless_norecency(ptep)))) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;

We pass the pte into pte_access_permitted(). It would be good to mention
that you checked all implementations.

Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2024-03-26 16:35:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 4/4] arm64/mm: Override ptep_get_lockless_norecency()

On 15.02.24 13:17, Ryan Roberts wrote:
> Override ptep_get_lockless_norecency() when CONFIG_ARM64_CONTPTE is
> enabled. Because this API doesn't require the access and dirty bits to
> be accurate, for the contpte case, we can avoid reading all ptes in the
> contpte block to collect those bits, in contrast to ptep_get_lockless().
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 401087e8a43d..c0e4ccf74714 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1287,6 +1287,12 @@ static inline pte_t ptep_get_lockless(pte_t *ptep)
> return contpte_ptep_get_lockless(ptep);
> }
>
> +#define ptep_get_lockless_norecency ptep_get_lockless_norecency
> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
> +{
> + return __ptep_get(ptep);
> +}
> +
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
> /*
> --
> 2.25.1
>

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2024-03-26 16:36:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26.03.24 17:31, Ryan Roberts wrote:
> On 26/03/2024 16:17, David Hildenbrand wrote:
>> On 15.02.24 13:17, Ryan Roberts wrote:
>>> This is an RFC for a series that aims to reduce the cost and complexity of
>>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
>>> The approach came from discussion with Mark and David [2].
>>>
>>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
>>> access and dirty bits in the returned pte to be incorrect. This relaxation
>>> permits arm64's implementation to just read the single target pte, and avoids
>>> having to iterate over the full contpte block to gather the access and dirty
>>> bits, for the contpte case.
>>>
>>> It turns out that none of the call sites using ptep_get_lockless() require
>>> accurate access and dirty bit information, so we can also convert those sites.
>>> Although a couple of places need care (see patches 2 and 3).
>>>
>>> Arguably patch 3 is a bit fragile, given the wide accessibility of
>>> vmf->orig_pte. So it might make sense to drop this patch and stick to using
>>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
>>
>> Yes. Especially as we have these pte_same() checks that might just fail now
>> because of wrong accessed/dirty bits?
>
> Which pte_same() checks are you referring to? I've changed them all to
> pte_same_norecency() which ignores the access/dirty bits when doing the comparison.

I'm reading the patches just now. So I stumbled over that just after I
wrote that, so I was missing that part from the description here.

>
>>
>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>> handling.
>
> Sorry I'm not sure I understand? You mean read the full pte including
> access/dirty? That's the same as dropping the patch, right? Of course if we do
> that, we still have to keep pte_get_lockless() around for this case. In an ideal
> world we would convert everything over to ptep_get_lockless_norecency() and
> delete ptep_get_lockless() to remove the ugliness from arm64.

Yes, agreed. Patch #3 does not look too crazy and it wouldn't really
affect any architecture.

I do wonder if pte_same_norecency() should be defined per architecture
and the default would be pte_same(). So we could avoid the mkold etc on
all other architectures.

--
Cheers,

David / dhildenb


2024-03-26 16:39:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

On 26/03/2024 16:27, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> With the introduction of contpte mapping support for arm64, that
>> architecture's implementation of ptep_get_lockless() has become very
>> complex due to the need to gather access and dirty bits from across all
>> of the ptes in the contpte block. This requires careful implementation
>> to ensure the returned value is consistent (because its not possible to
>> read all ptes atomically), but even in the common case when there is no
>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>> cost if the core-mm is iterating over a range, and performing a
>> ptep_get_lockless() on each pte.
>>
>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>> make any guarantees about access and dirty bits. Therefore it can simply
>> read the single target pte.
>>
>> At the same time, convert all call sites that previously used
>> ptep_get_lockless() but don't care about access and dirty state.
>>
>
> I'd probably split that part off.

I thought the general guidance was to introduce new APIs in same patch they are
first used in? If I split this off, I'll have one patch for a new (unused) API,
then another for the first users.

>
>> We may want to do something similar for ptep_get() (i.e.
>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>> problem because the PTL serializes it with any modifications, but does
>> suffer the same O(n^2) cost.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>>   include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>   kernel/events/core.c    |  2 +-
>>   mm/hugetlb.c            |  2 +-
>>   mm/khugepaged.c         |  2 +-
>>   mm/memory.c             |  2 +-
>>   mm/swap_state.c         |  2 +-
>>   mm/swapfile.c           |  2 +-
>>   7 files changed, 40 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a36cf4e124b0..9dd40fdbd825 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>
>> -/*
>> - * We require that the PTE can be read atomically.
>> - */
>>   #ifndef ptep_get_lockless
>> +/**
>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
>> + *                     dirty bits are guaranteed to accurately reflect the state
>> + *                     of the pte at the time of the call.
>> + * @ptep: Page table pointer for pte to get.
>> + *
>> + * If young and dirty information is not required, use
>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented using
>> + * ptep_get(), on the assumption that it is atomic.
>> + *
>> + * Context: Any.
>> + */
>
> I think we usually say "Any context.". But I would just do it like idr.h:
>
> "Any context. It is safe to call this function without locking in your code."
>
> ... but is this true? We really want to say "without page table lock". Because
> there must be some way to prevent against concurrent page table freeing. For
> example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU.

How about:

"
Context: Any context that guarrantees the page table can't be freed
concurrently. The page table lock is not required.
"

>
>>   static inline pte_t ptep_get_lockless(pte_t *ptep)
>>   {
>>       return ptep_get(ptep);
>>   }
>>   #endif
>>
>> +#ifndef ptep_get_lockless_norecency
>> +/**
>> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
>> + *                 Young and dirty bits may not be accurate.
>> + * @ptep: Page table pointer for pte to get.
>> + *
>> + * Prefer this over ptep_get_lockless() when young and dirty information is not
>> + * required since it can be faster on some architectures.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented using the more
>> + * precise ptep_get_lockless().
>> + *
>> + * Context: Any.
>
> Same comment.
>
>> + */
>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>> +{
>> +    return ptep_get_lockless(ptep);
>> +}
>> +#endif
>
> [...]
>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 68283e54c899..41dc44eb8454 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>> vm_area_struct *vma,
>>       }
>>
>>       if (pte) {
>> -        pte_t pteval = ptep_get_lockless(pte);
>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>
>>           BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>       }
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2771fc043b3b..1a6c9ed8237a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>> *mm,
>>               }
>>           }
>>
>> -        vmf.orig_pte = ptep_get_lockless(pte);
>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>           if (!is_swap_pte(vmf.orig_pte))
>>               continue;
>
>
> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.

Yeah good point. So I guess this should move to patch 3 (which may be dropped -
tbd)?

2024-03-26 16:49:51

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 2/4] mm/gup: Use ptep_get_lockless_norecency()

On 26/03/2024 16:30, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> Gup needs to read ptes locklessly, so it uses ptep_get_lockless().
>> However, the returned access and dirty bits are unimportant so let's
>> switch over to ptep_get_lockless_norecency().
>>
>> The wrinkle is that gup needs to check that the pte hasn't changed once
>> it has pinned the folio following this model:
>>
>>      pte = ptep_get_lockless_norecency(ptep)
>>      ...
>>      if (!pte_same(pte, ptep_get_lockless(ptep)))
>>              // RACE!
>>      ...
>>
>> And now that pte may not contain correct access and dirty information,
>> the pte_same() comparison could spuriously fail. So let's introduce a
>> new pte_same_norecency() helper which will ignore the access and dirty
>> bits when doing the comparison.
>>
>> Note that previously, ptep_get() was being used for the comparison; this
>> is technically incorrect because the PTL is not held. I've also
>> converted the comparison to use the preferred pmd_same() helper instead
>> of doing a raw value comparison.
>>
>> As a side-effect, this new approach removes the possibility of
>> concurrent read/write to the page causing a spurious fast gup failure,
>> because the access and dirty bits are no longer used in the comparison.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>
> [...]
>
>>   #ifndef __HAVE_ARCH_PTE_UNUSED
>>   /*
>>    * Some architectures provide facilities to virtualization guests
>> diff --git a/mm/gup.c b/mm/gup.c
>> index df83182ec72d..0f96d0a5ec09 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2576,7 +2576,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp,
>> unsigned long addr,
>>       if (!ptep)
>>           return 0;
>>       do {
>> -        pte_t pte = ptep_get_lockless(ptep);
>> +        pte_t pte = ptep_get_lockless_norecency(ptep);
>>           struct page *page;
>>           struct folio *folio;
>>
>> @@ -2617,8 +2617,9 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp,
>> unsigned long addr,
>>               goto pte_unmap;
>>           }
>>
>> -        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
>> -            unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
>> +        if (unlikely(!pmd_same(pmd, *pmdp)) ||
>> +            unlikely(!pte_same_norecency(pte,
>> +                    ptep_get_lockless_norecency(ptep)))) {
>>               gup_put_folio(folio, 1, flags);
>>               goto pte_unmap;
>
> We pass the pte into pte_access_permitted(). It would be good to mention that
> you checked all implementations.

TBH, I hadn't; I decided that since the "inaccurate access/dirty bits" was only
possible on arm64, then only arm64's implementation needed checking. But given
your comment, I just had a quick look at all impls. I didn't spot any problems
where any impl needs the access/dirty bits. I'll add this to the commit log.

>
> Acked-by: David Hildenbrand <[email protected]>
>


2024-03-26 17:02:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26/03/2024 16:34, David Hildenbrand wrote:
> On 26.03.24 17:31, Ryan Roberts wrote:
>> On 26/03/2024 16:17, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> This is an RFC for a series that aims to reduce the cost and complexity of
>>>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
>>>> The approach came from discussion with Mark and David [2].
>>>>
>>>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
>>>> access and dirty bits in the returned pte to be incorrect. This relaxation
>>>> permits arm64's implementation to just read the single target pte, and avoids
>>>> having to iterate over the full contpte block to gather the access and dirty
>>>> bits, for the contpte case.
>>>>
>>>> It turns out that none of the call sites using ptep_get_lockless() require
>>>> accurate access and dirty bit information, so we can also convert those sites.
>>>> Although a couple of places need care (see patches 2 and 3).
>>>>
>>>> Arguably patch 3 is a bit fragile, given the wide accessibility of
>>>> vmf->orig_pte. So it might make sense to drop this patch and stick to using
>>>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
>>>
>>> Yes. Especially as we have these pte_same() checks that might just fail now
>>> because of wrong accessed/dirty bits?
>>
>> Which pte_same() checks are you referring to? I've changed them all to
>> pte_same_norecency() which ignores the access/dirty bits when doing the
>> comparison.
>
> I'm reading the patches just now. So I stumbled over that just after I wrote
> that, so I was missing that part from the description here.
>
>>
>>>
>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>> handling.
>>
>> Sorry I'm not sure I understand? You mean read the full pte including
>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>> that, we still have to keep pte_get_lockless() around for this case. In an ideal
>> world we would convert everything over to ptep_get_lockless_norecency() and
>> delete ptep_get_lockless() to remove the ugliness from arm64.
>
> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
> architecture.
>
> I do wonder if pte_same_norecency() should be defined per architecture and the
> default would be pte_same(). So we could avoid the mkold etc on all other
> architectures.

Wouldn't that break it's semantics? The "norecency" of
ptep_get_lockless_norecency() means "recency information in the returned pte may
be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
access and dirty bits when you do the comparison".

I think you could only do the optimization you describe if you required that
pte_same_norecency() would only be given values returned by
ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
quite the same; if a page is accessed between gets one will return true and the
other false.



2024-03-26 17:04:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 15.02.24 13:17, Ryan Roberts wrote:
> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
> ptep_get_lockless_norecency() to save orig_pte.
>
> There are a number of places that follow this model:
>
> orig_pte = ptep_get_lockless(ptep)
> ...
> <lock>
> if (!pte_same(orig_pte, ptep_get(ptep)))
> // RACE!
> ...
> <unlock>
>
> So we need to be careful to convert all of those to use
> pte_same_norecency() so that the access and dirty bits are excluded from
> the comparison.
>
> Additionally there are a couple of places that genuinely rely on the
> access and dirty bits of orig_pte, but with some careful refactoring, we
> can use ptep_get() once we are holding the lock to achieve equivalent
> logic.

We really should document that changed behavior somewhere where it can
be easily found: that orig_pte might have incomplete/stale
accessed/dirty information.


> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> vmf->address, &vmf->ptl);
> if (unlikely(!vmf->pte))
> return 0;
> - vmf->orig_pte = ptep_get_lockless(vmf->pte);
> + vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>
> if (pte_none(vmf->orig_pte)) {
> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>
> spin_lock(vmf->ptl);
> entry = vmf->orig_pte;
> - if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
> + if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
> goto unlock;

I was wondering about the following:

Assume the PTE is not dirty.

Thread 1 does

vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
/* not dirty */

/* Now, thread 2 ends up setting the PTE dirty under PT lock. */

spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
...
}
..
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vmf->vma, ...)
..
pte_unmap_unlock(vmf->pte, vmf->ptl);


Generic ptep_set_access_flags() will do another pte_same() check and
realize "hey, there was a change!" let's update the PTE!

set_pte_at(vma->vm_mm, address, ptep, entry);

would overwrite the dirty bit set by thread 2.

--
Cheers,

David / dhildenb


2024-03-26 17:07:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

>>>>
>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>> handling.
>>>
>>> Sorry I'm not sure I understand? You mean read the full pte including
>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>> that, we still have to keep pte_get_lockless() around for this case. In an ideal
>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>
>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>> architecture.
>>
>> I do wonder if pte_same_norecency() should be defined per architecture and the
>> default would be pte_same(). So we could avoid the mkold etc on all other
>> architectures.
>
> Wouldn't that break it's semantics? The "norecency" of
> ptep_get_lockless_norecency() means "recency information in the returned pte may
> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
> access and dirty bits when you do the comparison".

My idea was that ptep_get_lockless_norecency() would return the actual
result on these architectures. So e.g., on x86, there would be no actual
change in generated code.

But yes, the documentation of these functions would have to be improved.

Now I wonder if ptep_get_lockless_norecency() should actively clear
dirty/accessed bits to more easily find any actual issues where the bits
still matter ...

>
> I think you could only do the optimization you describe if you required that
> pte_same_norecency() would only be given values returned by
> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
> quite the same; if a page is accessed between gets one will return true and the
> other false.

Right.

--
Cheers,

David / dhildenb


2024-03-26 17:27:51

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 26/03/2024 17:02, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
>> ptep_get_lockless_norecency() to save orig_pte.
>>
>> There are a number of places that follow this model:
>>
>>      orig_pte = ptep_get_lockless(ptep)
>>      ...
>>      <lock>
>>      if (!pte_same(orig_pte, ptep_get(ptep)))
>>              // RACE!
>>      ...
>>      <unlock>
>>
>> So we need to be careful to convert all of those to use
>> pte_same_norecency() so that the access and dirty bits are excluded from
>> the comparison.
>>
>> Additionally there are a couple of places that genuinely rely on the
>> access and dirty bits of orig_pte, but with some careful refactoring, we
>> can use ptep_get() once we are holding the lock to achieve equivalent
>> logic.
>
> We really should document that changed behavior somewhere where it can be easily
> found: that orig_pte might have incomplete/stale accessed/dirty information.

I could add it to the orig_pte definition in the `struct vm_fault`?

>
>
>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>                            vmf->address, &vmf->ptl);
>>           if (unlikely(!vmf->pte))
>>               return 0;
>> -        vmf->orig_pte = ptep_get_lockless(vmf->pte);
>> +        vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
>>           vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>>           if (pte_none(vmf->orig_pte)) {
>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>
>>       spin_lock(vmf->ptl);
>>       entry = vmf->orig_pte;
>> -    if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>> +    if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
>>           update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>           goto unlock;
>
> I was wondering about the following:
>
> Assume the PTE is not dirty.
>
> Thread 1 does

Sorry not sure what threads have to do with this? How is the vmf shared between
threads? What have I misunderstood...

>
> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
> /* not dirty */
>
> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */
>
> spin_lock(vmf->ptl);
> entry = vmf->orig_pte;
> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>     ...
> }
> ...
> entry = pte_mkyoung(entry);

Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.

> if (ptep_set_access_flags(vmf->vma, ...)
> ...
> pte_unmap_unlock(vmf->pte, vmf->ptl);
>
>
> Generic ptep_set_access_flags() will do another pte_same() check and realize
> "hey, there was a change!" let's update the PTE!
>
> set_pte_at(vma->vm_mm, address, ptep, entry);

This is called from the generic ptep_set_access_flags() in your example, right?

>
> would overwrite the dirty bit set by thread 2.

I'm not really sure what you are getting at... Is your concern that there is a
race where the page could become dirty in the meantime and it now gets lost? I
think that's why arm64 overrides ptep_set_access_flags(); since the hw can
update access/dirty we have to deal with the races.


2024-03-26 17:33:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>
>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>> handling.
>>>>
>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>> ideal
>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>
>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>>> architecture.
>>>
>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>> architectures.
>>
>> Wouldn't that break it's semantics? The "norecency" of
>> ptep_get_lockless_norecency() means "recency information in the returned pte may
>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>> access and dirty bits when you do the comparison".
>
> My idea was that ptep_get_lockless_norecency() would return the actual result on
> these architectures. So e.g., on x86, there would be no actual change in
> generated code.

I think this is a bad plan... You'll end up with subtle differences between
architectures.

>
> But yes, the documentation of these functions would have to be improved.
>
> Now I wonder if ptep_get_lockless_norecency() should actively clear
> dirty/accessed bits to more easily find any actual issues where the bits still
> matter ...

I did a version that took that approach. Decided it was not as good as this way
though. Now for the life of me, I can't remember my reasoning.

>
>>
>> I think you could only do the optimization you describe if you required that
>> pte_same_norecency() would only be given values returned by
>> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
>> quite the same; if a page is accessed between gets one will return true and the
>> other false.
>
> Right.
>


2024-03-26 17:52:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26/03/2024 17:39, David Hildenbrand wrote:
> On 26.03.24 18:32, Ryan Roberts wrote:
>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>
>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>>> handling.
>>>>>>
>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>> we do
>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>> ideal
>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>
>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>> any
>>>>> architecture.
>>>>>
>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>> architectures.
>>>>
>>>> Wouldn't that break it's semantics? The "norecency" of
>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>> may
>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>> access and dirty bits when you do the comparison".
>>>
>>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>>> these architectures. So e.g., on x86, there would be no actual change in
>>> generated code.
>>
>> I think this is a bad plan... You'll end up with subtle differences between
>> architectures.
>>
>>>
>>> But yes, the documentation of these functions would have to be improved.
>>>
>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>> matter ...
>>
>> I did a version that took that approach. Decided it was not as good as this way
>> though. Now for the life of me, I can't remember my reasoning.
>
> Maybe because there are some code paths that check accessed/dirty without
> "correctness" implications? For example, if the PTE is already dirty, no need to
> set it dirty etc?

I think I decided I was penalizing the architectures that don't care because all
their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
that I could feed its result into pte_same().


2024-03-26 17:59:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

>>>>
>>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>>> /* not dirty */
>>>>
>>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */
>
> Ahh, this comment about thread 2 is not referring to the code immediately below
> it. It all makes much more sense now. :)

Sorry :)

>
>>>>
>>>> spin_lock(vmf->ptl);
>>>> entry = vmf->orig_pte;
>>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>>      ...
>>>> }
>>>> ...
>>>> entry = pte_mkyoung(entry);
>>>
>>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
>>
>> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
>> unconditionally does that in handle_pte_fault().
>>
>>>
>>>> if (ptep_set_access_flags(vmf->vma, ...)
>>>> ...
>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>
>>>>
>>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>>> "hey, there was a change!" let's update the PTE!
>>>>
>>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>>
>>> This is called from the generic ptep_set_access_flags() in your example, right?
>>>
>>
>> Yes.
>>
>>>>
>>>> would overwrite the dirty bit set by thread 2.
>>>
>>> I'm not really sure what you are getting at... Is your concern that there is a
>>> race where the page could become dirty in the meantime and it now gets lost? I
>>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>>> update access/dirty we have to deal with the races.
>>
>> My concern is that your patch can in subtle ways lead to use losing PTE dirty
>> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)
>
> But I think the example you give can already happen today? Thread 1 reads
> orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
> set dirty just after the get, then thread 1 is going to set the PTE back to (a
> modified version of) orig_pte. Isn't it already broken?

No, because the pte_same() check under PTL would have detected it, and
we would have backed out. And I think the problem comes to live when we
convert pte_same()->pte_same_norecency(), because we fail to protect PTE
access/dirty changes that happend under PTL from another thread.

But could be I am missing something :)

>> Arm64 should be fine in that regard.
>>
>
> There is plenty of arm64 HW that doesn't do HW access/dirty update. But our
> ptep_set_access_flags() can always deal with a racing update, even if that
> update originates from SW.
>
> Why do I have the feeling you're about to explain (very patiently) exactly why
> I'm wrong?... :)

heh ... or you'll tell me (vary patiently) why I am wrong :)

--
Cheers,

David / dhildenb


2024-03-26 18:04:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 26.03.24 18:27, Ryan Roberts wrote:
> On 26/03/2024 17:02, David Hildenbrand wrote:
>> On 15.02.24 13:17, Ryan Roberts wrote:
>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
>>> ptep_get_lockless_norecency() to save orig_pte.
>>>
>>> There are a number of places that follow this model:
>>>
>>>      orig_pte = ptep_get_lockless(ptep)
>>>      ...
>>>      <lock>
>>>      if (!pte_same(orig_pte, ptep_get(ptep)))
>>>              // RACE!
>>>      ...
>>>      <unlock>
>>>
>>> So we need to be careful to convert all of those to use
>>> pte_same_norecency() so that the access and dirty bits are excluded from
>>> the comparison.
>>>
>>> Additionally there are a couple of places that genuinely rely on the
>>> access and dirty bits of orig_pte, but with some careful refactoring, we
>>> can use ptep_get() once we are holding the lock to achieve equivalent
>>> logic.
>>
>> We really should document that changed behavior somewhere where it can be easily
>> found: that orig_pte might have incomplete/stale accessed/dirty information.
>
> I could add it to the orig_pte definition in the `struct vm_fault`?
>
>>
>>
>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>                            vmf->address, &vmf->ptl);
>>>           if (unlikely(!vmf->pte))
>>>               return 0;
>>> -        vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>> +        vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
>>>           vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>>
>>>           if (pte_none(vmf->orig_pte)) {
>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>
>>>       spin_lock(vmf->ptl);
>>>       entry = vmf->orig_pte;
>>> -    if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>> +    if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
>>>           update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>>           goto unlock;
>>
>> I was wondering about the following:
>>
>> Assume the PTE is not dirty.
>>
>> Thread 1 does
>
> Sorry not sure what threads have to do with this? How is the vmf shared between
> threads? What have I misunderstood...

Assume we have a HW that does not have HW-managed access/dirty bits. One
that ends up using generic ptep_set_access_flags(). Access/dirty bits
are always updated under PT lock.

Then, imagine two threads. One is the the fault path here. another
thread performs some other magic that sets the PTE dirty under PTL.

>
>>
>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>> /* not dirty */
>>
>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */
>>
>> spin_lock(vmf->ptl);
>> entry = vmf->orig_pte;
>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>     ...
>> }
>> ...
>> entry = pte_mkyoung(entry);
>
> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.

No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead
and unconditionally does that in handle_pte_fault().

>
>> if (ptep_set_access_flags(vmf->vma, ...)
>> ...
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
>>
>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>> "hey, there was a change!" let's update the PTE!
>>
>> set_pte_at(vma->vm_mm, address, ptep, entry);
>
> This is called from the generic ptep_set_access_flags() in your example, right?
>

Yes.

>>
>> would overwrite the dirty bit set by thread 2.
>
> I'm not really sure what you are getting at... Is your concern that there is a
> race where the page could become dirty in the meantime and it now gets lost? I
> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
> update access/dirty we have to deal with the races.

My concern is that your patch can in subtle ways lead to use losing PTE
dirty bits on architectures that don't have the HW-managed dirty bit.
They do exist ;)

Arm64 should be fine in that regard.

--
Cheers,

David / dhildenb


2024-03-26 18:04:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26.03.24 18:32, Ryan Roberts wrote:
> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>
>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>> handling.
>>>>>
>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>> ideal
>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>
>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>>>> architecture.
>>>>
>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>> architectures.
>>>
>>> Wouldn't that break it's semantics? The "norecency" of
>>> ptep_get_lockless_norecency() means "recency information in the returned pte may
>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>> access and dirty bits when you do the comparison".
>>
>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>> these architectures. So e.g., on x86, there would be no actual change in
>> generated code.
>
> I think this is a bad plan... You'll end up with subtle differences between
> architectures.
>
>>
>> But yes, the documentation of these functions would have to be improved.
>>
>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>> dirty/accessed bits to more easily find any actual issues where the bits still
>> matter ...
>
> I did a version that took that approach. Decided it was not as good as this way
> though. Now for the life of me, I can't remember my reasoning.

Maybe because there are some code paths that check accessed/dirty
without "correctness" implications? For example, if the PTE is already
dirty, no need to set it dirty etc?

--
Cheers,

David / dhildenb


2024-03-26 18:10:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 26/03/2024 17:38, David Hildenbrand wrote:
> On 26.03.24 18:27, Ryan Roberts wrote:
>> On 26/03/2024 17:02, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> Let's convert handle_pte_fault()'s use of ptep_get_lockless() to
>>>> ptep_get_lockless_norecency() to save orig_pte.
>>>>
>>>> There are a number of places that follow this model:
>>>>
>>>>       orig_pte = ptep_get_lockless(ptep)
>>>>       ...
>>>>       <lock>
>>>>       if (!pte_same(orig_pte, ptep_get(ptep)))
>>>>               // RACE!
>>>>       ...
>>>>       <unlock>
>>>>
>>>> So we need to be careful to convert all of those to use
>>>> pte_same_norecency() so that the access and dirty bits are excluded from
>>>> the comparison.
>>>>
>>>> Additionally there are a couple of places that genuinely rely on the
>>>> access and dirty bits of orig_pte, but with some careful refactoring, we
>>>> can use ptep_get() once we are holding the lock to achieve equivalent
>>>> logic.
>>>
>>> We really should document that changed behavior somewhere where it can be easily
>>> found: that orig_pte might have incomplete/stale accessed/dirty information.
>>
>> I could add it to the orig_pte definition in the `struct vm_fault`?
>>
>>>
>>>
>>>> @@ -5343,7 +5356,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>                             vmf->address, &vmf->ptl);
>>>>            if (unlikely(!vmf->pte))
>>>>                return 0;
>>>> -        vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>>> +        vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte);
>>>>            vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>>>
>>>>            if (pte_none(vmf->orig_pte)) {
>>>> @@ -5363,7 +5376,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>>>
>>>>        spin_lock(vmf->ptl);
>>>>        entry = vmf->orig_pte;
>>>> -    if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>> +    if (unlikely(!pte_same_norecency(ptep_get(vmf->pte), entry))) {
>>>>            update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>>>>            goto unlock;
>>>
>>> I was wondering about the following:
>>>
>>> Assume the PTE is not dirty.
>>>
>>> Thread 1 does
>>
>> Sorry not sure what threads have to do with this? How is the vmf shared between
>> threads? What have I misunderstood...
>
> Assume we have a HW that does not have HW-managed access/dirty bits. One that
> ends up using generic ptep_set_access_flags(). Access/dirty bits are always
> updated under PT lock.
>
> Then, imagine two threads. One is the the fault path here. another thread
> performs some other magic that sets the PTE dirty under PTL.
>
>>
>>>
>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>> /* not dirty */
>>>
>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */

Ahh, this comment about thread 2 is not referring to the code immediately below
it. It all makes much more sense now. :)

>>>
>>> spin_lock(vmf->ptl);
>>> entry = vmf->orig_pte;
>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>      ...
>>> }
>>> ...
>>> entry = pte_mkyoung(entry);
>>
>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
>
> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
> unconditionally does that in handle_pte_fault().
>
>>
>>> if (ptep_set_access_flags(vmf->vma, ...)
>>> ...
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>
>>>
>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>> "hey, there was a change!" let's update the PTE!
>>>
>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>
>> This is called from the generic ptep_set_access_flags() in your example, right?
>>
>
> Yes.
>
>>>
>>> would overwrite the dirty bit set by thread 2.
>>
>> I'm not really sure what you are getting at... Is your concern that there is a
>> race where the page could become dirty in the meantime and it now gets lost? I
>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>> update access/dirty we have to deal with the races.
>
> My concern is that your patch can in subtle ways lead to use losing PTE dirty
> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)

But I think the example you give can already happen today? Thread 1 reads
orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
set dirty just after the get, then thread 1 is going to set the PTE back to (a
modified version of) orig_pte. Isn't it already broken?


>
> Arm64 should be fine in that regard.
>

There is plenty of arm64 HW that doesn't do HW access/dirty update. But our
ptep_set_access_flags() can always deal with a racing update, even if that
update originates from SW.

Why do I have the feeling you're about to explain (very patiently) exactly why
I'm wrong?... :)


2024-03-27 09:30:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

On 26.03.24 17:39, Ryan Roberts wrote:
> On 26/03/2024 16:27, David Hildenbrand wrote:
>> On 15.02.24 13:17, Ryan Roberts wrote:
>>> With the introduction of contpte mapping support for arm64, that
>>> architecture's implementation of ptep_get_lockless() has become very
>>> complex due to the need to gather access and dirty bits from across all
>>> of the ptes in the contpte block. This requires careful implementation
>>> to ensure the returned value is consistent (because its not possible to
>>> read all ptes atomically), but even in the common case when there is no
>>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>>> cost if the core-mm is iterating over a range, and performing a
>>> ptep_get_lockless() on each pte.
>>>
>>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>>> make any guarantees about access and dirty bits. Therefore it can simply
>>> read the single target pte.
>>>
>>> At the same time, convert all call sites that previously used
>>> ptep_get_lockless() but don't care about access and dirty state.
>>>
>>
>> I'd probably split that part off.
>
> I thought the general guidance was to introduce new APIs in same patch they are
> first used in? If I split this off, I'll have one patch for a new (unused) API,
> then another for the first users.

I don't know what exact guidance there is, but I tend to leave "non
trivial changes" to separate patches.

Some of the changes here are rather trivial (mm/hugetlb.c), and I agree
that we can perform them here.

At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment.

>
>>
>>> We may want to do something similar for ptep_get() (i.e.
>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>>> problem because the PTL serializes it with any modifications, but does
>>> suffer the same O(n^2) cost.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>>   include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>>   kernel/events/core.c    |  2 +-
>>>   mm/hugetlb.c            |  2 +-
>>>   mm/khugepaged.c         |  2 +-
>>>   mm/memory.c             |  2 +-
>>>   mm/swap_state.c         |  2 +-
>>>   mm/swapfile.c           |  2 +-
>>>   7 files changed, 40 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index a36cf4e124b0..9dd40fdbd825 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>>   #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>>   #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>>
>>> -/*
>>> - * We require that the PTE can be read atomically.
>>> - */
>>>   #ifndef ptep_get_lockless
>>> +/**
>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young and
>>> + *                     dirty bits are guaranteed to accurately reflect the state
>>> + *                     of the pte at the time of the call.
>>> + * @ptep: Page table pointer for pte to get.
>>> + *
>>> + * If young and dirty information is not required, use
>>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented using
>>> + * ptep_get(), on the assumption that it is atomic.
>>> + *
>>> + * Context: Any.
>>> + */
>>
>> I think we usually say "Any context.". But I would just do it like idr.h:
>>
>> "Any context. It is safe to call this function without locking in your code."
>>
>> ... but is this true? We really want to say "without page table lock". Because
>> there must be some way to prevent against concurrent page table freeing. For
>> example, GUP-fast disables IRQs, whereby page table freeing code frees using RCU.
>
> How about:
>
> "
> Context: Any context that guarrantees the page table can't be freed

s/guarrantees/guarantees/

> concurrently. The page table lock is not required.
> "
>

Sounds good.

>>
>>>   static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>   {
>>>       return ptep_get(ptep);
>>>   }
>>>   #endif
>>>
>>> +#ifndef ptep_get_lockless_norecency
>>> +/**
>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table lock.
>>> + *                 Young and dirty bits may not be accurate.
>>> + * @ptep: Page table pointer for pte to get.
>>> + *
>>> + * Prefer this over ptep_get_lockless() when young and dirty information is not
>>> + * required since it can be faster on some architectures.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented using the more
>>> + * precise ptep_get_lockless().
>>> + *
>>> + * Context: Any.
>>
>> Same comment.
>>
>>> + */
>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>>> +{
>>> +    return ptep_get_lockless(ptep);
>>> +}
>>> +#endif
>>
>> [...]
>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 68283e54c899..41dc44eb8454 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>> vm_area_struct *vma,
>>>       }
>>>
>>>       if (pte) {
>>> -        pte_t pteval = ptep_get_lockless(pte);
>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>
>>>           BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>       }
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>> *mm,
>>>               }
>>>           }
>>>
>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>           if (!is_swap_pte(vmf.orig_pte))
>>>               continue;
>>
>>
>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
>
> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
> tbd)?
>

Yes. Or a separate one where you explain in detail why do_swap_page()
can handle it just fine.

--
Cheers,

David / dhildenb


2024-03-27 09:36:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 26.03.24 18:51, Ryan Roberts wrote:
> On 26/03/2024 17:39, David Hildenbrand wrote:
>> On 26.03.24 18:32, Ryan Roberts wrote:
>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>>>> handling.
>>>>>>>
>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>> we do
>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>> ideal
>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>
>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>> any
>>>>>> architecture.
>>>>>>
>>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>> architectures.
>>>>>
>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>> may
>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>> access and dirty bits when you do the comparison".
>>>>
>>>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>> generated code.
>>>
>>> I think this is a bad plan... You'll end up with subtle differences between
>>> architectures.
>>>
>>>>
>>>> But yes, the documentation of these functions would have to be improved.
>>>>
>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>> matter ...
>>>
>>> I did a version that took that approach. Decided it was not as good as this way
>>> though. Now for the life of me, I can't remember my reasoning.
>>
>> Maybe because there are some code paths that check accessed/dirty without
>> "correctness" implications? For example, if the PTE is already dirty, no need to
>> set it dirty etc?
>
> I think I decided I was penalizing the architectures that don't care because all
> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
> that I could feed its result into pte_same().

True. With ptep_get_norecency() you're also penalizing other
architectures. Therefore my original thought about making the behavior
arch-specific, but the arch has to make sure to get the combination of
ptep_get_lockless_norecency()+ptep_same_norecency() is right.

So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it
must make sure to also ignore them in ptep_same_norecency(), and must be
able to handle access/dirty bit changes differently.

Maybe one could have one variant for "hw-managed access/dirty" vs. "sw
managed accessed or dirty". Only the former would end up ignoring stuff
here, the latter would not.

But again, just some random thoughts how this affects other
architectures and how we could avoid it. The issue I describe in patch
#3 would be gone if ptep_same_norecency() would just do a ptep_same()
check on other architectures -- and would make it easier to sell :)

--
Cheers,

David / dhildenb


2024-03-27 09:54:35

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 26/03/2024 17:58, David Hildenbrand wrote:
>>>>>
>>>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>>>> /* not dirty */
>>>>>
>>>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */
>>
>> Ahh, this comment about thread 2 is not referring to the code immediately below
>> it. It all makes much more sense now. :)
>
> Sorry :)
>
>>
>>>>>
>>>>> spin_lock(vmf->ptl);
>>>>> entry = vmf->orig_pte;
>>>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>>>       ...
>>>>> }
>>>>> ...
>>>>> entry = pte_mkyoung(entry);
>>>>
>>>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
>>>
>>> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
>>> unconditionally does that in handle_pte_fault().
>>>
>>>>
>>>>> if (ptep_set_access_flags(vmf->vma, ...)
>>>>> ...
>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>
>>>>>
>>>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>>>> "hey, there was a change!" let's update the PTE!
>>>>>
>>>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>>>
>>>> This is called from the generic ptep_set_access_flags() in your example, right?
>>>>
>>>
>>> Yes.
>>>
>>>>>
>>>>> would overwrite the dirty bit set by thread 2.
>>>>
>>>> I'm not really sure what you are getting at... Is your concern that there is a
>>>> race where the page could become dirty in the meantime and it now gets lost? I
>>>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>>>> update access/dirty we have to deal with the races.
>>>
>>> My concern is that your patch can in subtle ways lead to use losing PTE dirty
>>> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)
>>
>> But I think the example you give can already happen today? Thread 1 reads
>> orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
>> set dirty just after the get, then thread 1 is going to set the PTE back to (a
>> modified version of) orig_pte. Isn't it already broken?
>
> No, because the pte_same() check under PTL would have detected it, and we would
> have backed out. And I think the problem comes to live when we convert
> pte_same()->pte_same_norecency(), because we fail to protect PTE access/dirty
> changes that happend under PTL from another thread.

Ahh yep. Got it. I absolutely knew that you would be correct, but I still walked
right into it!

I think one could argue that the generic ptep_set_access_flags() is not
implementing its own spec:

"
.. Only sets the access flags (dirty, accessed), as well as write permission.
Furthermore, we know it always gets set to a "more permissive" setting ...
"

Surely it should be folding the access and dirty bits from *ptep into entry if
they are set?

Regardless, I think this example proves that its fragile and subtle. I'm not
really sure how to fix it more generally/robustly. Any thoughts? If not perhaps
we are better off keeping ptep_get_lockless() around and only using
ptep_get_lockless_norecency() for the really obviously correct cases?


>
> But could be I am missing something :)
>
>>> Arm64 should be fine in that regard.
>>>
>>
>> There is plenty of arm64 HW that doesn't do HW access/dirty update. But our
>> ptep_set_access_flags() can always deal with a racing update, even if that
>> update originates from SW.
>>
>> Why do I have the feeling you're about to explain (very patiently) exactly why
>> I'm wrong?... :)
>
> heh ... or you'll tell me (vary patiently) why I am wrong :)
>


2024-03-27 09:58:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

On 27/03/2024 09:28, David Hildenbrand wrote:
> On 26.03.24 17:39, Ryan Roberts wrote:
>> On 26/03/2024 16:27, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> With the introduction of contpte mapping support for arm64, that
>>>> architecture's implementation of ptep_get_lockless() has become very
>>>> complex due to the need to gather access and dirty bits from across all
>>>> of the ptes in the contpte block. This requires careful implementation
>>>> to ensure the returned value is consistent (because its not possible to
>>>> read all ptes atomically), but even in the common case when there is no
>>>> racing modification, we have to read all ptes, which gives an ~O(n^2)
>>>> cost if the core-mm is iterating over a range, and performing a
>>>> ptep_get_lockless() on each pte.
>>>>
>>>> Solve this by introducing ptep_get_lockless_norecency(), which does not
>>>> make any guarantees about access and dirty bits. Therefore it can simply
>>>> read the single target pte.
>>>>
>>>> At the same time, convert all call sites that previously used
>>>> ptep_get_lockless() but don't care about access and dirty state.
>>>>
>>>
>>> I'd probably split that part off.
>>
>> I thought the general guidance was to introduce new APIs in same patch they are
>> first used in? If I split this off, I'll have one patch for a new (unused) API,
>> then another for the first users.
>
> I don't know what exact guidance there is, but I tend to leave "non trivial
> changes" to separate patches.
>
> Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we
> can perform them here.
>
> At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment.

got it.


>
>>
>>>
>>>> We may want to do something similar for ptep_get() (i.e.
>>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency
>>>> problem because the PTL serializes it with any modifications, but does
>>>> suffer the same O(n^2) cost.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>>    include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++---
>>>>    kernel/events/core.c    |  2 +-
>>>>    mm/hugetlb.c            |  2 +-
>>>>    mm/khugepaged.c         |  2 +-
>>>>    mm/memory.c             |  2 +-
>>>>    mm/swap_state.c         |  2 +-
>>>>    mm/swapfile.c           |  2 +-
>>>>    7 files changed, 40 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a36cf4e124b0..9dd40fdbd825 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
>>>>    #endif /* CONFIG_PGTABLE_LEVELS > 2 */
>>>>    #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
>>>>
>>>> -/*
>>>> - * We require that the PTE can be read atomically.
>>>> - */
>>>>    #ifndef ptep_get_lockless
>>>> +/**
>>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young
>>>> and
>>>> + *                     dirty bits are guaranteed to accurately reflect the
>>>> state
>>>> + *                     of the pte at the time of the call.
>>>> + * @ptep: Page table pointer for pte to get.
>>>> + *
>>>> + * If young and dirty information is not required, use
>>>> + * ptep_get_lockless_norecency() which can be faster on some architectures.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented using
>>>> + * ptep_get(), on the assumption that it is atomic.
>>>> + *
>>>> + * Context: Any.
>>>> + */
>>>
>>> I think we usually say "Any context.". But I would just do it like idr.h:
>>>
>>> "Any context. It is safe to call this function without locking in your code."
>>>
>>> ... but is this true? We really want to say "without page table lock". Because
>>> there must be some way to prevent against concurrent page table freeing. For
>>> example, GUP-fast disables IRQs, whereby page table freeing code frees using
>>> RCU.
>>
>> How about:
>>
>> "
>> Context: Any context that guarrantees the page table can't be freed
>
> s/guarrantees/guarantees/
>
>> concurrently. The page table lock is not required.
>> "
>>
>
> Sounds good.

Great!

>
>>>
>>>>    static inline pte_t ptep_get_lockless(pte_t *ptep)
>>>>    {
>>>>        return ptep_get(ptep);
>>>>    }
>>>>    #endif
>>>>
>>>> +#ifndef ptep_get_lockless_norecency
>>>> +/**
>>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table
>>>> lock.
>>>> + *                 Young and dirty bits may not be accurate.
>>>> + * @ptep: Page table pointer for pte to get.
>>>> + *
>>>> + * Prefer this over ptep_get_lockless() when young and dirty information is
>>>> not
>>>> + * required since it can be faster on some architectures.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented using the
>>>> more
>>>> + * precise ptep_get_lockless().
>>>> + *
>>>> + * Context: Any.
>>>
>>> Same comment.
>>>
>>>> + */
>>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep)
>>>> +{
>>>> +    return ptep_get_lockless(ptep);
>>>> +}
>>>> +#endif
>>>
>>> [...]
>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 68283e54c899..41dc44eb8454 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>>> vm_area_struct *vma,
>>>>        }
>>>>
>>>>        if (pte) {
>>>> -        pte_t pteval = ptep_get_lockless(pte);
>>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>>
>>>>            BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>>        }
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>>> *mm,
>>>>                }
>>>>            }
>>>>
>>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>>            if (!is_swap_pte(vmf.orig_pte))
>>>>                continue;
>>>
>>>
>>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
>>
>> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
>> tbd)?
>>
>
> Yes. Or a separate one where you explain in detail why do_swap_page() can handle
> it just fine.

Ahh no wait - I remember now; the reason I believe this is a "trivial" case is
because we only leak vmf.orig_pte to the rest of the world if its a swap entry.
And if its a swap entry, then ptep_get_lockless_norecency() is equivalent to
ptep_get_lockless() - the pte is not present so there are no access or dirty
bits. So I think this can stay here?


2024-03-27 10:01:13

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 27/03/2024 09:34, David Hildenbrand wrote:
> On 26.03.24 18:51, Ryan Roberts wrote:
>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>> pte_same()
>>>>>>>>> handling.
>>>>>>>>
>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>> we do
>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>> ideal
>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>
>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>> any
>>>>>>> architecture.
>>>>>>>
>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>> and the
>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>> architectures.
>>>>>>
>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>> may
>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>> access and dirty bits when you do the comparison".
>>>>>
>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>> result on
>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>> generated code.
>>>>
>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>> architectures.
>>>>
>>>>>
>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>
>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>> matter ...
>>>>
>>>> I did a version that took that approach. Decided it was not as good as this way
>>>> though. Now for the life of me, I can't remember my reasoning.
>>>
>>> Maybe because there are some code paths that check accessed/dirty without
>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>> set it dirty etc?
>>
>> I think I decided I was penalizing the architectures that don't care because all
>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>> that I could feed its result into pte_same().
>
> True. With ptep_get_norecency() you're also penalizing other architectures.
> Therefore my original thought about making the behavior arch-specific, but the
> arch has to make sure to get the combination of
> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>
> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
> make sure to also ignore them in ptep_same_norecency(), and must be able to
> handle access/dirty bit changes differently.
>
> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
> accessed or dirty". Only the former would end up ignoring stuff here, the latter
> would not.
>
> But again, just some random thoughts how this affects other architectures and
> how we could avoid it. The issue I describe in patch #3 would be gone if
> ptep_same_norecency() would just do a ptep_same() check on other architectures
> -- and would make it easier to sell :)

Perhaps - let me chew on that for a bit. It doesn't feel as easy as you suggest
to me. But I can't put my finger on why...



2024-03-27 17:02:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency()

>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index 68283e54c899..41dc44eb8454 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct
>>>>> vm_area_struct *vma,
>>>>>        }
>>>>>
>>>>>        if (pte) {
>>>>> -        pte_t pteval = ptep_get_lockless(pte);
>>>>> +        pte_t pteval = ptep_get_lockless_norecency(pte);
>>>>>
>>>>>            BUG_ON(pte_present(pteval) && !pte_huge(pteval));
>>>>>        }
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 2771fc043b3b..1a6c9ed8237a 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct
>>>>> *mm,
>>>>>                }
>>>>>            }
>>>>>
>>>>> -        vmf.orig_pte = ptep_get_lockless(pte);
>>>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte);
>>>>>            if (!is_swap_pte(vmf.orig_pte))
>>>>>                continue;
>>>>
>>>>
>>>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte.
>>>
>>> Yeah good point. So I guess this should move to patch 3 (which may be dropped -
>>> tbd)?
>>>
>>
>> Yes. Or a separate one where you explain in detail why do_swap_page() can handle
>> it just fine.
>
> Ahh no wait - I remember now; the reason I believe this is a "trivial" case is
> because we only leak vmf.orig_pte to the rest of the world if its a swap entry.

Ugh, yes!

--
Cheers,

David / dhildenb


2024-03-27 17:06:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte

On 27.03.24 10:51, Ryan Roberts wrote:
> On 26/03/2024 17:58, David Hildenbrand wrote:
>>>>>>
>>>>>> vmf->orig_pte = ptep_get_lockless_norecency(vmf->pte)
>>>>>> /* not dirty */
>>>>>>
>>>>>> /* Now, thread 2 ends up setting the PTE dirty under PT lock. */
>>>
>>> Ahh, this comment about thread 2 is not referring to the code immediately below
>>> it. It all makes much more sense now. :)
>>
>> Sorry :)
>>
>>>
>>>>>>
>>>>>> spin_lock(vmf->ptl);
>>>>>> entry = vmf->orig_pte;
>>>>>> if (unlikely(!pte_same(ptep_get(vmf->pte), entry))) {
>>>>>>       ...
>>>>>> }
>>>>>> ...
>>>>>> entry = pte_mkyoung(entry);
>>>>>
>>>>> Do you mean pte_mkdirty() here? You're talking about dirty everywhere else.
>>>>
>>>> No, that is just thread 1 seeing "oh, nothing to do" and then goes ahead and
>>>> unconditionally does that in handle_pte_fault().
>>>>
>>>>>
>>>>>> if (ptep_set_access_flags(vmf->vma, ...)
>>>>>> ...
>>>>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>>
>>>>>>
>>>>>> Generic ptep_set_access_flags() will do another pte_same() check and realize
>>>>>> "hey, there was a change!" let's update the PTE!
>>>>>>
>>>>>> set_pte_at(vma->vm_mm, address, ptep, entry);
>>>>>
>>>>> This is called from the generic ptep_set_access_flags() in your example, right?
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>> would overwrite the dirty bit set by thread 2.
>>>>>
>>>>> I'm not really sure what you are getting at... Is your concern that there is a
>>>>> race where the page could become dirty in the meantime and it now gets lost? I
>>>>> think that's why arm64 overrides ptep_set_access_flags(); since the hw can
>>>>> update access/dirty we have to deal with the races.
>>>>
>>>> My concern is that your patch can in subtle ways lead to use losing PTE dirty
>>>> bits on architectures that don't have the HW-managed dirty bit. They do exist ;)
>>>
>>> But I think the example you give can already happen today? Thread 1 reads
>>> orig_pte = ptep_get_lockless(). So that's already racy, if thread 2 is going to
>>> set dirty just after the get, then thread 1 is going to set the PTE back to (a
>>> modified version of) orig_pte. Isn't it already broken?
>>
>> No, because the pte_same() check under PTL would have detected it, and we would
>> have backed out. And I think the problem comes to live when we convert
>> pte_same()->pte_same_norecency(), because we fail to protect PTE access/dirty
>> changes that happend under PTL from another thread.
>
> Ahh yep. Got it. I absolutely knew that you would be correct, but I still walked
> right into it!
>
> I think one could argue that the generic ptep_set_access_flags() is not
> implementing its own spec:
>
> "
> ... Only sets the access flags (dirty, accessed), as well as write permission.
> Furthermore, we know it always gets set to a "more permissive" setting ...
> "
>
> Surely it should be folding the access and dirty bits from *ptep into entry if
> they are set?

Likely yes. Unless it's also used to clear access/dirty (don't think so,
and would not be documented).

But the simplification made sense for now, because you previously
checked that pte_same(), and nobody can modify it concurrently.

>
> Regardless, I think this example proves that its fragile and subtle. I'm not
> really sure how to fix it more generally/robustly. Any thoughts? If not perhaps
> we are better off keeping ptep_get_lockless() around and only using
> ptep_get_lockless_norecency() for the really obviously correct cases?

Maybe one of the "sources of problems" is that we have a
ptep_get_lockless_norecency() call followed by a pte_same() check, like
done here.

Not the source of all problems I believe, though ...

--
Cheers,

David / dhildenb


2024-04-03 12:59:21

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 27/03/2024 09:34, David Hildenbrand wrote:
> On 26.03.24 18:51, Ryan Roberts wrote:
>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>> pte_same()
>>>>>>>>> handling.
>>>>>>>>
>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>> we do
>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>> ideal
>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>
>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>> any
>>>>>>> architecture.
>>>>>>>
>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>> and the
>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>> architectures.
>>>>>>
>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>> may
>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>> access and dirty bits when you do the comparison".
>>>>>
>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>> result on
>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>> generated code.
>>>>
>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>> architectures.
>>>>
>>>>>
>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>
>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>> matter ...
>>>>
>>>> I did a version that took that approach. Decided it was not as good as this way
>>>> though. Now for the life of me, I can't remember my reasoning.
>>>
>>> Maybe because there are some code paths that check accessed/dirty without
>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>> set it dirty etc?
>>
>> I think I decided I was penalizing the architectures that don't care because all
>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>> that I could feed its result into pte_same().
>
> True. With ptep_get_norecency() you're also penalizing other architectures.
> Therefore my original thought about making the behavior arch-specific, but the
> arch has to make sure to get the combination of
> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>
> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
> make sure to also ignore them in ptep_same_norecency(), and must be able to
> handle access/dirty bit changes differently.
>
> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
> accessed or dirty". Only the former would end up ignoring stuff here, the latter
> would not.
>
> But again, just some random thoughts how this affects other architectures and
> how we could avoid it. The issue I describe in patch #3 would be gone if
> ptep_same_norecency() would just do a ptep_same() check on other architectures
> -- and would make it easier to sell :)
>

I've been thinking some more about this. I think your proposal is the following:


// ARM64
ptep_get_lockless_norecency()
{
- returned access/dirty may be incorrect
- returned access/dirty may be differently incorrect between 2 calls
}
pte_same_norecency()
{
- ignore access/dirty when doing comparison
}
ptep_set_access_flags(ptep, pte)
{
- must not assume access/dirty in pte are "more permissive" than
access/dirty in *ptep
- must only set access/dirty in *ptep, never clear
}


// Other arches: no change to generated code
ptep_get_lockless_norecency()
{
return ptep_get_lockless();
}
pte_same_norecency()
{
return pte_same();
}
ptep_set_access_flags(ptep, pte)
{
- may assume access/dirty in pte are "more permissive" than access/dirty
in *ptep
- if no HW access/dirty updates, "*ptep = pte" always results in "more
permissive" change
}

An arch either specializes all 3 or none of them.

This would allow us to get rid of ptep_get_lockless().

And it addresses the bug you found with ptep_set_access_flags().


BUT, I still have a nagging feeling that there are likely to be other similar
problems caused by ignoring access/dirty during pte_same_norecency(). I can't
convince myself that its definitely all safe and robust.

So I'm leaning towards dropping patch 3 and therefore keeping
ptep_get_lockless() around.

Let me know if you have any insight that might help me change my mind :)

Thanks,
Ryan


2024-04-08 08:36:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 03.04.24 14:59, Ryan Roberts wrote:
> On 27/03/2024 09:34, David Hildenbrand wrote:
>> On 26.03.24 18:51, Ryan Roberts wrote:
>>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>>> pte_same()
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>>> we do
>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>>> ideal
>>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>>
>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>>> any
>>>>>>>> architecture.
>>>>>>>>
>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>>> and the
>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>>> architectures.
>>>>>>>
>>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>>> may
>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>>> access and dirty bits when you do the comparison".
>>>>>>
>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>>> result on
>>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>>> generated code.
>>>>>
>>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>>> architectures.
>>>>>
>>>>>>
>>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>>
>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>>> matter ...
>>>>>
>>>>> I did a version that took that approach. Decided it was not as good as this way
>>>>> though. Now for the life of me, I can't remember my reasoning.
>>>>
>>>> Maybe because there are some code paths that check accessed/dirty without
>>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>>> set it dirty etc?
>>>
>>> I think I decided I was penalizing the architectures that don't care because all
>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>>> that I could feed its result into pte_same().
>>
>> True. With ptep_get_norecency() you're also penalizing other architectures.
>> Therefore my original thought about making the behavior arch-specific, but the
>> arch has to make sure to get the combination of
>> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>>
>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
>> make sure to also ignore them in ptep_same_norecency(), and must be able to
>> handle access/dirty bit changes differently.
>>
>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
>> accessed or dirty". Only the former would end up ignoring stuff here, the latter
>> would not.
>>
>> But again, just some random thoughts how this affects other architectures and
>> how we could avoid it. The issue I describe in patch #3 would be gone if
>> ptep_same_norecency() would just do a ptep_same() check on other architectures
>> -- and would make it easier to sell :)
>>
>
> I've been thinking some more about this. I think your proposal is the following:
>
>
> // ARM64
> ptep_get_lockless_norecency()
> {
> - returned access/dirty may be incorrect
> - returned access/dirty may be differently incorrect between 2 calls
> }
> pte_same_norecency()
> {
> - ignore access/dirty when doing comparison
> }
> ptep_set_access_flags(ptep, pte)
> {
> - must not assume access/dirty in pte are "more permissive" than
> access/dirty in *ptep
> - must only set access/dirty in *ptep, never clear
> }
>
>
> // Other arches: no change to generated code
> ptep_get_lockless_norecency()
> {
> return ptep_get_lockless();
> }
> pte_same_norecency()
> {
> return pte_same();
> }
> ptep_set_access_flags(ptep, pte)
> {
> - may assume access/dirty in pte are "more permissive" than access/dirty
> in *ptep
> - if no HW access/dirty updates, "*ptep = pte" always results in "more
> permissive" change
> }
>
> An arch either specializes all 3 or none of them.
>
> This would allow us to get rid of ptep_get_lockless().
>
> And it addresses the bug you found with ptep_set_access_flags().
>
>
> BUT, I still have a nagging feeling that there are likely to be other similar
> problems caused by ignoring access/dirty during pte_same_norecency(). I can't
> convince myself that its definitely all safe and robust.

Right, we'd have to identify the other possible cases and document what
an arch + common code must stick to to make it work.

Some rules would be: if an arch implements ptep_get_lockless_norecency():

(1) Passing the result from ptep_get_lockless_norecency() to pte_same()
is wrong.
(2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
ptep_get_lockless_norecency() is very likely wrong.


>
> So I'm leaning towards dropping patch 3 and therefore keeping
> ptep_get_lockless() around.
>
> Let me know if you have any insight that might help me change my mind :)

I'm wondering if it would help if we could find a better name (or
concept) for "norecency" here, that expresses that only on some archs
we'd have that fuzzy handling.

Keeping ptep_get_lockless() around for now sounds like the best alternative.

--
Cheers,

David / dhildenb


2024-04-09 17:08:38

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 08/04/2024 09:36, David Hildenbrand wrote:
> On 03.04.24 14:59, Ryan Roberts wrote:
>> On 27/03/2024 09:34, David Hildenbrand wrote:
>>> On 26.03.24 18:51, Ryan Roberts wrote:
>>>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>>>
>>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>>>> pte_same()
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>>>> we do
>>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case.
>>>>>>>>>> In an
>>>>>>>>>> ideal
>>>>>>>>>> world we would convert everything over to
>>>>>>>>>> ptep_get_lockless_norecency() and
>>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>>>
>>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really
>>>>>>>>> affect
>>>>>>>>> any
>>>>>>>>> architecture.
>>>>>>>>>
>>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>>>> and the
>>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned
>>>>>>>> pte
>>>>>>>> may
>>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore
>>>>>>>> the
>>>>>>>> access and dirty bits when you do the comparison".
>>>>>>>
>>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>>>> result on
>>>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>>>> generated code.
>>>>>>
>>>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>>>> architectures.
>>>>>>
>>>>>>>
>>>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>>>
>>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>>>> dirty/accessed bits to more easily find any actual issues where the bits
>>>>>>> still
>>>>>>> matter ...
>>>>>>
>>>>>> I did a version that took that approach. Decided it was not as good as
>>>>>> this way
>>>>>> though. Now for the life of me, I can't remember my reasoning.
>>>>>
>>>>> Maybe because there are some code paths that check accessed/dirty without
>>>>> "correctness" implications? For example, if the PTE is already dirty, no
>>>>> need to
>>>>> set it dirty etc?
>>>>
>>>> I think I decided I was penalizing the architectures that don't care because
>>>> all
>>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>>>> that I could feed its result into pte_same().
>>>
>>> True. With ptep_get_norecency() you're also penalizing other architectures.
>>> Therefore my original thought about making the behavior arch-specific, but the
>>> arch has to make sure to get the combination of
>>> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>>>
>>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
>>> make sure to also ignore them in ptep_same_norecency(), and must be able to
>>> handle access/dirty bit changes differently.
>>>
>>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
>>> accessed or dirty". Only the former would end up ignoring stuff here, the latter
>>> would not.
>>>
>>> But again, just some random thoughts how this affects other architectures and
>>> how we could avoid it. The issue I describe in patch #3 would be gone if
>>> ptep_same_norecency() would just do a ptep_same() check on other architectures
>>> -- and would make it easier to sell :)
>>>
>>
>> I've been thinking some more about this. I think your proposal is the following:
>>
>>
>> // ARM64
>> ptep_get_lockless_norecency()
>> {
>>     - returned access/dirty may be incorrect
>>     - returned access/dirty may be differently incorrect between 2 calls
>> }
>> pte_same_norecency()
>> {
>>     - ignore access/dirty when doing comparison
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - must not assume access/dirty in pte are "more permissive" than
>>       access/dirty in *ptep
>>     - must only set access/dirty in *ptep, never clear
>> }
>>
>>
>> // Other arches: no change to generated code
>> ptep_get_lockless_norecency()
>> {
>>     return ptep_get_lockless();
>> }
>> pte_same_norecency()
>> {
>>     return pte_same();
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - may assume access/dirty in pte are "more permissive" than access/dirty
>>       in *ptep
>>     - if no HW access/dirty updates, "*ptep = pte" always results in "more
>>       permissive" change
>> }
>>
>> An arch either specializes all 3 or none of them.
>>
>> This would allow us to get rid of ptep_get_lockless().
>>
>> And it addresses the bug you found with ptep_set_access_flags().
>>
>>
>> BUT, I still have a nagging feeling that there are likely to be other similar
>> problems caused by ignoring access/dirty during pte_same_norecency(). I can't
>> convince myself that its definitely all safe and robust.
>
> Right, we'd have to identify the other possible cases and document what an arch
> + common code must stick to to make it work.
>
> Some rules would be: if an arch implements ptep_get_lockless_norecency():
>
> (1) Passing the result from ptep_get_lockless_norecency() to pte_same()
>     is wrong.
> (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
>     ptep_get_lockless_norecency() is very likely wrong.
>
>
>>
>> So I'm leaning towards dropping patch 3 and therefore keeping
>> ptep_get_lockless() around.
>>
>> Let me know if you have any insight that might help me change my mind :)
>
> I'm wondering if it would help if we could find a better name (or concept) for
> "norecency" here, that expresses that only on some archs we'd have that fuzzy
> handling.
>
> Keeping ptep_get_lockless() around for now sounds like the best alternative.

Separately to this I've been playing with an idea to add support for uffd-wp and
soft-dirty SW PTE bits for arm64; it boils down to keeping the SW bits in
separate storage, linked from the ptdesc. And we have some constant HW PTE bits
that we can remove and replace with those SW bits so we can keep the pte_t the
same size and abstract it all with ptep_get() and set_ptes().

It was all looking straightforward until I got to ptep_get_lockless(). Now that
there are 2 separate locations for PTE bits, I can't read it all atomically.

So I've been looking at all this again, and getting myself even more confused.

I believe there are 2 classes of ptep_get_lockless() caller:

1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
2) everyone else

--

(1) doesn't really care if orig_pte is consistent or not. It just wants to read
a value, do some speculative work based on that value, then lock the PTL, and
check the value hasn't changed. If it has changed, it backs out. So we don't
actually need any "lockless" guarrantees here; we could just use ptep_get().

In fact, prior to Hugh's commit 26e1a0c3277d ("mm: use pmdp_get_lockless()
without surplus barrier()"), we had this:

vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- vmf->orig_pte = *vmf->pte;
+ vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
- * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
- * accesses. The code below just needs a consistent view
- * for the ifs and we later double check anyway with the
- * ptl lock held. So here a barrier will do.
- */
- barrier();
if (pte_none(vmf->orig_pte)) {

--

(2) All the other users require that a subset of the pte fields are
self-consistent; specifically they don't care about access, dirty, uffd-wp or
soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
just by calling ptep_get().

--

So, I'm making the bold claim that it was never neccessary to specialize
pte_get_lockless() on arm64, and I *think* we could just delete it so that
ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
aim without needing to introduce "norecency" variants.

Additionally I propose documenting ptep_get_lockless() to describe the set of
fields that are guarranteed to be self-consistent and the remaining fields which
are self-consistent only with best-effort.

Could it be this easy? My head is hurting...

Thanks,
Ryan


2024-04-10 20:09:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

[...]

Skipping the ptdesc stuff we discussed offline, to not get distracted. :)

This stuff is killing me, sorry for the lengthy reply ...

>
> So I've been looking at all this again, and getting myself even more confused.
>
> I believe there are 2 classes of ptep_get_lockless() caller:
>
> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
> 2) everyone else

Likely only completely lockless page table walkers where we *cannot*
recheck under PTL is special. Essentially where we disable interrupts
and rely on TLB flushes to sync against concurrent changes.

Let's take a look where ptep_get_lockless() comes from:

commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
Author: Peter Zijlstra <[email protected]>
Date: Fri Nov 13 11:41:40 2020 +0100

mm/gup: Provide gup_get_pte() more generic

In order to write another lockless page-table walker, we need
gup_get_pte() exposed. While doing that, rename it to
ptep_get_lockless() to match the existing ptep_get() naming.


GUP-fast, when we were still relying on TLB flushes to sync against
GUP-fast.

"With get_user_pages_fast(), we walk down the pagetables without taking
any locks. For this we would like to load the pointers atomically, but
sometimes that is not possible (e.g. without expensive cmpxchg8b on
x86_32 PAE). What we do have is the guarantee that a PTE will only
either go from not present to present, or present to not present or both
-- it will not switch to a completely different present page without a
TLB flush in between; something hat we are blocking by holding
interrupts off."

Later, we added support for GUP-fast that introduced the !TLB variant:

commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
Author: Steve Capper <[email protected]>
Date: Thu Oct 9 15:29:14 2014 -0700

mm: introduce a general RCU get_user_pages_fast()

With the pattern

/*
* In the line below we are assuming that the pte can be read
* atomically. If this is not the case for your architecture,
* please wrap this in a helper function!
*
* for an example see gup_get_pte in arch/x86/mm/gup.c
*/
pte_t pte = ACCESS_ONCE(*ptep);
..
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
..


Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read
was required to be sane, this the lengthy comment above
ptep_get_lockless() that talks about TLB flushes.

The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is
still full of details about TLB flushes syncing against GUP-fast. But as
you note, we use it even in contexts where we don't disable interrupts
and the TLB flush can't help.

We don't disable interrupts during page faults ... so most of the things
described in ptep_get_lockless() don't really apply.

That's also the reason why ...
>
> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> - vmf->orig_pte = *vmf->pte;
> + vmf->orig_pte = ptep_get_lockless(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>
> - /*
> - * some architectures can have larger ptes than wordsize,
> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
> - * accesses. The code below just needs a consistent view
> - * for the ifs and we later double check anyway with the
> - * ptl lock held. So here a barrier will do.
> - */
> - barrier();
> if (pte_none(vmf->orig_pte)) {

.. that code was in place. We would possibly read garbage PTEs, but
would recheck *under PTL* (where the PTE can no longer change) that what
we read wasn't garbage and didn't change.

>
> --
>
> (2) All the other users require that a subset of the pte fields are
> self-consistent; specifically they don't care about access, dirty, uffd-wp or
> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
> just by calling ptep_get().

Let's focus on access+dirty for now ;)

>
> --
>
> So, I'm making the bold claim that it was never neccessary to specialize
> pte_get_lockless() on arm64, and I *think* we could just delete it so that
> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
> aim without needing to introduce "norecency" variants.
>
> Additionally I propose documenting ptep_get_lockless() to describe the set of
> fields that are guarranteed to be self-consistent and the remaining fields which
> are self-consistent only with best-effort.
>
> Could it be this easy? My head is hurting...

I think what has to happen is:

(1) pte_get_lockless() must return the same value as ptep_get() as long
as there are no races. No removal/addition of access/dirty bits etc.

(2) Lockless page table walkers that later verify under the PTL can
handle serious "garbage PTEs". This is our page fault handler.

(3) Lockless page table walkers that cannot verify under PTL cannot
handle arbitrary garbage PTEs. This is GUP-fast. Two options:

(3a) pte_get_lockless() can atomically read the PTE: We re-check later
if the atomically-read PTE is still unchanged (without PTL). No IPI for
TLB flushes required. This is the common case. HW might concurrently set
access/dirty bits, so we can race with that. But we don't read garbage.

(3b) pte_get_lockless() cannot atomically read the PTE: We need special
magic to read somewhat-sane PTEs and need IPIs during TLB flushes to
sync against serious PTE changes (e.g., present -> present). This is
weird x86-PAE.


If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.

My head is hurting ...

--
Cheers,

David / dhildenb


2024-04-11 09:56:15

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
>
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
>
> This stuff is killing me, sorry for the lengthy reply ...

No problem - thanks for taking the time to think it through and reply with such
clarity. :)

>
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
>
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.

Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".

Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.

>
> Let's take a look where ptep_get_lockless() comes from:
>
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <[email protected]>
> Date:   Fri Nov 13 11:41:40 2020 +0100
>
>     mm/gup: Provide gup_get_pte() more generic
>
>     In order to write another lockless page-table walker, we need
>     gup_get_pte() exposed. While doing that, rename it to
>     ptep_get_lockless() to match the existing ptep_get() naming.
>
>
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
>
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks.  For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
>
> Later, we added support for GUP-fast that introduced the !TLB variant:
>
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <[email protected]>
> Date:   Thu Oct 9 15:29:14 2014 -0700
>
>     mm: introduce a general RCU get_user_pages_fast()
>
> With the pattern
>
> /*
>  * In the line below we are assuming that the pte can be read
>  * atomically. If this is not the case for your architecture,
>  * please wrap this in a helper function!
>  *
>  * for an example see gup_get_pte in arch/x86/mm/gup.c
>  */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
>
>
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
>
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
>
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
>
> That's also the reason why ...
>>
>>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> -               vmf->orig_pte = *vmf->pte;
>> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> -               /*
>> -                * some architectures can have larger ptes than wordsize,
>> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> -                * accesses.  The code below just needs a consistent view
>> -                * for the ifs and we later double check anyway with the
>> -                * ptl lock held. So here a barrier will do.
>> -                */
>> -               barrier();
>>                  if (pte_none(vmf->orig_pte)) {
>
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.

Agreed.

>
>>
>> --
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
>
> Let's focus on access+dirty for now ;)
>
>>
>> --
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
>
> I think what has to happen is:
>
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.

Today's arm64 ptep_get() guarantees this.

>
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.

This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.

>
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
>
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.

Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.

But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".

This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.

Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?

>
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
>
>
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
>
> My head is hurting ...
>


2024-04-12 20:17:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

>
> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
> "lockless walkers that never take the PTL".
>
> Detail: the part about disabling interrupts and TLB flush syncing is
> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
> you make that clear further down.

Yes, but disabling interrupts is also required for RCU-freeing of page
tables such that they can be walked safely. The TLB flush IPI is
arch-specific and indeed to sync against PTE invalidation (before
generic GUP-fast).
[...]

>>>
>>> Could it be this easy? My head is hurting...
>>
>> I think what has to happen is:
>>
>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
>> are no races. No removal/addition of access/dirty bits etc.
>
> Today's arm64 ptep_get() guarantees this.
>
>>
>> (2) Lockless page table walkers that later verify under the PTL can handle
>> serious "garbage PTEs". This is our page fault handler.
>
> This isn't really a property of a ptep_get_lockless(); its a statement about a
> class of users. I agree with the statement.

Yes. That's a requirement for the user of ptep_get_lockless(), such as
page fault handlers. Well, mostly "not GUP".

>
>>
>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>
>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>> required. This is the common case. HW might concurrently set access/dirty bits,
>> so we can race with that. But we don't read garbage.
>
> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
> consistent for contpte ptes. That's the bit that complicates the current
> ptep_get_lockless() implementation.
>
> But the point I was trying to make is that GUP-fast does not actually care about
> *all* the fields being consistent (e.g. access/dirty). So we could spec
> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
> to be self-consistent except for access and dirty information, which may be
> inconsistent if a racing modification occured".

We *might* have KVM in the future want to check that a PTE is dirty,
such that we can only allow dirty PTEs to be writable in a secondary
MMU. That's not there yet, but one thing I was discussing on the list
recently. Burried in:

https://lkml.kernel.org/r/[email protected]

We wouldn't care about racing modifications, as long as MMU notifiers
will properly notify us when the PTE would lose its dirty bits.

But getting false-positive dirty bits would be problematic.

>
> This could mean that the access/dirty state *does* change for a given page while
> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
> that failing to detect this is benign.

I mean, HW could just set the dirty/access bit immediately after the
check. So if HW concurrently sets the bit and we don't observe that
change when we recheck, I think that would be perfectly fine.

>
> Aside: GUP-fast currently rechecks the pte originally obtained with
> ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
> to (1), so either it returns the same pte or it returns a different pte or
> garbage. But that garbage could just happen to be the same as the originally
> obtained pte. So in that case, it would have a false match. I think this needs
> to be changed to ptep_get_lockless()?

I *think* it's fine, because the case where it would make a difference
(x86-PAE) still requires the TLB flush IPI to sync against PTE changes,
and that check would likely be wrong in one way or the other. So for
x86-pae, that check is just moot either way.

That my theory, at least.

(but this "let's fake-read atomically although we don't, but let's do
like we could in some specific circumstances" is really hard to get)

I was wondering a while ago if we are missing a memory barrier before
the checl, but I think the one from obtaining the page reference gets
the job done (at least that's what I remember).

--
Cheers,

David / dhildenb


2024-04-15 09:29:08

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 12/04/2024 21:16, David Hildenbrand wrote:
>>
>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>> "lockless walkers that never take the PTL".
>>
>> Detail: the part about disabling interrupts and TLB flush syncing is
>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>> you make that clear further down.
>
> Yes, but disabling interrupts is also required for RCU-freeing of page tables
> such that they can be walked safely. The TLB flush IPI is arch-specific and
> indeed to sync against PTE invalidation (before generic GUP-fast).
> [...]
>
>>>>
>>>> Could it be this easy? My head is hurting...
>>>
>>> I think what has to happen is:
>>>
>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
>>> are no races. No removal/addition of access/dirty bits etc.
>>
>> Today's arm64 ptep_get() guarantees this.
>>
>>>
>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>> serious "garbage PTEs". This is our page fault handler.
>>
>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>> class of users. I agree with the statement.
>
> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
> fault handlers. Well, mostly "not GUP".
>
>>
>>>
>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>
>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>> required. This is the common case. HW might concurrently set access/dirty bits,
>>> so we can race with that. But we don't read garbage.
>>
>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>> consistent for contpte ptes. That's the bit that complicates the current
>> ptep_get_lockless() implementation.
>>
>> But the point I was trying to make is that GUP-fast does not actually care about
>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>> to be self-consistent except for access and dirty information, which may be
>> inconsistent if a racing modification occured".
>
> We *might* have KVM in the future want to check that a PTE is dirty, such that
> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
> yet, but one thing I was discussing on the list recently. Burried in:
>
> https://lkml.kernel.org/r/[email protected]
>
> We wouldn't care about racing modifications, as long as MMU notifiers will
> properly notify us when the PTE would lose its dirty bits.
>
> But getting false-positive dirty bits would be problematic.
>
>>
>> This could mean that the access/dirty state *does* change for a given page while
>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>> that failing to detect this is benign.
>
> I mean, HW could just set the dirty/access bit immediately after the check. So
> if HW concurrently sets the bit and we don't observe that change when we
> recheck, I think that would be perfectly fine.

Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
soft-dirty or uffd-wp).

But if you don't want to change the ptep_get_lockless() spec to explicitly allow
this (because you have the KVM use case where false-positive dirty is
problematic), then I think we are stuck with ptep_get_lockless() as implemented
for arm64 today.

>
>>
>> Aside: GUP-fast currently rechecks the pte originally obtained with
>> ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
>> to (1), so either it returns the same pte or it returns a different pte or
>> garbage. But that garbage could just happen to be the same as the originally
>> obtained pte. So in that case, it would have a false match. I think this needs
>> to be changed to ptep_get_lockless()?
>
> I *think* it's fine, because the case where it would make a difference (x86-PAE)
> still requires the TLB flush IPI to sync against PTE changes, and that check
> would likely be wrong in one way or the other. So for x86-pae, that check is
> just moot either way.
>
> That my theory, at least.
>
> (but this "let's fake-read atomically although we don't, but let's do like we
> could in some specific circumstances" is really hard to get)
>
> I was wondering a while ago if we are missing a memory barrier before the checl,
> but I think the one from obtaining the page reference gets the job done (at
> least that's what I remember).
>


2024-04-15 10:58:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15.04.24 11:28, Ryan Roberts wrote:
> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>
>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>> "lockless walkers that never take the PTL".
>>>
>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>> you make that clear further down.
>>
>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>> indeed to sync against PTE invalidation (before generic GUP-fast).
>> [...]
>>
>>>>>
>>>>> Could it be this easy? My head is hurting...
>>>>
>>>> I think what has to happen is:
>>>>
>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
>>>> are no races. No removal/addition of access/dirty bits etc.
>>>
>>> Today's arm64 ptep_get() guarantees this.
>>>
>>>>
>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>> serious "garbage PTEs". This is our page fault handler.
>>>
>>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>>> class of users. I agree with the statement.
>>
>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>> fault handlers. Well, mostly "not GUP".
>>
>>>
>>>>
>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>
>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>> required. This is the common case. HW might concurrently set access/dirty bits,
>>>> so we can race with that. But we don't read garbage.
>>>
>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>> consistent for contpte ptes. That's the bit that complicates the current
>>> ptep_get_lockless() implementation.
>>>
>>> But the point I was trying to make is that GUP-fast does not actually care about
>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>>> to be self-consistent except for access and dirty information, which may be
>>> inconsistent if a racing modification occured".
>>
>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
>> yet, but one thing I was discussing on the list recently. Burried in:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> We wouldn't care about racing modifications, as long as MMU notifiers will
>> properly notify us when the PTE would lose its dirty bits.
>>
>> But getting false-positive dirty bits would be problematic.
>>
>>>
>>> This could mean that the access/dirty state *does* change for a given page while
>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>> that failing to detect this is benign.
>>
>> I mean, HW could just set the dirty/access bit immediately after the check. So
>> if HW concurrently sets the bit and we don't observe that change when we
>> recheck, I think that would be perfectly fine.
>
> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
> soft-dirty or uffd-wp).
>
> But if you don't want to change the ptep_get_lockless() spec to explicitly allow
> this (because you have the KVM use case where false-positive dirty is
> problematic), then I think we are stuck with ptep_get_lockless() as implemented
> for arm64 today.

At least regarding the dirty bit, we'd have to guarantee that if
ptep_get_lockless() returns a false-positive dirty bit, that the PTE
recheck would be able to catch that.

Would that be possible?

--
Cheers,

David / dhildenb


2024-04-15 13:47:55

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15/04/2024 11:57, David Hildenbrand wrote:
> On 15.04.24 11:28, Ryan Roberts wrote:
>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>
>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>> "lockless walkers that never take the PTL".
>>>>
>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>> you make that clear further down.
>>>
>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>> [...]
>>>
>>>>>>
>>>>>> Could it be this easy? My head is hurting...
>>>>>
>>>>> I think what has to happen is:
>>>>>
>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>> there
>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>
>>>> Today's arm64 ptep_get() guarantees this.
>>>>
>>>>>
>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>
>>>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>>>> class of users. I agree with the statement.
>>>
>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>> fault handlers. Well, mostly "not GUP".
>>>
>>>>
>>>>>
>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>
>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>> bits,
>>>>> so we can race with that. But we don't read garbage.
>>>>
>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>> ptep_get_lockless() implementation.
>>>>
>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>> about
>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>>>> to be self-consistent except for access and dirty information, which may be
>>>> inconsistent if a racing modification occured".
>>>
>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>> properly notify us when the PTE would lose its dirty bits.
>>>
>>> But getting false-positive dirty bits would be problematic.
>>>
>>>>
>>>> This could mean that the access/dirty state *does* change for a given page
>>>> while
>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>> that failing to detect this is benign.
>>>
>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>> if HW concurrently sets the bit and we don't observe that change when we
>>> recheck, I think that would be perfectly fine.
>>
>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>> soft-dirty or uffd-wp).
>>
>> But if you don't want to change the ptep_get_lockless() spec to explicitly allow
>> this (because you have the KVM use case where false-positive dirty is
>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>> for arm64 today.
>
> At least regarding the dirty bit, we'd have to guarantee that if
> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
> would be able to catch that.
>
> Would that be possible?

Hmm maybe. My head hurts. Let me try to work through some examples...


Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
PTE0 for the contpte block, and it is dirty.

Now let's say there are 2 racing threads:

- ThreadA is doing a GUP-fast for PTE3
- ThreadB is remapping order-0 FolioB at PTE0

(ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
example - today's arm64 ptep_get_lockless() can handle the below correctly).

ThreadA ThreadB
======= =======

gup_pte_range()
pte1 = ptep_get_lockless(PTE3)
READ_ONCE(PTE3)
mmap(PTE0)
clear_pte(PTE0)
unfold(PTE0 - PTE3)
WRITE_ONCE(PTE0, 0)
WRITE_ONCE(PTE1, 0)
WRITE_ONCE(PTE2, 0)
READ_ONCE(PTE0) (for a/d) << CLEAN!!
READ_ONCE(PTE1) (for a/d)
READ_ONCE(PTE2) (for a/d)
READ_ONCE(PTE3) (for a/d)
<do speculative work with pte1 content>
pte2 = ptep_get_lockless(PTE3)
READ_ONCE(PTE3)
READ_ONCE(PTE0) (for a/d)
READ_ONCE(PTE1) (for a/d)
READ_ONCE(PTE2) (for a/d)
READ_ONCE(PTE3) (for a/d)
true = pte_same(pte1, pte2)
WRITE_ONCE(PTE3, 0)
TLBI
WRITE_ONCE(PTE0, <orig & ~CONT>)
WRITE_ONCE(PTE1, <orig & ~CONT>)
WRITE_ONCE(PTE2, <orig & ~CONT>)
WRITE_ONCE(PTE3, <orig & ~CONT>)
WRITE_ONCE(PTE0, 0)
set_pte_at(PTE0, <new>)

This example shows how a *false-negative* can be returned for the dirty state,
which isn't detected by the check.

I've been unable to come up with an example where a *false-positive* can be
returned for dirty state without the second ptep_get_lockless() noticing. In
this second example, let's assume everything is the same execpt FolioA is
initially clean:

ThreadA ThreadB
======= =======

gup_pte_range()
pte1 = ptep_get_lockless(PTE3)
READ_ONCE(PTE3)
mmap(PTE0)
clear_pte(PTE0)
unfold(PTE0 - PTE3)
WRITE_ONCE(PTE0, 0)
WRITE_ONCE(PTE1, 0)
WRITE_ONCE(PTE2, 0)
WRITE_ONCE(PTE3, 0)
TLBI
WRITE_ONCE(PTE0, <orig & ~CONT>)
WRITE_ONCE(PTE1, <orig & ~CONT>)
WRITE_ONCE(PTE2, <orig & ~CONT>)
WRITE_ONCE(PTE3, <orig & ~CONT>)
WRITE_ONCE(PTE0, 0)
set_pte_at(PTE0, <new>)
write to FolioB - HW sets PTE0's dirty
READ_ONCE(PTE0) (for a/d) << DIRTY!!
READ_ONCE(PTE1) (for a/d)
READ_ONCE(PTE2) (for a/d)
READ_ONCE(PTE3) (for a/d)
<do speculative work with pte1 content>
pte2 = ptep_get_lockless(PTE3)
READ_ONCE(PTE3) << BUT THIS IS FOR FolioB
READ_ONCE(PTE0) (for a/d)
READ_ONCE(PTE1) (for a/d)
READ_ONCE(PTE2) (for a/d)
READ_ONCE(PTE3) (for a/d)
false = pte_same(pte1, pte2) << So this fails

The only way I can see false-positive not being caught in the second example is
if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
But these lockless table walkers are already suseptible to that.

I think all the same arguments can be extended to the access bit.


For me this is all getting rather subtle and difficult to reason about and even
harder to spec in a comprehensible way. The best I could come up with is:

"All fields in the returned pte are guarranteed to be self-consistent except for
access and dirty information, which may be inconsistent if a racing modification
occured. Additionally it is guranteed that false-positive access and/or dirty
information is not possible if 2 calls are made and both ptes are the same. Only
false-negative access and/or dirty information is possible in this scenario."

which is starting to sound bonkers. Personally I think we are better off at this
point, just keeping today's arm64 ptep_get_lockless().

Thanks,
Ryan


2024-04-15 14:23:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15.04.24 15:30, Ryan Roberts wrote:
> On 15/04/2024 11:57, David Hildenbrand wrote:
>> On 15.04.24 11:28, Ryan Roberts wrote:
>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>
>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>> "lockless walkers that never take the PTL".
>>>>>
>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>>> you make that clear further down.
>>>>
>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>> [...]
>>>>
>>>>>>>
>>>>>>> Could it be this easy? My head is hurting...
>>>>>>
>>>>>> I think what has to happen is:
>>>>>>
>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>> there
>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>
>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>
>>>>>>
>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>
>>>>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>>>>> class of users. I agree with the statement.
>>>>
>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>> fault handlers. Well, mostly "not GUP".
>>>>
>>>>>
>>>>>>
>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>
>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>> bits,
>>>>>> so we can race with that. But we don't read garbage.
>>>>>
>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>> ptep_get_lockless() implementation.
>>>>>
>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>> about
>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>>>>> to be self-consistent except for access and dirty information, which may be
>>>>> inconsistent if a racing modification occured".
>>>>
>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>
>>>> https://lkml.kernel.org/r/[email protected]
>>>>
>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>> properly notify us when the PTE would lose its dirty bits.
>>>>
>>>> But getting false-positive dirty bits would be problematic.
>>>>
>>>>>
>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>> while
>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>>> that failing to detect this is benign.
>>>>
>>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>> recheck, I think that would be perfectly fine.
>>>
>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>> soft-dirty or uffd-wp).
>>>
>>> But if you don't want to change the ptep_get_lockless() spec to explicitly allow
>>> this (because you have the KVM use case where false-positive dirty is
>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>>> for arm64 today.
>>
>> At least regarding the dirty bit, we'd have to guarantee that if
>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>> would be able to catch that.
>>
>> Would that be possible?
>
> Hmm maybe. My head hurts. Let me try to work through some examples...
>
>
> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
> PTE0 for the contpte block, and it is dirty.
>
> Now let's say there are 2 racing threads:
>
> - ThreadA is doing a GUP-fast for PTE3
> - ThreadB is remapping order-0 FolioB at PTE0
>
> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>
> ThreadA ThreadB
> ======= =======
>
> gup_pte_range()
> pte1 = ptep_get_lockless(PTE3)
> READ_ONCE(PTE3)
> mmap(PTE0)
> clear_pte(PTE0)
> unfold(PTE0 - PTE3)
> WRITE_ONCE(PTE0, 0)
> WRITE_ONCE(PTE1, 0)
> WRITE_ONCE(PTE2, 0)
> READ_ONCE(PTE0) (for a/d) << CLEAN!!
> READ_ONCE(PTE1) (for a/d)
> READ_ONCE(PTE2) (for a/d)
> READ_ONCE(PTE3) (for a/d)
> <do speculative work with pte1 content>
> pte2 = ptep_get_lockless(PTE3)
> READ_ONCE(PTE3)
> READ_ONCE(PTE0) (for a/d)
> READ_ONCE(PTE1) (for a/d)
> READ_ONCE(PTE2) (for a/d)
> READ_ONCE(PTE3) (for a/d)
> true = pte_same(pte1, pte2)
> WRITE_ONCE(PTE3, 0)
> TLBI
> WRITE_ONCE(PTE0, <orig & ~CONT>)
> WRITE_ONCE(PTE1, <orig & ~CONT>)
> WRITE_ONCE(PTE2, <orig & ~CONT>)
> WRITE_ONCE(PTE3, <orig & ~CONT>)
> WRITE_ONCE(PTE0, 0)
> set_pte_at(PTE0, <new>)
>
> This example shows how a *false-negative* can be returned for the dirty state,
> which isn't detected by the check.
>
> I've been unable to come up with an example where a *false-positive* can be
> returned for dirty state without the second ptep_get_lockless() noticing. In
> this second example, let's assume everything is the same execpt FolioA is
> initially clean:
>
> ThreadA ThreadB
> ======= =======
>
> gup_pte_range()
> pte1 = ptep_get_lockless(PTE3)
> READ_ONCE(PTE3)
> mmap(PTE0)
> clear_pte(PTE0)
> unfold(PTE0 - PTE3)
> WRITE_ONCE(PTE0, 0)
> WRITE_ONCE(PTE1, 0)
> WRITE_ONCE(PTE2, 0)
> WRITE_ONCE(PTE3, 0)
> TLBI
> WRITE_ONCE(PTE0, <orig & ~CONT>)
> WRITE_ONCE(PTE1, <orig & ~CONT>)
> WRITE_ONCE(PTE2, <orig & ~CONT>)
> WRITE_ONCE(PTE3, <orig & ~CONT>)
> WRITE_ONCE(PTE0, 0)
> set_pte_at(PTE0, <new>)
> write to FolioB - HW sets PTE0's dirty
> READ_ONCE(PTE0) (for a/d) << DIRTY!!
> READ_ONCE(PTE1) (for a/d)
> READ_ONCE(PTE2) (for a/d)
> READ_ONCE(PTE3) (for a/d)
> <do speculative work with pte1 content>
> pte2 = ptep_get_lockless(PTE3)
> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB
> READ_ONCE(PTE0) (for a/d)
> READ_ONCE(PTE1) (for a/d)
> READ_ONCE(PTE2) (for a/d)
> READ_ONCE(PTE3) (for a/d)
> false = pte_same(pte1, pte2) << So this fails
>
> The only way I can see false-positive not being caught in the second example is
> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
> But these lockless table walkers are already suseptible to that.
>
> I think all the same arguments can be extended to the access bit.
>
>
> For me this is all getting rather subtle and difficult to reason about and even
> harder to spec in a comprehensible way. The best I could come up with is:
>
> "All fields in the returned pte are guarranteed to be self-consistent except for
> access and dirty information, which may be inconsistent if a racing modification
> occured. Additionally it is guranteed that false-positive access and/or dirty
> information is not possible if 2 calls are made and both ptes are the same. Only
> false-negative access and/or dirty information is possible in this scenario."
>
> which is starting to sound bonkers. Personally I think we are better off at this
> point, just keeping today's arm64 ptep_get_lockless().

Remind me again, does arm64 perform an IPI broadcast during a TLB flush
that would sync against GUP-fast?

--
Cheers,

David / dhildenb


2024-04-15 14:37:29

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15/04/2024 15:23, David Hildenbrand wrote:
> On 15.04.24 15:30, Ryan Roberts wrote:
>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>
>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>> "lockless walkers that never take the PTL".
>>>>>>
>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>>>> you make that clear further down.
>>>>>
>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>> [...]
>>>>>
>>>>>>>>
>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>
>>>>>>> I think what has to happen is:
>>>>>>>
>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>> there
>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>
>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>
>>>>>>>
>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>
>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>> about a
>>>>>> class of users. I agree with the statement.
>>>>>
>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>> fault handlers. Well, mostly "not GUP".
>>>>>
>>>>>>
>>>>>>>
>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>
>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>> the
>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>> bits,
>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>
>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>> ptep_get_lockless() implementation.
>>>>>>
>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>> about
>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>> guarranteed
>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>> inconsistent if a racing modification occured".
>>>>>
>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>> there
>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>
>>>>> https://lkml.kernel.org/r/[email protected]
>>>>>
>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>
>>>>> But getting false-positive dirty bits would be problematic.
>>>>>
>>>>>>
>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>> while
>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>>>> that failing to detect this is benign.
>>>>>
>>>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>> recheck, I think that would be perfectly fine.
>>>>
>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>> soft-dirty or uffd-wp).
>>>>
>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>> allow
>>>> this (because you have the KVM use case where false-positive dirty is
>>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>>>> for arm64 today.
>>>
>>> At least regarding the dirty bit, we'd have to guarantee that if
>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>> would be able to catch that.
>>>
>>> Would that be possible?
>>
>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>
>>
>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
>> PTE0 for the contpte block, and it is dirty.
>>
>> Now let's say there are 2 racing threads:
>>
>>    - ThreadA is doing a GUP-fast for PTE3
>>    - ThreadB is remapping order-0 FolioB at PTE0
>>
>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>
>> ThreadA                    ThreadB
>> =======                    =======
>>
>> gup_pte_range()
>>    pte1 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>                     mmap(PTE0)
>>                       clear_pte(PTE0)
>>                         unfold(PTE0 - PTE3)
>>                           WRITE_ONCE(PTE0, 0)
>>                           WRITE_ONCE(PTE1, 0)
>>                           WRITE_ONCE(PTE2, 0)
>>      READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    <do speculative work with pte1 content>
>>    pte2 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>      READ_ONCE(PTE0) (for a/d)
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    true = pte_same(pte1, pte2)
>>                           WRITE_ONCE(PTE3, 0)
>>                           TLBI
>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>                         WRITE_ONCE(PTE0, 0)
>>                       set_pte_at(PTE0, <new>)
>>
>> This example shows how a *false-negative* can be returned for the dirty state,
>> which isn't detected by the check.
>>
>> I've been unable to come up with an example where a *false-positive* can be
>> returned for dirty state without the second ptep_get_lockless() noticing. In
>> this second example, let's assume everything is the same execpt FolioA is
>> initially clean:
>>
>> ThreadA                    ThreadB
>> =======                    =======
>>
>> gup_pte_range()
>>    pte1 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>                     mmap(PTE0)
>>                       clear_pte(PTE0)
>>                         unfold(PTE0 - PTE3)
>>                           WRITE_ONCE(PTE0, 0)
>>                           WRITE_ONCE(PTE1, 0)
>>                           WRITE_ONCE(PTE2, 0)
>>                           WRITE_ONCE(PTE3, 0)
>>                           TLBI
>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>                         WRITE_ONCE(PTE0, 0)
>>                       set_pte_at(PTE0, <new>)
>>                     write to FolioB - HW sets PTE0's dirty
>>      READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    <do speculative work with pte1 content>
>>    pte2 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>      READ_ONCE(PTE0) (for a/d)
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    false = pte_same(pte1, pte2) << So this fails
>>
>> The only way I can see false-positive not being caught in the second example is
>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>> But these lockless table walkers are already suseptible to that.
>>
>> I think all the same arguments can be extended to the access bit.
>>
>>
>> For me this is all getting rather subtle and difficult to reason about and even
>> harder to spec in a comprehensible way. The best I could come up with is:
>>
>> "All fields in the returned pte are guarranteed to be self-consistent except for
>> access and dirty information, which may be inconsistent if a racing modification
>> occured. Additionally it is guranteed that false-positive access and/or dirty
>> information is not possible if 2 calls are made and both ptes are the same. Only
>> false-negative access and/or dirty information is possible in this scenario."
>>
>> which is starting to sound bonkers. Personally I think we are better off at this
>> point, just keeping today's arm64 ptep_get_lockless().
>
> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
> would sync against GUP-fast?

No, the broadcast is in HW. There is no IPI.



2024-04-15 14:58:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15.04.24 16:34, Ryan Roberts wrote:
> On 15/04/2024 15:23, David Hildenbrand wrote:
>> On 15.04.24 15:30, Ryan Roberts wrote:
>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>
>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>
>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>>>>> you make that clear further down.
>>>>>>
>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>> [...]
>>>>>>
>>>>>>>>>
>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>
>>>>>>>> I think what has to happen is:
>>>>>>>>
>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>> there
>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>
>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>
>>>>>>>>
>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>
>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>> about a
>>>>>>> class of users. I agree with the statement.
>>>>>>
>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>
>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>> the
>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>> bits,
>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>
>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>> ptep_get_lockless() implementation.
>>>>>>>
>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>> about
>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>> guarranteed
>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>> inconsistent if a racing modification occured".
>>>>>>
>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>> there
>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>
>>>>>> https://lkml.kernel.org/r/[email protected]
>>>>>>
>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>
>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>
>>>>>>>
>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>> while
>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>>>>> that failing to detect this is benign.
>>>>>>
>>>>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>> recheck, I think that would be perfectly fine.
>>>>>
>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>> soft-dirty or uffd-wp).
>>>>>
>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>> allow
>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>>>>> for arm64 today.
>>>>
>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>> would be able to catch that.
>>>>
>>>> Would that be possible?
>>>
>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>
>>>
>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
>>> PTE0 for the contpte block, and it is dirty.
>>>
>>> Now let's say there are 2 racing threads:
>>>
>>>    - ThreadA is doing a GUP-fast for PTE3
>>>    - ThreadB is remapping order-0 FolioB at PTE0
>>>
>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>
>>> ThreadA                    ThreadB
>>> =======                    =======
>>>
>>> gup_pte_range()
>>>    pte1 = ptep_get_lockless(PTE3)
>>>      READ_ONCE(PTE3)
>>>                     mmap(PTE0)
>>>                       clear_pte(PTE0)
>>>                         unfold(PTE0 - PTE3)
>>>                           WRITE_ONCE(PTE0, 0)
>>>                           WRITE_ONCE(PTE1, 0)
>>>                           WRITE_ONCE(PTE2, 0)
>>>      READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>      READ_ONCE(PTE1) (for a/d)
>>>      READ_ONCE(PTE2) (for a/d)
>>>      READ_ONCE(PTE3) (for a/d)
>>>    <do speculative work with pte1 content>
>>>    pte2 = ptep_get_lockless(PTE3)
>>>      READ_ONCE(PTE3)
>>>      READ_ONCE(PTE0) (for a/d)
>>>      READ_ONCE(PTE1) (for a/d)
>>>      READ_ONCE(PTE2) (for a/d)
>>>      READ_ONCE(PTE3) (for a/d)
>>>    true = pte_same(pte1, pte2)
>>>                           WRITE_ONCE(PTE3, 0)
>>>                           TLBI
>>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>                         WRITE_ONCE(PTE0, 0)
>>>                       set_pte_at(PTE0, <new>)
>>>
>>> This example shows how a *false-negative* can be returned for the dirty state,
>>> which isn't detected by the check.
>>>
>>> I've been unable to come up with an example where a *false-positive* can be
>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>> this second example, let's assume everything is the same execpt FolioA is
>>> initially clean:
>>>
>>> ThreadA                    ThreadB
>>> =======                    =======
>>>
>>> gup_pte_range()
>>>    pte1 = ptep_get_lockless(PTE3)
>>>      READ_ONCE(PTE3)
>>>                     mmap(PTE0)
>>>                       clear_pte(PTE0)
>>>                         unfold(PTE0 - PTE3)
>>>                           WRITE_ONCE(PTE0, 0)
>>>                           WRITE_ONCE(PTE1, 0)
>>>                           WRITE_ONCE(PTE2, 0)
>>>                           WRITE_ONCE(PTE3, 0)
>>>                           TLBI
>>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>                         WRITE_ONCE(PTE0, 0)
>>>                       set_pte_at(PTE0, <new>)
>>>                     write to FolioB - HW sets PTE0's dirty
>>>      READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>      READ_ONCE(PTE1) (for a/d)
>>>      READ_ONCE(PTE2) (for a/d)
>>>      READ_ONCE(PTE3) (for a/d)
>>>    <do speculative work with pte1 content>
>>>    pte2 = ptep_get_lockless(PTE3)
>>>      READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>      READ_ONCE(PTE0) (for a/d)
>>>      READ_ONCE(PTE1) (for a/d)
>>>      READ_ONCE(PTE2) (for a/d)
>>>      READ_ONCE(PTE3) (for a/d)
>>>    false = pte_same(pte1, pte2) << So this fails
>>>
>>> The only way I can see false-positive not being caught in the second example is
>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>> But these lockless table walkers are already suseptible to that.
>>>
>>> I think all the same arguments can be extended to the access bit.
>>>
>>>
>>> For me this is all getting rather subtle and difficult to reason about and even
>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>
>>> "All fields in the returned pte are guarranteed to be self-consistent except for
>>> access and dirty information, which may be inconsistent if a racing modification
>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>> information is not possible if 2 calls are made and both ptes are the same. Only
>>> false-negative access and/or dirty information is possible in this scenario."
>>>
>>> which is starting to sound bonkers. Personally I think we are better off at this
>>> point, just keeping today's arm64 ptep_get_lockless().
>>
>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>> would sync against GUP-fast?
>
> No, the broadcast is in HW. There is no IPI.

Okay ...

I agree that the semantics are a bit weird, but if we could get rid of
ptep_get_lockless() on arm64, that would also be nice.


Something I've been thinking of ... just to share what I've had in mind.
The two types of users we currently have are:

(1) ptep_get_lockless() followed by ptep_get() check under PTL.

(2) ptep_get_lockless() followed by ptep_get() check without PTL.

What if we had the following instead:

(1) ptep_get_lockless() followed by ptep_get() check under PTL.

(2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
PTL.

And on arm64 let

(1) ptep_get_lockless() be ptep_get()

(2) ptep_get_gup_fast() be __ptep_get().

That would mean, that (2) would not care if another cont-pte is dirty,
because we don't collect access+dirty bits. That way, we avoid any races
with concurrent unfolding etc. The only "problamtic" thing is that
pte_mkdirty() -> set_ptes() would have to set all cont-PTEs dirty, even
if any of these already is dirty.

--
Cheers,

David / dhildenb


2024-04-15 15:17:54

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15/04/2024 15:58, David Hildenbrand wrote:
> On 15.04.24 16:34, Ryan Roberts wrote:
>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>
>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>> TLBIs). But
>>>>>>>> you make that clear further down.
>>>>>>>
>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>> tables
>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>
>>>>>>>>> I think what has to happen is:
>>>>>>>>>
>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>>> there
>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>
>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>> about a
>>>>>>>> class of users. I agree with the statement.
>>>>>>>
>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>
>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>>> the
>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>> flushes
>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>>> bits,
>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>
>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>>> about
>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>> guarranteed
>>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>
>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>> that
>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>> there
>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>
>>>>>>> https://lkml.kernel.org/r/[email protected]
>>>>>>>
>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>
>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>
>>>>>>>>
>>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>>> while
>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>> *think*
>>>>>>>> that failing to detect this is benign.
>>>>>>>
>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>> check. So
>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>
>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>> soft-dirty or uffd-wp).
>>>>>>
>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>> allow
>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>> implemented
>>>>>> for arm64 today.
>>>>>
>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>> would be able to catch that.
>>>>>
>>>>> Would that be possible?
>>>>
>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>
>>>>
>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>> stored in
>>>> PTE0 for the contpte block, and it is dirty.
>>>>
>>>> Now let's say there are 2 racing threads:
>>>>
>>>>     - ThreadA is doing a GUP-fast for PTE3
>>>>     - ThreadB is remapping order-0 FolioB at PTE0
>>>>
>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>
>>>> ThreadA                    ThreadB
>>>> =======                    =======
>>>>
>>>> gup_pte_range()
>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>                      mmap(PTE0)
>>>>                        clear_pte(PTE0)
>>>>                          unfold(PTE0 - PTE3)
>>>>                            WRITE_ONCE(PTE0, 0)
>>>>                            WRITE_ONCE(PTE1, 0)
>>>>                            WRITE_ONCE(PTE2, 0)
>>>>       READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     <do speculative work with pte1 content>
>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>       READ_ONCE(PTE0) (for a/d)
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     true = pte_same(pte1, pte2)
>>>>                            WRITE_ONCE(PTE3, 0)
>>>>                            TLBI
>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>                          WRITE_ONCE(PTE0, 0)
>>>>                        set_pte_at(PTE0, <new>)
>>>>
>>>> This example shows how a *false-negative* can be returned for the dirty state,
>>>> which isn't detected by the check.
>>>>
>>>> I've been unable to come up with an example where a *false-positive* can be
>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>> this second example, let's assume everything is the same execpt FolioA is
>>>> initially clean:
>>>>
>>>> ThreadA                    ThreadB
>>>> =======                    =======
>>>>
>>>> gup_pte_range()
>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>                      mmap(PTE0)
>>>>                        clear_pte(PTE0)
>>>>                          unfold(PTE0 - PTE3)
>>>>                            WRITE_ONCE(PTE0, 0)
>>>>                            WRITE_ONCE(PTE1, 0)
>>>>                            WRITE_ONCE(PTE2, 0)
>>>>                            WRITE_ONCE(PTE3, 0)
>>>>                            TLBI
>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>                          WRITE_ONCE(PTE0, 0)
>>>>                        set_pte_at(PTE0, <new>)
>>>>                      write to FolioB - HW sets PTE0's dirty
>>>>       READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     <do speculative work with pte1 content>
>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>       READ_ONCE(PTE0) (for a/d)
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     false = pte_same(pte1, pte2) << So this fails
>>>>
>>>> The only way I can see false-positive not being caught in the second example is
>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>>> But these lockless table walkers are already suseptible to that.
>>>>
>>>> I think all the same arguments can be extended to the access bit.
>>>>
>>>>
>>>> For me this is all getting rather subtle and difficult to reason about and even
>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>
>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>> for
>>>> access and dirty information, which may be inconsistent if a racing
>>>> modification
>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>> Only
>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>
>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>> this
>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>
>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>> would sync against GUP-fast?
>>
>> No, the broadcast is in HW. There is no IPI.
>
> Okay ...
>
> I agree that the semantics are a bit weird, but if we could get rid of
> ptep_get_lockless() on arm64, that would also be nice.
>
>
> Something I've been thinking of ... just to share what I've had in mind. The two
> types of users we currently have are:
>
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>
> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>
> What if we had the following instead:
>
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>
> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>     PTL.
>
> And on arm64 let
>
> (1) ptep_get_lockless() be ptep_get()
>
> (2) ptep_get_gup_fast() be __ptep_get().
>
> That would mean, that (2) would not care if another cont-pte is dirty, because
> we don't collect access+dirty bits. That way, we avoid any races with concurrent
> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
> would have to set all cont-PTEs dirty, even if any of these already is dirty.

I don't think the "problematic" thing is actually a problem; set_ptes() will
always set the dirty bit to the same value for all ptes it covers (and if you do
set_ptes() on a partial contpte block, it will be unfolded first). Although I
suspect I've misunderstood what you meant there...

The potential problem I see with this is that the Arm ARM doesn't specify which
PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
randomly and this could spuriously increase your check failure rate. In reality
I believe most implementations will update the PTE for the address that caused
the TLB to be populated. But in some cases, you could have eviction (due to
pressure or explicit invalidation) followed by re-population due to faulting on
a different page of the contpte block. In this case you would see this type of
problem too.

But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
potentially false-negatives for access and dirty? Just with a much higher chance
of getting a false-negative. How is this helping?



2024-04-15 15:22:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15.04.24 17:17, Ryan Roberts wrote:
> On 15/04/2024 15:58, David Hildenbrand wrote:
>> On 15.04.24 16:34, Ryan Roberts wrote:
>>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>>
>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>>> TLBIs). But
>>>>>>>>> you make that clear further down.
>>>>>>>>
>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>>> tables
>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>>
>>>>>>>>>> I think what has to happen is:
>>>>>>>>>>
>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>>>> there
>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>>
>>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>>
>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>>> about a
>>>>>>>>> class of users. I agree with the statement.
>>>>>>>>
>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>>
>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>>>> the
>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>>> flushes
>>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>>>> bits,
>>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>>
>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>>
>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>>>> about
>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>>> guarranteed
>>>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>>
>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>>> that
>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>>> there
>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>>
>>>>>>>> https://lkml.kernel.org/r/[email protected]
>>>>>>>>
>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>>
>>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>>>> while
>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>>> *think*
>>>>>>>>> that failing to detect this is benign.
>>>>>>>>
>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>>> check. So
>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>>
>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>>> soft-dirty or uffd-wp).
>>>>>>>
>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>>> allow
>>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>>> implemented
>>>>>>> for arm64 today.
>>>>>>
>>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>>> would be able to catch that.
>>>>>>
>>>>>> Would that be possible?
>>>>>
>>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>>
>>>>>
>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>>> stored in
>>>>> PTE0 for the contpte block, and it is dirty.
>>>>>
>>>>> Now let's say there are 2 racing threads:
>>>>>
>>>>>     - ThreadA is doing a GUP-fast for PTE3
>>>>>     - ThreadB is remapping order-0 FolioB at PTE0
>>>>>
>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>>
>>>>> ThreadA                    ThreadB
>>>>> =======                    =======
>>>>>
>>>>> gup_pte_range()
>>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>>       READ_ONCE(PTE3)
>>>>>                      mmap(PTE0)
>>>>>                        clear_pte(PTE0)
>>>>>                          unfold(PTE0 - PTE3)
>>>>>                            WRITE_ONCE(PTE0, 0)
>>>>>                            WRITE_ONCE(PTE1, 0)
>>>>>                            WRITE_ONCE(PTE2, 0)
>>>>>       READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>>       READ_ONCE(PTE1) (for a/d)
>>>>>       READ_ONCE(PTE2) (for a/d)
>>>>>       READ_ONCE(PTE3) (for a/d)
>>>>>     <do speculative work with pte1 content>
>>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>>       READ_ONCE(PTE3)
>>>>>       READ_ONCE(PTE0) (for a/d)
>>>>>       READ_ONCE(PTE1) (for a/d)
>>>>>       READ_ONCE(PTE2) (for a/d)
>>>>>       READ_ONCE(PTE3) (for a/d)
>>>>>     true = pte_same(pte1, pte2)
>>>>>                            WRITE_ONCE(PTE3, 0)
>>>>>                            TLBI
>>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>                          WRITE_ONCE(PTE0, 0)
>>>>>                        set_pte_at(PTE0, <new>)
>>>>>
>>>>> This example shows how a *false-negative* can be returned for the dirty state,
>>>>> which isn't detected by the check.
>>>>>
>>>>> I've been unable to come up with an example where a *false-positive* can be
>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>>> this second example, let's assume everything is the same execpt FolioA is
>>>>> initially clean:
>>>>>
>>>>> ThreadA                    ThreadB
>>>>> =======                    =======
>>>>>
>>>>> gup_pte_range()
>>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>>       READ_ONCE(PTE3)
>>>>>                      mmap(PTE0)
>>>>>                        clear_pte(PTE0)
>>>>>                          unfold(PTE0 - PTE3)
>>>>>                            WRITE_ONCE(PTE0, 0)
>>>>>                            WRITE_ONCE(PTE1, 0)
>>>>>                            WRITE_ONCE(PTE2, 0)
>>>>>                            WRITE_ONCE(PTE3, 0)
>>>>>                            TLBI
>>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>                          WRITE_ONCE(PTE0, 0)
>>>>>                        set_pte_at(PTE0, <new>)
>>>>>                      write to FolioB - HW sets PTE0's dirty
>>>>>       READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>>       READ_ONCE(PTE1) (for a/d)
>>>>>       READ_ONCE(PTE2) (for a/d)
>>>>>       READ_ONCE(PTE3) (for a/d)
>>>>>     <do speculative work with pte1 content>
>>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>>       READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>>       READ_ONCE(PTE0) (for a/d)
>>>>>       READ_ONCE(PTE1) (for a/d)
>>>>>       READ_ONCE(PTE2) (for a/d)
>>>>>       READ_ONCE(PTE3) (for a/d)
>>>>>     false = pte_same(pte1, pte2) << So this fails
>>>>>
>>>>> The only way I can see false-positive not being caught in the second example is
>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>>>> But these lockless table walkers are already suseptible to that.
>>>>>
>>>>> I think all the same arguments can be extended to the access bit.
>>>>>
>>>>>
>>>>> For me this is all getting rather subtle and difficult to reason about and even
>>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>>
>>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>>> for
>>>>> access and dirty information, which may be inconsistent if a racing
>>>>> modification
>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>>> Only
>>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>>
>>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>>> this
>>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>>
>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>>> would sync against GUP-fast?
>>>
>>> No, the broadcast is in HW. There is no IPI.
>>
>> Okay ...
>>
>> I agree that the semantics are a bit weird, but if we could get rid of
>> ptep_get_lockless() on arm64, that would also be nice.
>>
>>
>> Something I've been thinking of ... just to share what I've had in mind. The two
>> types of users we currently have are:
>>
>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>
>> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>>
>> What if we had the following instead:
>>
>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>
>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>>     PTL.
>>
>> And on arm64 let
>>
>> (1) ptep_get_lockless() be ptep_get()
>>
>> (2) ptep_get_gup_fast() be __ptep_get().
>>
>> That would mean, that (2) would not care if another cont-pte is dirty, because
>> we don't collect access+dirty bits. That way, we avoid any races with concurrent
>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
>> would have to set all cont-PTEs dirty, even if any of these already is dirty.
>
> I don't think the "problematic" thing is actually a problem; set_ptes() will
> always set the dirty bit to the same value for all ptes it covers (and if you do
> set_ptes() on a partial contpte block, it will be unfolded first). Although I
> suspect I've misunderstood what you meant there...

It's more code like that following that I am concerned about.

if (pte_dirty()) {
/* Great, nothing to do */
} else
mte_mkdirty();
set_ptes();
...
}

>
> The potential problem I see with this is that the Arm ARM doesn't specify which
> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
> randomly and this could spuriously increase your check failure rate. In reality
> I believe most implementations will update the PTE for the address that caused
> the TLB to be populated. But in some cases, you could have eviction (due to
> pressure or explicit invalidation) followed by re-population due to faulting on
> a different page of the contpte block. In this case you would see this type of
> problem too.
>
> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
> potentially false-negatives for access and dirty? Just with a much higher chance
> of getting a false-negative. How is this helping?

You are performing an atomic read like GUP-fast wants you to. So there
are no races to worry about like on other architectures: HW might *set*
the dirty bit concurrently, but that's just fine.

The whole races you describe with concurrent folding/unfolding/ ... are
irrelevant.

To me that sounds ... much simpler ;) But again, just something I've
been thinking about.

The reuse of pte_get_lockless() outside GUP code might not have been the
wisest choice.

--
Cheers,

David / dhildenb


2024-04-15 15:53:47

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 15/04/2024 16:22, David Hildenbrand wrote:
> On 15.04.24 17:17, Ryan Roberts wrote:
>> On 15/04/2024 15:58, David Hildenbrand wrote:
>>> On 15.04.24 16:34, Ryan Roberts wrote:
>>>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>>>
>>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>>>> TLBIs). But
>>>>>>>>>> you make that clear further down.
>>>>>>>>>
>>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>>>> tables
>>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific
>>>>>>>>> and
>>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>>>
>>>>>>>>>>> I think what has to happen is:
>>>>>>>>>>>
>>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as
>>>>>>>>>>> long as
>>>>>>>>>>> there
>>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can
>>>>>>>>>>> handle
>>>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>>>
>>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>>>> about a
>>>>>>>>>> class of users. I agree with the statement.
>>>>>>>>>
>>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as
>>>>>>>>> page
>>>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot
>>>>>>>>>>> handle
>>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>>>
>>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check
>>>>>>>>>>> later if
>>>>>>>>>>> the
>>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>>>> flushes
>>>>>>>>>>> required. This is the common case. HW might concurrently set
>>>>>>>>>>> access/dirty
>>>>>>>>>>> bits,
>>>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>>>
>>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually
>>>>>>>>>> care
>>>>>>>>>> about
>>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>>>> guarranteed
>>>>>>>>>> to be self-consistent except for access and dirty information, which
>>>>>>>>>> may be
>>>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>>>
>>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>>>> that
>>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>>>> there
>>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>>>
>>>>>>>>> https://lkml.kernel.org/r/[email protected]
>>>>>>>>>
>>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>>>
>>>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This could mean that the access/dirty state *does* change for a given
>>>>>>>>>> page
>>>>>>>>>> while
>>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>>>> *think*
>>>>>>>>>> that failing to detect this is benign.
>>>>>>>>>
>>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>>>> check. So
>>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>>>
>>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>>>> soft-dirty or uffd-wp).
>>>>>>>>
>>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>>>> allow
>>>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>>>> implemented
>>>>>>>> for arm64 today.
>>>>>>>
>>>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>>>> would be able to catch that.
>>>>>>>
>>>>>>> Would that be possible?
>>>>>>
>>>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>>>
>>>>>>
>>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs
>>>>>> 0, 1,
>>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>>>> stored in
>>>>>> PTE0 for the contpte block, and it is dirty.
>>>>>>
>>>>>> Now let's say there are 2 racing threads:
>>>>>>
>>>>>>      - ThreadA is doing a GUP-fast for PTE3
>>>>>>      - ThreadB is remapping order-0 FolioB at PTE0
>>>>>>
>>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>>>
>>>>>> ThreadA                    ThreadB
>>>>>> =======                    =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>                       mmap(PTE0)
>>>>>>                         clear_pte(PTE0)
>>>>>>                           unfold(PTE0 - PTE3)
>>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>>        READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      <do speculative work with pte1 content>
>>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      true = pte_same(pte1, pte2)
>>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>>                             TLBI
>>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>>                         set_pte_at(PTE0, <new>)
>>>>>>
>>>>>> This example shows how a *false-negative* can be returned for the dirty
>>>>>> state,
>>>>>> which isn't detected by the check.
>>>>>>
>>>>>> I've been unable to come up with an example where a *false-positive* can be
>>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>>>> this second example, let's assume everything is the same execpt FolioA is
>>>>>> initially clean:
>>>>>>
>>>>>> ThreadA                    ThreadB
>>>>>> =======                    =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>                       mmap(PTE0)
>>>>>>                         clear_pte(PTE0)
>>>>>>                           unfold(PTE0 - PTE3)
>>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>>                             TLBI
>>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>>                         set_pte_at(PTE0, <new>)
>>>>>>                       write to FolioB - HW sets PTE0's dirty
>>>>>>        READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      <do speculative work with pte1 content>
>>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      false = pte_same(pte1, pte2) << So this fails
>>>>>>
>>>>>> The only way I can see false-positive not being caught in the second
>>>>>> example is
>>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA
>>>>>> scenario.
>>>>>> But these lockless table walkers are already suseptible to that.
>>>>>>
>>>>>> I think all the same arguments can be extended to the access bit.
>>>>>>
>>>>>>
>>>>>> For me this is all getting rather subtle and difficult to reason about and
>>>>>> even
>>>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>>>
>>>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>>>> for
>>>>>> access and dirty information, which may be inconsistent if a racing
>>>>>> modification
>>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>>>> Only
>>>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>>>
>>>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>>>> this
>>>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>>>
>>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>>>> would sync against GUP-fast?
>>>>
>>>> No, the broadcast is in HW. There is no IPI.
>>>
>>> Okay ...
>>>
>>> I agree that the semantics are a bit weird, but if we could get rid of
>>> ptep_get_lockless() on arm64, that would also be nice.
>>>
>>>
>>> Something I've been thinking of ... just to share what I've had in mind. The two
>>> types of users we currently have are:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>>>
>>> What if we had the following instead:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>>>      PTL.
>>>
>>> And on arm64 let
>>>
>>> (1) ptep_get_lockless() be ptep_get()
>>>
>>> (2) ptep_get_gup_fast() be __ptep_get().
>>>
>>> That would mean, that (2) would not care if another cont-pte is dirty, because
>>> we don't collect access+dirty bits. That way, we avoid any races with concurrent
>>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
>>> would have to set all cont-PTEs dirty, even if any of these already is dirty.
>>
>> I don't think the "problematic" thing is actually a problem; set_ptes() will
>> always set the dirty bit to the same value for all ptes it covers (and if you do
>> set_ptes() on a partial contpte block, it will be unfolded first). Although I
>> suspect I've misunderstood what you meant there...
>
> It's more code like that following that I am concerned about.
>
> if (pte_dirty()) {
>     /* Great, nothing to do */
> } else
>     mte_mkdirty();
>     set_ptes();
>     ...
> }

OK I see, so you're worried about uneccessary unfolding that the false-negative
dirty reporting could cause? I think the best solution there would be for the
core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in
his series at [1]. That would avoid any unfolding and just dirty all contpte
block(s) touched by the range.

[1] https://lore.kernel.org/linux-mm/[email protected]/

>
>>
>> The potential problem I see with this is that the Arm ARM doesn't specify which
>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>> randomly and this could spuriously increase your check failure rate. In reality
>> I believe most implementations will update the PTE for the address that caused
>> the TLB to be populated. But in some cases, you could have eviction (due to
>> pressure or explicit invalidation) followed by re-population due to faulting on
>> a different page of the contpte block. In this case you would see this type of
>> problem too.
>>
>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
>> potentially false-negatives for access and dirty? Just with a much higher chance
>> of getting a false-negative. How is this helping?
>
> You are performing an atomic read like GUP-fast wants you to. So there are no
> races to worry about like on other architectures: HW might *set* the dirty bit
> concurrently, but that's just fine.

But you can still see false-negatives for access and dirty...

>
> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.

And I think I convinced myself that you will only see false-negatives with
today's arm64 ptep_get(). But an order or magnitude fewer than with your
proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).

>
> To me that sounds ... much simpler ;) But again, just something I've been
> thinking about.

OK so this approach upgrades my "I'm fairly sure we never see false-positives"
to "we definitely never see false-positives". But it certainly increases the
quantity of false-negatives.

>
> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
> choice.
>

If you want to go down the ptep_get_gup_fast() route, you've still got to be
able to spec it, and I think it will land pretty close to my most recent stab at
respec'ing ptep_get_lockless() a couple of replies up on this thread.

Where would your proposal leave the KVM use case? If you call it
ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
be left with ptep_get()...

Sorry this thread is getting so long. Just to summarise, I think there are
currently 3 solutions on the table:

- ptep_get_lockless() remains as is
- ptep_get_lockless() wraps ptep_get()
- ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)

Based on discussion so far, that's also the order of my preference.

Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
I think its just the potential for looping in the face of concurrent modifications?


2024-04-15 16:02:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

>>> The potential problem I see with this is that the Arm ARM doesn't specify which
>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>>> randomly and this could spuriously increase your check failure rate. In reality
>>> I believe most implementations will update the PTE for the address that caused
>>> the TLB to be populated. But in some cases, you could have eviction (due to
>>> pressure or explicit invalidation) followed by re-population due to faulting on
>>> a different page of the contpte block. In this case you would see this type of
>>> problem too.
>>>
>>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
>>> potentially false-negatives for access and dirty? Just with a much higher chance
>>> of getting a false-negative. How is this helping?
>>
>> You are performing an atomic read like GUP-fast wants you to. So there are no
>> races to worry about like on other architectures: HW might *set* the dirty bit
>> concurrently, but that's just fine.
>
> But you can still see false-negatives for access and dirty...

Yes.

>
>>
>> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.
>
> And I think I convinced myself that you will only see false-negatives with
> today's arm64 ptep_get(). But an order or magnitude fewer than with your
> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).
>
>>
>> To me that sounds ... much simpler ;) But again, just something I've been
>> thinking about.
>
> OK so this approach upgrades my "I'm fairly sure we never see false-positives"
> to "we definitely never see false-positives". But it certainly increases the
> quantity of false-negatives.

Yes.

>
>>
>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
>> choice.
>>
>
> If you want to go down the ptep_get_gup_fast() route, you've still got to be
> able to spec it, and I think it will land pretty close to my most recent stab at
> respec'ing ptep_get_lockless() a couple of replies up on this thread.
>
> Where would your proposal leave the KVM use case? If you call it
> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
> be left with ptep_get()...

It's using GUP-fast.

>
> Sorry this thread is getting so long. Just to summarise, I think there are
> currently 3 solutions on the table:
>
> - ptep_get_lockless() remains as is
> - ptep_get_lockless() wraps ptep_get()
> - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)
>
> Based on discussion so far, that's also the order of my preference.

(1) seems like the easiest thing to do.

>
> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?

Well, you sent that patch series with "that aims to reduce the cost and
complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve
that. :)

--
Cheers,

David / dhildenb


2024-04-23 10:15:27

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

Hi David,

Sorry for the slow reply on this; its was due to a combination of thinking a bit
more about the options here and being out on holiday.


On 15/04/2024 17:02, David Hildenbrand wrote:
>>>> The potential problem I see with this is that the Arm ARM doesn't specify which
>>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>>>> randomly and this could spuriously increase your check failure rate. In reality
>>>> I believe most implementations will update the PTE for the address that caused
>>>> the TLB to be populated. But in some cases, you could have eviction (due to
>>>> pressure or explicit invalidation) followed by re-population due to faulting on
>>>> a different page of the contpte block. In this case you would see this type of
>>>> problem too.
>>>>
>>>> But ultimately, isn't this basically equivalent to ptep_get_lockless()
>>>> returning
>>>> potentially false-negatives for access and dirty? Just with a much higher
>>>> chance
>>>> of getting a false-negative. How is this helping?
>>>
>>> You are performing an atomic read like GUP-fast wants you to. So there are no
>>> races to worry about like on other architectures: HW might *set* the dirty bit
>>> concurrently, but that's just fine.
>>
>> But you can still see false-negatives for access and dirty...
>
> Yes.
>
>>
>>>
>>> The whole races you describe with concurrent folding/unfolding/ ... are
>>> irrelevant.
>>
>> And I think I convinced myself that you will only see false-negatives with
>> today's arm64 ptep_get(). But an order or magnitude fewer than with your
>> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of
>> those).
>>
>>>
>>> To me that sounds ... much simpler ;) But again, just something I've been
>>> thinking about.
>>
>> OK so this approach upgrades my "I'm fairly sure we never see false-positives"
>> to "we definitely never see false-positives". But it certainly increases the
>> quantity of false-negatives.
>
> Yes.
>
>>
>>>
>>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
>>> choice.
>>>
>>
>> If you want to go down the ptep_get_gup_fast() route, you've still got to be
>> able to spec it, and I think it will land pretty close to my most recent stab at
>> respec'ing ptep_get_lockless() a couple of replies up on this thread.
>>
>> Where would your proposal leave the KVM use case? If you call it
>> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
>> be left with ptep_get()...
>
> It's using GUP-fast.
>
>>
>> Sorry this thread is getting so long. Just to summarise, I think there are
>> currently 3 solutions on the table:
>>
>>    - ptep_get_lockless() remains as is
>>    - ptep_get_lockless() wraps ptep_get()
>>    - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)
>>
>> Based on discussion so far, that's also the order of my preference.
>
> (1) seems like the easiest thing to do.

Yes, I'm very much in favour of easy.

>
>>
>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
>
> Well, you sent that patch series with "that aims to reduce the cost and
> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :)

Touche! I'd half forgotten that we were having this conversation in the context
of this series!

I guess your ptep_get_gup_fast() approach is very similar to
ptep_get_lockless_norecency()... So we are back to the beginning :)

But ultimately I've come to the conclusion that it is easy to reason about the
current arm64 ptep_get_lockless() implementation and see that its correct. The
other options both have their drawbacks.

Yes, there is a loop in the current implementation that would be nice to get rid
of, but I don't think it is really any worse than the cmpxchg loops we already
have in other helpers.

I'm not planning to persue this any further. Thanks for the useful discussion
(as always).



2024-04-23 10:19:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

On 23.04.24 12:15, Ryan Roberts wrote:
> Hi David,
>
> Sorry for the slow reply on this; its was due to a combination of thinking a bit
> more about the options here and being out on holiday.
>

No worries, there are things more important in life than
ptep_get_lockless() :D

>> (1) seems like the easiest thing to do.
>
> Yes, I'm very much in favour of easy.
>
>>
>>>
>>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
>>
>> Well, you sent that patch series with "that aims to reduce the cost and
>> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :)
>
> Touche! I'd half forgotten that we were having this conversation in the context
> of this series!
>
> I guess your ptep_get_gup_fast() approach is very similar to
> ptep_get_lockless_norecency()... So we are back to the beginning :)

Except that it would be limited to GUP-fast :)

>
> But ultimately I've come to the conclusion that it is easy to reason about the
> current arm64 ptep_get_lockless() implementation and see that its correct. The
> other options both have their drawbacks.

Yes.

>
> Yes, there is a loop in the current implementation that would be nice to get rid
> of, but I don't think it is really any worse than the cmpxchg loops we already
> have in other helpers.
>
> I'm not planning to persue this any further. Thanks for the useful discussion
> (as always).

Make sense to me. let's leave it as is for the time being. (and also see
if a GUP-fast user that needs precise dirty/accessed actually gets real)

--
Cheers,

David / dhildenb