2023-06-20 07:48:55

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 00/12] mm: free retracted page table by RCU

Here is v2 third series of patches to mm (and a few architectures), based
on v6.4-rc5 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
Differences from v1 are noted patch by patch below

This follows on from the v2 "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 23 posted on 2023-06-08 (and now in mm-stable - thank you),
and the v2 "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 32 posted on 2023-06-08 (and now in mm-stable - thank you),
and replaces the v1 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/[email protected]/
series of 12 posted on 2023-05-28 (which was bad on powerpc and s390).

The first two series were "independent": neither depending for build or
correctness on the other, but both series had to be in before this third
series is added to make the effective changes; and it would probably be
best to hold this series back until the following release, since it might
now reveal missed imbalances which the first series hoped to fix.

What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc5, or any v6.4-rc; and
almost good on current mm-everything or current linux-next - just one
patch conflicts, the 10/12: I'll reply to that one with its
mm-everything or linux-next equivalent (ptent replacing *pte).

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
v2: same as v1
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
v2: rename to pmdp_get_lockless_start/end() per Matthew;
so use inlines without _irq_save(flags) macro oddity;
add pmdp_get_lockless_sync() for use later in 09/12.
03/12 arm: adjust_pte() use pte_offset_map_nolock()
v2: same as v1
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
v2: same as v1
05/12 powerpc: add pte_free_defer() for pgtables sharing page
v2: fix rcu_head usage to cope with concurrent deferrals;
add para to commit message explaining rcu_head issue.
06/12 sparc: add pte_free_defer() for pte_t *pgtable_t
v2: use page_address() instead of less common page_to_virt();
add para to commit message explaining simple conversion;
changed title since sparc64 pgtables do not share page.
07/12 s390: add pte_free_defer() for pgtables sharing page
v2: complete rewrite, integrated with s390's existing pgtable
management; temporarily using a global mm_pgtable_list_lock,
to be restored to per-mm spinlock in a later followup patch.
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
v2: add comment on rcu_head to "Page table pages", per JannH
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
v2: repeat checks under ptl because UFFD, per PeterX and JannH;
bring back mmu_notifier calls for PMD, per JannH and Jason;
pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
v2: first check VMA, in case page tables torn down, per JannH;
pmdp_get_lockless_sync() to issue missing interrupt if PAE;
moved mmu_notifier after step 1, reworked final goto labels.
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
v2: same as v1
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()
v2: same as v1

arch/arm/mm/fault-armv.c | 3 +-
arch/powerpc/include/asm/pgalloc.h | 4 +
arch/powerpc/mm/pgtable-frag.c | 51 ++++
arch/powerpc/mm/pgtable.c | 16 +-
arch/s390/include/asm/pgalloc.h | 4 +
arch/s390/mm/pgalloc.c | 205 +++++++++----
arch/sparc/include/asm/pgalloc_64.h | 4 +
arch/sparc/mm/init_64.c | 16 +
include/linux/mm.h | 17 --
include/linux/mm_types.h | 6 +-
include/linux/mmap_lock.h | 10 -
include/linux/pgtable.h | 10 +-
mm/khugepaged.c | 481 +++++++++++-------------------
mm/pgtable-generic.c | 53 +++-
14 files changed, 467 insertions(+), 413 deletions(-)

Hugh


2023-06-20 07:49:34

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 4 ++--
mm/pgtable-generic.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
#define pte_unmap(pte) do { \
kunmap_local((pte)); \
- /* rcu_read_unlock() to be added later */ \
+ rcu_read_unlock(); \
} while (0)
#else
static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
}
static inline void pte_unmap(pte_t *pte)
{
- /* rcu_read_unlock() to be added later */
+ rcu_read_unlock();
}
#endif

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
pmd_t pmdval;

- /* rcu_read_lock() to be added later */
+ rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
}
return __pte_map(&pmdval, addr);
nomap:
- /* rcu_read_unlock() to be added later */
+ rcu_read_unlock();
return NULL;
}

--
2.35.3


2023-06-20 07:50:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 02/12] mm/pgtable: add PAE safety to __pte_offset_map()

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 4 ++++
mm/pgtable-generic.c | 29 +++++++++++++++++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..525f1782b466 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
return pmd;
}
#define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
#endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */

@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
{
return pmdp_get(pmdp);
}
+static inline void pmdp_get_lockless_sync(void)
+{
+}
#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..5e85a625ab30 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+ (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+ unsigned long irqflags;
+
+ local_irq_save(irqflags);
+ return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+ local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
+ unsigned long irqflags;
pmd_t pmdval;

rcu_read_lock();
+ irqflags = pmdp_get_lockless_start();
pmdval = pmdp_get_lockless(pmd);
+ pmdp_get_lockless_end(irqflags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
--
2.35.3


2023-06-20 07:50:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 03/12] arm: adjust_pte() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/arm/mm/fault-armv.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
* must use the nested version. This also means we need to
* open-code the spin-locking.
*/
- pte = pte_offset_map(pmd, address);
+ pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
if (!pte)
return 0;

- ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);

ret = do_adjust_pte(vma, address, pfn, pte);
--
2.35.3


2023-06-20 07:52:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/pgtable.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;

if (mm == &init_mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
- /*
- * khugepaged to collapse normal pages to hugepage, first set
- * pmd to none to force page fault/gup to take mmap_lock. After
- * pmd is set to none, we do a pte_clear which does this assertion
- * so if we find pmd none, return.
- */
- if (pmd_none(*pmd))
- return;
- BUG_ON(!pmd_present(*pmd));
- assert_spin_locked(pte_lockptr(mm, pmd));
+ pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+ BUG_ON(!pte);
+ assert_spin_locked(ptl);
+ pte_unmap(pte);
}
#endif /* CONFIG_DEBUG_VM */

--
2.35.3


2023-06-20 08:12:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 06/12] sparc: add pte_free_defer() for pte_t *pgtable_t

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/include/asm/pgalloc_64.h | 4 ++++
arch/sparc/mm/init_64.c | 16 ++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
void pte_free(struct mm_struct *mm, pgtable_t ptepage);

+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
#define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
#define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ __pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd)
{
--
2.35.3


2023-06-20 08:14:41

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 17 -----------------
include/linux/mmap_lock.h | 10 ----------
2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
up_write(&vma->vm_lock->lock);
}

-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
- int mm_lock_seq;
-
- if (__is_vma_write_locked(vma, &mm_lock_seq))
- return true;
-
- if (!down_write_trylock(&vma->vm_lock->lock))
- return false;
-
- vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->vm_lock->lock);
- return true;
-}
-
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
{
int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
{ return false; }
static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
- { return true; }
static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
return ret;
}

-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
- bool ret;
-
- __mmap_lock_trace_start_locking(mm, true);
- ret = down_write_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, true, ret);
- return ret;
-}
-
static inline void mmap_write_unlock(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, true);
--
2.35.3


2023-06-20 08:14:55

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/include/asm/pgalloc.h | 4 +++
arch/powerpc/mm/pgtable-frag.c | 51 ++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
}

+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/*
* Functions that deal with pagetables that could be at any level of
* the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+ int refcount;
+
+ page = container_of(head, struct page, rcu_head);
+ refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+ &page->pt_frag_refcount);
+ if (refcount < PTE_FREE_DEFERRED) {
+ pte_fragment_free((unsigned long *)page_address(page), 0);
+ return;
+ }
+ /*
+ * One page may be shared between PTE_FRAG_NR pagetables.
+ * At least one more call to pte_free_defer() came in while we
+ * were already deferring, so the free must be deferred again;
+ * but just for one grace period, however many calls came in.
+ */
+ while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+ refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+ &page->pt_frag_refcount);
+ }
+ /* Remove that refcount of 1 left for fragment freeing above */
+ atomic_dec(&page->pt_frag_refcount);
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ /*
+ * One page may be shared between PTE_FRAG_NR pagetables: only queue
+ * it once for freeing, but note whenever the free must be deferred.
+ *
+ * (This would be much simpler if the struct page had an rcu_head for
+ * each fragment, or if we could allocate a separate array for that.)
+ *
+ * Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and
+ * proceed to call_rcu() only when the rcu_head is not already in use.
+ */
+ if (atomic_add_return(PTE_FREE_DEFERRED - 1, &page->pt_frag_refcount) <
+ PTE_FREE_DEFERRED + PTE_FREE_DEFERRED)
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.35.3


2023-06-20 08:24:13

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that. That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs. retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write(). In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault. But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 184 ++++++++++++++++++++----------------------------
1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..f7a0f7673127 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
- * In MADV_COLLAPSE path, possible race with khugepaged where
- * all pte entries have been removed and pmd cleared. If so,
- * skip all the pte checks and just update the pmd mapping.
+ * All pte entries have been removed and pmd cleared.
+ * Skip all the pte checks and just update the pmd mapping.
*/
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
}

-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
- struct mm_struct *target_mm,
- unsigned long target_addr, struct page *hpage,
- struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
- int target_result = SCAN_FAIL;

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- int result = SCAN_FAIL;
- struct mm_struct *mm = NULL;
- unsigned long addr = 0;
- pmd_t *pmd;
- bool is_target = false;
+ struct mmu_notifier_range range;
+ struct mm_struct *mm;
+ unsigned long addr;
+ pmd_t *pmd, pgt_pmd;
+ spinlock_t *pml;
+ spinlock_t *ptl;
+ bool skipped_uffd = false;

/*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
- * got written to. These VMAs are likely not worth investing
- * mmap_write_lock(mm) as PMD-mapping is likely to be split
- * later.
- *
- * Note that vma->anon_vma check is racy: it can be set up after
- * the check but before we took mmap_lock by the fault path.
- * But page lock would prevent establishing any new ptes of the
- * page, so we are safe.
- *
- * An alternative would be drop the check, but check that page
- * table is clear before calling pmdp_collapse_flush() under
- * ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too. It would also probably require locking
- * the anon_vma.
+ * got written to. These VMAs are likely not worth removing
+ * page tables from, as PMD-mapping is likely to be split later.
*/
- if (READ_ONCE(vma->anon_vma)) {
- result = SCAN_PAGE_ANON;
- goto next;
- }
+ if (READ_ONCE(vma->anon_vma))
+ continue;
+
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
if (addr & ~HPAGE_PMD_MASK ||
- vma->vm_end < addr + HPAGE_PMD_SIZE) {
- result = SCAN_VMA_CHECK;
- goto next;
- }
- mm = vma->vm_mm;
- is_target = mm == target_mm && addr == target_addr;
- result = find_pmd_or_thp_or_none(mm, addr, &pmd);
- if (result != SCAN_SUCCEED)
- goto next;
- /*
- * We need exclusive mmap_lock to retract page table.
- *
- * We use trylock due to lock inversion: we need to acquire
- * mmap_lock while holding page lock. Fault path does it in
- * reverse order. Trylock is a way to avoid deadlock.
- *
- * Also, it's not MADV_COLLAPSE's job to collapse other
- * mappings - let khugepaged take care of them later.
- */
- result = SCAN_PTE_MAPPED_HUGEPAGE;
- if ((cc->is_khugepaged || is_target) &&
- mmap_write_trylock(mm)) {
- /* trylock for the same lock inversion as above */
- if (!vma_try_start_write(vma))
- goto unlock_next;
-
- /*
- * Re-check whether we have an ->anon_vma, because
- * collapse_and_free_pmd() requires that either no
- * ->anon_vma exists or the anon_vma is locked.
- * We already checked ->anon_vma above, but that check
- * is racy because ->anon_vma can be populated under the
- * mmap lock in read mode.
- */
- if (vma->anon_vma) {
- result = SCAN_PAGE_ANON;
- goto unlock_next;
- }
- /*
- * When a vma is registered with uffd-wp, we can't
- * recycle the pmd pgtable because there can be pte
- * markers installed. Skip it only, so the rest mm/vma
- * can still have the same file mapped hugely, however
- * it'll always mapped in small page size for uffd-wp
- * registered ranges.
- */
- if (hpage_collapse_test_exit(mm)) {
- result = SCAN_ANY_PROCESS;
- goto unlock_next;
- }
- if (userfaultfd_wp(vma)) {
- result = SCAN_PTE_UFFD_WP;
- goto unlock_next;
- }
- collapse_and_free_pmd(mm, vma, addr, pmd);
- if (!cc->is_khugepaged && is_target)
- result = set_huge_pmd(vma, addr, pmd, hpage);
- else
- result = SCAN_SUCCEED;
-
-unlock_next:
- mmap_write_unlock(mm);
- goto next;
- }
- /*
- * Calling context will handle target mm/addr. Otherwise, let
- * khugepaged try again later.
- */
- if (!is_target) {
- khugepaged_add_pte_mapped_thp(mm, addr);
+ vma->vm_end < addr + HPAGE_PMD_SIZE)
continue;
+
+ mm = vma->vm_mm;
+ if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+ continue;
+
+ if (hpage_collapse_test_exit(mm))
+ continue;
+ /*
+ * When a vma is registered with uffd-wp, we cannot recycle
+ * the page table because there may be pte markers installed.
+ * Other vmas can still have the same file mapped hugely, but
+ * skip this one: it will always be mapped in small page size
+ * for uffd-wp registered ranges.
+ */
+ if (userfaultfd_wp(vma))
+ continue;
+
+ /* PTEs were notified when unmapped; but now for the PMD? */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+ addr, addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ pml = pmd_lock(mm, pmd);
+ ptl = pte_lockptr(mm, pmd);
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+ /*
+ * Huge page lock is still held, so normally the page table
+ * must remain empty; and we have already skipped anon_vma
+ * and userfaultfd_wp() vmas. But since the mmap_lock is not
+ * held, it is still possible for a racing userfaultfd_ioctl()
+ * to have inserted ptes or markers. Now that we hold ptlock,
+ * repeating the anon_vma check protects from one category,
+ * and repeating the userfaultfd_wp() check from another.
+ */
+ if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
+ skipped_uffd = true;
+ } else {
+ pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+ pmdp_get_lockless_sync();
+ }
+
+ if (ptl != pml)
+ spin_unlock(ptl);
+ spin_unlock(pml);
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ if (!skipped_uffd) {
+ mm_dec_nr_ptes(mm);
+ page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+ pte_free_defer(mm, pmd_pgtable(pgt_pmd));
}
-next:
- if (is_target)
- target_result = result;
}
- i_mmap_unlock_write(mapping);
- return target_result;
+ i_mmap_unlock_read(mapping);
}

/**
@@ -2261,9 +2225,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,

/*
* Remove pte page tables, so we can re-fault the page as huge.
+ * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
*/
- result = retract_page_tables(mapping, start, mm, addr, hpage,
- cc);
+ retract_page_tables(mapping, start);
+ if (cc && !cc->is_khugepaged)
+ result = SCAN_PTE_MAPPED_HUGEPAGE;
unlock_page(hpage);

/*
--
2.35.3


2023-06-20 08:30:03

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries. i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual. Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs. "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3". But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 172 ++++++++++++++++++++++--------------------------
1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f7a0f7673127..060ac8789a1e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
return ret;
}

-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
{
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
};

VM_BUG_ON(!PageTransHuge(hpage));
- mmap_assert_write_locked(vma->vm_mm);
+ mmap_assert_locked(vma->vm_mm);

if (do_set_pmd(&vmf, hpage))
return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}

-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- * - modifying terminal entries (ones that point to a data page, not to another
- * page table)
- * - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp)
-{
- pmd_t pmd;
- struct mmu_notifier_range range;
-
- mmap_assert_write_locked(mm);
- if (vma->vm_file)
- lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
- /*
- * All anon_vmas attached to the VMA have the same root and are
- * therefore locked by the same lock.
- */
- if (vma->anon_vma)
- lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
- addr + HPAGE_PMD_SIZE);
- mmu_notifier_invalidate_range_start(&range);
- pmd = pmdp_collapse_flush(vma, addr, pmdp);
- tlb_remove_table_sync_one();
- mmu_notifier_invalidate_range_end(&range);
- mm_dec_nr_ptes(mm);
- page_table_check_pte_clear_range(mm, addr, pmd);
- pte_free(mm, pmd_pgtable(pmd));
-}
-
/**
* collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
* address haddr.
@@ -1563,26 +1521,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
{
+ struct mmu_notifier_range range;
+ bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
- pmd_t *pmd;
- spinlock_t *ptl;
- int count = 0, result = SCAN_FAIL;
+ pmd_t *pmd, pgt_pmd;
+ spinlock_t *pml, *ptl;
+ int nr_ptes = 0, result = SCAN_FAIL;
int i;

- mmap_assert_write_locked(mm);
+ mmap_assert_locked(mm);
+
+ /* First check VMA found, in case page tables are being torn down */
+ if (!vma || !vma->vm_file ||
+ !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+ return SCAN_VMA_CHECK;

/* Fast check before locking page if already PMD-mapped */
result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
if (result == SCAN_PMD_MAPPED)
return result;

- if (!vma || !vma->vm_file ||
- !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
- return SCAN_VMA_CHECK;
-
/*
* If we are here, we've succeeded in replacing all the native pages
* in the page cache with a single hugepage. If a mm were to fault-in
@@ -1612,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

+ result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
switch (result) {
case SCAN_SUCCEED:
break;
@@ -1625,27 +1587,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

- /* Lock the vma before taking i_mmap and page table locks */
- vma_start_write(vma);
-
- /*
- * We need to lock the mapping so that from here on, only GUP-fast and
- * hardware page walks can access the parts of the page tables that
- * we're operating on.
- * See collapse_and_free_pmd().
- */
- i_mmap_lock_write(vma->vm_file->f_mapping);
-
- /*
- * This spinlock should be unnecessary: Nobody else should be accessing
- * the page tables under spinlock protection here, only
- * lockless_pages_from_mm() and the hardware page walker can access page
- * tables while all the high-level locks are held in write mode.
- */
result = SCAN_FAIL;
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
- if (!start_pte)
- goto drop_immap;
+ if (!start_pte) /* mmap_lock + page lock should prevent this */
+ goto drop_hpage;

/* step 1: check all mapped PTEs are to the right huge page */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1671,57 +1616,94 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
*/
if (hpage + i != page)
goto abort;
- count++;
}

- /* step 2: adjust rmap */
+ pte_unmap_unlock(start_pte, ptl);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+ haddr, haddr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+ notified = true;
+ start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+ if (!start_pte) /* mmap_lock + page lock should prevent this */
+ goto abort;
+
+ /* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
struct page *page;

if (pte_none(*pte))
continue;
- page = vm_normal_page(vma, addr, *pte);
- if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+ /*
+ * We dropped ptl after the first scan, to do the mmu_notifier:
+ * page lock stops more PTEs of the hpage being faulted in, but
+ * does not stop write faults COWing anon copies from existing
+ * PTEs; and does not stop those being swapped out or migrated.
+ */
+ if (!pte_present(*pte)) {
+ result = SCAN_PTE_NON_PRESENT;
goto abort;
+ }
+ page = vm_normal_page(vma, addr, *pte);
+ if (hpage + i != page)
+ goto abort;
+
+ /*
+ * Must clear entry, or a racing truncate may re-remove it.
+ * TLB flush can be left until pmdp_collapse_flush() does it.
+ * PTE dirty? Shmem page is already dirty; file is read-only.
+ */
+ pte_clear(mm, addr, pte);
page_remove_rmap(page, vma, false);
+ nr_ptes++;
}

pte_unmap_unlock(start_pte, ptl);

/* step 3: set proper refcount and mm_counters. */
- if (count) {
- page_ref_sub(hpage, count);
- add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+ if (nr_ptes) {
+ page_ref_sub(hpage, nr_ptes);
+ add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
}

- /* step 4: remove pte entries */
- /* we make no change to anon, but protect concurrent anon page lookup */
- if (vma->anon_vma)
- anon_vma_lock_write(vma->anon_vma);
+ /* step 4: remove page table */

- collapse_and_free_pmd(mm, vma, haddr, pmd);
+ /* Huge page lock is still held, so page table must remain empty */
+ pml = pmd_lock(mm, pmd);
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+ pmdp_get_lockless_sync();
+ if (ptl != pml)
+ spin_unlock(ptl);
+ spin_unlock(pml);

- if (vma->anon_vma)
- anon_vma_unlock_write(vma->anon_vma);
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ mmu_notifier_invalidate_range_end(&range);
+
+ mm_dec_nr_ptes(mm);
+ page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+ pte_free_defer(mm, pmd_pgtable(pgt_pmd));

maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
? set_huge_pmd(vma, haddr, pmd, hpage)
: SCAN_SUCCEED;
-
+ goto drop_hpage;
+abort:
+ if (nr_ptes) {
+ flush_tlb_mm(mm);
+ page_ref_sub(hpage, nr_ptes);
+ add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+ }
+ if (start_pte)
+ pte_unmap_unlock(start_pte, ptl);
+ if (notified)
+ mmu_notifier_invalidate_range_end(&range);
drop_hpage:
unlock_page(hpage);
put_page(hpage);
return result;
-
-abort:
- pte_unmap_unlock(start_pte, ptl);
-drop_immap:
- i_mmap_unlock_write(vma->vm_file->f_mapping);
- goto drop_hpage;
}

static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
@@ -2857,9 +2839,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
case SCAN_PTE_MAPPED_HUGEPAGE:
BUG_ON(mmap_locked);
BUG_ON(*prev);
- mmap_write_lock(mm);
+ mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
- mmap_write_unlock(mm);
+ mmap_locked = true;
goto handle_result;
/* Whitelisted set of results where continuing OK */
case SCAN_PMD_NULL:
--
2.35.3


2023-06-20 08:34:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 08/12] mm/pgtable: add pte_free_defer() for pgtable as page

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm_types.h | 4 ++++
include/linux/pgtable.h | 2 ++
mm/pgtable-generic.c | 20 ++++++++++++++++++++
3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1667a1bdb8a8..09335fa28c41 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
+ /*
+ * A PTE page table page might be freed by use of
+ * rcu_head: which overlays those two fields above.
+ */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgd, s390 */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 525f1782b466..d18d3e963967 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
}
#endif

+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/* Find an entry in the second-level page table.. */
#ifndef pmd_offset
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5e85a625ab30..ab3741064bb8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
#include <asm/tlb.h>

/*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
return pmd;
}
#endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = pgtable;
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
--
2.35.3


2023-06-20 08:35:33

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is not linked to mm_context_t::pgtable_list while either half is
pending free, by either tlb_remove_table() or pte_free_defer(); but is
afterwards either relinked to the list (if other half is allocated), or
freed (if other half is free): by __tlb_remove_table() in both cases.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer(). And a fortuitous byproduct of
following this rule is that page_table_free() no longer needs its curious
two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
fix 2KB pgtable release race") for what to think of there. But it does
not solve the problem that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
knows where to link the freed half while its other half is allocated.
But linking to the list needs mm->context.lock: and although AA bit set
guarantees that pt_mm must still be valid, it does not guarantee that mm
is still valid an instant later: so acquiring mm->context.lock would not
be safe. For now, use a static global mm_pgtable_list_lock instead:
then a soon-to-follow commit will split it per-mm as before (probably by
using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
and update the commentary on the pgtable_list.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/include/asm/pgalloc.h | 4 +
arch/s390/mm/pgalloc.c | 205 +++++++++++++++++++++++---------
include/linux/mm_types.h | 2 +-
3 files changed, 154 insertions(+), 57 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
void vmem_map_init(void);
void *vmem_crst_alloc(unsigned long val);
pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..11983a3ff95a 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -159,6 +159,11 @@ void page_table_free_pgste(struct page *page)

#endif /* CONFIG_PGSTE */

+/*
+ * Temporarily use a global spinlock instead of mm->context.lock.
+ * This will be replaced by a per-mm spinlock in a followup commit.
+ */
+static DEFINE_SPINLOCK(mm_pgtable_list_lock);
/*
* A 2KB-pgtable is either upper or lower half of a normal page.
* The second half of the page may be unused or used as another
@@ -172,7 +177,7 @@ void page_table_free_pgste(struct page *page)
* When a parent page gets fully allocated it contains 2KB-pgtables in both
* upper and lower halves and is removed from mm_context_t::pgtable_list.
*
- * When 2KB-pgtable is freed from to fully allocated parent page that
+ * When 2KB-pgtable is freed from the fully allocated parent page that
* page turns partially allocated and added to mm_context_t::pgtable_list.
*
* If 2KB-pgtable is freed from the partially allocated parent page that
@@ -182,16 +187,24 @@ void page_table_free_pgste(struct page *page)
* As follows from the above, no unallocated or fully allocated parent
* pages are contained in mm_context_t::pgtable_list.
*
+ * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
+ * the new rule is that a page is not linked to mm_context_t::pgtable_list
+ * while either half is pending free by any method; but afterwards is
+ * either relinked to it, or freed, by __tlb_remove_table(). This allows
+ * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ *
* The upper byte (bits 24-31) of the parent page _refcount is used
* for tracking contained 2KB-pgtables and has the following format:
*
- * PP AA
- * 01234567 upper byte (bits 24-31) of struct page::_refcount
- * || ||
- * || |+--- upper 2KB-pgtable is allocated
- * || +---- lower 2KB-pgtable is allocated
- * |+------- upper 2KB-pgtable is pending for removal
- * +-------- lower 2KB-pgtable is pending for removal
+ * PPHHAA
+ * 76543210 upper byte (bits 24-31) of struct page::_refcount
+ * ||||||
+ * |||||+--- lower 2KB-pgtable is allocated
+ * ||||+---- upper 2KB-pgtable is allocated
+ * |||+----- lower 2KB-pgtable is pending free by page->rcu_head
+ * ||+------ upper 2KB-pgtable is pending free by page->rcu_head
+ * |+------- lower 2KB-pgtable is pending free by any method
+ * +-------- upper 2KB-pgtable is pending free by any method
*
* (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
* using _refcount is possible).
@@ -200,7 +213,7 @@ void page_table_free_pgste(struct page *page)
* The parent page is either:
* - added to mm_context_t::pgtable_list in case the second half of the
* parent page is still unallocated;
- * - removed from mm_context_t::pgtable_list in case both hales of the
+ * - removed from mm_context_t::pgtable_list in case both halves of the
* parent page are allocated;
* These operations are protected with mm_context_t::lock.
*
@@ -239,32 +252,22 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
/* Try to get a fragment of a 4K page as a 2K page table */
if (!mm_alloc_pgste(mm)) {
table = NULL;
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
if (!list_empty(&mm->context.pgtable_list)) {
page = list_first_entry(&mm->context.pgtable_list,
struct page, lru);
mask = atomic_read(&page->_refcount) >> 24;
- /*
- * The pending removal bits must also be checked.
- * Failure to do so might lead to an impossible
- * value of (i.e 0x13 or 0x23) written to _refcount.
- * Such values violate the assumption that pending and
- * allocation bits are mutually exclusive, and the rest
- * of the code unrails as result. That could lead to
- * a whole bunch of races and corruptions.
- */
- mask = (mask | (mask >> 4)) & 0x03U;
- if (mask != 0x03U) {
- table = (unsigned long *) page_to_virt(page);
- bit = mask & 1; /* =1 -> second 2K */
- if (bit)
- table += PTRS_PER_PTE;
- atomic_xor_bits(&page->_refcount,
- 0x01U << (bit + 24));
- list_del(&page->lru);
- }
+ /* Cannot be on this list if either half pending free */
+ WARN_ON_ONCE(mask & ~0x03U);
+ /* One or other half must be available, but not both */
+ WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
+ table = (unsigned long *)page_to_virt(page);
+ bit = mask & 0x01U; /* =1 -> second 2K available */
+ table += bit * PTRS_PER_PTE;
+ atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
+ list_del(&page->lru);
}
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
if (table)
return table;
}
@@ -278,6 +281,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
}
arch_set_page_dat(page, 0);
/* Initialize page table */
+ page->pt_mm = mm;
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
@@ -288,14 +292,14 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
/* Return the first 2K fragment of the page */
atomic_xor_bits(&page->_refcount, 0x01U << 24);
memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
list_add(&page->lru, &mm->context.pgtable_list);
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
}
return table;
}

-static void page_table_release_check(struct page *page, void *table,
+static void page_table_release_check(struct page *page, unsigned long *table,
unsigned int half, unsigned int mask)
{
char msg[128];
@@ -317,21 +321,18 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
if (!mm_alloc_pgste(mm)) {
/* Free 2K page table fragment of a 4K page */
bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
/*
- * Mark the page for delayed release. The actual release
- * will happen outside of the critical section from this
- * function or from __tlb_remove_table()
+ * Mark the page for release. The actual release will happen
+ * below from this function, or later from __tlb_remove_table().
*/
- mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
+ mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if (mask & 0x03U) /* other half is allocated */
list_add(&page->lru, &mm->context.pgtable_list);
- else
+ else if (!(mask & 0x30U)) /* other half not pending */
list_del(&page->lru);
- spin_unlock_bh(&mm->context.lock);
- mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
- mask >>= 24;
+ spin_unlock_bh(&mm_pgtable_list_lock);
if (mask != 0x00U)
return;
half = 0x01U << bit;
@@ -362,19 +363,17 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
return;
}
bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
/*
- * Mark the page for delayed release. The actual release will happen
- * outside of the critical section from __tlb_remove_table() or from
- * page_table_free()
+ * Mark the page for delayed release.
+ * The actual release will happen later, from __tlb_remove_table().
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
- list_add_tail(&page->lru, &mm->context.pgtable_list);
- else
+ /* Other half not allocated? Other half not already pending free? */
+ if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
list_del(&page->lru);
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
tlb_remove_table(tlb, table);
}
@@ -382,17 +381,40 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
void __tlb_remove_table(void *_table)
{
unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
- void *table = (void *)((unsigned long) _table ^ mask);
+ unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
struct page *page = virt_to_page(table);

switch (half) {
case 0x00U: /* pmd, pud, or p4d */
- free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+ __free_pages(page, CRST_ALLOC_ORDER);
return;
case 0x01U: /* lower 2K of a 4K page table */
- case 0x02U: /* higher 2K of a 4K page table */
- mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
- mask >>= 24;
+ case 0x02U: /* upper 2K of a 4K page table */
+ /*
+ * If the other half is marked as allocated, page->pt_mm must
+ * still be valid, page->rcu_head no longer in use so page->lru
+ * good for use, so now make the freed half available for reuse.
+ * But be wary of races with that other half being freed.
+ */
+ if (atomic_read(&page->_refcount) & (0x03U << 24)) {
+ struct mm_struct *mm = page->pt_mm;
+ /*
+ * It is safe to use page->pt_mm when the other half
+ * is seen allocated while holding pgtable_list lock;
+ * but how will it be safe to acquire that spinlock?
+ * Global mm_pgtable_list_lock is safe and easy for
+ * now, then a followup commit will split it per-mm.
+ */
+ spin_lock_bh(&mm_pgtable_list_lock);
+ mask = atomic_xor_bits(&page->_refcount, mask << 28);
+ mask >>= 24;
+ if (mask & 0x03U)
+ list_add(&page->lru, &mm->context.pgtable_list);
+ spin_unlock_bh(&mm_pgtable_list_lock);
+ } else {
+ mask = atomic_xor_bits(&page->_refcount, mask << 28);
+ mask >>= 24;
+ }
if (mask != 0x00U)
return;
break;
@@ -407,6 +429,77 @@ void __tlb_remove_table(void *_table)
__free_page(page);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+ unsigned long *table;
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ table = (unsigned long *)page_to_virt(page);
+ table = (unsigned long *)((unsigned long)table | 0x03U);
+ __tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+ unsigned long *table;
+ struct page *page;
+ unsigned int mask;
+
+ page = container_of(head, struct page, rcu_head);
+ mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+ table = (unsigned long *)page_to_virt(page);
+ table += bit * PTRS_PER_PTE;
+ table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+ __tlb_remove_table(table);
+
+ /* If pte_free_defer() of the other half came in, queue it now */
+ if (mask & 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+ pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+ pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ unsigned int bit, mask;
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ if (mm_alloc_pgste(mm)) {
+ call_rcu(&page->rcu_head, pte_free_pgste);
+ return;
+ }
+ bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+ (PTRS_PER_PTE * sizeof(pte_t));
+
+ spin_lock_bh(&mm_pgtable_list_lock);
+ mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+ mask >>= 24;
+ /* Other half not allocated? Other half not already pending free? */
+ if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
+ list_del(&page->lru);
+ spin_unlock_bh(&mm_pgtable_list_lock);
+
+ /* Do not relink on rcu_head if other half already linked on rcu_head */
+ if ((mask & 0x0CU) != 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Base infrastructure required to generate basic asces, region, segment,
* and page tables that do not make use of enhanced features like EDAT1.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
- struct mm_struct *pt_mm; /* x86 pgds only */
+ struct mm_struct *pt_mm; /* x86 pgd, s390 */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
--
2.35.3


2023-06-20 08:55:49

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed. Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before? I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 125 +++++++-----------------------------------------
1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 060ac8789a1e..06c659e6a89e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);

static struct kmem_cache *mm_slot_cache __read_mostly;

-#define MAX_PTE_MAPPED_THP 8
-
struct collapse_control {
bool is_khugepaged;

@@ -107,15 +105,9 @@ struct collapse_control {
/**
* struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
* @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
*/
struct khugepaged_mm_slot {
struct mm_slot slot;
-
- /* pte-mapped THP in this mm */
- int nr_pte_mapped_thp;
- unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
};

/**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
}

#ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct. Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none(). Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
- struct khugepaged_mm_slot *mm_slot;
- struct mm_slot *slot;
- bool ret = false;
-
- VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
- spin_lock(&khugepaged_mm_lock);
- slot = mm_slot_lookup(mm_slots_hash, mm);
- mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
- if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
- mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
- ret = true;
- }
- spin_unlock(&khugepaged_mm_lock);
- return ret;
-}
-
/* hpage must be locked, and mmap_lock must be held */
static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
return result;
}

-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
- struct mm_slot *slot = &mm_slot->slot;
- struct mm_struct *mm = slot->mm;
- int i;
-
- if (likely(mm_slot->nr_pte_mapped_thp == 0))
- return;
-
- if (!mmap_write_trylock(mm))
- return;
-
- if (unlikely(hpage_collapse_test_exit(mm)))
- goto out;
-
- for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
- collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
- mm_slot->nr_pte_mapped_thp = 0;
- mmap_write_unlock(mm);
-}
-
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
@@ -2372,16 +2297,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
{
BUILD_BUG();
}
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
- return false;
-}
#endif

static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2411,7 +2326,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.mm_slot = mm_slot;
}
spin_unlock(&khugepaged_mm_lock);
- khugepaged_collapse_pte_mapped_thps(mm_slot);

mm = slot->mm;
/*
@@ -2464,36 +2378,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.address);

mmap_read_unlock(mm);
- *result = hpage_collapse_scan_file(mm,
- khugepaged_scan.address,
- file, pgoff, cc);
mmap_locked = false;
+ *result = hpage_collapse_scan_file(mm,
+ khugepaged_scan.address, file, pgoff, cc);
+ if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+ mmap_read_lock(mm);
+ mmap_locked = true;
+ if (hpage_collapse_test_exit(mm)) {
+ fput(file);
+ goto breakouterloop;
+ }
+ *result = collapse_pte_mapped_thp(mm,
+ khugepaged_scan.address, false);
+ if (*result == SCAN_PMD_MAPPED)
+ *result = SCAN_SUCCEED;
+ }
fput(file);
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
- khugepaged_scan.address,
- &mmap_locked,
- cc);
+ khugepaged_scan.address, &mmap_locked, cc);
}
- switch (*result) {
- case SCAN_PTE_MAPPED_HUGEPAGE: {
- pmd_t *pmd;

- *result = find_pmd_or_thp_or_none(mm,
- khugepaged_scan.address,
- &pmd);
- if (*result != SCAN_SUCCEED)
- break;
- if (!khugepaged_add_pte_mapped_thp(mm,
- khugepaged_scan.address))
- break;
- } fallthrough;
- case SCAN_SUCCEED:
+ if (*result == SCAN_SUCCEED)
++khugepaged_pages_collapsed;
- break;
- default:
- break;
- }

/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
--
2.35.3


2023-06-20 12:14:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon. This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
>
> This is awkward because the struct page contains only one rcu_head, but
> that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> use the rcu_head at the same time: account concurrent deferrals with a
> heightened refcount, only the first making use of the rcu_head, but
> re-deferring if more deferrals arrived during its grace period.

You didn't answer my question why we can't just move the rcu to the
actual free page?

Since PPC doesn't recycle the frags, we don't need to carefully RCU
free each frag, we just need to RCU free the entire page when it
becomes eventually free?

Jason

2023-06-20 20:14:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon. This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> >
> > This is awkward because the struct page contains only one rcu_head, but
> > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > use the rcu_head at the same time: account concurrent deferrals with a
> > heightened refcount, only the first making use of the rcu_head, but
> > re-deferring if more deferrals arrived during its grace period.
>
> You didn't answer my question why we can't just move the rcu to the
> actual free page?

I thought that I had answered it, perhaps not to your satisfaction:

https://lore.kernel.org/linux-mm/[email protected]/

My conclusion then was:
Not very good reasons: good enough, or can you supply a better patch?

Hugh

>
> Since PPC doesn't recycle the frags, we don't need to carefully RCU
> free each frag, we just need to RCU free the entire page when it
> becomes eventually free?
>
> Jason

2023-06-21 00:15:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Tue, Jun 20, 2023 at 12:54:25PM -0700, Hugh Dickins wrote:
> On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> > On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > > loop, where allocating extra memory cannot be relied upon. This precedes
> > > the generic version to avoid build breakage from incompatible pgtable_t.
> > >
> > > This is awkward because the struct page contains only one rcu_head, but
> > > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > > use the rcu_head at the same time: account concurrent deferrals with a
> > > heightened refcount, only the first making use of the rcu_head, but
> > > re-deferring if more deferrals arrived during its grace period.
> >
> > You didn't answer my question why we can't just move the rcu to the
> > actual free page?
>
> I thought that I had answered it, perhaps not to your satisfaction:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> My conclusion then was:
> Not very good reasons: good enough, or can you supply a better patch?

Oh, I guess I didn't read that email as answering the question..

I was saying to make pte_fragment_free() unconditionally do the
RCU. It is the only thing that uses the page->rcu_head, and it means
PPC would double RCU the final free on the TLB path, but that is
probably OK for now. This means pte_free_defer() won't do anything
special on PPC as PPC will always RCU free these things, this address
the defer concern too, I think. Overall it is easier to reason about.

I looked at fixing the TLB stuff to avoid the double rcu but quickly
got scared that ppc was using a kmem_cache to allocate other page
table sizes so there is not a reliable struct page to get a rcu_head
from. This looks like the main challenge for ppc... We'd have to teach
the tlb code to not do its own RCU stuff for table levels that the
arch is already RCU freeing - and that won't get us to full RCU
freeing on PPC.

Anyhow, this is a full version of what I was thinking:

diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e3a..b5dcd0f27fc115 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,21 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
}

+static void pgtable_free_cb(struct rcu_head *head)
+{
+ struct page *page = container_of(head, struct page, rcu_head);
+
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
+static void pgtable_free_cb_kernel(struct rcu_head *head)
+{
+ struct page *page = container_of(head, struct page, rcu_head);
+
+ __free_page(page);
+}
+
void pte_fragment_free(unsigned long *table, int kernel)
{
struct page *page = virt_to_page(table);
@@ -115,8 +130,13 @@ void pte_fragment_free(unsigned long *table, int kernel)

BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
+ /*
+ * Always RCU free pagetable memory. rcu_head overlaps with lru
+ * which is no longer in use by the time the table is freed.
+ */
if (!kernel)
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ call_rcu(&page->rcu_head, pgtable_free_cb);
+ else
+ call_rcu(&page->rcu_head, pgtable_free_cb_kernel);
}
}

2023-06-22 03:08:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:54:25PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> > > On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > > > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > > > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > > > loop, where allocating extra memory cannot be relied upon. This precedes
> > > > the generic version to avoid build breakage from incompatible pgtable_t.
> > > >
> > > > This is awkward because the struct page contains only one rcu_head, but
> > > > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > > > use the rcu_head at the same time: account concurrent deferrals with a
> > > > heightened refcount, only the first making use of the rcu_head, but
> > > > re-deferring if more deferrals arrived during its grace period.
> > >
> > > You didn't answer my question why we can't just move the rcu to the
> > > actual free page?
> >
> > I thought that I had answered it, perhaps not to your satisfaction:
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > My conclusion then was:
> > Not very good reasons: good enough, or can you supply a better patch?
>
> Oh, I guess I didn't read that email as answering the question..
>
> I was saying to make pte_fragment_free() unconditionally do the
> RCU. It is the only thing that uses the page->rcu_head, and it means
> PPC would double RCU the final free on the TLB path, but that is
> probably OK for now. This means pte_free_defer() won't do anything
> special on PPC as PPC will always RCU free these things, this address
> the defer concern too, I think. Overall it is easier to reason about.
>
> I looked at fixing the TLB stuff to avoid the double rcu but quickly
> got scared that ppc was using a kmem_cache to allocate other page
> table sizes so there is not a reliable struct page to get a rcu_head
> from. This looks like the main challenge for ppc... We'd have to teach
> the tlb code to not do its own RCU stuff for table levels that the
> arch is already RCU freeing - and that won't get us to full RCU
> freeing on PPC.

Sorry for being so dense all along: yes, your way is unquestionably
much better than mine. I guess I must have been obsessive about
keeping pte_free_defer()+pte_free_now() "on the outside", as they
were on x86, and never perceived how much easier it is with a small
tweak inside pte_fragment_free(); and never reconsidered it since.

But I'm not so keen on the double-RCU, extending this call_rcu() to
all the normal cases, while still leaving the TLB batching in place:
here is the replacement patch I'd prefer us to go forward with now.

Many thanks!

[PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time. But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/include/asm/pgalloc.h | 4 ++++
arch/powerpc/mm/pgtable-frag.c | 29 ++++++++++++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
}

+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/*
* Functions that deal with pagetables that could be at any level of
* the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
}

+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
void pte_fragment_free(unsigned long *table, int kernel)
{
struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)

BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
- if (!kernel)
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (kernel)
+ __free_page(page);
+ else if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ SetPageActive(page);
+ pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.35.3


2023-06-27 17:41:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Wed, Jun 21, 2023 at 07:36:11PM -0700, Hugh Dickins wrote:
> [PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page
>
> Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon. This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
>
> This is awkward because the struct page contains only one rcu_head, but
> that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> use the rcu_head at the same time. But powerpc never reuses a fragment
> once it has been freed: so mark the page Active in pte_free_defer(),
> before calling pte_fragment_free() directly; and there call_rcu() to
> pte_free_now() when last fragment is freed and the page is PageActive.
>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/powerpc/include/asm/pgalloc.h | 4 ++++
> arch/powerpc/mm/pgtable-frag.c | 29 ++++++++++++++++++++++++++---
> 2 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
> index 3360cad78ace..3a971e2a8c73 100644
> --- a/arch/powerpc/include/asm/pgalloc.h
> +++ b/arch/powerpc/include/asm/pgalloc.h
> @@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
> pte_fragment_free((unsigned long *)ptepage, 0);
> }
>
> +/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
> /*
> * Functions that deal with pagetables that could be at any level of
> * the table need to be passed an "index_size" so they know how to
> diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> index 20652daa1d7e..0c6b68130025 100644
> --- a/arch/powerpc/mm/pgtable-frag.c
> +++ b/arch/powerpc/mm/pgtable-frag.c
> @@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
> return __alloc_for_ptecache(mm, kernel);
> }
>
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + pgtable_pte_page_dtor(page);
> + __free_page(page);
> +}
> +
> void pte_fragment_free(unsigned long *table, int kernel)
> {
> struct page *page = virt_to_page(table);
> @@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
>
> BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> - if (!kernel)
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (kernel)
> + __free_page(page);
> + else if (TestClearPageActive(page))
> + call_rcu(&page->rcu_head, pte_free_now);
> + else
> + pte_free_now(&page->rcu_head);
> }
> }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + pte_fragment_free((unsigned long *)pgtable, 0);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

Yes, this makes sense to me, very simple..

I always for get these details but atomic_dec_and_test() is a release?
So the SetPageActive is guarenteed to be visible in another thread
that reaches 0?

Thanks,
Jason

2023-06-27 21:15:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

On Tue, 27 Jun 2023, Jason Gunthorpe wrote:
> On Wed, Jun 21, 2023 at 07:36:11PM -0700, Hugh Dickins wrote:
> > [PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page
...
> Yes, this makes sense to me, very simple..
>
> I always for get these details but atomic_dec_and_test() is a release?
> So the SetPageActive is guarenteed to be visible in another thread
> that reaches 0?

Yes, that's my understanding - so the TestClearPageActive adds more
to the guarantee than is actually needed.

"release": you speak the modern language, whereas I haven't advanced
from olden-days barriers: atomic_dec_and_test() meets atomic_t.txt's
- RMW operations that have a return value are fully ordered;
which a quick skim of present-day memory-barriers.txt suggests would
imply both ACQUIRE and RELEASE. Please correct me if I'm mistaken!

Hugh

2023-06-28 19:27:48

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon. This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
>
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
>
> Build upon the existing management, adjusted to follow a new rule: that
> a page is not linked to mm_context_t::pgtable_list while either half is
> pending free, by either tlb_remove_table() or pte_free_defer(); but is
> afterwards either relinked to the list (if other half is allocated), or
> freed (if other half is free): by __tlb_remove_table() in both cases.
>
> This rule ensures that page->lru is no longer in use while page->rcu_head
> may be needed for use by pte_free_defer(). And a fortuitous byproduct of
> following this rule is that page_table_free() no longer needs its curious
> two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
> fix 2KB pgtable release race") for what to think of there. But it does
> not solve the problem that two halves may need rcu_head at the same time.
>
> For that, add HHead bits between s390's AAllocated and PPending bits in
> the upper byte of page->_refcount: then the second pte_free_defer() can
> see that rcu_head is already in use, and the RCU callee pte_free_half()
> can see that it needs to make a further call_rcu() for that other half.
>
> page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
> knows where to link the freed half while its other half is allocated.
> But linking to the list needs mm->context.lock: and although AA bit set
> guarantees that pt_mm must still be valid, it does not guarantee that mm
> is still valid an instant later: so acquiring mm->context.lock would not
> be safe. For now, use a static global mm_pgtable_list_lock instead:
> then a soon-to-follow commit will split it per-mm as before (probably by
> using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
> and update the commentary on the pgtable_list.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/s390/include/asm/pgalloc.h | 4 +
> arch/s390/mm/pgalloc.c | 205 +++++++++++++++++++++++---------
> include/linux/mm_types.h | 2 +-
> 3 files changed, 154 insertions(+), 57 deletions(-)

As discussed in the other thread, we would rather go with less complexity,
possibly switching to an approach w/o the list and fragment re-use in the
future. For now, as a first step in that direction, we can try with not
adding fragments back only for pte_free_defer(). Here is an adjusted
version of your patch, copying most of your pte_free_defer() logic and
also description, tested with LTP and all three of your patch series applied:

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never added back to the list in pte_free_defer(). It is only
removed from the list, when currently listed because the other fragment
is not allocated. This introduces some asymmetry compared to the other
page table freeing paths, and in particular a list_del() for such pages
must be avoided there. Use page->pt_frag_refcount to keep track of the
list status, and check that before doing list_del() in any freeing path.

Other paths would also not add back such pages to the list, if the other
fragment happens to be freed in such a path at the same time, because
they would observe cleared AA bits.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer(). But it does not solve the problem
that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Gerald Schaefer <[email protected]>
---
arch/s390/include/asm/pgalloc.h | 4 +
arch/s390/mm/pgalloc.c | 136 +++++++++++++++++++++++++++++++++++++---
2 files changed, 132 insertions(+), 8 deletions(-)

--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct m
#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
void vmem_map_init(void);
void *vmem_crst_alloc(unsigned long val);
pte_t *vmem_pte_alloc(void);
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -185,11 +185,13 @@ void page_table_free_pgste(struct page *
* The upper byte (bits 24-31) of the parent page _refcount is used
* for tracking contained 2KB-pgtables and has the following format:
*
- * PP AA
+ * PPHHAA
* 01234567 upper byte (bits 24-31) of struct page::_refcount
- * || ||
- * || |+--- upper 2KB-pgtable is allocated
- * || +---- lower 2KB-pgtable is allocated
+ * ||||||
+ * |||||+--- upper 2KB-pgtable is allocated
+ * ||||+---- lower 2KB-pgtable is allocated
+ * |||+----- upper 2KB-pgtable is pending free by page->rcu_head
+ * ||+------ lower 2KB-pgtable is pending free by page->rcu_head
* |+------- upper 2KB-pgtable is pending for removal
* +-------- lower 2KB-pgtable is pending for removal
*
@@ -229,6 +231,17 @@ void page_table_free_pgste(struct page *
* logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
* while the PP bits are never used, nor such a page is added to or removed
* from mm_context_t::pgtable_list.
+ *
+ * The HH bits are used to prevent double use of page->rcu_head in
+ * pte_free_defer(), when both 2K pagetables inside a page happen to get
+ * freed by that path at the same time.
+ *
+ * pte_free_defer() also cannot add 2K fragments back to the list, because
+ * page->rcu_head overlays with page->lru. This introduces some asymmetry
+ * compared to the other pagetable freeing paths, and the missing list_add()
+ * in pte_free_defer() could result in incorrect list_del(). Therefore, track
+ * the the list status of a page with page->pt_frag_refcount, and check that
+ * before doing list_del() in any freeing path.
*/
unsigned long *page_table_alloc(struct mm_struct *mm)
{
@@ -262,6 +275,7 @@ unsigned long *page_table_alloc(struct m
atomic_xor_bits(&page->_refcount,
0x01U << (bit + 24));
list_del(&page->lru);
+ atomic_set(&page->pt_frag_refcount, 0);
}
}
spin_unlock_bh(&mm->context.lock);
@@ -290,6 +304,7 @@ unsigned long *page_table_alloc(struct m
memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
spin_lock_bh(&mm->context.lock);
list_add(&page->lru, &mm->context.pgtable_list);
+ atomic_set(&page->pt_frag_refcount, 1);
spin_unlock_bh(&mm->context.lock);
}
return table;
@@ -325,13 +340,24 @@ void page_table_free(struct mm_struct *m
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if (mask & 0x03U) {
+ /*
+ * Other half is allocated, add to list
+ */
list_add(&page->lru, &mm->context.pgtable_list);
- else
+ atomic_set(&page->pt_frag_refcount, 1);
+ } else if (atomic_read(&page->pt_frag_refcount)) {
+ /*
+ * Other half is not allocated, and page is on the list,
+ * remove from list
+ */
list_del(&page->lru);
+ atomic_set(&page->pt_frag_refcount, 0);
+ }
spin_unlock_bh(&mm->context.lock);
mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
mask >>= 24;
+ /* Return if other half is allocated, or delayed release pending */
if (mask != 0x00U)
return;
half = 0x01U << bit;
@@ -370,10 +396,22 @@ void page_table_free_rcu(struct mmu_gath
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if (mask & 0x03U) {
+ /*
+ * Other half is allocated, add to end of list, as this
+ * will not immediately be re-usable because it is marked
+ * for delayed release
+ */
list_add_tail(&page->lru, &mm->context.pgtable_list);
- else
+ atomic_set(&page->pt_frag_refcount, 1);
+ } else if (atomic_read(&page->pt_frag_refcount)) {
+ /*
+ * Other half is not allocated, and page is on the list,
+ * remove from list
+ */
list_del(&page->lru);
+ atomic_set(&page->pt_frag_refcount, 0);
+ }
spin_unlock_bh(&mm->context.lock);
table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
tlb_remove_table(tlb, table);
@@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
__free_page(page);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+ unsigned long *table;
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ table = (unsigned long *)page_to_virt(page);
+ table = (unsigned long *)((unsigned long)table | 0x03U);
+ __tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+ unsigned long *table;
+ struct page *page;
+ unsigned int mask;
+
+ page = container_of(head, struct page, rcu_head);
+ mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+ table = (unsigned long *)page_to_virt(page);
+ table += bit * PTRS_PER_PTE;
+ table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+ __tlb_remove_table(table);
+
+ /* If pte_free_defer() of the other half came in, queue it now */
+ if (mask & 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+ pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+ pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ unsigned int bit, mask;
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ if (mm_alloc_pgste(mm)) {
+ /*
+ * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
+ * page_table_free_rcu()?
+ * If yes -> need addr parameter here, like in pte_free_tlb().
+ */
+ call_rcu(&page->rcu_head, pte_free_pgste);
+ return;
+}
+ bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
+
+ spin_lock_bh(&mm->context.lock);
+ mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+ mask >>= 24;
+ if ((mask & 0x03U) == 0x00U && atomic_read(&page->pt_frag_refcount)) {
+ /*
+ * Other half is not allocated, page is on the list,
+ * remove from list
+ */
+ list_del(&page->lru);
+ atomic_set(&page->pt_frag_refcount, 0);
+ }
+ /* Page must not be on the list, so rcu_head can be used */
+ BUG_ON(atomic_read(&page->pt_frag_refcount));
+ spin_unlock_bh(&mm->context.lock);
+
+ /* Do not relink on rcu_head if other half already linked on rcu_head */
+ if ((mask & 0x0CU) != 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Base infrastructure required to generate basic asces, region, segment,
* and page tables that do not make use of enhanced features like EDAT1.

2023-06-29 05:19:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, 28 Jun 2023, Gerald Schaefer wrote:
>
> As discussed in the other thread, we would rather go with less complexity,
> possibly switching to an approach w/o the list and fragment re-use in the
> future. For now, as a first step in that direction, we can try with not
> adding fragments back only for pte_free_defer(). Here is an adjusted
> version of your patch, copying most of your pte_free_defer() logic and
> also description, tested with LTP and all three of your patch series applied:

Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
patch (posted with fewer Cc's to the s390 list last week), and switching
to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
patch.

But I didn't get deep enough into it today to confirm it - and disappointed
that you've found it necessary to play with pt_frag_refcount in addition to
_refcount and HH bits. No real problem with that, but my instinct says it
should be simpler.

Tomorrow...
Hugh

2023-06-29 14:18:42

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:

Hi Gerald, Hugh!

...
> @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> __free_page(page);
> }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);

What about pte_free_lower() / pte_free_upper()?

...
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + unsigned int bit, mask;
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + if (mm_alloc_pgste(mm)) {
> + /*
> + * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> + * page_table_free_rcu()?
> + * If yes -> need addr parameter here, like in pte_free_tlb().
> + */
> + call_rcu(&page->rcu_head, pte_free_pgste);
> + return;
> +}
> + bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> +
> + spin_lock_bh(&mm->context.lock);
> + mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));

This makes the bit logic increasingly complicated to me.

What if instead we set the rule "one bit at a time only"?
That means an atomic group bit flip is only allowed between
pairs of bits, namely:

bit flip initiated from
----------- ----------------------------------------
P <- A page_table_free(), page_table_free_rcu()
H <- A pte_free_defer()
P <- H pte_free_half()

In the current model P bit could be on together with H
bit simultaneously. That actually brings in equation
nothing.

Besides, this check in page_table_alloc() (while still
correct) makes one (well, me) wonder "what about HH bits?":

mask = (mask | (mask >> 4)) & 0x03U;
if (mask != 0x03U) {
...
}

By contrast, with "one bit at a time only" policy every
of three bits effectevely indicates which state a page
half is currently in.

Thanks!

2023-06-29 15:27:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> >
> > As discussed in the other thread, we would rather go with less complexity,
> > possibly switching to an approach w/o the list and fragment re-use in the
> > future. For now, as a first step in that direction, we can try with not
> > adding fragments back only for pte_free_defer(). Here is an adjusted
> > version of your patch, copying most of your pte_free_defer() logic and
> > also description, tested with LTP and all three of your patch series applied:
>
> Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> patch (posted with fewer Cc's to the s390 list last week), and switching
> to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> patch.
>
> But I didn't get deep enough into it today to confirm it - and disappointed
> that you've found it necessary to play with pt_frag_refcount in addition to
> _refcount and HH bits. No real problem with that, but my instinct says it
> should be simpler.

Is there any reason it should be any different at all from what PPC is
doing?

I still think the right thing to do here is make the PPC code common
(with Hugh's proposed RCU modification) and just use it in both
arches....

Jason

2023-06-29 16:27:16

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 29 Jun 2023 12:22:24 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> > >
> > > As discussed in the other thread, we would rather go with less complexity,
> > > possibly switching to an approach w/o the list and fragment re-use in the
> > > future. For now, as a first step in that direction, we can try with not
> > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > version of your patch, copying most of your pte_free_defer() logic and
> > > also description, tested with LTP and all three of your patch series applied:
> >
> > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > patch (posted with fewer Cc's to the s390 list last week), and switching
> > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > patch.
> >
> > But I didn't get deep enough into it today to confirm it - and disappointed
> > that you've found it necessary to play with pt_frag_refcount in addition to
> > _refcount and HH bits. No real problem with that, but my instinct says it
> > should be simpler.

Yes, I also found it a bit awkward, but it seemed "good and simple enough",
to have something to go forward with, while my instinct was in line with yours.

>
> Is there any reason it should be any different at all from what PPC is
> doing?
>
> I still think the right thing to do here is make the PPC code common
> (with Hugh's proposed RCU modification) and just use it in both
> arches....

With the current approach, we would not add back fragments _only_ for
the new pte_free_defer() path, while keeping our cleverness for the other
paths. Not having a good overview of the negative impact wrt potential
memory waste, I would rather take small steps, if possible.

If we later switch to never adding back fragments, of course we should
try to be in line with PPC implementation.

2023-06-29 16:27:22

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 29 Jun 2023 15:59:07 +0200
Alexander Gordeev <[email protected]> wrote:

> On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
>
> Hi Gerald, Hugh!
>
> ...
> > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> > __free_page(page);
> > }
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);
>
> What about pte_free_lower() / pte_free_upper()?

I actually like the 0/1 better, I always get confused what exactly we
mean with "lower / upper" in our code and comments. Is it the first
or second half? With 0/1 it is immediately clear to me.

>
> ...
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > + unsigned int bit, mask;
> > + struct page *page;
> > +
> > + page = virt_to_page(pgtable);
> > + if (mm_alloc_pgste(mm)) {
> > + /*
> > + * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > + * page_table_free_rcu()?
> > + * If yes -> need addr parameter here, like in pte_free_tlb().
> > + */
> > + call_rcu(&page->rcu_head, pte_free_pgste);
> > + return;
> > +}
> > + bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> > +
> > + spin_lock_bh(&mm->context.lock);
> > + mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
>
> This makes the bit logic increasingly complicated to me.

I think it is well in line with existing code in page_table_free[_rcu].
Only instead of doing xor with 0x11U, it does xor with 0x15U to also
switch on the H bit while at it.

>
> What if instead we set the rule "one bit at a time only"?
> That means an atomic group bit flip is only allowed between
> pairs of bits, namely:
>
> bit flip initiated from
> ----------- ----------------------------------------
> P <- A page_table_free(), page_table_free_rcu()
> H <- A pte_free_defer()
> P <- H pte_free_half()
>
> In the current model P bit could be on together with H
> bit simultaneously. That actually brings in equation
> nothing.

P bit has to be set at the latest when __tlb_remove_table() gets
called, because there it is checked / cleared. It might be possible
to not set it in pte_free_defer() already, but only later in
pte_free_half() RCU callback, before calling __tlb_remove_table().
But that would not be in line any more with existing code, where it
is already set before scheduling the RCU callback.

Therefore, I would rather stick to the current approach, unless
you see some bug in it.

>
> Besides, this check in page_table_alloc() (while still
> correct) makes one (well, me) wonder "what about HH bits?":
>
> mask = (mask | (mask >> 4)) & 0x03U;
> if (mask != 0x03U) {
> ...
> }

Without adding fragments back to the list, it is not necessary
to check any H bits page_table_alloc(), or so I hope. Actually,
I like that aspect most, i.e. we have as little impact on current
code as possible.

And H bits are only relevant for preventing double use of rcu_head,
which is what they were designed for, and only the new code has
to care about them.

2023-06-30 06:35:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> On Thu, 29 Jun 2023 12:22:24 -0300
> Jason Gunthorpe <[email protected]> wrote:
> > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> > > >
> > > > As discussed in the other thread, we would rather go with less complexity,
> > > > possibly switching to an approach w/o the list and fragment re-use in the
> > > > future. For now, as a first step in that direction, we can try with not
> > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > also description, tested with LTP and all three of your patch series applied:
> > >
> > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > patch.
> > >
> > > But I didn't get deep enough into it today to confirm it - and disappointed
> > > that you've found it necessary to play with pt_frag_refcount in addition to
> > > _refcount and HH bits. No real problem with that, but my instinct says it
> > > should be simpler.
>
> Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> to have something to go forward with, while my instinct was in line with yours.
>
> >
> > Is there any reason it should be any different at all from what PPC is
> > doing?
> >
> > I still think the right thing to do here is make the PPC code common
> > (with Hugh's proposed RCU modification) and just use it in both
> > arches....
>
> With the current approach, we would not add back fragments _only_ for
> the new pte_free_defer() path, while keeping our cleverness for the other
> paths. Not having a good overview of the negative impact wrt potential
> memory waste, I would rather take small steps, if possible.
>
> If we later switch to never adding back fragments, of course we should
> try to be in line with PPC implementation.

I find myself half-agreeing with everyone.

I agree with Gerald that s390 should keep close to what it is already
doing (except for adding pte_free_defer()): that changing its strategy
and implementation to be much more like powerpc, is a job for some other
occasion (and would depend on gathering data about how well each does).

But I agree with Jason that the powerpc solution we ended up with cut
out a lot of unnecessary complication: it shifts the RCU delay from
when pte_free_defer() is called, to when the shared page comes to be
freed; which may be a lot later, and might not be welcome in a common
path, but is quite okay for the uncommon pte_free_defer().

And I agree with Alexander that pte_free_lower() and pte_free_upper()
are better names than pte_free_now0() and pte_free_now1(): I was going
to make that change, except all those functions disappear if we follow
Jason's advice and switch the call_rcu() to when freeing the page.

(Lower and upper seem unambiguous to me: Gerald, does your confusion
come just from the way they are shown the wrong way round in the PP AA
diagram? I corrected that in my patch, but you reverted it in yours.)

I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
extent that I've not even tried to verify it; but I think I do get the
point now, that we need further info than just PPHHAA to know whether
the page is on the list or not. But I think that if we move where the
call_rcu() is done, then the page can stay on or off the list by same
rules as before (but need to check HH bits along with PP when deciding
whether to allocate, and whether to list_add_tail() when freeing).

So, starting from Gerald's but cutting it down, I was working on the
patch which follows those ideas. But have run out of puff for tonight,
and would just waste all our time (again) if I sent anything out now.

Hugh

2023-06-30 14:27:44

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

[...]

> @@ -407,6 +429,77 @@ void __tlb_remove_table(void *_table)
> __free_page(page);
> }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);
> +
> +static void pte_free_pgste(struct rcu_head *head)
> +{
> + unsigned long *table;
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + table = (unsigned long *)page_to_virt(page);
> + table = (unsigned long *)((unsigned long)table | 0x03U);
> + __tlb_remove_table(table);
> +}
> +
> +static void pte_free_half(struct rcu_head *head, unsigned int bit)
> +{
> + unsigned long *table;
> + struct page *page;
> + unsigned int mask;
> +
> + page = container_of(head, struct page, rcu_head);
> + mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
> +
> + table = (unsigned long *)page_to_virt(page);
> + table += bit * PTRS_PER_PTE;
> + table = (unsigned long *)((unsigned long)table | (0x01U << bit));
> + __tlb_remove_table(table);
> +
> + /* If pte_free_defer() of the other half came in, queue it now */
> + if (mask & 0x0CU)
> + call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
> +}
> +
> +static void pte_free_now0(struct rcu_head *head)
> +{
> + pte_free_half(head, 0);
> +}
> +
> +static void pte_free_now1(struct rcu_head *head)
> +{
> + pte_free_half(head, 1);
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + unsigned int bit, mask;
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + if (mm_alloc_pgste(mm)) {
> + call_rcu(&page->rcu_head, pte_free_pgste);

so is this now going to be used to free page tables
instead of page_table_free_rcu?

or will it be used instead of page_table_free?

this is actually quite important for KVM on s390

> + return;
> + }
> + bit = ((unsigned long)pgtable & ~PAGE_MASK) /
> + (PTRS_PER_PTE * sizeof(pte_t));
> +
> + spin_lock_bh(&mm_pgtable_list_lock);
> + mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
> + mask >>= 24;
> + /* Other half not allocated? Other half not already pending free? */
> + if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
> + list_del(&page->lru);
> + spin_unlock_bh(&mm_pgtable_list_lock);
> +
> + /* Do not relink on rcu_head if other half already linked on rcu_head */
> + if ((mask & 0x0CU) != 0x0CU)
> + call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * Base infrastructure required to generate basic asces, region, segment,
> * and page tables that do not make use of enhanced features like EDAT1.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..1667a1bdb8a8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -146,7 +146,7 @@ struct page {
> pgtable_t pmd_huge_pte; /* protected by page->ptl */
> unsigned long _pt_pad_2; /* mapping */
> union {
> - struct mm_struct *pt_mm; /* x86 pgds only */
> + struct mm_struct *pt_mm; /* x86 pgd, s390 */
> atomic_t pt_frag_refcount; /* powerpc */
> };
> #if ALLOC_SPLIT_PTLOCKS


2023-06-30 15:54:39

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
>
> [...]
>
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > + unsigned int bit, mask;
> > + struct page *page;
> > +
> > + page = virt_to_page(pgtable);
> > + if (mm_alloc_pgste(mm)) {
> > + call_rcu(&page->rcu_head, pte_free_pgste);
>
> so is this now going to be used to free page tables
> instead of page_table_free_rcu?

No.

All pte_free_defer() is being used for (in this series; and any future
use beyond this series will have to undertake its own evaluations) is
for the case of removing an empty page table, which used to map a group
of PTE mappings of a file, in order to make way for one PMD mapping of
the huge page which those scattered pages have now been gathered into.

You're worried by that mm_alloc_pgste() block: it's something I didn't
have at all in my first draft, then I thought that perhaps the pgste
case might be able to come this way, so it seemed stupid to leave out
the handling for it.

I hope that you're implying that should be dead code here? Perhaps,
that the pgste case corresponds to the case in s390 where THPs are
absolutely forbidden? That would be good news for us.

Gerald, in his version of this block, added a comment asking:
/*
* TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
* page_table_free_rcu()?
* If yes -> need addr parameter here, like in pte_free_tlb().
*/
Do you have the answer to that? Neither of us could work it out.

>
> or will it be used instead of page_table_free?

Not always; but yes, this case of removing a page table used
page_table_free() before; but now, with the lighter locking, needs
to keep the page table valid until the RCU grace period expires.

>
> this is actually quite important for KVM on s390

None of us are wanting to break KVM on s390: your guidance appreciated!

Thanks,
Hugh

2023-06-30 16:42:09

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > [...]
> >
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + unsigned int bit, mask;
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + if (mm_alloc_pgste(mm)) {
> > > + call_rcu(&page->rcu_head, pte_free_pgste);
> >
> > so is this now going to be used to free page tables
> > instead of page_table_free_rcu?
>
> No.
>
> All pte_free_defer() is being used for (in this series; and any future
> use beyond this series will have to undertake its own evaluations) is
> for the case of removing an empty page table, which used to map a group
> of PTE mappings of a file, in order to make way for one PMD mapping of
> the huge page which those scattered pages have now been gathered into.
>
> You're worried by that mm_alloc_pgste() block: it's something I didn't

actually no, but thanks for bringing it up :D

> have at all in my first draft, then I thought that perhaps the pgste
> case might be able to come this way, so it seemed stupid to leave out
> the handling for it.
>
> I hope that you're implying that should be dead code here? Perhaps,
> that the pgste case corresponds to the case in s390 where THPs are
> absolutely forbidden? That would be good news for us.
>
> Gerald, in his version of this block, added a comment asking:
> /*
> * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> * page_table_free_rcu()?
> * If yes -> need addr parameter here, like in pte_free_tlb().
> */
> Do you have the answer to that? Neither of us could work it out.

this is the thing I'm worried about; removing a page table that was
used to map a guest will leave dangling pointers in the gmap that will
cause memory corruption (I actually ran into that problem myself for
another patchseries).

gmap_unlink() is needed to clean up the pointers before they become
dangling (and also potentially do some TLB purging as needed)

the point here is: we need that only for page_table_free_rcu(); all
other users of page_table_free() cannot act on guest page tables
(because we don't allow THP for KVM guests). and that is why
page_table_free() does not do gmap_unlink() currently.

>
> >
> > or will it be used instead of page_table_free?
>
> Not always; but yes, this case of removing a page table used
> page_table_free() before; but now, with the lighter locking, needs
> to keep the page table valid until the RCU grace period expires.

so if I understand correctly your code will, sometimes, under some
circumstances, replace what page_table_free() does, but it will never
replace page_table_free_rcu()?

because in that case there would be no issues

>
> >
> > this is actually quite important for KVM on s390
>
> None of us are wanting to break KVM on s390: your guidance appreciated!
>
> Thanks,
> Hugh


2023-06-30 19:31:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
> > On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> > > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > > Hugh Dickins <[email protected]> wrote:
> > >
> > > [...]
> > >
> > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > > +{
> > > > + unsigned int bit, mask;
> > > > + struct page *page;
> > > > +
> > > > + page = virt_to_page(pgtable);
> > > > + if (mm_alloc_pgste(mm)) {
> > > > + call_rcu(&page->rcu_head, pte_free_pgste);
> > >
> > > so is this now going to be used to free page tables
> > > instead of page_table_free_rcu?
> >
> > No.
> >
> > All pte_free_defer() is being used for (in this series; and any future
> > use beyond this series will have to undertake its own evaluations) is
> > for the case of removing an empty page table, which used to map a group
> > of PTE mappings of a file, in order to make way for one PMD mapping of
> > the huge page which those scattered pages have now been gathered into.
> >
> > You're worried by that mm_alloc_pgste() block: it's something I didn't
>
> actually no, but thanks for bringing it up :D
>
> > have at all in my first draft, then I thought that perhaps the pgste
> > case might be able to come this way, so it seemed stupid to leave out
> > the handling for it.
> >
> > I hope that you're implying that should be dead code here? Perhaps,
> > that the pgste case corresponds to the case in s390 where THPs are
> > absolutely forbidden? That would be good news for us.
> >
> > Gerald, in his version of this block, added a comment asking:
> > /*
> > * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > * page_table_free_rcu()?
> > * If yes -> need addr parameter here, like in pte_free_tlb().
> > */
> > Do you have the answer to that? Neither of us could work it out.
>
> this is the thing I'm worried about; removing a page table that was
> used to map a guest will leave dangling pointers in the gmap that will
> cause memory corruption (I actually ran into that problem myself for
> another patchseries).
>
> gmap_unlink() is needed to clean up the pointers before they become
> dangling (and also potentially do some TLB purging as needed)

That's something I would have expected to be handled already via
mmu_notifiers, rather than buried inside the page table freeing.

If s390 is the only architecture to go that way, and could instead do
it via mmu_notifiers, then I think that will be more easily supported
in the long term.

But I'm writing from a position of very great ignorance: advising
KVM on s390 is many dimensions away from what I'm capable of.

>
> the point here is: we need that only for page_table_free_rcu(); all
> other users of page_table_free() cannot act on guest page tables

I might be wrong, but I think that most users of page_table_free()
are merely freeing a page table which had to be allocated up front,
but was then found unnecessary (maybe a racing task already inserted
one): page tables which were never exposed to actual use.

> (because we don't allow THP for KVM guests). and that is why
> page_table_free() does not do gmap_unlink() currently.

But THP collapse does (or did before this series) use it to free a
page table which had been exposed to use. The fact that s390 does
not allow THP for KVM guests makes page_table_free(), and this new
pte_free_defer(), safe for that; but it feels dangerously coincidental.

It's easy to imagine a future change being made, which would stumble
over this issue. I have imagined that pte_free_defer() will be useful
in future, in the freeing of empty page tables: but s390 may pose a
problem there - though perhaps no more of a problem than additionally
needing to pass a virtual address down the stack.

>
> >
> > >
> > > or will it be used instead of page_table_free?
> >
> > Not always; but yes, this case of removing a page table used
> > page_table_free() before; but now, with the lighter locking, needs
> > to keep the page table valid until the RCU grace period expires.
>
> so if I understand correctly your code will, sometimes, under some
> circumstances, replace what page_table_free() does, but it will never
> replace page_table_free_rcu()?
>
> because in that case there would be no issues

Yes, thanks for confirming: we have no issue here at present, but may
do if use of pte_free_defer() is extended to other contexts in future.

Would it be appropriate to add a WARN_ON_ONCE around that
> > > > + if (mm_alloc_pgste(mm)) {
in pte_free_defer()?

I ask that somewhat rhetorically: that block disappears in the later
version I was working on last night (and will return to shortly), in
which pte_free_defer() just sets a bit and calls page_table_free().

But I'd like to understand the possibilities better: does mm_alloc_pgste()
correspond 1:1 to KVM guest on s390, or does it cover several different
possibilities of which KVM guest is one, or am I just confused to be
thinking there's any relationship?

Thanks,
Hugh

>
> >
> > >
> > > this is actually quite important for KVM on s390
> >
> > None of us are wanting to break KVM on s390: your guidance appreciated!
> >
> > Thanks,
> > Hugh

2023-07-02 05:48:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 29 Jun 2023, Hugh Dickins wrote:
>
> I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> extent that I've not even tried to verify it; but I think I do get the
> point now, that we need further info than just PPHHAA to know whether
> the page is on the list or not. But I think that if we move where the
> call_rcu() is done, then the page can stay on or off the list by same
> rules as before (but need to check HH bits along with PP when deciding
> whether to allocate, and whether to list_add_tail() when freeing).

No, not quite the same rules as before: I came to realize that using
list_add_tail() for the HH pages would be liable to put a page on the
list which forever blocked reuse of PP list_add_tail() pages after it
(could be solved by a list_move() somewhere, but we have agreed to
prefer simplicity).

I've dropped the HH bits, I'm using PageActive like we did on powerpc,
I've dropped most of the pte_free_*() helpers, and list_del_init() is
an easier way of dealing with those "is it on the list" questions.
I expect that we shall be close to reaching agreement on...

[PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive). And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/include/asm/pgalloc.h | 4 ++
arch/s390/mm/pgalloc.c | 75 +++++++++++++++++++++++++++------
2 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
void vmem_map_init(void);
void *vmem_crst_alloc(unsigned long val);
pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..fd0c4312da16 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
* logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
* while the PP bits are never used, nor such a page is added to or removed
* from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
*/
unsigned long *page_table_alloc(struct mm_struct *mm)
{
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table += PTRS_PER_PTE;
atomic_xor_bits(&page->_refcount,
0x01U << (bit + 24));
- list_del(&page->lru);
+ list_del_init(&page->lru);
}
}
spin_unlock_bh(&mm->context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
+ INIT_LIST_HEAD(&page->lru);
atomic_xor_bits(&page->_refcount, 0x03U << 24);
memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
{
char msg[128];

- if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+ if (!IS_ENABLED(CONFIG_DEBUG_VM))
+ return;
+ if (!mask && list_empty(&page->lru))
return;
snprintf(msg, sizeof(msg),
"Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
dump_page(page, msg);
}

+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
void page_table_free(struct mm_struct *mm, unsigned long *table)
{
unsigned int mask, bit, half;
@@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if ((mask & 0x03U) && !PageActive(page)) {
+ /*
+ * Other half is allocated, and neither half has had
+ * its free deferred: add page to head of list, to make
+ * this freed half available for immediate reuse.
+ */
list_add(&page->lru, &mm->context.pgtable_list);
- else
- list_del(&page->lru);
+ } else {
+ /* If page is on list, now remove it. */
+ list_del_init(&page->lru);
+ }
spin_unlock_bh(&mm->context.lock);
mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
mask >>= 24;
@@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
}

page_table_release_check(page, table, half, mask);
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}

void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
@@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if ((mask & 0x03U) && !PageActive(page)) {
+ /*
+ * Other half is allocated, and neither half has had
+ * its free deferred: add page to end of list, to make
+ * this freed half available for reuse once its pending
+ * bit has been cleared by __tlb_remove_table().
+ */
list_add_tail(&page->lru, &mm->context.pgtable_list);
- else
- list_del(&page->lru);
+ } else {
+ /* If page is on list, now remove it. */
+ list_del_init(&page->lru);
+ }
spin_unlock_bh(&mm->context.lock);
table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
tlb_remove_table(tlb, table);
@@ -403,10 +441,23 @@ void __tlb_remove_table(void *_table)
}

page_table_release_check(page, table, half, mask);
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ SetPageActive(page);
+ page_table_free(mm, (unsigned long *)pgtable);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Base infrastructure required to generate basic asces, region, segment,
* and page tables that do not make use of enhanced features like EDAT1.
--
2.35.3


2023-07-03 11:18:22

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

[...]

> That's something I would have expected to be handled already via
> mmu_notifiers, rather than buried inside the page table freeing.
>
> If s390 is the only architecture to go that way, and could instead do
> it via mmu_notifiers, then I think that will be more easily supported
> in the long term.

I am very well aware of that, and in fact I've been working on
exactly that for some time already. But it's a very complex minefield
and therefore I'm proceeding *very* carefully. It will take quite some
time before anything comes out.

>
> But I'm writing from a position of very great ignorance: advising
> KVM on s390 is many dimensions away from what I'm capable of.

fair enough, but in this case it doesn't mean that you are not right :)

>
> >
> > the point here is: we need that only for page_table_free_rcu(); all
> > other users of page_table_free() cannot act on guest page tables
>
> I might be wrong, but I think that most users of page_table_free()
> are merely freeing a page table which had to be allocated up front,
> but was then found unnecessary (maybe a racing task already inserted
> one): page tables which were never exposed to actual use.

that was my impression as well

> > (because we don't allow THP for KVM guests). and that is why
> > page_table_free() does not do gmap_unlink() currently.
>
> But THP collapse does (or did before this series) use it to free a
> page table which had been exposed to use. The fact that s390 does

that is also my understanding

> not allow THP for KVM guests makes page_table_free(), and this new
> pte_free_defer(), safe for that; but it feels dangerously coincidental.

not really; my guess is that we _intentionally_ did not do anything
there, because we knew we did not need it, knowing well that
we would need it once we would want to support THP for guests.
so not a coincidence, but a conscious decision based, I guess, on
touching as little code as needed.

>
> It's easy to imagine a future change being made, which would stumble
> over this issue. I have imagined that pte_free_defer() will be useful
> in future, in the freeing of empty page tables: but s390 may pose a
> problem there - though perhaps no more of a problem than additionally
> needing to pass a virtual address down the stack.

yeah it can always be fixed later if we need to

>
> >
> > >
> > > >
> > > > or will it be used instead of page_table_free?
> > >
> > > Not always; but yes, this case of removing a page table used
> > > page_table_free() before; but now, with the lighter locking, needs
> > > to keep the page table valid until the RCU grace period expires.
> >
> > so if I understand correctly your code will, sometimes, under some
> > circumstances, replace what page_table_free() does, but it will never
> > replace page_table_free_rcu()?
> >
> > because in that case there would be no issues
>
> Yes, thanks for confirming: we have no issue here at present, but may
> do if use of pte_free_defer() is extended to other contexts in future.
>
> Would it be appropriate to add a WARN_ON_ONCE around that
> > > > > + if (mm_alloc_pgste(mm)) {
> in pte_free_defer()?

that's actually not a bad idea. should never happen, but... that's the
whole point of a WARN_ON after all

>
> I ask that somewhat rhetorically: that block disappears in the later
> version I was working on last night (and will return to shortly), in
> which pte_free_defer() just sets a bit and calls page_table_free().
>
> But I'd like to understand the possibilities better: does mm_alloc_pgste()
> correspond 1:1 to KVM guest on s390, or does it cover several different
> possibilities of which KVM guest is one, or am I just confused to be
> thinking there's any relationship?

this is... historically complicated (because of course it is)

in theory any process can allocate PGSTEs by having the right bits in
the ELF header (that's how QEMU does it currently). And QEMU will have
PGSTEs allocated even if it does not actually start any guests.

Then we have the vm.allocate_pgste sysctl knob; once enabled, it will
cause all processes to have PGSTEs allocated. This is how we handled
PGSTEs before we switched to ELF header bits.

So in summary: in __practice__ yes, only QEMU will have PGSTEs. But in
theory anything is possible and allowed.

>
> Thanks,
> Hugh
>
> >
> > >
> > > >
> > > > this is actually quite important for KVM on s390
> > >
> > > None of us are wanting to break KVM on s390: your guidance appreciated!
> > >
> > > Thanks,
> > > Hugh


2023-07-03 16:35:57

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 29 Jun 2023 23:00:07 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> > On Thu, 29 Jun 2023 12:22:24 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> > > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> > > > >
> > > > > As discussed in the other thread, we would rather go with less complexity,
> > > > > possibly switching to an approach w/o the list and fragment re-use in the
> > > > > future. For now, as a first step in that direction, we can try with not
> > > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > > also description, tested with LTP and all three of your patch series applied:
> > > >
> > > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > > patch.
> > > >
> > > > But I didn't get deep enough into it today to confirm it - and disappointed
> > > > that you've found it necessary to play with pt_frag_refcount in addition to
> > > > _refcount and HH bits. No real problem with that, but my instinct says it
> > > > should be simpler.
> >
> > Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> > to have something to go forward with, while my instinct was in line with yours.
> >
> > >
> > > Is there any reason it should be any different at all from what PPC is
> > > doing?
> > >
> > > I still think the right thing to do here is make the PPC code common
> > > (with Hugh's proposed RCU modification) and just use it in both
> > > arches....
> >
> > With the current approach, we would not add back fragments _only_ for
> > the new pte_free_defer() path, while keeping our cleverness for the other
> > paths. Not having a good overview of the negative impact wrt potential
> > memory waste, I would rather take small steps, if possible.
> >
> > If we later switch to never adding back fragments, of course we should
> > try to be in line with PPC implementation.
>
> I find myself half-agreeing with everyone.
>
> I agree with Gerald that s390 should keep close to what it is already
> doing (except for adding pte_free_defer()): that changing its strategy
> and implementation to be much more like powerpc, is a job for some other
> occasion (and would depend on gathering data about how well each does).
>
> But I agree with Jason that the powerpc solution we ended up with cut
> out a lot of unnecessary complication: it shifts the RCU delay from
> when pte_free_defer() is called, to when the shared page comes to be
> freed; which may be a lot later, and might not be welcome in a common
> path, but is quite okay for the uncommon pte_free_defer().

Ok, I guess I must admit that I completely ignored the latest progress in
the powerpc thread, and therefore was not up-to-date. Still had the older
approach in mind, where you also checked for pt_frag_refcount to avoid
double call_rcu().

The new approach sounds very reasonable, and I also like your latest
s390 patch from a first glance. Need to get more up-to-date with PageActive
and maybe also powerpc approach, and give this some proper review tomorrow.

>
> And I agree with Alexander that pte_free_lower() and pte_free_upper()
> are better names than pte_free_now0() and pte_free_now1(): I was going
> to make that change, except all those functions disappear if we follow
> Jason's advice and switch the call_rcu() to when freeing the page.
>
> (Lower and upper seem unambiguous to me: Gerald, does your confusion
> come just from the way they are shown the wrong way round in the PP AA
> diagram? I corrected that in my patch, but you reverted it in yours.)

Ah yes, that could well be, and unfortunately I did not notice that you
fixed that in the comment. I only saw that you "fixed" the bit numbering
from 01234567 to 76543210, which I think is wrong on big-endian s390,
and therefore I simply removed that complete hunk.

But thanks a lot for pointing to that! We will certainly want to fix that
comment in a later patch, to reduce some or maybe all of the (at least
my) upper/lower confusion.

2023-07-03 21:44:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Mon, Jul 03, 2023 at 01:00:13PM +0200, Claudio Imbrenda wrote:
> On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
>
> [...]
>
> > That's something I would have expected to be handled already via
> > mmu_notifiers, rather than buried inside the page table freeing.
> >
> > If s390 is the only architecture to go that way, and could instead do
> > it via mmu_notifiers, then I think that will be more easily supported
> > in the long term.
>
> I am very well aware of that, and in fact I've been working on
> exactly that for some time already. But it's a very complex minefield
> and therefore I'm proceeding *very* carefully. It will take quite some
> time before anything comes out.

Yes +1 on this please, creating your own arch cross connect with kvm
in the page table freers is really horrible..

Jason

2023-07-04 13:48:30

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).

Just to make things more clear for me: do I understand correctly that this
was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Thanks!

2023-07-04 15:38:56

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Thu, 29 Jun 2023, Hugh Dickins wrote:
> >
> > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > extent that I've not even tried to verify it; but I think I do get the
> > point now, that we need further info than just PPHHAA to know whether
> > the page is on the list or not. But I think that if we move where the
> > call_rcu() is done, then the page can stay on or off the list by same
> > rules as before (but need to check HH bits along with PP when deciding
> > whether to allocate, and whether to list_add_tail() when freeing).
>
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).
>
> I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> I've dropped most of the pte_free_*() helpers, and list_del_init() is
> an easier way of dealing with those "is it on the list" questions.
> I expect that we shall be close to reaching agreement on...

This looks really nice, almost too good and easy to be true. I did not
find any obvious flaw, just some comments below. It also survived LTP
without any visible havoc, so I guess this approach is the best so far.

>
> [PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page
>
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon. This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
>
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
>
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive). And for simplicity, delay calling RCU until
> both halves are freed.
>
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/s390/include/asm/pgalloc.h | 4 ++
> arch/s390/mm/pgalloc.c | 75 +++++++++++++++++++++++++++------
> 2 files changed, 67 insertions(+), 12 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
> #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
> #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
> void vmem_map_init(void);
> void *vmem_crst_alloc(unsigned long val);
> pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..fd0c4312da16 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> * while the PP bits are never used, nor such a page is added to or removed
> * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).

Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
allow reuse before grace period elapsed. And I hope that it does so, by
setting the PP bits, which would be noticed in page_table_alloc(), in
case the page would be seen there.

Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
end of the list, and so they could be seen in page_table_alloc(), but they
should not be reused before grace period elapsed and __tlb_remove_table()
cleared the PP bits, as far as I understand.

So what exactly do you mean with "which page_table_free_rcu() does not actually
guarantee"?

> + * But for simplicity, because page->rcu_head overlays page->lru, and because
> + * the RCU callback might not be called before the mm_context_t has been freed,
> + * pte_free_defer() in this implementation prevents both fragments from being
> + * reused, and delays making the call to RCU until both fragments are freed.
> */
> unsigned long *page_table_alloc(struct mm_struct *mm)
> {
> @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> table += PTRS_PER_PTE;
> atomic_xor_bits(&page->_refcount,
> 0x01U << (bit + 24));
> - list_del(&page->lru);
> + list_del_init(&page->lru);
> }
> }
> spin_unlock_bh(&mm->context.lock);
> @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> table = (unsigned long *) page_to_virt(page);
> if (mm_alloc_pgste(mm)) {
> /* Return 4K page table with PGSTEs */
> + INIT_LIST_HEAD(&page->lru);
> atomic_xor_bits(&page->_refcount, 0x03U << 24);
> memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
> memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
> @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
> {
> char msg[128];
>
> - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
> + if (!IS_ENABLED(CONFIG_DEBUG_VM))
> + return;
> + if (!mask && list_empty(&page->lru))
> return;
> snprintf(msg, sizeof(msg),
> "Invalid pgtable %p release half 0x%02x mask 0x%02x",
> @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
> dump_page(page, msg);
> }
>
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + pgtable_pte_page_dtor(page);
> + __free_page(page);
> +}
> +
> void page_table_free(struct mm_struct *mm, unsigned long *table)
> {
> unsigned int mask, bit, half;
> @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> */
> mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> + * Other half is allocated, and neither half has had
> + * its free deferred: add page to head of list, to make
> + * this freed half available for immediate reuse.
> + */
> list_add(&page->lru, &mm->context.pgtable_list);
> - else
> - list_del(&page->lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(&page->lru);
> + }

Ok, we might end up with some unnecessary list_del_init() here, e.g. if
other half is still allocated, when called from pte_free_defer() on a
fully allocated page, which was not on the list (and with PageActive, and
(mask & 0x03U) true).
Not sure if adding an additional mask check to the else path would be
needed, but it seems that list_del_init() should also be able to handle
this.

Same thought applies to the similar logic in page_table_free_rcu()
below.

> spin_unlock_bh(&mm->context.lock);
> mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> mask >>= 24;
> @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> }
>
> page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(&page->rcu_head, pte_free_now);
> + else
> + pte_free_now(&page->rcu_head);

This ClearPageActive, and the similar thing in __tlb_remove_table() below,
worries me a bit, because it is done outside the spin_lock. It "feels" like
there could be some race with the PageActive checks inside the spin_lock,
but when drawing some pictures, I could not find any such scenario yet.
Also, our existing spin_lock is probably not supposed to protect against
PageActive changes anyway, right?

> }
>
> void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> */
> mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> + * Other half is allocated, and neither half has had
> + * its free deferred: add page to end of list, to make
> + * this freed half available for reuse once its pending
> + * bit has been cleared by __tlb_remove_table().
> + */
> list_add_tail(&page->lru, &mm->context.pgtable_list);
> - else
> - list_del(&page->lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(&page->lru);
> + }
> spin_unlock_bh(&mm->context.lock);
> table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> tlb_remove_table(tlb, table);
> @@ -403,10 +441,23 @@ void __tlb_remove_table(void *_table)
> }
>
> page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(&page->rcu_head, pte_free_now);
> + else
> + pte_free_now(&page->rcu_head);
> }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * Base infrastructure required to generate basic asces, region, segment,
> * and page tables that do not make use of enhanced features like EDAT1.


2023-07-04 16:14:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Tue, 4 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
>
> Hi Hugh,
>
> ...
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
>
> Just to make things more clear for me: do I understand correctly that this
> was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Yes, from page_table_free() called from pte_free_defer(): I had claimed
they could be put on the list (or not) without needing to consider their
HH-ness, apart from wanting to list_add_tail() rather than list_add() them.

But then realized that this category of list_add_tail() pages would block
access to the others.

But I think I was mistaken then to say "could be solved by a list_move()
somewhere"; because "somewhere" would have had to be __tlb_remove_table()
when it removes PP-bits, which would bring us back to the issues of
getting a spinlock from an mm which might already be freed.

Hugh

2023-07-04 17:11:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > >
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not. But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).
> >
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> >
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
>
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.

Phew! I'm of course glad to hear this: thanks for your efforts on it.

...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > * while the PP bits are never used, nor such a page is added to or removed
> > * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
>
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
>
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
>
> So what exactly do you mean with "which page_table_free_rcu() does not actually
> guarantee"?

I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better. And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.

The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented:
/*
* This isn't an RCU grace period and hence the page-tables cannot be
* assumed to be actually RCU-freed.
*
* It is however sufficient for software page-table walkers that rely on
* IRQ disabling.
*/

Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.

If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)". But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.

And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.

...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > */
> > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > mask >>= 24;
> > - if (mask & 0x03U)
> > + if ((mask & 0x03U) && !PageActive(page)) {
> > + /*
> > + * Other half is allocated, and neither half has had
> > + * its free deferred: add page to head of list, to make
> > + * this freed half available for immediate reuse.
> > + */
> > list_add(&page->lru, &mm->context.pgtable_list);
> > - else
> > - list_del(&page->lru);
> > + } else {
> > + /* If page is on list, now remove it. */
> > + list_del_init(&page->lru);
> > + }
>
> Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> other half is still allocated, when called from pte_free_defer() on a
> fully allocated page, which was not on the list (and with PageActive, and
> (mask & 0x03U) true).
> Not sure if adding an additional mask check to the else path would be
> needed, but it seems that list_del_init() should also be able to handle
> this.

list_del_init() is very cheap in the unnecessary case: the cachelines
required are already there. You don't want a flag to say whether to
call it or not, it is already the efficient approach.

(But you were right not to use it in your pt_frag_refcount version,
because there we were still trying to do the call_rcu() per fragment
rather than per page, so page->lru could have been on the RCU queue.)

>
> Same thought applies to the similar logic in page_table_free_rcu()
> below.
>
> > spin_unlock_bh(&mm->context.lock);
> > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > mask >>= 24;
> > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > }
> >
> > page_table_release_check(page, table, half, mask);
> > - pgtable_pte_page_dtor(page);
> > - __free_page(page);
> > + if (TestClearPageActive(page))
> > + call_rcu(&page->rcu_head, pte_free_now);
> > + else
> > + pte_free_now(&page->rcu_head);
>
> This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> worries me a bit, because it is done outside the spin_lock. It "feels" like
> there could be some race with the PageActive checks inside the spin_lock,
> but when drawing some pictures, I could not find any such scenario yet.
> Also, our existing spin_lock is probably not supposed to protect against
> PageActive changes anyway, right?

Here (and similarly in __tlb_remove_table()) is where we are about to free
the page table page: both of the fragments have already been released,
there is nobody left who could be racing against us to set PageActive.

I chose PageActive for its name, not for any special behaviour of that
flag: nothing else could be setting or clearing it while we own the page.

Hugh

2023-07-05 06:59:48

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;

If I got your and Claudio conversation right, you were going to add
here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * Base infrastructure required to generate basic asces, region, segment,
> * and page tables that do not make use of enhanced features like EDAT1.

Thanks!

2023-07-05 13:21:55

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > >
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not. But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).
> > >
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > >
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...
> >
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.
>
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
>
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > * while the PP bits are never used, nor such a page is added to or removed
> > > * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> >
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> >
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> >
> > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > guarantee"?
>
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better. And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
>
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> and is commented:
> /*
> * This isn't an RCU grace period and hence the page-tables cannot be
> * assumed to be actually RCU-freed.
> *
> * It is however sufficient for software page-table walkers that rely on
> * IRQ disabling.
> */
>
> Whether that's good for your PP pages or not, I've given no thought:
> I've just taken it on trust that what s390 has working today is good.

Yes, we should be fine with that, current code can be trusted :-)

>
> If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> then I would not have written "(which page_table_free_rcu() does not
> actually guarantee)". But it cannot use call_rcu() because it does
> not have an rcu_head to work with - it's in some generic code, and
> there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
>
> And Jason would have much preferred us to address the issue from that
> angle; but not only would doing so destroy my sanity, I'd also destroy
> 20 architectures TLB-flushing, unbuilt and untested, in the attempt.

Oh yes, if your changes would have allowed to get rid of that "semi RCU"
logic, that would really be a major boost in popularity, I guess. But
it probably is as it is, because it is not so easily fixed...

>
> ...
> > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > */
> > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > mask >>= 24;
> > > - if (mask & 0x03U)
> > > + if ((mask & 0x03U) && !PageActive(page)) {
> > > + /*
> > > + * Other half is allocated, and neither half has had
> > > + * its free deferred: add page to head of list, to make
> > > + * this freed half available for immediate reuse.
> > > + */
> > > list_add(&page->lru, &mm->context.pgtable_list);
> > > - else
> > > - list_del(&page->lru);
> > > + } else {
> > > + /* If page is on list, now remove it. */
> > > + list_del_init(&page->lru);
> > > + }
> >
> > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > other half is still allocated, when called from pte_free_defer() on a
> > fully allocated page, which was not on the list (and with PageActive, and
> > (mask & 0x03U) true).
> > Not sure if adding an additional mask check to the else path would be
> > needed, but it seems that list_del_init() should also be able to handle
> > this.
>
> list_del_init() is very cheap in the unnecessary case: the cachelines
> required are already there. You don't want a flag to say whether to
> call it or not, it is already the efficient approach.

Yes, I also see no functional issue here. Just thought that the extra
write could be avoided, e.g. by checking for list_empty() or mask first.
But I guess that is simply the benefit of list_del_init(), that you
don't have to check, at least if it is guaranteed that rcu_head is
never in use here.

Then maybe adjust the comment, because now it makes you wonder, when
you read (and understand) the code, you see that this list_del_init()
might also be called for pages not on the list.

>
> (But you were right not to use it in your pt_frag_refcount version,
> because there we were still trying to do the call_rcu() per fragment
> rather than per page, so page->lru could have been on the RCU queue.)

That is actually the one thing I still try to figure out, by drawing
pictures, i.e. if we really really never end up here on list_del_init(),
while using rcu_head, e.g. by racing PageActive.

>
> >
> > Same thought applies to the similar logic in page_table_free_rcu()
> > below.
> >
> > > spin_unlock_bh(&mm->context.lock);
> > > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > > mask >>= 24;
> > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > }
> > >
> > > page_table_release_check(page, table, half, mask);
> > > - pgtable_pte_page_dtor(page);
> > > - __free_page(page);
> > > + if (TestClearPageActive(page))
> > > + call_rcu(&page->rcu_head, pte_free_now);
> > > + else
> > > + pte_free_now(&page->rcu_head);
> >
> > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > there could be some race with the PageActive checks inside the spin_lock,
> > but when drawing some pictures, I could not find any such scenario yet.
> > Also, our existing spin_lock is probably not supposed to protect against
> > PageActive changes anyway, right?
>
> Here (and similarly in __tlb_remove_table()) is where we are about to free
> the page table page: both of the fragments have already been released,
> there is nobody left who could be racing against us to set PageActive.

Yes, that is what makes this approach so nice, i.e. no more checking
for HH bits or worry about double call_rcu(), simply do the the freeing
whenever the page is ready. At least in theory, still drawing pictures :-)

But this really looks very good to me, and also works with LTP not worse
than the other approaches.

2023-07-06 01:07:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
>
> Hi Hugh,
>
> ...
>
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > + struct page *page;
>
> If I got your and Claudio conversation right, you were going to add
> here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

Well, Claudio approved, so I would have put it in, if we had stuck with
that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
but once that went away, it became somewhat irrelevant... to me anyway.

But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
if it might help you guys - there is some point, since pte_free_defer()
is a route which can usefully check for such a case, without confusion
from harmless traffic from immediate frees of just-in-case allocations.

But don't expect it to catch all such cases (if they exist): another
category of s390 page_table_free()s comes from the PageAnon
zap_deposited_table() in zap_huge_pmd(): those tables might or might
not have been exposed to userspace at some time in the past.

I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
WARN_ON_ONCE is the one we want - I get confused by all the different
flavours of WARN, and have to check the header file each time to be
sure of the syntax and semantics): but be aware that it won't be
checking all potential cases.

Hugh

>
> > + page = virt_to_page(pgtable);
> > + SetPageActive(page);
> > + page_table_free(mm, (unsigned long *)pgtable);
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> > /*
> > * Base infrastructure required to generate basic asces, region, segment,
> > * and page tables that do not make use of enhanced features like EDAT1.
>
> Thanks!

2023-07-06 01:29:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
> > On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > Hugh Dickins <[email protected]> wrote:
> > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > ...
> > > > --- a/arch/s390/mm/pgalloc.c
> > > > +++ b/arch/s390/mm/pgalloc.c
> > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > > * while the PP bits are never used, nor such a page is added to or removed
> > > > * from mm_context_t::pgtable_list.
> > > > + *
> > > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> > >
> > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > case the page would be seen there.
> > >
> > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > cleared the PP bits, as far as I understand.
> > >
> > > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > > guarantee"?
> >
> > I'll answer without locating and re-reading what Jason explained earlier,
> > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > he may have explained it better. And without working out again all the
> > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> >
> > The detail that sticks in my mind is the fallback in tlb_remove_table()
>
> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific.

Yes.

> I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.
>
> Maybe change to "which page_table_free_rcu() does not actually guarantee,
> by calling tlb_remove_table()", to make it clear that this is not a problem
> of page_table_free_rcu() itself.

Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)

>
> > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > batch the tables for freeing by RCU, and resorts instead to an immediate
> > TLB flush (I think: that again involves chasing definitions) followed by
> > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > and is commented:
> > /*
> > * This isn't an RCU grace period and hence the page-tables cannot be
> > * assumed to be actually RCU-freed.
> > *
> > * It is however sufficient for software page-table walkers that rely on
> > * IRQ disabling.
> > */
> >
> > Whether that's good for your PP pages or not, I've given no thought:
> > I've just taken it on trust that what s390 has working today is good.
>
> Yes, we should be fine with that, current code can be trusted :-)

Glad to hear it :-) Yes, I think it's not actually relying on the "rcu"
implied by the function name.

>
> >
> > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > then I would not have written "(which page_table_free_rcu() does not
> > actually guarantee)". But it cannot use call_rcu() because it does
> > not have an rcu_head to work with - it's in some generic code, and
> > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> >
> > And Jason would have much preferred us to address the issue from that
> > angle; but not only would doing so destroy my sanity, I'd also destroy
> > 20 architectures TLB-flushing, unbuilt and untested, in the attempt.
>
> Oh yes, if your changes would have allowed to get rid of that "semi RCU"
> logic, that would really be a major boost in popularity, I guess. But
> it probably is as it is, because it is not so easily fixed...

I'm hoping that this series might help stir someone else to get into that.

>
> >
> > ...
> > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > > */
> > > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > > mask >>= 24;
> > > > - if (mask & 0x03U)
> > > > + if ((mask & 0x03U) && !PageActive(page)) {
> > > > + /*
> > > > + * Other half is allocated, and neither half has had
> > > > + * its free deferred: add page to head of list, to make
> > > > + * this freed half available for immediate reuse.
> > > > + */
> > > > list_add(&page->lru, &mm->context.pgtable_list);
> > > > - else
> > > > - list_del(&page->lru);
> > > > + } else {
> > > > + /* If page is on list, now remove it. */
> > > > + list_del_init(&page->lru);
> > > > + }
> > >
> > > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > > other half is still allocated, when called from pte_free_defer() on a
> > > fully allocated page, which was not on the list (and with PageActive, and
> > > (mask & 0x03U) true).
> > > Not sure if adding an additional mask check to the else path would be
> > > needed, but it seems that list_del_init() should also be able to handle
> > > this.
> >
> > list_del_init() is very cheap in the unnecessary case: the cachelines
> > required are already there. You don't want a flag to say whether to
> > call it or not, it is already the efficient approach.
>
> Yes, I also see no functional issue here. Just thought that the extra
> write could be avoided, e.g. by checking for list_empty() or mask first.
> But I guess that is simply the benefit of list_del_init(), that you
> don't have to check, at least if it is guaranteed that rcu_head is
> never in use here.
>
> Then maybe adjust the comment, because now it makes you wonder, when
> you read (and understand) the code, you see that this list_del_init()
> might also be called for pages not on the list.

Sorry, I don't understand what clarification you're asking for there.
I thought
/* If page is on list, now remove it. */
list_del_init(&page->lru);
was good enough comment.

(I certainly don't want to enumerate the cases when it is or is not
already on the list there, that would be misery; but I don't think
that's the adjustment you were asking for either.)

>
> >
> > (But you were right not to use it in your pt_frag_refcount version,
> > because there we were still trying to do the call_rcu() per fragment
> > rather than per page, so page->lru could have been on the RCU queue.)
>
> That is actually the one thing I still try to figure out, by drawing
> pictures, i.e. if we really really never end up here on list_del_init(),
> while using rcu_head, e.g. by racing PageActive.

There is no race with PageActive being seen when the table page is
finally to be freed (by RCU or not). But there is definitely a harmless
race with pte_free_defer()er of other half setting PageActive an instant
after page_table_free() checked PageActive here. So maybe this
page_table_free() does a list_add(), which the racer then list_del_init()s
when it gets the mm->context.lock; or maybe they both list_del_init().

>
> >
> > >
> > > Same thought applies to the similar logic in page_table_free_rcu()
> > > below.
> > >
> > > > spin_unlock_bh(&mm->context.lock);
> > > > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > > > mask >>= 24;
> > > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > > }
> > > >
> > > > page_table_release_check(page, table, half, mask);
> > > > - pgtable_pte_page_dtor(page);
> > > > - __free_page(page);
> > > > + if (TestClearPageActive(page))
> > > > + call_rcu(&page->rcu_head, pte_free_now);
> > > > + else
> > > > + pte_free_now(&page->rcu_head);
> > >
> > > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > > there could be some race with the PageActive checks inside the spin_lock,
> > > but when drawing some pictures, I could not find any such scenario yet.
> > > Also, our existing spin_lock is probably not supposed to protect against
> > > PageActive changes anyway, right?
> >
> > Here (and similarly in __tlb_remove_table()) is where we are about to free
> > the page table page: both of the fragments have already been released,
> > there is nobody left who could be racing against us to set PageActive.
>
> Yes, that is what makes this approach so nice, i.e. no more checking
> for HH bits or worry about double call_rcu(), simply do the the freeing
> whenever the page is ready. At least in theory, still drawing pictures :-)

Please do keep drawing: and perhaps you can sell them afterwards :-)

>
> But this really looks very good to me, and also works with LTP not worse
> than the other approaches.

Great, thanks for all your help Gerald.

Hugh

2023-07-06 15:23:22

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, 5 Jul 2023 18:20:21 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> > > On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > > Hugh Dickins <[email protected]> wrote:
> > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > ...
> > > > > --- a/arch/s390/mm/pgalloc.c
> > > > > +++ b/arch/s390/mm/pgalloc.c
> > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > > > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > > > * while the PP bits are never used, nor such a page is added to or removed
> > > > > * from mm_context_t::pgtable_list.
> > > > > + *
> > > > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> > > >
> > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > > case the page would be seen there.
> > > >
> > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > > cleared the PP bits, as far as I understand.
> > > >
> > > > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > > > guarantee"?
> > >
> > > I'll answer without locating and re-reading what Jason explained earlier,
> > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > > he may have explained it better. And without working out again all the
> > > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > >
> > > The detail that sticks in my mind is the fallback in tlb_remove_table()
> >
> > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> > but that is rather a generic issue, and not s390-specific.
>
> Yes.
>
> > I thought you
> > meant some s390-oddity here, of which we have a lot, unfortunately...
> > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> > I guess you could say that page_table_free_rcu() cannot guarantee what
> > tlb_remove_table() cannot guarantee.
> >
> > Maybe change to "which page_table_free_rcu() does not actually guarantee,
> > by calling tlb_remove_table()", to make it clear that this is not a problem
> > of page_table_free_rcu() itself.
>
> Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)
>
> >
> > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > > batch the tables for freeing by RCU, and resorts instead to an immediate
> > > TLB flush (I think: that again involves chasing definitions) followed by
> > > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > > and is commented:
> > > /*
> > > * This isn't an RCU grace period and hence the page-tables cannot be
> > > * assumed to be actually RCU-freed.
> > > *
> > > * It is however sufficient for software page-table walkers that rely on
> > > * IRQ disabling.
> > > */
> > >
> > > Whether that's good for your PP pages or not, I've given no thought:
> > > I've just taken it on trust that what s390 has working today is good.
> >
> > Yes, we should be fine with that, current code can be trusted :-)
>
> Glad to hear it :-) Yes, I think it's not actually relying on the "rcu"
> implied by the function name.

Ah ok, now I get it. Never noticed that naming it "free_rcu" could be
misleading. It is only ever called from pte_free_tlb(), so always in that
"semi-RCU" context. If you just look at the name, you could expect this
to always free pagetables by RCU, which would be exactly what you need
for pte_free_defer(), and which of course cannot be guaranteed by our
page_table_free_rcu().

IOW, exactly what your comment says, and now I think it is actually fine
as it is :-)

I guess I am a bit lamebrained this week, due to early shift and not
enough sleep...

>
> >
> > >
> > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > > then I would not have written "(which page_table_free_rcu() does not
> > > actually guarantee)". But it cannot use call_rcu() because it does
> > > not have an rcu_head to work with - it's in some generic code, and
> > > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > >
> > > And Jason would have much preferred us to address the issue from that
> > > angle; but not only would doing so destroy my sanity, I'd also destroy
> > > 20 architectures TLB-flushing, unbuilt and untested, in the attempt.
> >
> > Oh yes, if your changes would have allowed to get rid of that "semi RCU"
> > logic, that would really be a major boost in popularity, I guess. But
> > it probably is as it is, because it is not so easily fixed...
>
> I'm hoping that this series might help stir someone else to get into that.
>
> >
> > >
> > > ...
> > > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > > > */
> > > > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > > > mask >>= 24;
> > > > > - if (mask & 0x03U)
> > > > > + if ((mask & 0x03U) && !PageActive(page)) {
> > > > > + /*
> > > > > + * Other half is allocated, and neither half has had
> > > > > + * its free deferred: add page to head of list, to make
> > > > > + * this freed half available for immediate reuse.
> > > > > + */
> > > > > list_add(&page->lru, &mm->context.pgtable_list);
> > > > > - else
> > > > > - list_del(&page->lru);
> > > > > + } else {
> > > > > + /* If page is on list, now remove it. */
> > > > > + list_del_init(&page->lru);
> > > > > + }
> > > >
> > > > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > > > other half is still allocated, when called from pte_free_defer() on a
> > > > fully allocated page, which was not on the list (and with PageActive, and
> > > > (mask & 0x03U) true).
> > > > Not sure if adding an additional mask check to the else path would be
> > > > needed, but it seems that list_del_init() should also be able to handle
> > > > this.
> > >
> > > list_del_init() is very cheap in the unnecessary case: the cachelines
> > > required are already there. You don't want a flag to say whether to
> > > call it or not, it is already the efficient approach.
> >
> > Yes, I also see no functional issue here. Just thought that the extra
> > write could be avoided, e.g. by checking for list_empty() or mask first.
> > But I guess that is simply the benefit of list_del_init(), that you
> > don't have to check, at least if it is guaranteed that rcu_head is
> > never in use here.
> >
> > Then maybe adjust the comment, because now it makes you wonder, when
> > you read (and understand) the code, you see that this list_del_init()
> > might also be called for pages not on the list.
>
> Sorry, I don't understand what clarification you're asking for there.
> I thought
> /* If page is on list, now remove it. */
> list_del_init(&page->lru);
> was good enough comment.
>
> (I certainly don't want to enumerate the cases when it is or is not
> already on the list there, that would be misery; but I don't think
> that's the adjustment you were asking for either.)

I was mislead by the comment saying "If page is on the list", in an
else path where we also end up for pages not on the list any more.
I guess I would have added something like "it is also ok to do
list_del_init() here for pages not on the list". But thinking again,
that would probably just be a reminder of how list_del_init() works,
which should be obvious anyway, at least for people with enough sleep.

>
> >
> > >
> > > (But you were right not to use it in your pt_frag_refcount version,
> > > because there we were still trying to do the call_rcu() per fragment
> > > rather than per page, so page->lru could have been on the RCU queue.)
> >
> > That is actually the one thing I still try to figure out, by drawing
> > pictures, i.e. if we really really never end up here on list_del_init(),
> > while using rcu_head, e.g. by racing PageActive.
>
> There is no race with PageActive being seen when the table page is
> finally to be freed (by RCU or not). But there is definitely a harmless
> race with pte_free_defer()er of other half setting PageActive an instant
> after page_table_free() checked PageActive here. So maybe this
> page_table_free() does a list_add(), which the racer then list_del_init()s
> when it gets the mm->context.lock; or maybe they both list_del_init().

Agree.

Since none of my remarks on the comments seem valid or strictly necessary
any more, and I also could not find functional issues, I think you can add
this patch as new version for 07/12. And I can now give you this:

Reviewed-by: Gerald Schaefer <[email protected]>

2023-07-06 20:07:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Thu, 6 Jul 2023, Gerald Schaefer wrote:
>
> Since none of my remarks on the comments seem valid or strictly necessary
> any more, and I also could not find functional issues, I think you can add
> this patch as new version for 07/12. And I can now give you this:
>
> Reviewed-by: Gerald Schaefer <[email protected]>

Great, thanks a lot Gerald.
The one change I'm making to it is then this merged in:

--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -455,6 +455,11 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
page = virt_to_page(pgtable);
SetPageActive(page);
page_table_free(mm, (unsigned long *)pgtable);
+ /*
+ * page_table_free() does not do the pgste gmap_unlink() which
+ * page_table_free_rcu() does: warn us if pgste ever reaches here.
+ */
+ WARN_ON_ONCE(mm_alloc_pgste(mm));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */


2023-07-07 15:08:37

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, 5 Jul 2023 17:52:40 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> > On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> >
> > Hi Hugh,
> >
> > ...
> >
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> >
> > If I got your and Claudio conversation right, you were going to add
> > here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

Good point, thanks Alexander for noticing!

>
> Well, Claudio approved, so I would have put it in, if we had stuck with
> that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
> but once that went away, it became somewhat irrelevant... to me anyway.
>
> But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
> if it might help you guys - there is some point, since pte_free_defer()
> is a route which can usefully check for such a case, without confusion
> from harmless traffic from immediate frees of just-in-case allocations.
>
> But don't expect it to catch all such cases (if they exist): another
> category of s390 page_table_free()s comes from the PageAnon
> zap_deposited_table() in zap_huge_pmd(): those tables might or might
> not have been exposed to userspace at some time in the past.

Right, after THP collapse, the previously active PTE table would be
deposited in this case, and then later freed in zap_deposited_table().
I guess we need to be very careful, if THP was ever enabled for KVM
guests.

>
> I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
> WARN_ON_ONCE is the one we want - I get confused by all the different
> flavours of WARN, and have to check the header file each time to be
> sure of the syntax and semantics): but be aware that it won't be
> checking all potential cases.

Thanks, looks good.

2023-07-10 17:43:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

On Wed, Jul 05, 2023 at 02:55:16PM +0200, Gerald Schaefer wrote:

> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific. I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.

The issue is the arches don't provide a reliable way to RCU free
things, so the core code creates an RCU situation using the MMU
batch. With the non-RCU compatible IPI fallback. So it isn't actually
RCU, it is IPI but optimized with RCU in some cases.

When Hugh introduces a reliable way to RCU free stuff we could fall
back to that in the TLB code instead of invoking the synchronize_rcu()

For lots of arches, S390 included after this series, this would be
pretty easy.

What I see now as the big trouble is that this series only addresses
PTE RCU'ness and making all the other levels RCUable would be much
harder on some arches like power.

In short we could create a CONFIG_ARCH_RCU_SAFE_PAGEWALK and it could
be done on alot of arches quite simply, but at least not power. Which
makes me wonder about the value, but maybe it could shame power into
doing something..

However, calling things 'page_table_free_rcu()' when it doesn't
actually always do RCU but IPI optimzed RCU is an unfortunate name :(
As long as you never assume it does RCU anywhere else, and don't use
rcu_read_lock(), it is fine :)

The corner case is narrow, you have to OOM the TLB batching before you
loose the RCU optimization of the IPI. Then you can notice that
rcu_read_lock() doesn't actually protect against concurrent free.

Jason