2023-07-12 04:32:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 00/13] mm: free retracted page table by RCU

Here is v3 of the series of patches to mm (and a few architectures), based
on v6.5-rc1 which includes the preceding two series (thank you!): in which
khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd
transitions. Differences from v1 and v2 are noted patch by patch below.

This replaces the v2 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/[email protected]/
series of 12 posted on 2023-06-20.

What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on v6.5-rc1; and almost good on current mm-unstable or current
linux-next - just one patch conflicts, the 12/13: I'll reply to that
one with its mm-unstable or linux-next equivalent (vma_assert_locked()
has been added next to where vma_try_start_write() is being removed).

01/13 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
v3: same as v1
02/13 mm/pgtable: add PAE safety to __pte_offset_map()
v3: same as v2
v2: rename to pmdp_get_lockless_start/end() per Matthew;
so use inlines without _irq_save(flags) macro oddity;
add pmdp_get_lockless_sync() for use later in 09/13.
03/13 arm: adjust_pte() use pte_offset_map_nolock()
v3: same as v1
04/13 powerpc: assert_pte_locked() use pte_offset_map_nolock()
v3: same as v1
05/13 powerpc: add pte_free_defer() for pgtables sharing page
v3: much simpler version, following suggestion by Jason
v2: fix rcu_head usage to cope with concurrent deferrals;
add para to commit message explaining rcu_head issue.
06/13 sparc: add pte_free_defer() for pte_t *pgtable_t
v3: same as v2
v2: use page_address() instead of less common page_to_virt();
add para to commit message explaining simple conversion;
changed title since sparc64 pgtables do not share page.
07/13 s390: add pte_free_defer() for pgtables sharing page
v3: much simpler version, following suggestion by Gerald
v2: complete rewrite, integrated with s390's existing pgtable
management; temporarily using a global mm_pgtable_list_lock,
to be restored to per-mm spinlock in a later followup patch.
08/13 mm/pgtable: add pte_free_defer() for pgtable as page
v3: same as v2
v2: add comment on rcu_head to "Page table pages", per JannH
09/13 mm/khugepaged: retract_page_tables() without mmap or vma lock
v3: same as v2
v2: repeat checks under ptl because UFFD, per PeterX and JannH;
bring back mmu_notifier calls for PMD, per JannH and Jason;
pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/13 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
v3: updated to using ptent instead of *pte
v2: first check VMA, in case page tables torn down, per JannH;
pmdp_get_lockless_sync() to issue missing interrupt if PAE;
moved mmu_notifier after step 1, reworked final goto labels.
11/13 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
v3: rediffed
v2: same as v1
12/13 mm: delete mmap_write_trylock() and vma_try_start_write()
v3: rediffed (different diff needed for mm-unstable or linux-next)
v2: same as v1
13/13 mm/pgtable: notes on pte_offset_map[_lock]()
v3: new: JannH asked for more helpful comment, this is my attempt;
could be moved to be the first in the series.

arch/arm/mm/fault-armv.c | 3 +-
arch/powerpc/include/asm/pgalloc.h | 4 +
arch/powerpc/mm/pgtable-frag.c | 29 +-
arch/powerpc/mm/pgtable.c | 16 +-
arch/s390/include/asm/pgalloc.h | 4 +
arch/s390/mm/pgalloc.c | 80 ++++-
arch/sparc/include/asm/pgalloc_64.h | 4 +
arch/sparc/mm/init_64.c | 16 +
include/linux/mm.h | 17 --
include/linux/mm_types.h | 4 +
include/linux/mmap_lock.h | 10 -
include/linux/pgtable.h | 10 +-
mm/khugepaged.c | 481 +++++++++++-------------------
mm/pgtable-generic.c | 97 +++++-
14 files changed, 404 insertions(+), 371 deletions(-)

Hugh


2023-07-12 04:44:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/include/asm/pgalloc_64.h | 4 ++++
arch/sparc/mm/init_64.c | 16 ++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
void pte_free(struct mm_struct *mm, pgtable_t ptepage);

+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
#define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
#define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ __pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd)
{
--
2.35.3


2023-07-12 04:45:22

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/arm/mm/fault-armv.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
* must use the nested version. This also means we need to
* open-code the spin-locking.
*/
- pte = pte_offset_map(pmd, address);
+ pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
if (!pte)
return 0;

- ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);

ret = do_adjust_pte(vma, address, pfn, pte);
--
2.35.3


2023-07-12 04:45:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map()

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 4 ++++
mm/pgtable-generic.c | 29 +++++++++++++++++++++++++++++
2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5134edcec668..7f2db400f653 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
return pmd;
}
#define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
#endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */

@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
{
return pmdp_get(pmdp);
}
+static inline void pmdp_get_lockless_sync(void)
+{
+}
#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 400e5a045848..b9a0c2137cc1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+ (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+ unsigned long irqflags;
+
+ local_irq_save(irqflags);
+ return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+ local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
+ unsigned long irqflags;
pmd_t pmdval;

rcu_read_lock();
+ irqflags = pmdp_get_lockless_start();
pmdval = pmdp_get_lockless(pmd);
+ pmdp_get_lockless_end(irqflags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
--
2.35.3


2023-07-12 04:47:01

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm_types.h | 4 ++++
include/linux/pgtable.h | 2 ++
mm/pgtable-generic.c | 20 ++++++++++++++++++++
3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..17a7868f00bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
+ /*
+ * A PTE page table page might be freed by use of
+ * rcu_head: which overlays those two fields above.
+ */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 7f2db400f653..9fa34be65159 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
}
#endif

+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/* Find an entry in the second-level page table.. */
#ifndef pmd_offset
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b9a0c2137cc1..fa9d4d084291 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
#include <asm/tlb.h>

/*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
return pmd;
}
#endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = pgtable;
+ call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
--
2.35.3


2023-07-12 04:49:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/pgtable.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;

if (mm == &init_mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
- /*
- * khugepaged to collapse normal pages to hugepage, first set
- * pmd to none to force page fault/gup to take mmap_lock. After
- * pmd is set to none, we do a pte_clear which does this assertion
- * so if we find pmd none, return.
- */
- if (pmd_none(*pmd))
- return;
- BUG_ON(!pmd_present(*pmd));
- assert_spin_locked(pte_lockptr(mm, pmd));
+ pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+ BUG_ON(!pte);
+ assert_spin_locked(ptl);
+ pte_unmap(pte);
}
#endif /* CONFIG_DEBUG_VM */

--
2.35.3


2023-07-12 04:49:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 4 ++--
mm/pgtable-generic.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..5134edcec668 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
#define pte_unmap(pte) do { \
kunmap_local((pte)); \
- /* rcu_read_unlock() to be added later */ \
+ rcu_read_unlock(); \
} while (0)
#else
static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
}
static inline void pte_unmap(pte_t *pte)
{
- /* rcu_read_unlock() to be added later */
+ rcu_read_unlock();
}
#endif

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4d454953046f..400e5a045848 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
pmd_t pmdval;

- /* rcu_read_lock() to be added later */
+ rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
}
return __pte_map(&pmdval, addr);
nomap:
- /* rcu_read_unlock() to be added later */
+ rcu_read_unlock();
return NULL;
}

--
2.35.3


2023-07-12 04:58:04

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed. Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before? I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 125 +++++++-------------------------------------------
1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46986eb4eebb..7c7aaddbe130 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);

static struct kmem_cache *mm_slot_cache __read_mostly;

-#define MAX_PTE_MAPPED_THP 8
-
struct collapse_control {
bool is_khugepaged;

@@ -107,15 +105,9 @@ struct collapse_control {
/**
* struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
* @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
*/
struct khugepaged_mm_slot {
struct mm_slot slot;
-
- /* pte-mapped THP in this mm */
- int nr_pte_mapped_thp;
- unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
};

/**
@@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
}

#ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct. Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none(). Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
- struct khugepaged_mm_slot *mm_slot;
- struct mm_slot *slot;
- bool ret = false;
-
- VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
- spin_lock(&khugepaged_mm_lock);
- slot = mm_slot_lookup(mm_slots_hash, mm);
- mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
- if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
- mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
- ret = true;
- }
- spin_unlock(&khugepaged_mm_lock);
- return ret;
-}
-
/* hpage must be locked, and mmap_lock must be held */
static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
return result;
}

-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
- struct mm_slot *slot = &mm_slot->slot;
- struct mm_struct *mm = slot->mm;
- int i;
-
- if (likely(mm_slot->nr_pte_mapped_thp == 0))
- return;
-
- if (!mmap_write_trylock(mm))
- return;
-
- if (unlikely(hpage_collapse_test_exit(mm)))
- goto out;
-
- for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
- collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
- mm_slot->nr_pte_mapped_thp = 0;
- mmap_write_unlock(mm);
-}
-
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
@@ -2370,16 +2295,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
{
BUILD_BUG();
}
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
- return false;
-}
#endif

static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2409,7 +2324,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.mm_slot = mm_slot;
}
spin_unlock(&khugepaged_mm_lock);
- khugepaged_collapse_pte_mapped_thps(mm_slot);

mm = slot->mm;
/*
@@ -2462,36 +2376,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.address);

mmap_read_unlock(mm);
- *result = hpage_collapse_scan_file(mm,
- khugepaged_scan.address,
- file, pgoff, cc);
mmap_locked = false;
+ *result = hpage_collapse_scan_file(mm,
+ khugepaged_scan.address, file, pgoff, cc);
+ if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+ mmap_read_lock(mm);
+ mmap_locked = true;
+ if (hpage_collapse_test_exit(mm)) {
+ fput(file);
+ goto breakouterloop;
+ }
+ *result = collapse_pte_mapped_thp(mm,
+ khugepaged_scan.address, false);
+ if (*result == SCAN_PMD_MAPPED)
+ *result = SCAN_SUCCEED;
+ }
fput(file);
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
- khugepaged_scan.address,
- &mmap_locked,
- cc);
+ khugepaged_scan.address, &mmap_locked, cc);
}
- switch (*result) {
- case SCAN_PTE_MAPPED_HUGEPAGE: {
- pmd_t *pmd;

- *result = find_pmd_or_thp_or_none(mm,
- khugepaged_scan.address,
- &pmd);
- if (*result != SCAN_SUCCEED)
- break;
- if (!khugepaged_add_pte_mapped_thp(mm,
- khugepaged_scan.address))
- break;
- } fallthrough;
- case SCAN_SUCCEED:
+ if (*result == SCAN_SUCCEED)
++khugepaged_pages_collapsed;
- break;
- default:
- break;
- }

/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
--
2.35.3


2023-07-12 05:02:31

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time. But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/include/asm/pgalloc.h | 4 ++++
arch/powerpc/mm/pgtable-frag.c | 29 ++++++++++++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
}

+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/*
* Functions that deal with pagetables that could be at any level of
* the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
}

+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
void pte_fragment_free(unsigned long *table, int kernel)
{
struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)

BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
- if (!kernel)
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (kernel)
+ __free_page(page);
+ else if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ SetPageActive(page);
+ pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.35.3


2023-07-12 05:08:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that. That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs. retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write(). In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault. But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 184 ++++++++++++++++++++------------------------------
1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..3bb05147961b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
- * In MADV_COLLAPSE path, possible race with khugepaged where
- * all pte entries have been removed and pmd cleared. If so,
- * skip all the pte checks and just update the pmd mapping.
+ * All pte entries have been removed and pmd cleared.
+ * Skip all the pte checks and just update the pmd mapping.
*/
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
}

-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
- struct mm_struct *target_mm,
- unsigned long target_addr, struct page *hpage,
- struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
- int target_result = SCAN_FAIL;

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- int result = SCAN_FAIL;
- struct mm_struct *mm = NULL;
- unsigned long addr = 0;
- pmd_t *pmd;
- bool is_target = false;
+ struct mmu_notifier_range range;
+ struct mm_struct *mm;
+ unsigned long addr;
+ pmd_t *pmd, pgt_pmd;
+ spinlock_t *pml;
+ spinlock_t *ptl;
+ bool skipped_uffd = false;

/*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
- * got written to. These VMAs are likely not worth investing
- * mmap_write_lock(mm) as PMD-mapping is likely to be split
- * later.
- *
- * Note that vma->anon_vma check is racy: it can be set up after
- * the check but before we took mmap_lock by the fault path.
- * But page lock would prevent establishing any new ptes of the
- * page, so we are safe.
- *
- * An alternative would be drop the check, but check that page
- * table is clear before calling pmdp_collapse_flush() under
- * ptl. It has higher chance to recover THP for the VMA, but
- * has higher cost too. It would also probably require locking
- * the anon_vma.
+ * got written to. These VMAs are likely not worth removing
+ * page tables from, as PMD-mapping is likely to be split later.
*/
- if (READ_ONCE(vma->anon_vma)) {
- result = SCAN_PAGE_ANON;
- goto next;
- }
+ if (READ_ONCE(vma->anon_vma))
+ continue;
+
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
if (addr & ~HPAGE_PMD_MASK ||
- vma->vm_end < addr + HPAGE_PMD_SIZE) {
- result = SCAN_VMA_CHECK;
- goto next;
- }
- mm = vma->vm_mm;
- is_target = mm == target_mm && addr == target_addr;
- result = find_pmd_or_thp_or_none(mm, addr, &pmd);
- if (result != SCAN_SUCCEED)
- goto next;
- /*
- * We need exclusive mmap_lock to retract page table.
- *
- * We use trylock due to lock inversion: we need to acquire
- * mmap_lock while holding page lock. Fault path does it in
- * reverse order. Trylock is a way to avoid deadlock.
- *
- * Also, it's not MADV_COLLAPSE's job to collapse other
- * mappings - let khugepaged take care of them later.
- */
- result = SCAN_PTE_MAPPED_HUGEPAGE;
- if ((cc->is_khugepaged || is_target) &&
- mmap_write_trylock(mm)) {
- /* trylock for the same lock inversion as above */
- if (!vma_try_start_write(vma))
- goto unlock_next;
-
- /*
- * Re-check whether we have an ->anon_vma, because
- * collapse_and_free_pmd() requires that either no
- * ->anon_vma exists or the anon_vma is locked.
- * We already checked ->anon_vma above, but that check
- * is racy because ->anon_vma can be populated under the
- * mmap lock in read mode.
- */
- if (vma->anon_vma) {
- result = SCAN_PAGE_ANON;
- goto unlock_next;
- }
- /*
- * When a vma is registered with uffd-wp, we can't
- * recycle the pmd pgtable because there can be pte
- * markers installed. Skip it only, so the rest mm/vma
- * can still have the same file mapped hugely, however
- * it'll always mapped in small page size for uffd-wp
- * registered ranges.
- */
- if (hpage_collapse_test_exit(mm)) {
- result = SCAN_ANY_PROCESS;
- goto unlock_next;
- }
- if (userfaultfd_wp(vma)) {
- result = SCAN_PTE_UFFD_WP;
- goto unlock_next;
- }
- collapse_and_free_pmd(mm, vma, addr, pmd);
- if (!cc->is_khugepaged && is_target)
- result = set_huge_pmd(vma, addr, pmd, hpage);
- else
- result = SCAN_SUCCEED;
-
-unlock_next:
- mmap_write_unlock(mm);
- goto next;
- }
- /*
- * Calling context will handle target mm/addr. Otherwise, let
- * khugepaged try again later.
- */
- if (!is_target) {
- khugepaged_add_pte_mapped_thp(mm, addr);
+ vma->vm_end < addr + HPAGE_PMD_SIZE)
continue;
+
+ mm = vma->vm_mm;
+ if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+ continue;
+
+ if (hpage_collapse_test_exit(mm))
+ continue;
+ /*
+ * When a vma is registered with uffd-wp, we cannot recycle
+ * the page table because there may be pte markers installed.
+ * Other vmas can still have the same file mapped hugely, but
+ * skip this one: it will always be mapped in small page size
+ * for uffd-wp registered ranges.
+ */
+ if (userfaultfd_wp(vma))
+ continue;
+
+ /* PTEs were notified when unmapped; but now for the PMD? */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+ addr, addr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ pml = pmd_lock(mm, pmd);
+ ptl = pte_lockptr(mm, pmd);
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+ /*
+ * Huge page lock is still held, so normally the page table
+ * must remain empty; and we have already skipped anon_vma
+ * and userfaultfd_wp() vmas. But since the mmap_lock is not
+ * held, it is still possible for a racing userfaultfd_ioctl()
+ * to have inserted ptes or markers. Now that we hold ptlock,
+ * repeating the anon_vma check protects from one category,
+ * and repeating the userfaultfd_wp() check from another.
+ */
+ if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
+ skipped_uffd = true;
+ } else {
+ pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+ pmdp_get_lockless_sync();
+ }
+
+ if (ptl != pml)
+ spin_unlock(ptl);
+ spin_unlock(pml);
+
+ mmu_notifier_invalidate_range_end(&range);
+
+ if (!skipped_uffd) {
+ mm_dec_nr_ptes(mm);
+ page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+ pte_free_defer(mm, pmd_pgtable(pgt_pmd));
}
-next:
- if (is_target)
- target_result = result;
}
- i_mmap_unlock_write(mapping);
- return target_result;
+ i_mmap_unlock_read(mapping);
}

/**
@@ -2259,9 +2223,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,

/*
* Remove pte page tables, so we can re-fault the page as huge.
+ * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
*/
- result = retract_page_tables(mapping, start, mm, addr, hpage,
- cc);
+ retract_page_tables(mapping, start);
+ if (cc && !cc->is_khugepaged)
+ result = SCAN_PTE_MAPPED_HUGEPAGE;
unlock_page(hpage);

/*
--
2.35.3


2023-07-12 05:16:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive). And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Gerald Schaefer <[email protected]>
---
arch/s390/include/asm/pgalloc.h | 4 ++
arch/s390/mm/pgalloc.c | 80 +++++++++++++++++++++++++++++------
2 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
void vmem_map_init(void);
void *vmem_crst_alloc(unsigned long val);
pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..760b4ace475e 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
* logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
* while the PP bits are never used, nor such a page is added to or removed
* from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
*/
unsigned long *page_table_alloc(struct mm_struct *mm)
{
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table += PTRS_PER_PTE;
atomic_xor_bits(&page->_refcount,
0x01U << (bit + 24));
- list_del(&page->lru);
+ list_del_init(&page->lru);
}
}
spin_unlock_bh(&mm->context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
+ INIT_LIST_HEAD(&page->lru);
atomic_xor_bits(&page->_refcount, 0x03U << 24);
memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
{
char msg[128];

- if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+ if (!IS_ENABLED(CONFIG_DEBUG_VM))
+ return;
+ if (!mask && list_empty(&page->lru))
return;
snprintf(msg, sizeof(msg),
"Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
dump_page(page, msg);
}

+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
void page_table_free(struct mm_struct *mm, unsigned long *table)
{
unsigned int mask, bit, half;
@@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if ((mask & 0x03U) && !PageActive(page)) {
+ /*
+ * Other half is allocated, and neither half has had
+ * its free deferred: add page to head of list, to make
+ * this freed half available for immediate reuse.
+ */
list_add(&page->lru, &mm->context.pgtable_list);
- else
- list_del(&page->lru);
+ } else {
+ /* If page is on list, now remove it. */
+ list_del_init(&page->lru);
+ }
spin_unlock_bh(&mm->context.lock);
mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
mask >>= 24;
@@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
}

page_table_release_check(page, table, half, mask);
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}

void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
@@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if ((mask & 0x03U) && !PageActive(page)) {
+ /*
+ * Other half is allocated, and neither half has had
+ * its free deferred: add page to end of list, to make
+ * this freed half available for reuse once its pending
+ * bit has been cleared by __tlb_remove_table().
+ */
list_add_tail(&page->lru, &mm->context.pgtable_list);
- else
- list_del(&page->lru);
+ } else {
+ /* If page is on list, now remove it. */
+ list_del_init(&page->lru);
+ }
spin_unlock_bh(&mm->context.lock);
table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
tlb_remove_table(tlb, table);
@@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table)
}

page_table_release_check(page, table, half, mask);
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ SetPageActive(page);
+ page_table_free(mm, (unsigned long *)pgtable);
+ /*
+ * page_table_free() does not do the pgste gmap_unlink() which
+ * page_table_free_rcu() does: warn us if pgste ever reaches here.
+ */
+ WARN_ON_ONCE(mm_alloc_pgste(mm));
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Base infrastructure required to generate basic asces, region, segment,
* and page tables that do not make use of enhanced features like EDAT1.
--
2.35.3


2023-07-12 05:19:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries. i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual. Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs. "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3". But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3bb05147961b..46986eb4eebb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
return ret;
}

-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
{
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
};

VM_BUG_ON(!PageTransHuge(hpage));
- mmap_assert_write_locked(vma->vm_mm);
+ mmap_assert_locked(vma->vm_mm);

if (do_set_pmd(&vmf, hpage))
return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
return SCAN_SUCCEED;
}

-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- * - modifying terminal entries (ones that point to a data page, not to another
- * page table)
- * - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pmd_t *pmdp)
-{
- pmd_t pmd;
- struct mmu_notifier_range range;
-
- mmap_assert_write_locked(mm);
- if (vma->vm_file)
- lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
- /*
- * All anon_vmas attached to the VMA have the same root and are
- * therefore locked by the same lock.
- */
- if (vma->anon_vma)
- lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
- addr + HPAGE_PMD_SIZE);
- mmu_notifier_invalidate_range_start(&range);
- pmd = pmdp_collapse_flush(vma, addr, pmdp);
- tlb_remove_table_sync_one();
- mmu_notifier_invalidate_range_end(&range);
- mm_dec_nr_ptes(mm);
- page_table_check_pte_clear_range(mm, addr, pmd);
- pte_free(mm, pmd_pgtable(pmd));
-}
-
/**
* collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
* address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
{
+ struct mmu_notifier_range range;
+ bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
- pmd_t *pmd;
- spinlock_t *ptl;
- int count = 0, result = SCAN_FAIL;
+ pmd_t *pmd, pgt_pmd;
+ spinlock_t *pml, *ptl;
+ int nr_ptes = 0, result = SCAN_FAIL;
int i;

- mmap_assert_write_locked(mm);
+ mmap_assert_locked(mm);
+
+ /* First check VMA found, in case page tables are being torn down */
+ if (!vma || !vma->vm_file ||
+ !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+ return SCAN_VMA_CHECK;

/* Fast check before locking page if already PMD-mapped */
result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
if (result == SCAN_PMD_MAPPED)
return result;

- if (!vma || !vma->vm_file ||
- !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
- return SCAN_VMA_CHECK;
-
/*
* If we are here, we've succeeded in replacing all the native pages
* in the page cache with a single hugepage. If a mm were to fault-in
@@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

+ result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
switch (result) {
case SCAN_SUCCEED:
break;
@@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

- /* Lock the vma before taking i_mmap and page table locks */
- vma_start_write(vma);
-
- /*
- * We need to lock the mapping so that from here on, only GUP-fast and
- * hardware page walks can access the parts of the page tables that
- * we're operating on.
- * See collapse_and_free_pmd().
- */
- i_mmap_lock_write(vma->vm_file->f_mapping);
-
- /*
- * This spinlock should be unnecessary: Nobody else should be accessing
- * the page tables under spinlock protection here, only
- * lockless_pages_from_mm() and the hardware page walker can access page
- * tables while all the high-level locks are held in write mode.
- */
result = SCAN_FAIL;
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
- if (!start_pte)
- goto drop_immap;
+ if (!start_pte) /* mmap_lock + page lock should prevent this */
+ goto drop_hpage;

/* step 1: check all mapped PTEs are to the right huge page */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
*/
if (hpage + i != page)
goto abort;
- count++;
}

- /* step 2: adjust rmap */
+ pte_unmap_unlock(start_pte, ptl);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+ haddr, haddr + HPAGE_PMD_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+ notified = true;
+ start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+ if (!start_pte) /* mmap_lock + page lock should prevent this */
+ goto abort;
+
+ /* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
struct page *page;
@@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,

if (pte_none(ptent))
continue;
- page = vm_normal_page(vma, addr, ptent);
- if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+ /*
+ * We dropped ptl after the first scan, to do the mmu_notifier:
+ * page lock stops more PTEs of the hpage being faulted in, but
+ * does not stop write faults COWing anon copies from existing
+ * PTEs; and does not stop those being swapped out or migrated.
+ */
+ if (!pte_present(ptent)) {
+ result = SCAN_PTE_NON_PRESENT;
goto abort;
+ }
+ page = vm_normal_page(vma, addr, ptent);
+ if (hpage + i != page)
+ goto abort;
+
+ /*
+ * Must clear entry, or a racing truncate may re-remove it.
+ * TLB flush can be left until pmdp_collapse_flush() does it.
+ * PTE dirty? Shmem page is already dirty; file is read-only.
+ */
+ pte_clear(mm, addr, pte);
page_remove_rmap(page, vma, false);
+ nr_ptes++;
}

pte_unmap_unlock(start_pte, ptl);

/* step 3: set proper refcount and mm_counters. */
- if (count) {
- page_ref_sub(hpage, count);
- add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+ if (nr_ptes) {
+ page_ref_sub(hpage, nr_ptes);
+ add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
}

- /* step 4: remove pte entries */
- /* we make no change to anon, but protect concurrent anon page lookup */
- if (vma->anon_vma)
- anon_vma_lock_write(vma->anon_vma);
+ /* step 4: remove page table */

- collapse_and_free_pmd(mm, vma, haddr, pmd);
+ /* Huge page lock is still held, so page table must remain empty */
+ pml = pmd_lock(mm, pmd);
+ if (ptl != pml)
+ spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+ pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+ pmdp_get_lockless_sync();
+ if (ptl != pml)
+ spin_unlock(ptl);
+ spin_unlock(pml);

- if (vma->anon_vma)
- anon_vma_unlock_write(vma->anon_vma);
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ mmu_notifier_invalidate_range_end(&range);
+
+ mm_dec_nr_ptes(mm);
+ page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+ pte_free_defer(mm, pmd_pgtable(pgt_pmd));

maybe_install_pmd:
/* step 5: install pmd entry */
result = install_pmd
? set_huge_pmd(vma, haddr, pmd, hpage)
: SCAN_SUCCEED;
-
+ goto drop_hpage;
+abort:
+ if (nr_ptes) {
+ flush_tlb_mm(mm);
+ page_ref_sub(hpage, nr_ptes);
+ add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+ }
+ if (start_pte)
+ pte_unmap_unlock(start_pte, ptl);
+ if (notified)
+ mmu_notifier_invalidate_range_end(&range);
drop_hpage:
unlock_page(hpage);
put_page(hpage);
return result;
-
-abort:
- pte_unmap_unlock(start_pte, ptl);
-drop_immap:
- i_mmap_unlock_write(vma->vm_file->f_mapping);
- goto drop_hpage;
}

static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
@@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
case SCAN_PTE_MAPPED_HUGEPAGE:
BUG_ON(mmap_locked);
BUG_ON(*prev);
- mmap_write_lock(mm);
+ mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
- mmap_write_unlock(mm);
+ mmap_locked = true;
goto handle_result;
/* Whitelisted set of results where continuing OK */
case SCAN_PMD_NULL:
--
2.35.3


2023-07-12 05:41:56

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 17 -----------------
include/linux/mmap_lock.h | 10 ----------
2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..b7b45be616ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
up_write(&vma->vm_lock->lock);
}

-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
- int mm_lock_seq;
-
- if (__is_vma_write_locked(vma, &mm_lock_seq))
- return true;
-
- if (!down_write_trylock(&vma->vm_lock->lock))
- return false;
-
- vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->vm_lock->lock);
- return true;
-}
-
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
{
int mm_lock_seq;
@@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
{ return false; }
static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
- { return true; }
static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
return ret;
}

-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
- bool ret;
-
- __mmap_lock_trace_start_locking(mm, true);
- ret = down_write_trylock(&mm->mmap_lock) != 0;
- __mmap_lock_trace_acquire_returned(mm, true, ret);
- return ret;
-}
-
static inline void mmap_write_unlock(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, true);
--
2.35.3


2023-07-12 06:07:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]()

Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/pgtable-generic.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index fa9d4d084291..4fcd959dcc4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
return pte;
}

+/*
+ * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
+ * __pte_offset_map_lock() below, is usually called with the pmd pointer for
+ * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while
+ * holding mmap_lock or vma lock for read or for write; or in truncate or rmap
+ * context, while holding file's i_mmap_lock or anon_vma lock for read (or for
+ * write). In a few cases, it may be used with pmd pointing to a pmd_t already
+ * copied to or constructed on the stack.
+ *
+ * When successful, it returns the pte pointer for addr, with its page table
+ * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent
+ * modification by software, with a pointer to that spinlock in ptlp (in some
+ * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's
+ * struct page). pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards.
+ *
+ * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no
+ * page table at *pmd: if, for example, the page table has just been removed,
+ * or replaced by the huge pmd of a THP. (When successful, *pmd is rechecked
+ * after acquiring the ptlock, and retried internally if it changed: so that a
+ * page table can be safely removed or replaced by THP while holding its lock.)
+ *
+ * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above,
+ * just returns the pte pointer for addr, its page table kmapped if necessary;
+ * or NULL if there is no page table at *pmd. It does not attempt to lock the
+ * page table, so cannot normally be used when the page table is to be updated,
+ * or when entries read must be stable. But it does take rcu_read_lock(): so
+ * that even when page table is racily removed, it remains a valid though empty
+ * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s
+ * afterwards.
+ *
+ * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it. This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
+ * pointer for the page table that it returns. In principle, the caller should
+ * recheck *pmd once the lock is taken; in practice, no callsite needs that -
+ * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ *
+ * Note that free_pgtables(), used after unmapping detached vmas, or when
+ * exiting the whole mm, does not take page table lock before freeing a page
+ * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
+ * pte_offset_map() and co once the vma is detached from mm or mm_users is zero.
+ */
pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, spinlock_t **ptlp)
{
--
2.35.3


2023-07-13 05:23:56

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

On Tue, Jul 11, 2023 at 09:38:35PM -0700, Hugh Dickins wrote:
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon. This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
>
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
>
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive). And for simplicity, delay calling RCU until
> both halves are freed.
>
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Reviewed-by: Gerald Schaefer <[email protected]>
> ---
> arch/s390/include/asm/pgalloc.h | 4 ++
> arch/s390/mm/pgalloc.c | 80 +++++++++++++++++++++++++++++------
> 2 files changed, 72 insertions(+), 12 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
> #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
> #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
> void vmem_map_init(void);
> void *vmem_crst_alloc(unsigned long val);
> pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..760b4ace475e 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> * while the PP bits are never used, nor such a page is added to or removed
> * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> + * But for simplicity, because page->rcu_head overlays page->lru, and because
> + * the RCU callback might not be called before the mm_context_t has been freed,
> + * pte_free_defer() in this implementation prevents both fragments from being
> + * reused, and delays making the call to RCU until both fragments are freed.
> */
> unsigned long *page_table_alloc(struct mm_struct *mm)
> {
> @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> table += PTRS_PER_PTE;
> atomic_xor_bits(&page->_refcount,
> 0x01U << (bit + 24));
> - list_del(&page->lru);
> + list_del_init(&page->lru);
> }
> }
> spin_unlock_bh(&mm->context.lock);
> @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> table = (unsigned long *) page_to_virt(page);
> if (mm_alloc_pgste(mm)) {
> /* Return 4K page table with PGSTEs */
> + INIT_LIST_HEAD(&page->lru);
> atomic_xor_bits(&page->_refcount, 0x03U << 24);
> memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
> memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
> @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
> {
> char msg[128];
>
> - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
> + if (!IS_ENABLED(CONFIG_DEBUG_VM))
> + return;
> + if (!mask && list_empty(&page->lru))
> return;
> snprintf(msg, sizeof(msg),
> "Invalid pgtable %p release half 0x%02x mask 0x%02x",
> @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
> dump_page(page, msg);
> }
>
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + pgtable_pte_page_dtor(page);
> + __free_page(page);
> +}
> +
> void page_table_free(struct mm_struct *mm, unsigned long *table)
> {
> unsigned int mask, bit, half;
> @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> */
> mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> + * Other half is allocated, and neither half has had
> + * its free deferred: add page to head of list, to make
> + * this freed half available for immediate reuse.
> + */
> list_add(&page->lru, &mm->context.pgtable_list);
> - else
> - list_del(&page->lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(&page->lru);
> + }
> spin_unlock_bh(&mm->context.lock);
> mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> mask >>= 24;
> @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> }
>
> page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(&page->rcu_head, pte_free_now);
> + else
> + pte_free_now(&page->rcu_head);
> }
>
> void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> */
> mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> mask >>= 24;
> - if (mask & 0x03U)
> + if ((mask & 0x03U) && !PageActive(page)) {
> + /*
> + * Other half is allocated, and neither half has had
> + * its free deferred: add page to end of list, to make
> + * this freed half available for reuse once its pending
> + * bit has been cleared by __tlb_remove_table().
> + */
> list_add_tail(&page->lru, &mm->context.pgtable_list);
> - else
> - list_del(&page->lru);
> + } else {
> + /* If page is on list, now remove it. */
> + list_del_init(&page->lru);
> + }
> spin_unlock_bh(&mm->context.lock);
> table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> tlb_remove_table(tlb, table);
> @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table)
> }
>
> page_table_release_check(page, table, half, mask);
> - pgtable_pte_page_dtor(page);
> - __free_page(page);
> + if (TestClearPageActive(page))
> + call_rcu(&page->rcu_head, pte_free_now);
> + else
> + pte_free_now(&page->rcu_head);
> }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> + /*
> + * page_table_free() does not do the pgste gmap_unlink() which
> + * page_table_free_rcu() does: warn us if pgste ever reaches here.
> + */
> + WARN_ON_ONCE(mm_alloc_pgste(mm));
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * Base infrastructure required to generate basic asces, region, segment,
> * and page tables that do not make use of enhanced features like EDAT1.

Tested-by: Alexander Gordeev <[email protected]>
Acked-by: Alexander Gordeev <[email protected]>

2023-07-18 11:20:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

Hugh Dickins <[email protected]> writes:

> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
> stricter than the previous implementation, which skipped when pmd_none()
> (with a comment on khugepaged collapse transitions): but wouldn't we want
> to know, if an assert_pte_locked() caller can be racing such transitions?
>

The reason we had that pmd_none check there was to handle khugpaged. In
case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
ppc64 had the assert_pte_locked check inside that ptep_clear.

_pmd = pmdp_collapse_flush(vma, address, pmd);
..
ptep_clear()
-> asset_ptep_locked()
---> pmd_none
-----> BUG


The problem is how assert_pte_locked() verify whether we are holding
ptl. It does that by walking the page table again and in this specific
case by the time we call the function we already had cleared pmd .
>
> This mod might cause new crashes: which either expose my ignorance, or
> indicate issues to be fixed, or limit the usage of assert_pte_locked().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/powerpc/mm/pgtable.c | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index cb2dcdb18f8e..16b061af86d7 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> + pte_t *pte;
> + spinlock_t *ptl;
>
> if (mm == &init_mm)
> return;
> @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> pud = pud_offset(p4d, addr);
> BUG_ON(pud_none(*pud));
> pmd = pmd_offset(pud, addr);
> - /*
> - * khugepaged to collapse normal pages to hugepage, first set
> - * pmd to none to force page fault/gup to take mmap_lock. After
> - * pmd is set to none, we do a pte_clear which does this assertion
> - * so if we find pmd none, return.
> - */
> - if (pmd_none(*pmd))
> - return;
> - BUG_ON(!pmd_present(*pmd));
> - assert_spin_locked(pte_lockptr(mm, pmd));
> + pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
> + BUG_ON(!pte);
> + assert_spin_locked(ptl);
> + pte_unmap(pte);
> }
> #endif /* CONFIG_DEBUG_VM */
>
> --
> 2.35.3

2023-07-19 05:16:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
> Hugh Dickins <[email protected]> writes:
>
> > Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> > in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
> > stricter than the previous implementation, which skipped when pmd_none()
> > (with a comment on khugepaged collapse transitions): but wouldn't we want
> > to know, if an assert_pte_locked() caller can be racing such transitions?
> >
>
> The reason we had that pmd_none check there was to handle khugpaged. In
> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
> ppc64 had the assert_pte_locked check inside that ptep_clear.
>
> _pmd = pmdp_collapse_flush(vma, address, pmd);
> ..
> ptep_clear()
> -> asset_ptep_locked()
> ---> pmd_none
> -----> BUG
>
>
> The problem is how assert_pte_locked() verify whether we are holding
> ptl. It does that by walking the page table again and in this specific
> case by the time we call the function we already had cleared pmd .

Aneesh, please clarify, I've spent hours on this.

From all your use of past tense ("had"), I thought you were Acking my
patch; but now, after looking again at v3.11 source and today's,
I think you are NAKing my patch in its present form.

You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
*pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
Is that your point?

I can easily restore that khugepaged comment (which had appeared to me
out of date at the time, but now looks still relevant) and pmd_none(*pmd)
check: but please clarify.

Thanks,
Hugh

> >
> > This mod might cause new crashes: which either expose my ignorance, or
> > indicate issues to be fixed, or limit the usage of assert_pte_locked().
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > arch/powerpc/mm/pgtable.c | 16 ++++++----------
> > 1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> > index cb2dcdb18f8e..16b061af86d7 100644
> > --- a/arch/powerpc/mm/pgtable.c
> > +++ b/arch/powerpc/mm/pgtable.c
> > @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> > p4d_t *p4d;
> > pud_t *pud;
> > pmd_t *pmd;
> > + pte_t *pte;
> > + spinlock_t *ptl;
> >
> > if (mm == &init_mm)
> > return;
> > @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> > pud = pud_offset(p4d, addr);
> > BUG_ON(pud_none(*pud));
> > pmd = pmd_offset(pud, addr);
> > - /*
> > - * khugepaged to collapse normal pages to hugepage, first set
> > - * pmd to none to force page fault/gup to take mmap_lock. After
> > - * pmd is set to none, we do a pte_clear which does this assertion
> > - * so if we find pmd none, return.
> > - */
> > - if (pmd_none(*pmd))
> > - return;
> > - BUG_ON(!pmd_present(*pmd));
> > - assert_spin_locked(pte_lockptr(mm, pmd));
> > + pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
> > + BUG_ON(!pte);
> > + assert_spin_locked(ptl);
> > + pte_unmap(pte);
> > }
> > #endif /* CONFIG_DEBUG_VM */
> >
> > --
> > 2.35.3

2023-07-19 05:35:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

On 7/19/23 10:34 AM, Hugh Dickins wrote:
> On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
>> Hugh Dickins <[email protected]> writes:
>>
>>> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
>>> in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
>>> stricter than the previous implementation, which skipped when pmd_none()
>>> (with a comment on khugepaged collapse transitions): but wouldn't we want
>>> to know, if an assert_pte_locked() caller can be racing such transitions?
>>>
>>
>> The reason we had that pmd_none check there was to handle khugpaged. In
>> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
>> ppc64 had the assert_pte_locked check inside that ptep_clear.
>>
>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>> ..
>> ptep_clear()
>> -> asset_ptep_locked()
>> ---> pmd_none
>> -----> BUG
>>
>>
>> The problem is how assert_pte_locked() verify whether we are holding
>> ptl. It does that by walking the page table again and in this specific
>> case by the time we call the function we already had cleared pmd .
>
> Aneesh, please clarify, I've spent hours on this.
>
> From all your use of past tense ("had"), I thought you were Acking my
> patch; but now, after looking again at v3.11 source and today's,
> I think you are NAKing my patch in its present form.
>

Sorry for the confusion my reply created.

> You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
> uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
> *pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
> Is that your point?
>

Yes. I haven't tested this yet to verify that it is indeed hitting that BUG.
But a code inspection tells me we will hit that BUG on powerpc because of
the above details.

> I can easily restore that khugepaged comment (which had appeared to me
> out of date at the time, but now looks still relevant) and pmd_none(*pmd)
> check: but please clarify.
>

That is correct. if we add that pmd_none check back we should be good here.


-aneesh

2023-07-19 14:41:19

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

On Tue, 11 Jul 2023 21:38:35 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

[...]

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + SetPageActive(page);
> + page_table_free(mm, (unsigned long *)pgtable);
> + /*
> + * page_table_free() does not do the pgste gmap_unlink() which
> + * page_table_free_rcu() does: warn us if pgste ever reaches here.
> + */
> + WARN_ON_ONCE(mm_alloc_pgste(mm));

it seems I have overlooked something when we previously discussed
this...

mm_alloc_pgste() is true for all processes that have PGSTEs, not only
for processes that can run guests.

There are two ways to enable PGSTEs: an ELF header bit, and a sysctl
knob.

The ELF bit is only used by qemu, it enables PGSTE allocation only for
that single process. This is a strong indication that the process wants
to run guests.

The sysctl knob enables PGSTE allocation for every process in the system
from that moment on. In that case, the WARN_ON_ONCE would be triggered
when not necessary.

There is however another way to check if a process is actually
__using__ the PGSTEs, a.k.a. if the process is actually capable of
running guests.

Confusingly, the name of that function is mm_has_pgste(). This confused
me as well, which is why I didn't notice it when we discussed this
previously :)


in short: can you please use mm_has_pgste() instead of mm_alloc_pgste()
in the WARN_ON_ONCE ?

> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> /*
> * Base infrastructure required to generate basic asces, region, segment,
> * and page tables that do not make use of enhanced features like EDAT1.


2023-07-21 14:10:55

by Jay Patel

[permalink] [raw]
Subject: Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

On Jul 19 2023, Aneesh Kumar K V wrote:
> On 7/19/23 10:34 AM, Hugh Dickins wrote:
> > On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
> >> Hugh Dickins <[email protected]> writes:
> >>
> >>> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> >>> in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is
> >>> stricter than the previous implementation, which skipped when pmd_none()
> >>> (with a comment on khugepaged collapse transitions): but wouldn't we want
> >>> to know, if an assert_pte_locked() caller can be racing such transitions?
> >>>
> >>
> >> The reason we had that pmd_none check there was to handle khugpaged. In
> >> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
> >> ppc64 had the assert_pte_locked check inside that ptep_clear.
> >>
> >> _pmd = pmdp_collapse_flush(vma, address, pmd);
> >> ..
> >> ptep_clear()
> >> -> asset_ptep_locked()
> >> ---> pmd_none
> >> -----> BUG
> >>
> >>
> >> The problem is how assert_pte_locked() verify whether we are holding
> >> ptl. It does that by walking the page table again and in this specific
> >> case by the time we call the function we already had cleared pmd .
> >
> > Aneesh, please clarify, I've spent hours on this.
> >
> > From all your use of past tense ("had"), I thought you were Acking my
> > patch; but now, after looking again at v3.11 source and today's,
> > I think you are NAKing my patch in its present form.
> >
>
> Sorry for the confusion my reply created.
>
> > You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
> > uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
> > *pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
> > Is that your point?
> >
>
> Yes. I haven't tested this yet to verify that it is indeed hitting that BUG.
> But a code inspection tells me we will hit that BUG on powerpc because of
> the above details.
>
Hi Aneesh,

After testing it, I can confirm that it encountered a BUG on powerpc.
Log report as attached

Thanks,
Jay Patel
> > I can easily restore that khugepaged comment (which had appeared to me
> > out of date at the time, but now looks still relevant) and pmd_none(*pmd)
> > check: but please clarify.
> >
>
> That is correct. if we add that pmd_none check back we should be good here.
>
>
> -aneesh


Attachments:
(No filename) (2.37 kB)
report.txt (3.67 kB)
Download all attachments

2023-07-23 22:35:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 04/13 fix] powerpc: assert_pte_locked() use pte_offset_map_nolock(): fix

Aneesh points out that assert_pte_locked() still needs the pmd_none()
check, to stop crashing in khugepaged: restore that comment and check.

Andrew, when merging with original commit, please edit its 1st para to:

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked(). BUG if pte_offset_map_nolock() fails.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/pgtable.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 16b061af86d7..a3dcdb2d5b4b 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -323,6 +323,14 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
+ /*
+ * khugepaged to collapse normal pages to hugepage, first set
+ * pmd to none to force page fault/gup to take mmap_lock. After
+ * pmd is set to none, we do a pte_clear which does this assertion
+ * so if we find pmd none, return.
+ */
+ if (pmd_none(*pmd))
+ return;
pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
BUG_ON(!pte);
assert_spin_locked(ptl);
--
2.35.3


2023-07-23 22:38:06

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix

madvise_collapse() setting "mmap_locked = true" after calling
collapse_pte_mapped_thp() looked good but was wrong. If the loop then
moves on to the next extent, mmap_locked assures it that "vma" has been
revalidated under mmap_lock, which was not the case: and led to UAFs,
crashes in __fput() or task_work_run(), even collapse_file()'s
VM_BUG_ON(start & (HPAGE_PMD_NR - 1)) - all detected by syzbot.

(collapse_pte_mapped_thp() does validate the vma that it works on:
but it's not passed in as an argument, collapse_pte_mapped_thp() finds
the vma for mm and addr by itself - which may by this time have changed
from the vma saved in madvise_collapse().)

Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/
Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6bad69c0e4bd..1c773db26e88 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2747,7 +2747,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
BUG_ON(*prev);
mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
- mmap_locked = true;
+ mmap_read_unlock(mm);
goto handle_result;
/* Whitelisted set of results where continuing OK */
case SCAN_PMD_NULL:
--
2.35.3


2023-07-23 23:11:08

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 11/13 fix] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps(): fix

Though not yet detected by syzbot, this commit was making the same
mistake with mmap_locked as the previous commit: fix that.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1c773db26e88..41913730db4c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2380,19 +2380,17 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
mmap_locked = false;
*result = hpage_collapse_scan_file(mm,
khugepaged_scan.address, file, pgoff, cc);
+ fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
- mmap_locked = true;
- if (hpage_collapse_test_exit(mm)) {
- fput(file);
+ if (hpage_collapse_test_exit(mm))
goto breakouterloop;
- }
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
if (*result == SCAN_PMD_MAPPED)
*result = SCAN_SUCCEED;
+ mmap_read_unlock(mm);
}
- fput(file);
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
khugepaged_scan.address, &mmap_locked, cc);
--
2.35.3


2023-07-23 23:47:01

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 07/13 fix] s390: add pte_free_defer() for pgtables sharing page: fix

Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/mm/pgalloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 760b4ace475e..d7374add7820 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -459,7 +459,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
* page_table_free() does not do the pgste gmap_unlink() which
* page_table_free_rcu() does: warn us if pgste ever reaches here.
*/
- WARN_ON_ONCE(mm_alloc_pgste(mm));
+ WARN_ON_ONCE(mm_has_pgste(mm));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--
2.35.3


2023-08-03 10:55:01

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

Hi,

On 2023/7/12 12:42, Hugh Dickins wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
>
> Follow the pattern in retract_page_tables(); and using pte_free_defer()
> removes most of the need for tlb_remove_table_sync_one() here; but call
> pmdp_get_lockless_sync() to use it in the PAE case.
>
> First check the VMA, in case page tables are being torn down: from JannH.
> Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
> acquired and the page looks suitable: from then on its state is stable.
>
> However, collapse_pte_mapped_thp() was doing something others don't:
> freeing a page table still containing "valid" entries. i_mmap lock did
> stop a racing truncate from double-freeing those pages, but we prefer
> collapse_pte_mapped_thp() to clear the entries as usual. Their TLB
> flush can wait until the pmdp_collapse_flush() which follows, but the
> mmu_notifier_invalidate_range_start() has to be done earlier.
>
> Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
> for khugepaged to keep on repeatedly invalidating a range which is then
> found unsuitable e.g. contains COWs. "step 2", which does the clearing,
> must then be more careful (after dropping ptl to do mmu_notifier), with
> abort prepared to correct the accounting like "step 3". But with those
> entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
> safe by the huge page lock, which stops new PTEs from being faulted in.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
> 1 file changed, 77 insertions(+), 95 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3bb05147961b..46986eb4eebb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
> return ret;
> }
>
> -/* hpage must be locked, and mmap_lock must be held in write */
> +/* hpage must be locked, and mmap_lock must be held */
> static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmdp, struct page *hpage)
> {
> @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> };
>
> VM_BUG_ON(!PageTransHuge(hpage));
> - mmap_assert_write_locked(vma->vm_mm);
> + mmap_assert_locked(vma->vm_mm);
>
> if (do_set_pmd(&vmf, hpage))
> return SCAN_FAIL;
> @@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> return SCAN_SUCCEED;
> }
>
> -/*
> - * A note about locking:
> - * Trying to take the page table spinlocks would be useless here because those
> - * are only used to synchronize:
> - *
> - * - modifying terminal entries (ones that point to a data page, not to another
> - * page table)
> - * - installing *new* non-terminal entries
> - *
> - * Instead, we need roughly the same kind of protection as free_pgtables() or
> - * mm_take_all_locks() (but only for a single VMA):
> - * The mmap lock together with this VMA's rmap locks covers all paths towards
> - * the page table entries we're messing with here, except for hardware page
> - * table walks and lockless_pages_from_mm().
> - */
> -static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long addr, pmd_t *pmdp)
> -{
> - pmd_t pmd;
> - struct mmu_notifier_range range;
> -
> - mmap_assert_write_locked(mm);
> - if (vma->vm_file)
> - lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
> - /*
> - * All anon_vmas attached to the VMA have the same root and are
> - * therefore locked by the same lock.
> - */
> - if (vma->anon_vma)
> - lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
> -
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> - addr + HPAGE_PMD_SIZE);
> - mmu_notifier_invalidate_range_start(&range);
> - pmd = pmdp_collapse_flush(vma, addr, pmdp);
> - tlb_remove_table_sync_one();
> - mmu_notifier_invalidate_range_end(&range);
> - mm_dec_nr_ptes(mm);
> - page_table_check_pte_clear_range(mm, addr, pmd);
> - pte_free(mm, pmd_pgtable(pmd));
> -}
> -
> /**
> * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
> * address haddr.
> @@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> bool install_pmd)
> {
> + struct mmu_notifier_range range;
> + bool notified = false;
> unsigned long haddr = addr & HPAGE_PMD_MASK;
> struct vm_area_struct *vma = vma_lookup(mm, haddr);
> struct page *hpage;
> pte_t *start_pte, *pte;
> - pmd_t *pmd;
> - spinlock_t *ptl;
> - int count = 0, result = SCAN_FAIL;
> + pmd_t *pmd, pgt_pmd;
> + spinlock_t *pml, *ptl;
> + int nr_ptes = 0, result = SCAN_FAIL;
> int i;
>
> - mmap_assert_write_locked(mm);
> + mmap_assert_locked(mm);
> +
> + /* First check VMA found, in case page tables are being torn down */
> + if (!vma || !vma->vm_file ||
> + !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> + return SCAN_VMA_CHECK;
>
> /* Fast check before locking page if already PMD-mapped */
> result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> if (result == SCAN_PMD_MAPPED)
> return result;
>
> - if (!vma || !vma->vm_file ||
> - !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> - return SCAN_VMA_CHECK;
> -
> /*
> * If we are here, we've succeeded in replacing all the native pages
> * in the page cache with a single hugepage. If a mm were to fault-in
> @@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> goto drop_hpage;
> }
>
> + result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
> switch (result) {
> case SCAN_SUCCEED:
> break;
> @@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> goto drop_hpage;
> }
>
> - /* Lock the vma before taking i_mmap and page table locks */
> - vma_start_write(vma);
> -
> - /*
> - * We need to lock the mapping so that from here on, only GUP-fast and
> - * hardware page walks can access the parts of the page tables that
> - * we're operating on.
> - * See collapse_and_free_pmd().
> - */
> - i_mmap_lock_write(vma->vm_file->f_mapping);
> -
> - /*
> - * This spinlock should be unnecessary: Nobody else should be accessing
> - * the page tables under spinlock protection here, only
> - * lockless_pages_from_mm() and the hardware page walker can access page
> - * tables while all the high-level locks are held in write mode.
> - */
> result = SCAN_FAIL;
> start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> - if (!start_pte)
> - goto drop_immap;
> + if (!start_pte) /* mmap_lock + page lock should prevent this */
> + goto drop_hpage;
>
> /* step 1: check all mapped PTEs are to the right huge page */
> for (i = 0, addr = haddr, pte = start_pte;
> @@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> */
> if (hpage + i != page)
> goto abort;
> - count++;
> }
>
> - /* step 2: adjust rmap */
> + pte_unmap_unlock(start_pte, ptl);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> + haddr, haddr + HPAGE_PMD_SIZE);
> + mmu_notifier_invalidate_range_start(&range);
> + notified = true;
> + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> + if (!start_pte) /* mmap_lock + page lock should prevent this */
> + goto abort;
> +
> + /* step 2: clear page table and adjust rmap */
> for (i = 0, addr = haddr, pte = start_pte;
> i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> struct page *page;
> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>
> if (pte_none(ptent))
> continue;
> - page = vm_normal_page(vma, addr, ptent);
> - if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> + /*
> + * We dropped ptl after the first scan, to do the mmu_notifier:
> + * page lock stops more PTEs of the hpage being faulted in, but
> + * does not stop write faults COWing anon copies from existing
> + * PTEs; and does not stop those being swapped out or migrated.
> + */
> + if (!pte_present(ptent)) {
> + result = SCAN_PTE_NON_PRESENT;
> goto abort;
> + }
> + page = vm_normal_page(vma, addr, ptent);
> + if (hpage + i != page)
> + goto abort;
> +
> + /*
> + * Must clear entry, or a racing truncate may re-remove it.
> + * TLB flush can be left until pmdp_collapse_flush() does it.
> + * PTE dirty? Shmem page is already dirty; file is read-only.
> + */
> + pte_clear(mm, addr, pte);

This is not non-present PTE entry, so we should call ptep_clear() to let
page_table_check track the PTE clearing operation, right? Otherwise it
may lead to false positives?

Thanks,
Qi

> page_remove_rmap(page, vma, false);
> + nr_ptes++;
> }
>
> pte_unmap_unlock(start_pte, ptl);
>
> /* step 3: set proper refcount and mm_counters. */
> - if (count) {
> - page_ref_sub(hpage, count);
> - add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> + if (nr_ptes) {
> + page_ref_sub(hpage, nr_ptes);
> + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> }
>
> - /* step 4: remove pte entries */
> - /* we make no change to anon, but protect concurrent anon page lookup */
> - if (vma->anon_vma)
> - anon_vma_lock_write(vma->anon_vma);
> + /* step 4: remove page table */
>
> - collapse_and_free_pmd(mm, vma, haddr, pmd);
> + /* Huge page lock is still held, so page table must remain empty */
> + pml = pmd_lock(mm, pmd);
> + if (ptl != pml)
> + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> + pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
> + pmdp_get_lockless_sync();
> + if (ptl != pml)
> + spin_unlock(ptl);
> + spin_unlock(pml);
>
> - if (vma->anon_vma)
> - anon_vma_unlock_write(vma->anon_vma);
> - i_mmap_unlock_write(vma->vm_file->f_mapping);
> + mmu_notifier_invalidate_range_end(&range);
> +
> + mm_dec_nr_ptes(mm);
> + page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
> + pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>
> maybe_install_pmd:
> /* step 5: install pmd entry */
> result = install_pmd
> ? set_huge_pmd(vma, haddr, pmd, hpage)
> : SCAN_SUCCEED;
> -
> + goto drop_hpage;
> +abort:
> + if (nr_ptes) {
> + flush_tlb_mm(mm);
> + page_ref_sub(hpage, nr_ptes);
> + add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> + }
> + if (start_pte)
> + pte_unmap_unlock(start_pte, ptl);
> + if (notified)
> + mmu_notifier_invalidate_range_end(&range);
> drop_hpage:
> unlock_page(hpage);
> put_page(hpage);
> return result;
> -
> -abort:
> - pte_unmap_unlock(start_pte, ptl);
> -drop_immap:
> - i_mmap_unlock_write(vma->vm_file->f_mapping);
> - goto drop_hpage;
> }
>
> static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
> @@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> case SCAN_PTE_MAPPED_HUGEPAGE:
> BUG_ON(mmap_locked);
> BUG_ON(*prev);
> - mmap_write_lock(mm);
> + mmap_read_lock(mm);
> result = collapse_pte_mapped_thp(mm, addr, true);
> - mmap_write_unlock(mm);
> + mmap_locked = true;
> goto handle_result;
> /* Whitelisted set of results where continuing OK */
> case SCAN_PMD_NULL:

2023-08-06 05:07:12

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2

Use ptep_clear() instead of pte_clear(): when CONFIG_PAGE_TABLE_CHECK=y,
ptep_clear() adds some accounting, missing which would cause a BUG later.

Signed-off-by: Hugh Dickins <[email protected]>
Reported-by: Qi Zheng <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected]/
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bb76a5d454de..78fc1a24a1cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1603,7 +1603,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* TLB flush can be left until pmdp_collapse_flush() does it.
* PTE dirty? Shmem page is already dirty; file is read-only.
*/
- pte_clear(mm, addr, pte);
+ ptep_clear(mm, addr, pte);
page_remove_rmap(page, vma, false);
nr_ptes++;
}
--
2.35.3


2023-08-06 05:28:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

On Thu, 3 Aug 2023, Qi Zheng wrote:
> On 2023/7/12 12:42, Hugh Dickins wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
...
> > @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >
> > if (pte_none(ptent))
> > continue;
> > - page = vm_normal_page(vma, addr, ptent);
> > - if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > + /*
> > + * We dropped ptl after the first scan, to do the
> > mmu_notifier:
> > + * page lock stops more PTEs of the hpage being faulted in,
> > but
> > + * does not stop write faults COWing anon copies from existing
> > + * PTEs; and does not stop those being swapped out or
> > migrated.
> > + */
> > + if (!pte_present(ptent)) {
> > + result = SCAN_PTE_NON_PRESENT;
> > goto abort;
> > + }
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (hpage + i != page)
> > + goto abort;
> > +
> > + /*
> > + * Must clear entry, or a racing truncate may re-remove it.
> > + * TLB flush can be left until pmdp_collapse_flush() does it.
> > + * PTE dirty? Shmem page is already dirty; file is read-only.
> > + */
> > + pte_clear(mm, addr, pte);
>
> This is not non-present PTE entry, so we should call ptep_clear() to let
> page_table_check track the PTE clearing operation, right? Otherwise it
> may lead to false positives?

You are right: thanks a lot for catching that: fix patch follows.

Hugh

2023-08-07 02:54:58

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()



On 2023/8/6 11:55, Hugh Dickins wrote:
> On Thu, 3 Aug 2023, Qi Zheng wrote:
>> On 2023/7/12 12:42, Hugh Dickins wrote:
>>> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
>>> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
>>> nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing
>>> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> ...
>>> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
>>> unsigned long addr,
>>>
>>> if (pte_none(ptent))
>>> continue;
>>> - page = vm_normal_page(vma, addr, ptent);
>>> - if (WARN_ON_ONCE(page && is_zone_device_page(page)))
>>> + /*
>>> + * We dropped ptl after the first scan, to do the
>>> mmu_notifier:
>>> + * page lock stops more PTEs of the hpage being faulted in,
>>> but
>>> + * does not stop write faults COWing anon copies from existing
>>> + * PTEs; and does not stop those being swapped out or
>>> migrated.
>>> + */
>>> + if (!pte_present(ptent)) {
>>> + result = SCAN_PTE_NON_PRESENT;
>>> goto abort;
>>> + }
>>> + page = vm_normal_page(vma, addr, ptent);
>>> + if (hpage + i != page)
>>> + goto abort;
>>> +
>>> + /*
>>> + * Must clear entry, or a racing truncate may re-remove it.
>>> + * TLB flush can be left until pmdp_collapse_flush() does it.
>>> + * PTE dirty? Shmem page is already dirty; file is read-only.
>>> + */
>>> + pte_clear(mm, addr, pte);
>>
>> This is not non-present PTE entry, so we should call ptep_clear() to let
>> page_table_check track the PTE clearing operation, right? Otherwise it
>> may lead to false positives?
>
> You are right: thanks a lot for catching that: fix patch follows.

With fix patch:

Reviewed-by: Qi Zheng <[email protected]>

Thanks.

>
> Hugh