Based on latest mm-unstable (96aa38b69507).
This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, so this series also depends on that one.
Hopefully this series can make it a more complete resolution for pmd
unsharing.
PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
think I'm already more confident than many of the RFCs I posted.
PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
the changelog. The whole idea is still the same, though.
Problem
=======
huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.
huge_pte_offset() is always called with mmap lock held with either read or
write.
For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us when we're walking on another mm's pgtable, so
it's still on risk even if we're with the current->mm's mmap lock.
The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.
There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons. E.g. it
may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
because even if we took the vma lock, we're walking on another mm's vma!
Taking vma lock for all the vmas are probably not gonna work.
I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening. This
series is trying to resolve that problem.
Resolution
==========
What this patch proposed, besides using the vma lock, is that we can also
use other ways to protect the pgtable page from being freed from under us
in huge_pte_offset() context. The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's. So fast-gup will never hit a freed page even if
pmd sharing is possible.
To apply similar same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page. Note that RCU will only work to protect
pgtables if MMU_GATHER_RCU_TABLE_FREE=y. For the rest we need to disable
irq. Of course, the whole locking idea is not needed if pmd sharing is not
possible at all, or, on private hugetlb mappings.
Patch Layout
============
Patch 1-3: cleanup, or dependency of the follow up patches
Patch 4: the core patch to introduce hugetlb walker lock
Patch 5-11: each patch resolves one possible race condition
Patch 12: introduce hugetlb_walk() to replace huge_pte_offset()
Tests
=====
Only lightly tested on hugetlb kselftests including uffd.
Comments welcomed, thanks.
Peter Xu (12):
mm/hugetlb: Let vma_offset_start() to return start
mm/hugetlb: Move swap entry handling into vma lock for fault
mm/hugetlb: Don't wait for migration entry during follow page
mm/hugetlb: Add pgtable walker lock
mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
mm/hugetlb: Protect huge_pmd_share() with walker lock
mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask()
mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page()
mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page()
mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range()
mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk()
mm/hugetlb: Introduce hugetlb_walk()
arch/s390/mm/gmap.c | 2 +
fs/hugetlbfs/inode.c | 41 +++++++-------
fs/proc/task_mmu.c | 2 +
fs/userfaultfd.c | 24 ++++++---
include/linux/hugetlb.h | 112 +++++++++++++++++++++++++++++++++++++++
include/linux/pagewalk.h | 9 +++-
include/linux/rmap.h | 4 ++
mm/hugetlb.c | 97 +++++++++++++++++----------------
mm/page_vma_mapped.c | 7 ++-
mm/pagewalk.c | 6 +--
10 files changed, 224 insertions(+), 80 deletions(-)
--
2.37.3
Hugetlb walker lock makes sure the pte_t* won't go away from under us.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5ef883184885..fc49e3ca8acd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6222,9 +6222,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;
+ hugetlb_walker_lock();
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
- return NULL;
+ goto out_unlock;
ptl = huge_pte_lock(h, mm, pte);
entry = huge_ptep_get(pte);
@@ -6247,6 +6248,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
}
out:
spin_unlock(ptl);
+out_unlock:
+ hugetlb_walker_unlock();
return page;
}
--
2.37.3
Hugetlb walker lock makes sure the pte_t* won't go away from under us. One
thing to mention is there're two hugetlb_entry() users that can yield the
thread within hugetlb_entry(), that'll need to add unlock/lock pair around
the yield, meanwhile document hugetlb_entry() explaining the lock for
sleepable hugetlb_entry()s.
Signed-off-by: Peter Xu <[email protected]>
---
arch/s390/mm/gmap.c | 2 ++
fs/proc/task_mmu.c | 2 ++
include/linux/pagewalk.h | 9 ++++++++-
mm/pagewalk.c | 2 ++
4 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..fb2938e8d1c7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2644,7 +2644,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
end = start + HPAGE_SIZE - 1;
__storage_key_init_range(start, end);
set_bit(PG_arch_1, &page->flags);
+ hugetlb_walker_unlock();
cond_resched();
+ hugetlb_walker_lock();
return 0;
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 89338950afd3..ed750a52e60b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1612,7 +1612,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
frame++;
}
+ hugetlb_walker_unlock();
cond_resched();
+ hugetlb_walker_lock();
return err;
}
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 959f52e5867d..7fa3724c6eb5 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,7 +21,14 @@ struct mm_walk;
* depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
* Any folded depths (where PTRS_PER_P?D is equal to 1)
* are skipped.
- * @hugetlb_entry: if set, called for each hugetlb entry
+ * @hugetlb_entry: if set, called for each hugetlb entry. Note that
+ * each pte_t* is protected by hugetlb_walker_lock(),
+ * and the lock does not allow sleep. If explicit
+ * sleep in the entry fn needed, the caller needs to
+ * release the lock (hugetlb_walker_unlock()), then
+ * relock it (hugetlb_walker_lock()) before return.
+ * After the unlock, the pte_t* may become invalid
+ * anytime so cannot be accessed anymore.
* @test_walk: caller specific callback function to determine whether
* we walk over the current vma or not. Returning 0 means
* "do page table walk over the current vma", returning
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 7f1c9b274906..abf310011ab1 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
const struct mm_walk_ops *ops = walk->ops;
int err = 0;
+ hugetlb_walker_lock();
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
@@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
if (err)
break;
} while (addr = next, addr != end);
+ hugetlb_walker_unlock();
return err;
}
--
2.37.3
That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.
This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 62ff3fc51d4e..61a1fa678172 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6222,7 +6222,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;
-retry:
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
return NULL;
@@ -6245,16 +6244,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
page = NULL;
goto out;
}
- } else {
- if (is_hugetlb_entry_migration(entry)) {
- spin_unlock(ptl);
- __migration_entry_wait_huge(pte, ptl);
- goto retry;
- }
- /*
- * hwpoisoned entry is treated as no_page_table in
- * follow_page_mask().
- */
}
out:
spin_unlock(ptl);
--
2.37.3
Hugetlb walker lock makes sure the pte_t* won't go away from under us.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/rmap.h | 4 ++++
mm/page_vma_mapped.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 011a7530dc76..94d25a67db2b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -13,6 +13,7 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/memremap.h>
+#include <linux/hugetlb.h>
/*
* The anon_vma heads a list of private "related" vmas, to scan if
@@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
pte_unmap(pvmw->pte);
if (pvmw->ptl)
spin_unlock(pvmw->ptl);
+ /* This needs to be after unlock of the spinlock */
+ if (is_vm_hugetlb_page(pvmw->vma))
+ hugetlb_walker_unlock();
}
bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..5ac8a89130f6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pte)
return not_found(pvmw);
+ hugetlb_walker_lock();
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
- if (!pvmw->pte)
+ if (!pvmw->pte) {
+ hugetlb_walker_unlock();
return false;
+ }
pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
if (!check_pte(pvmw))
--
2.37.3
huge_pte_offset() is the main walker function for hugetlb pgtables. The
name is not really representing what it does, though.
Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside. Assert on the locks so when
walking the pgtable.
Note, the vma lock assertion will be a no-op for private mappings.
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 4 +---
fs/userfaultfd.c | 6 ++----
include/linux/hugetlb.h | 13 +++++++++++++
mm/hugetlb.c | 28 ++++++++++++++--------------
mm/page_vma_mapped.c | 2 +-
mm/pagewalk.c | 4 +---
6 files changed, 32 insertions(+), 25 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 265508981ba1..ed7934015290 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -390,9 +390,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
bool result = false;
hugetlb_walker_lock();
- ptep = huge_pte_offset(vma->vm_mm, addr,
- huge_page_size(hstate_vma(vma)));
-
+ ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
if (!ptep)
goto out;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a602f008dde5..f31fe1a9f4c5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
unsigned long flags,
unsigned long reason)
{
- struct mm_struct *mm = ctx->mm;
pte_t *ptep, pte;
bool ret = true;
- mmap_assert_locked(mm);
-
- ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+ mmap_assert_locked(ctx->mm);
+ ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
if (!ptep)
goto out;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8f85ad0d5bdb..e75b1ffb93a5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1296,4 +1296,17 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
#endif
+/*
+ * Safe version of huge_pte_offset() to check the locks. See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+ if (!hugetlb_walker_locked())
+ hugetlb_vma_assert_locked(vma);
+
+ return huge_pte_offset(vma->vm_mm, addr, sz);
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e81af6a46c59..6c77ae7a3d94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4814,7 +4814,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
/*
* For shared mappings the vma lock must be held before
- * calling huge_pte_offset in the src vma. Otherwise, the
+ * calling hugetlb_walk() in the src vma. Otherwise, the
* returned ptep could go away if part of a shared pmd and
* another thread calls huge_pmd_unshare.
*/
@@ -4824,7 +4824,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
last_addr_mask = hugetlb_mask_last_page(h);
for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
- src_pte = huge_pte_offset(src, addr, sz);
+ src_pte = hugetlb_walk(src_vma, addr, sz);
if (!src_pte) {
addr |= last_addr_mask;
continue;
@@ -5028,7 +5028,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
- src_pte = huge_pte_offset(mm, old_addr, sz);
+ src_pte = hugetlb_walk(vma, old_addr, sz);
if (!src_pte) {
old_addr |= last_addr_mask;
new_addr |= last_addr_mask;
@@ -5091,7 +5091,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -5404,7 +5404,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vma_lock_read(vma);
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep &&
pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
@@ -5442,7 +5442,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* before the page tables are altered
*/
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -5839,7 +5839,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
*
- * ptep could have already be assigned via huge_pte_offset. That
+ * ptep could have already be assigned via hugetlb_walk(). That
* is OK, as huge_pte_alloc will return the same value unless
* something has changed.
*/
@@ -6223,7 +6223,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return NULL;
hugetlb_walker_lock();
- pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+ pte = hugetlb_walk(vma, haddr, huge_page_size(h));
if (!pte)
goto out_unlock;
@@ -6288,8 +6288,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*
* Note that page table lock is not held when pte is null.
*/
- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
- huge_page_size(h));
+ pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
+ huge_page_size(h));
if (pte)
ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -6481,7 +6481,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
last_addr_mask = hugetlb_mask_last_page(h);
for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, psize);
+ ptep = hugetlb_walk(vma, address, psize);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -7037,8 +7037,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {
- spte = huge_pte_offset(svma->vm_mm, saddr,
- vma_mmu_pagesize(svma));
+ spte = hugetlb_walk(svma, saddr,
+ vma_mmu_pagesize(svma));
/*
* When page ref==0, it means it's probably being
* freed; continue with the next one.
@@ -7400,7 +7400,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (address = start; address < end; address += PUD_SIZE) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5ac8a89130f6..72d72cd73c8f 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -171,7 +171,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
hugetlb_walker_lock();
/* when pud is not present, pte will be NULL */
- pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
+ pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
if (!pvmw->pte) {
hugetlb_walker_unlock();
return false;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index abf310011ab1..469e60b1b096 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
hugetlb_walker_lock();
do {
next = hugetlb_entry_end(h, addr, end);
- pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
+ pte = hugetlb_walk(vma, addr & hmask, sz);
if (pte)
err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
else if (ops->pte_hole)
err = ops->pte_hole(addr, next, -1, walk);
-
if (err)
break;
} while (addr = next, addr != end);
--
2.37.3
Hugetlb walker lock makes sure the pte_t* won't go away from under us.
Some trick is used to release the walker lock slightly earlier when we
found present pte.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fc49e3ca8acd..e81af6a46c59 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6280,6 +6280,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
break;
}
+ hugetlb_walker_lock();
/*
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
@@ -6304,6 +6305,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
if (pte)
spin_unlock(ptl);
+ hugetlb_walker_unlock();
remainder = 0;
break;
}
@@ -6325,6 +6327,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (pte)
spin_unlock(ptl);
+ hugetlb_walker_unlock();
+
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
else if (unshare)
@@ -6367,6 +6371,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
continue;
}
+ /*
+ * When reach here, it means the pteval is not absent, so
+ * anyone who wants to free and invalidate the pgtable page
+ * (aka, pte*) should need to first unmap the entries which
+ * relies on the pgtable lock. Since we're holding it,
+ * we're safe even without the walker lock anymore.
+ */
+ hugetlb_walker_unlock();
+
pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
page = pte_page(huge_ptep_get(pte));
--
2.37.3
Hugetlb walker lock makes sure the pte_t* won't go away from under us.
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fdb16246f46e..265508981ba1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -387,21 +387,24 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
unsigned long addr, struct page *page)
{
pte_t *ptep, pte;
+ bool result = false;
+ hugetlb_walker_lock();
ptep = huge_pte_offset(vma->vm_mm, addr,
huge_page_size(hstate_vma(vma)));
if (!ptep)
- return false;
+ goto out;
pte = huge_ptep_get(ptep);
if (huge_pte_none(pte) || !pte_present(pte))
- return false;
+ goto out;
if (pte_page(pte) == page)
- return true;
-
- return false;
+ result = true;
+out:
+ hugetlb_walker_unlock();
+ return result;
}
/*
--
2.37.3
huge_pmd_share() is normally called with vma lock held already, it lets us
feel like we don't really need the walker lock. But that's not true,
because we only took the vma lock for "current" vma, but not all the vma
pgtables we're going to walk upon.
Taking each vma lock may lead to deadlock and hard to order. The safe
approach is just to use the walker lock which guarantees the pgtable page
being alive, then we should use get_page_unless_zero() rather than
get_page() only, to make sure the pgtable page is not being freed at the
meantime.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61a1fa678172..5ef883184885 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7008,6 +7008,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
spinlock_t *ptl;
i_mmap_lock_read(mapping);
+
+ /*
+ * NOTE: even if we've got the vma read lock, here we still need to
+ * take the walker lock, because we're not walking the current vma,
+ * but some other mm's!
+ */
+ hugetlb_walker_lock();
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -7016,12 +7023,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
if (saddr) {
spte = huge_pte_offset(svma->vm_mm, saddr,
vma_mmu_pagesize(svma));
- if (spte) {
- get_page(virt_to_page(spte));
+ /*
+ * When page ref==0, it means it's probably being
+ * freed; continue with the next one.
+ */
+ if (spte && get_page_unless_zero(virt_to_page(spte)))
break;
- }
}
}
+ hugetlb_walker_unlock();
if (!spte)
goto out;
--
2.37.3
On 18.11.22 02:10, Peter Xu wrote:
> Based on latest mm-unstable (96aa38b69507).
>
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, so this series also depends on that one.
> Hopefully this series can make it a more complete resolution for pmd
> unsharing.
>
> PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
> think I'm already more confident than many of the RFCs I posted.
>
> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> the changelog. The whole idea is still the same, though.
>
> Problem
> =======
>
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable. It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write.
>
> For normal memory types that's far enough, since any pgtable removal
> requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
> has the pmd unshare feature, it means not only the pgtable page can be gone
> from under us when we're doing a walking, but also the pgtable page we're
> walking (even after unshared, in this case it can only be the huge PUD page
> which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
> possible because even though freeing the pgtable page requires mmap write
> lock, it doesn't help us when we're walking on another mm's pgtable, so
> it's still on risk even if we're with the current->mm's mmap lock.
>
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed. It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
>
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons. E.g. it
> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> because even if we took the vma lock, we're walking on another mm's vma!
> Taking vma lock for all the vmas are probably not gonna work.
>
> I have totally no report showing that I can trigger such a race, but from
> code wise I never see anything that stops the race from happening. This
> series is trying to resolve that problem.
Let me try understand the basic problem first:
hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
don't grab the page table locks. That's very hugetlb specific handling
and I assume hugetlb uses different mechanisms to sync against
MADV_DONTNEED, concurrent page fault s... but that's no news. hugetlb is
weird in many ways :)
So, IIUC, you want a mechanism to synchronize against PMD unsharing.
Can't we use some very basic locking for that?
Using RCU / disabling local irqs seems a bit excessive because we *are*
holding the mmap lock and only care about concurrent unsharing
--
Thanks,
David / dhildenb
Hi, David,
Thanks for taking a look.
On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> On 18.11.22 02:10, Peter Xu wrote:
> > Based on latest mm-unstable (96aa38b69507).
> >
> > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > series for pmd unsharing, so this series also depends on that one.
> > Hopefully this series can make it a more complete resolution for pmd
> > unsharing.
> >
> > PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
> > think I'm already more confident than many of the RFCs I posted.
> >
> > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > the changelog. The whole idea is still the same, though.
> >
> > Problem
> > =======
> >
> > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > hugetlb pgtable. It's used mostly everywhere since that's needed even
> > before taking the pgtable lock.
> >
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.
> >
> > For normal memory types that's far enough, since any pgtable removal
> > requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
> > has the pmd unshare feature, it means not only the pgtable page can be gone
> > from under us when we're doing a walking, but also the pgtable page we're
> > walking (even after unshared, in this case it can only be the huge PUD page
> > which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
> > possible because even though freeing the pgtable page requires mmap write
> > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > it's still on risk even if we're with the current->mm's mmap lock.
> >
> > The recent work from Mike on vma lock can resolve most of this already.
> > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > further risk of the pgtable page being freed. It means if we can take the
> > vma lock around all huge_pte_offset() callers it'll be safe.
[1]
> >
> > There're already a bunch of them that we did as per the latest mm-unstable,
> > but also quite a few others that we didn't for various reasons. E.g. it
> > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
[2]
> > because even if we took the vma lock, we're walking on another mm's vma!
> > Taking vma lock for all the vmas are probably not gonna work.
> >
> > I have totally no report showing that I can trigger such a race, but from
> > code wise I never see anything that stops the race from happening. This
> > series is trying to resolve that problem.
>
> Let me try understand the basic problem first:
>
> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> don't grab the page table locks. That's very hugetlb specific handling and I
> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> concurrent page fault s... but that's no news. hugetlb is weird in many ways
> :)
>
> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> we use some very basic locking for that?
Yes we can in most cases. Please refer to above paragraph [1] where I
referred Mike's recent work on vma lock. That's the basic locking we need
so far to protect pmd unsharing. I'll attach the link too in the next
post, which is here:
https://lore.kernel.org/r/[email protected]
>
> Using RCU / disabling local irqs seems a bit excessive because we *are*
> holding the mmap lock and only care about concurrent unsharing
The series wanted to address where the vma lock is not easy to take. It
originates from when I was reading Mike's other patch, I forgot why I did
that but I just noticed there's some code path that we may not want to take
a sleepable lock, e.g. in follow page code.
The other one is huge_pmd_share() where we may have the mmap lock for
current mm but we're fundamentally walking another mm. It'll be tricky to
take a sleepable lock in such condition too.
I mentioned these cases in the other paragraph above [2]. Let me try to
expand that in my next post too.
It's debatable whether all the rest places can only work with either RCU or
irq disabled, but the idea is at least it should speed up those paths when
we still can. Here, irqoff might be a bit heavy, but RCU lock should be
always superior to vma lock when possible, the payoff is we may still see
stale pgtable data (since unsharing can still happen in parallel), while
that can be completely avoided when we take the vma lock.
Thanks,
--
Peter Xu
On 11/23/22 10:09, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > Let me try understand the basic problem first:
> >
> > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > don't grab the page table locks. That's very hugetlb specific handling and I
> > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > :)
> >
> > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > we use some very basic locking for that?
>
> Yes we can in most cases. Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock. That's the basic locking we need
> so far to protect pmd unsharing. I'll attach the link too in the next
> post, which is here:
>
> https://lore.kernel.org/r/[email protected]
>
> >
> > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > holding the mmap lock and only care about concurrent unsharing
>
> The series wanted to address where the vma lock is not easy to take. It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.
Yes, it was the patch suggested by David,
https://lore.kernel.org/linux-mm/[email protected]/
The issue was that FOLL_NOWAIT could be passed into follow_page_mask. If so,
then we do not want potentially sleep on the mutex.
Since you both are on this thread, I thought of/noticed a related issue. In
follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
is set. hugetlb_fault certainly has the potential for sleeping. Is this also
a similar issue?
--
Mike Kravetz
> The other one is huge_pmd_share() where we may have the mmap lock for
> current mm but we're fundamentally walking another mm. It'll be tricky to
> take a sleepable lock in such condition too.
>
> I mentioned these cases in the other paragraph above [2]. Let me try to
> expand that in my next post too.
>
> It's debatable whether all the rest places can only work with either RCU or
> irq disabled, but the idea is at least it should speed up those paths when
> we still can. Here, irqoff might be a bit heavy, but RCU lock should be
> always superior to vma lock when possible, the payoff is we may still see
> stale pgtable data (since unsharing can still happen in parallel), while
> that can be completely avoided when we take the vma lock.
>
> Thanks,
>
> --
> Peter Xu
>
On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
> On 11/23/22 10:09, Peter Xu wrote:
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > Let me try understand the basic problem first:
> > >
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > >
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> >
> > Yes we can in most cases. Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock. That's the basic locking we need
> > so far to protect pmd unsharing. I'll attach the link too in the next
> > post, which is here:
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > >
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> >
> > The series wanted to address where the vma lock is not easy to take. It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
>
> Yes, it was the patch suggested by David,
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> The issue was that FOLL_NOWAIT could be passed into follow_page_mask. If so,
> then we do not want potentially sleep on the mutex.
>
> Since you both are on this thread, I thought of/noticed a related issue. In
> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
> is set. hugetlb_fault certainly has the potential for sleeping. Is this also
> a similar issue?
Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
we should always try to not sleep at all.
But maybe that's also not urgently needed. So far I don't see any real
non-sleepable caller of it exists - the only one (kvm) can actually sleep..
It's definitely not wanted, as kvm only attach NOWAIT for an async fault,
so ideally any wait should be offloaded into async threads. Now with the
hugetlb code being able to sleep with NOWAIT, the waiting time will be
accounted to real fault time of vcpu and partly invalidate async page fault
handling. Said that, it also means no immediate fault would trigger either.
It's just that for the pmd unshare we can start to at least use non-sleep
version of the locks.
Now I'm more concerned with huge_pmd_share(), which seems to have no good
option but only the RCU approach.
One other thing I noticed is I cannot quickly figure out whether
follow_hugetlb_page() is needed anymore, since follow_page_mask() seems to
be also fine with walking hugetlb pgtables.
follow_hugetlb_page() can be traced back to the git initial commit, I had a
feeling that the old version of follow_page_mask() doesn't support hugetlb,
but now after it's supported maybe we can drop follow_hugetlb_page() as a
whole?
--
Peter Xu
On 23.11.22 19:56, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
>> On 11/23/22 10:09, Peter Xu wrote:
>>> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>>>> Let me try understand the basic problem first:
>>>>
>>>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>>>> don't grab the page table locks. That's very hugetlb specific handling and I
>>>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>>>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>>>> :)
>>>>
>>>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>>>> we use some very basic locking for that?
>>>
>>> Yes we can in most cases. Please refer to above paragraph [1] where I
>>> referred Mike's recent work on vma lock. That's the basic locking we need
>>> so far to protect pmd unsharing. I'll attach the link too in the next
>>> post, which is here:
>>>
>>> https://lore.kernel.org/r/[email protected]
>>>
>>>>
>>>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>>>> holding the mmap lock and only care about concurrent unsharing
>>>
>>> The series wanted to address where the vma lock is not easy to take. It
>>> originates from when I was reading Mike's other patch, I forgot why I did
>>> that but I just noticed there's some code path that we may not want to take
>>> a sleepable lock, e.g. in follow page code.
>>
>> Yes, it was the patch suggested by David,
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> The issue was that FOLL_NOWAIT could be passed into follow_page_mask. If so,
>> then we do not want potentially sleep on the mutex.
>>
>> Since you both are on this thread, I thought of/noticed a related issue. In
>> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
>> is set. hugetlb_fault certainly has the potential for sleeping. Is this also
>> a similar issue?
>
> Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
> we should always try to not sleep at all.
hva_to_pfn_slow() that sets FOLL_NOWAIT calls get_user_pages_unlocked(),
which will just do a straight mmap_read_lock().
The interpretation of FOLL_NOWAIT should not be "don't take any
sleepable locks" but instead more like "don't wait for a page to get
swapped in".
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
I did not read the full replies yet (sorry, busy hacking :) ) but *any*
code path that already takes the mmap_read_lock() can just take whatever
other lock we want -- IMHO. No need to over-complicate our code trying
to avoid locks in that case.
--
Thanks,
David / dhildenb
On 23.11.22 16:09, Peter Xu wrote:
> Hi, David,
>
> Thanks for taking a look.
>
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>> On 18.11.22 02:10, Peter Xu wrote:
>>> Based on latest mm-unstable (96aa38b69507).
>>>
>>> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
>>> series for pmd unsharing, so this series also depends on that one.
>>> Hopefully this series can make it a more complete resolution for pmd
>>> unsharing.
>>>
>>> PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
>>> think I'm already more confident than many of the RFCs I posted.
>>>
>>> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
>>> the changelog. The whole idea is still the same, though.
>>>
>>> Problem
>>> =======
>>>
>>> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
>>> hugetlb pgtable. It's used mostly everywhere since that's needed even
>>> before taking the pgtable lock.
>>>
>>> huge_pte_offset() is always called with mmap lock held with either read or
>>> write.
>>>
>>> For normal memory types that's far enough, since any pgtable removal
>>> requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
>>> has the pmd unshare feature, it means not only the pgtable page can be gone
>>> from under us when we're doing a walking, but also the pgtable page we're
>>> walking (even after unshared, in this case it can only be the huge PUD page
>>> which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
>>> possible because even though freeing the pgtable page requires mmap write
>>> lock, it doesn't help us when we're walking on another mm's pgtable, so
>>> it's still on risk even if we're with the current->mm's mmap lock.
>>>
>>> The recent work from Mike on vma lock can resolve most of this already.
>>> It's achieved by forbidden pmd unsharing during the lock being taken, so no
>>> further risk of the pgtable page being freed. It means if we can take the
>>> vma lock around all huge_pte_offset() callers it'll be safe.
>
> [1]
>
>>>
>>> There're already a bunch of them that we did as per the latest mm-unstable,
>>> but also quite a few others that we didn't for various reasons. E.g. it
>>> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
>>> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
>
> [2]
>
>>> because even if we took the vma lock, we're walking on another mm's vma!
>>> Taking vma lock for all the vmas are probably not gonna work.
>>>
>>> I have totally no report showing that I can trigger such a race, but from
>>> code wise I never see anything that stops the race from happening. This
>>> series is trying to resolve that problem.
>>
>> Let me try understand the basic problem first:
>>
>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>> don't grab the page table locks. That's very hugetlb specific handling and I
>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>> :)
>>
>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>> we use some very basic locking for that?
>
Sorry for the delay, finally found time to look into this again. :)
> Yes we can in most cases. Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock. That's the basic locking we need
> so far to protect pmd unsharing. I'll attach the link too in the next
> post, which is here:
>
> https://lore.kernel.org/r/[email protected]
>
>>
>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>> holding the mmap lock and only care about concurrent unsharing
>
> The series wanted to address where the vma lock is not easy to take. It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.
As I stated, whenever we already take the (expensive) mmap lock, the
least thing we should have to worry about is taking another sleepable
lock IMHO. Please correct me if I'm wrong.
>
> The other one is huge_pmd_share() where we may have the mmap lock for
> current mm but we're fundamentally walking another mm. It'll be tricky to
> take a sleepable lock in such condition too.
We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are
should be stable in that interval tree IIUC. So I wonder if taking VMA
locks would really be problematic here. Anything obvious I am missing?
>
> I mentioned these cases in the other paragraph above [2]. Let me try to
> expand that in my next post too.
That would be great. I yet have to dedicate more time to understand all
that complexity.
>
> It's debatable whether all the rest places can only work with either RCU or
> irq disabled, but the idea is at least it should speed up those paths when
> we still can. Here, irqoff might be a bit heavy, but RCU lock should be
> always superior to vma lock when possible, the payoff is we may still see
> stale pgtable data (since unsharing can still happen in parallel), while
> that can be completely avoided when we take the vma lock.
IRQ disabled is frowned upon by RT folks, that's why I'd like to
understand if this is really required. Also, adding RCU to an already
complex mechanism doesn't necessarily make it easier :)
Let me dedicate some time today to dig into some details.
--
Thanks,
David / dhildenb
On Fri, Nov 25, 2022 at 10:43:43AM +0100, David Hildenbrand wrote:
> On 23.11.22 16:09, Peter Xu wrote:
> > Hi, David,
> >
> > Thanks for taking a look.
> >
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > On 18.11.22 02:10, Peter Xu wrote:
> > > > Based on latest mm-unstable (96aa38b69507).
> > > >
> > > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > > > series for pmd unsharing, so this series also depends on that one.
> > > > Hopefully this series can make it a more complete resolution for pmd
> > > > unsharing.
> > > >
> > > > PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
> > > > think I'm already more confident than many of the RFCs I posted.
> > > >
> > > > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > > > the changelog. The whole idea is still the same, though.
> > > >
> > > > Problem
> > > > =======
> > > >
> > > > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > > > hugetlb pgtable. It's used mostly everywhere since that's needed even
> > > > before taking the pgtable lock.
> > > >
> > > > huge_pte_offset() is always called with mmap lock held with either read or
> > > > write.
> > > >
> > > > For normal memory types that's far enough, since any pgtable removal
> > > > requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
> > > > has the pmd unshare feature, it means not only the pgtable page can be gone
> > > > from under us when we're doing a walking, but also the pgtable page we're
> > > > walking (even after unshared, in this case it can only be the huge PUD page
> > > > which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
> > > > possible because even though freeing the pgtable page requires mmap write
> > > > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > > > it's still on risk even if we're with the current->mm's mmap lock.
> > > >
> > > > The recent work from Mike on vma lock can resolve most of this already.
> > > > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > > > further risk of the pgtable page being freed. It means if we can take the
> > > > vma lock around all huge_pte_offset() callers it'll be safe.
> >
> > [1]
> >
> > > >
> > > > There're already a bunch of them that we did as per the latest mm-unstable,
> > > > but also quite a few others that we didn't for various reasons. E.g. it
> > > > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > > > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> >
> > [2]
> >
> > > > because even if we took the vma lock, we're walking on another mm's vma!
> > > > Taking vma lock for all the vmas are probably not gonna work.
> > > >
> > > > I have totally no report showing that I can trigger such a race, but from
> > > > code wise I never see anything that stops the race from happening. This
> > > > series is trying to resolve that problem.
> > >
> > > Let me try understand the basic problem first:
> > >
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > >
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> >
>
> Sorry for the delay, finally found time to look into this again. :)
>
> > Yes we can in most cases. Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock. That's the basic locking we need
> > so far to protect pmd unsharing. I'll attach the link too in the next
> > post, which is here:
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > >
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> >
> > The series wanted to address where the vma lock is not easy to take. It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
>
> As I stated, whenever we already take the (expensive) mmap lock, the least
> thing we should have to worry about is taking another sleepable lock IMHO.
> Please correct me if I'm wrong.
Yes that's not a major concern. But I still think the follow page path
should sleep as less as possible. For example, non-hugetlb doesn't sleep
now. If with RCU lock we may do it lockless, then why not?
The same thing to patch 3 of this patchset - I would think it beneficial to
have even without a new lock type introduced, because it still makes the
follow page path cleaner, and have the hugetlb and non-hugetlb match.
>
> >
> > The other one is huge_pmd_share() where we may have the mmap lock for
> > current mm but we're fundamentally walking another mm. It'll be tricky to
> > take a sleepable lock in such condition too.
>
> We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are
> should be stable in that interval tree IIUC. So I wonder if taking VMA locks
> would really be problematic here. Anything obvious I am missing?
No, I think you're right, and I found that myself just yesterday when I was
writting a reproducer. huge_pmd_share() is safe here, so at least that
patch in this patchset can be dropped.
>
> >
> > I mentioned these cases in the other paragraph above [2]. Let me try to
> > expand that in my next post too.
>
> That would be great. I yet have to dedicate more time to understand all that
> complexity.
>
> >
> > It's debatable whether all the rest places can only work with either RCU or
> > irq disabled, but the idea is at least it should speed up those paths when
> > we still can. Here, irqoff might be a bit heavy, but RCU lock should be
> > always superior to vma lock when possible, the payoff is we may still see
> > stale pgtable data (since unsharing can still happen in parallel), while
> > that can be completely avoided when we take the vma lock.
>
> IRQ disabled is frowned upon by RT folks, that's why I'd like to understand
> if this is really required. Also, adding RCU to an already complex mechanism
> doesn't necessarily make it easier :)
I've posted it before, let me copy that over:
arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE
The irqoff thing is definitely unfortunate, but that's only happening with
riscv or x86 when !PARAVIRT. I would suppose PARAVIRT a common
config.. then it's majorly for riscv only. If someday riscv can support
rcu table free we can remove it and enforce rcu table free for pmd sharing
if wanted.
Said that, the "new lock" part is definitely not the core of the patchset
(even if it might read like that..). The core part is to first identify
the issue on overlooked usage of huge_pte_offset() and how to make it
always safe.
Let me also update a few more things when at it.
Since it'll be very hard to reproduce the race discussed in this series, I
didn't try to write a reproducer until yesterday. I'll need some kernel
delays to trigger that, only if so I can trigger some use-after-free. So I
think problem confirmed. The rest is how to resolve it, and whether the
vma lock is good enough.
One other thing to mention is I overlooked one important thing on the huge
pgtable lock, which is actually not protected by RCU (as it's part of
pmd_free()). IOW if a hugetlb walker that wants to do huge_pte_lock()
after huge_pte_offset() it won't be guarded by RCU, hence the new lock
won't easily work for them. That's another thing very unfortunate. I'm
not sure whether it's okay to move that part into the page free rcu
callback, but definitely needs more thoughts as pmd_free() is an arch API.
Debatably irqoff might work if the arch needs IPI for tlb flush so irqoff
may protect both the pgtable page but also the huge_pte_lock(), but I don't
think it's wise either to enlarge the irqoff to generic archs, and also I
think "whether tlb flush requires IPI" is arch-specific, that makes this
over complicated too and not necessary.
So firstly, I think I need to rework the patchset so more places will need
to take the vma lock (where pgtable lock needed). I tend to keep the RCU
lock because it's lighter at least to !riscv, if so it'll look like this
for the rule to use huge_pte_offset():
/*
* huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
* Returns the pte_t* if found, or NULL if the address is not mapped.
*
* IMPORTANT: we should normally not directly call this function, instead
* this is only a common interface to implement arch-specific walker.
* Please consider using the hugetlb_walk() helper to make sure of the
* correct locking is satisfied.
*
* Since this function will walk all the pgtable pages (including not only
* high-level pgtable page, but also PUD entry that can be unshared
* concurrently for VM_SHARED), the caller of this function should be
* responsible of its thread safety. One can follow this rule:
*
* (1) For private mappings: pmd unsharing is not possible, so it'll
* always be safe if we're with the mmap sem for either read or write.
* This is normally always the case, IOW we don't need to do anything
* special.
*
* (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
* pgtable page can go away from under us! It can be done by a pmd
* unshare with a follow up munmap() on the other process), then we
* need either:
*
* (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
* won't happen upon the range (it also makes sure the pte_t we
* read is the right and stable one), or,
*
* (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
* sure even if unshare happened the racy unmap() will wait until
* i_mmap_rwsem is released, or,
*
* (2.3) pgtable walker lock, to make sure even pmd unsharing happened,
* the old shared PUD page won't get freed from under us. In
* this case, the pteval can be obsolete, but at least it's still
* always safe to access the page (e.g., de-referencing pte_t*
* would not cause use-after-free). Note, it's not safe to
* access pgtable lock with this lock. If huge_pte_lock()
* needed, look for (2.1) or (2.2).
*
* Option (2.3) is the lightest, but it means pmd unshare can still happen
* so the pte got from pgtable walk can be stalled; also only page data is
* safe to access not others (e.g. pgtable lock). Option (2.1) is the
* safest, which guarantees pte stability until the vma lock released,
* however heavier than (2.3).
*/
Where hugetlb_walk() will be a new wrapper I'll introduce just to make sure
lock protections:
static inline pte_t *
hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
{
#ifdef CONFIG_LOCKDEP
/* lockdep_is_held() only defined with CONFIG_LOCKDEP */
if (vma->vm_flags & VM_MAYSHARE) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
/*
* Here, taking any of the three locks will guarantee
* safety of hugetlb pgtable walk.
*
* For more information on the locking rules of using
* huge_pte_offset(), please see the comment above
* huge_pte_offset() in the header file.
*/
WARN_ON_ONCE(!hugetlb_walker_locked() &&
!lockdep_is_held(&vma_lock->rw_sema) &&
!lockdep_is_held(
&vma->vm_file->f_mapping->i_mmap_rwsem));
}
#endif
return huge_pte_offset(vma->vm_mm, addr, sz);
}
--
Peter Xu