2022-12-07 20:36:01

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Based on latest mm-unstable (dbccafc6e992).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

This version should have all the issues resolved regarding v1.

Old versions:

rfcv1: https://lore.kernel.org/r/[email protected]
rfcv2: https://lore.kernel.org/r/[email protected]
v1: https://lore.kernel.org/r/[email protected]

Changelog for v1->v2:
- Added a-bs and r-bs
- Dropped the old patch 9, replacing it with a new document update patch 10
- Patch 4
- Dropped obsolete comment in hugetlb_fault() regarding ptep [Mike]
- Patch 8
- Release vma lock and retake for three special cases, where two of them
can proactively yield (s390 skey, pagemap) and the 3rd one (HMM) can
call handle_mm_fault nested, so taking the vma lock nested.
- More comments
- Patch 9
- Fix compilation error when !HUGETLB_PAGE

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d97c9a2a15d..f8d99dad5004 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -38,6 +38,7 @@
#include <asm/page.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
+#include <asm/delay.h>

#include <linux/io.h>
#include <linux/hugetlb.h>
@@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool unshare = false;
int absent;
struct page *page;
+ unsigned long c = 0;

/*
* If we have a pending SIGKILL, don't keep faulting pages and
@@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
huge_page_size(h));
+
+ pr_info("%s: withhold 1 sec...\n", __func__);
+ for (c = 0; c < 100; c++) {
+ udelay(10000);
+ }
+ pr_info("%s: withhold 1 sec...done\n", __func__);
+
if (pte)
ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-8: each patch resolves one possible race condition
Patch 9: introduce hugetlb_walk() to replace huge_pte_offset()
Patch 10: document why page_vma_mapped_walk() is safe

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.

Comments welcomed, thanks.

Peter Xu (10):
mm/hugetlb: Let vma_offset_start() to return start
mm/hugetlb: Don't wait for migration entry during follow page
mm/hugetlb: Document huge_pte_offset usage
mm/hugetlb: Move swap entry handling into vma lock when faulted
mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare
mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare
mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare
mm/hugetlb: Introduce hugetlb_walk()
mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

arch/s390/mm/gmap.c | 2 +
fs/hugetlbfs/inode.c | 28 ++++++-------
fs/proc/task_mmu.c | 2 +
fs/userfaultfd.c | 24 +++++++----
include/linux/hugetlb.h | 71 +++++++++++++++++++++++++++++++
include/linux/pagewalk.h | 11 ++++-
include/linux/swapops.h | 6 ++-
mm/hmm.c | 15 ++++++-
mm/hugetlb.c | 91 +++++++++++++++++-----------------------
mm/migrate.c | 25 +++++++++--
mm/page_vma_mapped.c | 12 ++++--
mm/pagewalk.c | 6 +--
12 files changed, 204 insertions(+), 89 deletions(-)

--
2.37.3


2022-12-07 20:36:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held. We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock. Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release the
lock in __migration_entry_wait_huge().

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 6 ++++--
mm/hugetlb.c | 36 +++++++++++++++---------------------
mm/migrate.c | 25 +++++++++++++++++++++----
3 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a70b5c3a68d7..b134c5eb75cb 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -337,7 +337,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
#ifdef CONFIG_HUGETLB_PAGE
-extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
+extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl);
extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
#endif /* CONFIG_HUGETLB_PAGE */
#else /* CONFIG_MIGRATION */
@@ -366,7 +367,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
#ifdef CONFIG_HUGETLB_PAGE
-static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
+static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
#endif /* CONFIG_HUGETLB_PAGE */
static inline int is_writable_migration_entry(swp_entry_t entry)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c8a6673fe5b4..49f73677a418 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int need_wait_lock = 0;
unsigned long haddr = address & huge_page_mask(h);

- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
- if (ptep) {
- /*
- * Since we hold no locks, ptep could be stale. That is
- * OK as we are only making decisions based on content and
- * not actually modifying content here.
- */
- entry = huge_ptep_get(ptep);
- if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait_huge(vma, ptep);
- return 0;
- } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
- return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(hstate_index(h));
- }
-
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
@@ -5854,10 +5838,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire vma lock before calling huge_pte_alloc and hold
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
- *
- * ptep could have already be assigned via huge_pte_offset. That
- * is OK, as huge_pte_alloc will return the same value unless
- * something has changed.
*/
hugetlb_vma_lock_read(vma);
ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
@@ -5886,8 +5866,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
* properly handle it.
*/
- if (!pte_present(entry))
+ if (!pte_present(entry)) {
+ if (unlikely(is_hugetlb_entry_migration(entry))) {
+ /*
+ * Release fault lock first because the vma lock is
+ * needed to guard the huge_pte_lockptr() later in
+ * migration_entry_wait_huge(). The vma lock will
+ * be released there.
+ */
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ migration_entry_wait_huge(vma, ptep);
+ return 0;
+ } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+ ret = VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto out_mutex;
+ }

/*
* If we are going to COW/unshare the mapping later, we examine the
diff --git a/mm/migrate.c b/mm/migrate.c
index 48584b032ea9..d14f1f3ab073 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -333,24 +333,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
}

#ifdef CONFIG_HUGETLB_PAGE
-void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
+void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl)
{
pte_t pte;

+ /*
+ * The vma read lock must be taken, which will be released before
+ * the function returns. It makes sure the pgtable page (along
+ * with its spin lock) not be freed in parallel.
+ */
+ hugetlb_vma_assert_locked(vma);
+
spin_lock(ptl);
pte = huge_ptep_get(ptep);

- if (unlikely(!is_hugetlb_entry_migration(pte)))
+ if (unlikely(!is_hugetlb_entry_migration(pte))) {
spin_unlock(ptl);
- else
+ hugetlb_vma_unlock_read(vma);
+ } else {
+ /*
+ * If migration entry existed, safe to release vma lock
+ * here because the pgtable page won't be freed without the
+ * pgtable lock released. See comment right above pgtable
+ * lock release in migration_entry_wait_on_locked().
+ */
+ hugetlb_vma_unlock_read(vma);
migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
+ }
}

void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
{
spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);

- __migration_entry_wait_huge(pte, ptl);
+ __migration_entry_wait_huge(vma, pte, ptl);
}
#endif

--
2.37.3

2022-12-07 20:36:37

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 49f73677a418..3fbbd599d015 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6226,9 +6226,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

+ hugetlb_vma_lock_read(vma);
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
- return NULL;
+ goto out_unlock;

ptl = huge_pte_lock(h, mm, pte);
entry = huge_ptep_get(pte);
@@ -6251,6 +6252,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
}
out:
spin_unlock(ptl);
+out_unlock:
+ hugetlb_vma_unlock_read(vma);
return page;
}

--
2.37.3

2022-12-07 20:37:05

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3fbbd599d015..f42399522805 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6284,6 +6284,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
break;
}

+ hugetlb_vma_lock_read(vma);
/*
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
@@ -6308,6 +6309,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
if (pte)
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
remainder = 0;
break;
}
@@ -6329,6 +6331,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (pte)
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
+
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
else if (unshare)
@@ -6388,6 +6392,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
remainder -= pages_per_huge_page(h);
i += pages_per_huge_page(h);
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
continue;
}

@@ -6415,6 +6420,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
flags))) {
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
remainder = 0;
err = -ENOMEM;
break;
@@ -6426,6 +6432,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
i += refs;

spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
}
*nr_pages = remainder;
/*
--
2.37.3

2022-12-07 20:37:18

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

We can take the hugetlb walker lock, here taking vma lock directly.

Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..a602f008dde5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
*/
vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
{
- struct mm_struct *mm = vmf->vma->vm_mm;
+ struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
*/
mmap_assert_locked(mm);

- ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
+ ctx = vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;

@@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)

blocking_state = userfaultfd_get_blocking_state(vmf->flags);

+ /*
+ * This stablizes pgtable for hugetlb on e.g. pmd unsharing. Need
+ * to be before setting current state.
+ */
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_lock_read(vma);
+
spin_lock_irq(&ctx->fault_pending_wqh.lock);
/*
* After the __add_wait_queue the uwq is visible to userland
@@ -507,13 +515,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
set_current_state(blocking_state);
spin_unlock_irq(&ctx->fault_pending_wqh.lock);

- if (!is_vm_hugetlb_page(vmf->vma))
+ if (!is_vm_hugetlb_page(vma))
must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
reason);
else
- must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+ must_wait = userfaultfd_huge_must_wait(ctx, vma,
vmf->address,
vmf->flags, reason);
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_unlock_read(vma);
mmap_read_unlock(mm);

if (likely(must_wait && !READ_ONCE(ctx->released))) {
--
2.37.3

2022-12-07 20:37:35

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/s390/mm/gmap.c | 2 ++
fs/proc/task_mmu.c | 2 ++
include/linux/pagewalk.h | 11 ++++++++++-
mm/hmm.c | 15 ++++++++++++++-
mm/pagewalk.c | 2 ++
5 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8947451ae021..292a54c490d4 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
end = start + HPAGE_SIZE - 1;
__storage_key_init_range(start, end);
set_bit(PG_arch_1, &page->flags);
+ hugetlb_vma_unlock_read(walk->vma);
cond_resched();
+ hugetlb_vma_lock_read(walk->vma);
return 0;
}

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..cf3887fb2905 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
frame++;
}

+ hugetlb_vma_unlock_read(walk->vma);
cond_resched();
+ hugetlb_vma_lock_read(walk->vma);

return err;
}
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 959f52e5867d..27a6df448ee5 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,7 +21,16 @@ struct mm_walk;
* depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
* Any folded depths (where PTRS_PER_P?D is equal to 1)
* are skipped.
- * @hugetlb_entry: if set, called for each hugetlb entry
+ * @hugetlb_entry: if set, called for each hugetlb entry. This hook
+ * function is called with the vma lock held, in order to
+ * protect against a concurrent freeing of the pte_t* or
+ * the ptl. In some cases, the hook function needs to drop
+ * and retake the vma lock in order to avoid deadlocks
+ * while calling other functions. In such cases the hook
+ * function must either refrain from accessing the pte or
+ * ptl after dropping the vma lock, or else revalidate
+ * those items after re-acquiring the vma lock and before
+ * accessing them.
* @test_walk: caller specific callback function to determine whether
* we walk over the current vma or not. Returning 0 means
* "do page table walk over the current vma", returning
diff --git a/mm/hmm.c b/mm/hmm.c
index 3850fb625dda..796de6866089 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -493,8 +493,21 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
if (required_fault) {
+ int ret;
+
spin_unlock(ptl);
- return hmm_vma_fault(addr, end, required_fault, walk);
+ hugetlb_vma_unlock_read(vma);
+ /*
+ * Avoid deadlock: drop the vma lock before calling
+ * hmm_vma_fault(), which will itself potentially take and
+ * drop the vma lock. This is also correct from a
+ * protection point of view, because there is no further
+ * use here of either pte or ptl after dropping the vma
+ * lock.
+ */
+ ret = hmm_vma_fault(addr, end, required_fault, walk);
+ hugetlb_vma_lock_read(vma);
+ return ret;
}

pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 7f1c9b274906..d98564a7be57 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
const struct mm_walk_ops *ops = walk->ops;
int err = 0;

+ hugetlb_vma_lock_read(vma);
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
@@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
if (err)
break;
} while (addr = next, addr != end);
+ hugetlb_vma_unlock_read(vma);

return err;
}
--
2.37.3

2022-12-07 20:41:01

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

Taking vma lock here is not needed for now because all potential hugetlb
walkers here should have i_mmap_rwsem held. Document the fact.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/page_vma_mapped.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e97b2e23bd28..2e59a0419d22 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
/* The only possible mapping was handled on last iteration */
if (pvmw->pte)
return not_found(pvmw);
-
- /* when pud is not present, pte will be NULL */
+ /*
+ * NOTE: we don't need explicit lock here to walk the
+ * hugetlb pgtable because either (1) potential callers of
+ * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
+ * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
+ * When one day this rule breaks, one will get a warning
+ * in hugetlb_walk(), and then we'll figure out what to do.
+ */
pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
if (!pvmw->pte)
return false;
--
2.37.3

2022-12-07 20:41:55

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

huge_pte_offset() is the main walker function for hugetlb pgtables. The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside. Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 4 +---
fs/userfaultfd.c | 6 ++----
include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 32 +++++++++++++-------------------
mm/page_vma_mapped.c | 2 +-
mm/pagewalk.c | 4 +---
6 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fdb16246f46e..48f1a8ad2243 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
{
pte_t *ptep, pte;

- ptep = huge_pte_offset(vma->vm_mm, addr,
- huge_page_size(hstate_vma(vma)));
-
+ ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
if (!ptep)
return false;

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a602f008dde5..f31fe1a9f4c5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
unsigned long flags,
unsigned long reason)
{
- struct mm_struct *mm = ctx->mm;
pte_t *ptep, pte;
bool ret = true;

- mmap_assert_locked(mm);
-
- ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+ mmap_assert_locked(ctx->mm);

+ ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
if (!ptep)
goto out;

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 81efd9b9baa2..1c20cbbf3d22 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_HUGETLB_H
#define _LINUX_HUGETLB_H

+#include <linux/mm.h>
#include <linux/mm_types.h>
#include <linux/mmdebug.h>
#include <linux/fs.h>
@@ -196,6 +197,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
* huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
* Returns the pte_t* if found, or NULL if the address is not mapped.
*
+ * IMPORTANT: we should normally not directly call this function, instead
+ * this is only a common interface to implement arch-specific walker.
+ * Please consider using the hugetlb_walk() helper to make sure of the
+ * correct locking is satisfied.
+ *
* Since this function will walk all the pgtable pages (including not only
* high-level pgtable page, but also PUD entry that can be unshared
* concurrently for VM_SHARED), the caller of this function should be
@@ -1229,4 +1235,37 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
#endif

+static inline bool
+__vma_shareable_flags_pmd(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
+ vma->vm_private_data;
+}
+
+/*
+ * Safe version of huge_pte_offset() to check the locks. See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+#if defined(CONFIG_HUGETLB_PAGE) && \
+ defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
+ struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+ /*
+ * If pmd sharing possible, locking needed to safely walk the
+ * hugetlb pgtables. More information can be found at the comment
+ * above huge_pte_offset() in the same file.
+ *
+ * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
+ */
+ if (__vma_shareable_flags_pmd(vma))
+ WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
+ !lockdep_is_held(
+ &vma->vm_file->f_mapping->i_mmap_rwsem));
+#endif
+ return huge_pte_offset(vma->vm_mm, addr, sz);
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f42399522805..e3500c087893 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4814,7 +4814,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
/*
* For shared mappings the vma lock must be held before
- * calling huge_pte_offset in the src vma. Otherwise, the
+ * calling hugetlb_walk() in the src vma. Otherwise, the
* returned ptep could go away if part of a shared pmd and
* another thread calls huge_pmd_unshare.
*/
@@ -4824,7 +4824,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
last_addr_mask = hugetlb_mask_last_page(h);
for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
- src_pte = huge_pte_offset(src, addr, sz);
+ src_pte = hugetlb_walk(src_vma, addr, sz);
if (!src_pte) {
addr |= last_addr_mask;
continue;
@@ -5028,7 +5028,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
- src_pte = huge_pte_offset(mm, old_addr, sz);
+ src_pte = hugetlb_walk(vma, old_addr, sz);
if (!src_pte) {
old_addr |= last_addr_mask;
new_addr |= last_addr_mask;
@@ -5091,7 +5091,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -5404,7 +5404,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vma_lock_read(vma);
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep &&
pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
@@ -5442,7 +5442,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* before the page tables are altered
*/
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -6227,7 +6227,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return NULL;

hugetlb_vma_lock_read(vma);
- pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+ pte = hugetlb_walk(vma, haddr, huge_page_size(h));
if (!pte)
goto out_unlock;

@@ -6292,8 +6292,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*
* Note that page table lock is not held when pte is null.
*/
- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
- huge_page_size(h));
+ pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
+ huge_page_size(h));
if (pte)
ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -6479,7 +6479,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
last_addr_mask = hugetlb_mask_last_page(h);
for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, psize);
+ ptep = hugetlb_walk(vma, address, psize);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -6857,12 +6857,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
*end = ALIGN(*end, PUD_SIZE);
}

-static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
-{
- return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
- vma->vm_private_data;
-}
-
void hugetlb_vma_lock_read(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
@@ -7028,8 +7022,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {
- spte = huge_pte_offset(svma->vm_mm, saddr,
- vma_mmu_pagesize(svma));
+ spte = hugetlb_walk(svma, saddr,
+ vma_mmu_pagesize(svma));
if (spte) {
get_page(virt_to_page(spte));
break;
@@ -7387,7 +7381,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (address = start; address < end; address += PUD_SIZE) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..e97b2e23bd28 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -170,7 +170,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);

/* when pud is not present, pte will be NULL */
- pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
+ pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
if (!pvmw->pte)
return false;

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d98564a7be57..cb23f8a15c13 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
hugetlb_vma_lock_read(vma);
do {
next = hugetlb_entry_end(h, addr, end);
- pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
+ pte = hugetlb_walk(vma, addr & hmask, sz);
if (pte)
err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
else if (ops->pte_hole)
err = ops->pte_hole(addr, next, -1, walk);
-
if (err)
break;
} while (addr = next, addr != end);
--
2.37.3

2022-12-07 20:45:58

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 01/10] mm/hugetlb: Let vma_offset_start() to return start

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for all
the callers because whenever the retval is used, it'll be ultimately added
into the vma->vm_start anyway, so it's better.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 790d2727141a..fdb16246f46e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,12 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
*/
static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
{
+ unsigned long offset = 0;
+
if (vma->vm_pgoff < start)
- return (start - vma->vm_pgoff) << PAGE_SHIFT;
- else
- return 0;
+ offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
+
+ return vma->vm_start + offset;
}

static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
@@ -457,7 +459,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);

- if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+ if (!hugetlb_vma_maps_page(vma, v_start, page))
continue;

if (!hugetlb_vma_trylock_write(vma)) {
@@ -473,8 +475,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
break;
}

- unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
- NULL, ZAP_FLAG_DROP_MARKER);
+ unmap_hugepage_range(vma, v_start, v_end, NULL,
+ ZAP_FLAG_DROP_MARKER);
hugetlb_vma_unlock_write(vma);
}

@@ -507,10 +509,9 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
*/
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);
- if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
- unmap_hugepage_range(vma, vma->vm_start + v_start,
- v_end, NULL,
- ZAP_FLAG_DROP_MARKER);
+ if (hugetlb_vma_maps_page(vma, v_start, page))
+ unmap_hugepage_range(vma, v_start, v_end, NULL,
+ ZAP_FLAG_DROP_MARKER);

kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
hugetlb_vma_unlock_write(vma);
@@ -540,8 +541,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);

- unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
- NULL, zap_flags);
+ unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);

/*
* Note that vma lock only exists for shared/non-private
--
2.37.3

2022-12-07 20:46:40

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 02/10] mm/hugetlb: Don't wait for migration entry during follow page

That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.

This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1088f2f41c88..c8a6673fe5b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6232,7 +6232,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

-retry:
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
return NULL;
@@ -6255,16 +6254,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
page = NULL;
goto out;
}
- } else {
- if (is_hugetlb_entry_migration(entry)) {
- spin_unlock(ptl);
- __migration_entry_wait_huge(pte, ptl);
- goto retry;
- }
- /*
- * hwpoisoned entry is treated as no_page_table in
- * follow_page_mask().
- */
}
out:
spin_unlock(ptl);
--
2.37.3

2022-12-07 20:47:29

by Peter Xu

[permalink] [raw]
Subject: [PATCH v2 03/10] mm/hugetlb: Document huge_pte_offset usage

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have two ways to make it safe even for a shared mapping:

(1) If we're with the hugetlb vma lock held for either read/write, it's
okay because pmd unshare cannot happen at all.

(2) If we're with the i_mmap_rwsem lock held for either read/write, it's
okay because even if pmd unshare can happen, the pgtable page cannot
be freed from under us.

Document it.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 551834cd5299..81efd9b9baa2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;

pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz);
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * Since this function will walk all the pgtable pages (including not only
+ * high-level pgtable page, but also PUD entry that can be unshared
+ * concurrently for VM_SHARED), the caller of this function should be
+ * responsible of its thread safety. One can follow this rule:
+ *
+ * (1) For private mappings: pmd unsharing is not possible, so it'll
+ * always be safe if we're with the mmap sem for either read or write.
+ * This is normally always the case, IOW we don't need to do anything
+ * special.
+ *
+ * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
+ * pgtable page can go away from under us! It can be done by a pmd
+ * unshare with a follow up munmap() on the other process), then we
+ * need either:
+ *
+ * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ * won't happen upon the range (it also makes sure the pte_t we
+ * read is the right and stable one), or,
+ *
+ * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
+ * sure even if unshare happened the racy unmap() will wait until
+ * i_mmap_rwsem is released.
+ *
+ * Option (2.1) is the safest, which guarantees pte stability from pmd
+ * sharing pov, until the vma lock released. Option (2.2) doesn't protect
+ * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
+ * access.
+ */
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
unsigned long hugetlb_mask_last_page(struct hstate *h);
--
2.37.3

2022-12-07 21:09:33

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On 12/7/22 12:30, Peter Xu wrote:
> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/s390/mm/gmap.c | 2 ++
> fs/proc/task_mmu.c | 2 ++
> include/linux/pagewalk.h | 11 ++++++++++-
> mm/hmm.c | 15 ++++++++++++++-
> mm/pagewalk.c | 2 ++
> 5 files changed, 30 insertions(+), 2 deletions(-)

Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8947451ae021..292a54c490d4 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
> end = start + HPAGE_SIZE - 1;
> __storage_key_init_range(start, end);
> set_bit(PG_arch_1, &page->flags);
> + hugetlb_vma_unlock_read(walk->vma);
> cond_resched();
> + hugetlb_vma_lock_read(walk->vma);
> return 0;
> }
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..cf3887fb2905 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
> frame++;
> }
>
> + hugetlb_vma_unlock_read(walk->vma);
> cond_resched();
> + hugetlb_vma_lock_read(walk->vma);
>
> return err;
> }
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 959f52e5867d..27a6df448ee5 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -21,7 +21,16 @@ struct mm_walk;
> * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
> * Any folded depths (where PTRS_PER_P?D is equal to 1)
> * are skipped.
> - * @hugetlb_entry: if set, called for each hugetlb entry
> + * @hugetlb_entry: if set, called for each hugetlb entry. This hook
> + * function is called with the vma lock held, in order to
> + * protect against a concurrent freeing of the pte_t* or
> + * the ptl. In some cases, the hook function needs to drop
> + * and retake the vma lock in order to avoid deadlocks
> + * while calling other functions. In such cases the hook
> + * function must either refrain from accessing the pte or
> + * ptl after dropping the vma lock, or else revalidate
> + * those items after re-acquiring the vma lock and before
> + * accessing them.
> * @test_walk: caller specific callback function to determine whether
> * we walk over the current vma or not. Returning 0 means
> * "do page table walk over the current vma", returning
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 3850fb625dda..796de6866089 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -493,8 +493,21 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> required_fault =
> hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
> if (required_fault) {
> + int ret;
> +
> spin_unlock(ptl);
> - return hmm_vma_fault(addr, end, required_fault, walk);
> + hugetlb_vma_unlock_read(vma);
> + /*
> + * Avoid deadlock: drop the vma lock before calling
> + * hmm_vma_fault(), which will itself potentially take and
> + * drop the vma lock. This is also correct from a
> + * protection point of view, because there is no further
> + * use here of either pte or ptl after dropping the vma
> + * lock.
> + */
> + ret = hmm_vma_fault(addr, end, required_fault, walk);
> + hugetlb_vma_lock_read(vma);
> + return ret;
> }
>
> pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 7f1c9b274906..d98564a7be57 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> const struct mm_walk_ops *ops = walk->ops;
> int err = 0;
>
> + hugetlb_vma_lock_read(vma);
> do {
> next = hugetlb_entry_end(h, addr, end);
> pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> @@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> if (err)
> break;
> } while (addr = next, addr != end);
> + hugetlb_vma_unlock_read(vma);
>
> return err;
> }

2022-12-07 22:02:28

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] mm/hugetlb: Document huge_pte_offset usage

On 12/7/22 12:30, Peter Xu wrote:
> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
> hugetlb address.
>
> Normally, it's always safe to walk a generic pgtable as long as we're with
> the mmap lock held for either read or write, because that guarantees the
> pgtable pages will always be valid during the process.
>
> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
> pgtable freed by pmd unsharing, it means that even with mmap lock held for
> current mm, the PMD pgtable page can still go away from under us if pmd
> unsharing is possible during the walk.
>
> So we have two ways to make it safe even for a shared mapping:
>
> (1) If we're with the hugetlb vma lock held for either read/write, it's
> okay because pmd unshare cannot happen at all.
>
> (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
> okay because even if pmd unshare can happen, the pgtable page cannot
> be freed from under us.
>
> Document it.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)

Looks good, with a couple of minor wording tweaks below that you might
consider folding in, but either way,

Reviewed-by: John Hubbard <[email protected]>

>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 551834cd5299..81efd9b9baa2 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>
> pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz);
> +/*
> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> + * Returns the pte_t* if found, or NULL if the address is not mapped.
> + *
> + * Since this function will walk all the pgtable pages (including not only
> + * high-level pgtable page, but also PUD entry that can be unshared
> + * concurrently for VM_SHARED), the caller of this function should be
> + * responsible of its thread safety. One can follow this rule:

"responsible for"

> + *
> + * (1) For private mappings: pmd unsharing is not possible, so it'll
> + * always be safe if we're with the mmap sem for either read or write.

mmap sem is sooo two years ago! :)

> + * This is normally always the case, IOW we don't need to do anything

"normally always" hurts my sense of logic. And "IOW" is for typing very quickly
in chats or email, not for long term documentation that is written rarely
and read many times.

> + * special.

So putting all that together, maybe:

* (1) For private mappings: pmd unsharing is not possible, so holding the
* mmap_lock for either read or write is sufficient. Most callers already
* hold the mmap_lock, so normally, no special action is required.


thanks,
--
John Hubbard
NVIDIA

> + *
> + * (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
> + * pgtable page can go away from under us! It can be done by a pmd
> + * unshare with a follow up munmap() on the other process), then we
> + * need either:
> + *
> + * (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
> + * won't happen upon the range (it also makes sure the pte_t we
> + * read is the right and stable one), or,
> + *
> + * (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
> + * sure even if unshare happened the racy unmap() will wait until
> + * i_mmap_rwsem is released.
> + *
> + * Option (2.1) is the safest, which guarantees pte stability from pmd
> + * sharing pov, until the vma lock released. Option (2.2) doesn't protect
> + * a concurrent pmd unshare, but it makes sure the pgtable page is safe to
> + * access.
> + */
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> unsigned long hugetlb_mask_last_page(struct hstate *h);

2022-12-07 22:14:32

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] mm/hugetlb: Let vma_offset_start() to return start

On 12/7/22 12:30, Peter Xu wrote:
> Even though vma_offset_start() is named like that, it's not returning "the
> start address of the range" but rather the offset we should use to offset
> the vma->vm_start address.
>
> Make it return the real value of the start vaddr, and it also helps for all
> the callers because whenever the retval is used, it'll be ultimately added
> into the vma->vm_start anyway, so it's better.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)

This is a nice refinement.

Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 790d2727141a..fdb16246f46e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -412,10 +412,12 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> */
> static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
> {
> + unsigned long offset = 0;
> +
> if (vma->vm_pgoff < start)
> - return (start - vma->vm_pgoff) << PAGE_SHIFT;
> - else
> - return 0;
> + offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
> +
> + return vma->vm_start + offset;
> }
>
> static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
> @@ -457,7 +459,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> v_start = vma_offset_start(vma, start);
> v_end = vma_offset_end(vma, end);
>
> - if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> + if (!hugetlb_vma_maps_page(vma, v_start, page))
> continue;
>
> if (!hugetlb_vma_trylock_write(vma)) {
> @@ -473,8 +475,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> break;
> }
>
> - unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> - NULL, ZAP_FLAG_DROP_MARKER);
> + unmap_hugepage_range(vma, v_start, v_end, NULL,
> + ZAP_FLAG_DROP_MARKER);
> hugetlb_vma_unlock_write(vma);
> }
>
> @@ -507,10 +509,9 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> */
> v_start = vma_offset_start(vma, start);
> v_end = vma_offset_end(vma, end);
> - if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> - unmap_hugepage_range(vma, vma->vm_start + v_start,
> - v_end, NULL,
> - ZAP_FLAG_DROP_MARKER);
> + if (hugetlb_vma_maps_page(vma, v_start, page))
> + unmap_hugepage_range(vma, v_start, v_end, NULL,
> + ZAP_FLAG_DROP_MARKER);
>
> kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
> hugetlb_vma_unlock_write(vma);
> @@ -540,8 +541,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> v_start = vma_offset_start(vma, start);
> v_end = vma_offset_end(vma, end);
>
> - unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> - NULL, zap_flags);
> + unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);
>
> /*
> * Note that vma lock only exists for shared/non-private

2022-12-07 22:15:14

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] mm/hugetlb: Don't wait for migration entry during follow page

On 12/7/22 12:30, Peter Xu wrote:
> That's what the code does with !hugetlb pages, so we should logically do
> the same for hugetlb, so migration entry will also be treated as no page.

This reasoning make good sense to me. I looked again at the follow_page*()
paths and just double-checked that this is accurate, and it is.

>
> This is probably also the last piece in follow_page code that may sleep,
> the last one should be removed in cf994dd8af27 ("mm/gup: remove
> FOLL_MIGRATION", 2022-11-16).
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 11 -----------
> 1 file changed, 11 deletions(-)

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1088f2f41c88..c8a6673fe5b4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6232,7 +6232,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> if (WARN_ON_ONCE(flags & FOLL_PIN))
> return NULL;
>
> -retry:
> pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> if (!pte)
> return NULL;
> @@ -6255,16 +6254,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> page = NULL;
> goto out;
> }
> - } else {
> - if (is_hugetlb_entry_migration(entry)) {
> - spin_unlock(ptl);
> - __migration_entry_wait_huge(pte, ptl);
> - goto retry;
> - }
> - /*
> - * hwpoisoned entry is treated as no_page_table in
> - * follow_page_mask().
> - */
> }
> out:
> spin_unlock(ptl);

2022-12-07 22:54:03

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On 12/7/22 12:30, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset(). That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without any lock to protect the pgtable from being freed after pmd unshare.
>
> Here the simplest solution to make it safe is to move the swap handling to
> be after the vma lock being held. We may need to take the fault mutex on
> either migration or hwpoison entries now (also the vma lock, but that's
> really needed), however neither of them is hot path.
>
> Note that the vma lock cannot be released in hugetlb_fault() when the
> migration entry is detected, because in migration_entry_wait_huge() the
> pgtable page will be used again (by taking the pgtable lock), so that also
> need to be protected by the vma lock. Modify migration_entry_wait_huge()
> so that it must be called with vma read lock held, and properly release the
> lock in __migration_entry_wait_huge().
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/swapops.h | 6 ++++--
> mm/hugetlb.c | 36 +++++++++++++++---------------------
> mm/migrate.c | 25 +++++++++++++++++++++----
> 3 files changed, 40 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index a70b5c3a68d7..b134c5eb75cb 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -337,7 +337,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
> #ifdef CONFIG_HUGETLB_PAGE
> -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
> +extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl);
> extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
> #endif /* CONFIG_HUGETLB_PAGE */
> #else /* CONFIG_MIGRATION */
> @@ -366,7 +367,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
> #ifdef CONFIG_HUGETLB_PAGE
> -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
> +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl) { }
> static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
> #endif /* CONFIG_HUGETLB_PAGE */
> static inline int is_writable_migration_entry(swp_entry_t entry)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c8a6673fe5b4..49f73677a418 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> int need_wait_lock = 0;
> unsigned long haddr = address & huge_page_mask(h);
>
> - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> - if (ptep) {
> - /*
> - * Since we hold no locks, ptep could be stale. That is
> - * OK as we are only making decisions based on content and
> - * not actually modifying content here.
> - */
> - entry = huge_ptep_get(ptep);
> - if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait_huge(vma, ptep);
> - return 0;
> - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> - return VM_FAULT_HWPOISON_LARGE |
> - VM_FAULT_SET_HINDEX(hstate_index(h));
> - }
> -
> /*
> * Serialize hugepage allocation and instantiation, so that we don't
> * get spurious allocation failures if two CPUs race to instantiate
> @@ -5854,10 +5838,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * Acquire vma lock before calling huge_pte_alloc and hold
> * until finished with ptep. This prevents huge_pmd_unshare from
> * being called elsewhere and making the ptep no longer valid.
> - *
> - * ptep could have already be assigned via huge_pte_offset. That
> - * is OK, as huge_pte_alloc will return the same value unless
> - * something has changed.
> */
> hugetlb_vma_lock_read(vma);
> ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> @@ -5886,8 +5866,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
> * properly handle it.
> */
> - if (!pte_present(entry))
> + if (!pte_present(entry)) {
> + if (unlikely(is_hugetlb_entry_migration(entry))) {
> + /*
> + * Release fault lock first because the vma lock is
> + * needed to guard the huge_pte_lockptr() later in
> + * migration_entry_wait_huge(). The vma lock will
> + * be released there.
> + */
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + migration_entry_wait_huge(vma, ptep);
> + return 0;

Oh, but now (and also one other, pre-existing case, above)
hugetlb_fault() is returning with the vma lock held. This is in contrast
with most of the rest of the function, which takes great care to release
locks before returning.

Which makes this new case really quite irregular and makes the overall
locking harder to follow. It would be ideal to avoid doing this! But at
the very least, there should be a little note above hugetlb_fault(),
explaining this deviation and how it fits in with the locking rules.

Do we really have to structure it this way, though?

thanks,
--
John Hubbard
NVIDIA

> + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> + ret = VM_FAULT_HWPOISON_LARGE |
> + VM_FAULT_SET_HINDEX(hstate_index(h));
> goto out_mutex;
> + }
>
> /*
> * If we are going to COW/unshare the mapping later, we examine the
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 48584b032ea9..d14f1f3ab073 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -333,24 +333,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> }
>
> #ifdef CONFIG_HUGETLB_PAGE
> -void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
> +void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl)
> {
> pte_t pte;
>
> + /*
> + * The vma read lock must be taken, which will be released before
> + * the function returns. It makes sure the pgtable page (along
> + * with its spin lock) not be freed in parallel.
> + */
> + hugetlb_vma_assert_locked(vma);
> +
> spin_lock(ptl);
> pte = huge_ptep_get(ptep);
>
> - if (unlikely(!is_hugetlb_entry_migration(pte)))
> + if (unlikely(!is_hugetlb_entry_migration(pte))) {
> spin_unlock(ptl);
> - else
> + hugetlb_vma_unlock_read(vma);
> + } else {
> + /*
> + * If migration entry existed, safe to release vma lock
> + * here because the pgtable page won't be freed without the
> + * pgtable lock released. See comment right above pgtable
> + * lock release in migration_entry_wait_on_locked().
> + */
> + hugetlb_vma_unlock_read(vma);
> migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
> + }
> }
>
> void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
> {
> spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
>
> - __migration_entry_wait_huge(pte, ptl);
> + __migration_entry_wait_huge(vma, pte, ptl);
> }
> #endif
>

2022-12-07 23:09:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

Hi, John,

Firstly, thanks for taking a look at the whole set.

On Wed, Dec 07, 2022 at 02:36:21PM -0800, John Hubbard wrote:
> > @@ -5886,8 +5866,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
> > * properly handle it.
> > */
> > - if (!pte_present(entry))
> > + if (!pte_present(entry)) {
> > + if (unlikely(is_hugetlb_entry_migration(entry))) {
> > + /*
> > + * Release fault lock first because the vma lock is
> > + * needed to guard the huge_pte_lockptr() later in
> > + * migration_entry_wait_huge(). The vma lock will
> > + * be released there.
> > + */
> > + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > + migration_entry_wait_huge(vma, ptep);
> > + return 0;
>
> Oh, but now (and also one other, pre-existing case, above)
> hugetlb_fault() is returning with the vma lock held.

Note that here migration_entry_wait_huge() will release it.

Sorry it's definitely not as straightforward, but this is also something I
didn't come up with a better solution, because we need the vma lock to
protect the spinlock, which is used in deep code path of the migration
code.

That's also why I added a rich comment above, and there's "The vma lock
will be released there" which is just for that.

> This is in contrast
> with most of the rest of the function, which takes great care to release
> locks before returning.
>
> Which makes this new case really quite irregular and makes the overall
> locking harder to follow. It would be ideal to avoid doing this! But at
> the very least, there should be a little note above hugetlb_fault(),
> explaining this deviation and how it fits in with the locking rules.
>
> Do we really have to structure it this way, though?

--
Peter Xu

2022-12-07 23:12:12

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

On 12/07/22 15:31, Peter Xu wrote:
> huge_pte_offset() is the main walker function for hugetlb pgtables. The
> name is not really representing what it does, though.
>
> Instead of renaming it, introduce a wrapper function called hugetlb_walk()
> which will use huge_pte_offset() inside. Assert on the locks when walking
> the pgtable.
>
> Note, the vma lock assertion will be a no-op for private mappings.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 4 +---
> fs/userfaultfd.c | 6 ++----
> include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++
> mm/hugetlb.c | 32 +++++++++++++-------------------
> mm/page_vma_mapped.c | 2 +-
> mm/pagewalk.c | 4 +---
> 6 files changed, 57 insertions(+), 30 deletions(-)

Thanks! I like the lockdep checks in hugetlb_walk.

Reviewed-by: Mike Kravetz <[email protected]>

--
Mike Kravetz

2022-12-07 23:31:15

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On 12/7/22 14:43, Peter Xu wrote:
> Note that here migration_entry_wait_huge() will release it.
>
> Sorry it's definitely not as straightforward, but this is also something I
> didn't come up with a better solution, because we need the vma lock to
> protect the spinlock, which is used in deep code path of the migration
> code.
>
> That's also why I added a rich comment above, and there's "The vma lock
> will be released there" which is just for that.
>

Yes, OK,

Reviewed-by: John Hubbard <[email protected]>

...and here is some fancy documentation polishing (incremental on top of this
specific patch) if you would like to fold it in, optional but it makes me
happier:


diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 49f73677a418..e3bbd4869f68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5809,6 +5809,10 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
}
#endif

+/*
+ * There are a few special cases in which this function returns while still
+ * holding locks. Those are noted inline.
+ */
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
@@ -5851,8 +5855,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/* PTE markers should be handled the same way as none pte */
if (huge_pte_none_mostly(entry))
/*
- * hugetlb_no_page will drop vma lock and hugetlb fault
- * mutex internally, which make us return immediately.
+ * hugetlb_no_page() will release both the vma lock and the
+ * hugetlb fault mutex, so just return directly from that:
*/
return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
@@ -5869,10 +5873,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pte_present(entry)) {
if (unlikely(is_hugetlb_entry_migration(entry))) {
/*
- * Release fault lock first because the vma lock is
- * needed to guard the huge_pte_lockptr() later in
- * migration_entry_wait_huge(). The vma lock will
- * be released there.
+ * Release the hugetlb fault lock now, but retain the
+ * vma lock, because it is needed to guard the
+ * huge_pte_lockptr() later in
+ * migration_entry_wait_huge(). The vma lock will be
+ * released there.
*/
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
migration_entry_wait_huge(vma, ptep);
diff --git a/mm/migrate.c b/mm/migrate.c
index d14f1f3ab073..a31df628b938 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -333,16 +333,18 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
}

#ifdef CONFIG_HUGETLB_PAGE
+
+/*
+ * The vma read lock must be held upon entry. Holding that lock prevents either
+ * the pte or the ptl from being freed.
+ *
+ * This function will release the vma lock before returning.
+ */
void __migration_entry_wait_huge(struct vm_area_struct *vma,
pte_t *ptep, spinlock_t *ptl)
{
pte_t pte;

- /*
- * The vma read lock must be taken, which will be released before
- * the function returns. It makes sure the pgtable page (along
- * with its spin lock) not be freed in parallel.
- */
hugetlb_vma_assert_locked(vma);

spin_lock(ptl);


thanks,
--
John Hubbard
NVIDIA

2022-12-07 23:48:53

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

On 12/7/22 12:30, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..a602f008dde5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> */
> vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> {
> - struct mm_struct *mm = vmf->vma->vm_mm;
> + struct vm_area_struct *vma = vmf->vma;
> + struct mm_struct *mm = vma->vm_mm;
> struct userfaultfd_ctx *ctx;
> struct userfaultfd_wait_queue uwq;
> vm_fault_t ret = VM_FAULT_SIGBUS;
> @@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> */
> mmap_assert_locked(mm);
>
> - ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
> + ctx = vma->vm_userfaultfd_ctx.ctx;
> if (!ctx)
> goto out;
>
> @@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>
> blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>
> + /*
> + * This stablizes pgtable for hugetlb on e.g. pmd unsharing. Need
> + * to be before setting current state.
> + */

Looking at this code, I am not able to come up with a reason for why the
vma lock/unlock placement is exactly where it is. It looks quite arbitrary.

Why not, for example, take and drop the vma lock within
userfaultfd_huge_must_wait()? That makes more sense to me, but I'm not familiar
with userfaultfd so of course I'm missing something.

But the comment above certainly doesn't supply that something.


thanks,
--
John Hubbard
NVIDIA

> + if (is_vm_hugetlb_page(vma))
> + hugetlb_vma_lock_read(vma);
> +
> spin_lock_irq(&ctx->fault_pending_wqh.lock);
> /*
> * After the __add_wait_queue the uwq is visible to userland
> @@ -507,13 +515,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> set_current_state(blocking_state);
> spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>
> - if (!is_vm_hugetlb_page(vmf->vma))
> + if (!is_vm_hugetlb_page(vma))
> must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> reason);
> else
> - must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
> + must_wait = userfaultfd_huge_must_wait(ctx, vma,
> vmf->address,
> vmf->flags, reason);
> + if (is_vm_hugetlb_page(vma))
> + hugetlb_vma_unlock_read(vma);
> mmap_read_unlock(mm);
>
> if (likely(must_wait && !READ_ONCE(ctx->released))) {

2022-12-07 23:50:50

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

On 12/7/22 12:30, Peter Xu wrote:
> Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Acked-by: David Hildenbrand <[email protected]>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 7 +++++++
> 1 file changed, 7 insertions(+)

Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3fbbd599d015..f42399522805 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6284,6 +6284,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> break;
> }
>
> + hugetlb_vma_lock_read(vma);
> /*
> * Some archs (sparc64, sh*) have multiple pte_ts to
> * each hugepage. We have to make sure we get the
> @@ -6308,6 +6309,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> if (pte)
> spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);
> remainder = 0;
> break;
> }
> @@ -6329,6 +6331,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (pte)
> spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);
> +
> if (flags & FOLL_WRITE)
> fault_flags |= FAULT_FLAG_WRITE;
> else if (unshare)
> @@ -6388,6 +6392,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> remainder -= pages_per_huge_page(h);
> i += pages_per_huge_page(h);
> spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);
> continue;
> }
>
> @@ -6415,6 +6420,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
> flags))) {
> spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);
> remainder = 0;
> err = -ENOMEM;
> break;
> @@ -6426,6 +6432,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> i += refs;
>
> spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);
> }
> *nr_pages = remainder;
> /*

2022-12-07 23:53:20

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

On 12/7/22 12:30, Peter Xu wrote:
> Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Acked-by: David Hildenbrand <[email protected]>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)

Reviewed-by: John Hubbard <[email protected]>


thanks,
--
John Hubbard
NVIDIA

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 49f73677a418..3fbbd599d015 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6226,9 +6226,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> if (WARN_ON_ONCE(flags & FOLL_PIN))
> return NULL;
>
> + hugetlb_vma_lock_read(vma);
> pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> if (!pte)
> - return NULL;
> + goto out_unlock;
>
> ptl = huge_pte_lock(h, mm, pte);
> entry = huge_ptep_get(pte);
> @@ -6251,6 +6252,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> }
> out:
> spin_unlock(ptl);
> +out_unlock:
> + hugetlb_vma_unlock_read(vma);
> return page;
> }
>

2022-12-08 00:20:35

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

On Wed, Dec 07, 2022 at 03:19:55PM -0800, John Hubbard wrote:
> On 12/7/22 12:30, Peter Xu wrote:
> > We can take the hugetlb walker lock, here taking vma lock directly.
> >
> > Reviewed-by: David Hildenbrand <[email protected]>
> > Reviewed-by: Mike Kravetz <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/userfaultfd.c | 18 ++++++++++++++----
> > 1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 07c81ab3fd4d..a602f008dde5 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> > */
> > vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > {
> > - struct mm_struct *mm = vmf->vma->vm_mm;
> > + struct vm_area_struct *vma = vmf->vma;
> > + struct mm_struct *mm = vma->vm_mm;
> > struct userfaultfd_ctx *ctx;
> > struct userfaultfd_wait_queue uwq;
> > vm_fault_t ret = VM_FAULT_SIGBUS;
> > @@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > */
> > mmap_assert_locked(mm);
> > - ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
> > + ctx = vma->vm_userfaultfd_ctx.ctx;
> > if (!ctx)
> > goto out;
> > @@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> > + /*
> > + * This stablizes pgtable for hugetlb on e.g. pmd unsharing. Need
> > + * to be before setting current state.
> > + */
>
> Looking at this code, I am not able to come up with a reason for why the
> vma lock/unlock placement is exactly where it is. It looks quite arbitrary.
>
> Why not, for example, take and drop the vma lock within
> userfaultfd_huge_must_wait()? That makes more sense to me, but I'm not familiar
> with userfaultfd so of course I'm missing something.
>
> But the comment above certainly doesn't supply that something.

The part that matters in the comment is "need to be before setting current
state".

blocking_state = userfaultfd_get_blocking_state(vmf->flags);
if (is_vm_hugetlb_page(vma))
hugetlb_vma_lock_read(vma);
set_current_state(blocking_state);

down_read() can sleep and also modify the task state, we cannot take the
lock after that point because otherwise the task state will be messed up.

--
Peter Xu

2022-12-08 00:28:43

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

On 12/7/22 15:44, Peter Xu wrote:
> The part that matters in the comment is "need to be before setting current
> state".
>
> blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> if (is_vm_hugetlb_page(vma))
> hugetlb_vma_lock_read(vma);
> set_current_state(blocking_state);
>
> down_read() can sleep and also modify the task state, we cannot take the
> lock after that point because otherwise the task state will be messed up.
>

OK, how about:

/*
* Take the vma lock now, in order to safely call
* userfaultfd_huge_must_wait() a little bit later. Because acquiring
* the (sleepable) vma lock potentially modifies the current task state,
* that must be before explicitly calling set_current_state().
*/

Other than that, the patch looks good, so:

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

2022-12-08 00:29:54

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

On 12/7/22 12:31, Peter Xu wrote:
> huge_pte_offset() is the main walker function for hugetlb pgtables. The
> name is not really representing what it does, though.
>
> Instead of renaming it, introduce a wrapper function called hugetlb_walk()
> which will use huge_pte_offset() inside. Assert on the locks when walking
> the pgtable.
>
> Note, the vma lock assertion will be a no-op for private mappings.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 4 +---
> fs/userfaultfd.c | 6 ++----
> include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++
> mm/hugetlb.c | 32 +++++++++++++-------------------
> mm/page_vma_mapped.c | 2 +-
> mm/pagewalk.c | 4 +---
> 6 files changed, 57 insertions(+), 30 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index fdb16246f46e..48f1a8ad2243 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> {
> pte_t *ptep, pte;
>
> - ptep = huge_pte_offset(vma->vm_mm, addr,
> - huge_page_size(hstate_vma(vma)));
> -
> + ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
> if (!ptep)
> return false;
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index a602f008dde5..f31fe1a9f4c5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> unsigned long flags,
> unsigned long reason)
> {
> - struct mm_struct *mm = ctx->mm;
> pte_t *ptep, pte;
> bool ret = true;
>
> - mmap_assert_locked(mm);
> -
> - ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
> + mmap_assert_locked(ctx->mm);
>
> + ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> if (!ptep)
> goto out;
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 81efd9b9baa2..1c20cbbf3d22 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -2,6 +2,7 @@
> #ifndef _LINUX_HUGETLB_H
> #define _LINUX_HUGETLB_H
>
> +#include <linux/mm.h>
> #include <linux/mm_types.h>
> #include <linux/mmdebug.h>
> #include <linux/fs.h>
> @@ -196,6 +197,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> * Returns the pte_t* if found, or NULL if the address is not mapped.
> *
> + * IMPORTANT: we should normally not directly call this function, instead
> + * this is only a common interface to implement arch-specific walker.
> + * Please consider using the hugetlb_walk() helper to make sure of the
> + * correct locking is satisfied.

Or:

"Please use hugetlb_walk() instead, because that will attempt to verify
the locking for you."

> + *
> * Since this function will walk all the pgtable pages (including not only
> * high-level pgtable page, but also PUD entry that can be unshared
> * concurrently for VM_SHARED), the caller of this function should be
> @@ -1229,4 +1235,37 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
> #define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
> #endif
>
> +static inline bool
> +__vma_shareable_flags_pmd(struct vm_area_struct *vma)
> +{
> + return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
> + vma->vm_private_data;
> +}
> +
> +/*
> + * Safe version of huge_pte_offset() to check the locks. See comments
> + * above huge_pte_offset().
> + */

It is odd to say that functionA() is a safe version of functionB(), if the
names are completely different.

At this point, it is very clear that huge_pte_offset() should be renamed.
I'd suggest something like one of these:

__hugetlb_walk()
hugetlb_walk_raw()

> +static inline pte_t *
> +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> +{
> +#if defined(CONFIG_HUGETLB_PAGE) && \
> + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> +
> + /*
> + * If pmd sharing possible, locking needed to safely walk the
> + * hugetlb pgtables. More information can be found at the comment
> + * above huge_pte_offset() in the same file.
> + *
> + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> + */
> + if (__vma_shareable_flags_pmd(vma))
> + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> + !lockdep_is_held(
> + &vma->vm_file->f_mapping->i_mmap_rwsem));
> +#endif
> + return huge_pte_offset(vma->vm_mm, addr, sz);
> +}

Let's please not slice up C functions with ifdefs. Instead, stick to the
standard approach of

#ifdef X
functionC()
{
...implementation
}
#else
functionC()
{
...simpler or shorter or stub implementation
}

> +
> #endif /* _LINUX_HUGETLB_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f42399522805..e3500c087893 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4814,7 +4814,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> } else {
> /*
> * For shared mappings the vma lock must be held before
> - * calling huge_pte_offset in the src vma. Otherwise, the
> + * calling hugetlb_walk() in the src vma. Otherwise, the
> * returned ptep could go away if part of a shared pmd and
> * another thread calls huge_pmd_unshare.
> */
> @@ -4824,7 +4824,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> last_addr_mask = hugetlb_mask_last_page(h);
> for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
> spinlock_t *src_ptl, *dst_ptl;
> - src_pte = huge_pte_offset(src, addr, sz);
> + src_pte = hugetlb_walk(src_vma, addr, sz);
> if (!src_pte) {
> addr |= last_addr_mask;
> continue;
> @@ -5028,7 +5028,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
> hugetlb_vma_lock_write(vma);
> i_mmap_lock_write(mapping);
> for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
> - src_pte = huge_pte_offset(mm, old_addr, sz);
> + src_pte = hugetlb_walk(vma, old_addr, sz);
> if (!src_pte) {
> old_addr |= last_addr_mask;
> new_addr |= last_addr_mask;
> @@ -5091,7 +5091,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> last_addr_mask = hugetlb_mask_last_page(h);
> address = start;
> for (; address < end; address += sz) {
> - ptep = huge_pte_offset(mm, address, sz);
> + ptep = hugetlb_walk(vma, address, sz);
> if (!ptep) {
> address |= last_addr_mask;
> continue;
> @@ -5404,7 +5404,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
> hugetlb_vma_lock_read(vma);
> spin_lock(ptl);
> - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> + ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
> if (likely(ptep &&
> pte_same(huge_ptep_get(ptep), pte)))
> goto retry_avoidcopy;
> @@ -5442,7 +5442,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
> * before the page tables are altered
> */
> spin_lock(ptl);
> - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> + ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
> if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> /* Break COW or unshare */
> huge_ptep_clear_flush(vma, haddr, ptep);
> @@ -6227,7 +6227,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> return NULL;
>
> hugetlb_vma_lock_read(vma);
> - pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> + pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> if (!pte)
> goto out_unlock;
>
> @@ -6292,8 +6292,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> *
> * Note that page table lock is not held when pte is null.
> */
> - pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
> - huge_page_size(h));
> + pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
> + huge_page_size(h));
> if (pte)
> ptl = huge_pte_lock(h, mm, pte);
> absent = !pte || huge_pte_none(huge_ptep_get(pte));
> @@ -6479,7 +6479,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> last_addr_mask = hugetlb_mask_last_page(h);
> for (; address < end; address += psize) {
> spinlock_t *ptl;
> - ptep = huge_pte_offset(mm, address, psize);
> + ptep = hugetlb_walk(vma, address, psize);
> if (!ptep) {
> address |= last_addr_mask;
> continue;
> @@ -6857,12 +6857,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> *end = ALIGN(*end, PUD_SIZE);
> }
>
> -static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
> -{
> - return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
> - vma->vm_private_data;
> -}
> -
> void hugetlb_vma_lock_read(struct vm_area_struct *vma)
> {
> if (__vma_shareable_flags_pmd(vma)) {
> @@ -7028,8 +7022,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>
> saddr = page_table_shareable(svma, vma, addr, idx);
> if (saddr) {
> - spte = huge_pte_offset(svma->vm_mm, saddr,
> - vma_mmu_pagesize(svma));
> + spte = hugetlb_walk(svma, saddr,
> + vma_mmu_pagesize(svma));
> if (spte) {
> get_page(virt_to_page(spte));
> break;
> @@ -7387,7 +7381,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
> hugetlb_vma_lock_write(vma);
> i_mmap_lock_write(vma->vm_file->f_mapping);
> for (address = start; address < end; address += PUD_SIZE) {
> - ptep = huge_pte_offset(mm, address, sz);
> + ptep = hugetlb_walk(vma, address, sz);
> if (!ptep)
> continue;
> ptl = huge_pte_lock(h, mm, ptep);
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 93e13fc17d3c..e97b2e23bd28 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -170,7 +170,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> return not_found(pvmw);
>
> /* when pud is not present, pte will be NULL */
> - pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
> + pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
> if (!pvmw->pte)
> return false;
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d98564a7be57..cb23f8a15c13 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> hugetlb_vma_lock_read(vma);
> do {
> next = hugetlb_entry_end(h, addr, end);
> - pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> -
> + pte = hugetlb_walk(vma, addr & hmask, sz);
> if (pte)
> err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
> else if (ops->pte_hole)
> err = ops->pte_hole(addr, next, -1, walk);
> -
> if (err)
> break;
> } while (addr = next, addr != end);


2022-12-08 00:32:05

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On 12/7/22 12:31, Peter Xu wrote:
> Taking vma lock here is not needed for now because all potential hugetlb
> walkers here should have i_mmap_rwsem held. Document the fact.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/page_vma_mapped.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e97b2e23bd28..2e59a0419d22 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> /* The only possible mapping was handled on last iteration */
> if (pvmw->pte)
> return not_found(pvmw);
> -
> - /* when pud is not present, pte will be NULL */
> + /*
> + * NOTE: we don't need explicit lock here to walk the
> + * hugetlb pgtable because either (1) potential callers of
> + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> + * When one day this rule breaks, one will get a warning
> + * in hugetlb_walk(), and then we'll figure out what to do.
> + */

Confused. Is this documentation actually intended to refer to hugetlb_walk()
itself, or just this call site? If the former, then let's move it over
to be right before hugetlb_walk().

> pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
> if (!pvmw->pte)
> return false;

thanks,
--
John Hubbard
NVIDIA

2022-12-08 13:45:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On 07.12.22 21:31, Peter Xu wrote:
> Taking vma lock here is not needed for now because all potential hugetlb
> walkers here should have i_mmap_rwsem held. Document the fact.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/page_vma_mapped.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e97b2e23bd28..2e59a0419d22 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> /* The only possible mapping was handled on last iteration */
> if (pvmw->pte)
> return not_found(pvmw);
> -
> - /* when pud is not present, pte will be NULL */
> + /*
> + * NOTE: we don't need explicit lock here to walk the
> + * hugetlb pgtable because either (1) potential callers of
> + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> + * When one day this rule breaks, one will get a warning
> + * in hugetlb_walk(), and then we'll figure out what to do.
> + */
> pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
> if (!pvmw->pte)
> return false;

Would it make sense to squash that into the previous commit?

--
Thanks,

David / dhildenb

2022-12-08 14:20:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On 07.12.22 21:30, Peter Xu wrote:
> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Reviewed-by: Mike Kravetz <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/s390/mm/gmap.c | 2 ++
> fs/proc/task_mmu.c | 2 ++
> include/linux/pagewalk.h | 11 ++++++++++-
> mm/hmm.c | 15 ++++++++++++++-
> mm/pagewalk.c | 2 ++
> 5 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 8947451ae021..292a54c490d4 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
> end = start + HPAGE_SIZE - 1;
> __storage_key_init_range(start, end);
> set_bit(PG_arch_1, &page->flags);
> + hugetlb_vma_unlock_read(walk->vma);
> cond_resched();
> + hugetlb_vma_lock_read(walk->vma);
> return 0;
> }
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..cf3887fb2905 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
> frame++;
> }
>
> + hugetlb_vma_unlock_read(walk->vma);
> cond_resched();
> + hugetlb_vma_lock_read(walk->vma);

We already hold the mmap_lock and reschedule. Even without the
cond_resched() we might happily reschedule on a preemptive kernel. So
I'm not sure if this optimization is strictly required or even helpful
in practice here.

In the worst case, concurrent unsharing would have to wait.
For example, s390_enable_skey() is called at most once for a VM, for
most VMs it gets never even called.

Or am I missing something important?

--
Thanks,

David / dhildenb

2022-12-08 14:22:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] mm/hugetlb: Document huge_pte_offset usage

On 07.12.22 21:49, John Hubbard wrote:
> On 12/7/22 12:30, Peter Xu wrote:
>> huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
>> hugetlb address.
>>
>> Normally, it's always safe to walk a generic pgtable as long as we're with
>> the mmap lock held for either read or write, because that guarantees the
>> pgtable pages will always be valid during the process.
>>
>> But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
>> pgtable freed by pmd unsharing, it means that even with mmap lock held for
>> current mm, the PMD pgtable page can still go away from under us if pmd
>> unsharing is possible during the walk.
>>
>> So we have two ways to make it safe even for a shared mapping:
>>
>> (1) If we're with the hugetlb vma lock held for either read/write, it's
>> okay because pmd unshare cannot happen at all.
>>
>> (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
>> okay because even if pmd unshare can happen, the pgtable page cannot
>> be freed from under us.
>>
>> Document it.
>>
>> Signed-off-by: Peter Xu <[email protected]>
>> ---
>> include/linux/hugetlb.h | 32 ++++++++++++++++++++++++++++++++
>> 1 file changed, 32 insertions(+)
>
> Looks good, with a couple of minor wording tweaks below that you might
> consider folding in, but either way,
>
> Reviewed-by: John Hubbard <[email protected]>
>
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 551834cd5299..81efd9b9baa2 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -192,6 +192,38 @@ extern struct list_head huge_boot_pages;
>>
>> pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>> unsigned long addr, unsigned long sz);
>> +/*
>> + * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
>> + * Returns the pte_t* if found, or NULL if the address is not mapped.
>> + *
>> + * Since this function will walk all the pgtable pages (including not only
>> + * high-level pgtable page, but also PUD entry that can be unshared
>> + * concurrently for VM_SHARED), the caller of this function should be
>> + * responsible of its thread safety. One can follow this rule:
>
> "responsible for"
>
>> + *
>> + * (1) For private mappings: pmd unsharing is not possible, so it'll
>> + * always be safe if we're with the mmap sem for either read or write.
>
> mmap sem is sooo two years ago! :)
>
>> + * This is normally always the case, IOW we don't need to do anything
>
> "normally always" hurts my sense of logic. And "IOW" is for typing very quickly
> in chats or email, not for long term documentation that is written rarely
> and read many times.
>
>> + * special.
>
> So putting all that together, maybe:
>
> * (1) For private mappings: pmd unsharing is not possible, so holding the
> * mmap_lock for either read or write is sufficient. Most callers already
> * hold the mmap_lock, so normally, no special action is required.

With that,

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-12-08 20:58:00

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On Wed, Dec 07, 2022 at 03:05:42PM -0800, John Hubbard wrote:
> On 12/7/22 14:43, Peter Xu wrote:
> > Note that here migration_entry_wait_huge() will release it.
> >
> > Sorry it's definitely not as straightforward, but this is also something I
> > didn't come up with a better solution, because we need the vma lock to
> > protect the spinlock, which is used in deep code path of the migration
> > code.
> >
> > That's also why I added a rich comment above, and there's "The vma lock
> > will be released there" which is just for that.
> >
>
> Yes, OK,
>
> Reviewed-by: John Hubbard <[email protected]>
>
> ...and here is some fancy documentation polishing (incremental on top of this
> specific patch) if you would like to fold it in, optional but it makes me
> happier:
>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 49f73677a418..e3bbd4869f68 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5809,6 +5809,10 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
> }
> #endif
> +/*
> + * There are a few special cases in which this function returns while still
> + * holding locks. Those are noted inline.
> + */

This is not true, I think? It always releases all the locks.

> vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, unsigned int flags)
> {
> @@ -5851,8 +5855,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> /* PTE markers should be handled the same way as none pte */
> if (huge_pte_none_mostly(entry))
> /*
> - * hugetlb_no_page will drop vma lock and hugetlb fault
> - * mutex internally, which make us return immediately.
> + * hugetlb_no_page() will release both the vma lock and the
> + * hugetlb fault mutex, so just return directly from that:

I'm probably not gonna touch this part because it's not part of the patch..
For the rest, I can do.

I'll also apply the comment changes elsewhere if I don't speak up - in most
cases they all look good to me.

Thanks,

> */
> return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> entry, flags);
> @@ -5869,10 +5873,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!pte_present(entry)) {
> if (unlikely(is_hugetlb_entry_migration(entry))) {
> /*
> - * Release fault lock first because the vma lock is
> - * needed to guard the huge_pte_lockptr() later in
> - * migration_entry_wait_huge(). The vma lock will
> - * be released there.
> + * Release the hugetlb fault lock now, but retain the
> + * vma lock, because it is needed to guard the
> + * huge_pte_lockptr() later in
> + * migration_entry_wait_huge(). The vma lock will be
> + * released there.
> */
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> migration_entry_wait_huge(vma, ptep);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d14f1f3ab073..a31df628b938 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -333,16 +333,18 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> }
> #ifdef CONFIG_HUGETLB_PAGE
> +
> +/*
> + * The vma read lock must be held upon entry. Holding that lock prevents either
> + * the pte or the ptl from being freed.
> + *
> + * This function will release the vma lock before returning.
> + */
> void __migration_entry_wait_huge(struct vm_area_struct *vma,
> pte_t *ptep, spinlock_t *ptl)
> {
> pte_t pte;
> - /*
> - * The vma read lock must be taken, which will be released before
> - * the function returns. It makes sure the pgtable page (along
> - * with its spin lock) not be freed in parallel.
> - */
> hugetlb_vma_assert_locked(vma);
> spin_lock(ptl);
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>

--
Peter Xu

2022-12-08 20:58:30

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On 12/8/22 12:28, Peter Xu wrote:
>> +/*
>> + * There are a few special cases in which this function returns while still
>> + * holding locks. Those are noted inline.
>> + */
>
> This is not true, I think? It always releases all the locks.
Agreed. It just looks that way at first, and I guess it was getting
late when I wrote that part. :)

thanks,
--
John Hubbard
NVIDIA

2022-12-08 21:07:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote:
> On 07.12.22 21:30, Peter Xu wrote:
> > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> > to make sure the pgtable page will not be freed concurrently.
> >
> > Reviewed-by: Mike Kravetz <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > arch/s390/mm/gmap.c | 2 ++
> > fs/proc/task_mmu.c | 2 ++
> > include/linux/pagewalk.h | 11 ++++++++++-
> > mm/hmm.c | 15 ++++++++++++++-
> > mm/pagewalk.c | 2 ++
> > 5 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > index 8947451ae021..292a54c490d4 100644
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
> > end = start + HPAGE_SIZE - 1;
> > __storage_key_init_range(start, end);
> > set_bit(PG_arch_1, &page->flags);
> > + hugetlb_vma_unlock_read(walk->vma);
> > cond_resched();
> > + hugetlb_vma_lock_read(walk->vma);
> > return 0;
> > }
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index e35a0398db63..cf3887fb2905 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
> > frame++;
> > }
> > + hugetlb_vma_unlock_read(walk->vma);
> > cond_resched();
> > + hugetlb_vma_lock_read(walk->vma);
>
> We already hold the mmap_lock and reschedule. Even without the
> cond_resched() we might happily reschedule on a preemptive kernel. So I'm
> not sure if this optimization is strictly required or even helpful in
> practice here.

It's just low hanging fruit if we need that complexity anyway.

That's also why I didn't do that for v1 (where I missed hmm special case,
though..), but I think since we'll need that anyway, we'd better release
the vma lock if we can easily do so.

mmap_lock is just more special because it needs more work in the caller to
release (e.g. vma invalidations). Otherwise I'm happy dropping that too.

>
> In the worst case, concurrent unsharing would have to wait.
> For example, s390_enable_skey() is called at most once for a VM, for most
> VMs it gets never even called.
>
> Or am I missing something important?

Nothing important. I just don't see why we need to strictly follow the
same release rule of mmap_lock here when talking about vma lock.

In short - if we can drop a lock earlier before sleep, why not?

I tend to just keep it as-is, but let me know if you have further thoughts
or concerns.

--
Peter Xu

2022-12-08 21:07:59

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

On Wed, Dec 07, 2022 at 04:12:31PM -0800, John Hubbard wrote:
> On 12/7/22 12:31, Peter Xu wrote:
> > huge_pte_offset() is the main walker function for hugetlb pgtables. The
> > name is not really representing what it does, though.
> >
> > Instead of renaming it, introduce a wrapper function called hugetlb_walk()
> > which will use huge_pte_offset() inside. Assert on the locks when walking
> > the pgtable.
> >
> > Note, the vma lock assertion will be a no-op for private mappings.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 4 +---
> > fs/userfaultfd.c | 6 ++----
> > include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++
> > mm/hugetlb.c | 32 +++++++++++++-------------------
> > mm/page_vma_mapped.c | 2 +-
> > mm/pagewalk.c | 4 +---
> > 6 files changed, 57 insertions(+), 30 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index fdb16246f46e..48f1a8ad2243 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> > {
> > pte_t *ptep, pte;
> > - ptep = huge_pte_offset(vma->vm_mm, addr,
> > - huge_page_size(hstate_vma(vma)));
> > -
> > + ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
> > if (!ptep)
> > return false;
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index a602f008dde5..f31fe1a9f4c5 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > unsigned long flags,
> > unsigned long reason)
> > {
> > - struct mm_struct *mm = ctx->mm;
> > pte_t *ptep, pte;
> > bool ret = true;
> > - mmap_assert_locked(mm);
> > -
> > - ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
> > + mmap_assert_locked(ctx->mm);
> > + ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> > if (!ptep)
> > goto out;
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 81efd9b9baa2..1c20cbbf3d22 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -2,6 +2,7 @@
> > #ifndef _LINUX_HUGETLB_H
> > #define _LINUX_HUGETLB_H
> > +#include <linux/mm.h>
> > #include <linux/mm_types.h>
> > #include <linux/mmdebug.h>
> > #include <linux/fs.h>
> > @@ -196,6 +197,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > * Returns the pte_t* if found, or NULL if the address is not mapped.
> > *
> > + * IMPORTANT: we should normally not directly call this function, instead
> > + * this is only a common interface to implement arch-specific walker.
> > + * Please consider using the hugetlb_walk() helper to make sure of the
> > + * correct locking is satisfied.
>
> Or:
>
> "Please use hugetlb_walk() instead, because that will attempt to verify
> the locking for you."

Ok.

>
> > + *
> > * Since this function will walk all the pgtable pages (including not only
> > * high-level pgtable page, but also PUD entry that can be unshared
> > * concurrently for VM_SHARED), the caller of this function should be
> > @@ -1229,4 +1235,37 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
> > #define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
> > #endif
> > +static inline bool
> > +__vma_shareable_flags_pmd(struct vm_area_struct *vma)
> > +{
> > + return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
> > + vma->vm_private_data;
> > +}
> > +
> > +/*
> > + * Safe version of huge_pte_offset() to check the locks. See comments
> > + * above huge_pte_offset().
> > + */
>
> It is odd to say that functionA() is a safe version of functionB(), if the
> names are completely different.
>
> At this point, it is very clear that huge_pte_offset() should be renamed.
> I'd suggest something like one of these:
>
> __hugetlb_walk()
> hugetlb_walk_raw()

We can.

Not only because that's an arch api for years (didn't want to touch more
arch code unless necessary), but also since we have hugetlb_walk() that'll
be the future interface not huge_pte_offset().

Actually it's good when that's the only thing people can find from its name
when they want to have a huge pgtable walk. :)

So totally makes sense to do so, but I don't strongly feel like doing it in
this patchset if you're okay with it.

>
> > +static inline pte_t *
> > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > +{
> > +#if defined(CONFIG_HUGETLB_PAGE) && \
> > + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > +
> > + /*
> > + * If pmd sharing possible, locking needed to safely walk the
> > + * hugetlb pgtables. More information can be found at the comment
> > + * above huge_pte_offset() in the same file.
> > + *
> > + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > + */
> > + if (__vma_shareable_flags_pmd(vma))
> > + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > + !lockdep_is_held(
> > + &vma->vm_file->f_mapping->i_mmap_rwsem));
> > +#endif
> > + return huge_pte_offset(vma->vm_mm, addr, sz);
> > +}
>
> Let's please not slice up C functions with ifdefs. Instead, stick to the
> standard approach of
>
> #ifdef X
> functionC()
> {
> ...implementation
> }
> #else
> functionC()
> {
> ...simpler or shorter or stub implementation
> }

Personally I like the slicing because it clearly tells what's the
difference with/without the macros defined.

Thanks,

--
Peter Xu

2022-12-08 21:18:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On Wed, Dec 07, 2022 at 04:16:11PM -0800, John Hubbard wrote:
> On 12/7/22 12:31, Peter Xu wrote:
> > Taking vma lock here is not needed for now because all potential hugetlb
> > walkers here should have i_mmap_rwsem held. Document the fact.
> >
> > Reviewed-by: Mike Kravetz <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/page_vma_mapped.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index e97b2e23bd28..2e59a0419d22 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > /* The only possible mapping was handled on last iteration */
> > if (pvmw->pte)
> > return not_found(pvmw);
> > -
> > - /* when pud is not present, pte will be NULL */
> > + /*
> > + * NOTE: we don't need explicit lock here to walk the
> > + * hugetlb pgtable because either (1) potential callers of
> > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > + * When one day this rule breaks, one will get a warning
> > + * in hugetlb_walk(), and then we'll figure out what to do.
> > + */
>
> Confused. Is this documentation actually intended to refer to hugetlb_walk()
> itself, or just this call site? If the former, then let's move it over
> to be right before hugetlb_walk().

It is for this specific code path not hugetlb_walk().

The "holds i_mmap_rwsem" here is a true statement (not requirement) because
PVMW rmap walkers always have that. That satisfies with hugetlb_walk()
requirements already even without holding the vma lock.

--
Peter Xu

2022-12-08 21:54:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On Thu, Dec 08, 2022 at 02:16:03PM +0100, David Hildenbrand wrote:
> On 07.12.22 21:31, Peter Xu wrote:
> > Taking vma lock here is not needed for now because all potential hugetlb
> > walkers here should have i_mmap_rwsem held. Document the fact.
> >
> > Reviewed-by: Mike Kravetz <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/page_vma_mapped.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index e97b2e23bd28..2e59a0419d22 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > /* The only possible mapping was handled on last iteration */
> > if (pvmw->pte)
> > return not_found(pvmw);
> > -
> > - /* when pud is not present, pte will be NULL */
> > + /*
> > + * NOTE: we don't need explicit lock here to walk the
> > + * hugetlb pgtable because either (1) potential callers of
> > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > + * When one day this rule breaks, one will get a warning
> > + * in hugetlb_walk(), and then we'll figure out what to do.
> > + */
> > pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
> > if (!pvmw->pte)
> > return false;
>
> Would it make sense to squash that into the previous commit?

Sure thing.

--
Peter Xu

2022-12-08 21:56:23

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

On 12/8/22 13:01, Peter Xu wrote:
>> At this point, it is very clear that huge_pte_offset() should be renamed.
>> I'd suggest something like one of these:
>>
>> __hugetlb_walk()
>> hugetlb_walk_raw()
>
> We can.
>
> Not only because that's an arch api for years (didn't want to touch more
> arch code unless necessary), but also since we have hugetlb_walk() that'll
> be the future interface not huge_pte_offset().
>
> Actually it's good when that's the only thing people can find from its name
> when they want to have a huge pgtable walk. :)
>
> So totally makes sense to do so, but I don't strongly feel like doing it in
> this patchset if you're okay with it.
>

Sounds good.

>>
>>> +static inline pte_t *
>>> +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
>>> +{
>>> +#if defined(CONFIG_HUGETLB_PAGE) && \
>>> + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
>>> + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
>>> +
>>> + /*
>>> + * If pmd sharing possible, locking needed to safely walk the
>>> + * hugetlb pgtables. More information can be found at the comment
>>> + * above huge_pte_offset() in the same file.
>>> + *
>>> + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
>>> + */
>>> + if (__vma_shareable_flags_pmd(vma))
>>> + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
>>> + !lockdep_is_held(
>>> + &vma->vm_file->f_mapping->i_mmap_rwsem));
>>> +#endif
>>> + return huge_pte_offset(vma->vm_mm, addr, sz);
>>> +}
>>
>> Let's please not slice up C functions with ifdefs. Instead, stick to the
>> standard approach of
>>
>> #ifdef X
>> functionC()
>> {
>> ...implementation
>> }
>> #else
>> functionC()
>> {
>> ...simpler or shorter or stub implementation
>> }
>
> Personally I like the slicing because it clearly tells what's the
> difference with/without the macros defined.
>

Ha, I think you have a higher tolerance for complexity on the screen.
The fact that you can see more of that complexity at once, is what
slows down human readers.

So when someone is working through the code, if they can look at one
config at a time, that's shorter and cleaner. This is why folks (I'm
very much not the only one) have this common pattern.

However, of course I won't insist here, as there are clearly preferences
in both directions. And the code is still small in either form in this
case so really a non-issue.

thanks,
--
John Hubbard
NVIDIA

2022-12-08 22:08:44

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On Thu, Dec 08, 2022 at 03:47:26PM -0500, Peter Xu wrote:
> On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote:
> > On 07.12.22 21:30, Peter Xu wrote:
> > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> > > to make sure the pgtable page will not be freed concurrently.
> > >
> > > Reviewed-by: Mike Kravetz <[email protected]>
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > arch/s390/mm/gmap.c | 2 ++
> > > fs/proc/task_mmu.c | 2 ++
> > > include/linux/pagewalk.h | 11 ++++++++++-
> > > mm/hmm.c | 15 ++++++++++++++-
> > > mm/pagewalk.c | 2 ++
> > > 5 files changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > > index 8947451ae021..292a54c490d4 100644
> > > --- a/arch/s390/mm/gmap.c
> > > +++ b/arch/s390/mm/gmap.c
> > > @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
> > > end = start + HPAGE_SIZE - 1;
> > > __storage_key_init_range(start, end);
> > > set_bit(PG_arch_1, &page->flags);
> > > + hugetlb_vma_unlock_read(walk->vma);
> > > cond_resched();
> > > + hugetlb_vma_lock_read(walk->vma);
> > > return 0;
> > > }
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index e35a0398db63..cf3887fb2905 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
> > > frame++;
> > > }
> > > + hugetlb_vma_unlock_read(walk->vma);
> > > cond_resched();
> > > + hugetlb_vma_lock_read(walk->vma);
> >
> > We already hold the mmap_lock and reschedule. Even without the
> > cond_resched() we might happily reschedule on a preemptive kernel. So I'm
> > not sure if this optimization is strictly required or even helpful in
> > practice here.
>
> It's just low hanging fruit if we need that complexity anyway.
>
> That's also why I didn't do that for v1 (where I missed hmm special case,
> though..), but I think since we'll need that anyway, we'd better release
> the vma lock if we can easily do so.
>
> mmap_lock is just more special because it needs more work in the caller to
> release (e.g. vma invalidations). Otherwise I'm happy dropping that too.
>
> >
> > In the worst case, concurrent unsharing would have to wait.
> > For example, s390_enable_skey() is called at most once for a VM, for most
> > VMs it gets never even called.
> >
> > Or am I missing something important?
>
> Nothing important. I just don't see why we need to strictly follow the
> same release rule of mmap_lock here when talking about vma lock.
>
> In short - if we can drop a lock earlier before sleep, why not?
>
> I tend to just keep it as-is, but let me know if you have further thoughts
> or concerns.

One thing I can do better here is:

- cond_resched();
+
+ if (need_resched()) {
+ hugetlb_vma_unlock_read(walk->vma);
+ cond_resched();
+ hugetlb_vma_lock_read(walk->vma);
+ }
+

It's a pity we don't have rwsem version of cond_resched_rwlock_read(), or
it'll be even cleaner.

--
Peter Xu

2022-12-08 22:35:19

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On 12/8/22 13:05, Peter Xu wrote:
>>> + /*
>>> + * NOTE: we don't need explicit lock here to walk the
>>> + * hugetlb pgtable because either (1) potential callers of
>>> + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
>>> + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
>>> + * When one day this rule breaks, one will get a warning
>>> + * in hugetlb_walk(), and then we'll figure out what to do.
>>> + */
>>
>> Confused. Is this documentation actually intended to refer to hugetlb_walk()
>> itself, or just this call site? If the former, then let's move it over
>> to be right before hugetlb_walk().
>
> It is for this specific code path not hugetlb_walk().
>
> The "holds i_mmap_rwsem" here is a true statement (not requirement) because
> PVMW rmap walkers always have that. That satisfies with hugetlb_walk()
> requirements already even without holding the vma lock.
>

It's really hard to understand. Do you have a few extra words to explain it?
I can help with actual comment wording perhaps, but I am still a bit in
the dark as to the actual meaning. :)

thanks,
--
John Hubbard
NVIDIA

2022-12-08 22:36:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On Thu, Dec 08, 2022 at 01:54:27PM -0800, John Hubbard wrote:
> On 12/8/22 13:05, Peter Xu wrote:
> > > > + /*
> > > > + * NOTE: we don't need explicit lock here to walk the
> > > > + * hugetlb pgtable because either (1) potential callers of
> > > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the
> > > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe).
> > > > + * When one day this rule breaks, one will get a warning
> > > > + * in hugetlb_walk(), and then we'll figure out what to do.
> > > > + */
> > >
> > > Confused. Is this documentation actually intended to refer to hugetlb_walk()
> > > itself, or just this call site? If the former, then let's move it over
> > > to be right before hugetlb_walk().
> >
> > It is for this specific code path not hugetlb_walk().
> >
> > The "holds i_mmap_rwsem" here is a true statement (not requirement) because
> > PVMW rmap walkers always have that. That satisfies with hugetlb_walk()
> > requirements already even without holding the vma lock.
> >
>
> It's really hard to understand. Do you have a few extra words to explain it?
> I can help with actual comment wording perhaps, but I am still a bit in
> the dark as to the actual meaning. :)

Firstly, this patch (to be squashed into previous) is trying to document
page_vma_mapped_walk() on why it's not needed to further take any lock to
call hugetlb_walk().

To call hugetlb_walk() we need either of the locks listed below (in either
read or write mode), according to the rules we setup for it in patch 3:

(1) hugetlb vma lock
(2) i_mmap_rwsem lock

page_vma_mapped_walk() is called in below sites across the kernel:

__replace_page[179] if (!page_vma_mapped_walk(&pvmw))
__damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) {
__damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) {
write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw))
remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) {
page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw))
folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) {
page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) {
try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) {
try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) {
page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {

If we group them, we can see that most of them are during a rmap walk
(i.e., comes from a higher rmap_walk() stack), they are:

__damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) {
__damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) {
remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) {
page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw))
folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) {
page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) {
try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) {
try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) {
page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {

Let's call it case (A).

We have another two special cases that are not during a rmap walk, they
are:

write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw))
__replace_page[179] if (!page_vma_mapped_walk(&pvmw))

Let's call it case (B).

Case (A) is always safe because it always take the i_mmap_rwsem lock in
read mode. It's done in rmap_walk_file() where:

if (!locked) {
if (i_mmap_trylock_read(mapping))
goto lookup;

if (rwc->try_lock) {
rwc->contended = true;
return;
}

i_mmap_lock_read(mapping);
}

If locked==true it means the caller already holds the lock, so no need to
take it. It justifies that all callers from rmap_walk() upon a hugetlb vma
is safe to call hugetlb_walk() already according to the rule of hugetlb_walk().

Case (B) contains two cases either in KSM path or uprobe path, and none of
the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of

if (unlikely(is_vm_hugetlb_page(vma))) {

In page_vma_mapped_walk() just should never trigger.

To summarize above into a shorter paragraph, it'll become the comment.

Hope it explains. Thanks.

--
Peter Xu

2022-12-08 23:37:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

On Thu, Dec 08, 2022 at 01:50:17PM -0800, John Hubbard wrote:
> However, of course I won't insist here, as there are clearly preferences
> in both directions. And the code is still small in either form in this
> case so really a non-issue.

Thanks, John.

--
Peter Xu

2022-12-09 00:52:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On Thu, Dec 08, 2022 at 04:24:19PM -0800, John Hubbard wrote:
> It does! And now for the comment, I'll think you'll find that this suffices:
>
> /*
> * All callers that get here will already hold the i_mmap_rwsem.
> * Therefore, no additional locks need to be taken before
> * calling hugetlb_walk().
> */
>
> ...which, considering all the data above, is probably the mother of
> all summaries. :) But really, it's all that people need to know here, and
> it's readily understandable without wondering what KSM has to do with this,
> for example.

I'm okay with the change. :)

I think what I'll do is I'll move part of the original one into commit
message, and take the new version in the code.

Thanks,

--
Peter Xu

2022-12-09 01:23:36

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 10/10] mm/hugetlb: Document why page_vma_mapped_walk() is safe to walk

On 12/8/22 14:21, Peter Xu wrote:
>
> Firstly, this patch (to be squashed into previous) is trying to document
> page_vma_mapped_walk() on why it's not needed to further take any lock to
> call hugetlb_walk().
>
> To call hugetlb_walk() we need either of the locks listed below (in either
> read or write mode), according to the rules we setup for it in patch 3:
>
> (1) hugetlb vma lock
> (2) i_mmap_rwsem lock
>
> page_vma_mapped_walk() is called in below sites across the kernel:
>
> __replace_page[179] if (!page_vma_mapped_walk(&pvmw))
> __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) {
> __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) {
> write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw))
> remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) {
> page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
> page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw))
> folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) {
> page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) {
> try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) {
> try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) {
> page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {
>
> If we group them, we can see that most of them are during a rmap walk
> (i.e., comes from a higher rmap_walk() stack), they are:
>
> __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) {
> __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) {
> remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) {
> page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) {
> page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw))
> folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) {
> page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) {
> try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) {
> try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) {
> page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) {
>
> Let's call it case (A).
>
> We have another two special cases that are not during a rmap walk, they
> are:
>
> write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw))
> __replace_page[179] if (!page_vma_mapped_walk(&pvmw))
>
> Let's call it case (B).
>
> Case (A) is always safe because it always take the i_mmap_rwsem lock in
> read mode. It's done in rmap_walk_file() where:
>
> if (!locked) {
> if (i_mmap_trylock_read(mapping))
> goto lookup;
>
> if (rwc->try_lock) {
> rwc->contended = true;
> return;
> }
>
> i_mmap_lock_read(mapping);
> }
>
> If locked==true it means the caller already holds the lock, so no need to
> take it. It justifies that all callers from rmap_walk() upon a hugetlb vma
> is safe to call hugetlb_walk() already according to the rule of hugetlb_walk().
>
> Case (B) contains two cases either in KSM path or uprobe path, and none of
> the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of
>
> if (unlikely(is_vm_hugetlb_page(vma))) {
>
> In page_vma_mapped_walk() just should never trigger.
>
> To summarize above into a shorter paragraph, it'll become the comment.
>
> Hope it explains. Thanks.
>

It does! And now for the comment, I'll think you'll find that this suffices:

/*
* All callers that get here will already hold the i_mmap_rwsem.
* Therefore, no additional locks need to be taken before
* calling hugetlb_walk().
*/

...which, considering all the data above, is probably the mother of
all summaries. :) But really, it's all that people need to know here, and
it's readily understandable without wondering what KSM has to do with this,
for example.


thanks,
--
John Hubbard
NVIDIA

2022-12-09 11:01:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On 08.12.22 21:47, Peter Xu wrote:
> On Thu, Dec 08, 2022 at 02:14:42PM +0100, David Hildenbrand wrote:
>> On 07.12.22 21:30, Peter Xu wrote:
>>> Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
>>> to make sure the pgtable page will not be freed concurrently.
>>>
>>> Reviewed-by: Mike Kravetz <[email protected]>
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> arch/s390/mm/gmap.c | 2 ++
>>> fs/proc/task_mmu.c | 2 ++
>>> include/linux/pagewalk.h | 11 ++++++++++-
>>> mm/hmm.c | 15 ++++++++++++++-
>>> mm/pagewalk.c | 2 ++
>>> 5 files changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
>>> index 8947451ae021..292a54c490d4 100644
>>> --- a/arch/s390/mm/gmap.c
>>> +++ b/arch/s390/mm/gmap.c
>>> @@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
>>> end = start + HPAGE_SIZE - 1;
>>> __storage_key_init_range(start, end);
>>> set_bit(PG_arch_1, &page->flags);
>>> + hugetlb_vma_unlock_read(walk->vma);
>>> cond_resched();
>>> + hugetlb_vma_lock_read(walk->vma);
>>> return 0;
>>> }
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index e35a0398db63..cf3887fb2905 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1613,7 +1613,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
>>> frame++;
>>> }
>>> + hugetlb_vma_unlock_read(walk->vma);
>>> cond_resched();
>>> + hugetlb_vma_lock_read(walk->vma);
>>
>> We already hold the mmap_lock and reschedule. Even without the
>> cond_resched() we might happily reschedule on a preemptive kernel. So I'm
>> not sure if this optimization is strictly required or even helpful in
>> practice here.
>
> It's just low hanging fruit if we need that complexity anyway.
>
> That's also why I didn't do that for v1 (where I missed hmm special case,
> though..), but I think since we'll need that anyway, we'd better release
> the vma lock if we can easily do so.
>
> mmap_lock is just more special because it needs more work in the caller to
> release (e.g. vma invalidations). Otherwise I'm happy dropping that too.
>
>>
>> In the worst case, concurrent unsharing would have to wait.
>> For example, s390_enable_skey() is called at most once for a VM, for most
>> VMs it gets never even called.
>>
>> Or am I missing something important?
>
> Nothing important. I just don't see why we need to strictly follow the
> same release rule of mmap_lock here when talking about vma lock.
>
> In short - if we can drop a lock earlier before sleep, why not?
>
> I tend to just keep it as-is, but let me know if you have further thoughts
> or concerns.

To me this looks like a possibly unnecessary optimization, requiring
additional code. IMHO, possibly unnecessary unlock+relock makes thatthat
code harder to get.

For such cases, it would be good to have any evidence that it really helps.

--
Thanks,

David / dhildenb

2022-12-09 14:56:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote:
> For such cases, it would be good to have any evidence that it really helps.

I don't know much on the s390 path, but if a process has a large hugetlb
vma, even MADV_DONTNEED will be blocked for whatever long time if there's
another process or thread scanning pagemap for this vma.

Would this justify a bit?

It's just that the vma lock is taken write far more than mmap_lock taken
write I think, meanwhile what we need here is only release the lock and
retake, nothing else. It didn't make things over complicated, IMO.

No strong ipinion from my side, as I said to me it's really low hanging
fruit. If you still think that doesn't justify and if Mike doesn't have a
preference either I can just drop it for later.

--
Peter Xu

2022-12-09 15:49:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On 09.12.22 15:39, Peter Xu wrote:
> On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote:
>> For such cases, it would be good to have any evidence that it really helps.
>
> I don't know much on the s390 path, but if a process has a large hugetlb
> vma, even MADV_DONTNEED will be blocked for whatever long time if there's
> another process or thread scanning pagemap for this vma.
>
> Would this justify a bit?

I get your point. But that raises the question if we should voluntarily
drop the VMA lock already in the caller every now and then on such large
VMAs and maybe move even the cond_resched() into the common page walker,
if you get what I mean?

On a preemtible kernel you could reschedule just before you drop the
lock and call cond_resched() ... hmm

No strong opinion here, it just looked a bit weird to optimize for a
cond_resched() if we might just reschedule either way even without the
cond_resched().

--
Thanks,

David / dhildenb

2022-12-09 17:32:15

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

On Fri, Dec 09, 2022 at 04:18:11PM +0100, David Hildenbrand wrote:
> On 09.12.22 15:39, Peter Xu wrote:
> > On Fri, Dec 09, 2022 at 11:24:55AM +0100, David Hildenbrand wrote:
> > > For such cases, it would be good to have any evidence that it really helps.
> >
> > I don't know much on the s390 path, but if a process has a large hugetlb
> > vma, even MADV_DONTNEED will be blocked for whatever long time if there's
> > another process or thread scanning pagemap for this vma.
> >
> > Would this justify a bit?
>
> I get your point. But that raises the question if we should voluntarily drop
> the VMA lock already in the caller every now and then on such large VMAs and
> maybe move even the cond_resched() into the common page walker, if you get
> what I mean?
>
> On a preemtible kernel you could reschedule just before you drop the lock
> and call cond_resched() ... hmm

On full preempt the old cond_resched() doesn't work anyway. IIUC most
distros now should be using dynamic preempt which makes voluntary mode by
default, which this change should help.

>
> No strong opinion here, it just looked a bit weird to optimize for a
> cond_resched() if we might just reschedule either way even without the
> cond_resched().

It's really not the core of the patchset. Let me drop it for now.

--
Peter Xu