2022-11-29 19:49:17

by Peter Xu

[permalink] [raw]
Subject: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Based on latest mm-unstable (9ed079378408).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

Comparing to previous rfcv2, the major change is I dropped the new pgtable
lock but only use vma lock for locking. The major reason is I overlooked
that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of
the huge_pte_offset() call sites do need to take pgtable lock. It means
the valid users for the new RCU lock will be very limited.

It's possible that in the future we can rework the pgtable free to only
free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
to be within tlb_remove_table_rcu()), then the RCU lock will make more
sense. For now, make it simple by fixing the races first.

Since this version attached a reproducer (below) and also removed the RCU
(especially, the fallback irqoff) solution, removing RFC tag.

Old versions:

rfcv1: https://lore.kernel.org/r/[email protected]
rfcv2: https://lore.kernel.org/r/[email protected]

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write. It was assumed to be safe but it's actually not. One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us. IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <assert.h>
#include <pthread.h>

#define MSIZE (1UL << 30) /* 1GB */
#define PSIZE (2UL << 20) /* 2MB */

#define HOLD_SEC (1)

int pipefd[2];
void *buf;

void *do_map(int fd)
{
unsigned char *tmpbuf, *p;
int ret;

ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
if (ret) {
perror("posix_memalign() failed");
return NULL;
}

tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, fd, 0);
if (tmpbuf == MAP_FAILED) {
perror("mmap() failed");
return NULL;
}
printf("mmap() -> %p\n", tmpbuf);

for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
*p = 1;
}

return tmpbuf;
}

void do_unmap(void *buf)
{
munmap(buf, MSIZE);
}

void proc2(int fd)
{
unsigned char c;

buf = do_map(fd);
if (!buf)
return;

read(pipefd[0], &c, 1);
/*
* This frees the shared pgtable page, causing use-after-free in
* proc1_thread1 when soft walking hugetlb pgtable.
*/
do_unmap(buf);

printf("Proc2 quitting\n");
}

void *proc1_thread1(void *data)
{
/*
* Trigger follow-page on 1st 2m page. Kernel hack patch needed to
* withhold this procedure for easier reproduce.
*/
madvise(buf, PSIZE, MADV_POPULATE_WRITE);
printf("Proc1-thread1 quitting\n");
return NULL;
}

void *proc1_thread2(void *data)
{
unsigned char c;

/* Wait a while until proc1_thread1() start to wait */
sleep(0.5);
/* Trigger pmd unshare */
madvise(buf, PSIZE, MADV_DONTNEED);
/* Kick off proc2 to release the pgtable */
write(pipefd[1], &c, 1);

printf("Proc1-thread2 quitting\n");
return NULL;
}

void proc1(int fd)
{
pthread_t tid1, tid2;
int ret;

buf = do_map(fd);
if (!buf)
return;

ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
assert(ret == 0);
ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
assert(ret == 0);

/* Kick the child to share the PUD entry */
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);

do_unmap(buf);
}

int main(void)
{
int fd, ret;

fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
if (fd < 0) {
perror("open failed");
return -1;
}

ret = ftruncate(fd, MSIZE);
if (ret) {
perror("ftruncate() failed");
return -1;
}

ret = pipe(pipefd);
if (ret) {
perror("pipe() failed");
return -1;
}

if (fork()) {
proc1(fd);
} else {
proc2(fd);
}

close(pipefd[0]);
close(pipefd[1]);
close(fd);

return 0;
}
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d97c9a2a15d..f8d99dad5004 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -38,6 +38,7 @@
#include <asm/page.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
+#include <asm/delay.h>

#include <linux/io.h>
#include <linux/hugetlb.h>
@@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool unshare = false;
int absent;
struct page *page;
+ unsigned long c = 0;

/*
* If we have a pending SIGKILL, don't keep faulting pages and
@@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
huge_page_size(h));
+
+ pr_info("%s: withhold 1 sec...\n", __func__);
+ for (c = 0; c < 100; c++) {
+ udelay(10000);
+ }
+ pr_info("%s: withhold 1 sec...done\n", __func__);
+
if (pte)
ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[ 16.959907] follow_hugetlb_page: withhold 1 sec...
[ 17.960315] follow_hugetlb_page: withhold 1 sec...done
[ 17.960550] ------------[ cut here ]------------
[ 17.960742] DEBUG_LOCKS_WARN_ON(1)
[ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[ 17.961264] Modules linked in:
[ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.965956] PKRU: 55555554
[ 17.966068] Call Trace:
[ 17.966172] <TASK>
[ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.966455] lock_acquire+0xbf/0x2b0
[ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.966799] ? _printk+0x48/0x4e
[ 17.966934] _raw_spin_lock+0x2f/0x40
[ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.967473] __get_user_pages+0xbb/0x620
[ 17.967635] faultin_vma_page_range+0x9a/0x100
[ 17.967817] madvise_vma_behavior+0x3c0/0xbd0
[ 17.967998] ? mas_prev+0x11/0x290
[ 17.968141] ? find_vma_prev+0x5e/0xa0
[ 17.968304] ? madvise_vma_anon_name+0x70/0x70
[ 17.968486] madvise_walk_vmas+0xa9/0x120
[ 17.968650] do_madvise.part.0+0xfa/0x270
[ 17.968813] __x64_sys_madvise+0x5a/0x70
[ 17.968974] do_syscall_64+0x37/0x90
[ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 17.969329] RIP: 0033:0x7f1840f0efdb
[ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[ 17.972083] </TASK>
[ 17.972199] irq event stamp: 2353
[ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[ 17.974341] ---[ end trace 0000000000000000 ]---
[ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[ 17.975012] #PF: supervisor read access in kernel mode
[ 17.975314] #PF: error_code(0x0000) - not-present page
[ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46
[ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[ 17.986674] PKRU: 55555554
[ 17.986832] Call Trace:
[ 17.987012] <TASK>
[ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30
[ 17.987770] lock_acquire+0xbf/0x2b0
[ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.988575] ? _printk+0x48/0x4e
[ 17.988889] _raw_spin_lock+0x2f/0x40
[ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4
[ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4
[ 17.990119] __get_user_pages+0xbb/0x620
[ 17.990500] faultin_vma_page_range+0x9a/0x100
[ 17.990928] madvise_vma_behavior+0x3c0/0xbd0
[ 17.991354] ? mas_prev+0x11/0x290
[ 17.991678] ? find_vma_prev+0x5e/0xa0
[ 17.992024] ? madvise_vma_anon_name+0x70/0x70
[ 17.992421] madvise_walk_vmas+0xa9/0x120
[ 17.992793] do_madvise.part.0+0xfa/0x270
[ 17.993166] __x64_sys_madvise+0x5a/0x70
[ 17.993539] do_syscall_64+0x37/0x90
[ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2: cleanup, or dependency of the follow up patches
Patch 3: before fixing, document huge_pte_offset() on lock required
Patch 4-9: each patch resolves one possible race condition
Patch 10: introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore. It also passes all hugetlb kselftests.

Comments welcomed, thanks.

Peter Xu (10):
mm/hugetlb: Let vma_offset_start() to return start
mm/hugetlb: Don't wait for migration entry during follow page
mm/hugetlb: Document huge_pte_offset usage
mm/hugetlb: Move swap entry handling into vma lock when faulted
mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare
mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare
mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare
mm/hugetlb: Make page_vma_mapped_walk() safe to pmd unshare
mm/hugetlb: Introduce hugetlb_walk()

fs/hugetlbfs/inode.c | 28 ++++++-------
fs/userfaultfd.c | 24 +++++++----
include/linux/hugetlb.h | 69 ++++++++++++++++++++++++++++++++
include/linux/rmap.h | 4 ++
include/linux/swapops.h | 6 ++-
mm/hugetlb.c | 89 ++++++++++++++++++-----------------------
mm/migrate.c | 25 ++++++++++--
mm/page_vma_mapped.c | 7 +++-
mm/pagewalk.c | 6 +--
9 files changed, 175 insertions(+), 83 deletions(-)

--
2.37.3


2022-11-29 19:49:18

by Peter Xu

[permalink] [raw]
Subject: [PATCH 01/10] mm/hugetlb: Let vma_offset_start() to return start

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for all
the callers because whenever the retval is used, it'll be ultimately added
into the vma->vm_start anyway, so it's better.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 790d2727141a..fdb16246f46e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,12 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
*/
static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
{
+ unsigned long offset = 0;
+
if (vma->vm_pgoff < start)
- return (start - vma->vm_pgoff) << PAGE_SHIFT;
- else
- return 0;
+ offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
+
+ return vma->vm_start + offset;
}

static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
@@ -457,7 +459,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);

- if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+ if (!hugetlb_vma_maps_page(vma, v_start, page))
continue;

if (!hugetlb_vma_trylock_write(vma)) {
@@ -473,8 +475,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
break;
}

- unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
- NULL, ZAP_FLAG_DROP_MARKER);
+ unmap_hugepage_range(vma, v_start, v_end, NULL,
+ ZAP_FLAG_DROP_MARKER);
hugetlb_vma_unlock_write(vma);
}

@@ -507,10 +509,9 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
*/
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);
- if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
- unmap_hugepage_range(vma, vma->vm_start + v_start,
- v_end, NULL,
- ZAP_FLAG_DROP_MARKER);
+ if (hugetlb_vma_maps_page(vma, v_start, page))
+ unmap_hugepage_range(vma, v_start, v_end, NULL,
+ ZAP_FLAG_DROP_MARKER);

kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
hugetlb_vma_unlock_write(vma);
@@ -540,8 +541,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);

- unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
- NULL, zap_flags);
+ unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);

/*
* Note that vma lock only exists for shared/non-private
--
2.37.3

2022-11-29 19:49:25

by Peter Xu

[permalink] [raw]
Subject: [PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.

This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d97c9a2a15d..dfe677fadaf8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6234,7 +6234,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

-retry:
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
return NULL;
@@ -6257,16 +6256,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
page = NULL;
goto out;
}
- } else {
- if (is_hugetlb_entry_migration(entry)) {
- spin_unlock(ptl);
- __migration_entry_wait_huge(pte, ptl);
- goto retry;
- }
- /*
- * hwpoisoned entry is treated as no_page_table in
- * follow_page_mask().
- */
}
out:
spin_unlock(ptl);
--
2.37.3

2022-11-29 19:49:31

by Peter Xu

[permalink] [raw]
Subject: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset(). That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without any lock to protect the pgtable from being freed after pmd unshare.

Here the simplest solution to make it safe is to move the swap handling to
be after the vma lock being held. We may need to take the fault mutex on
either migration or hwpoison entries now (also the vma lock, but that's
really needed), however neither of them is hot path.

Note that the vma lock cannot be released in hugetlb_fault() when the
migration entry is detected, because in migration_entry_wait_huge() the
pgtable page will be used again (by taking the pgtable lock), so that also
need to be protected by the vma lock. Modify migration_entry_wait_huge()
so that it must be called with vma read lock held, and properly release the
lock in __migration_entry_wait_huge().

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 6 ++++--
mm/hugetlb.c | 32 +++++++++++++++-----------------
mm/migrate.c | 25 +++++++++++++++++++++----
3 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 27ade4f22abb..09b22b169a71 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
#ifdef CONFIG_HUGETLB_PAGE
-extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
+extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl);
extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
#endif /* CONFIG_HUGETLB_PAGE */
#else /* CONFIG_MIGRATION */
@@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
#ifdef CONFIG_HUGETLB_PAGE
-static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
+static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
#endif /* CONFIG_HUGETLB_PAGE */
static inline int is_writable_migration_entry(swp_entry_t entry)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfe677fadaf8..776e34ccf029 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int need_wait_lock = 0;
unsigned long haddr = address & huge_page_mask(h);

- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
- if (ptep) {
- /*
- * Since we hold no locks, ptep could be stale. That is
- * OK as we are only making decisions based on content and
- * not actually modifying content here.
- */
- entry = huge_ptep_get(ptep);
- if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait_huge(vma, ptep);
- return 0;
- } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
- return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(hstate_index(h));
- }
-
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
@@ -5888,8 +5872,22 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
* properly handle it.
*/
- if (!pte_present(entry))
+ if (!pte_present(entry)) {
+ if (unlikely(is_hugetlb_entry_migration(entry))) {
+ /*
+ * Release fault lock first because the vma lock is
+ * needed to guard the huge_pte_lockptr() later in
+ * migration_entry_wait_huge(). The vma lock will
+ * be released there.
+ */
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ migration_entry_wait_huge(vma, ptep);
+ return 0;
+ } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+ ret = VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto out_mutex;
+ }

/*
* If we are going to COW/unshare the mapping later, we examine the
diff --git a/mm/migrate.c b/mm/migrate.c
index 267ad0d073ae..c13c828d34f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -326,24 +326,41 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
}

#ifdef CONFIG_HUGETLB_PAGE
-void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl)
+void __migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *ptep, spinlock_t *ptl)
{
pte_t pte;

+ /*
+ * The vma read lock must be taken, which will be released before
+ * the function returns. It makes sure the pgtable page (along
+ * with its spin lock) not be freed in parallel.
+ */
+ hugetlb_vma_assert_locked(vma);
+
spin_lock(ptl);
pte = huge_ptep_get(ptep);

- if (unlikely(!is_hugetlb_entry_migration(pte)))
+ if (unlikely(!is_hugetlb_entry_migration(pte))) {
spin_unlock(ptl);
- else
+ hugetlb_vma_unlock_read(vma);
+ } else {
+ /*
+ * If migration entry existed, safe to release vma lock
+ * here because the pgtable page won't be freed without the
+ * pgtable lock released. See comment right above pgtable
+ * lock release in migration_entry_wait_on_locked().
+ */
+ hugetlb_vma_unlock_read(vma);
migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
+ }
}

void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
{
spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);

- __migration_entry_wait_huge(pte, ptl);
+ __migration_entry_wait_huge(vma, pte, ptl);
}
#endif

--
2.37.3

2022-11-29 19:49:51

by Peter Xu

[permalink] [raw]
Subject: [PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 776e34ccf029..d6bb1d22f1c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6232,9 +6232,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
if (WARN_ON_ONCE(flags & FOLL_PIN))
return NULL;

+ hugetlb_vma_lock_read(vma);
pte = huge_pte_offset(mm, haddr, huge_page_size(h));
if (!pte)
- return NULL;
+ goto out_unlock;

ptl = huge_pte_lock(h, mm, pte);
entry = huge_ptep_get(pte);
@@ -6257,6 +6258,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
}
out:
spin_unlock(ptl);
+out_unlock:
+ hugetlb_vma_unlock_read(vma);
return page;
}

--
2.37.3

2022-11-29 19:50:00

by Peter Xu

[permalink] [raw]
Subject: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

huge_pte_offset() is the main walker function for hugetlb pgtables. The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside. Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 4 +---
fs/userfaultfd.c | 6 ++----
include/linux/hugetlb.h | 37 +++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 34 ++++++++++++++--------------------
mm/page_vma_mapped.c | 2 +-
mm/pagewalk.c | 4 +---
6 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fdb16246f46e..48f1a8ad2243 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
{
pte_t *ptep, pte;

- ptep = huge_pte_offset(vma->vm_mm, addr,
- huge_page_size(hstate_vma(vma)));
-
+ ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
if (!ptep)
return false;

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a602f008dde5..f31fe1a9f4c5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
unsigned long flags,
unsigned long reason)
{
- struct mm_struct *mm = ctx->mm;
pte_t *ptep, pte;
bool ret = true;

- mmap_assert_locked(mm);
-
- ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+ mmap_assert_locked(ctx->mm);

+ ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
if (!ptep)
goto out;

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 81efd9b9baa2..1a51c45fdf2e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -196,6 +196,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
* huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
* Returns the pte_t* if found, or NULL if the address is not mapped.
*
+ * IMPORTANT: we should normally not directly call this function, instead
+ * this is only a common interface to implement arch-specific walker.
+ * Please consider using the hugetlb_walk() helper to make sure of the
+ * correct locking is satisfied.
+ *
* Since this function will walk all the pgtable pages (including not only
* high-level pgtable page, but also PUD entry that can be unshared
* concurrently for VM_SHARED), the caller of this function should be
@@ -1229,4 +1234,36 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
#endif

+static inline bool
+__vma_shareable_flags_pmd(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
+ vma->vm_private_data;
+}
+
+/*
+ * Safe version of huge_pte_offset() to check the locks. See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
+ struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+
+ /*
+ * If pmd sharing possible, locking needed to safely walk the
+ * hugetlb pgtables. More information can be found at the comment
+ * above huge_pte_offset() in the same file.
+ *
+ * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
+ */
+ if (__vma_shareable_flags_pmd(vma))
+ WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
+ !lockdep_is_held(
+ &vma->vm_file->f_mapping->i_mmap_rwsem));
+#endif
+ return huge_pte_offset(vma->vm_mm, addr, sz);
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df645a5824e3..05867e82b467 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4816,7 +4816,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
/*
* For shared mappings the vma lock must be held before
- * calling huge_pte_offset in the src vma. Otherwise, the
+ * calling hugetlb_walk() in the src vma. Otherwise, the
* returned ptep could go away if part of a shared pmd and
* another thread calls huge_pmd_unshare.
*/
@@ -4826,7 +4826,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
last_addr_mask = hugetlb_mask_last_page(h);
for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
- src_pte = huge_pte_offset(src, addr, sz);
+ src_pte = hugetlb_walk(src_vma, addr, sz);
if (!src_pte) {
addr |= last_addr_mask;
continue;
@@ -5030,7 +5030,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
- src_pte = huge_pte_offset(mm, old_addr, sz);
+ src_pte = hugetlb_walk(vma, old_addr, sz);
if (!src_pte) {
old_addr |= last_addr_mask;
new_addr |= last_addr_mask;
@@ -5093,7 +5093,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -5406,7 +5406,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vma_lock_read(vma);
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep &&
pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
@@ -5444,7 +5444,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* before the page tables are altered
*/
spin_lock(ptl);
- ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+ ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -5841,7 +5841,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
*
- * ptep could have already be assigned via huge_pte_offset. That
+ * ptep could have already be assigned via hugetlb_walk(). That
* is OK, as huge_pte_alloc will return the same value unless
* something has changed.
*/
@@ -6233,7 +6233,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return NULL;

hugetlb_vma_lock_read(vma);
- pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+ pte = hugetlb_walk(vma, haddr, huge_page_size(h));
if (!pte)
goto out_unlock;

@@ -6298,8 +6298,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*
* Note that page table lock is not held when pte is null.
*/
- pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
- huge_page_size(h));
+ pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
+ huge_page_size(h));
if (pte)
ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -6485,7 +6485,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
last_addr_mask = hugetlb_mask_last_page(h);
for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, psize);
+ ptep = hugetlb_walk(vma, address, psize);
if (!ptep) {
address |= last_addr_mask;
continue;
@@ -6863,12 +6863,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
*end = ALIGN(*end, PUD_SIZE);
}

-static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
-{
- return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
- vma->vm_private_data;
-}
-
void hugetlb_vma_lock_read(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
@@ -7034,8 +7028,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {
- spte = huge_pte_offset(svma->vm_mm, saddr,
- vma_mmu_pagesize(svma));
+ spte = hugetlb_walk(svma, saddr,
+ vma_mmu_pagesize(svma));
if (spte) {
get_page(virt_to_page(spte));
break;
@@ -7394,7 +7388,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (address = start; address < end; address += PUD_SIZE) {
- ptep = huge_pte_offset(mm, address, sz);
+ ptep = hugetlb_walk(vma, address, sz);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f94ec78b54ff..bb782dea4b42 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -171,7 +171,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)

hugetlb_vma_lock_read(vma);
/* when pud is not present, pte will be NULL */
- pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
+ pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
if (!pvmw->pte) {
hugetlb_vma_unlock_read(vma);
return false;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d98564a7be57..cb23f8a15c13 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
hugetlb_vma_lock_read(vma);
do {
next = hugetlb_entry_end(h, addr, end);
- pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
+ pte = hugetlb_walk(vma, addr & hmask, sz);
if (pte)
err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
else if (ops->pte_hole)
err = ops->pte_hole(addr, next, -1, walk);
-
if (err)
break;
} while (addr = next, addr != end);
--
2.37.3

2022-11-29 20:17:23

by Peter Xu

[permalink] [raw]
Subject: [PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
to make sure the pgtable page will not be freed concurrently.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6bb1d22f1c4..df645a5824e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6290,6 +6290,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
break;
}

+ hugetlb_vma_lock_read(vma);
/*
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
@@ -6314,6 +6315,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
if (pte)
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
remainder = 0;
break;
}
@@ -6335,6 +6337,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (pte)
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
+
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
else if (unshare)
@@ -6394,6 +6398,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
remainder -= pages_per_huge_page(h);
i += pages_per_huge_page(h);
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
continue;
}

@@ -6421,6 +6426,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
flags))) {
spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
remainder = 0;
err = -ENOMEM;
break;
@@ -6432,6 +6438,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
i += refs;

spin_unlock(ptl);
+ hugetlb_vma_unlock_read(vma);
}
*nr_pages = remainder;
/*
--
2.37.3

2022-11-29 20:18:46

by Peter Xu

[permalink] [raw]
Subject: [PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

We can take the hugetlb walker lock, here taking vma lock directly.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..a602f008dde5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
*/
vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
{
- struct mm_struct *mm = vmf->vma->vm_mm;
+ struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *mm = vma->vm_mm;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
*/
mmap_assert_locked(mm);

- ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
+ ctx = vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
goto out;

@@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)

blocking_state = userfaultfd_get_blocking_state(vmf->flags);

+ /*
+ * This stablizes pgtable for hugetlb on e.g. pmd unsharing. Need
+ * to be before setting current state.
+ */
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_lock_read(vma);
+
spin_lock_irq(&ctx->fault_pending_wqh.lock);
/*
* After the __add_wait_queue the uwq is visible to userland
@@ -507,13 +515,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
set_current_state(blocking_state);
spin_unlock_irq(&ctx->fault_pending_wqh.lock);

- if (!is_vm_hugetlb_page(vmf->vma))
+ if (!is_vm_hugetlb_page(vma))
must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
reason);
else
- must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+ must_wait = userfaultfd_huge_must_wait(ctx, vma,
vmf->address,
vmf->flags, reason);
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_vma_unlock_read(vma);
mmap_read_unlock(mm);

if (likely(must_wait && !READ_ONCE(ctx->released))) {
--
2.37.3

2022-11-29 21:08:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <[email protected]> wrote:

> [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI

Do we know which kernel versions are affected here?

2022-11-29 21:08:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <[email protected]> wrote:

> Based on latest mm-unstable (9ed079378408).
>
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.

We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
settle down and stabilize things...

>
> ...
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write. It was assumed to be safe but it's actually not. One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).

That sounds like a hard-to-hit memory leak, but what we have here is a
user-triggerable use-after-free and an oops. Ugh.

Could people please prioritize the review and testing of this patchset?

2022-11-29 21:56:58

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On Tue, Nov 29, 2022 at 12:51:17PM -0800, Andrew Morton wrote:
> On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <[email protected]> wrote:
>
> > [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
>
> Do we know which kernel versions are affected here?

Since lockless walk of huge_pte_offset() existed since the initial git
commit, it should be the time when we introduced pmd sharing for hugetlb,
which means any kernel after commit 39dde65c9940 ("[PATCH] shared page
table for hugetlb page", 2006-12-07) should be prone to the issue at least
for x86 (as pmd sharing was proposed initially only for x86).

--
Peter Xu

2022-11-29 21:58:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On Tue, 29 Nov 2022 16:19:52 -0500 Peter Xu <[email protected]> wrote:

> On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote:
> > On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <[email protected]> wrote:
> >
> > > Based on latest mm-unstable (9ed079378408).
> > >
> > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > > series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
> >
> > We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
> > settle down and stabilize things...
>
> I targeted this series for the next release not current, because there's no
> known report for it per my knowledge.
>
> The reproducer needs explicit kernel delays to trigger as mentioned in the
> cover letter. So far I didn't try to reproduce with a generic kernel yet
> but just to verify the existance of the problem.

OK, thanks, I missed that.

I'll give the series a run in -next for a couple of days then I'll pull
it out until after the next Linus merge window, so it can't invalidate
testing heading into that merge window.

2022-11-29 22:46:43

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Hi, Andrew,

On Tue, Nov 29, 2022 at 12:49:44PM -0800, Andrew Morton wrote:
> On Tue, 29 Nov 2022 14:35:16 -0500 Peter Xu <[email protected]> wrote:
>
> > Based on latest mm-unstable (9ed079378408).
> >
> > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
>
> We're at -rc7 (a -rc8 appears probable this time) and I'm looking to
> settle down and stabilize things...

I targeted this series for the next release not current, because there's no
known report for it per my knowledge.

The reproducer needs explicit kernel delays to trigger as mentioned in the
cover letter. So far I didn't try to reproduce with a generic kernel yet
but just to verify the existance of the problem.

>
> >
> > ...
> >
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write. It was assumed to be safe but it's actually not. One race
> > condition can easily trigger by: (1) firstly trigger pmd share on a memory
> > range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> > another thread unshare the pmd range, and the pgtable page is prone to lost
> > if the other shared process wants to free it completely (by either munmap
> > or exit mm).
>
> That sounds like a hard-to-hit memory leak, but what we have here is a
> user-triggerable use-after-free and an oops. Ugh.

IIUC it's not a leak, but it's just that huge_pte_offset() can walk the
(pmd-shared) pgtable page and also trying to take the pgtable lock even
though the page can already be freed in parallel, hence accessing either
the page or the pgtable lock after the pgtable page got freed.

E.g., the 1st warning was trigger by:

static inline struct lock_class *hlock_class(struct held_lock *hlock)
{
unsigned int class_idx = hlock->class_idx;

/* Don't re-read hlock->class_idx, can't use READ_ONCE() on bitfield */
barrier();

if (!test_bit(class_idx, lock_classes_in_use)) {
/*
* Someone passed in garbage, we give up.
*/
DEBUG_LOCKS_WARN_ON(1); <----------------------------
return NULL;
}
...
}

I think it's because the spin lock got freed along with the pgtable page,
so when we want to lock the pgtable lock we see weird lock state in
dep_map, as the lock pointer is not valid at all.

--
Peter Xu

2022-11-30 05:38:34

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> +static inline pte_t *
> +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> +{
> +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> +
> + /*
> + * If pmd sharing possible, locking needed to safely walk the
> + * hugetlb pgtables. More information can be found at the comment
> + * above huge_pte_offset() in the same file.
> + *
> + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> + */
> + if (__vma_shareable_flags_pmd(vma))
> + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> + !lockdep_is_held(
> + &vma->vm_file->f_mapping->i_mmap_rwsem));
> +#endif

FYI, in next-20221130 there is a compile error here due to this commit:

In file included from security/commoncap.c:19:
./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
~~~~~~~~^
./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
#define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)
^~~~

- Eric

2022-11-30 05:41:05

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

On 11/29/22 14:35, Peter Xu wrote:
> That's what the code does with !hugetlb pages, so we should logically do
> the same for hugetlb, so migration entry will also be treated as no page.

Makes sense. I like the consistency.

> This is probably also the last piece in follow_page code that may sleep,
> the last one should be removed in cf994dd8af27 ("mm/gup: remove
> FOLL_MIGRATION", 2022-11-16).
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 11 -----------
> 1 file changed, 11 deletions(-)

Thanks!

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2022-11-30 10:25:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm/hugetlb: Don't wait for migration entry during follow page

On 29.11.22 20:35, Peter Xu wrote:
> That's what the code does with !hugetlb pages, so we should logically do
> the same for hugetlb, so migration entry will also be treated as no page.
>
> This is probably also the last piece in follow_page code that may sleep,
> the last one should be removed in cf994dd8af27 ("mm/gup: remove
> FOLL_MIGRATION", 2022-11-16).

We only have a handful of follow_page() calls left. Most relevant for
hugetlb should only be the ones from mm/migrate.c.

So this is now just consistent with ordinary pages/THP.

Happy so see that gone.

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-11-30 10:41:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On 29.11.22 20:35, Peter Xu wrote:
> Based on latest mm-unstable (9ed079378408).
>
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, but majorly covering safe use of huge_pte_offset.
>
> Comparing to previous rfcv2, the major change is I dropped the new pgtable
> lock but only use vma lock for locking. The major reason is I overlooked
> that the pgtable lock was not protected by RCU: __pmd_free_tlb() frees the
> pgtable lock before e.g. call_rcu() for RCU_TABLE_FREE archs. OTOH many of
> the huge_pte_offset() call sites do need to take pgtable lock. It means
> the valid users for the new RCU lock will be very limited.

Thanks.

>
> It's possible that in the future we can rework the pgtable free to only
> free the pgtable lock after RCU grace period (move pgtable_pmd_page_dtor()
> to be within tlb_remove_table_rcu()), then the RCU lock will make more
> sense. For now, make it simple by fixing the races first.

Good.

>
> Since this version attached a reproducer (below) and also removed the RCU
> (especially, the fallback irqoff) solution, removing RFC tag.

Very nice, thanks.

>
> Old versions:
>
> rfcv1: https://lore.kernel.org/r/[email protected]
> rfcv2: https://lore.kernel.org/r/[email protected]
>
> Problem
> =======
>
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable. It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
>
> huge_pte_offset() is always called with mmap lock held with either read or
> write. It was assumed to be safe but it's actually not. One race
> condition can easily trigger by: (1) firstly trigger pmd share on a memory
> range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> another thread unshare the pmd range, and the pgtable page is prone to lost
> if the other shared process wants to free it completely (by either munmap
> or exit mm).

So just that I understand correctly:

Two processes, #A and #B, share a page table. Process #A runs two
threads, #A1 and #A2.

#A1 walks that shared page table (using huge_pte_offset()), for example,
to resolve a page fault. Concurrently, #A2 triggers unsharing of that
page table (replacing it by a private page table), for example, using
munmap().

So #A1 will eventually read/write the shared page table while we're
placing a private page table. Which would be fine (assuming no unsharing
would be required by #A1), however, if #B also concurrently drops the
reference to the shared page table (), the shared page table could
essentially get freed while #A1 is still walking it.

I suspect, looking at the reproducer, that the page table deconstructor
was called. Will the page table also actually get freed already? IOW,
could #A1 be reading/writing a freed page?


>
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed. It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.

Agreed.

>
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons especially
> on huge_pte_offset() usage.
>
> One more thing to mention is that besides the vma lock, i_mmap_rwsem can
> also be used to protect the pgtable page (along with its pgtable lock) from
> being freed from under us. IOW, huge_pte_offset() callers need to either
> hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
>
> A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
> very hard to trigger, one needs to apply another kernel delay patch too,
> see below):

Thanks,

David / dhildenb

2022-11-30 10:41:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm/hugetlb: Let vma_offset_start() to return start

On 29.11.22 20:35, Peter Xu wrote:
> Even though vma_offset_start() is named like that, it's not returning "the
> start address of the range" but rather the offset we should use to offset
> the vma->vm_start address.
>
> Make it return the real value of the start vaddr, and it also helps for all
> the callers because whenever the retval is used, it'll be ultimately added
> into the vma->vm_start anyway, so it's better.

... the function name is a bit unfortunate ("offset" what?, "start" what?).

But it now matches the semantics of vma_offset_end() and the actual
"v_start" variables.

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-11-30 16:11:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

On Tue, Nov 29, 2022 at 09:18:08PM -0800, Eric Biggers wrote:
> On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> > +static inline pte_t *
> > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > +{
> > +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > +
> > + /*
> > + * If pmd sharing possible, locking needed to safely walk the
> > + * hugetlb pgtables. More information can be found at the comment
> > + * above huge_pte_offset() in the same file.
> > + *
> > + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > + */
> > + if (__vma_shareable_flags_pmd(vma))
> > + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > + !lockdep_is_held(
> > + &vma->vm_file->f_mapping->i_mmap_rwsem));
> > +#endif
>
> FYI, in next-20221130 there is a compile error here due to this commit:
>
> In file included from security/commoncap.c:19:
> ./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
> WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> ~~~~~~~~^
> ./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
> #define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)
> ^~~~

This probably means the config has:

CONFIG_HUGETLB_PAGE=n
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y

And I'm surprised we didn't have a dependency that ARCH_WANT_HUGE_PMD_SHARE
should depend on HUGETLB_PAGE already. Mike, what do you think?

I've also attached a quick fix for this patch to be squashed in. Hope it
works.

Thanks,

--
Peter Xu


Attachments:
(No filename) (1.82 kB)
0001-fixup-mm-hugetlb-Introduce-hugetlb_walk.patch (995.00 B)
Download all attachments

2022-11-30 16:24:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

On 29.11.22 20:35, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-11-30 16:54:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

On 29.11.22 20:35, Peter Xu wrote:
> Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---


Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-11-30 16:55:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

On 29.11.22 20:35, Peter Xu wrote:
> Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2022-11-30 17:38:32

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On Wed, Nov 30, 2022 at 10:46:24AM +0100, David Hildenbrand wrote:
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write. It was assumed to be safe but it's actually not. One race
> > condition can easily trigger by: (1) firstly trigger pmd share on a memory
> > range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
> > another thread unshare the pmd range, and the pgtable page is prone to lost
> > if the other shared process wants to free it completely (by either munmap
> > or exit mm).
>
> So just that I understand correctly:
>
> Two processes, #A and #B, share a page table. Process #A runs two threads,
> #A1 and #A2.
>
> #A1 walks that shared page table (using huge_pte_offset()), for example, to
> resolve a page fault. Concurrently, #A2 triggers unsharing of that page
> table (replacing it by a private page table),

Not yet replacing it, just unsharing.

If the replacement happened we shouldn't trigger a bug either because
huge_pte_offset() will return the private pgtable page instead.

> for example, using munmap().

munmap() may not work because it needs mmap lock, so it'll wait until #A1
completes huge_pte_offset() walks and release mmap lock read.

Many of other things can trigger unshare, though. In the reproducer I used
MADV_DONTNEED.

>
> So #A1 will eventually read/write the shared page table while we're placing
> a private page table. Which would be fine (assuming no unsharing would be
> required by #A1), however, if #B also concurrently drops the reference to
> the shared page table (), the shared page table could essentially get freed
> while #A1 is still walking it.
>
> I suspect, looking at the reproducer, that the page table deconstructor was
> called. Will the page table also actually get freed already? IOW, could #A1
> be reading/writing a freed page?

If with the existing code base, I think it could.

If with RCU lock, it couldn't, but still since the pgtable lock is freed
even if the page is not, we'll still hit weird issues when accessing the
lock.

And with vma lock it should be all safe.

--
Peter Xu

2022-12-05 22:47:24

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On 11/29/22 14:35, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset(). That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without any lock to protect the pgtable from being freed after pmd unshare.

Thanks. That special path has been there for a long time. I should have given
it more thought when adding lock. However, I was only considering the 'stale'
pte case as opposed to freed.

> Here the simplest solution to make it safe is to move the swap handling to
> be after the vma lock being held. We may need to take the fault mutex on
> either migration or hwpoison entries now (also the vma lock, but that's
> really needed), however neither of them is hot path.
>
> Note that the vma lock cannot be released in hugetlb_fault() when the
> migration entry is detected, because in migration_entry_wait_huge() the
> pgtable page will be used again (by taking the pgtable lock), so that also
> need to be protected by the vma lock. Modify migration_entry_wait_huge()
> so that it must be called with vma read lock held, and properly release the
> lock in __migration_entry_wait_huge().
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/swapops.h | 6 ++++--
> mm/hugetlb.c | 32 +++++++++++++++-----------------
> mm/migrate.c | 25 +++++++++++++++++++++----
> 3 files changed, 40 insertions(+), 23 deletions(-)

Looks good with one small change noted below,

Reviewed-by: Mike Kravetz <[email protected]>

>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 27ade4f22abb..09b22b169a71 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -335,7 +335,8 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address);
> #ifdef CONFIG_HUGETLB_PAGE
> -extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
> +extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl);
> extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
> #endif /* CONFIG_HUGETLB_PAGE */
> #else /* CONFIG_MIGRATION */
> @@ -364,7 +365,8 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> unsigned long address) { }
> #ifdef CONFIG_HUGETLB_PAGE
> -static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
> +static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
> + pte_t *ptep, spinlock_t *ptl) { }
> static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
> #endif /* CONFIG_HUGETLB_PAGE */
> static inline int is_writable_migration_entry(swp_entry_t entry)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfe677fadaf8..776e34ccf029 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> int need_wait_lock = 0;
> unsigned long haddr = address & huge_page_mask(h);
>
> - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> - if (ptep) {
> - /*
> - * Since we hold no locks, ptep could be stale. That is
> - * OK as we are only making decisions based on content and
> - * not actually modifying content here.
> - */
> - entry = huge_ptep_get(ptep);
> - if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait_huge(vma, ptep);
> - return 0;
> - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> - return VM_FAULT_HWPOISON_LARGE |
> - VM_FAULT_SET_HINDEX(hstate_index(h));
> - }
> -

Before acquiring the vma_lock, there is this comment:

/*
* Acquire vma lock before calling huge_pte_alloc and hold
* until finished with ptep. This prevents huge_pmd_unshare from
* being called elsewhere and making the ptep no longer valid.
*
* ptep could have already be assigned via hugetlb_walk(). That
* is OK, as huge_pte_alloc will return the same value unless
* something has changed.
*/

The second sentence in that comment about ptep being already assigned no
longer applies and can be deleted.

--
Mike Kravetz

2022-12-05 22:48:32

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare

On 11/29/22 14:35, Peter Xu wrote:
> We can take the hugetlb walker lock, here taking vma lock directly.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)

Looks good.

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2022-12-05 23:06:38

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm/hugetlb: Make hugetlb_follow_page_mask() safe to pmd unshare

On 11/29/22 14:35, Peter Xu wrote:
> Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)

Thanks!

Reviewed-by: Mike Kravetz <[email protected]>

--
Mike Kravetz

2022-12-05 23:07:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 07/10] mm/hugetlb: Make follow_hugetlb_page() safe to pmd unshare

On 11/29/22 14:35, Peter Xu wrote:
> Since follow_hugetlb_page() walks the pgtable, it needs the vma lock
> to make sure the pgtable page will not be freed concurrently.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 7 +++++++
> 1 file changed, 7 insertions(+)

Thanks! The flow of that routine is difficult to follow.

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2022-12-06 00:19:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm/hugetlb: Move swap entry handling into vma lock when faulted

On Mon, Dec 05, 2022 at 02:14:38PM -0800, Mike Kravetz wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dfe677fadaf8..776e34ccf029 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5826,22 +5826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > int need_wait_lock = 0;
> > unsigned long haddr = address & huge_page_mask(h);
> >
> > - ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> > - if (ptep) {
> > - /*
> > - * Since we hold no locks, ptep could be stale. That is
> > - * OK as we are only making decisions based on content and
> > - * not actually modifying content here.
> > - */
> > - entry = huge_ptep_get(ptep);
> > - if (unlikely(is_hugetlb_entry_migration(entry))) {
> > - migration_entry_wait_huge(vma, ptep);
> > - return 0;
> > - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> > - return VM_FAULT_HWPOISON_LARGE |
> > - VM_FAULT_SET_HINDEX(hstate_index(h));
> > - }
> > -
>
> Before acquiring the vma_lock, there is this comment:
>
> /*
> * Acquire vma lock before calling huge_pte_alloc and hold
> * until finished with ptep. This prevents huge_pmd_unshare from
> * being called elsewhere and making the ptep no longer valid.
> *
> * ptep could have already be assigned via hugetlb_walk(). That
> * is OK, as huge_pte_alloc will return the same value unless
> * something has changed.
> */
>
> The second sentence in that comment about ptep being already assigned no
> longer applies and can be deleted.

Correct, this can be removed.

One thing to mention is there's an inline touch-up above in the last patch
of the series when introducing hugetlb_walk() to s/pte_offset/walk/, but I
saw that Andrew has already done the right thing on the fixup one in his
local tree, so I think we're all good.

Thanks!

--
Peter Xu

2022-12-06 01:22:12

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: Introduce hugetlb_walk()

On 11/30/22 10:37, Peter Xu wrote:
> On Tue, Nov 29, 2022 at 09:18:08PM -0800, Eric Biggers wrote:
> > On Tue, Nov 29, 2022 at 02:35:26PM -0500, Peter Xu wrote:
> > > +static inline pte_t *
> > > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > > +{
> > > +#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > > + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > > +
> > > + /*
> > > + * If pmd sharing possible, locking needed to safely walk the
> > > + * hugetlb pgtables. More information can be found at the comment
> > > + * above huge_pte_offset() in the same file.
> > > + *
> > > + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > > + */
> > > + if (__vma_shareable_flags_pmd(vma))
> > > + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > > + !lockdep_is_held(
> > > + &vma->vm_file->f_mapping->i_mmap_rwsem));
> > > +#endif
> >
> > FYI, in next-20221130 there is a compile error here due to this commit:
> >
> > In file included from security/commoncap.c:19:
> > ./include/linux/hugetlb.h:1262:42: error: incomplete definition of type 'struct hugetlb_vma_lock'
> > WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > ~~~~~~~~^
> > ./include/linux/lockdep.h:286:47: note: expanded from macro 'lockdep_is_held'
> > #define lockdep_is_held(lock) lock_is_held(&(lock)->dep_map)
> > ^~~~
>
> This probably means the config has:
>
> CONFIG_HUGETLB_PAGE=n
> CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
>
> And I'm surprised we didn't have a dependency that ARCH_WANT_HUGE_PMD_SHARE
> should depend on HUGETLB_PAGE already. Mike, what do you think?

Yes, ARCH_WANT_HUGE_PMD_SHARE should probably depend on HUGETLB_PAGE as it
makes no sense without it. There is also ARCH_WANT_GENERAL_HUGETLB that
seems to fall into the same category.


>
> I've also attached a quick fix for this patch to be squashed in. Hope it
> works.

Let's go with this quick fix for now since the only issue is in the new code
and spend more time on config dependencies later.

For the original patch and the quick fix,

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

>
> Thanks,
>
> --
> Peter Xu

> From 9787a7f5492ca251fcce5c09bd7e4a80ac157726 Mon Sep 17 00:00:00 2001
> From: Peter Xu <[email protected]>
> Date: Wed, 30 Nov 2022 10:33:44 -0500
> Subject: [PATCH] fixup! mm/hugetlb: Introduce hugetlb_walk()
> Content-type: text/plain
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/hugetlb.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1a51c45fdf2e..ec2a1f93b12d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1248,7 +1248,8 @@ __vma_shareable_flags_pmd(struct vm_area_struct *vma)
> static inline pte_t *
> hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> {
> -#if defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> +#if defined(CONFIG_HUGETLB_PAGE) && \
> + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
>
> /*
> --
> 2.37.3
>