LinuxLists.cc - [PATCH 1/2] mm/hugetlb: Restore the reservation if needed

2024-01-17 17:11:44

Subject: [PATCH 1/2] mm/hugetlb: Restore the reservation if needed

Currently there is a bug that a huge page could be stolen, and when the
original owner tries to fault in it, it causes a page fault.

You can achieve that by:
1) Creating a single page
echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
* This will mark the page as reserved
3) touch the page, which causes a page fault and allocates the page
* This will move the page out of the free list.
* It will also unreserved the page, since there is no more free
page
4) madvise(MADV_DONTNEED) the page
* This will free the page, but not mark it as reserved.
5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
* it should fail, but, since there is no more available page.
* But, since the page above is not reserved, this mmap() succeed.
6) Faulting at ptr1 will cause a SIGBUS
* it will try to allocate a huge page, but there is none
available

A full reproducer is in selftest. See
https://lore.kernel.org/all/[email protected]/

Fix this by restoring the reserved page if necessary. If the page being
unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the
restore_reserve flag, which will move the page from free to reserved.

Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
---
mm/hugetlb.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..fa2c17767e44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
hugetlb_count_sub(pages_per_huge_page(h), mm);
hugetlb_remove_rmap(page_folio(page));

+ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
+ vma_needs_reservation(h, vma, start)) {
+ /*
+ * Restore the reservation if needed, otherwise the
+ * backing page could be stolen by someone.
+ */
+ folio_set_hugetlb_restore_reserve(page_folio(page));
+ vma_add_reservation(h, vma, address);
+ }
+
spin_unlock(ptl);
tlb_remove_page_size(tlb, page, huge_page_size(h));
/*
--
2.34.1

2024-01-17 17:21:06

by Breno Leitao

[permalink] [raw]

Subject: [PATCH 2/2] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map

hugetlb_madv_vs_map selftest was not part of the mm test-suite since we
didn't have a fix for the problem it found.

Now that the problem is already fixed (see previous commit), let's
enable this selftest in the default test-suite.

Signed-off-by: Breno Leitao <[email protected]>
---
tools/testing/selftests/mm/run_vmtests.sh | 1 +
1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index a5e6ba8d3579..f41e1978e4d4 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -256,6 +256,7 @@ nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
echo 1 > /proc/sys/vm/nr_hugepages
CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+CATEGORY="hugetlb" run_test ./hugetlb_madv_vs_map
# Restore the previous number of huge pages, since further tests rely on it
echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages

--
2.34.1

2024-01-24 09:24:18

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm/hugetlb: Restore the reservation if needed

On 17/01/2024 17:10, Breno Leitao wrote:
> Currently there is a bug that a huge page could be stolen, and when the
> original owner tries to fault in it, it causes a page fault.
>
> You can achieve that by:
> 1) Creating a single page
> echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>
> 2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
> * This will mark the page as reserved
> 3) touch the page, which causes a page fault and allocates the page
> * This will move the page out of the free list.
> * It will also unreserved the page, since there is no more free
> page
> 4) madvise(MADV_DONTNEED) the page
> * This will free the page, but not mark it as reserved.
> 5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
> * it should fail, but, since there is no more available page.
> * But, since the page above is not reserved, this mmap() succeed.
> 6) Faulting at ptr1 will cause a SIGBUS
> * it will try to allocate a huge page, but there is none
> available
>
> A full reproducer is in selftest. See
> https://lore.kernel.org/all/[email protected]/
>
> Fix this by restoring the reserved page if necessary. If the page being
> unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the
> restore_reserve flag, which will move the page from free to reserved.
>
> Suggested-by: Rik van Riel <[email protected]>
> Signed-off-by: Breno Leitao <[email protected]>
> ---
> mm/hugetlb.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d4..fa2c17767e44 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> hugetlb_count_sub(pages_per_huge_page(h), mm);
> hugetlb_remove_rmap(page_folio(page));
>
> + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> + vma_needs_reservation(h, vma, start)) {
> + /*
> + * Restore the reservation if needed, otherwise the
> + * backing page could be stolen by someone.
> + */
> + folio_set_hugetlb_restore_reserve(page_folio(page));
> + vma_add_reservation(h, vma, address);
> + }
> +
> spin_unlock(ptl);
> tlb_remove_page_size(tlb, page, huge_page_size(h));
> /*

Hi Breno,

I'm seeing a kernel bug fire when running the "map_hugetlb" mm selftest against latest mm-unstable. Bisect tells me this patch is culprit. I'm running on arm64 with defconfig plus the following:

# Woraround mm-unstable build errors.
/scripts/config --file ${param:builddir}/.config --enable ANON_VMA_NAME

# Squashfs for snaps, xfs for large file folios.
/scripts/config --enable CONFIG_SQUASHFS_LZ4
/scripts/config --enable CONFIG_SQUASHFS_LZO
/scripts/config --enable CONFIG_SQUASHFS_XZ
/scripts/config --enable CONFIG_SQUASHFS_ZSTD
/scripts/config --enable CONFIG_XFS_FS

# Useful trace features (on for Ubuntu configs).
/scripts/config --enable CONFIG_FTRACE
/scripts/config --enable CONFIG_FUNCTION_TRACER
/scripts/config --enable CONFIG_KPROBES
/scripts/config --enable CONFIG_HIST_TRIGGERS
/scripts/config --enable CONFIG_FTRACE_SYSCALLS

# For general mm debug.
/scripts/config --enable CONFIG_DEBUG_VM
/scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
/scripts/config --enable CONFIG_DEBUG_VM_RB
/scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
/scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
/scripts/config --enable CONFIG_PAGE_TABLE_CHECK

# For mm selftests.
/scripts/config --enable CONFIG_USERFAULTFD
/scripts/config --enable CONFIG_TEST_VMALLOC
/scripts/config --enable CONFIG_GUP_TEST

Steps to reproduce:

$ sudo ./map_hugetlb

git bisect log:

git bisect start
# good: [6613476e225e090cc9aad49be7fa504e290dd33d] Linux 6.8-rc1
git bisect good 6613476e225e090cc9aad49be7fa504e290dd33d
# bad: [ba34798948c09778ff707415fd52daa591bd361a] mm: add swappiness= arg to memory.reclaim
git bisect bad ba34798948c09778ff707415fd52daa591bd361a
# bad: [69d0e75874638fe852ed7004666e068cf9e2148d] mm: list_lru: disable memcg_aware when cgroup.memory is set to "nokmem"
git bisect bad 69d0e75874638fe852ed7004666e068cf9e2148d
# good: [38c61fca93058635b533ad927c1d6529757424d3] mm: huge_memory: don't force huge page alignment on 32 bit
git bisect good 38c61fca93058635b533ad927c1d6529757424d3
# bad: [552225e31fdc31315d06c5c7f59b8dbc78e45d18] mm: vmalloc: rename adjust_va_to_fit_type() function
git bisect bad 552225e31fdc31315d06c5c7f59b8dbc78e45d18
# bad: [184c8db9112ee2370c2d17efab76a6c2aad061bc] s390/sclp: remove unhandled memory notifier type
git bisect bad 184c8db9112ee2370c2d17efab76a6c2aad061bc
# bad: [7ccaa21733f6a0fefdf978da0ee08a459c8441ca] mm/cma: fix placement of trace_cma_alloc_start/finish
git bisect bad 7ccaa21733f6a0fefdf978da0ee08a459c8441ca
# bad: [be7a8289b10f82365b9a209d9db9e4aaab1a86d4] selftests/mm: run_vmtests.sh: add hugetlb_madv_vs_map
git bisect bad be7a8289b10f82365b9a209d9db9e4aaab1a86d4
# bad: [947b031634e7af3d265275c26ec17e2f96fdb5a1] mm/hugetlb: restore the reservation if needed
git bisect bad 947b031634e7af3d265275c26ec17e2f96fdb5a1
# first bad commit: [947b031634e7af3d265275c26ec17e2f96fdb5a1] mm/hugetlb: restore the reservation if needed

BUG:

[ 11.959447] kernel BUG at mm/hugetlb.c:1129!
[ 11.959628] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[ 11.959870] Modules linked in:
[ 11.959975] CPU: 0 PID: 1211 Comm: map_hugetlb Not tainted 6.8.0-rc1-00029-g947b031634e7 #11
[ 11.960272] Hardware name: linux,dummy-virt (DT)
[ 11.960434] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 11.960680] pc : resv_map_release+0xa8/0xb8
[ 11.960895] lr : resv_map_release+0x6c/0xb8
[ 11.961041] sp : ffff80008863bb20
[ 11.961159] x29: ffff80008863bb20 x28: 0000000000000000 x27: 0000000000000001
[ 11.961500] x26: ffff77b80e91a704 x25: 0000000000000001 x24: ffff77b80e91a708
[ 11.961740] x23: ffff77b80e91a700 x22: 0000000000000080 x21: 0000000000000000
[ 11.961992] x20: ffff77b80e91a720 x19: ffff77b80e91a720 x18: 0000fffff7fecfff
[ 11.962239] x17: 0000000000000000 x16: 1fffeef701697601 x15: 0000000000000001
[ 11.962484] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffc8c6bc3b3cd0
[ 11.962729] x11: 0000000000000001 x10: 0000000000000000 x9 : ffffc8c6bad5245c
[ 11.962971] x8 : ffff80008863bae0 x7 : 0000000000000000 x6 : ffff77b801966900
[ 11.963285] x5 : ffffc8c6bad6b7e4 x4 : fffffddee1613ea0 x3 : 0000000000400010
[ 11.963629] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000000007f
[ 11.963945] Call trace:
[ 11.964070] resv_map_release+0xa8/0xb8
[ 11.964255] hugetlb_vm_op_close+0x1a0/0x1f0
[ 11.964431] remove_vma+0x34/0x88
[ 11.964604] do_vmi_align_munmap+0x4b8/0x5f0
[ 11.964792] do_vmi_munmap+0xb4/0x138
[ 11.964936] __vm_munmap+0xa8/0x188
[ 11.965074] __arm64_sys_munmap+0x28/0x40
[ 11.965237] invoke_syscall+0x50/0x128
[ 11.965388] el0_svc_common.constprop.0+0x48/0xf0
[ 11.965573] do_el0_svc+0x24/0x38
[ 11.965705] el0_svc+0x34/0xb8
[ 11.965862] el0t_64_sync_handler+0x100/0x130
[ 11.966032] el0t_64_sync+0x190/0x198
[ 11.966174] Code: a8c47bfd d50323bf d65f03c0 a9025bf5 (d4210000)
[ 11.966410] ---[ end trace 0000000000000000 ]---
[ 11.966594] note: map_hugetlb[1211] exited with irqs disabled
[ 11.966874] note: map_hugetlb[1211] exited with preempt_count 1
[ 11.967915] ------------[ cut here ]------------
[ 11.968165] WARNING: CPU: 0 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[ 11.968560] Modules linked in:
[ 11.968651] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 6.8.0-rc1-00029-g947b031634e7 #11
[ 11.969235] Hardware name: linux,dummy-virt (DT)
[ 11.970045] pstate: 214003c5 (nzCv DAIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 11.970941] pc : ct_kernel_exit.constprop.0+0x108/0x120
[ 11.971241] lr : ct_idle_enter+0x10/0x20
[ 11.971397] sp : ffffc8c6bd7f3d40
[ 11.971527] x29: ffffc8c6bd7f3d40 x28: 000000004224d0ac x27: 0000000000000000
[ 11.971811] x26: 0000000000000000 x25: ffffc8c6bd804980 x24: 0000000000000000
[ 11.972279] x23: 0000000000000000 x22: ffffc8c6bd804980 x21: ffffc8c6bd7f9ae0
[ 11.973023] x20: ffffc8c6bd7f9a20 x19: ffff77b83ff8dce0 x18: 0000000000000001
[ 11.973856] x17: 0000000000000000 x16: 1fffeef701697721 x15: 0000fffff7fed000
[ 11.975799] x14: ffff80008863b688 x13: 0000000000000000 x12: ffffc8c6bc3b3cd0
[ 11.976202] x11: 0000000000000001 x10: fc206bc54f751bff x9 : ffffc8c6bab7f564
[ 11.976461] x8 : ffffc8c6bd805a58 x7 : 00000000000001e5 x6 : 00000000119b8986
[ 11.976722] x5 : 4000000000000002 x4 : ffffaef18328b000 x3 : ffffc8c6bd7f3d40
[ 11.976985] x2 : ffffc8c6bcd02ce0 x1 : 4000000000000000 x0 : ffffc8c6bcd02ce0
[ 11.977245] Call trace:
[ 11.977338] ct_kernel_exit.constprop.0+0x108/0x120
[ 11.977520] ct_idle_enter+0x10/0x20
[ 11.977654] default_idle_call+0x3c/0x170
[ 11.977804] do_idle+0x218/0x278
[ 11.977942] cpu_startup_entry+0x40/0x50
[ 11.978087] rest_init+0xf0/0xf8
[ 11.978209] arch_call_rest_init+0x18/0x20
[ 11.978391] start_kernel+0x524/0x670
[ 11.978528] __primary_switched+0xb8/0xc8
[ 11.978681] ---[ end trace 0000000000000000 ]---

Thanks,
Ryan

2024-01-29 12:45:48

by Breno Leitao

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm/hugetlb: Restore the reservation if needed

On Wed, Jan 24, 2024 at 09:22:03AM +0000, Ryan Roberts wrote:
> On 17/01/2024 17:10, Breno Leitao wrote:
> > Currently there is a bug that a huge page could be stolen, and when the
> > original owner tries to fault in it, it causes a page fault.
> >
> > You can achieve that by:
> > 1) Creating a single page
> > echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >
> > 2) mmap() the page above with MAP_HUGETLB into (void *ptr1).
> > * This will mark the page as reserved
> > 3) touch the page, which causes a page fault and allocates the page
> > * This will move the page out of the free list.
> > * It will also unreserved the page, since there is no more free
> > page
> > 4) madvise(MADV_DONTNEED) the page
> > * This will free the page, but not mark it as reserved.
> > 5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void *ptr2).
> > * it should fail, but, since there is no more available page.
> > * But, since the page above is not reserved, this mmap() succeed.
> > 6) Faulting at ptr1 will cause a SIGBUS
> > * it will try to allocate a huge page, but there is none
> > available
> >
> > A full reproducer is in selftest. See
> > https://lore.kernel.org/all/[email protected]/
> >
> > Fix this by restoring the reserved page if necessary. If the page being
> > unmapped has HPAGE_RESV_OWNER set, and needs a reservation, set the
> > restore_reserve flag, which will move the page from free to reserved.
> >
> > Suggested-by: Rik van Riel <[email protected]>
> > Signed-off-by: Breno Leitao <[email protected]>
> > ---
> > mm/hugetlb.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ed1581b670d4..fa2c17767e44 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5677,6 +5677,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > hugetlb_count_sub(pages_per_huge_page(h), mm);
> > hugetlb_remove_rmap(page_folio(page));
> >
> > + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > + vma_needs_reservation(h, vma, start)) {
> > + /*
> > + * Restore the reservation if needed, otherwise the
> > + * backing page could be stolen by someone.
> > + */
> > + folio_set_hugetlb_restore_reserve(page_folio(page));
> > + vma_add_reservation(h, vma, address);
> > + }
> > +
> > spin_unlock(ptl);
> > tlb_remove_page_size(tlb, page, huge_page_size(h));
> > /*
>
> Hi Breno,
>
> I'm seeing a kernel bug fire when running the "map_hugetlb" mm selftest against latest mm-unstable. Bisect tells me this patch is culprit. I'm running on arm64 with defconfig plus the following:

Hello Ryan,

Thanks for the heads-up. I was able to reproduce the problem, and I am
working on a solution.

Thanks,