2022-11-11 23:42:11

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v9 0/3] fix hugetlb MADV_DONTNEED vma_lock handling

This series addresses the issue first reported in [1], and fully
described in patch 3. While exploring solutions to this issue,
related problems with mmu notification calls were discovered. The
first two patches address those issues.

Previous discussions suggested further cleanup by removing the
routine zap_page_range. This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma. This work will be done in a later patch so as not
to distract from this bug fix.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Mike Kravetz (3):
madvise: use zap_page_range_single for madvise dontneed
hugetlb: remove duplicate mmu notifications
hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

include/linux/mm.h | 29 +++++++++++++++++++++--------
mm/hugetlb.c | 45 +++++++++++++++++++++++++--------------------
mm/madvise.c | 6 +++---
mm/memory.c | 25 ++++++++++++-------------
4 files changed, 61 insertions(+), 44 deletions(-)

--
2.37.3



2022-11-11 23:42:46

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v9 1/3] madvise: use zap_page_range_single for madvise dontneed

Expose the routine zap_page_range_single to zap a range within a single
vma. The madvise routine madvise_dontneed_single_vma can use this
routine as it explicitly operates on a single vma. Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account. This is required as MADV_DONTNEED supports hugetlb vmas.

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <[email protected]>
Reported-by: Wei Chen <[email protected]>
Cc: <[email protected]>
---
include/linux/mm.h | 27 +++++++++++++++++++--------
mm/madvise.c | 6 +++---
mm/memory.c | 23 +++++++++++------------
3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e8fc35edaee0..9e7cad65dfde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1881,6 +1881,23 @@ static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodem
__show_free_areas(flags, nodemask, MAX_NR_ZONES - 1);
}

+/*
+ * Parameter block passed down to zap_pte_range in exceptional cases.
+ */
+struct zap_details {
+ struct folio *single_folio; /* Locked folio to be unmapped */
+ bool even_cows; /* Zap COWed private pages too? */
+ zap_flags_t zap_flags; /* Extra flags for zapping */
+};
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory. This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma. By
+ * default, the flag is not set.
+ */
+#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
@@ -1898,6 +1915,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+ unsigned long size, struct zap_details *details);
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
unsigned long end);
@@ -3529,12 +3548,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

-/*
- * Whether to drop the pte markers, for example, the uffd-wp information for
- * file-backed memory. This should only be specified when we will completely
- * drop the page in the mm, either by truncation or unmapping of the vma. By
- * default, the flag is not set.
- */
-#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
-
#endif /* _LINUX_MM_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 68a23104687f..b2f1860a353e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -785,8 +785,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
- * zap_page_range call sets things up for shrink_active_list to actually free
- * these pages later if no one else has touched them in the meantime,
+ * zap_page_range_single call sets things up for shrink_active_list to actually
+ * free these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
@@ -803,7 +803,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range(vma, start, end - start);
+ zap_page_range_single(vma, start, end - start, NULL);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 98ddb91df9a7..ebdbd395cfad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1294,15 +1294,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}

-/*
- * Parameter block passed down to zap_pte_range in exceptional cases.
- */
-struct zap_details {
- struct folio *single_folio; /* Locked folio to be unmapped */
- bool even_cows; /* Zap COWed private pages too? */
- zap_flags_t zap_flags; /* Extra flags for zapping */
-};
-
/* Whether we should zap all COWed (private) pages too */
static inline bool should_zap_cows(struct zap_details *details)
{
@@ -1736,19 +1727,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
*
* The range must fit into one VMA.
*/
-static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
{
+ unsigned long end = address + size;
struct mmu_notifier_range range;
struct mmu_gather tlb;

lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- address, address + size);
+ address, end);
+ if (is_vm_hugetlb_page(vma))
+ adjust_range_if_pmd_sharing_possible(vma, &range.start,
+ &range.end);
tlb_gather_mmu(&tlb, vma->vm_mm);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
- unmap_single_vma(&tlb, vma, address, range.end, details);
+ /*
+ * unmap 'address-end' not 'range.start-range.end' as range
+ * could have been expanded for hugetlb pmd sharing.
+ */
+ unmap_single_vma(&tlb, vma, address, end, details);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
--
2.37.3


2022-11-12 00:54:31

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v9 2/3] hugetlb: remove duplicate mmu notifications

The common hugetlb unmap routine __unmap_hugepage_range performs mmu
notification calls. However, in the case where __unmap_hugepage_range
is called via __unmap_hugepage_range_final, mmu notification calls are
performed earlier in other calling routines.

Remove mmu notification calls from __unmap_hugepage_range. Add
notification calls to the only other caller: unmap_hugepage_range.
unmap_hugepage_range is called for truncation and hole punch, so
change notification type from UNMAP to CLEAR as this is more appropriate.

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <[email protected]>
Reported-by: Wei Chen <[email protected]>
Cc: <[email protected]>
---
mm/hugetlb.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d765364231e..205c67c6787a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5074,7 +5074,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- struct mmu_notifier_range range;
unsigned long last_addr_mask;
bool force_flush = false;

@@ -5089,13 +5088,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
tlb_change_page_size(tlb, sz);
tlb_start_vma(tlb, vma);

- /*
- * If sharing possible, alert mmu notifiers of worst case.
- */
- mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
- end);
- adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
- mmu_notifier_invalidate_range_start(&range);
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
@@ -5180,7 +5172,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
if (ref_page)
break;
}
- mmu_notifier_invalidate_range_end(&range);
tlb_end_vma(tlb, vma);

/*
@@ -5208,6 +5199,7 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);

+ /* mmu notification performed in caller */
__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);

/*
@@ -5227,10 +5219,18 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page,
zap_flags_t zap_flags)
{
+ struct mmu_notifier_range range;
struct mmu_gather tlb;

+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ start, end);
+ adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+ mmu_notifier_invalidate_range_start(&range);
tlb_gather_mmu(&tlb, vma->vm_mm);
+
__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
+
+ mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}

--
2.37.3


2022-11-12 20:07:57

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v9 0/3] fix hugetlb MADV_DONTNEED vma_lock handling

On Nov 11, 2022, at 3:26 PM, Mike Kravetz <[email protected]> wrote:

> This series addresses the issue first reported in [1], and fully
> described in patch 3. While exploring solutions to this issue,
> related problems with mmu notification calls were discovered. The
> first two patches address those issues.
>
> Previous discussions suggested further cleanup by removing the
> routine zap_page_range. This is possible because zap_page_range_single
> is now exported, and all callers of zap_page_range pass ranges entirely
> within a single vma. This work will be done in a later patch so as not
> to distract from this bug fix.
>
> [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
>
> Mike Kravetz (3):
> madvise: use zap_page_range_single for madvise dontneed
> hugetlb: remove duplicate mmu notifications
> hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
>
> include/linux/mm.h | 29 +++++++++++++++++++++--------
> mm/hugetlb.c | 45 +++++++++++++++++++++++++--------------------
> mm/madvise.c | 6 +++---
> mm/memory.c | 25 ++++++++++++-------------
> 4 files changed, 61 insertions(+), 44 deletions(-)

With my limited knowledge of hugetlbfs, it all looks good.

Having said that - 2 random thoughts:

1. It is more intuitive to me to have
mmu_notifier_invalidate_range_{start|end}() next to tlb_{start|end}_vma().
I think that one day these two should have been combined into a
single function, which could have also executed
adjust_range_if_pmd_sharing_possible() as needed.

2. If you still have a concern of exposing zap_details as you had in the
past (not that I care), consider putting zap_details and
zap_page_range_single() in mm/internal.h.

Thanks,
Nadav

2022-11-14 00:58:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v9 0/3] fix hugetlb MADV_DONTNEED vma_lock handling

On Fri, Nov 11, 2022 at 03:26:25PM -0800, Mike Kravetz wrote:
> This series addresses the issue first reported in [1], and fully
> described in patch 3. While exploring solutions to this issue,
> related problems with mmu notification calls were discovered. The
> first two patches address those issues.
>
> Previous discussions suggested further cleanup by removing the
> routine zap_page_range. This is possible because zap_page_range_single
> is now exported, and all callers of zap_page_range pass ranges entirely
> within a single vma. This work will be done in a later patch so as not
> to distract from this bug fix.
>
> [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
>
> Mike Kravetz (3):
> madvise: use zap_page_range_single for madvise dontneed
> hugetlb: remove duplicate mmu notifications
> hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

Acked-by: Peter Xu <[email protected]>

--
Peter Xu


2022-11-14 09:47:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v9 1/3] madvise: use zap_page_range_single for madvise dontneed

On 12.11.22 00:26, Mike Kravetz wrote:
> Expose the routine zap_page_range_single to zap a range within a single
> vma. The madvise routine madvise_dontneed_single_vma can use this
> routine as it explicitly operates on a single vma. Also, update the mmu
> notification range in zap_page_range_single to take hugetlb pmd sharing
> into account. This is required as MADV_DONTNEED supports hugetlb vmas.
>
> Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> Signed-off-by: Mike Kravetz <[email protected]>
> Reported-by: Wei Chen <[email protected]>
> Cc: <[email protected]>


[...]

>
> -/*
> - * Parameter block passed down to zap_pte_range in exceptional cases.
> - */
> -struct zap_details {
> - struct folio *single_folio; /* Locked folio to be unmapped */
> - bool even_cows; /* Zap COWed private pages too? */
> - zap_flags_t zap_flags; /* Extra flags for zapping */
> -};
> -
> /* Whether we should zap all COWed (private) pages too */
> static inline bool should_zap_cows(struct zap_details *details)
> {
> @@ -1736,19 +1727,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> *
> * The range must fit into one VMA.
> */
> -static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> unsigned long size, struct zap_details *details)
> {
> + unsigned long end = address + size;

Could make that const.

> struct mmu_notifier_range range;
> struct mmu_gather tlb;
>
> lru_add_drain();
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> - address, address + size);
> + address, end);
> + if (is_vm_hugetlb_page(vma))
> + adjust_range_if_pmd_sharing_possible(vma, &range.start,
> + &range.end);
> tlb_gather_mmu(&tlb, vma->vm_mm);
> update_hiwater_rss(vma->vm_mm);
> mmu_notifier_invalidate_range_start(&range);
> - unmap_single_vma(&tlb, vma, address, range.end, details);
> + /*
> + * unmap 'address-end' not 'range.start-range.end' as range
> + * could have been expanded for hugetlb pmd sharing.
> + */
> + unmap_single_vma(&tlb, vma, address, end, details);
> mmu_notifier_invalidate_range_end(&range);
> tlb_finish_mmu(&tlb);
> }


Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2022-11-14 09:48:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] hugetlb: remove duplicate mmu notifications

On 12.11.22 00:26, Mike Kravetz wrote:
> The common hugetlb unmap routine __unmap_hugepage_range performs mmu
> notification calls. However, in the case where __unmap_hugepage_range
> is called via __unmap_hugepage_range_final, mmu notification calls are
> performed earlier in other calling routines.
>
> Remove mmu notification calls from __unmap_hugepage_range. Add
> notification calls to the only other caller: unmap_hugepage_range.
> unmap_hugepage_range is called for truncation and hole punch, so
> change notification type from UNMAP to CLEAR as this is more appropriate.
>
> Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> Signed-off-by: Mike Kravetz <[email protected]>
> Reported-by: Wei Chen <[email protected]>
> Cc: <[email protected]>

Why exactly do we care about stable backports here? What's the
user-visible impact?

--
Thanks,

David / dhildenb


2022-11-14 19:48:53

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v9 2/3] hugetlb: remove duplicate mmu notifications

On 11/14/22 10:06, David Hildenbrand wrote:
> On 12.11.22 00:26, Mike Kravetz wrote:
> > The common hugetlb unmap routine __unmap_hugepage_range performs mmu
> > notification calls. However, in the case where __unmap_hugepage_range
> > is called via __unmap_hugepage_range_final, mmu notification calls are
> > performed earlier in other calling routines.
> >
> > Remove mmu notification calls from __unmap_hugepage_range. Add
> > notification calls to the only other caller: unmap_hugepage_range.
> > unmap_hugepage_range is called for truncation and hole punch, so
> > change notification type from UNMAP to CLEAR as this is more appropriate.
> >
> > Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> > Signed-off-by: Mike Kravetz <[email protected]>
> > Reported-by: Wei Chen <[email protected]>
> > Cc: <[email protected]>
>
> Why exactly do we care about stable backports here? What's the user-visible
> impact?

I do not believe the duplicate notification calls have a user-visible impact.
They have existed for a long time without notice.

When fixing this issue, this was noticed and cleaned up. We should be able to
fix the issue without this change. Unless someone really thinks this needs
to be fixed in stable as well.

I will move this to the end of the patch series and drop the Fixes/Cc stable
tags. Will send out later today as I will want to do another round of testing.
--
Mike Kravetz