LinuxLists.cc - [PATCH v2 0/3] Add hugetlb MADV

2022-02-04 15:25:18

Subject: [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support

Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
testing. However, mremap support was recently added in commit
550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
vma"). While attempting to enable mremap support in the test, it was
discovered that the mremap test indirectly depends on MADV_DONTNEED.

madvise does not allow MADV_DONTNEED for hugetlb mappings. However,
that is primarily due to the check in can_madv_lru_vma(). By simply
removing the check and adding huge page alignment, MADV_DONTNEED can
be made to work for hugetlb mappings.

Do note that there is no compelling use case for adding this support.
This was discussed in the RFC [1]. However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory.

After enabling support, add selftest for MADV_DONTNEED as well as
MADV_REMOVE. Then update userfaultfd selftest.

v1 -> v2
- Use is_vm_hugetlb_page() instead of open coding vma hugetlb check.
- Add new test to .gitignore and use meaningful symbolic names (#define)
for constants used in test. Shuah
- Updated help text in userfaultfd test and modified run_vmtests to not
pass in a file for userfaultfd hugetlb test. Axel
- Added Reviewed-by for selftest patches.

RFC -> v1
- Fixed alignment issues when calling zap_page_range. Naoya
- Added checks for invalid arguments and misalignment to selftest.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Mike Kravetz (3):
mm: enable MADV_DONTNEED for hugetlb mappings
selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
userfaultfd/selftests: enable huegtlb remap and remove event testing

mm/madvise.c | 24 +-
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugetlb-madvise.c | 413 +++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 15 +-
tools/testing/selftests/vm/userfaultfd.c | 69 ++--
6 files changed, 485 insertions(+), 38 deletions(-)
create mode 100644 tools/testing/selftests/vm/hugetlb-madvise.c

--
2.34.1

2022-02-07 14:43:56

by Mike Kravetz

[permalink] [raw]

Subject: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

MADV_DONTNEED is currently disabled for hugetlb mappings. This
certainly makes sense in shared file mappings as the pagecache maintains
a reference to the page and it will never be freed. However, it could
be useful to unmap and free pages in private mappings.

The only thing preventing MADV_DONTNEED from working on hugetlb mappings
is a check in can_madv_lru_vma(). To allow support for hugetlb mappings
create and use a new routine madvise_dontneed_free_valid_vma() that will
allow hugetlb mappings. Also, before calling zap_page_range in the
DONTNEED case align start and size to huge page size for hugetlb vmas.
madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
requires huge page size alignment.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/madvise.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..7ae891e030a4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
+ /*
+ * start and size (end - start) must be huge page size aligned
+ * for hugetlb vmas.
+ */
+ if (is_vm_hugetlb_page(vma)) {
+ struct hstate *h = hstate_vma(vma);
+
+ start = ALIGN_DOWN(start, huge_page_size(h));
+ end = ALIGN(end, huge_page_size(h));
+ }
+
zap_page_range(vma, start, end - start);
return 0;
}

+static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
+ int behavior)
+{
+ if (is_vm_hugetlb_page(vma))
+ return behavior == MADV_DONTNEED;
+ else
+ return can_madv_lru_vma(vma);
+}
+
static long madvise_dontneed_free(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
@@ -808,7 +828,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;

*prev = vma;
- if (!can_madv_lru_vma(vma))
+ if (!madvise_dontneed_free_valid_vma(vma, behavior))
return -EINVAL;

if (!userfaultfd_remove(vma, start, end)) {
@@ -830,7 +850,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
*/
return -ENOMEM;
}
- if (!can_madv_lru_vma(vma))
+ if (!madvise_dontneed_free_valid_vma(vma, behavior))
return -EINVAL;
if (end > vma->vm_end) {
/*
--
2.34.1

2022-02-10 05:51:55

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

(Sorry for the late comment)

On Tue, Feb 01, 2022 at 05:40:32PM -0800, Mike Kravetz wrote:
> MADV_DONTNEED is currently disabled for hugetlb mappings. This
> certainly makes sense in shared file mappings as the pagecache maintains
> a reference to the page and it will never be freed. However, it could
> be useful to unmap and free pages in private mappings.
>
> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
> is a check in can_madv_lru_vma(). To allow support for hugetlb mappings
> create and use a new routine madvise_dontneed_free_valid_vma() that will
> allow hugetlb mappings. Also, before calling zap_page_range in the
> DONTNEED case align start and size to huge page size for hugetlb vmas.
> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
> requires huge page size alignment.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/madvise.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5604064df464..7ae891e030a4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> unsigned long start, unsigned long end)
> {
> + /*
> + * start and size (end - start) must be huge page size aligned
> + * for hugetlb vmas.
> + */
> + if (is_vm_hugetlb_page(vma)) {
> + struct hstate *h = hstate_vma(vma);
> +
> + start = ALIGN_DOWN(start, huge_page_size(h));
> + end = ALIGN(end, huge_page_size(h));
> + }
> +

Maybe check the alignment before userfaultfd_remove()? Otherwise there'll be a
fake message generated to the tracer.

> zap_page_range(vma, start, end - start);
> return 0;
> }
>
> +static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
> + int behavior)
> +{
> + if (is_vm_hugetlb_page(vma))
> + return behavior == MADV_DONTNEED;
> + else
> + return can_madv_lru_vma(vma);
> +}

can_madv_lru_vma() will check hugetlb again which looks a bit weird. Would it
look better to write it as:

madvise_dontneed_free_valid_vma()
{
return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP));
}

can_madv_lru_vma()
{
return madvise_dontneed_free_valid_vma() && !is_vm_hugetlb_page(vma);
}

?

Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
this is the only api that can force strip the hugetlb mapped pgtable without
losing pagecache data.

Thanks,

--
Peter Xu

2022-02-11 03:51:09

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

On 2/9/22 19:21, Peter Xu wrote:
> (Sorry for the late comment)

Thanks for taking a look.

>
> On Tue, Feb 01, 2022 at 05:40:32PM -0800, Mike Kravetz wrote:
>> MADV_DONTNEED is currently disabled for hugetlb mappings. This
>> certainly makes sense in shared file mappings as the pagecache maintains
>> a reference to the page and it will never be freed. However, it could
>> be useful to unmap and free pages in private mappings.
>>
>> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
>> is a check in can_madv_lru_vma(). To allow support for hugetlb mappings
>> create and use a new routine madvise_dontneed_free_valid_vma() that will
>> allow hugetlb mappings. Also, before calling zap_page_range in the
>> DONTNEED case align start and size to huge page size for hugetlb vmas.
>> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
>> requires huge page size alignment.
>>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> mm/madvise.c | 24 ++++++++++++++++++++++--
>> 1 file changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 5604064df464..7ae891e030a4 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>> static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end)
>> {
>> + /*
>> + * start and size (end - start) must be huge page size aligned
>> + * for hugetlb vmas.
>> + */
>> + if (is_vm_hugetlb_page(vma)) {
>> + struct hstate *h = hstate_vma(vma);
>> +
>> + start = ALIGN_DOWN(start, huge_page_size(h));
>> + end = ALIGN(end, huge_page_size(h));
>> + }
>> +
>
> Maybe check the alignment before userfaultfd_remove()? Otherwise there'll be a
> fake message generated to the tracer.

Yes, we should pass the aligned addresses to userfaultfd_remove. We will
also need to potentially align again after the call.

>
>> zap_page_range(vma, start, end - start);
>> return 0;
>> }
>>
>> +static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
>> + int behavior)
>> +{
>> + if (is_vm_hugetlb_page(vma))
>> + return behavior == MADV_DONTNEED;
>> + else
>> + return can_madv_lru_vma(vma);
>> +}
>
> can_madv_lru_vma() will check hugetlb again which looks a bit weird. Would it
> look better to write it as:
>
> madvise_dontneed_free_valid_vma()
> {
> return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP));
> }
>
> can_madv_lru_vma()
> {
> return madvise_dontneed_free_valid_vma() && !is_vm_hugetlb_page(vma);
> }
>
> ?

Yes, that would look better.

>
> Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> this is the only api that can force strip the hugetlb mapped pgtable without
> losing pagecache data.
>

Correct. However, I do not know if uffd-minor users would ever want to
do this. Perhaps?
--
Mike Kravetz

2022-02-11 21:23:04

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > this is the only api that can force strip the hugetlb mapped pgtable without
> > losing pagecache data.
>
> Correct. However, I do not know if uffd-minor users would ever want to
> do this. Perhaps?

My understanding is before this patch uffd-minor upon hugetlbfs requires the
huge file to be mapped twice, one to populate the content, then we'll be able
to trap MINOR faults via the other mapping. Or we could munmap() the range and
remap it again on the same file offset to drop the pgtables, I think. But that
sounds tricky. MINOR faults only works with pgtables dropped.

With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
because we can explicitly drop the pgtables of hugetlbfs files without any
other tricks.

However I have no real use case of it. Initially I thought it could be useful
for QEMU because QEMU migration routine is run with the same mm context with
the hypervisor, so by default is doesn't have two mappings of the same guest
memory. If QEMU wants to leverage minor faults, DONTNEED could help.

However when I was measuring bitmap transfer (assuming that's what minor fault
could help with qemu's postcopy) there some months ago I found it's not as slow
as I thought at all.. Either I could have missed something, or we're facing
different problems with what it is when uffd minor is firstly proposed by Axel.

This is probably too out of topic, though.. Let me go back..

Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
could further abuse DONTNEED, as the original POSIX definition is as simple as:

The application expects that it will not access the specified address range
in the near future.

Linux did it by tearing down pgtable, which looks okay so far. It could be a
bit more weird to apply it to hugetlbfs because from its definition it's a hint
to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
it LRU-aware. It goes further into some MADV_ZAP styled syscall.

I think it could still be fine as posix doesn't define that behavior
specifically on hugetlb so it can be defined by Linux, but not sure whether
there can be other implications.

Thanks,

--
Peter Xu

2022-02-12 10:22:57

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

On 2/11/22 11:08, Axel Rasmussen wrote:
> On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <[email protected]> wrote:
>>
>> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
>>>> Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
>>>> this is the only api that can force strip the hugetlb mapped pgtable without
>>>> losing pagecache data.
>>>
>>> Correct. However, I do not know if uffd-minor users would ever want to
>>> do this. Perhaps?
>
> I talked with some colleagues, and I didn't come up with any
> production *requirement* for it, but it may be a convenience in some
> cases (make certain code cleaner, e.g. not having to unmap-and-remap
> to tear down page tables as Peter mentioned). I think Peter's
> assessment below is right.
>
>>
>> My understanding is before this patch uffd-minor upon hugetlbfs requires the
>> huge file to be mapped twice, one to populate the content, then we'll be able
>> to trap MINOR faults via the other mapping. Or we could munmap() the range and
>> remap it again on the same file offset to drop the pgtables, I think. But that
>> sounds tricky. MINOR faults only works with pgtables dropped.
>>
>> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
>> because we can explicitly drop the pgtables of hugetlbfs files without any
>> other tricks.
>>
>> However I have no real use case of it. Initially I thought it could be useful
>> for QEMU because QEMU migration routine is run with the same mm context with
>> the hypervisor, so by default is doesn't have two mappings of the same guest
>> memory. If QEMU wants to leverage minor faults, DONTNEED could help.).
>>
>> However when I was measuring bitmap transfer (assuming that's what minor fault
>> could help with qemu's postcopy) there some months ago I found it's not as slow
>> as I thought at all.. Either I could have missed something, or we're facing
>> different problems with what it is when uffd minor is firstly proposed by Axel.
>
> Re: the bitmap, that matters most on machines with lots of RAM. For
> example, GCE offers some VMs with up to 12 *TB* of RAM
> (https://cloud.google.com/compute/docs/memory-optimized-machines), I
> think with this size of machine we see a significant benefit, as it
> may take some significant time for the bitmap to arrive over the
> network.
>
> But I think that's a bit of an edge case, most machines are not that
> big. :) I think the benefit is more often seen just in avoiding
> copies. E.g. if we find a page is already up-to-date after precopy, we
> just install PTEs, no copying or page allocation needed. And even when
> we have to go fetch a page over the network, one can imagine an RDMA
> setup where we can avoid any copies/allocations at all even in that
> case. I suppose this also has a bigger effect on larger machines, e.g.
> ones that are backed by 1G pages instead of 4k.
>

Thanks Peter and Axel!

As mentioned, my primary motivation was to simply clean up the userfaultfd
selftest. Glad to see there might be more use cases. If we can simplify
other code as in the case of userfaultfd selftest, that would be a win.

--
Mike Kravetz

2022-02-14 19:42:26

by Axel Rasmussen

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > > this is the only api that can force strip the hugetlb mapped pgtable without
> > > losing pagecache data.
> >
> > Correct. However, I do not know if uffd-minor users would ever want to
> > do this. Perhaps?

I talked with some colleagues, and I didn't come up with any
production *requirement* for it, but it may be a convenience in some
cases (make certain code cleaner, e.g. not having to unmap-and-remap
to tear down page tables as Peter mentioned). I think Peter's
assessment below is right.

>
> My understanding is before this patch uffd-minor upon hugetlbfs requires the
> huge file to be mapped twice, one to populate the content, then we'll be able
> to trap MINOR faults via the other mapping. Or we could munmap() the range and
> remap it again on the same file offset to drop the pgtables, I think. But that
> sounds tricky. MINOR faults only works with pgtables dropped.
>
> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
> because we can explicitly drop the pgtables of hugetlbfs files without any
> other tricks.
>
> However I have no real use case of it. Initially I thought it could be useful
> for QEMU because QEMU migration routine is run with the same mm context with
> the hypervisor, so by default is doesn't have two mappings of the same guest
> memory. If QEMU wants to leverage minor faults, DONTNEED could help.).
>
> However when I was measuring bitmap transfer (assuming that's what minor fault
> could help with qemu's postcopy) there some months ago I found it's not as slow
> as I thought at all.. Either I could have missed something, or we're facing
> different problems with what it is when uffd minor is firstly proposed by Axel.

Re: the bitmap, that matters most on machines with lots of RAM. For
example, GCE offers some VMs with up to 12 *TB* of RAM
(https://cloud.google.com/compute/docs/memory-optimized-machines), I
think with this size of machine we see a significant benefit, as it
may take some significant time for the bitmap to arrive over the
network.

But I think that's a bit of an edge case, most machines are not that
big. :) I think the benefit is more often seen just in avoiding
copies. E.g. if we find a page is already up-to-date after precopy, we
just install PTEs, no copying or page allocation needed. And even when
we have to go fetch a page over the network, one can imagine an RDMA
setup where we can avoid any copies/allocations at all even in that
case. I suppose this also has a bigger effect on larger machines, e.g.
ones that are backed by 1G pages instead of 4k.

>
> This is probably too out of topic, though.. Let me go back..
>
> Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
> could further abuse DONTNEED, as the original POSIX definition is as simple as:
>
> The application expects that it will not access the specified address range
> in the near future.
>
> Linux did it by tearing down pgtable, which looks okay so far. It could be a
> bit more weird to apply it to hugetlbfs because from its definition it's a hint
> to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
> it LRU-aware. It goes further into some MADV_ZAP styled syscall.
>
> I think it could still be fine as posix doesn't define that behavior
> specifically on hugetlb so it can be defined by Linux, but not sure whether
> there can be other implications.
>
> Thanks,
>
> --
> Peter Xu
>