2023-12-14 22:35:04

by Yang Shi

[permalink] [raw]
Subject: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

From: Rik van Riel <[email protected]>

Align larger anonymous memory mappings on THP boundaries by going through
thp_get_unmapped_area if THPs are enabled for the current process.

With this patch, larger anonymous mappings are now THP aligned. When a
malloc library allocates a 2MB or larger arena, that arena can now be
mapped with THPs right from the start, which can result in better TLB hit
rates and execution time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christopher Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
This patch was applied to v6.1, but was reverted due to a regression
report. However it turned out the regression was not due to this patch.
I ping'ed Andrew to reapply this patch, Andrew may forget it. This
patch helps promote THP, so I rebased it onto the latest mm-unstable.


mm/mmap.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3..dd25a2aa94f7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2232,6 +2232,9 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
*/
pgoff = 0;
get_area = shmem_get_unmapped_area;
+ } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ /* Ensures that larger anonymous mappings are THP aligned. */
+ get_area = thp_get_unmapped_area;
}

addr = get_area(file, addr, len, pgoff, flags);
--
2.41.0



2024-01-20 12:05:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 14/12/2023 22:34, Yang Shi wrote:
> From: Rik van Riel <[email protected]>
>
> Align larger anonymous memory mappings on THP boundaries by going through
> thp_get_unmapped_area if THPs are enabled for the current process.
>
> With this patch, larger anonymous mappings are now THP aligned. When a
> malloc library allocates a 2MB or larger arena, that arena can now be
> mapped with THPs right from the start, which can result in better TLB hit
> rates and execution time.
>
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Rik van Riel <[email protected]>
> Reviewed-by: Yang Shi <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
> This patch was applied to v6.1, but was reverted due to a regression
> report. However it turned out the regression was not due to this patch.
> I ping'ed Andrew to reapply this patch, Andrew may forget it. This
> patch helps promote THP, so I rebased it onto the latest mm-unstable.

Hi Yang,

I'm not sure what regression you are referring to above, but I'm seeing a
performance regression in the virtual_address_range mm selftest on arm64, caused
by this patch (which is now in v6.7).

I see 2 problems when running the test; 1) it takes much longer to execute, and
2) the test fails. Both are related:

The (first part of the) test allocates as many 1GB anonymous blocks as it can in
the low 256TB of address space, passing NULL as the addr hint to mmap. Before
this patch, all allocations were abutted and contained in a single, merged VMA.
However, after this patch, each allocation is in its own VMA, and there is a 2M
gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
because there are so many VMAs to check to find a new 1G gap. 2) It fails once
it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
causes a subsequent calloc() to fail, which causes the test to fail.

Looking at the code, I think the problem is that arm64 selects
ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
len+2M then always aligns to the bottom of the discovered gap. That causes the
2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.

I'm not quite sure what the fix is - perhaps __thp_get_unmapped_area() should be
implemented around vm_unmapped_area(), which can manage the alignment more
intelligently?

But until/unless someone comes along with a fix, I think this patch should be
reverted.

Thanks,
Ryan


>
>
> mm/mmap.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9d780f415be3..dd25a2aa94f7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2232,6 +2232,9 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
> */
> pgoff = 0;
> get_area = shmem_get_unmapped_area;
> + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + /* Ensures that larger anonymous mappings are THP aligned. */
> + get_area = thp_get_unmapped_area;
> }
>
> addr = get_area(file, addr, len, pgoff, flags);


2024-01-20 12:13:41

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 20/01/2024 12:04, Ryan Roberts wrote:
> On 14/12/2023 22:34, Yang Shi wrote:
>> From: Rik van Riel <[email protected]>
>>
>> Align larger anonymous memory mappings on THP boundaries by going through
>> thp_get_unmapped_area if THPs are enabled for the current process.
>>
>> With this patch, larger anonymous mappings are now THP aligned. When a
>> malloc library allocates a 2MB or larger arena, that arena can now be
>> mapped with THPs right from the start, which can result in better TLB hit
>> rates and execution time.
>>
>> Link: https://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Rik van Riel <[email protected]>
>> Reviewed-by: Yang Shi <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Christopher Lameter <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>> ---
>> This patch was applied to v6.1, but was reverted due to a regression
>> report. However it turned out the regression was not due to this patch.
>> I ping'ed Andrew to reapply this patch, Andrew may forget it. This
>> patch helps promote THP, so I rebased it onto the latest mm-unstable.
>
> Hi Yang,
>
> I'm not sure what regression you are referring to above, but I'm seeing a
> performance regression in the virtual_address_range mm selftest on arm64, caused
> by this patch (which is now in v6.7).
>
> I see 2 problems when running the test; 1) it takes much longer to execute, and
> 2) the test fails. Both are related:
>
> The (first part of the) test allocates as many 1GB anonymous blocks as it can in
> the low 256TB of address space, passing NULL as the addr hint to mmap. Before
> this patch, all allocations were abutted and contained in a single, merged VMA.
> However, after this patch, each allocation is in its own VMA, and there is a 2M
> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> causes a subsequent calloc() to fail, which causes the test to fail.
>
> Looking at the code, I think the problem is that arm64 selects
> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> len+2M then always aligns to the bottom of the discovered gap. That causes the
> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
>
> I'm not quite sure what the fix is - perhaps __thp_get_unmapped_area() should be
> implemented around vm_unmapped_area(), which can manage the alignment more
> intelligently?
>
> But until/unless someone comes along with a fix, I think this patch should be
> reverted.

Looks like this patch is also the cause of `ksm_tests -H -s 100` starting to
fail on arm64. I haven't looked in detail, but it passes without the change and
fails with. So this should definitely be reverted, I think.


>
> Thanks,
> Ryan
>
>
>>
>>
>> mm/mmap.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 9d780f415be3..dd25a2aa94f7 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2232,6 +2232,9 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>> */
>> pgoff = 0;
>> get_area = shmem_get_unmapped_area;
>> + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> + /* Ensures that larger anonymous mappings are THP aligned. */
>> + get_area = thp_get_unmapped_area;
>> }
>>
>> addr = get_area(file, addr, len, pgoff, flags);
>


2024-01-20 16:40:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> However, after this patch, each allocation is in its own VMA, and there is a 2M
> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> causes a subsequent calloc() to fail, which causes the test to fail.
>
> Looking at the code, I think the problem is that arm64 selects
> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> len+2M then always aligns to the bottom of the discovered gap. That causes the
> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.

As a quick hack, perhaps
#ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
take-the-top-half
#else
current-take-bottom-half-code
#endif

?

2024-01-22 11:42:54

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 20/01/2024 16:39, Matthew Wilcox wrote:
> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
>> However, after this patch, each allocation is in its own VMA, and there is a 2M
>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
>> causes a subsequent calloc() to fail, which causes the test to fail.
>>
>> Looking at the code, I think the problem is that arm64 selects
>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
>> len+2M then always aligns to the bottom of the discovered gap. That causes the
>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
>
> As a quick hack, perhaps
> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> take-the-top-half
> #else
> current-take-bottom-half-code
> #endif
>
> ?

There is a general problem though that there is a trade-off between abutting
VMAs, and aligning them to PMD boundaries. This patch has decided that in
general the latter is preferable. The case I'm hitting is special though, in
that both requirements could be achieved but currently are not.

The below fixes it, but I feel like there should be some bitwise magic that
would give the correct answer without the conditional - but my head is gone and
I can't see it. Any thoughts?

Beyond this, though, there is also a latent bug where the offset provided to
mmap() is carried all the way through to the get_unmapped_area()
impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
that use the default get_unmapped_area(), any non-zero offset would not have
been used. But this change starts using it, which is incorrect. That said, there
are some arches that override the default get_unmapped_area() and do use the
offset. So I'm not sure if this is a bug or a feature that user space can pass
an arbitrary value to the implementation for anon memory??

Finally, the second test failure I reported (ksm_tests) is actually caused by a
bug in the test code, but provoked by this change. So I'll send out a fix for
the test code separately.


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4f542444a91f..68ac54117c77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
{
loff_t off_end = off + len;
loff_t off_align = round_up(off, size);
- unsigned long len_pad, ret;
+ unsigned long len_pad, ret, off_sub;

if (off_end <= off_align || (off_end - off_align) < size)
return 0;
@@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
if (ret == addr)
return addr;

- ret += (off - ret) & (size - 1);
+ off_sub = (off - ret) & (size - 1);
+
+ if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
+ !off_sub)
+ return ret + size;
+
+ ret += off_sub;
return ret;
}

2024-01-22 19:47:29

by Yang Shi

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
>
> On 20/01/2024 16:39, Matthew Wilcox wrote:
> > On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> >> However, after this patch, each allocation is in its own VMA, and there is a 2M
> >> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> >> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> >> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> >> causes a subsequent calloc() to fail, which causes the test to fail.
> >>
> >> Looking at the code, I think the problem is that arm64 selects
> >> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> >> len+2M then always aligns to the bottom of the discovered gap. That causes the
> >> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
> >
> > As a quick hack, perhaps
> > #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> > take-the-top-half
> > #else
> > current-take-bottom-half-code
> > #endif
> >
> > ?

Thanks for the suggestion. It makes sense to me. Doing the alignment
needs to take into account this.

>
> There is a general problem though that there is a trade-off between abutting
> VMAs, and aligning them to PMD boundaries. This patch has decided that in
> general the latter is preferable. The case I'm hitting is special though, in
> that both requirements could be achieved but currently are not.
>
> The below fixes it, but I feel like there should be some bitwise magic that
> would give the correct answer without the conditional - but my head is gone and
> I can't see it. Any thoughts?

Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
the conditional either.

>
> Beyond this, though, there is also a latent bug where the offset provided to
> mmap() is carried all the way through to the get_unmapped_area()
> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
> that use the default get_unmapped_area(), any non-zero offset would not have
> been used. But this change starts using it, which is incorrect. That said, there
> are some arches that override the default get_unmapped_area() and do use the
> offset. So I'm not sure if this is a bug or a feature that user space can pass
> an arbitrary value to the implementation for anon memory??

Thanks for noticing this. If I read the code correctly, the pgoff used
by some arches to workaround VIPT caches, and it looks like it is for
shared mapping only (just checked arm and mips). And I believe
everybody assumes 0 should be used when doing anonymous mapping. The
offset should have nothing to do with seeking proper unmapped virtual
area. But the pgoff does make sense for file THP due to the alignment
requirements. I think it should be zero'ed for anonymous mappings,
like:

diff --git a/mm/mmap.c b/mm/mmap.c
index 2ff79b1d1564..a9ed353ce627 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
long addr, unsigned long len,
pgoff = 0;
get_area = shmem_get_unmapped_area;
} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ pgoff = 0;
/* Ensures that larger anonymous mappings are THP aligned. */
get_area = thp_get_unmapped_area;
}

>
> Finally, the second test failure I reported (ksm_tests) is actually caused by a
> bug in the test code, but provoked by this change. So I'll send out a fix for
> the test code separately.
>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4f542444a91f..68ac54117c77 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> {
> loff_t off_end = off + len;
> loff_t off_align = round_up(off, size);
> - unsigned long len_pad, ret;
> + unsigned long len_pad, ret, off_sub;
>
> if (off_end <= off_align || (off_end - off_align) < size)
> return 0;
> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> if (ret == addr)
> return addr;
>
> - ret += (off - ret) & (size - 1);
> + off_sub = (off - ret) & (size - 1);
> +
> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
> + !off_sub)
> + return ret + size;
> +
> + ret += off_sub;
> return ret;
> }

I didn't spot any problem, would you please come up with a formal patch?

2024-01-22 20:20:49

by Yang Shi

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
>
> On 20/01/2024 16:39, Matthew Wilcox wrote:
> > On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> >> However, after this patch, each allocation is in its own VMA, and there is a 2M
> >> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> >> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> >> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> >> causes a subsequent calloc() to fail, which causes the test to fail.
> >>
> >> Looking at the code, I think the problem is that arm64 selects
> >> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> >> len+2M then always aligns to the bottom of the discovered gap. That causes the
> >> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
> >
> > As a quick hack, perhaps
> > #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> > take-the-top-half
> > #else
> > current-take-bottom-half-code
> > #endif
> >
> > ?
>
> There is a general problem though that there is a trade-off between abutting
> VMAs, and aligning them to PMD boundaries. This patch has decided that in
> general the latter is preferable. The case I'm hitting is special though, in
> that both requirements could be achieved but currently are not.
>
> The below fixes it, but I feel like there should be some bitwise magic that
> would give the correct answer without the conditional - but my head is gone and
> I can't see it. Any thoughts?
>
> Beyond this, though, there is also a latent bug where the offset provided to
> mmap() is carried all the way through to the get_unmapped_area()
> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
> that use the default get_unmapped_area(), any non-zero offset would not have
> been used. But this change starts using it, which is incorrect. That said, there
> are some arches that override the default get_unmapped_area() and do use the
> offset. So I'm not sure if this is a bug or a feature that user space can pass
> an arbitrary value to the implementation for anon memory??
>
> Finally, the second test failure I reported (ksm_tests) is actually caused by a
> bug in the test code, but provoked by this change. So I'll send out a fix for
> the test code separately.

Thanks for figuring this out.

>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4f542444a91f..68ac54117c77 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> {
> loff_t off_end = off + len;
> loff_t off_align = round_up(off, size);
> - unsigned long len_pad, ret;
> + unsigned long len_pad, ret, off_sub;
>
> if (off_end <= off_align || (off_end - off_align) < size)
> return 0;
> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> if (ret == addr)
> return addr;
>
> - ret += (off - ret) & (size - 1);
> + off_sub = (off - ret) & (size - 1);
> +
> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
> + !off_sub)
> + return ret + size;
> +
> + ret += off_sub;
> return ret;
> }

2024-01-23 09:45:05

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 22/01/2024 19:43, Yang Shi wrote:
> On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 20/01/2024 16:39, Matthew Wilcox wrote:
>>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
>>>> However, after this patch, each allocation is in its own VMA, and there is a 2M
>>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
>>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
>>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
>>>> causes a subsequent calloc() to fail, which causes the test to fail.
>>>>
>>>> Looking at the code, I think the problem is that arm64 selects
>>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
>>>> len+2M then always aligns to the bottom of the discovered gap. That causes the
>>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
>>>
>>> As a quick hack, perhaps
>>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>> take-the-top-half
>>> #else
>>> current-take-bottom-half-code
>>> #endif
>>>
>>> ?
>
> Thanks for the suggestion. It makes sense to me. Doing the alignment
> needs to take into account this.
>
>>
>> There is a general problem though that there is a trade-off between abutting
>> VMAs, and aligning them to PMD boundaries. This patch has decided that in
>> general the latter is preferable. The case I'm hitting is special though, in
>> that both requirements could be achieved but currently are not.
>>
>> The below fixes it, but I feel like there should be some bitwise magic that
>> would give the correct answer without the conditional - but my head is gone and
>> I can't see it. Any thoughts?
>
> Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
> the conditional either.
>
>>
>> Beyond this, though, there is also a latent bug where the offset provided to
>> mmap() is carried all the way through to the get_unmapped_area()
>> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
>> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
>> that use the default get_unmapped_area(), any non-zero offset would not have
>> been used. But this change starts using it, which is incorrect. That said, there
>> are some arches that override the default get_unmapped_area() and do use the
>> offset. So I'm not sure if this is a bug or a feature that user space can pass
>> an arbitrary value to the implementation for anon memory??
>
> Thanks for noticing this. If I read the code correctly, the pgoff used
> by some arches to workaround VIPT caches, and it looks like it is for
> shared mapping only (just checked arm and mips). And I believe
> everybody assumes 0 should be used when doing anonymous mapping. The
> offset should have nothing to do with seeking proper unmapped virtual
> area. But the pgoff does make sense for file THP due to the alignment
> requirements. I think it should be zero'ed for anonymous mappings,
> like:
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2ff79b1d1564..a9ed353ce627 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
> long addr, unsigned long len,
> pgoff = 0;
> get_area = shmem_get_unmapped_area;
> } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + pgoff = 0;
> /* Ensures that larger anonymous mappings are THP aligned. */
> get_area = thp_get_unmapped_area;
> }

I think it would be cleaner to just zero pgoff if file==NULL, then it covers the
shared case, the THP case, and the non-THP case properly. I'll prepare a
separate patch for this.


>
>>
>> Finally, the second test failure I reported (ksm_tests) is actually caused by a
>> bug in the test code, but provoked by this change. So I'll send out a fix for
>> the test code separately.
>>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 4f542444a91f..68ac54117c77 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>> {
>> loff_t off_end = off + len;
>> loff_t off_align = round_up(off, size);
>> - unsigned long len_pad, ret;
>> + unsigned long len_pad, ret, off_sub;
>>
>> if (off_end <= off_align || (off_end - off_align) < size)
>> return 0;
>> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>> if (ret == addr)
>> return addr;
>>
>> - ret += (off - ret) & (size - 1);
>> + off_sub = (off - ret) & (size - 1);
>> +
>> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
>> + !off_sub)
>> + return ret + size;
>> +
>> + ret += off_sub;
>> return ret;
>> }
>
> I didn't spot any problem, would you please come up with a formal patch?

Yeah, I'll aim to post today.



2024-01-23 17:22:46

by Yang Shi

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts <[email protected]> wrote:
>
> On 22/01/2024 19:43, Yang Shi wrote:
> > On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 20/01/2024 16:39, Matthew Wilcox wrote:
> >>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> >>>> However, after this patch, each allocation is in its own VMA, and there is a 2M
> >>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> >>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> >>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> >>>> causes a subsequent calloc() to fail, which causes the test to fail.
> >>>>
> >>>> Looking at the code, I think the problem is that arm64 selects
> >>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> >>>> len+2M then always aligns to the bottom of the discovered gap. That causes the
> >>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
> >>>
> >>> As a quick hack, perhaps
> >>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >>> take-the-top-half
> >>> #else
> >>> current-take-bottom-half-code
> >>> #endif
> >>>
> >>> ?
> >
> > Thanks for the suggestion. It makes sense to me. Doing the alignment
> > needs to take into account this.
> >
> >>
> >> There is a general problem though that there is a trade-off between abutting
> >> VMAs, and aligning them to PMD boundaries. This patch has decided that in
> >> general the latter is preferable. The case I'm hitting is special though, in
> >> that both requirements could be achieved but currently are not.
> >>
> >> The below fixes it, but I feel like there should be some bitwise magic that
> >> would give the correct answer without the conditional - but my head is gone and
> >> I can't see it. Any thoughts?
> >
> > Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
> > the conditional either.
> >
> >>
> >> Beyond this, though, there is also a latent bug where the offset provided to
> >> mmap() is carried all the way through to the get_unmapped_area()
> >> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
> >> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
> >> that use the default get_unmapped_area(), any non-zero offset would not have
> >> been used. But this change starts using it, which is incorrect. That said, there
> >> are some arches that override the default get_unmapped_area() and do use the
> >> offset. So I'm not sure if this is a bug or a feature that user space can pass
> >> an arbitrary value to the implementation for anon memory??
> >
> > Thanks for noticing this. If I read the code correctly, the pgoff used
> > by some arches to workaround VIPT caches, and it looks like it is for
> > shared mapping only (just checked arm and mips). And I believe
> > everybody assumes 0 should be used when doing anonymous mapping. The
> > offset should have nothing to do with seeking proper unmapped virtual
> > area. But the pgoff does make sense for file THP due to the alignment
> > requirements. I think it should be zero'ed for anonymous mappings,
> > like:
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2ff79b1d1564..a9ed353ce627 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
> > long addr, unsigned long len,
> > pgoff = 0;
> > get_area = shmem_get_unmapped_area;
> > } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > + pgoff = 0;
> > /* Ensures that larger anonymous mappings are THP aligned. */
> > get_area = thp_get_unmapped_area;
> > }
>
> I think it would be cleaner to just zero pgoff if file==NULL, then it covers the
> shared case, the THP case, and the non-THP case properly. I'll prepare a
> separate patch for this.

IIUC I don't think this is ok for those arches which have to
workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file
pointer is a common case for creating tmpfs mapping. For example,
arm's arch_get_unmapped_area() has:

if (aliasing)
do_align = filp || (flags & MAP_SHARED);

The pgoff is needed if do_align is true. So we should just zero pgoff
iff !file && !MAP_SHARED like what my patch does, we can move the
zeroing to a better place.

>
>
> >
> >>
> >> Finally, the second test failure I reported (ksm_tests) is actually caused by a
> >> bug in the test code, but provoked by this change. So I'll send out a fix for
> >> the test code separately.
> >>
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 4f542444a91f..68ac54117c77 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >> {
> >> loff_t off_end = off + len;
> >> loff_t off_align = round_up(off, size);
> >> - unsigned long len_pad, ret;
> >> + unsigned long len_pad, ret, off_sub;
> >>
> >> if (off_end <= off_align || (off_end - off_align) < size)
> >> return 0;
> >> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >> if (ret == addr)
> >> return addr;
> >>
> >> - ret += (off - ret) & (size - 1);
> >> + off_sub = (off - ret) & (size - 1);
> >> +
> >> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
> >> + !off_sub)
> >> + return ret + size;
> >> +
> >> + ret += off_sub;
> >> return ret;
> >> }
> >
> > I didn't spot any problem, would you please come up with a formal patch?
>
> Yeah, I'll aim to post today.

Thanks!

>
>

2024-01-23 17:38:55

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 23/01/2024 17:14, Yang Shi wrote:
> On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 22/01/2024 19:43, Yang Shi wrote:
>>> On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 20/01/2024 16:39, Matthew Wilcox wrote:
>>>>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
>>>>>> However, after this patch, each allocation is in its own VMA, and there is a 2M
>>>>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
>>>>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
>>>>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
>>>>>> causes a subsequent calloc() to fail, which causes the test to fail.
>>>>>>
>>>>>> Looking at the code, I think the problem is that arm64 selects
>>>>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
>>>>>> len+2M then always aligns to the bottom of the discovered gap. That causes the
>>>>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
>>>>>
>>>>> As a quick hack, perhaps
>>>>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>>>> take-the-top-half
>>>>> #else
>>>>> current-take-bottom-half-code
>>>>> #endif
>>>>>
>>>>> ?
>>>
>>> Thanks for the suggestion. It makes sense to me. Doing the alignment
>>> needs to take into account this.
>>>
>>>>
>>>> There is a general problem though that there is a trade-off between abutting
>>>> VMAs, and aligning them to PMD boundaries. This patch has decided that in
>>>> general the latter is preferable. The case I'm hitting is special though, in
>>>> that both requirements could be achieved but currently are not.
>>>>
>>>> The below fixes it, but I feel like there should be some bitwise magic that
>>>> would give the correct answer without the conditional - but my head is gone and
>>>> I can't see it. Any thoughts?
>>>
>>> Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
>>> the conditional either.
>>>
>>>>
>>>> Beyond this, though, there is also a latent bug where the offset provided to
>>>> mmap() is carried all the way through to the get_unmapped_area()
>>>> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
>>>> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
>>>> that use the default get_unmapped_area(), any non-zero offset would not have
>>>> been used. But this change starts using it, which is incorrect. That said, there
>>>> are some arches that override the default get_unmapped_area() and do use the
>>>> offset. So I'm not sure if this is a bug or a feature that user space can pass
>>>> an arbitrary value to the implementation for anon memory??
>>>
>>> Thanks for noticing this. If I read the code correctly, the pgoff used
>>> by some arches to workaround VIPT caches, and it looks like it is for
>>> shared mapping only (just checked arm and mips). And I believe
>>> everybody assumes 0 should be used when doing anonymous mapping. The
>>> offset should have nothing to do with seeking proper unmapped virtual
>>> area. But the pgoff does make sense for file THP due to the alignment
>>> requirements. I think it should be zero'ed for anonymous mappings,
>>> like:
>>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 2ff79b1d1564..a9ed353ce627 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
>>> long addr, unsigned long len,
>>> pgoff = 0;
>>> get_area = shmem_get_unmapped_area;
>>> } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>> + pgoff = 0;
>>> /* Ensures that larger anonymous mappings are THP aligned. */
>>> get_area = thp_get_unmapped_area;
>>> }
>>
>> I think it would be cleaner to just zero pgoff if file==NULL, then it covers the
>> shared case, the THP case, and the non-THP case properly. I'll prepare a
>> separate patch for this.
>
> IIUC I don't think this is ok for those arches which have to
> workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file
> pointer is a common case for creating tmpfs mapping. For example,
> arm's arch_get_unmapped_area() has:
>
> if (aliasing)
> do_align = filp || (flags & MAP_SHARED);
>
> The pgoff is needed if do_align is true. So we should just zero pgoff
> iff !file && !MAP_SHARED like what my patch does, we can move the
> zeroing to a better place.

We crossed streams - I sent out the patch just as you sent this. My patch is
implemented as I proposed.

I'm not sure I agree with what you are saying. The mmap man page says this:

The contents of a file mapping (as opposed to an anonymous mapping; see
MAP_ANONYMOUS below), are initialized using length bytes starting at offset
offset in the file (or other object) referred to by the file descriptor fd.

So that implies offset is only relavent when a file is provided. It then goes on
to say:

MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero.
The fd argument is ignored; however, some implementations require fd to be -1
if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should
ensure this. The offset argument should be zero.

So users are expected to pass offset=0 when mapping anon memory, for both shared
and private cases.

Infact, in the line above where you made your proposed change, pgoff is also
being zeroed for the (!file && (flags & MAP_SHARED)) case.


>
>>
>>
>>>
>>>>
>>>> Finally, the second test failure I reported (ksm_tests) is actually caused by a
>>>> bug in the test code, but provoked by this change. So I'll send out a fix for
>>>> the test code separately.
>>>>
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 4f542444a91f..68ac54117c77 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>>>> {
>>>> loff_t off_end = off + len;
>>>> loff_t off_align = round_up(off, size);
>>>> - unsigned long len_pad, ret;
>>>> + unsigned long len_pad, ret, off_sub;
>>>>
>>>> if (off_end <= off_align || (off_end - off_align) < size)
>>>> return 0;
>>>> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>>>> if (ret == addr)
>>>> return addr;
>>>>
>>>> - ret += (off - ret) & (size - 1);
>>>> + off_sub = (off - ret) & (size - 1);
>>>> +
>>>> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
>>>> + !off_sub)
>>>> + return ret + size;
>>>> +
>>>> + ret += off_sub;
>>>> return ret;
>>>> }
>>>
>>> I didn't spot any problem, would you please come up with a formal patch?
>>
>> Yeah, I'll aim to post today.
>
> Thanks!
>
>>
>>


2024-01-23 17:53:24

by Yang Shi

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Tue, Jan 23, 2024 at 9:14 AM Yang Shi <[email protected]> wrote:
>
> On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts <[email protected]> wrote:
> >
> > On 22/01/2024 19:43, Yang Shi wrote:
> > > On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
> > >>
> > >> On 20/01/2024 16:39, Matthew Wilcox wrote:
> > >>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> > >>>> However, after this patch, each allocation is in its own VMA, and there is a 2M
> > >>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> > >>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> > >>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> > >>>> causes a subsequent calloc() to fail, which causes the test to fail.
> > >>>>
> > >>>> Looking at the code, I think the problem is that arm64 selects
> > >>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> > >>>> len+2M then always aligns to the bottom of the discovered gap. That causes the
> > >>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
> > >>>
> > >>> As a quick hack, perhaps
> > >>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> > >>> take-the-top-half
> > >>> #else
> > >>> current-take-bottom-half-code
> > >>> #endif
> > >>>
> > >>> ?
> > >
> > > Thanks for the suggestion. It makes sense to me. Doing the alignment
> > > needs to take into account this.
> > >
> > >>
> > >> There is a general problem though that there is a trade-off between abutting
> > >> VMAs, and aligning them to PMD boundaries. This patch has decided that in
> > >> general the latter is preferable. The case I'm hitting is special though, in
> > >> that both requirements could be achieved but currently are not.
> > >>
> > >> The below fixes it, but I feel like there should be some bitwise magic that
> > >> would give the correct answer without the conditional - but my head is gone and
> > >> I can't see it. Any thoughts?
> > >
> > > Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
> > > the conditional either.
> > >
> > >>
> > >> Beyond this, though, there is also a latent bug where the offset provided to
> > >> mmap() is carried all the way through to the get_unmapped_area()
> > >> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
> > >> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
> > >> that use the default get_unmapped_area(), any non-zero offset would not have
> > >> been used. But this change starts using it, which is incorrect. That said, there
> > >> are some arches that override the default get_unmapped_area() and do use the
> > >> offset. So I'm not sure if this is a bug or a feature that user space can pass
> > >> an arbitrary value to the implementation for anon memory??
> > >
> > > Thanks for noticing this. If I read the code correctly, the pgoff used
> > > by some arches to workaround VIPT caches, and it looks like it is for
> > > shared mapping only (just checked arm and mips). And I believe
> > > everybody assumes 0 should be used when doing anonymous mapping. The
> > > offset should have nothing to do with seeking proper unmapped virtual
> > > area. But the pgoff does make sense for file THP due to the alignment
> > > requirements. I think it should be zero'ed for anonymous mappings,
> > > like:
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 2ff79b1d1564..a9ed353ce627 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
> > > long addr, unsigned long len,
> > > pgoff = 0;
> > > get_area = shmem_get_unmapped_area;
> > > } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > > + pgoff = 0;
> > > /* Ensures that larger anonymous mappings are THP aligned. */
> > > get_area = thp_get_unmapped_area;
> > > }
> >
> > I think it would be cleaner to just zero pgoff if file==NULL, then it covers the
> > shared case, the THP case, and the non-THP case properly. I'll prepare a
> > separate patch for this.
>
> IIUC I don't think this is ok for those arches which have to
> workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file
> pointer is a common case for creating tmpfs mapping. For example,
> arm's arch_get_unmapped_area() has:
>
> if (aliasing)
> do_align = filp || (flags & MAP_SHARED);
>
> The pgoff is needed if do_align is true. So we should just zero pgoff
> iff !file && !MAP_SHARED like what my patch does, we can move the
> zeroing to a better place.

Rethinking this... zeroing pgoff when file is NULL should be ok since
MAP_ANOYMOUS | MAP_SHARED mapping should typically have zero offset.
I'm not aware of any usecase with non-zero offset, or sane usecase...

>
> >
> >
> > >
> > >>
> > >> Finally, the second test failure I reported (ksm_tests) is actually caused by a
> > >> bug in the test code, but provoked by this change. So I'll send out a fix for
> > >> the test code separately.
> > >>
> > >>
> > >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > >> index 4f542444a91f..68ac54117c77 100644
> > >> --- a/mm/huge_memory.c
> > >> +++ b/mm/huge_memory.c
> > >> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> > >> {
> > >> loff_t off_end = off + len;
> > >> loff_t off_align = round_up(off, size);
> > >> - unsigned long len_pad, ret;
> > >> + unsigned long len_pad, ret, off_sub;
> > >>
> > >> if (off_end <= off_align || (off_end - off_align) < size)
> > >> return 0;
> > >> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> > >> if (ret == addr)
> > >> return addr;
> > >>
> > >> - ret += (off - ret) & (size - 1);
> > >> + off_sub = (off - ret) & (size - 1);
> > >> +
> > >> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
> > >> + !off_sub)
> > >> + return ret + size;
> > >> +
> > >> + ret += off_sub;
> > >> return ret;
> > >> }
> > >
> > > I didn't spot any problem, would you please come up with a formal patch?
> >
> > Yeah, I'll aim to post today.
>
> Thanks!
>
> >
> >

2024-01-23 17:54:15

by Yang Shi

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On Tue, Jan 23, 2024 at 9:26 AM Ryan Roberts <[email protected]> wrote:
>
> On 23/01/2024 17:14, Yang Shi wrote:
> > On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 22/01/2024 19:43, Yang Shi wrote:
> >>> On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
> >>>>
> >>>> On 20/01/2024 16:39, Matthew Wilcox wrote:
> >>>>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
> >>>>>> However, after this patch, each allocation is in its own VMA, and there is a 2M
> >>>>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
> >>>>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once
> >>>>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
> >>>>>> causes a subsequent calloc() to fail, which causes the test to fail.
> >>>>>>
> >>>>>> Looking at the code, I think the problem is that arm64 selects
> >>>>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates
> >>>>>> len+2M then always aligns to the bottom of the discovered gap. That causes the
> >>>>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole.
> >>>>>
> >>>>> As a quick hack, perhaps
> >>>>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >>>>> take-the-top-half
> >>>>> #else
> >>>>> current-take-bottom-half-code
> >>>>> #endif
> >>>>>
> >>>>> ?
> >>>
> >>> Thanks for the suggestion. It makes sense to me. Doing the alignment
> >>> needs to take into account this.
> >>>
> >>>>
> >>>> There is a general problem though that there is a trade-off between abutting
> >>>> VMAs, and aligning them to PMD boundaries. This patch has decided that in
> >>>> general the latter is preferable. The case I'm hitting is special though, in
> >>>> that both requirements could be achieved but currently are not.
> >>>>
> >>>> The below fixes it, but I feel like there should be some bitwise magic that
> >>>> would give the correct answer without the conditional - but my head is gone and
> >>>> I can't see it. Any thoughts?
> >>>
> >>> Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
> >>> the conditional either.
> >>>
> >>>>
> >>>> Beyond this, though, there is also a latent bug where the offset provided to
> >>>> mmap() is carried all the way through to the get_unmapped_area()
> >>>> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
> >>>> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
> >>>> that use the default get_unmapped_area(), any non-zero offset would not have
> >>>> been used. But this change starts using it, which is incorrect. That said, there
> >>>> are some arches that override the default get_unmapped_area() and do use the
> >>>> offset. So I'm not sure if this is a bug or a feature that user space can pass
> >>>> an arbitrary value to the implementation for anon memory??
> >>>
> >>> Thanks for noticing this. If I read the code correctly, the pgoff used
> >>> by some arches to workaround VIPT caches, and it looks like it is for
> >>> shared mapping only (just checked arm and mips). And I believe
> >>> everybody assumes 0 should be used when doing anonymous mapping. The
> >>> offset should have nothing to do with seeking proper unmapped virtual
> >>> area. But the pgoff does make sense for file THP due to the alignment
> >>> requirements. I think it should be zero'ed for anonymous mappings,
> >>> like:
> >>>
> >>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>> index 2ff79b1d1564..a9ed353ce627 100644
> >>> --- a/mm/mmap.c
> >>> +++ b/mm/mmap.c
> >>> @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
> >>> long addr, unsigned long len,
> >>> pgoff = 0;
> >>> get_area = shmem_get_unmapped_area;
> >>> } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> >>> + pgoff = 0;
> >>> /* Ensures that larger anonymous mappings are THP aligned. */
> >>> get_area = thp_get_unmapped_area;
> >>> }
> >>
> >> I think it would be cleaner to just zero pgoff if file==NULL, then it covers the
> >> shared case, the THP case, and the non-THP case properly. I'll prepare a
> >> separate patch for this.
> >
> > IIUC I don't think this is ok for those arches which have to
> > workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file
> > pointer is a common case for creating tmpfs mapping. For example,
> > arm's arch_get_unmapped_area() has:
> >
> > if (aliasing)
> > do_align = filp || (flags & MAP_SHARED);
> >
> > The pgoff is needed if do_align is true. So we should just zero pgoff
> > iff !file && !MAP_SHARED like what my patch does, we can move the
> > zeroing to a better place.
>
> We crossed streams - I sent out the patch just as you sent this. My patch is
> implemented as I proposed.

We crossed again :-)

>
> I'm not sure I agree with what you are saying. The mmap man page says this:
>
> The contents of a file mapping (as opposed to an anonymous mapping; see
> MAP_ANONYMOUS below), are initialized using length bytes starting at offset
> offset in the file (or other object) referred to by the file descriptor fd.
>
> So that implies offset is only relavent when a file is provided. It then goes on
> to say:
>
> MAP_ANONYMOUS
> The mapping is not backed by any file; its contents are initialized to zero.
> The fd argument is ignored; however, some implementations require fd to be -1
> if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should
> ensure this. The offset argument should be zero.
>
> So users are expected to pass offset=0 when mapping anon memory, for both shared
> and private cases.
>
> Infact, in the line above where you made your proposed change, pgoff is also
> being zeroed for the (!file && (flags & MAP_SHARED)) case.

Yeah, rethinking led me to the same conclusion.

>
>
> >
> >>
> >>
> >>>
> >>>>
> >>>> Finally, the second test failure I reported (ksm_tests) is actually caused by a
> >>>> bug in the test code, but provoked by this change. So I'll send out a fix for
> >>>> the test code separately.
> >>>>
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 4f542444a91f..68ac54117c77 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >>>> {
> >>>> loff_t off_end = off + len;
> >>>> loff_t off_align = round_up(off, size);
> >>>> - unsigned long len_pad, ret;
> >>>> + unsigned long len_pad, ret, off_sub;
> >>>>
> >>>> if (off_end <= off_align || (off_end - off_align) < size)
> >>>> return 0;
> >>>> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >>>> if (ret == addr)
> >>>> return addr;
> >>>>
> >>>> - ret += (off - ret) & (size - 1);
> >>>> + off_sub = (off - ret) & (size - 1);
> >>>> +
> >>>> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown &&
> >>>> + !off_sub)
> >>>> + return ret + size;
> >>>> +
> >>>> + ret += off_sub;
> >>>> return ret;
> >>>> }
> >>>
> >>> I didn't spot any problem, would you please come up with a formal patch?
> >>
> >> Yeah, I'll aim to post today.
> >
> > Thanks!
> >
> >>
> >>
>

2024-05-07 10:08:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 07/05/2024 09:25, Kefeng Wang wrote:
> Hi Ryan, Yang and all,
>
> We see another regression on arm64(no issue on x86) when test memory
> latency from lmbench,
>
> ./lat_mem_rd -P 1 512M 128

Do you know exectly what this test is doing?

>
> memory latency(smaller is better)
>
> MiB     6.9-rc7    6.9-rc7+revert

And what exactly have you reverted? I'm guessing just commit efa7df3e3bb5 ("mm:
align larger anonymous mappings on THP boundaries")?

> 0.00049    1.539     1.539
> 0.00098    1.539     1.539
> 0.00195    1.539     1.539
> 0.00293    1.539     1.539
> 0.00391    1.539     1.539
> 0.00586    1.539     1.539
> 0.00781    1.539     1.539
> 0.01172    1.539     1.539
> 0.01562    1.539     1.539
> 0.02344    1.539     1.539
> 0.03125    1.539     1.539
> 0.04688    1.539     1.539
> 0.0625    1.540     1.540
> 0.09375    3.634     3.086

So the first regression is for 96K - I'm guessing that's the mmap size? That
size shouldn't even be affected by this patch, apart from a few adds and a
compare which determines the size is too small to do PMD alignment for.

> 0.125   3.874     3.175
> 0.1875  3.544     3.288
> 0.25    3.556     3.461
> 0.375   3.641     3.644
> 0.5     4.125     3.851
> 0.75    4.968     4.323
> 1       5.143     4.686
> 1.5     5.309     4.957
> 2       5.370     5.116
> 3       5.430     5.471
> 4       5.457     5.671
> 6       6.100     6.170
> 8       6.496     6.468
>
> -----------------------s
> * L1 cache = 8M, it is no big changes below 8M *
> * but the latency reduce a lot when revert this patch from L2 *
>
> 12      6.917     6.840
> 16      7.268     7.077
> 24      7.536     7.345
> 32      10.723     9.421
> 48      14.220     11.350
> 64      16.253     12.189
> 96      14.494     12.507
> 128     14.630     12.560
> 192     15.402     12.967
> 256     16.178     12.957
> 384     15.177     13.346
> 512     15.235     13.233
>
> After quickly check the smaps, but don't find any clues, any suggestion?

Without knowing exactly what the test does, it's difficult to know what to
suggest. If you want to try something semi-randomly; it might be useful to rule
out the arm64 contpte feature. I don't see how that would be interacting here if
mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
ARM64_CONTPTE (needs EXPERT) at compile time.

>
> Thanks.
>
> On 2024/1/24 1:26, Ryan Roberts wrote:
>> On 23/01/2024 17:14, Yang Shi wrote:
>>> On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> On 22/01/2024 19:43, Yang Shi wrote:
>>>>> On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts <[email protected]> wrote:
>>>>>>
>>>>>> On 20/01/2024 16:39, Matthew Wilcox wrote:
>>>>>>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote:
>>>>>>>> However, after this patch, each allocation is in its own VMA, and there
>>>>>>>> is a 2M
>>>>>>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower
>>>>>>>> because there are so many VMAs to check to find a new 1G gap. 2) It
>>>>>>>> fails once
>>>>>>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then
>>>>>>>> causes a subsequent calloc() to fail, which causes the test to fail.
>>>>>>>>
>>>>>>>> Looking at the code, I think the problem is that arm64 selects
>>>>>>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area()
>>>>>>>> allocates
>>>>>>>> len+2M then always aligns to the bottom of the discovered gap. That
>>>>>>>> causes the
>>>>>>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get
>>>>>>>> a hole.
>>>>>>>
>>>>>>> As a quick hack, perhaps
>>>>>>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>>>>>> take-the-top-half
>>>>>>> #else
>>>>>>> current-take-bottom-half-code
>>>>>>> #endif
>>>>>>>
>>>>>>> ?
>>>>>
>>>>> Thanks for the suggestion. It makes sense to me. Doing the alignment
>>>>> needs to take into account this.
>>>>>
>>>>>>
>>>>>> There is a general problem though that there is a trade-off between abutting
>>>>>> VMAs, and aligning them to PMD boundaries. This patch has decided that in
>>>>>> general the latter is preferable. The case I'm hitting is special though, in
>>>>>> that both requirements could be achieved but currently are not.
>>>>>>
>>>>>> The below fixes it, but I feel like there should be some bitwise magic that
>>>>>> would give the correct answer without the conditional - but my head is
>>>>>> gone and
>>>>>> I can't see it. Any thoughts?
>>>>>
>>>>> Thanks Ryan for the patch. TBH I didn't see a bitwise magic without
>>>>> the conditional either.
>>>>>
>>>>>>
>>>>>> Beyond this, though, there is also a latent bug where the offset provided to
>>>>>> mmap() is carried all the way through to the get_unmapped_area()
>>>>>> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be
>>>>>> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches
>>>>>> that use the default get_unmapped_area(), any non-zero offset would not have
>>>>>> been used. But this change starts using it, which is incorrect. That said,
>>>>>> there
>>>>>> are some arches that override the default get_unmapped_area() and do use the
>>>>>> offset. So I'm not sure if this is a bug or a feature that user space can
>>>>>> pass
>>>>>> an arbitrary value to the implementation for anon memory??
>>>>>
>>>>> Thanks for noticing this. If I read the code correctly, the pgoff used
>>>>> by some arches to workaround VIPT caches, and it looks like it is for
>>>>> shared mapping only (just checked arm and mips). And I believe
>>>>> everybody assumes 0 should be used when doing anonymous mapping. The
>>>>> offset should have nothing to do with seeking proper unmapped virtual
>>>>> area. But the pgoff does make sense for file THP due to the alignment
>>>>> requirements. I think it should be zero'ed for anonymous mappings,
>>>>> like:
>>>>>
>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>> index 2ff79b1d1564..a9ed353ce627 100644
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned
>>>>> long addr, unsigned long len,
>>>>>                  pgoff = 0;
>>>>>                  get_area = shmem_get_unmapped_area;
>>>>>          } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>>>> +               pgoff = 0;
>>>>>                  /* Ensures that larger anonymous mappings are THP aligned. */
>>>>>                  get_area = thp_get_unmapped_area;
>>>>>          }
>>>>
>>>> I think it would be cleaner to just zero pgoff if file==NULL, then it covers
>>>> the
>>>> shared case, the THP case, and the non-THP case properly. I'll prepare a
>>>> separate patch for this.
>>>
>>> IIUC I don't think this is ok for those arches which have to
>>> workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file
>>> pointer is a common case for creating tmpfs mapping. For example,
>>> arm's arch_get_unmapped_area() has:
>>>
>>> if (aliasing)
>>>          do_align = filp || (flags & MAP_SHARED);
>>>
>>> The pgoff is needed if do_align is true. So we should just zero pgoff
>>> iff !file && !MAP_SHARED like what my patch does, we can move the
>>> zeroing to a better place.
>>
>> We crossed streams - I sent out the patch just as you sent this. My patch is
>> implemented as I proposed.
>>
>> I'm not sure I agree with what you are saying. The mmap man page says this:
>>
>>    The  contents  of  a file mapping (as opposed to an anonymous mapping; see
>>    MAP_ANONYMOUS below), are initialized using length bytes starting at offset
>>    offset in the file (or other object) referred to by the file descriptor fd.
>>
>> So that implies offset is only relavent when a file is provided. It then goes on
>> to say:
>>
>>    MAP_ANONYMOUS
>>    The mapping is not backed by any file; its contents are initialized to zero.
>>    The fd argument is ignored; however, some implementations require fd to be -1
>>    if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should
>>    ensure this. The offset argument should be zero.
>>
>> So users are expected to pass offset=0 when mapping anon memory, for both shared
>> and private cases.
>>
>> Infact, in the line above where you made your proposed change, pgoff is also
>> being zeroed for the (!file && (flags & MAP_SHARED)) case.
>>
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Finally, the second test failure I reported (ksm_tests) is actually caused
>>>>>> by a
>>>>>> bug in the test code, but provoked by this change. So I'll send out a fix for
>>>>>> the test code separately.
>>>>>>
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 4f542444a91f..68ac54117c77 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct
>>>>>> file *filp,
>>>>>>   {
>>>>>>          loff_t off_end = off + len;
>>>>>>          loff_t off_align = round_up(off, size);
>>>>>> -       unsigned long len_pad, ret;
>>>>>> +       unsigned long len_pad, ret, off_sub;
>>>>>>
>>>>>>          if (off_end <= off_align || (off_end - off_align) < size)
>>>>>>                  return 0;
>>>>>> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct
>>>>>> file *filp,
>>>>>>          if (ret == addr)
>>>>>>                  return addr;
>>>>>>
>>>>>> -       ret += (off - ret) & (size - 1);
>>>>>> +       off_sub = (off - ret) & (size - 1);
>>>>>> +
>>>>>> +       if (current->mm->get_unmapped_area ==
>>>>>> arch_get_unmapped_area_topdown &&
>>>>>> +           !off_sub)
>>>>>> +               return ret + size;
>>>>>> +
>>>>>> +       ret += off_sub;
>>>>>>          return ret;
>>>>>>   }
>>>>>
>>>>> I didn't spot any problem, would you please come up with a formal patch?
>>>>
>>>> Yeah, I'll aim to post today.
>>>
>>> Thanks!
>>>
>>>>
>>>>
>>
>>


2024-05-07 11:13:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries


> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95
>
>> suggest. If you want to try something semi-randomly; it might be useful to rule
>> out the arm64 contpte feature. I don't see how that would be interacting here if
>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
>> ARM64_CONTPTE (needs EXPERT) at compile time.
> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
> but will have a try.

cont-pte can get active if we're just lucky when allocating pages in the
right order, correct Ryan?

--
Cheers,

David / dhildenb


2024-05-07 11:28:13

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 07/05/2024 12:14, Ryan Roberts wrote:
> On 07/05/2024 12:13, David Hildenbrand wrote:
>>
>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95
>>>
>>>> suggest. If you want to try something semi-randomly; it might be useful to rule
>>>> out the arm64 contpte feature. I don't see how that would be interacting here if
>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
>>>> ARM64_CONTPTE (needs EXPERT) at compile time.
>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
>>> but will have a try.
>>
>> cont-pte can get active if we're just lucky when allocating pages in the right
>> order, correct Ryan?
>
> No it shouldn't do; it requires the pages to be in the same folio.
>

That said, if we got lucky in allocating the "right" pages, then we will end up
doing an extra function call and a bit of maths per every 16 PTEs in order to
figure out that the span is not contained by a single folio, before backing out
of an attempt to fold. That would probably be just about measurable.

But the regression doesn't kick in until 96K, which is the step after 64K. I'd
expect to see the regression on 64K too if that was the issue. The cacheline is
64K so I suspect it could be something related to the cache?

2024-05-07 11:35:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 07.05.24 13:26, Ryan Roberts wrote:
> On 07/05/2024 12:14, Ryan Roberts wrote:
>> On 07/05/2024 12:13, David Hildenbrand wrote:
>>>
>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95
>>>>
>>>>> suggest. If you want to try something semi-randomly; it might be useful to rule
>>>>> out the arm64 contpte feature. I don't see how that would be interacting here if
>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
>>>>> ARM64_CONTPTE (needs EXPERT) at compile time.
>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
>>>> but will have a try.
>>>
>>> cont-pte can get active if we're just lucky when allocating pages in the right
>>> order, correct Ryan?
>>
>> No it shouldn't do; it requires the pages to be in the same folio.

Ah, my memory comes back. That's also important for folio_pte_batch() to
currently work as expected I think. We could change that, though, and
let cont-pte batch across folios.

>>
>
> That said, if we got lucky in allocating the "right" pages, then we will end up
> doing an extra function call and a bit of maths per every 16 PTEs in order to
> figure out that the span is not contained by a single folio, before backing out
> of an attempt to fold. That would probably be just about measurable.
>
> But the regression doesn't kick in until 96K, which is the step after 64K. I'd
> expect to see the regression on 64K too if that was the issue. The cacheline is
> 64K so I suspect it could be something related to the cache?
>

--
Cheers,

David / dhildenb


2024-05-07 11:42:32

by Ryan Roberts

[permalink] [raw]
Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries

On 07/05/2024 12:13, David Hildenbrand wrote:
>
>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95
>>
>>> suggest. If you want to try something semi-randomly; it might be useful to rule
>>> out the arm64 contpte feature. I don't see how that would be interacting here if
>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with
>>> ARM64_CONTPTE (needs EXPERT) at compile time.
>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE,
>> but will have a try.
>
> cont-pte can get active if we're just lucky when allocating pages in the right
> order, correct Ryan?

No it shouldn't do; it requires the pages to be in the same folio.