2017-03-15 16:51:15

by Avi Kivity

[permalink] [raw]
Subject: MAP_POPULATE vs. MADV_HUGEPAGES

A user is trying to allocate 1TB of anonymous memory in parallel on 48
cores (4 NUMA nodes). The kernel ends up spinning in
isolate_freepages_block().


I thought to help it along by using MAP_POPULATE, but then my
MADV_HUGEPAGE won't be seen until after mmap() completes, with pages
already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?


Is my only option to serialize those memory allocations, and fault in
those pages manually? Or perhaps use mlock()?


2017-03-16 12:34:53

by Michal Hocko

[permalink] [raw]
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES

On Wed 15-03-17 18:50:32, Avi Kivity wrote:
> A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
> (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block().

Which kernel version is that? What is the THP defrag mode
(/sys/kernel/mm/transparent_hugepage/defrag)?

> I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
> won't be seen until after mmap() completes, with pages already populated.
> Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?

Why do you need MADV_HUGEPAGE?

> Is my only option to serialize those memory allocations, and fault in those
> pages manually? Or perhaps use mlock()?

I am still not 100% sure I see what you are trying to achieve, though.
So you do not want all those processes to contend inside the compaction
while still allocate as many huge pages as possible?

--
Michal Hocko
SUSE Labs

2017-03-16 13:27:36

by Avi Kivity

[permalink] [raw]
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES



On 03/16/2017 02:34 PM, Michal Hocko wrote:
> On Wed 15-03-17 18:50:32, Avi Kivity wrote:
>> A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
>> (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block().
> Which kernel version is that?

A good question; it was 3.10.something-el.something. The user mentioned
above updated to 4.4, and the problem was gone, so it looks like it is a
Red Hat specific problem. I would really like the 3.10.something kernel
to handle this workload well, but I understand that's not this list's
concern.

> What is the THP defrag mode
> (/sys/kernel/mm/transparent_hugepage/defrag)?

The default (always).

>
>> I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
>> won't be seen until after mmap() completes, with pages already populated.
>> Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?
> Why do you need MADV_HUGEPAGE?

So that I get huge pages even if transparent_hugepage/enabled=madvise.
I'm allocating almost all of the memory of that machine to be used as a
giant cache, so I want it backed by hugepages.

>
>> Is my only option to serialize those memory allocations, and fault in those
>> pages manually? Or perhaps use mlock()?
> I am still not 100% sure I see what you are trying to achieve, though.
> So you do not want all those processes to contend inside the compaction
> while still allocate as many huge pages as possible?

Since the process starts with all of that memory free, there should not
be any compaction going on (or perhaps very minimal eviction/movement of
a few pages here and there). And since it's fixed in later kernels, it
looks like the contention was not really mandated by the workload, just
an artifact of the implementation.

To explain the workload again, the process starts, clones as many
threads as there are logical processors, and each of those threads
mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.

2017-03-16 14:48:48

by Michal Hocko

[permalink] [raw]
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES

On Thu 16-03-17 15:26:54, Avi Kivity wrote:
>
>
> On 03/16/2017 02:34 PM, Michal Hocko wrote:
> >On Wed 15-03-17 18:50:32, Avi Kivity wrote:
> >>A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
> >>(4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block().
> >Which kernel version is that?
>
> A good question; it was 3.10.something-el.something. The user mentioned
> above updated to 4.4, and the problem was gone, so it looks like it is a Red
> Hat specific problem. I would really like the 3.10.something kernel to
> handle this workload well, but I understand that's not this list's concern.
>
> >What is the THP defrag mode
> >(/sys/kernel/mm/transparent_hugepage/defrag)?
>
> The default (always).

the default has changed since then because the THP faul latencies were
just too large. Currently we only allow madvised VMAs to go stall and
even then we try hard to back off sooner rather than later. See
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option") merged in 4.4

> >>I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
> >>won't be seen until after mmap() completes, with pages already populated.
> >>Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?
> >Why do you need MADV_HUGEPAGE?
>
> So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm
> allocating almost all of the memory of that machine to be used as a giant
> cache, so I want it backed by hugepages.

Is there any strong reason to not use hugetlb then? You probably want
that memory reclaimable, right?

> >>Is my only option to serialize those memory allocations, and fault in those
> >>pages manually? Or perhaps use mlock()?
> >I am still not 100% sure I see what you are trying to achieve, though.
> >So you do not want all those processes to contend inside the compaction
> >while still allocate as many huge pages as possible?
>
> Since the process starts with all of that memory free, there should not be
> any compaction going on (or perhaps very minimal eviction/movement of a few
> pages here and there). And since it's fixed in later kernels, it looks like
> the contention was not really mandated by the workload, just an artifact of
> the implementation.

It is possible. A lot has changed since 3.10 times.

> To explain the workload again, the process starts, clones as many threads as
> there are logical processors, and each of those threads mmap()s (and
> mbind()s) a chunk of memory and then proceeds to touch it.

--
Michal Hocko
SUSE Labs

2017-03-16 14:58:10

by Avi Kivity

[permalink] [raw]
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES



On 03/16/2017 04:48 PM, Michal Hocko wrote:
> On Thu 16-03-17 15:26:54, Avi Kivity wrote:
>>
>> On 03/16/2017 02:34 PM, Michal Hocko wrote:
>>> On Wed 15-03-17 18:50:32, Avi Kivity wrote:
>>>> A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
>>>> (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block().
>>> Which kernel version is that?
>> A good question; it was 3.10.something-el.something. The user mentioned
>> above updated to 4.4, and the problem was gone, so it looks like it is a Red
>> Hat specific problem. I would really like the 3.10.something kernel to
>> handle this workload well, but I understand that's not this list's concern.
>>
>>> What is the THP defrag mode
>>> (/sys/kernel/mm/transparent_hugepage/defrag)?
>> The default (always).
> the default has changed since then because the THP faul latencies were
> just too large. Currently we only allow madvised VMAs to go stall and
> even then we try hard to back off sooner rather than later. See
> 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
> stall-free defrag option") merged in 4.4

I see, thanks. So the 4.4 behavior is better mostly due to not trying
so hard.

>
>>>> I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
>>>> won't be seen until after mmap() completes, with pages already populated.
>>>> Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?
>>> Why do you need MADV_HUGEPAGE?
>> So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm
>> allocating almost all of the memory of that machine to be used as a giant
>> cache, so I want it backed by hugepages.
> Is there any strong reason to not use hugetlb then? You probably want
> that memory reclaimable, right?

Did you mean hugetlbfs? It's a pain to configure, and often requires a
reboot.

We support it via an option, but we prefer the user's first experience
with the application not to be "configure this kernel parameter and reboot".

We don't particularly need that memory to be reclaimable (and in fact we
have an option to mlock() it; if it gets swapped, application
performance tanks).

>
>>>> Is my only option to serialize those memory allocations, and fault in those
>>>> pages manually? Or perhaps use mlock()?
>>> I am still not 100% sure I see what you are trying to achieve, though.
>>> So you do not want all those processes to contend inside the compaction
>>> while still allocate as many huge pages as possible?
>> Since the process starts with all of that memory free, there should not be
>> any compaction going on (or perhaps very minimal eviction/movement of a few
>> pages here and there). And since it's fixed in later kernels, it looks like
>> the contention was not really mandated by the workload, just an artifact of
>> the implementation.
> It is possible. A lot has changed since 3.10 times.

Like the default behavior :).

2017-03-16 15:01:58

by Michal Hocko

[permalink] [raw]
Subject: Re: MAP_POPULATE vs. MADV_HUGEPAGES

On Thu 16-03-17 16:56:34, Avi Kivity wrote:
> On 03/16/2017 04:48 PM, Michal Hocko wrote:
> >On Thu 16-03-17 15:26:54, Avi Kivity wrote:
[...]
> >>>What is the THP defrag mode
> >>>(/sys/kernel/mm/transparent_hugepage/defrag)?
> >>The default (always).
> >the default has changed since then because the THP faul latencies were
> >just too large. Currently we only allow madvised VMAs to go stall and
> >even then we try hard to back off sooner rather than later. See
> >444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
> >stall-free defrag option") merged in 4.4
>
> I see, thanks. So the 4.4 behavior is better mostly due to not trying so
> hard.

Please note there were many other patches in the compaction code as
well.
--
Michal Hocko
SUSE Labs