LinuxLists.cc - Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-19 12:51:17

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
> From: Nitin Gupta <[email protected]>
>
> Currently, if the THP enabled policy is "always", or the mode
> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
> is allocated on a page fault if the pud or pmd is empty. This
> yields the best VA translation performance, but increases memory
> consumption if some small page ranges within the huge page are
> never accessed.

Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
users.

> An alternate behavior for such page faults is to install a
> hugepage only when a region is actually found to be (almost)
> fully mapped and active. This is a compromise between
> translation performance and memory consumption. Currently there
> is no way for an application to choose this compromise for the
> page fault conditions above.

Is that really true? We have /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
This is not reflected during the PF of course but you can control the
behavior there as well. Either by the global setting or a per proces
prctl.

> With this change, whenever an application issues MADV_DONTNEED on a
> memory region, the region is marked as "space-efficient". For such
> regions, a hugepage is not immediately allocated on first write.

Kirill didn't like it in the previous version and I do not like this
either. You are adding a very subtle side effect which might completely
unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
to free up unused memory. Now you have put it out of THP usage
basically.

If the memory is used really scarce then we have MADV_NOHUGEPAGE.
--
Michal Hocko
SUSE Labs

2018-01-24 20:33:38

by Nitin Gupta

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On 1/19/18 4:49 AM, Michal Hocko wrote:
> On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
>> From: Nitin Gupta <[email protected]>
>>
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty. This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
>
> Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> users.
>

Yes, allocating hugepage on first touch is the current behavior for
above two cases. However, I see issues with this current behavior.
Firstly, THP=always mode is often too aggressive/wasteful to be useful
for any realistic workloads. For THP=madvise, users may want to back
active parts of memory region with hugepages while avoiding aggressive
hugepage allocation on first touch. Or, they may really want the current
behavior.

With this patch, users would have the option to pick what behavior they
want by passing hints to the kernel in the form of MADV_HUGEPAGE and
MADV_DONTNEED madvise calls.

>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active. This is a compromise between
>> translation performance and memory consumption. Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
>
> Is that really true? We have /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> This is not reflected during the PF of course but you can control the
> behavior there as well. Either by the global setting or a per proces
> prctl.
>

I think this part of patch description needs some rewording. This patch
is to change *only* the page fault behavior.

Once pages are installed, khugepaged does its job as usual, using
max_ptes_none and other config values. I'm not trying to change any
khugepaged behavior here.

>> With this change, whenever an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For such
>> regions, a hugepage is not immediately allocated on first write.
>
> Kirill didn't like it in the previous version and I do not like this
> either. You are adding a very subtle side effect which might completely
> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> to free up unused memory. Now you have put it out of THP usage
> basically.
>

Userpsace may want a region to be considered by khugepaged while opting
out of hugepage allocation on first touch. Asking userspace memory
allocators to have to track and reclaim unused parts of a THP allocated
hugepage does not seems right, as the kernel can use simple userspace
hints to avoid allocating extra memory in the first place.

I agree that this patch is adding a subtle side-effect which may take
some applications by surprise. However, I often see the opposite too:
for many workloads, disabling THP is the first advise as this aggressive
allocation of hugepages on first touch is unexpected and is too
wasteful. For e.g.:

1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/

2) Disable THP on MongoDB
https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

3) Disable THP for Couchbase Server
https://blog.couchbase.com/often-overlooked-linux-os-tweaks/

4) Redis
http://antirez.com/news/84

> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>

It's not really about memory scarcity but a more efficient use of it.
Applications may want hugepage benefits without requiring any changes to
app code which is what THP is supposed to provide, while still avoiding
memory bloat.

-Nitin

2018-01-25 00:48:28

by Zi Yan

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

>
>>> With this change, whenever an application issues MADV_DONTNEED on a
>>> memory region, the region is marked as "space-efficient". For such
>>> regions, a hugepage is not immediately allocated on first write.
>>
>> Kirill didn't like it in the previous version and I do not like this
>> either. You are adding a very subtle side effect which might completely
>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
>> to free up unused memory. Now you have put it out of THP usage
>> basically.
>>
>
> Userpsace may want a region to be considered by khugepaged while opting
> out of hugepage allocation on first touch. Asking userspace memory
> allocators to have to track and reclaim unused parts of a THP allocated
> hugepage does not seems right, as the kernel can use simple userspace
> hints to avoid allocating extra memory in the first place.
>
> I agree that this patch is adding a subtle side-effect which may take
> some applications by surprise. However, I often see the opposite too:
> for many workloads, disabling THP is the first advise as this aggressive
> allocation of hugepages on first touch is unexpected and is too
> wasteful. For e.g.:
>
> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/
>
> 2) Disable THP on MongoDB
> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
>
> 3) Disable THP for Couchbase Server
> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/
>
> 4) Redis
> http://antirez.com/news/84
>
>
>> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>>
>
> It's not really about memory scarcity but a more efficient use of it.
> Applications may want hugepage benefits without requiring any changes to
> app code which is what THP is supposed to provide, while still avoiding
> memory bloat.
>

I read these links and find that there are mainly two complains:
1. THP causes latency spikes, because direction compaction slows down THP allocation,
2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
THP size and fails because of THP.

The first complain is not related to this patch.

For second one, at least with recent kernels, MADV_DONTNEED splits THPs and returns the memory range you
specified in madvise(). Am I missing anything?

—
Best Regards,
Yan Zi

Attachments:

signature.asc (569.00 B)
OpenPGP digital signature

2018-01-25 09:59:52

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Fri 19-01-18 12:59:17, Nitin Gupta wrote:
> On 1/19/18 4:49 AM, Michal Hocko wrote:
> > On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
> >> From: Nitin Gupta <[email protected]>
> >>
> >> Currently, if the THP enabled policy is "always", or the mode
> >> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
> >> is allocated on a page fault if the pud or pmd is empty. This
> >> yields the best VA translation performance, but increases memory
> >> consumption if some small page ranges within the huge page are
> >> never accessed.
> >
> > Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> > users.
> >
>
> Yes, allocating hugepage on first touch is the current behavior for
> above two cases. However, I see issues with this current behavior.
> Firstly, THP=always mode is often too aggressive/wasteful to be useful
> for any realistic workloads. For THP=madvise, users may want to back
> active parts of memory region with hugepages while avoiding aggressive
> hugepage allocation on first touch. Or, they may really want the current
> behavior.

Then they should use THP=never and rely on the khugepaged to compact
madvise regions. This will avoid first touch problem and you can also
control how large portion of the THP has to be mapped already.

> With this patch, users would have the option to pick what behavior they
> want by passing hints to the kernel in the form of MADV_HUGEPAGE and
> MADV_DONTNEED madvise calls.

more on this below

[...]
> >> With this change, whenever an application issues MADV_DONTNEED on a
> >> memory region, the region is marked as "space-efficient". For such
> >> regions, a hugepage is not immediately allocated on first write.
> >
> > Kirill didn't like it in the previous version and I do not like this
> > either. You are adding a very subtle side effect which might completely
> > unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> > to free up unused memory. Now you have put it out of THP usage
> > basically.
> >
>
> Userpsace may want a region to be considered by khugepaged while opting
> out of hugepage allocation on first touch. Asking userspace memory
> allocators to have to track and reclaim unused parts of a THP allocated
> hugepage does not seems right, as the kernel can use simple userspace
> hints to avoid allocating extra memory in the first place.

Yes. This is in sync with what I wrote. Allocators shouldn't care and
that is why MADV_DONTNEED with side effect is simply wrong.

> I agree that this patch is adding a subtle side-effect which may take
> some applications by surprise. However, I often see the opposite too:
> for many workloads, disabling THP is the first advise as this aggressive
> allocation of hugepages on first touch is unexpected and is too
> wasteful. For e.g.:

Ohh, absolutely. And that is why we have changed the default in upstream
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option")
--
Michal Hocko
SUSE Labs

2018-01-25 19:44:31

by Nitin Gupta

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On 01/24/2018 04:47 PM, Zi Yan wrote:
>>>> With this change, whenever an application issues MADV_DONTNEED on a
>>>> memory region, the region is marked as "space-efficient". For such
>>>> regions, a hugepage is not immediately allocated on first write.
>>> Kirill didn't like it in the previous version and I do not like this
>>> either. You are adding a very subtle side effect which might completely
>>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
>>> to free up unused memory. Now you have put it out of THP usage
>>> basically.
>>>
>> Userpsace may want a region to be considered by khugepaged while opting
>> out of hugepage allocation on first touch. Asking userspace memory
>> allocators to have to track and reclaim unused parts of a THP allocated
>> hugepage does not seems right, as the kernel can use simple userspace
>> hints to avoid allocating extra memory in the first place.
>>
>> I agree that this patch is adding a subtle side-effect which may take
>> some applications by surprise. However, I often see the opposite too:
>> for many workloads, disabling THP is the first advise as this aggressive
>> allocation of hugepages on first touch is unexpected and is too
>> wasteful. For e.g.:
>>
>> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
>> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/
>>
>> 2) Disable THP on MongoDB
>> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
>>
>> 3) Disable THP for Couchbase Server
>> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/
>>
>> 4) Redis
>> http://antirez.com/news/84
>>
>>
>>> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>>>
>> It's not really about memory scarcity but a more efficient use of it.
>> Applications may want hugepage benefits without requiring any changes to
>> app code which is what THP is supposed to provide, while still avoiding
>> memory bloat.
>>
> I read these links and find that there are mainly two complains:
> 1. THP causes latency spikes, because direction compaction slows down THP allocation,
> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
> THP size and fails because of THP.
>
> The first complain is not related to this patch.

I'm trying to address many different THP issues and memory bloat is
first among them.
> For second one, at least with recent kernels, MADV_DONTNEED splits THPs and returns the memory range you
> specified in madvise(). Am I missing anything?
>

Yes, MADV_DONTNEED splits THPs and releases the requested range but
this is not
solving the issue of aggressive alloc-hugepage-on-first-touch policy
of THP=madvise
on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy
but for
application that don't, this patch give them an option to give hints
to the kernel to
go for gradual hugepage promotion via khugepaged only (and not on
first touch).

It's not good if an application has to track which parts of their
(implicitly allocated)
hugepage are in use and which sub-parts are free so they can issue
MADV_DONTNEED
calls on them. This approach really does not make THP "transparent"
and requires
lot of mm tracking code in userpace.

Nitin

2018-01-25 21:14:57

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> >> It's not really about memory scarcity but a more efficient use of it.
> >> Applications may want hugepage benefits without requiring any changes to
> >> app code which is what THP is supposed to provide, while still avoiding
> >> memory bloat.
> >>
> > I read these links and find that there are mainly two complains:
> > 1. THP causes latency spikes, because direction compaction slows down THP allocation,
> > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
> > THP size and fails because of THP.
> >
> > The first complain is not related to this patch.
>
> I'm trying to address many different THP issues and memory bloat is
> first among them.

Expecting userspace to get this right is probably going to go sideways.
It'll be screwed up and be sub-optimal or have odd semantics for existing
madvise flags. The fact is that an application may not even know if it's
going to be sparsely using memory in advance if it's a computation load
modelling from unknown input data.

I suggest you read the old Talluri paper "Superpassing the TLB Performance
of Superpages with Less Operating System Support" and pay attention to
Section 4. There it discusses a page reservation scheme whereby on fault
a naturally aligned set of base pages are reserved and only one correctly
placed base page is inserted into the faulting address. It was tied into
a hypothetical piece of hardware that doesn't exist to give best-effort
support for superpages so it does not directly help you but the initial
idea is sound. There are holes in the paper from todays perspective but
it was written in the 90's.

From there, read "Transparent operating system support for superpages"
by Navarro, particularly chapter 4 paying attention to the parts where
it talks about opportunism and promotion threshold.

Superficially, it goes like this

1. On fault, reserve a THP in the allocator and use one base page that
is correctly-aligned for the faulting addresses. By correctly-aligned,
I mean that you use base page whose offset would be naturally contiguous
if it ever was part of a huge page.
2. On subsequent faults, attempt to use a base page that is naturally
aligned to be a THP
3. When a "threshold" of base pages are inserted, allocate the remaining
pages and promote it to a THP
4. If there is memory pressure, spill "reserved" pages into the main
allocation pool and lose the opportunity to promote (which will need
khugepaged to recover)

By definition, a promotion threshold of 1 would be the existing scheme
of allocation a THP on the first fault and some users will want that. It
also should be the default to avoid unexpected overhead. For workloads
where memory is being sparsely addressed and the increased overhead of
THP is unwelcome then the threshold should be tuned higher with a maximum
possible value of HPAGE_PMD_NR.

It's non-trivial to do this because at minimum a page fault has to check
if there is a potential promotion candidate by checking the PTEs around
the faulting address searching for a correctly-aligned base page that is
already inserted. If there is, then check if the correctly aligned base
page for the current faulting address is free and if so use it. It'll
also then need to check the remaining PTEs to see if both the promotion
threshold has been reached and if so, promote it to a THP (or else teach
khugepaged to do an in-place promotion if possible). In other words,
implementing the promotion threshold is both hard and it's not free.

However, if it did exist then the only tunable would be the "promotion
threshold" and applications would not need any special awareness of their
address space.

--
Mel Gorman
SUSE Labs

2018-01-25 22:31:45

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> I'm trying to address many different THP issues and memory bloat is
> first among them.

You quoted redis in an earlier email, the redis issue has nothing to
do with MADV_DONTNEED.

I can quickly explain the redis issue.

Redis uses fork() to create a readonly copy of the memory to do
snapshotting in the child, while parent still writes to the memory.

THP CoWs in the parent are higher latency than 4k CoWs, they also take
more memory, but that's secondary, in fact the maximum waste of memory
in this model will reach the same worst case (x2) with 4k CoWs
too, no difference.

The problem is the copy-user there, it adds latency and wastes CPU.

Redis can simply use userfaultfd WP mode once it'll be upstream and
then it will use 4k granularity as the granularity of the writeprotect
userfaults is up to userland to decide.

The main benefit is it can avoid the worst case degradation of using
x2 physical memory (disabling THP makes zero difference in that
regard, if storage is very slow x2 physical memory can still be used
if very unlucky), it can throttle the WP writes (anon COW cannot
throttle), it can avoid to fork altogether so it shares the same
pagetables. It can also put the "user-CoWed" pages (in the fault
handler) in front of the write queue, to be written first, using a
ring buffer for the CoWed 4k pages, to keep memory utilization even
lower despite THP stays on at all times for all pages that didn't get
a CoW yet. This will be an optimal snapshot method, much better than
fork() no matter if 4k or THP are backing the memory.

In short MADV_DONTNEED has nothing to do with redis, if mysql gets an
improvement surely you can post a benchmark instead of URLs.

If you want low memory usage at the cost of potentially slower
performance overall you should use transparent_hugepage=madvise .

The cases where THP is not a good tradeoff are genreally related to
lower performance in copy-user or the higher cost of compaction if the
app is only ever doing short lived allocations.

If you post a reproducible benchmark with real life app that gets an
improvement with whatever change you're doing, it'll be possible to
evaluate it.

Thanks,
Andrea

2018-01-25 22:41:23

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Jan 25, 2018 at 10:58:32AM +0100, Michal Hocko wrote:
> Ohh, absolutely. And that is why we have changed the default in upstream
> 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
> stall-free defrag option")

Agreed, that direct compaction change should already address the cases
quoted in the other URLs.

One of the URL is about using fork() to snapshot a nosql db state,
that one can't be helped by the above commit but it's still unrelated
to MADV_DONTNEED or memory bloat.

It would be possible to fully fix the use of fork() for snapshotting
without userfaultfd WP mode, by just adding an madvise that forces 4k
CoWs on top of 2M THP and to call it in the parent that keeps writing
to memory while the child is writing the readonly copy to disk, but I
believe userfaultfd WP will be way more optimal as it provides so many
other advantages (i.e. avoid fork() in the first place and use
pthread_create and be able to throttle on I/O and limit the max memory
usage to something less than x2 RAM without the risk of triggering the
OOM killer and have a ring that is written immediately to keep the mem
utilization low etc..).

Thanks,
Andrea

2018-02-01 01:14:39

by Nitin Gupta

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On 01/25/2018 01:13 PM, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
>>>> It's not really about memory scarcity but a more efficient use of it.
>>>> Applications may want hugepage benefits without requiring any changes to
>>>> app code which is what THP is supposed to provide, while still avoiding
>>>> memory bloat.
>>>>
>>> I read these links and find that there are mainly two complains:
>>> 1. THP causes latency spikes, because direction compaction slows down THP allocation,
>>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
>>> THP size and fails because of THP.
>>>
>>> The first complain is not related to this patch.
>>
>> I'm trying to address many different THP issues and memory bloat is
>> first among them.
>
> Expecting userspace to get this right is probably going to go sideways.
> It'll be screwed up and be sub-optimal or have odd semantics for existing
> madvise flags. The fact is that an application may not even know if it's
> going to be sparsely using memory in advance if it's a computation load
> modelling from unknown input data.
>
> I suggest you read the old Talluri paper "Superpassing the TLB Performance
> of Superpages with Less Operating System Support" and pay attention to
> Section 4. There it discusses a page reservation scheme whereby on fault
> a naturally aligned set of base pages are reserved and only one correctly
> placed base page is inserted into the faulting address. It was tied into
> a hypothetical piece of hardware that doesn't exist to give best-effort
> support for superpages so it does not directly help you but the initial
> idea is sound. There are holes in the paper from todays perspective but
> it was written in the 90's.
>
> From there, read "Transparent operating system support for superpages"
> by Navarro, particularly chapter 4 paying attention to the parts where
> it talks about opportunism and promotion threshold.
>
> Superficially, it goes like this
>
> 1. On fault, reserve a THP in the allocator and use one base page that
> is correctly-aligned for the faulting addresses. By correctly-aligned,
> I mean that you use base page whose offset would be naturally contiguous
> if it ever was part of a huge page.
> 2. On subsequent faults, attempt to use a base page that is naturally
> aligned to be a THP
> 3. When a "threshold" of base pages are inserted, allocate the remaining
> pages and promote it to a THP
> 4. If there is memory pressure, spill "reserved" pages into the main
> allocation pool and lose the opportunity to promote (which will need
> khugepaged to recover)
>
> By definition, a promotion threshold of 1 would be the existing scheme
> of allocation a THP on the first fault and some users will want that. It
> also should be the default to avoid unexpected overhead. For workloads
> where memory is being sparsely addressed and the increased overhead of
> THP is unwelcome then the threshold should be tuned higher with a maximum
> possible value of HPAGE_PMD_NR.
>
> It's non-trivial to do this because at minimum a page fault has to check
> if there is a potential promotion candidate by checking the PTEs around
> the faulting address searching for a correctly-aligned base page that is
> already inserted. If there is, then check if the correctly aligned base
> page for the current faulting address is free and if so use it. It'll
> also then need to check the remaining PTEs to see if both the promotion
> threshold has been reached and if so, promote it to a THP (or else teach
> khugepaged to do an in-place promotion if possible). In other words,
> implementing the promotion threshold is both hard and it's not free.
>
> However, if it did exist then the only tunable would be the "promotion
> threshold" and applications would not need any special awareness of their
> address space.
>

I went through both references you mentioned and I really like the
idea of reservation-based hugepage allocation. Navarro also extends
the idea to allow multiple hugepage sizes to be used (as support by
underlying hardware) which was next in order of what I wanted to do in
THP.

So, please ignore this patch and I would work towards implementing
ideas in these papers.

Thanks for the feedback.

Nitin

2018-02-01 10:11:02

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Wed, Jan 31, 2018 at 05:09:48PM -0800, Nitin Gupta wrote:
> >
> > It's non-trivial to do this because at minimum a page fault has to check
> > if there is a potential promotion candidate by checking the PTEs around
> > the faulting address searching for a correctly-aligned base page that is
> > already inserted. If there is, then check if the correctly aligned base
> > page for the current faulting address is free and if so use it. It'll
> > also then need to check the remaining PTEs to see if both the promotion
> > threshold has been reached and if so, promote it to a THP (or else teach
> > khugepaged to do an in-place promotion if possible). In other words,
> > implementing the promotion threshold is both hard and it's not free.
> >
> > However, if it did exist then the only tunable would be the "promotion
> > threshold" and applications would not need any special awareness of their
> > address space.
> >
>
> I went through both references you mentioned and I really like the
> idea of reservation-based hugepage allocation. Navarro also extends
> the idea to allow multiple hugepage sizes to be used (as support by
> underlying hardware) which was next in order of what I wanted to do in
> THP.
>

Don't sweat too much about the multiple page size part. At the time Navarro
was writing, it was expected that hardware would support multiple page
sizes with fine granularity (e.g. what Itanium did). Just covering the PMD
huge page size would go a long way towards balancing memory consumption
and huge page usage.

> So, please ignore this patch and I would work towards implementing
> ideas in these papers.
>
> Thanks for the feedback.
>

My pleasure.

--
Mel Gorman
SUSE Labs

2018-02-01 10:28:31

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Jan 25, 2018 at 09:13:03PM +0000, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
> > >> It's not really about memory scarcity but a more efficient use of it.
> > >> Applications may want hugepage benefits without requiring any changes to
> > >> app code which is what THP is supposed to provide, while still avoiding
> > >> memory bloat.
> > >>
> > > I read these links and find that there are mainly two complains:
> > > 1. THP causes latency spikes, because direction compaction slows down THP allocation,
> > > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than
> > > THP size and fails because of THP.
> > >
> > > The first complain is not related to this patch.
> >
> > I'm trying to address many different THP issues and memory bloat is
> > first among them.
>
> Expecting userspace to get this right is probably going to go sideways.
> It'll be screwed up and be sub-optimal or have odd semantics for existing
> madvise flags. The fact is that an application may not even know if it's
> going to be sparsely using memory in advance if it's a computation load
> modelling from unknown input data.
>
> I suggest you read the old Talluri paper "Superpassing the TLB Performance
> of Superpages with Less Operating System Support" and pay attention to
> Section 4. There it discusses a page reservation scheme whereby on fault
> a naturally aligned set of base pages are reserved and only one correctly
> placed base page is inserted into the faulting address. It was tied into
> a hypothetical piece of hardware that doesn't exist to give best-effort
> support for superpages so it does not directly help you but the initial
> idea is sound. There are holes in the paper from todays perspective but
> it was written in the 90's.
>
> From there, read "Transparent operating system support for superpages"
> by Navarro, particularly chapter 4 paying attention to the parts where
> it talks about opportunism and promotion threshold.
>
> Superficially, it goes like this
>
> 1. On fault, reserve a THP in the allocator and use one base page that
> is correctly-aligned for the faulting addresses. By correctly-aligned,
> I mean that you use base page whose offset would be naturally contiguous
> if it ever was part of a huge page.
> 2. On subsequent faults, attempt to use a base page that is naturally
> aligned to be a THP
> 3. When a "threshold" of base pages are inserted, allocate the remaining
> pages and promote it to a THP
> 4. If there is memory pressure, spill "reserved" pages into the main
> allocation pool and lose the opportunity to promote (which will need
> khugepaged to recover)
>
> By definition, a promotion threshold of 1 would be the existing scheme
> of allocation a THP on the first fault and some users will want that. It
> also should be the default to avoid unexpected overhead. For workloads
> where memory is being sparsely addressed and the increased overhead of
> THP is unwelcome then the threshold should be tuned higher with a maximum
> possible value of HPAGE_PMD_NR.
>
> It's non-trivial to do this because at minimum a page fault has to check
> if there is a potential promotion candidate by checking the PTEs around
> the faulting address searching for a correctly-aligned base page that is
> already inserted. If there is, then check if the correctly aligned base
> page for the current faulting address is free and if so use it. It'll
> also then need to check the remaining PTEs to see if both the promotion
> threshold has been reached and if so, promote it to a THP (or else teach
> khugepaged to do an in-place promotion if possible). In other words,
> implementing the promotion threshold is both hard and it's not free.

"not free" is understatement.

Converting PTE page table to PMD would require down_write(mmap_sem).
Doing it from within page fault path would also mean that we need to drop
down_read(mmap) we hold, re-aquaire it with down_write(), find the vma again
and re-validate that nothing changed in meanwhile...

That's an interesting exercise, but I'm skeptical it would result in anything
practical.

--
Kirill A. Shutemov

2018-02-01 10:47:10

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP

On Thu, Feb 01, 2018 at 01:27:30PM +0300, Kirill A. Shutemov wrote:
> > It's non-trivial to do this because at minimum a page fault has to check
> > if there is a potential promotion candidate by checking the PTEs around
> > the faulting address searching for a correctly-aligned base page that is
> > already inserted. If there is, then check if the correctly aligned base
> > page for the current faulting address is free and if so use it. It'll
> > also then need to check the remaining PTEs to see if both the promotion
> > threshold has been reached and if so, promote it to a THP (or else teach
> > khugepaged to do an in-place promotion if possible). In other words,
> > implementing the promotion threshold is both hard and it's not free.
>
> "not free" is understatement.
>
> Converting PTE page table to PMD would require down_write(mmap_sem).
> Doing it from within page fault path would also mean that we need to drop
> down_read(mmap) we hold, re-aquaire it with down_write(), find the vma again
> and re-validate that nothing changed in meanwhile...
>
> That's an interesting exercise, but I'm skeptical it would result in anything
> practical.
>

The details are painful but we're somewhat caught between a rock and a
hard place for workloads that sparsely reference memory and want to avoid
excessive memory usage. Given that the cost will be high, it may need to
dynamically detect what the promotion threshold is -- default high and
reduce it on a per-task basis if promotions are frequent.

Either way, expecting applications to get it right with hints is the road
to hell paved with good intentions. If they were able to get this right,
they would be using prctl(PR_SET_THP_DISABLE) already.

--
Mel Gorman
SUSE Labs