by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> Hi all,
>
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
>
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.
>
> Design
> =======
>
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
>
> 1. Page table deposit and withdraw using a new pagechain data structure:
> instead of one PTE page table page, 1GB THP requires 513 page table pages
> (one PMD page table page and 512 PTE page table pages) to be deposited
> at the page allocaiton time, so that we can split the page later. Currently,
> the page table deposit is using ->lru, thus only one page can be deposited.
> A new pagechain data structure is added to enable multi-page deposit.
>
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> page[N*512 + 3].compound_mapcount.
>
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> to use something less intrusive. So all 1GB THPs are allocated from reserved
> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> THP is cleared as the resulting pages can be freed via normal page free path.
> We can fall back to alloc_contig_pages for 1GB THP if necessary.
>
>
> Patch Organization
> =======
>
> Patch 01 adds the new pagechain data structure.
>
> Patch 02 to 13 adds 1GB THP support in variable places.
>
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>
>
> Any suggestions and comments are welcome.
>
>
> Zi Yan (16):
> mm: add pagechain container for storing multiple pages.
> mm: thp: 1GB anonymous page implementation.
> mm: proc: add 1GB THP kpageflag.
> mm: thp: 1GB THP copy on write implementation.
> mm: thp: handling 1GB THP reference bit.
> mm: thp: add 1GB THP split_huge_pud_page() function.
> mm: stats: make smap stats understand PUD THPs.
> mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> mm: thp: 1GB THP support in try_to_unmap().
> mm: thp: split 1GB THPs at page reclaim.
> mm: thp: 1GB THP follow_p*d_page() support.
> mm: support 1GB THP pagemap support.
> mm: thp: add a knob to enable/disable 1GB THPs.
> mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> hugetlb: cma: move cma reserve function to cma.c.
> mm: thp: use cma reservation for pud thp allocation.

Surprised this doesn't touch mm/pagewalk.c ?

Jason

2020-09-02 18:48:30

On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:

> A global knob is insufficient. 1G pages will become a very precious
> resource as it requires a pre-allocation (reservation). So it really
> has
> to be an opt-in and the question is whether there is also some sort
> of
> access control needed.

The 1GB pages do not require that much in the way of
pre-allocation. The memory can be obtained through CMA,
which means it can be used for movable 4kB and 2MB
allocations when not
being used for 1GB pages.

That makes it relatively easy to set aside
some fraction
of system memory in every system for 1GB and movable
allocations, and use it for whatever way it is needed
depending on what workload(s) end up running on a system.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2020-09-08 20:19:43

On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
> http://lkml.kernel.org/r/[email protected]
> but this particular subthread has diverged a bit and you might find
> it
> interesting]
>
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> >
> > I am not sure I like the trend towards CMA that we are seeing,
> > reserving
> > huge buffers for specific users (and eventually even doing it
> > automatically).
> >
> > What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> > that
> > anybody who requires large, unmovable allocations can use it.
> >
> > I once played with the idea of having ZONE_PREFER_MOVABLE, which
> > a) Is the primary choice for movable allocations
> > b) Is allowed to contain unmovable allocations (esp., gigantic
> > pages)
> > c) Is the fallback for ZONE_NORMAL for unmovable allocations,
> > instead of
> > running out of memory
>
> I might be missing something but how can this work longterm? Or put
> in
> another words why would this work any better than existing
> fragmentation
> avoidance techniques that page allocator implements already -

One big difference is reclaim. If ZONE_NORMAL runs low on
free memory, page reclaim would kick in and evict some
movable/reclaimable things, to free up more space for
unmovable allocations.

The current fragmentation avoidance techniques don't do
things like reclaim, or proactively migrating movable
pages out of unmovable page blocks to prevent unmovable
allocations in currently movable page blocks.

> My suspicion is that a separate zone would work in a similar fashion.
> As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations
> are
> in minority. As long as the Normal zone gets full of unmovable
> objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble
> page
> block stealing when unmovable objects start spreading over movable
> page
> blocks.

You are right, with the difference being reclaim and/or
migration, which could make a real difference in limiting
the number of pageblocks that have unmovable allocations.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2020-09-10 14:46:21

On 10 Sep 2020, at 9:32, Rik van Riel wrote:

> On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/[email protected]
>> but this particular subthread has diverged a bit and you might find
>> it
>> interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>>
>>> I am not sure I like the trend towards CMA that we are seeing,
>>> reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>>> that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic
>>> pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations,
>>> instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put
>> in
>> another words why would this work any better than existing
>> fragmentation
>> avoidance techniques that page allocator implements already -
>
> One big difference is reclaim. If ZONE_NORMAL runs low on
> free memory, page reclaim would kick in and evict some
> movable/reclaimable things, to free up more space for
> unmovable allocations.
>
> The current fragmentation avoidance techniques don't do
> things like reclaim, or proactively migrating movable
> pages out of unmovable page blocks to prevent unmovable
> allocations in currently movable page blocks.

Isn’t Mel Gorman’s watermark boost patch[1] (merged about a year ago)
doing what you are describing?

[1]https://lore.kernel.org/linux-mm/[email protected]/

—
Best Regards,
Yan Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2020-09-10 21:27:49

by Zi Yan

[permalink] [raw]

Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64

On 10 Sep 2020, at 4:27, David Hildenbrand wrote:

> On 10.09.20 09:32, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/[email protected]
>> but this particular subthread has diverged a bit and you might find it
>> interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>> On 09.09.20 15:19, Rik van Riel wrote:
>>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>>
>>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>>> precious
>>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>>> really
>>>>>>> has
>>>>>>> to be an opt-in and the question is whether there is also some
>>>>>>> sort
>>>>>>> of
>>>>>>> access control needed.
>>>>>>
>>>>>> The 1GB pages do not require that much in the way of
>>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>>> which means it can be used for movable 4kB and 2MB
>>>>>> allocations when not
>>>>>> being used for 1GB pages.
>>>>>
>>>>> That CMA has to be pre-reserved, right? That requires a
>>>>> configuration.
>>>>
>>>> To some extent, yes.
>>>>
>>>> However, because that pool can be used for movable
>>>> 4kB and 2MB
>>>> pages as well as for 1GB pages, it would be easy to just set
>>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>>> system.
>>>>
>>>> It isn't like the pool needs to be the exact right size. We
>>>> just need to avoid the "highmem problem" of having too little
>>>> memory for kernel allocations.
>>>>
>>>
>>> I am not sure I like the trend towards CMA that we are seeing, reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put in
>> another words why would this work any better than existing fragmentation
>> avoidance techniques that page allocator implements already - movability
>> grouping etc. Please note that I am not deeply familiar with those but
>> my high level understanding is that we already try hard to not mix
>> movable and unmovable objects in same page blocks as much as we can.
>
> Note that we group in pageblock granularity, which avoids fragmentation
> on a pageblock level, not on anything bigger than that. Especially
> MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.
>
> So once you run for some time on a system (especially thinking about
> page shuffling *within* a zone), trying to allocate a gigantic page will
> simply always fail - even if you always had plenty of free memory in
> your single zone.
>
>>
>> My suspicion is that a separate zone would work in a similar fashion. As
>> long as there is a lot of free memory then zone will be effectively
>> MOVABLE. Similar applies to normal zone when unmovable allocations are
>
> Note the difference to MOVABLE: if you really want, you *can* put
> movable allocations into that zone. So you can happily allocate gigantic
> pages from it. Or anything else you like. As the name suggests "prefer
> movable allocations".
>
>> in minority. As long as the Normal zone gets full of unmovable objects
>> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
>> block stealing when unmovable objects start spreading over movable page
>> blocks.
>
> Right, the long-term goal would be
> 1. To limit the chance of that happening. (e.g., size it in a way that's
> safe for 99.9% of all setups, resize dynamically on demand)
> 2. To limit the physical area where that is happening (e.g., find lowest
> possible pageblock etc.). That's more tricky but I consider this a pure
> optimization on top.
>
> As long as we stay in safe zone boundaries you get a benefit in most
> scenarios. As soon as we would have a (temporary) workload that would
> require more unmovable allocations we would fallback to polluting some
> pageblocks only.

The idea would work well until unmoveable pages begin to overflow into
ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
avoid unmoveable page overflow. The issue comes from the lifetime of
the unmoveable pages. Since some long-live ones can be around the boundary,
there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
even if other unmoveable pages are deallocated. Ultimately,
ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
back to what we have now.

OK. I have a stupid question here. Why not just grow pageblock to a larger
size, like 1GB? So the fragmentation of unmoveable pages will be at larger
granularity. But it is less likely unmoveable pages will be allocated at
a movable pageblock, since the kernel has 1GB pageblock for them after
a pageblock stealing. If other kinds of pageblocks run out, moveable and
reclaimable pages can fall back to unmoveable pageblocks.
What am I missing here?

Thanks.

—
Best Regards,
Yan Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature