2023-08-23 15:22:45

by Alexandru Elisei

[permalink] [raw]
Subject: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
sure pages from the movable list are allocated only when the
ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
could end up allocating a metadata page when that page cannot be used.

__alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
directly. Add a check for the flag before calling __rmqueue_pcplist(), and
fallback to __alloc_pages() if the check is false.

Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
is not the case with MIGRATE_METADATA pages.

Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/page_alloc.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 829134a4dfa8..a693e23c4733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,

if (likely(pcp_allowed_order(order))) {
/*
- * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
- * we need to skip it when CMA area isn't allowed.
+ * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
+ * movable list. Make sure it's allowed to allocate both type of
+ * pages before allocating from the movable list.
*/
- if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
- migratetype != MIGRATE_MOVABLE) {
+ bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
+ (alloc_flags & ALLOC_CMA)) &&
+ (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
+ (alloc_flags & ALLOC_FROM_METADATA));
+
+ if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
page = rmqueue_pcplist(preferred_zone, zone, order,
migratetype, alloc_flags);
if (likely(page))
@@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto out;
gfp = alloc_gfp;

+ /*
+ * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
+ * use pcp if allocating metadata pages is not allowed.
+ */
+ if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
+ !(alloc_flags & ALLOC_FROM_METADATA))
+ goto failed;
+
/* Find an allowed local zone that meets the low watermark. */
for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
unsigned long mark;
--
2.41.0



2023-10-16 12:41:36

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

Hi,

On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> On Wed, Aug 23, 2023 at 02:13:19PM +0100, Alexandru Elisei wrote:
> > pcp lists keep MIGRATE_METADATA pages on the MIGRATE_MOVABLE list. Make
> > sure pages from the movable list are allocated only when the
> > ALLOC_FROM_METADATA alloc flag is set, as otherwise the page allocator
> > could end up allocating a metadata page when that page cannot be used.
> >
> > __alloc_pages_bulk() sidesteps rmqueue() and calls __rmqueue_pcplist()
> > directly. Add a check for the flag before calling __rmqueue_pcplist(), and
> > fallback to __alloc_pages() if the check is false.
> >
> > Note that CMA isn't a problem for __alloc_pages_bulk(): an allocation can
> > always use CMA pages if the requested migratetype is MIGRATE_MOVABLE, which
> > is not the case with MIGRATE_METADATA pages.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > mm/page_alloc.c | 21 +++++++++++++++++----
> > 1 file changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 829134a4dfa8..a693e23c4733 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2845,11 +2845,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> >
> > if (likely(pcp_allowed_order(order))) {
> > /*
> > - * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> > - * we need to skip it when CMA area isn't allowed.
> > + * PCP lists keep MIGRATE_CMA/MIGRATE_METADATA pages on the same
> > + * movable list. Make sure it's allowed to allocate both type of
> > + * pages before allocating from the movable list.
> > */
> > - if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > - migratetype != MIGRATE_MOVABLE) {
> > + bool movable_allowed = (!IS_ENABLED(CONFIG_CMA) ||
> > + (alloc_flags & ALLOC_CMA)) &&
> > + (!IS_ENABLED(CONFIG_MEMORY_METADATA) ||
> > + (alloc_flags & ALLOC_FROM_METADATA));
> > +
> > + if (migratetype != MIGRATE_MOVABLE || movable_allowed) {
>
> Hi!
>
> I don't think it would be effcient when the majority of movable pages
> do not use GFP_TAGGED.
>
> Metadata pages have a low probability of being in the pcp list
> because metadata pages is bypassed when freeing pages.
>
> The allocation performance of most movable pages is likely to decrease
> if only the request with ALLOC_FROM_METADATA could be allocated.

You're right, I hadn't considered that.

>
> How about not including metadata pages in the pcp list at all ?

Sounds reasonable, I will keep it in mind for the next iteration of the
series.

Thanks,
Alex

>
> Thanks,
> Hyesoo Yu.
>
> > page = rmqueue_pcplist(preferred_zone, zone, order,
> > migratetype, alloc_flags);
> > if (likely(page))
> > @@ -4388,6 +4393,14 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > goto out;
> > gfp = alloc_gfp;
> >
> > + /*
> > + * pcp lists puts MIGRATE_METADATA on the MIGRATE_MOVABLE list, don't
> > + * use pcp if allocating metadata pages is not allowed.
> > + */
> > + if (metadata_storage_enabled() && ac.migratetype == MIGRATE_MOVABLE &&
> > + !(alloc_flags & ALLOC_FROM_METADATA))
> > + goto failed;
> > +
> > /* Find an allowed local zone that meets the low watermark. */
> > for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> > unsigned long mark;
> > --
> > 2.41.0
> >
> >


2023-10-17 10:26:59

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

On Mon, Oct 16, 2023 at 01:41:15PM +0100, Alexandru Elisei wrote:
> On Thu, Oct 12, 2023 at 10:25:11AM +0900, Hyesoo Yu wrote:
> > I don't think it would be effcient when the majority of movable pages
> > do not use GFP_TAGGED.
> >
> > Metadata pages have a low probability of being in the pcp list
> > because metadata pages is bypassed when freeing pages.
> >
> > The allocation performance of most movable pages is likely to decrease
> > if only the request with ALLOC_FROM_METADATA could be allocated.
>
> You're right, I hadn't considered that.
>
> >
> > How about not including metadata pages in the pcp list at all ?
>
> Sounds reasonable, I will keep it in mind for the next iteration of the
> series.

BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
CMA and assume that the tag storage itself supports tagging. Hopefully
it makes the patches a bit simpler.

--
Catalin

2023-10-23 10:51:24

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
> > BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
> > CMA and assume that the tag storage itself supports tagging. Hopefully
> > it makes the patches a bit simpler.
>
> I am curious about the plan for the next iteration.

Alex is working on it.

> Does tag storage itself supports tagging? Will the following version be unusable
> if the hardware does not support it? The document of google said that
> "If this memory is itself mapped as Tagged Normal (which should not happen!)
> then tag updates on it either raise a fault or do nothing, but never change the
> contents of any other page."
> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>
> The support of H/W is very welcome because it is good to make the patches simpler.
> But if H/W doesn't support it, Can't the new solution be used?

AFAIK on the current interconnects this is supported but the offsets
will need to be configured by firmware in such a way that a tag access
to the tag carve-out range still points to physical RAM, otherwise, as
per Google's doc, you can get some unexpected behaviour.

Let's take a simplified example, we have:

phys_addr - physical address, linearised, starting from 0
ram_size - the size of RAM (also corresponds to the end of PA+1)

A typical configuration is to designate the top 1/32 of RAM for tags:

tag_offset = ram_size - ram_size / 32
tag_carveout_start = tag_offset

The tag address for a given phys_addr is calculated as:

tag_addr = phys_addr / 32 + tag_offset

To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
storage for the main/reusable tag carveout.

tag_carveout2_start = tag_carveout_start / 32 + tag_offset

This gives us the end of the first reusable carveout:

tag_carveout_end = tag_carveout2_start - 1

and this way in Linux we can have (inclusive ranges):

0..(tag_carveout_start-1): normal memory, data only
tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)

For this to work, we need the last page in the first carveout to have
a tag storage within RAM. And, of course, all of the above need to be at
least 4K aligned.

The simple configuration of 1/(32*32) of RAM for the second carveout is
sufficient but not fully utilised. We could be a bit more efficient to
gain a few more pages. Apart from the page alignment requirements, the
only strict requirement we need is:

tag_carverout2_end < ram_size

where tag_carveout2_end is the tag storage corresponding to the end of
the main/reusable carveout, just before tag_carveout2_start:

tag_carveout2_end = tag_carveout_end / 32 + tag_offset

Assuming that my on-paper substitutions are correct, the inequality
above becomes:

tag_offset < (1024 * ram_size + 32) / 1057

and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
tag_carveout2_start is a multiple of PAGE_SIZE).

As a concrete example, for 16GB of RAM starting from 0:

tag_offset = 0x3e0060000
tag_carverout2_start = 0x3ff063000

Without the optimal placement, the default tag_offset of top 1/32 of RAM
would have been:

tag_offset = 0x3e0000000
tag_carveou2_start = 0x3ff000000

so an extra 396KB gained with optimal placement (out of 16G, not sure
it's worth).

One can put the calculations in some python script to get the optimal
tag offset in case I got something wrong on paper.

--
Catalin

2023-10-23 11:56:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

On 23.10.23 12:50, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>> On Tue, Oct 17, 2023 at 11:26:36AM +0100, Catalin Marinas wrote:
>>> BTW, I suggest for the next iteration we drop MIGRATE_METADATA, only use
>>> CMA and assume that the tag storage itself supports tagging. Hopefully
>>> it makes the patches a bit simpler.
>>
>> I am curious about the plan for the next iteration.
>
> Alex is working on it.
>
>> Does tag storage itself supports tagging? Will the following version be unusable
>> if the hardware does not support it? The document of google said that
>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>> then tag updates on it either raise a fault or do nothing, but never change the
>> contents of any other page."
>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>
>> The support of H/W is very welcome because it is good to make the patches simpler.
>> But if H/W doesn't support it, Can't the new solution be used?
>
> AFAIK on the current interconnects this is supported but the offsets
> will need to be configured by firmware in such a way that a tag access
> to the tag carve-out range still points to physical RAM, otherwise, as
> per Google's doc, you can get some unexpected behaviour.
>
> Let's take a simplified example, we have:
>
> phys_addr - physical address, linearised, starting from 0
> ram_size - the size of RAM (also corresponds to the end of PA+1)
>
> A typical configuration is to designate the top 1/32 of RAM for tags:
>
> tag_offset = ram_size - ram_size / 32
> tag_carveout_start = tag_offset
>
> The tag address for a given phys_addr is calculated as:
>
> tag_addr = phys_addr / 32 + tag_offset
>
> To keep things simple, we reserve the top 1/(32*32) of the RAM as tag
> storage for the main/reusable tag carveout.
>
> tag_carveout2_start = tag_carveout_start / 32 + tag_offset
>
> This gives us the end of the first reusable carveout:
>
> tag_carveout_end = tag_carveout2_start - 1
>
> and this way in Linux we can have (inclusive ranges):
>
> 0..(tag_carveout_start-1): normal memory, data only
> tag_carveout_start..tag_carveout_end: CMA, reused as tags or data
> tag_carveout2_start..(ram_size-1): reserved for tags (not touched by the OS)
>
> For this to work, we need the last page in the first carveout to have
> a tag storage within RAM. And, of course, all of the above need to be at
> least 4K aligned.
>
> The simple configuration of 1/(32*32) of RAM for the second carveout is
> sufficient but not fully utilised. We could be a bit more efficient to
> gain a few more pages. Apart from the page alignment requirements, the
> only strict requirement we need is:
>
> tag_carverout2_end < ram_size
>
> where tag_carveout2_end is the tag storage corresponding to the end of
> the main/reusable carveout, just before tag_carveout2_start:
>
> tag_carveout2_end = tag_carveout_end / 32 + tag_offset
>
> Assuming that my on-paper substitutions are correct, the inequality
> above becomes:
>
> tag_offset < (1024 * ram_size + 32) / 1057
>
> and tag_offset is a multiple of PAGE_SIZE * 32 (so that the
> tag_carveout2_start is a multiple of PAGE_SIZE).
>
> As a concrete example, for 16GB of RAM starting from 0:
>
> tag_offset = 0x3e0060000
> tag_carverout2_start = 0x3ff063000
>
> Without the optimal placement, the default tag_offset of top 1/32 of RAM
> would have been:
>
> tag_offset = 0x3e0000000
> tag_carveou2_start = 0x3ff000000
>
> so an extra 396KB gained with optimal placement (out of 16G, not sure
> it's worth).
>
> One can put the calculations in some python script to get the optimal
> tag offset in case I got something wrong on paper.

I followed what you are saying, but I didn't quite read the following
clearly stated in your calculations: Using this model, how much memory
would you be able to reuse, and how much not?

I suspect you would *not* be able to reuse "1/(32*32)" [second
carve-out] but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or
am I completely off?

Further, (just thinking about it) I assume you've taken care of the
condition that memory cannot self-host it's own tag memory. So that
cannot happen in the model proposed here, right?

Anyhow, happy to see that we might be able to make it work just by
mostly reusing existing CMA.

--
Cheers,

David / dhildenb

2023-10-23 17:08:41

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
> On 23.10.23 12:50, Catalin Marinas wrote:
> > On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
> > > Does tag storage itself supports tagging? Will the following version be unusable
> > > if the hardware does not support it? The document of google said that
> > > "If this memory is itself mapped as Tagged Normal (which should not happen!)
> > > then tag updates on it either raise a fault or do nothing, but never change the
> > > contents of any other page."
> > > (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
> > >
> > > The support of H/W is very welcome because it is good to make the patches simpler.
> > > But if H/W doesn't support it, Can't the new solution be used?
> >
> > AFAIK on the current interconnects this is supported but the offsets
> > will need to be configured by firmware in such a way that a tag access
> > to the tag carve-out range still points to physical RAM, otherwise, as
> > per Google's doc, you can get some unexpected behaviour.
[...]
> I followed what you are saying, but I didn't quite read the following
> clearly stated in your calculations: Using this model, how much memory would
> you be able to reuse, and how much not?
>
> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
> off?

That's correct. In theory, from the hardware perspective, we could even
go recursively to the third/fourth etc. carveout until the last one is a
single page but I'd rather not complicate things further.

> Further, (just thinking about it) I assume you've taken care of the
> condition that memory cannot self-host it's own tag memory. So that cannot
> happen in the model proposed here, right?

I don't fully understand what you mean. The tags for the first data
range (0 .. ram_size * 31/32) are stored in the first tag carveout.
That's where we'll need CMA. For the tag carveout, when hosting data
pages as tagged, the tags go in the second carveout which is fully
reserved (still TBD but possibly the firmware won't even tell the kernel
about it).

--
Catalin

2023-10-23 17:23:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 06/37] mm: page_alloc: Allocate from movable pcp lists only if ALLOC_FROM_METADATA

On 23.10.23 19:08, Catalin Marinas wrote:
> On Mon, Oct 23, 2023 at 01:55:12PM +0200, David Hildenbrand wrote:
>> On 23.10.23 12:50, Catalin Marinas wrote:
>>> On Mon, Oct 23, 2023 at 04:16:56PM +0900, Hyesoo Yu wrote:
>>>> Does tag storage itself supports tagging? Will the following version be unusable
>>>> if the hardware does not support it? The document of google said that
>>>> "If this memory is itself mapped as Tagged Normal (which should not happen!)
>>>> then tag updates on it either raise a fault or do nothing, but never change the
>>>> contents of any other page."
>>>> (https://github.com/google/sanitizers/blob/master/mte-dynamic-carveout/spec.md)
>>>>
>>>> The support of H/W is very welcome because it is good to make the patches simpler.
>>>> But if H/W doesn't support it, Can't the new solution be used?
>>>
>>> AFAIK on the current interconnects this is supported but the offsets
>>> will need to be configured by firmware in such a way that a tag access
>>> to the tag carve-out range still points to physical RAM, otherwise, as
>>> per Google's doc, you can get some unexpected behaviour.
> [...]
>> I followed what you are saying, but I didn't quite read the following
>> clearly stated in your calculations: Using this model, how much memory would
>> you be able to reuse, and how much not?
>>
>> I suspect you would *not* be able to reuse "1/(32*32)" [second carve-out]
>> but be able to reuse "1/32 - 1/(32*32)" [first carve-out] or am I completely
>> off?
>
> That's correct. In theory, from the hardware perspective, we could even
> go recursively to the third/fourth etc. carveout until the last one is a
> single page but I'd rather not complicate things further.
>
>> Further, (just thinking about it) I assume you've taken care of the
>> condition that memory cannot self-host it's own tag memory. So that cannot
>> happen in the model proposed here, right?
>
> I don't fully understand what you mean. The tags for the first data
> range (0 .. ram_size * 31/32) are stored in the first tag carveout.
> That's where we'll need CMA. For the tag carveout, when hosting data
> pages as tagged, the tags go in the second carveout which is fully
> reserved (still TBD but possibly the firmware won't even tell the kernel
> about it).

You got my cryptic question right: you make sure that the tag for the
first carveout go to the second carveout.

Sounds very good, thanks.

--
Cheers,

David / dhildenb