2023-11-16 23:09:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On 14.11.23 19:02, Sumanth Korikkar wrote:
> Hi All,
>
> The patch series implements "memmap on memory" feature on s390 and
> provides the necessary fixes for it.

Thinking about this, one thing that makes s390x different from all the
other architectures in this series is the altmap handling.

I'm curious, why is that even required?

A memmep that is not marked as online in the section should not be
touched by anybody (except memory onlining code :) ). And if we do, it's
usually a BUG because that memmap might contain garbage/be poisoned or
completely stale, so we might want to track that down and fix it in any
case.

So what speaks against just leaving add_memory() populate the memmap
from the altmap? Then, also the page tables for the memmap are already
in place when onlining memory.


Then, adding two new notifier calls on start of memory_block_online()
called something like MEM_PREPARE_ONLINE and end the end of
memory_block_offline() called something like MEM_FINISH_OFFLINE is still
suboptimal, but that's where standby memory could be
activated/deactivated, without messing with the altmap.

That way, the only s390x specific thing is that the memmap that should
not be touched by anybody is actually inaccessible, and you'd
activate/deactivate simply from the new notifier calls just the way we
used to do.

It's still all worse than just adding/removing memory properly, using a
proper interface -- where you could alloc/free an actual memmap when the
altmap is not desired. But I know that people don't want to spend time
just doing it cleanly from scratch.

--
Cheers,

David / dhildenb


2023-11-17 13:01:19

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Fri, 17 Nov 2023 00:08:31 +0100
David Hildenbrand <[email protected]> wrote:

> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> >
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.
>
> Thinking about this, one thing that makes s390x different from all the
> other architectures in this series is the altmap handling.
>
> I'm curious, why is that even required?
>
> A memmep that is not marked as online in the section should not be
> touched by anybody (except memory onlining code :) ). And if we do, it's
> usually a BUG because that memmap might contain garbage/be poisoned or
> completely stale, so we might want to track that down and fix it in any
> case.
>
> So what speaks against just leaving add_memory() populate the memmap
> from the altmap? Then, also the page tables for the memmap are already
> in place when onlining memory.

Good question, I am not 100% sure if we ran into bugs, or simply assumed
that it is not OK to call __add_pages() when the memory for the altmap
is not accessible.

Maybe there is also already a common code bug with that, s390 might be
special but that is often also good for finding bugs in common code ...

> Then, adding two new notifier calls on start of memory_block_online()
> called something like MEM_PREPARE_ONLINE and end the end of
> memory_block_offline() called something like MEM_FINISH_OFFLINE is still
> suboptimal, but that's where standby memory could be
> activated/deactivated, without messing with the altmap.
>
> That way, the only s390x specific thing is that the memmap that should
> not be touched by anybody is actually inaccessible, and you'd
> activate/deactivate simply from the new notifier calls just the way we
> used to do.
>
> It's still all worse than just adding/removing memory properly, using a
> proper interface -- where you could alloc/free an actual memmap when the
> altmap is not desired. But I know that people don't want to spend time
> just doing it cleanly from scratch.

Yes, sometimes they need to be forced to do that :-)

So, we'll look into defining a "proper interface", and treat patches 1-3
separately as bug fixes? Especially patch 3 might be interesting for arm,
if they do not have ZONE_DEVICE, but still use the functions, they might
end up with the no-op version, not really freeing any memory.

2023-11-17 13:57:17

by Sumanth Korikkar

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> >
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.
>
> Thinking about this, one thing that makes s390x different from all the other
> architectures in this series is the altmap handling.
>
> I'm curious, why is that even required?
>
> A memmep that is not marked as online in the section should not be touched
> by anybody (except memory onlining code :) ). And if we do, it's usually a
> BUG because that memmap might contain garbage/be poisoned or completely
> stale, so we might want to track that down and fix it in any case.
>
> So what speaks against just leaving add_memory() populate the memmap from
> the altmap? Then, also the page tables for the memmap are already in place
> when onlining memory.
>

we do have page_init_poison() in sparse_add_section() which should be
handled later then. not in add_pages()
>
> Then, adding two new notifier calls on start of memory_block_online() called
> something like MEM_PREPARE_ONLINE and end the end of memory_block_offline()
> called something like MEM_FINISH_OFFLINE is still suboptimal, but that's
> where standby memory could be activated/deactivated, without messing with
> the altmap.
>
> That way, the only s390x specific thing is that the memmap that should not
> be touched by anybody is actually inaccessible, and you'd
> activate/deactivate simply from the new notifier calls just the way we used
> to do.
ok.

Thanks

2023-11-17 15:37:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On 17.11.23 14:00, Gerald Schaefer wrote:
> On Fri, 17 Nov 2023 00:08:31 +0100
> David Hildenbrand <[email protected]> wrote:
>
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the
>> other architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be
>> touched by anybody (except memory onlining code :) ). And if we do, it's
>> usually a BUG because that memmap might contain garbage/be poisoned or
>> completely stale, so we might want to track that down and fix it in any
>> case.
>>
>> So what speaks against just leaving add_memory() populate the memmap
>> from the altmap? Then, also the page tables for the memmap are already
>> in place when onlining memory.
>
> Good question, I am not 100% sure if we ran into bugs, or simply assumed
> that it is not OK to call __add_pages() when the memory for the altmap
> is not accessible.

I mean, we create the direct map even though nobody should access that
memory, so maybe we can simply map the altmap even though nobody should
should access that memory.

As I said, then, even the page tables for the altmap are allocated
already and memory onlining likely doesn't need any allocation anymore
(except, there is kasan or some other memory notifiers have special
demands).

Certainly simpler :)

>
> Maybe there is also already a common code bug with that, s390 might be
> special but that is often also good for finding bugs in common code ...

If it's only the page_init_poison() as noted by Sumanth, we could
disable that on s390x with an altmap some way or the other; should be
possible.

I mean, you effectively have your own poisoning if the altmap is
effectively inaccessible and makes your CPU angry on access :)

Last but not least, support for an inaccessible altmap might come in
handy for virtio-mem eventually, and make altmap support eventually
simpler. So added bonus points.

>
>> Then, adding two new notifier calls on start of memory_block_online()
>> called something like MEM_PREPARE_ONLINE and end the end of
>> memory_block_offline() called something like MEM_FINISH_OFFLINE is still
>> suboptimal, but that's where standby memory could be
>> activated/deactivated, without messing with the altmap.
>>
>> That way, the only s390x specific thing is that the memmap that should
>> not be touched by anybody is actually inaccessible, and you'd
>> activate/deactivate simply from the new notifier calls just the way we
>> used to do.
>>
>> It's still all worse than just adding/removing memory properly, using a
>> proper interface -- where you could alloc/free an actual memmap when the
>> altmap is not desired. But I know that people don't want to spend time
>> just doing it cleanly from scratch.
>
> Yes, sometimes they need to be forced to do that :-)

I certainly won't force you if we can just keep the __add_pages() calls
as is; having an altmap that is inaccessible but fully prepared sounds
reasonable to me.

I can see how this gives an immediate benefit to existing s390x
installations without being too hacky and without taking a long time to
settle.

But I'll strongly suggest to evaluate a new interface long-term.

>
> So, we'll look into defining a "proper interface", and treat patches 1-3
> separately as bug fixes? Especially patch 3 might be interesting for arm,
> if they do not have ZONE_DEVICE, but still use the functions, they might
> end up with the no-op version, not really freeing any memory.

It might make sense to

1) Send the first 3 out separately
2) Look into a simple variant that leaves __add_pages() calls alone and
only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
well, and deals with an inaccessible altmap, like the
page_init_poison() when the altmap might be inaccessible.
3) Look into a proper interface to add/remove memory instead of relying
on online/offline.

2) is certainly an improvement and might be desired in some cases. 3) is
more powerful (e.g., where you don't want an altmap because of
fragmentation) and future proof.

I suspect there will be installations where an altmap is undesired: it
fragments your address space with unmovable (memmap) allocations.
Currently, runtime allocations of gigantic pages are affected. Long-term
other large allocations (if we ever see very large THP) will be affected.

For that reason, we want to either support variable-sized memory blocks
long-term, or simulate that by "grouping" memory blocks that share a
same altmap located on the first memory blocks in that group: but
onlining one block forces onlining of the whole group.

On s390x that adds all memory ahead of time, it's hard to make a
decision what the right granularity will be, and seeing sudden
online/offline changed behavior might be quite "surprising" for users.
The user can give better hints when adding/removing memory explicitly.

--
Cheers,

David / dhildenb

2023-11-17 15:38:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On 17.11.23 14:56, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the other
>> architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be touched
>> by anybody (except memory onlining code :) ). And if we do, it's usually a
>> BUG because that memmap might contain garbage/be poisoned or completely
>> stale, so we might want to track that down and fix it in any case.
>>
>> So what speaks against just leaving add_memory() populate the memmap from
>> the altmap? Then, also the page tables for the memmap are already in place
>> when onlining memory.
>>
>
> we do have page_init_poison() in sparse_add_section() which should be
> handled later then. not in add_pages()

Was that all, or did you stumble over other things?

--
Cheers,

David / dhildenb

2023-11-17 19:48:05

by Sumanth Korikkar

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
>
> It might make sense to
>
> 1) Send the first 3 out separately

Ok sure, I will first send 3 patches as bug fixes with your feedback
applied.

> 2) Look into a simple variant that leaves __add_pages() calls alone and
> only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
> well, and deals with an inaccessible altmap, like the
> page_init_poison() when the altmap might be inaccessible.

Thanks for the valuable feedback.

I just tried out quickly with disabling page_init_poison() and removing
the hack in arch_add_memory() and arch_remove_memory(). Also used new
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. The current testing
result looks promising and seems to work and no issues found so far.

I will also double check if there are any other memmap accesses in
add_pages() phase.

we will try to go for this approach currently, i.e. with the notifiers you
suggested, and __add_pages() change.

Do you have any suggestions with how we could check for inaccessible altmap?

> 3) Look into a proper interface to add/remove memory instead of relying
> on online/offline.

agree for long term.
>
> 2) is certainly an improvement and might be desired in some cases. 3) is
> more powerful (e.g., where you don't want an altmap because of
> fragmentation) and future proof.
>
> I suspect there will be installations where an altmap is undesired: it
> fragments your address space with unmovable (memmap) allocations. Currently,
> runtime allocations of gigantic pages are affected. Long-term other large
> allocations (if we ever see very large THP) will be affected.
>
> For that reason, we want to either support variable-sized memory blocks
> long-term, or simulate that by "grouping" memory blocks that share a same
> altmap located on the first memory blocks in that group: but onlining one
> block forces onlining of the whole group.
>
> On s390x that adds all memory ahead of time, it's hard to make a decision
> what the right granularity will be, and seeing sudden online/offline changed
> behavior might be quite "surprising" for users. The user can give better
> hints when adding/removing memory explicitly.

Thanks for providing insights and details.

2023-11-21 13:14:01

by Sumanth Korikkar

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
> >
> > Maybe there is also already a common code bug with that, s390 might be
> > special but that is often also good for finding bugs in common code ...
>
> If it's only the page_init_poison() as noted by Sumanth, we could disable
> that on s390x with an altmap some way or the other; should be possible.
>
> I mean, you effectively have your own poisoning if the altmap is effectively
> inaccessible and makes your CPU angry on access :)
>
> Last but not least, support for an inaccessible altmap might come in handy
> for virtio-mem eventually, and make altmap support eventually simpler. So
> added bonus points.

We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 075094ca59b4..ab2dfcc7e9e4 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
* buddy allocator later.
*/
__arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
+ /*
+ * Poison the struct pages after memory block is accessible.
+ * This is needed for only altmap. Without altmap, the struct
+ * pages are poisoined in sparse_add_section().
+ */
+ if (memory_block->altmap->inaccessible)
+ page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);
break;
case MEM_FINISH_OFFLINE:
sclp_mem_change_state(start, size, 0);
@@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
goto skip_add;
for (addr = start; addr < start + size; addr += block_size)
add_memory(0, addr, block_size,
- MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
+ MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
skip_add:
first_rn = rn;
num = 1;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..5c70707e706f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
* implies the node id (nid).
*/
#define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
+/*
+ * Mark memmap on memory (struct pages array) as inaccessible during memory
+ * hotplug addition phase.
+ */
+#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3))

/*
* Extended parameters for memory hotplug:
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..9837f3e6fb95 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -25,6 +25,7 @@ struct vmem_altmap {
unsigned long free;
unsigned long align;
unsigned long alloc;
+ bool inaccessible;
};

/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..d8299853cdcc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1439,6 +1439,8 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (mhp_supports_memmap_on_memory(size)) {
mhp_altmap.free = memory_block_memmap_on_memory_pages();
+ if (mhp_flags & MHP_ALTMAP_INACCESSIBLE)
+ mhp_altmap.inaccessible = true;
params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
if (!params.altmap) {
ret = -ENOMEM;
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..3991c717b769 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -907,7 +907,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- page_init_poison(memmap, sizeof(struct page) * nr_pages);
+ if (!altmap || !altmap->inaccessible)
+ page_init_poison(memmap, sizeof(struct page) * nr_pages);

ms = __nr_to_section(section_nr);
set_section_nid(section_nr, nid);


Approach 2:
===========
Shouldnt kasan zero shadow mapping performed first before
accessing/initializing memmap via page_init_poisining()? If that is
true, then it is a problem for all architectures and should could be
fixed like:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..eb3975740537 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
if (ret)
return ret;

+ page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

for (i = 0; i < nr_pages; i++)
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..4ddf53f52075 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
/*
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
+ * For altmap, do this later when onlining the memory, as it might
+ * not be accessible at this point.
*/
- page_init_poison(memmap, sizeof(struct page) * nr_pages);
+ if (!altmap)
+ page_init_poison(memmap, sizeof(struct page) * nr_pages);

ms = __nr_to_section(section_nr);
set_section_nid(section_nr, nid);



Also, if this approach is taken, should page_init_poison() be performed
with cond_resched() as mentioned in commit d33695b16a9f
("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Opinions?

Thank you

2023-11-21 13:21:57

by Sumanth Korikkar

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()? If that is
> true, then it is a problem for all architectures and should could be
> fixed like:
>
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
> if (ret)
> return ret;
>
> + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>
> for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
> /*
> * Poison uninitialized struct pages in order to catch invalid flags
> * combinations.
> + * For altmap, do this later when onlining the memory, as it might
> + * not be accessible at this point.
> */
> - page_init_poison(memmap, sizeof(struct page) * nr_pages);
> + if (!altmap)
> + page_init_poison(memmap, sizeof(struct page) * nr_pages);
>
> ms = __nr_to_section(section_nr);
> set_section_nid(section_nr, nid);
>
>
>
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Sorry, wrong commit id.

should page_init_poison() be performed with cond_resched() as mentioned in
Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
during pfn range removal") ?

Thanks
>
> Opinions?
>
> Thank you

2023-11-21 14:43:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On 21.11.23 14:21, Sumanth Korikkar wrote:
> On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
>> Approach 2:
>> ===========
>> Shouldnt kasan zero shadow mapping performed first before
>> accessing/initializing memmap via page_init_poisining()? If that is
>> true, then it is a problem for all architectures and should could be
>> fixed like:
>>
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 7a5fc89a8652..eb3975740537 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>> if (ret)
>> return ret;
>>
>> + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>>
>> for (i = 0; i < nr_pages; i++)
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index 77d91e565045..4ddf53f52075 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>> /*
>> * Poison uninitialized struct pages in order to catch invalid flags
>> * combinations.
>> + * For altmap, do this later when onlining the memory, as it might
>> + * not be accessible at this point.
>> */
>> - page_init_poison(memmap, sizeof(struct page) * nr_pages);
>> + if (!altmap)
>> + page_init_poison(memmap, sizeof(struct page) * nr_pages);
>>
>> ms = __nr_to_section(section_nr);
>> set_section_nid(section_nr, nid);
>>
>>
>>
>> Also, if this approach is taken, should page_init_poison() be performed
>> with cond_resched() as mentioned in commit d33695b16a9f
>> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?
>
> Sorry, wrong commit id.
>
> should page_init_poison() be performed with cond_resched() as mentioned in
> Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
> during pfn range removal") ?

I think people are currently looking into removing all that cond_resched():

https://lore.kernel.org/all/[email protected]/T/#mda52da685a142bec9607625386b0b660e5470abe
--
Cheers,

David / dhildenb

2023-11-21 19:25:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On 21.11.23 14:13, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
>>>
>>> Maybe there is also already a common code bug with that, s390 might be
>>> special but that is often also good for finding bugs in common code ...
>>
>> If it's only the page_init_poison() as noted by Sumanth, we could disable
>> that on s390x with an altmap some way or the other; should be possible.
>>
>> I mean, you effectively have your own poisoning if the altmap is effectively
>> inaccessible and makes your CPU angry on access :)
>>
>> Last but not least, support for an inaccessible altmap might come in handy
>> for virtio-mem eventually, and make altmap support eventually simpler. So
>> added bonus points.
>
> We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
> Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()
>
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 075094ca59b4..ab2dfcc7e9e4 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
> * buddy allocator later.
> */
> __arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
> + /*
> + * Poison the struct pages after memory block is accessible.
> + * This is needed for only altmap. Without altmap, the struct
> + * pages are poisoined in sparse_add_section().
> + */
> + if (memory_block->altmap->inaccessible)
> + page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);

See below, likely it's best if the core will do that.

> break;
> case MEM_FINISH_OFFLINE:
> sclp_mem_change_state(start, size, 0);
> @@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
> goto skip_add;
> for (addr = start; addr < start + size; addr += block_size)
> add_memory(0, addr, block_size,
> - MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
> + MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
> skip_add:
> first_rn = rn;
> num = 1;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 7d2076583494..5c70707e706f 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
> * implies the node id (nid).
> */
> #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
> +/*
> + * Mark memmap on memory (struct pages array) as inaccessible during memory
> + * hotplug addition phase.
> + */
> +#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3))

If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document
how this relates to/interacts with the memory notifier callbacks and the
altmap.

Then, we can logically abstract this from altmap handling. Simply, the
memory should not be read/written before the memory notifier was called.

In the core, you can do the poisioning for the altmap in that case after
calling the notifier, probably in mhp_init_memmap_on_memory() as you do
below.

>
> /*
> * Extended parameters for memory hotplug:
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 744c830f4b13..9837f3e6fb95 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -25,6 +25,7 @@ struct vmem_altmap {
> unsigned long free;
> unsigned long align;
> unsigned long alloc;
> + bool inaccessible;

We should then likely remember that information for the memory block,
not the altmap.

[...]

>
>
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()? If that is

Likely the existing way. The first access to the poisoned memmap should
be a write, not a read. But I didn't look into the details.

Can you try reproducing?

> true, then it is a problem for all architectures and should could be
> fixed like:
>
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
> if (ret)
> return ret;
>
> + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>
> for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
> /*
> * Poison uninitialized struct pages in order to catch invalid flags
> * combinations.
> + * For altmap, do this later when onlining the memory, as it might
> + * not be accessible at this point.
> */
> - page_init_poison(memmap, sizeof(struct page) * nr_pages);
> + if (!altmap)
> + page_init_poison(memmap, sizeof(struct page) * nr_pages);
>
> ms = __nr_to_section(section_nr);
> set_section_nid(section_nr, nid);
>

That's too generic when it comes to other altmap users, especially DAX
or when the altmap is accessible while memory is offlining, and we want
to catch that.

>
>
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Likely as soon as possible after it is accessible :)


--
Cheers,

David / dhildenb

2023-11-22 11:46:55

by Sumanth Korikkar

[permalink] [raw]
Subject: Re: [PATCH 0/8] implement "memmap on memory" feature on s390

On Tue, Nov 21, 2023 at 08:24:48PM +0100, David Hildenbrand wrote:
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 7d2076583494..5c70707e706f 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
> > * implies the node id (nid).
> > */
> > #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
> > +/*
> > + * Mark memmap on memory (struct pages array) as inaccessible during memory
> > + * hotplug addition phase.
> > + */
> > +#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3))
>
> If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document how
> this relates to/interacts with the memory notifier callbacks and the altmap.
>
> Then, we can logically abstract this from altmap handling. Simply, the
> memory should not be read/written before the memory notifier was called.
>
> In the core, you can do the poisioning for the altmap in that case after
> calling the notifier, probably in mhp_init_memmap_on_memory() as you do
> below.
ok, sure. Thanks.
>
> > /*
> > * Extended parameters for memory hotplug:
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 744c830f4b13..9837f3e6fb95 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -25,6 +25,7 @@ struct vmem_altmap {
> > unsigned long free;
> > unsigned long align;
> > unsigned long alloc;
> > + bool inaccessible;
>
> We should then likely remember that information for the memory block, not
> the altmap.

Tried using inaccessible field in memory_block and observed that the
memory block is created following the success of arch_add_memory().
Hence, when conducting checks for inaccessible memory in
sparse_add_section() (regarding page poisoning), there is still a
reliance on mhp_flags conveyed through add_memory(). Subsequently, then
memory_block inaccessible state is set in create_memory_block_devices().

I hope this approach is fine.

On the other hand, relying on inaccessible flag in vmem_altmap could
make checks in arch_add_memory() and other functions easier I suppose.

>
> [...]
>
> >
> >
> > Approach 2:
> > ===========
> > Shouldnt kasan zero shadow mapping performed first before
> > accessing/initializing memmap via page_init_poisining()? If that is
>
> Likely the existing way. The first access to the poisoned memmap should be a
> write, not a read. But I didn't look into the details.
>
> Can you try reproducing?
>

Executing page_init_poison() right before kasan_add_zero_shadow() in the
context of altmap usage did not result in a crash when I attempted to
reproduce it.

However, altmap + page_init_poisoning() within sparse_add_section(), a
crash occurs on our arch, as previously discussed in this thread.

Thank you