Hi,
back in March [1] you have recommended 53cdc1cb29e8
("drivers/base/memory.c: indicate all memory blocks as removable") to be
backported to stable trees and that has led to a more general discussion
about the current state of pfn walkers wrt. uninitialized pmem struct
pages. We haven't concluded any specific solution for that except for a
general sentiment that pfn_to_online_page should be able to catch those.
I might have missed any follow ups on that but I do not think we have
landed on any actual solution in the end. Have I just missed any followups?
Is anybody working on that?
Also is there any reference explaining what those struct pages are and
why we cannot initialize them? I am sorry if this has been explained to
me but I really cannot find that in my notes anywhere. Most pmem pages
should be initialized via memmap_init_zone_device, right?
I am asking mostly because we are starting to see these issues in
production and while the only visible problem so far is a crash when
reading sysfs (removable file) I am worried we are just lucky no other
pfn walker stumble over this.
[1] http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com
--
Michal Hocko
SUSE Labs
On 04.01.21 11:03, Michal Hocko wrote:
> Hi,
> back in March [1] you have recommended 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> backported to stable trees and that has led to a more general discussion
> about the current state of pfn walkers wrt. uninitialized pmem struct
> pages. We haven't concluded any specific solution for that except for a
> general sentiment that pfn_to_online_page should be able to catch those.
> I might have missed any follow ups on that but I do not think we have
> landed on any actual solution in the end. Have I just missed any followups?
Thanks for raising this issue. It's still on my list of "broken and
non-trivial to fix" things.
So, as far as I recall, we still have the following two issues remaining:
1. pfn_to_online_page() false positives
The semantics of pfn_to_online_page() were broken with sub-section
hot-add in corner cases.
If we have ZONE_DEVICE hot-added memory that overlaps in a section with
boot memory, this memory section will contain parts ZONE_DEVICE memory
and parts !ZONE_DEVICE memory. This can happen in sub-section
granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
memory parts as the whole section is marked as online. Bad.
One instance where this is still an issue is
mm/memory-failure.c:memory_failure() and
mm/memory-failure.c:soft_offline_page(). I thought for a while about
"fixing" these, but to me it felt like fixing pfn_to_online_page() is
actually the right approach.
But worse, before ZONE_DEVICE hot-add
1. The whole section memmap does already exist (early sections always
have a full memmap for the whole section)
2. The whole section memmap is initialized (although eventually with
dummy node/zone 0/0 for memory holes until that part is fixed) and might
be accessed by pfn walkers.
So when hotadding ZONE_DEVICE we are modifying already existing and
visible memmaps. Bad.
2. Deferred init of ZONE_DEVICE ranges
memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
and outside the memhp lock. I did not follow if the use of
get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
pagemap_range() actually completed. I don't think it does.
>
> Is anybody working on that?
>
I believe Dan mentioned somewhere that he wants to see a real instance
of this producing a BUG before actually moving forward with a fix. I
might be wrong.
We might tackle 1. by:
a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() -
just plain ugly.
b) [worked-around] using a sub-section online-map and extending
pfn_to_online_page(). IMHO ugly to fix this corner case.
c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy.
d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
section. In the worst case, don't add partially present sections that
have big holes in the beginning/end. Like, if there is a 128MB section
with 126MB of memory followed by a 2MB hole, don't add that memory to
Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
well, it would be the price to pay for simplicity. Being able to hotadd
PMEM is more important than using each and every last MB of memory.
e) decrease the section size and drop sub-section hotadd support. As
sub-section hotadd is 2MB and MAX_ORDER - 1 corresponds 4MB, this is
mostly impossible. Worse on aarch64 with 64k base pages - 1024MB
sections and MAX_ORDER - 1 corresponds 512MB. I think this is not feasible.
We might tackle 2. by making sure that get_dev_pagemap() only succeeds
after memmap_init_zone_device() finished. As I said, not sure if that is
already done.
> Also is there any reference explaining what those struct pages are and
> why we cannot initialize them? I am sorry if this has been explained to
> me but I really cannot find that in my notes anywhere. Most pmem pages
> should be initialized via memmap_init_zone_device, right?
Long story short, IMHO we have to
a) fix pfn_to_online_page() to only succeed for !ZONE_DEVICE memory.
b) fix get_dev_pagemap() to only succeed if memmap_init_zone_device()
completed. (again, unless this already works as desired)
Dealing with races (memory_offlining() racing with pfn_to_online_page())
is another story and stuff for the future. Another level of complexity :)
> I am asking mostly because we are starting to see these issues in
> production and while the only visible problem so far is a crash when
> reading sysfs (removable file) I am worried we are just lucky no other
> pfn walker stumble over this.
Which exact issue are you seeing? The one tackled by "[PATCH v1]
drivers/base/memory.c: indicate all memory blocks as removable" ?
--
Thanks,
David / dhildenb
On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
> On 04.01.21 11:03, Michal Hocko wrote:
> > Hi,
> > back in March [1] you have recommended 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> > backported to stable trees and that has led to a more general discussion
> > about the current state of pfn walkers wrt. uninitialized pmem struct
> > pages. We haven't concluded any specific solution for that except for a
> > general sentiment that pfn_to_online_page should be able to catch those.
> > I might have missed any follow ups on that but I do not think we have
> > landed on any actual solution in the end. Have I just missed any followups?
>
> Thanks for raising this issue. It's still on my list of "broken and
> non-trivial to fix" things.
>
> So, as far as I recall, we still have the following two issues remaining:
>
> 1. pfn_to_online_page() false positives
>
> The semantics of pfn_to_online_page() were broken with sub-section
> hot-add in corner cases.
>
> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> boot memory, this memory section will contain parts ZONE_DEVICE memory
> and parts !ZONE_DEVICE memory. This can happen in sub-section
> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> memory parts as the whole section is marked as online. Bad.
OK, I was not aware of this problem. Anyway, those pages should be still
allocated and their state should retain their last state. I would have
to double check but this shouldn't be harmfull. Or what would be an
actual problem?
> One instance where this is still an issue is
> mm/memory-failure.c:memory_failure() and
> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> actually the right approach.
>
> But worse, before ZONE_DEVICE hot-add
> 1. The whole section memmap does already exist (early sections always
> have a full memmap for the whole section)
> 2. The whole section memmap is initialized (although eventually with
> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> be accessed by pfn walkers.
>
> So when hotadding ZONE_DEVICE we are modifying already existing and
> visible memmaps. Bad.
Could you elaborate please?
> 2. Deferred init of ZONE_DEVICE ranges
>
> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> and outside the memhp lock. I did not follow if the use of
> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> pagemap_range() actually completed. I don't think it does.
So a pfn walker can see an unitialized struct page for a while, right?
The problem that I have encountered is that some zone device pages are
not initialized at all. That sounds like a different from those 2 above.
I am having hard time to track what kind of pages those are and why we
cannot initialized their zone/node and make them reserved at least.
> > Is anybody working on that?
> >
>
> I believe Dan mentioned somewhere that he wants to see a real instance
> of this producing a BUG before actually moving forward with a fix. I
> might be wrong.
We have seen reports about those uninitialized struct pages on our 5.3
based kernels. Backporting 53cdc1cb29e8 helped for the particular report
but I still consider it a workaround rather than a fix. I do not have
any reports for other pfn walkers but we might be just lucky and I will
sleep better if I do not have rely on the luck.
[...]
I will think about your proposed solutions after I manage to get through
my email backlog.
Thanks!
--
Michal Hocko
SUSE Labs
On 04.01.21 15:26, Michal Hocko wrote:
> On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
>> On 04.01.21 11:03, Michal Hocko wrote:
>>> Hi,
>>> back in March [1] you have recommended 53cdc1cb29e8
>>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
>>> backported to stable trees and that has led to a more general discussion
>>> about the current state of pfn walkers wrt. uninitialized pmem struct
>>> pages. We haven't concluded any specific solution for that except for a
>>> general sentiment that pfn_to_online_page should be able to catch those.
>>> I might have missed any follow ups on that but I do not think we have
>>> landed on any actual solution in the end. Have I just missed any followups?
>>
>> Thanks for raising this issue. It's still on my list of "broken and
>> non-trivial to fix" things.
>>
>> So, as far as I recall, we still have the following two issues remaining:
>>
>> 1. pfn_to_online_page() false positives
>>
>> The semantics of pfn_to_online_page() were broken with sub-section
>> hot-add in corner cases.
>>
>> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
>> boot memory, this memory section will contain parts ZONE_DEVICE memory
>> and parts !ZONE_DEVICE memory. This can happen in sub-section
>> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
>> memory parts as the whole section is marked as online. Bad.
>
> OK, I was not aware of this problem. Anyway, those pages should be still
> allocated and their state should retain their last state. I would have
> to double check but this shouldn't be harmfull. Or what would be an
> actual problem?
>
>> One instance where this is still an issue is
>> mm/memory-failure.c:memory_failure() and
>> mm/memory-failure.c:soft_offline_page(). I thought for a while about
>> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
>> actually the right approach.
>>
>> But worse, before ZONE_DEVICE hot-add
>> 1. The whole section memmap does already exist (early sections always
>> have a full memmap for the whole section)
>> 2. The whole section memmap is initialized (although eventually with
>> dummy node/zone 0/0 for memory holes until that part is fixed) and might
>> be accessed by pfn walkers.
>>
>> So when hotadding ZONE_DEVICE we are modifying already existing and
>> visible memmaps. Bad.
>
> Could you elaborate please?
Simplistic example: Assume you have a VM with 64MB on x86-64.
We need exactly one memory section (-> one memory block device). We
allocate the memmap for a full section - an "early section". So we have
a memmap for 128MB, while 64MB are actually in use, the other 64MB is
initialized (like a memory hole). pfn_to_online_page() would return a
valid struct page for the whole section memmap.
The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
essentially re-initializing that part of the already-existing memmap.
See pfn_valid():
/*
* Traditionally early sections always returned pfn_valid() for
* the entire section-sized span.
*/
return early_section(ms) || pfn_section_valid(ms, pfn);
Depending on the memory layout of the system, a pfn walker might just be
about to stumble over this range getting re-initialized.
>
>> 2. Deferred init of ZONE_DEVICE ranges
>>
>> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
>> and outside the memhp lock. I did not follow if the use of
>> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
>> pagemap_range() actually completed. I don't think it does.
>
> So a pfn walker can see an unitialized struct page for a while, right?
>
> The problem that I have encountered is that some zone device pages are
> not initialized at all. That sounds like a different from those 2 above.
> I am having hard time to track what kind of pages those are and why we
> cannot initialized their zone/node and make them reserved at least.
And you are sure that these are in fact ZONE_DEVICE pages? Not memory
holes e.g., tackled by
commit 4b094b7851bf4bf551ad456195d3f26e1c03bd74
Author: David Hildenbrand <[email protected]>
Date: Mon Feb 3 17:33:55 2020 -0800
mm/page_alloc.c: initialize memmap of unavailable memory directly
commit e822969cab48b786b64246aad1a3ba2a774f5d23
Author: David Hildenbrand <[email protected]>
Date: Mon Feb 3 17:33:48 2020 -0800
mm/page_alloc.c: fix uninitialized memmaps on a partially populated
last section
(note there is currently an upstream discussion on improving this
initialization, especially getting better node/zone information, mostly
involving Andrea and Mike - but it only changes "how" these parts are
initialized, not "if" or "when")
---
However, I do remember a discussion regarding "reserved altmap space"
ZONE_DEVICE ranges, and whether to initialize them or leave them
uninitialized. See comment in
commit 77e080e7680e1e615587352f70c87b9e98126d03
Author: Aneesh Kumar K.V <[email protected]>
Date: Fri Oct 18 20:19:39 2019 -0700
mm/memunmap: don't access uninitialized memmap in memunmap_pages()
"With an altmap, the memmap falling into the reserved altmap space are
not initialized and, therefore, contain a garbage NID and a garbage zone.".
I think the issue is that the ZONE_DEVICE pages that *host* the memmap
of other pages might be left uninitialized.
Like pfn_to_page(VIRT_TO_PFN(pfn_to_page(zone_device_pfn))), which falls
onto ZONE_DEVICE with an altmap, could be uninitialized. This is very
similar to our Oscar's vmemmap-on-hotadded-memory approach, however,
there we implicitly initialize the memmap of these pages just by the way
the vmemmap is placed at the beginning of the memory block.
If altmap-reserved space is placed partially into an early section that
is marked as online (issue 1. I described), we have the same issue as
1., just a little harder to fix :)
>
>>> Is anybody working on that?
>>>
>>
>> I believe Dan mentioned somewhere that he wants to see a real instance
>> of this producing a BUG before actually moving forward with a fix. I
>> might be wrong.
>
> We have seen reports about those uninitialized struct pages on our 5.3
> based kernels. Backporting 53cdc1cb29e8 helped for the particular report
> but I still consider it a workaround rather than a fix. I do not have
> any reports for other pfn walkers but we might be just lucky and I will
> sleep better if I do not have rely on the luck.
Yeah, I think we are now at 3 different but related problems.
--
Thanks,
David / dhildenb
On Mon 04-01-21 15:51:35, David Hildenbrand wrote:
> On 04.01.21 15:26, Michal Hocko wrote:
> > On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
[....]
> >> One instance where this is still an issue is
> >> mm/memory-failure.c:memory_failure() and
> >> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> >> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> >> actually the right approach.
> >>
> >> But worse, before ZONE_DEVICE hot-add
> >> 1. The whole section memmap does already exist (early sections always
> >> have a full memmap for the whole section)
> >> 2. The whole section memmap is initialized (although eventually with
> >> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> >> be accessed by pfn walkers.
> >>
> >> So when hotadding ZONE_DEVICE we are modifying already existing and
> >> visible memmaps. Bad.
> >
> > Could you elaborate please?
>
> Simplistic example: Assume you have a VM with 64MB on x86-64.
>
> We need exactly one memory section (-> one memory block device). We
> allocate the memmap for a full section - an "early section". So we have
> a memmap for 128MB, while 64MB are actually in use, the other 64MB is
> initialized (like a memory hole). pfn_to_online_page() would return a
> valid struct page for the whole section memmap.
>
> The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
> essentially re-initializing that part of the already-existing memmap.
>
> See pfn_valid():
>
> /*
> * Traditionally early sections always returned pfn_valid() for
> * the entire section-sized span.
> */
> return early_section(ms) || pfn_section_valid(ms, pfn);
>
>
> Depending on the memory layout of the system, a pfn walker might just be
> about to stumble over this range getting re-initialized.
Right. But as long as pfn walkers are not synchronized with the memory
hotplug this is a general problem with any struct page. Whether it
belongs to pmem or a regular memory, no?
> >> 2. Deferred init of ZONE_DEVICE ranges
> >>
> >> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> >> and outside the memhp lock. I did not follow if the use of
> >> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> >> pagemap_range() actually completed. I don't think it does.
> >
> > So a pfn walker can see an unitialized struct page for a while, right?
> >
> > The problem that I have encountered is that some zone device pages are
> > not initialized at all. That sounds like a different from those 2 above.
> > I am having hard time to track what kind of pages those are and why we
> > cannot initialized their zone/node and make them reserved at least.
>
> And you are sure that these are in fact ZONE_DEVICE pages? Not memory
> holes e.g., tackled by
Well, the physical address matches the pmem range so I believe this is
the case.
[...]
> However, I do remember a discussion regarding "reserved altmap space"
> ZONE_DEVICE ranges, and whether to initialize them or leave them
> uninitialized. See comment in
>
> commit 77e080e7680e1e615587352f70c87b9e98126d03
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Fri Oct 18 20:19:39 2019 -0700
>
> mm/memunmap: don't access uninitialized memmap in memunmap_pages()
yes, the reserved altmap space sounds like it might be it.
> "With an altmap, the memmap falling into the reserved altmap space are
> not initialized and, therefore, contain a garbage NID and a garbage zone.".
>
> I think the issue is that the ZONE_DEVICE pages that *host* the memmap
> of other pages might be left uninitialized.
>
> Like pfn_to_page(VIRT_TO_PFN(pfn_to_page(zone_device_pfn))), which falls
> onto ZONE_DEVICE with an altmap, could be uninitialized. This is very
> similar to our Oscar's vmemmap-on-hotadded-memory approach, however,
> there we implicitly initialize the memmap of these pages just by the way
> the vmemmap is placed at the beginning of the memory block.
>
> If altmap-reserved space is placed partially into an early section that
> is marked as online (issue 1. I described), we have the same issue as
> 1., just a little harder to fix :)
Would it be possible to iterate over the reserved space and initialize
Node/zones at least?
Thanks!
--
Michal Hocko
SUSE Labs
On 04.01.21 16:10, Michal Hocko wrote:
> On Mon 04-01-21 15:51:35, David Hildenbrand wrote:
>> On 04.01.21 15:26, Michal Hocko wrote:
>>> On Mon 04-01-21 11:45:39, David Hildenbrand wrote:
> [....]
>>>> One instance where this is still an issue is
>>>> mm/memory-failure.c:memory_failure() and
>>>> mm/memory-failure.c:soft_offline_page(). I thought for a while about
>>>> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
>>>> actually the right approach.
>>>>
>>>> But worse, before ZONE_DEVICE hot-add
>>>> 1. The whole section memmap does already exist (early sections always
>>>> have a full memmap for the whole section)
>>>> 2. The whole section memmap is initialized (although eventually with
>>>> dummy node/zone 0/0 for memory holes until that part is fixed) and might
>>>> be accessed by pfn walkers.
>>>>
>>>> So when hotadding ZONE_DEVICE we are modifying already existing and
>>>> visible memmaps. Bad.
>>>
>>> Could you elaborate please?
>>
>> Simplistic example: Assume you have a VM with 64MB on x86-64.
>>
>> We need exactly one memory section (-> one memory block device). We
>> allocate the memmap for a full section - an "early section". So we have
>> a memmap for 128MB, while 64MB are actually in use, the other 64MB is
>> initialized (like a memory hole). pfn_to_online_page() would return a
>> valid struct page for the whole section memmap.
>>
>> The remaining 64MB can later be used for hot-adding ZONE_DEVICE memory,
>> essentially re-initializing that part of the already-existing memmap.
>>
>> See pfn_valid():
>>
>> /*
>> * Traditionally early sections always returned pfn_valid() for
>> * the entire section-sized span.
>> */
>> return early_section(ms) || pfn_section_valid(ms, pfn);
>>
>>
>> Depending on the memory layout of the system, a pfn walker might just be
>> about to stumble over this range getting re-initialized.
>
> Right. But as long as pfn walkers are not synchronized with the memory
> hotplug this is a general problem with any struct page. Whether it
> belongs to pmem or a regular memory, no?
Yes, however in this case even the memory hotplug lock does not help.
But yes, related issues.
>
>>>> 2. Deferred init of ZONE_DEVICE ranges
>>>>
>>>> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
>>>> and outside the memhp lock. I did not follow if the use of
>>>> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
>>>> pagemap_range() actually completed. I don't think it does.
>>>
>>> So a pfn walker can see an unitialized struct page for a while, right?
>>>
>>> The problem that I have encountered is that some zone device pages are
>>> not initialized at all. That sounds like a different from those 2 above.
>>> I am having hard time to track what kind of pages those are and why we
>>> cannot initialized their zone/node and make them reserved at least.
>>
>> And you are sure that these are in fact ZONE_DEVICE pages? Not memory
>> holes e.g., tackled by
>
> Well, the physical address matches the pmem range so I believe this is
> the case.
>
> [...]
>> However, I do remember a discussion regarding "reserved altmap space"
>> ZONE_DEVICE ranges, and whether to initialize them or leave them
>> uninitialized. See comment in
>>
>> commit 77e080e7680e1e615587352f70c87b9e98126d03
>> Author: Aneesh Kumar K.V <[email protected]>
>> Date: Fri Oct 18 20:19:39 2019 -0700
>>
>> mm/memunmap: don't access uninitialized memmap in memunmap_pages()
>
> yes, the reserved altmap space sounds like it might be it.
[...]
> Would it be possible to iterate over the reserved space and initialize
> Node/zones at least?
Right, I was confused by the terminology. We actually initialize the
pages used for memory mapping in
move_pfn_range_to_zone()->memmap_init_zone(). But we seem to exclude the
"reserved space" - I think for good reason.
I think the issue is that this "reserved space" might actually get
overridden by something else later, as it won't be used as a memmap, but
just to store "anything".
Do the physical addresses you see fall into the same section as boot
memory? Or what's around these addresses?
--
Thanks,
David / dhildenb
On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> On 04.01.21 16:10, Michal Hocko wrote:
[...]
> Do the physical addresses you see fall into the same section as boot
> memory? Or what's around these addresses?
Yes I am getting a garbage for the first struct page belonging to the
pmem section [1]
[ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
[ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
The pfn without the initialized struct page is 0x6060000. This is a
first pfn in a section.
---
[1] it shares the same memory block with the regular memory as those
are in the same 2G range.
--
Michal Hocko
SUSE Labs
On 04.01.21 16:33, Michal Hocko wrote:
> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>> On 04.01.21 16:10, Michal Hocko wrote:
> [...]
>> Do the physical addresses you see fall into the same section as boot
>> memory? Or what's around these addresses?
>
> Yes I am getting a garbage for the first struct page belonging to the
> pmem section [1]
> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
>
> The pfn without the initialized struct page is 0x6060000. This is a
> first pfn in a section.
Okay, so we're not dealing with the "early section" mess I described,
different story.
Due to [1], is_mem_section_removable() called
pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
initialized.
Let's assume this is indeed a reserved pfn in the altmap. What's the
actual address of the memmap?
I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
If it's not self-hosted, initializing the relevant memmaps should work
just fine I guess. Otherwise things get more complicated.
--
Thanks,
David / dhildenb
On 04.01.21 16:43, David Hildenbrand wrote:
> On 04.01.21 16:33, Michal Hocko wrote:
>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>>> On 04.01.21 16:10, Michal Hocko wrote:
>> [...]
>>> Do the physical addresses you see fall into the same section as boot
>>> memory? Or what's around these addresses?
>>
>> Yes I am getting a garbage for the first struct page belonging to the
>> pmem section [1]
>> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
>> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
>>
>> The pfn without the initialized struct page is 0x6060000. This is a
>> first pfn in a section.
>
> Okay, so we're not dealing with the "early section" mess I described,
> different story.
>
> Due to [1], is_mem_section_removable() called
> pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> initialized.
>
> Let's assume this is indeed a reserved pfn in the altmap. What's the
> actual address of the memmap?
>
> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
>
> If it's not self-hosted, initializing the relevant memmaps should work
> just fine I guess. Otherwise things get more complicated.
Oh, I forgot: pfn_to_online_page() should at least in your example make
sure other pfn walkers are safe. It was just an issue of
is_mem_section_removable().
--
Thanks,
David / dhildenb
On Mon 04-01-21 16:43:49, David Hildenbrand wrote:
> On 04.01.21 16:33, Michal Hocko wrote:
> > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >> On 04.01.21 16:10, Michal Hocko wrote:
> > [...]
> >> Do the physical addresses you see fall into the same section as boot
> >> memory? Or what's around these addresses?
> >
> > Yes I am getting a garbage for the first struct page belonging to the
> > pmem section [1]
> > [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> >
> > The pfn without the initialized struct page is 0x6060000. This is a
> > first pfn in a section.
>
> Okay, so we're not dealing with the "early section" mess I described,
> different story.
>
> Due to [1], is_mem_section_removable() called
> pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> initialized.
Exactly!
> Let's assume this is indeed a reserved pfn in the altmap. What's the
> actual address of the memmap?
Not sure what exactly you are asking for but crash says
crash> kmem -p 6060000
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
fffff8c600181800 6060000 0 0 0 fffffc0000000
> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
I am not really familiar with the pmem so I would need more assistance
here. I've tried this (shot into the dark):
crash> struct page.pgmap fffff8c600181800
pgmap = 0xfffff8c600181808
crash> struct -x dev_pagemap 0xfffff8c600181808
struct dev_pagemap {
altmap = {
base_pfn = 0xfffff8c600181808,
end_pfn = 0xfffff8c600181808,
reserve = 0x0,
free = 0x0,
align = 0x0,
alloc = 0xffffffff
},
res = {
start = 0x0,
end = 0xfffffc0000000,
name = 0xfffff8c600181848 "H\030\030",
flags = 0xfffff8c600181848,
desc = 0x0,
parent = 0x0,
sibling = 0x0,
child = 0xffffffff
},
ref = 0x0,
internal_ref = {
count = {
counter = 0xfffffc0000000
},
percpu_count_ptr = 0xfffff8c600181888,
release = 0xfffff8c600181888,
confirm_switch = 0x0,
force_atomic = 0x0,
allow_reinit = 0x0,
rcu = {
next = 0x0,
func = 0xffffffff
}
},
done = {
done = 0x0,
wait = {
lock = {
{
rlock = {
raw_lock = {
{
val = {
counter = 0xc0000000
},
{
locked = 0x0,
pending = 0x0
},
{
locked_pending = 0x0,
tail = 0xc000
}
}
}
}
}
},
head = {
next = 0xfffff8c6001818c8,
prev = 0xfffff8c6001818c8
}
}
},
dev = 0x0,
type = 0,
flags = 0x0,
ops = 0x0
}
Not sure whether this is of any use.
> If it's not self-hosted, initializing the relevant memmaps should work
> just fine I guess. Otherwise things get more complicated.
--
Michal Hocko
SUSE Labs
>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>> actual address of the memmap?
>
> Not sure what exactly you are asking for but crash says
> crash> kmem -p 6060000
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> fffff8c600181800 6060000 0 0 0 fffffc0000000
^ this looks like it was somewhat initialized. All flags zero, nid/zone
set to -1 (wild guess) and thus the crash? weird
>
>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
>> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
>
> I am not really familiar with the pmem so I would need more assistance
> here. I've tried this (shot into the dark):
> crash> struct page.pgmap fffff8c600181800
> pgmap = 0xfffff8c600181808
That's weird. If the memmap is at fffff8c600181800, why should the pgmap
be at an offset of 8 bytes from there. The "pgmap" field is actually at
an offset of 8 bytes within the memmap ...
Assuming the memmap is not actually ZONE_DEVICE, fffff8c600181800 really
only contains garbage, including the pgmap pointer :(
> crash> struct -x dev_pagemap 0xfffff8c600181808
> struct dev_pagemap {
> altmap = {
> base_pfn = 0xfffff8c600181808,
> end_pfn = 0xfffff8c600181808,
^ because this is very weird
> reserve = 0x0,
^ and this indicates nothing was actually reserved.
--
Thanks,
David / dhildenb
On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
>
> On 04.01.21 11:03, Michal Hocko wrote:
> > Hi,
> > back in March [1] you have recommended 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> > backported to stable trees and that has led to a more general discussion
> > about the current state of pfn walkers wrt. uninitialized pmem struct
> > pages. We haven't concluded any specific solution for that except for a
> > general sentiment that pfn_to_online_page should be able to catch those.
> > I might have missed any follow ups on that but I do not think we have
> > landed on any actual solution in the end. Have I just missed any followups?
>
> Thanks for raising this issue. It's still on my list of "broken and
> non-trivial to fix" things.
>
> So, as far as I recall, we still have the following two issues remaining:
>
> 1. pfn_to_online_page() false positives
>
> The semantics of pfn_to_online_page() were broken with sub-section
> hot-add in corner cases.
The motivation for that backport was for pre-subsection kernels. When
onlining pmem that collides with the same section as System-RAM we may
have a situation like:
|-- SECTION -- |
|-- System RAM -- PMEM -- |
|-- pfn_valid() -- PMEM metadata -- |
So problem 0. is just System RAM + PMEM collisions independent of
sub-section support.
>
> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> boot memory, this memory section will contain parts ZONE_DEVICE memory
> and parts !ZONE_DEVICE memory. This can happen in sub-section
> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> memory parts as the whole section is marked as online. Bad.
>
> One instance where this is still an issue is
> mm/memory-failure.c:memory_failure() and
> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> actually the right approach.
This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
say "yes" to pfn_to_online_page().
>
> But worse, before ZONE_DEVICE hot-add
> 1. The whole section memmap does already exist (early sections always
> have a full memmap for the whole section)
> 2. The whole section memmap is initialized (although eventually with
> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> be accessed by pfn walkers.
>
> So when hotadding ZONE_DEVICE we are modifying already existing and
> visible memmaps. Bad.
Where does the rewrite of a dummy page entry matter in practice? It
would certainly be exceedingly Bad if in-use 'struct page' instances
we're rewritten. You're only alleging the former, correct?
>
>
> 2. Deferred init of ZONE_DEVICE ranges
>
> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> and outside the memhp lock. I did not follow if the use of
> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> pagemap_range() actually completed. I don't think it does.
>
> >
> > Is anybody working on that?
> >
>
> I believe Dan mentioned somewhere that he wants to see a real instance
> of this producing a BUG before actually moving forward with a fix. I
> might be wrong.
I think I'm missing an argument for the user-visible effects of the
"Bad." statements above. I think soft_offline_page() is a candidate
for a local fix because mm/memory-failure.c already has a significant
amount of page-type specific knowledge. So teaching it "yes" for
MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
ok to me.
>
>
> We might tackle 1. by:
>
> a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() -
> just plain ugly.
>
> b) [worked-around] using a sub-section online-map and extending
> pfn_to_online_page(). IMHO ugly to fix this corner case.
>
> c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy.
>
> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
> section. In the worst case, don't add partially present sections that
> have big holes in the beginning/end. Like, if there is a 128MB section
> with 126MB of memory followed by a 2MB hole, don't add that memory to
> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
> well, it would be the price to pay for simplicity. Being able to hotadd
> PMEM is more important than using each and every last MB of memory.
The collisions that are out there in the wild are 64MB System RAM
followed by 64MB of PMEM. If you're suggesting reducing System RAM
that collides with PMEM that's a consideration. It can't go the other
way because there are deployed configurations that have persistent
data there. Reducing System RAM runs into the problem of how early
does the kernel know that it's bordering ZONE_DEVICE. It's not just
PMEM, it's also EFI_MEMORY_SP (Linux Soft Reserved) memory.
> e) decrease the section size and drop sub-section hotadd support. As
> sub-section hotadd is 2MB and MAX_ORDER - 1 corresponds 4MB, this is
> mostly impossible. Worse on aarch64 with 64k base pages - 1024MB
> sections and MAX_ORDER - 1 corresponds 512MB. I think this is not feasible.
f) Teach the places where pfn_to_online_page() gives the wrong answer
how to handle ZONE_DEVICE. Like:
a78986aae9b2 KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
On Mon, Jan 4, 2021 at 7:59 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 04-01-21 16:43:49, David Hildenbrand wrote:
> > On 04.01.21 16:33, Michal Hocko wrote:
> > > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > >> On 04.01.21 16:10, Michal Hocko wrote:
> > > [...]
> > >> Do the physical addresses you see fall into the same section as boot
> > >> memory? Or what's around these addresses?
> > >
> > > Yes I am getting a garbage for the first struct page belonging to the
> > > pmem section [1]
> > > [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > > [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > >
> > > The pfn without the initialized struct page is 0x6060000. This is a
> > > first pfn in a section.
> >
> > Okay, so we're not dealing with the "early section" mess I described,
> > different story.
> >
> > Due to [1], is_mem_section_removable() called
> > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > initialized.
>
> Exactly!
>
> > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > actual address of the memmap?
>
> Not sure what exactly you are asking for but crash says
> crash> kmem -p 6060000
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> fffff8c600181800 6060000 0 0 0 fffffc0000000
>
> > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
>
> I am not really familiar with the pmem so I would need more assistance
> here. I've tried this (shot into the dark):
> crash> struct page.pgmap fffff8c600181800
> pgmap = 0xfffff8c600181808
Does /proc/iomem show an active namespace in the range? You should be
able to skip ahead to the first pfn in that namespace to find the
first dev_pagemap. I would have expected pfn_to_online_page() to have
saved you here. This address range is section aligned.
On Mon 04-01-21 21:33:06, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 7:59 AM Michal Hocko <[email protected]> wrote:
[...]
> > Not sure what exactly you are asking for but crash says
> > crash> kmem -p 6060000
> > PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> > fffff8c600181800 6060000 0 0 0 fffffc0000000
> >
> > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> >
> > I am not really familiar with the pmem so I would need more assistance
> > here. I've tried this (shot into the dark):
> > crash> struct page.pgmap fffff8c600181800
> > pgmap = 0xfffff8c600181808
>
> Does /proc/iomem show an active namespace in the range?
Any tips how I dig that out from the crash dump?
> You should be
> able to skip ahead to the first pfn in that namespace to find the
> first dev_pagemap. I would have expected pfn_to_online_page() to have
> saved you here. This address range is section aligned.
Well, the affected code in this case was
end_pfn = min(start_pfn + nr_pages,
zone_end_pfn(page_zone(pfn_to_page(start_pfn))));
where start_pfn was the first pfn of a memory section. This code was
completely unaware of zone device or dev_pagemap like most others pfn
walkers. It just wanted to get bounds for the zone but it stumbled over
uninitialized zone/node.
--
Michal Hocko
SUSE Labs
On Mon 04-01-21 17:30:48, David Hildenbrand wrote:
> >> Let's assume this is indeed a reserved pfn in the altmap. What's the
> >> actual address of the memmap?
> >
> > Not sure what exactly you are asking for but crash says
> > crash> kmem -p 6060000
> > PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> > fffff8c600181800 6060000 0 0 0 fffffc0000000
>
> ^ this looks like it was somewhat initialized. All flags zero, nid/zone
> set to -1 (wild guess) and thus the crash? weird
Yes that made me scratch my head as well.
> >> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> >> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> >
> > I am not really familiar with the pmem so I would need more assistance
> > here. I've tried this (shot into the dark):
> > crash> struct page.pgmap fffff8c600181800
> > pgmap = 0xfffff8c600181808
>
> That's weird. If the memmap is at fffff8c600181800, why should the pgmap
> be at an offset of 8 bytes from there. The "pgmap" field is actually at
> an offset of 8 bytes within the memmap ...
I haven't noticed this is an offset into struct page. This is indeed
weird because the structure is much larger than struct page. But maybe
it reuses storage of several struct pages in a row. Or maybe it is not
pgmap but something else that shares the offset in the structure.
--
Michal Hocko
SUSE Labs
On Mon 04-01-21 21:17:43, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
[...]
> > I believe Dan mentioned somewhere that he wants to see a real instance
> > of this producing a BUG before actually moving forward with a fix. I
> > might be wrong.
>
> I think I'm missing an argument for the user-visible effects of the
> "Bad." statements above. I think soft_offline_page() is a candidate
> for a local fix because mm/memory-failure.c already has a significant
> amount of page-type specific knowledge. So teaching it "yes" for
> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> ok to me.
I believe we do not want to teach _every_ pfn walker about zone device
pages. This would be quite error prone. Especially when a missig check
could lead to a silently broken data or BUG_ON with debugging enabled
(which is not the case for many production users). Or are we talking
about different bugs here?
--
Michal Hocko
SUSE Labs
On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> On 04.01.21 16:43, David Hildenbrand wrote:
> > On 04.01.21 16:33, Michal Hocko wrote:
> >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >>> On 04.01.21 16:10, Michal Hocko wrote:
> >> [...]
> >>> Do the physical addresses you see fall into the same section as boot
> >>> memory? Or what's around these addresses?
> >>
> >> Yes I am getting a garbage for the first struct page belonging to the
> >> pmem section [1]
> >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> >>
> >> The pfn without the initialized struct page is 0x6060000. This is a
> >> first pfn in a section.
> >
> > Okay, so we're not dealing with the "early section" mess I described,
> > different story.
> >
> > Due to [1], is_mem_section_removable() called
> > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > initialized.
> >
> > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > actual address of the memmap?
> >
> > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> >
> > If it's not self-hosted, initializing the relevant memmaps should work
> > just fine I guess. Otherwise things get more complicated.
>
> Oh, I forgot: pfn_to_online_page() should at least in your example make
> sure other pfn walkers are safe. It was just an issue of
> is_mem_section_removable().
Hmm, I suspect you are right. I haven't put this together, thanks! The memory
section is indeed marked offline so pfn_to_online_page would indeed bail
out:
crash> p (0x6060000>>15)
$3 = 3084
crash> p mem_section[3084/128][3084 & 127]
$4 = {
section_mem_map = 18446736128020054019,
usage = 0xffff902dcf956680,
page_ext = 0x0,
pad = 0
}
crash> p 18446736128020054019 & (1UL<<2)
$5 = 0
That makes it considerably less of a problem than I thought!
Thanks David!
--
Michal Hocko
SUSE Labs
On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > On 04.01.21 16:43, David Hildenbrand wrote:
> > > On 04.01.21 16:33, Michal Hocko wrote:
> > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > >> [...]
> > >>> Do the physical addresses you see fall into the same section as boot
> > >>> memory? Or what's around these addresses?
> > >>
> > >> Yes I am getting a garbage for the first struct page belonging to the
> > >> pmem section [1]
> > >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > >>
> > >> The pfn without the initialized struct page is 0x6060000. This is a
> > >> first pfn in a section.
> > >
> > > Okay, so we're not dealing with the "early section" mess I described,
> > > different story.
> > >
> > > Due to [1], is_mem_section_removable() called
> > > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > > initialized.
> > >
> > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > actual address of the memmap?
> > >
> > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> > >
> > > If it's not self-hosted, initializing the relevant memmaps should work
> > > just fine I guess. Otherwise things get more complicated.
> >
> > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > sure other pfn walkers are safe. It was just an issue of
> > is_mem_section_removable().
>
> Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> section is indeed marked offline so pfn_to_online_page would indeed bail
> out:
> crash> p (0x6060000>>15)
> $3 = 3084
> crash> p mem_section[3084/128][3084 & 127]
> $4 = {
> section_mem_map = 18446736128020054019,
> usage = 0xffff902dcf956680,
> page_ext = 0x0,
> pad = 0
> }
> crash> p 18446736128020054019 & (1UL<<2)
> $5 = 0
>
> That makes it considerably less of a problem than I thought!
Forgot to add that those who are running kernels without 53cdc1cb29e8
("drivers/base/memory.c: indicate all memory blocks as removable") for
some reason can fix the crash by the following simple patch.
Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
===================================================================
--- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
+++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
@@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
goto out;
for (i = 0; i < sections_per_block; i++) {
- if (!present_section_nr(mem->start_section_nr + i))
+ unsigned long nr = mem->start_section_nr + i;
+ if (!present_section_nr(nr))
continue;
- pfn = section_nr_to_pfn(mem->start_section_nr + i);
+ if (!online_section_nr()) {
+ ret = 0;
+ break;
+ }
+ pfn = section_nr_to_pfn(nr);
ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
}
--
Michal Hocko
SUSE Labs
On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > >> [...]
> > > >>> Do the physical addresses you see fall into the same section as boot
> > > >>> memory? Or what's around these addresses?
> > > >>
> > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > >> pmem section [1]
> > > >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > > >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > > >>
> > > >> The pfn without the initialized struct page is 0x6060000. This is a
> > > >> first pfn in a section.
> > > >
> > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > different story.
> > > >
> > > > Due to [1], is_mem_section_removable() called
> > > > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > > > initialized.
> > > >
> > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > actual address of the memmap?
> > > >
> > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> > > >
> > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > just fine I guess. Otherwise things get more complicated.
> > >
> > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > sure other pfn walkers are safe. It was just an issue of
> > > is_mem_section_removable().
> >
> > Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> > section is indeed marked offline so pfn_to_online_page would indeed bail
> > out:
> > crash> p (0x6060000>>15)
> > $3 = 3084
> > crash> p mem_section[3084/128][3084 & 127]
> > $4 = {
> > section_mem_map = 18446736128020054019,
> > usage = 0xffff902dcf956680,
> > page_ext = 0x0,
> > pad = 0
> > }
> > crash> p 18446736128020054019 & (1UL<<2)
> > $5 = 0
> >
> > That makes it considerably less of a problem than I thought!
>
> Forgot to add that those who are running kernels without 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") for
> some reason can fix the crash by the following simple patch.
>
> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> ===================================================================
> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> goto out;
>
> for (i = 0; i < sections_per_block; i++) {
> - if (!present_section_nr(mem->start_section_nr + i))
> + unsigned long nr = mem->start_section_nr + i;
> + if (!present_section_nr(nr))
> continue;
> - pfn = section_nr_to_pfn(mem->start_section_nr + i);
> + if (!online_section_nr()) {
I assume that's onlince_section_nr(nr) in the version that compiles?
This makes sense because the memory block size is larger than the
section size. I suspect you have 1GB memory block size on this system,
but since the System RAM and PMEM collide at a 512MB alignment in a
memory block you end up walking the back end of the last 512MB of the
System RAM memory block and run into the offline PMEM section.
So, I don't think it's pfn_to_online_page that necessarily needs to
know how to disambiguate each page, it's things that walk sections and
memory blocks and expects them to be consistent over the span.
On Tue 05-01-21 00:27:34, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > >> [...]
> > > > >>> Do the physical addresses you see fall into the same section as boot
> > > > >>> memory? Or what's around these addresses?
> > > > >>
> > > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > > >> pmem section [1]
> > > > >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > > > >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > > > >>
> > > > >> The pfn without the initialized struct page is 0x6060000. This is a
> > > > >> first pfn in a section.
> > > > >
> > > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > > different story.
> > > > >
> > > > > Due to [1], is_mem_section_removable() called
> > > > > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > > > > initialized.
> > > > >
> > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > actual address of the memmap?
> > > > >
> > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> > > > >
> > > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > > just fine I guess. Otherwise things get more complicated.
> > > >
> > > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > > sure other pfn walkers are safe. It was just an issue of
> > > > is_mem_section_removable().
> > >
> > > Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > out:
> > > crash> p (0x6060000>>15)
> > > $3 = 3084
> > > crash> p mem_section[3084/128][3084 & 127]
> > > $4 = {
> > > section_mem_map = 18446736128020054019,
> > > usage = 0xffff902dcf956680,
> > > page_ext = 0x0,
> > > pad = 0
> > > }
> > > crash> p 18446736128020054019 & (1UL<<2)
> > > $5 = 0
> > >
> > > That makes it considerably less of a problem than I thought!
> >
> > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > some reason can fix the crash by the following simple patch.
> >
> > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > ===================================================================
> > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > goto out;
> >
> > for (i = 0; i < sections_per_block; i++) {
> > - if (!present_section_nr(mem->start_section_nr + i))
> > + unsigned long nr = mem->start_section_nr + i;
> > + if (!present_section_nr(nr))
> > continue;
> > - pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > + if (!online_section_nr()) {
>
> I assume that's onlince_section_nr(nr) in the version that compiles?
Yup.
> This makes sense because the memory block size is larger than the
> section size. I suspect you have 1GB memory block size on this system,
> but since the System RAM and PMEM collide at a 512MB alignment in a
> memory block you end up walking the back end of the last 512MB of the
> System RAM memory block and run into the offline PMEM section.
Sections are 128MB and memory blocks are 2GB on this system.
> So, I don't think it's pfn_to_online_page that necessarily needs to
> know how to disambiguate each page, it's things that walk sections and
> memory blocks and expects them to be consistent over the span.
Well, memory hotplug code is hard wired to sparse memory model so in
this particular case asking about the section is ok. But pfn walkers
shouldn't really care and only rely on pfn_to_online_page. But that will
do the right thing here. So we are good as long as the section is marked
properly. But this would become a problem as soon as the uninitialized
pages where sharing the same memory section as David pointed out.
pfn_to_online_page would then return something containing garbage. So we
should still think of a way to either initialize all those pages or make
sure pfn_to_online_page recognizes them. The former is preferred IMHO.
--
Michal Hocko
SUSE Labs
On Tue 05-01-21 00:57:43, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 05-01-21 00:27:34, Dan Williams wrote:
> > > On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > > > >> [...]
> > > > > > >>> Do the physical addresses you see fall into the same section as boot
> > > > > > >>> memory? Or what's around these addresses?
> > > > > > >>
> > > > > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > > > > >> pmem section [1]
> > > > > > >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > > > > > >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > > > > > >>
> > > > > > >> The pfn without the initialized struct page is 0x6060000. This is a
> > > > > > >> first pfn in a section.
> > > > > > >
> > > > > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > > > > different story.
> > > > > > >
> > > > > > > Due to [1], is_mem_section_removable() called
> > > > > > > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > > > > > > initialized.
> > > > > > >
> > > > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > > > actual address of the memmap?
> > > > > > >
> > > > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > > > > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> > > > > > >
> > > > > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > > > > just fine I guess. Otherwise things get more complicated.
> > > > > >
> > > > > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > > > > sure other pfn walkers are safe. It was just an issue of
> > > > > > is_mem_section_removable().
> > > > >
> > > > > Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> > > > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > > > out:
> > > > > crash> p (0x6060000>>15)
> > > > > $3 = 3084
> > > > > crash> p mem_section[3084/128][3084 & 127]
> > > > > $4 = {
> > > > > section_mem_map = 18446736128020054019,
> > > > > usage = 0xffff902dcf956680,
> > > > > page_ext = 0x0,
> > > > > pad = 0
> > > > > }
> > > > > crash> p 18446736128020054019 & (1UL<<2)
> > > > > $5 = 0
> > > > >
> > > > > That makes it considerably less of a problem than I thought!
> > > >
> > > > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > > > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > > > some reason can fix the crash by the following simple patch.
> > > >
> > > > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > > ===================================================================
> > > > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > > > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > > > goto out;
> > > >
> > > > for (i = 0; i < sections_per_block; i++) {
> > > > - if (!present_section_nr(mem->start_section_nr + i))
> > > > + unsigned long nr = mem->start_section_nr + i;
> > > > + if (!present_section_nr(nr))
> > > > continue;
> > > > - pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > > > + if (!online_section_nr()) {
> > >
> > > I assume that's onlince_section_nr(nr) in the version that compiles?
> >
> > Yup.
> >
> > > This makes sense because the memory block size is larger than the
> > > section size. I suspect you have 1GB memory block size on this system,
> > > but since the System RAM and PMEM collide at a 512MB alignment in a
> > > memory block you end up walking the back end of the last 512MB of the
> > > System RAM memory block and run into the offline PMEM section.
> >
> > Sections are 128MB and memory blocks are 2GB on this system.
> >
> > > So, I don't think it's pfn_to_online_page that necessarily needs to
> > > know how to disambiguate each page, it's things that walk sections and
> > > memory blocks and expects them to be consistent over the span.
> >
> > Well, memory hotplug code is hard wired to sparse memory model so in
> > this particular case asking about the section is ok. But pfn walkers
> > shouldn't really care and only rely on pfn_to_online_page. But that will
> > do the right thing here. So we are good as long as the section is marked
> > properly. But this would become a problem as soon as the uninitialized
> > pages where sharing the same memory section as David pointed out.
> > pfn_to_online_page would then return something containing garbage. So we
> > should still think of a way to either initialize all those pages or make
> > sure pfn_to_online_page recognizes them. The former is preferred IMHO.
>
> The former would not have saved the crash in this case because
> pfn_to_online_page() is not used in v5.3:removable_show() that I can
> see, nor some of the other paths that might walk pfns and the wrong
> thing with ZONE_DEVICE.
If the page was initialized properly, and by that I mean also have it
reserved, then the old code would have properly reported is as not
removable.
> However, I do think pfn_to_online_page() should be reliable, and I
> prefer to just brute force add a section flag to indicate whether the
> section might be ZONE_DEVICE polluted and fallback to the
> get_dev_pagemap() slow-path in that case.
Do we have some spare room to hold that flag in a section?
> ...but it would still require hunting to find the places where
> pfn_to_online_page() is missing for assumptions like this crash which
> assumed memblock-online + section-present == section-online.
Yes, but most users should be using pfn_to_online_page already.
--
Michal Hocko
SUSE Labs
On 05.01.21 10:05, Michal Hocko wrote:
> On Tue 05-01-21 00:57:43, Dan Williams wrote:
>> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko <[email protected]> wrote:
>>>
>>> On Tue 05-01-21 00:27:34, Dan Williams wrote:
>>>> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
>>>>>
>>>>> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
>>>>>> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
>>>>>>> On 04.01.21 16:43, David Hildenbrand wrote:
>>>>>>>> On 04.01.21 16:33, Michal Hocko wrote:
>>>>>>>>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>>>>>>>>>> On 04.01.21 16:10, Michal Hocko wrote:
>>>>>>>>> [...]
>>>>>>>>>> Do the physical addresses you see fall into the same section as boot
>>>>>>>>>> memory? Or what's around these addresses?
>>>>>>>>>
>>>>>>>>> Yes I am getting a garbage for the first struct page belonging to the
>>>>>>>>> pmem section [1]
>>>>>>>>> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
>>>>>>>>> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
>>>>>>>>>
>>>>>>>>> The pfn without the initialized struct page is 0x6060000. This is a
>>>>>>>>> first pfn in a section.
>>>>>>>>
>>>>>>>> Okay, so we're not dealing with the "early section" mess I described,
>>>>>>>> different story.
>>>>>>>>
>>>>>>>> Due to [1], is_mem_section_removable() called
>>>>>>>> pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
>>>>>>>> initialized.
>>>>>>>>
>>>>>>>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>>>>>>>> actual address of the memmap?
>>>>>>>>
>>>>>>>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
>>>>>>>> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
>>>>>>>>
>>>>>>>> If it's not self-hosted, initializing the relevant memmaps should work
>>>>>>>> just fine I guess. Otherwise things get more complicated.
>>>>>>>
>>>>>>> Oh, I forgot: pfn_to_online_page() should at least in your example make
>>>>>>> sure other pfn walkers are safe. It was just an issue of
>>>>>>> is_mem_section_removable().
>>>>>>
>>>>>> Hmm, I suspect you are right. I haven't put this together, thanks! The memory
>>>>>> section is indeed marked offline so pfn_to_online_page would indeed bail
>>>>>> out:
>>>>>> crash> p (0x6060000>>15)
>>>>>> $3 = 3084
>>>>>> crash> p mem_section[3084/128][3084 & 127]
>>>>>> $4 = {
>>>>>> section_mem_map = 18446736128020054019,
>>>>>> usage = 0xffff902dcf956680,
>>>>>> page_ext = 0x0,
>>>>>> pad = 0
>>>>>> }
>>>>>> crash> p 18446736128020054019 & (1UL<<2)
>>>>>> $5 = 0
>>>>>>
>>>>>> That makes it considerably less of a problem than I thought!
>>>>>
>>>>> Forgot to add that those who are running kernels without 53cdc1cb29e8
>>>>> ("drivers/base/memory.c: indicate all memory blocks as removable") for
>>>>> some reason can fix the crash by the following simple patch.
>>>>>
>>>>> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>>>> ===================================================================
>>>>> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
>>>>> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>>>> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
>>>>> goto out;
>>>>>
>>>>> for (i = 0; i < sections_per_block; i++) {
>>>>> - if (!present_section_nr(mem->start_section_nr + i))
>>>>> + unsigned long nr = mem->start_section_nr + i;
>>>>> + if (!present_section_nr(nr))
>>>>> continue;
>>>>> - pfn = section_nr_to_pfn(mem->start_section_nr + i);
>>>>> + if (!online_section_nr()) {
>>>>
>>>> I assume that's onlince_section_nr(nr) in the version that compiles?
>>>
>>> Yup.
>>>
>>>> This makes sense because the memory block size is larger than the
>>>> section size. I suspect you have 1GB memory block size on this system,
>>>> but since the System RAM and PMEM collide at a 512MB alignment in a
>>>> memory block you end up walking the back end of the last 512MB of the
>>>> System RAM memory block and run into the offline PMEM section.
>>>
>>> Sections are 128MB and memory blocks are 2GB on this system.
>>>
>>>> So, I don't think it's pfn_to_online_page that necessarily needs to
>>>> know how to disambiguate each page, it's things that walk sections and
>>>> memory blocks and expects them to be consistent over the span.
>>>
>>> Well, memory hotplug code is hard wired to sparse memory model so in
>>> this particular case asking about the section is ok. But pfn walkers
>>> shouldn't really care and only rely on pfn_to_online_page. But that will
>>> do the right thing here. So we are good as long as the section is marked
>>> properly. But this would become a problem as soon as the uninitialized
>>> pages where sharing the same memory section as David pointed out.
>>> pfn_to_online_page would then return something containing garbage. So we
>>> should still think of a way to either initialize all those pages or make
>>> sure pfn_to_online_page recognizes them. The former is preferred IMHO.
>>
>> The former would not have saved the crash in this case because
>> pfn_to_online_page() is not used in v5.3:removable_show() that I can
>> see, nor some of the other paths that might walk pfns and the wrong
>> thing with ZONE_DEVICE.
>
> If the page was initialized properly, and by that I mean also have it
> reserved, then the old code would have properly reported is as not
> removable.
>
>> However, I do think pfn_to_online_page() should be reliable, and I
>> prefer to just brute force add a section flag to indicate whether the
>> section might be ZONE_DEVICE polluted and fallback to the
>> get_dev_pagemap() slow-path in that case.
>
> Do we have some spare room to hold that flag in a section?
>
>> ...but it would still require hunting to find the places where
>> pfn_to_online_page() is missing for assumptions like this crash which
>> assumed memblock-online + section-present == section-online.
>
> Yes, but most users should be using pfn_to_online_page already.
>
Quite honestly, let's not hack around this issue and just fix it
properly - make pfn_to_online_page() only ever return an initialized,
online (buddy) page, just as documented.
What speaks against not adding early sections in case they have a
relevant hole at the end, besides wasting some MB? Relevant setups most
probably just don't care. Print a warning.
--
Thanks,
David / dhildenb
On 05.01.21 08:50, Michal Hocko wrote:
> On Mon 04-01-21 21:17:43, Dan Williams wrote:
>> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
> [...]
>>> I believe Dan mentioned somewhere that he wants to see a real instance
>>> of this producing a BUG before actually moving forward with a fix. I
>>> might be wrong.
>>
>> I think I'm missing an argument for the user-visible effects of the
>> "Bad." statements above. I think soft_offline_page() is a candidate
>> for a local fix because mm/memory-failure.c already has a significant
>> amount of page-type specific knowledge. So teaching it "yes" for
>> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
>> ok to me.
>
> I believe we do not want to teach _every_ pfn walker about zone device
> pages. This would be quite error prone. Especially when a missig check
> could lead to a silently broken data or BUG_ON with debugging enabled
> (which is not the case for many production users). Or are we talking
> about different bugs here?
I'd like us to stick to the documentation, e.g., include/linux/mmzone.h
"
pfn_valid() is meant to be able to tell if a given PFN has valid memmap
associated with it or not. This means that a struct page exists for this
pfn. The caller cannot assume the page is fully initialized in general.
Hotplugable pages might not have been onlined yet. pfn_to_online_page()
will ensure the struct page is fully online and initialized. Special
pages (e.g. ZONE_DEVICE) are never onlined and should be treated
accordingly.
"
--
Thanks,
David / dhildenb
On Tue 05-01-21 10:13:49, David Hildenbrand wrote:
> On 05.01.21 10:05, Michal Hocko wrote:
> > On Tue 05-01-21 00:57:43, Dan Williams wrote:
> >> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko <[email protected]> wrote:
> >>>
> >>> On Tue 05-01-21 00:27:34, Dan Williams wrote:
> >>>> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
> >>>>>
> >>>>> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> >>>>>> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> >>>>>>> On 04.01.21 16:43, David Hildenbrand wrote:
> >>>>>>>> On 04.01.21 16:33, Michal Hocko wrote:
> >>>>>>>>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> >>>>>>>>>> On 04.01.21 16:10, Michal Hocko wrote:
> >>>>>>>>> [...]
> >>>>>>>>>> Do the physical addresses you see fall into the same section as boot
> >>>>>>>>>> memory? Or what's around these addresses?
> >>>>>>>>>
> >>>>>>>>> Yes I am getting a garbage for the first struct page belonging to the
> >>>>>>>>> pmem section [1]
> >>>>>>>>> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> >>>>>>>>> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> >>>>>>>>>
> >>>>>>>>> The pfn without the initialized struct page is 0x6060000. This is a
> >>>>>>>>> first pfn in a section.
> >>>>>>>>
> >>>>>>>> Okay, so we're not dealing with the "early section" mess I described,
> >>>>>>>> different story.
> >>>>>>>>
> >>>>>>>> Due to [1], is_mem_section_removable() called
> >>>>>>>> pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> >>>>>>>> initialized.
> >>>>>>>>
> >>>>>>>> Let's assume this is indeed a reserved pfn in the altmap. What's the
> >>>>>>>> actual address of the memmap?
> >>>>>>>>
> >>>>>>>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> >>>>>>>> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> >>>>>>>>
> >>>>>>>> If it's not self-hosted, initializing the relevant memmaps should work
> >>>>>>>> just fine I guess. Otherwise things get more complicated.
> >>>>>>>
> >>>>>>> Oh, I forgot: pfn_to_online_page() should at least in your example make
> >>>>>>> sure other pfn walkers are safe. It was just an issue of
> >>>>>>> is_mem_section_removable().
> >>>>>>
> >>>>>> Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> >>>>>> section is indeed marked offline so pfn_to_online_page would indeed bail
> >>>>>> out:
> >>>>>> crash> p (0x6060000>>15)
> >>>>>> $3 = 3084
> >>>>>> crash> p mem_section[3084/128][3084 & 127]
> >>>>>> $4 = {
> >>>>>> section_mem_map = 18446736128020054019,
> >>>>>> usage = 0xffff902dcf956680,
> >>>>>> page_ext = 0x0,
> >>>>>> pad = 0
> >>>>>> }
> >>>>>> crash> p 18446736128020054019 & (1UL<<2)
> >>>>>> $5 = 0
> >>>>>>
> >>>>>> That makes it considerably less of a problem than I thought!
> >>>>>
> >>>>> Forgot to add that those who are running kernels without 53cdc1cb29e8
> >>>>> ("drivers/base/memory.c: indicate all memory blocks as removable") for
> >>>>> some reason can fix the crash by the following simple patch.
> >>>>>
> >>>>> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> >>>>> ===================================================================
> >>>>> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> >>>>> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> >>>>> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> >>>>> goto out;
> >>>>>
> >>>>> for (i = 0; i < sections_per_block; i++) {
> >>>>> - if (!present_section_nr(mem->start_section_nr + i))
> >>>>> + unsigned long nr = mem->start_section_nr + i;
> >>>>> + if (!present_section_nr(nr))
> >>>>> continue;
> >>>>> - pfn = section_nr_to_pfn(mem->start_section_nr + i);
> >>>>> + if (!online_section_nr()) {
> >>>>
> >>>> I assume that's onlince_section_nr(nr) in the version that compiles?
> >>>
> >>> Yup.
> >>>
> >>>> This makes sense because the memory block size is larger than the
> >>>> section size. I suspect you have 1GB memory block size on this system,
> >>>> but since the System RAM and PMEM collide at a 512MB alignment in a
> >>>> memory block you end up walking the back end of the last 512MB of the
> >>>> System RAM memory block and run into the offline PMEM section.
> >>>
> >>> Sections are 128MB and memory blocks are 2GB on this system.
> >>>
> >>>> So, I don't think it's pfn_to_online_page that necessarily needs to
> >>>> know how to disambiguate each page, it's things that walk sections and
> >>>> memory blocks and expects them to be consistent over the span.
> >>>
> >>> Well, memory hotplug code is hard wired to sparse memory model so in
> >>> this particular case asking about the section is ok. But pfn walkers
> >>> shouldn't really care and only rely on pfn_to_online_page. But that will
> >>> do the right thing here. So we are good as long as the section is marked
> >>> properly. But this would become a problem as soon as the uninitialized
> >>> pages where sharing the same memory section as David pointed out.
> >>> pfn_to_online_page would then return something containing garbage. So we
> >>> should still think of a way to either initialize all those pages or make
> >>> sure pfn_to_online_page recognizes them. The former is preferred IMHO.
> >>
> >> The former would not have saved the crash in this case because
> >> pfn_to_online_page() is not used in v5.3:removable_show() that I can
> >> see, nor some of the other paths that might walk pfns and the wrong
> >> thing with ZONE_DEVICE.
> >
> > If the page was initialized properly, and by that I mean also have it
> > reserved, then the old code would have properly reported is as not
> > removable.
> >
> >> However, I do think pfn_to_online_page() should be reliable, and I
> >> prefer to just brute force add a section flag to indicate whether the
> >> section might be ZONE_DEVICE polluted and fallback to the
> >> get_dev_pagemap() slow-path in that case.
> >
> > Do we have some spare room to hold that flag in a section?
> >
> >> ...but it would still require hunting to find the places where
> >> pfn_to_online_page() is missing for assumptions like this crash which
> >> assumed memblock-online + section-present == section-online.
> >
> > Yes, but most users should be using pfn_to_online_page already.
> >
>
> Quite honestly, let's not hack around this issue and just fix it
> properly - make pfn_to_online_page() only ever return an initialized,
> online (buddy) page, just as documented.
Just to make sure we are on the same page. You are agreeing with Dan
that pfn_to_online_page should check for zone device pages? Ideally in a
slow path.
> What speaks against not adding early sections in case they have a
> relevant hole at the end, besides wasting some MB? Relevant setups most
> probably just don't care. Print a warning.
If we can avoid shared section reasonably then I do not object.
Sacrificing few Megs of memory should be fine in most cases. It is not
like reasonable systems would have many hybrid sections.
--
Michal Hocko
SUSE Labs
On 05.01.21 06:17, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 04.01.21 11:03, Michal Hocko wrote:
>>> Hi,
>>> back in March [1] you have recommended 53cdc1cb29e8
>>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
>>> backported to stable trees and that has led to a more general discussion
>>> about the current state of pfn walkers wrt. uninitialized pmem struct
>>> pages. We haven't concluded any specific solution for that except for a
>>> general sentiment that pfn_to_online_page should be able to catch those.
>>> I might have missed any follow ups on that but I do not think we have
>>> landed on any actual solution in the end. Have I just missed any followups?
>>
>> Thanks for raising this issue. It's still on my list of "broken and
>> non-trivial to fix" things.
>>
>> So, as far as I recall, we still have the following two issues remaining:
>>
>> 1. pfn_to_online_page() false positives
>>
>> The semantics of pfn_to_online_page() were broken with sub-section
>> hot-add in corner cases.
>
> The motivation for that backport was for pre-subsection kernels. When
> onlining pmem that collides with the same section as System-RAM we may
> have a situation like:
>
> |-- SECTION -- |
> |-- System RAM -- PMEM -- |
> |-- pfn_valid() -- PMEM metadata -- |
>
> So problem 0. is just System RAM + PMEM collisions independent of
> sub-section support.
IIRC, you were not able to hot-add the PMEM device before sub-section
support (which was bad and fixed by sub-section hot-add). How was this a
problem before?
>
>>
>> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
>> boot memory, this memory section will contain parts ZONE_DEVICE memory
>> and parts !ZONE_DEVICE memory. This can happen in sub-section
>> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
>> memory parts as the whole section is marked as online. Bad.
>>
>> One instance where this is still an issue is
>> mm/memory-failure.c:memory_failure() and
>> mm/memory-failure.c:soft_offline_page(). I thought for a while about
>> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
>> actually the right approach.
>
> This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
> say "yes" to pfn_to_online_page().
>
>>
>> But worse, before ZONE_DEVICE hot-add
>> 1. The whole section memmap does already exist (early sections always
>> have a full memmap for the whole section)
>> 2. The whole section memmap is initialized (although eventually with
>> dummy node/zone 0/0 for memory holes until that part is fixed) and might
>> be accessed by pfn walkers.
>>
>> So when hotadding ZONE_DEVICE we are modifying already existing and
>> visible memmaps. Bad.
>
> Where does the rewrite of a dummy page entry matter in practice? It
> would certainly be exceedingly Bad if in-use 'struct page' instances
> we're rewritten. You're only alleging the former, correct?
Yes, just another piece of the puzzle.
>> 2. Deferred init of ZONE_DEVICE ranges
>>
>> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
>> and outside the memhp lock. I did not follow if the use of
>> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
>> pagemap_range() actually completed. I don't think it does.
>>
>>>
>>> Is anybody working on that?
>>>
>>
>> I believe Dan mentioned somewhere that he wants to see a real instance
>> of this producing a BUG before actually moving forward with a fix. I
>> might be wrong.
>
> I think I'm missing an argument for the user-visible effects of the
> "Bad." statements above. I think soft_offline_page() is a candidate
> for a local fix because mm/memory-failure.c already has a significant
> amount of page-type specific knowledge. So teaching it "yes" for
> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> ok to me.
I am not completely against working around the issue we have in the code
but nobody stumbled over them. IMHO it's just error prone code and
handling we have here that will bite us in the long term. E.g., any pfn
walker/code that sticks to the current documentation of
pfn_to_online_page().
I am not sure it's the right thing to do for
MEMORY_DEVICE_PRIVATE-ZONE_DEVICE, that requires more discussion.
>> We might tackle 1. by:
[...]
>> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
>> section. In the worst case, don't add partially present sections that
>> have big holes in the beginning/end. Like, if there is a 128MB section
>> with 126MB of memory followed by a 2MB hole, don't add that memory to
>> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
>> well, it would be the price to pay for simplicity. Being able to hotadd
>> PMEM is more important than using each and every last MB of memory.
>
> The collisions that are out there in the wild are 64MB System RAM
> followed by 64MB of PMEM. If you're suggesting reducing System RAM
> that collides with PMEM that's a consideration. It can't go the other
> way because there are deployed configurations that have persistent
> data there. Reducing System RAM runs into the problem of how early
> does the kernel know that it's bordering ZONE_DEVICE. It's not just
> PMEM, it's also EFI_MEMORY_SP (Linux Soft Reserved) memory.
Yeah, obviously the first one. Being able to add+use PMEM is more
important than using each and every last MB of main memory.
I wonder if we can just stop adding any system RAM like
[ Memory Section ]
[ RAM ] [ Hole ]
When there could be the possibility that the hole might actually be
PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
sequence of sections, not just a tiny hole)
--
Thanks,
David / dhildenb
On 05.01.21 10:25, Michal Hocko wrote:
> On Tue 05-01-21 10:13:49, David Hildenbrand wrote:
>> On 05.01.21 10:05, Michal Hocko wrote:
>>> On Tue 05-01-21 00:57:43, Dan Williams wrote:
>>>> On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko <[email protected]> wrote:
>>>>>
>>>>> On Tue 05-01-21 00:27:34, Dan Williams wrote:
>>>>>> On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
>>>>>>>
>>>>>>> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
>>>>>>>> On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
>>>>>>>>> On 04.01.21 16:43, David Hildenbrand wrote:
>>>>>>>>>> On 04.01.21 16:33, Michal Hocko wrote:
>>>>>>>>>>> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
>>>>>>>>>>>> On 04.01.21 16:10, Michal Hocko wrote:
>>>>>>>>>>> [...]
>>>>>>>>>>>> Do the physical addresses you see fall into the same section as boot
>>>>>>>>>>>> memory? Or what's around these addresses?
>>>>>>>>>>>
>>>>>>>>>>> Yes I am getting a garbage for the first struct page belonging to the
>>>>>>>>>>> pmem section [1]
>>>>>>>>>>> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
>>>>>>>>>>> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
>>>>>>>>>>>
>>>>>>>>>>> The pfn without the initialized struct page is 0x6060000. This is a
>>>>>>>>>>> first pfn in a section.
>>>>>>>>>>
>>>>>>>>>> Okay, so we're not dealing with the "early section" mess I described,
>>>>>>>>>> different story.
>>>>>>>>>>
>>>>>>>>>> Due to [1], is_mem_section_removable() called
>>>>>>>>>> pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
>>>>>>>>>> initialized.
>>>>>>>>>>
>>>>>>>>>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>>>>>>>>>> actual address of the memmap?
>>>>>>>>>>
>>>>>>>>>> I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
>>>>>>>>>> part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
>>>>>>>>>>
>>>>>>>>>> If it's not self-hosted, initializing the relevant memmaps should work
>>>>>>>>>> just fine I guess. Otherwise things get more complicated.
>>>>>>>>>
>>>>>>>>> Oh, I forgot: pfn_to_online_page() should at least in your example make
>>>>>>>>> sure other pfn walkers are safe. It was just an issue of
>>>>>>>>> is_mem_section_removable().
>>>>>>>>
>>>>>>>> Hmm, I suspect you are right. I haven't put this together, thanks! The memory
>>>>>>>> section is indeed marked offline so pfn_to_online_page would indeed bail
>>>>>>>> out:
>>>>>>>> crash> p (0x6060000>>15)
>>>>>>>> $3 = 3084
>>>>>>>> crash> p mem_section[3084/128][3084 & 127]
>>>>>>>> $4 = {
>>>>>>>> section_mem_map = 18446736128020054019,
>>>>>>>> usage = 0xffff902dcf956680,
>>>>>>>> page_ext = 0x0,
>>>>>>>> pad = 0
>>>>>>>> }
>>>>>>>> crash> p 18446736128020054019 & (1UL<<2)
>>>>>>>> $5 = 0
>>>>>>>>
>>>>>>>> That makes it considerably less of a problem than I thought!
>>>>>>>
>>>>>>> Forgot to add that those who are running kernels without 53cdc1cb29e8
>>>>>>> ("drivers/base/memory.c: indicate all memory blocks as removable") for
>>>>>>> some reason can fix the crash by the following simple patch.
>>>>>>>
>>>>>>> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>>>>>> ===================================================================
>>>>>>> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
>>>>>>> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
>>>>>>> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
>>>>>>> goto out;
>>>>>>>
>>>>>>> for (i = 0; i < sections_per_block; i++) {
>>>>>>> - if (!present_section_nr(mem->start_section_nr + i))
>>>>>>> + unsigned long nr = mem->start_section_nr + i;
>>>>>>> + if (!present_section_nr(nr))
>>>>>>> continue;
>>>>>>> - pfn = section_nr_to_pfn(mem->start_section_nr + i);
>>>>>>> + if (!online_section_nr()) {
>>>>>>
>>>>>> I assume that's onlince_section_nr(nr) in the version that compiles?
>>>>>
>>>>> Yup.
>>>>>
>>>>>> This makes sense because the memory block size is larger than the
>>>>>> section size. I suspect you have 1GB memory block size on this system,
>>>>>> but since the System RAM and PMEM collide at a 512MB alignment in a
>>>>>> memory block you end up walking the back end of the last 512MB of the
>>>>>> System RAM memory block and run into the offline PMEM section.
>>>>>
>>>>> Sections are 128MB and memory blocks are 2GB on this system.
>>>>>
>>>>>> So, I don't think it's pfn_to_online_page that necessarily needs to
>>>>>> know how to disambiguate each page, it's things that walk sections and
>>>>>> memory blocks and expects them to be consistent over the span.
>>>>>
>>>>> Well, memory hotplug code is hard wired to sparse memory model so in
>>>>> this particular case asking about the section is ok. But pfn walkers
>>>>> shouldn't really care and only rely on pfn_to_online_page. But that will
>>>>> do the right thing here. So we are good as long as the section is marked
>>>>> properly. But this would become a problem as soon as the uninitialized
>>>>> pages where sharing the same memory section as David pointed out.
>>>>> pfn_to_online_page would then return something containing garbage. So we
>>>>> should still think of a way to either initialize all those pages or make
>>>>> sure pfn_to_online_page recognizes them. The former is preferred IMHO.
>>>>
>>>> The former would not have saved the crash in this case because
>>>> pfn_to_online_page() is not used in v5.3:removable_show() that I can
>>>> see, nor some of the other paths that might walk pfns and the wrong
>>>> thing with ZONE_DEVICE.
>>>
>>> If the page was initialized properly, and by that I mean also have it
>>> reserved, then the old code would have properly reported is as not
>>> removable.
>>>
>>>> However, I do think pfn_to_online_page() should be reliable, and I
>>>> prefer to just brute force add a section flag to indicate whether the
>>>> section might be ZONE_DEVICE polluted and fallback to the
>>>> get_dev_pagemap() slow-path in that case.
>>>
>>> Do we have some spare room to hold that flag in a section?
>>>
>>>> ...but it would still require hunting to find the places where
>>>> pfn_to_online_page() is missing for assumptions like this crash which
>>>> assumed memblock-online + section-present == section-online.
>>>
>>> Yes, but most users should be using pfn_to_online_page already.
>>>
>>
>> Quite honestly, let's not hack around this issue and just fix it
>> properly - make pfn_to_online_page() only ever return an initialized,
>> online (buddy) page, just as documented.
>
> Just to make sure we are on the same page. You are agreeing with Dan
> that pfn_to_online_page should check for zone device pages? Ideally in a
> slow path.
The most important part for me is that pfn_to_online_page() behaves as
documented. How that is implemented is a secondary concern. The easier,
the better (e.g., just avoid the corner-case (!) issue we discovered
completely).
--
Thanks,
David / dhildenb
On Tue, Jan 5, 2021 at 1:25 AM David Hildenbrand <[email protected]> wrote:
>
> On 05.01.21 06:17, Dan Williams wrote:
> > On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 04.01.21 11:03, Michal Hocko wrote:
> >>> Hi,
> >>> back in March [1] you have recommended 53cdc1cb29e8
> >>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> >>> backported to stable trees and that has led to a more general discussion
> >>> about the current state of pfn walkers wrt. uninitialized pmem struct
> >>> pages. We haven't concluded any specific solution for that except for a
> >>> general sentiment that pfn_to_online_page should be able to catch those.
> >>> I might have missed any follow ups on that but I do not think we have
> >>> landed on any actual solution in the end. Have I just missed any followups?
> >>
> >> Thanks for raising this issue. It's still on my list of "broken and
> >> non-trivial to fix" things.
> >>
> >> So, as far as I recall, we still have the following two issues remaining:
> >>
> >> 1. pfn_to_online_page() false positives
> >>
> >> The semantics of pfn_to_online_page() were broken with sub-section
> >> hot-add in corner cases.
> >
> > The motivation for that backport was for pre-subsection kernels. When
> > onlining pmem that collides with the same section as System-RAM we may
> > have a situation like:
> >
> > |-- SECTION -- |
> > |-- System RAM -- PMEM -- |
> > |-- pfn_valid() -- PMEM metadata -- |
> >
> > So problem 0. is just System RAM + PMEM collisions independent of
> > sub-section support.
>
> IIRC, you were not able to hot-add the PMEM device before sub-section
> support (which was bad and fixed by sub-section hot-add). How was this a
> problem before?
The details are in the subsection changelog but the tl;dr is that I
clumsily hacked around it by throwing away PMEM, but that breaks when
PMEM shifts around in physical address space from one boot to the
next. The moving around is on 64MB granularity, so subsection hotplug
lets those shifts be covered and enables smaller granularity /
alignment support for smaller things like p2pdma mappings.
>
> >
> >>
> >> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> >> boot memory, this memory section will contain parts ZONE_DEVICE memory
> >> and parts !ZONE_DEVICE memory. This can happen in sub-section
> >> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> >> memory parts as the whole section is marked as online. Bad.
> >>
> >> One instance where this is still an issue is
> >> mm/memory-failure.c:memory_failure() and
> >> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> >> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> >> actually the right approach.
> >
> > This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
> > say "yes" to pfn_to_online_page().
> >
> >>
> >> But worse, before ZONE_DEVICE hot-add
> >> 1. The whole section memmap does already exist (early sections always
> >> have a full memmap for the whole section)
> >> 2. The whole section memmap is initialized (although eventually with
> >> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> >> be accessed by pfn walkers.
> >>
> >> So when hotadding ZONE_DEVICE we are modifying already existing and
> >> visible memmaps. Bad.
> >
> > Where does the rewrite of a dummy page entry matter in practice? It
> > would certainly be exceedingly Bad if in-use 'struct page' instances
> > we're rewritten. You're only alleging the former, correct?
>
> Yes, just another piece of the puzzle.
>
> >> 2. Deferred init of ZONE_DEVICE ranges
> >>
> >> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> >> and outside the memhp lock. I did not follow if the use of
> >> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> >> pagemap_range() actually completed. I don't think it does.
> >>
> >>>
> >>> Is anybody working on that?
> >>>
> >>
> >> I believe Dan mentioned somewhere that he wants to see a real instance
> >> of this producing a BUG before actually moving forward with a fix. I
> >> might be wrong.
> >
> > I think I'm missing an argument for the user-visible effects of the
> > "Bad." statements above. I think soft_offline_page() is a candidate
> > for a local fix because mm/memory-failure.c already has a significant
> > amount of page-type specific knowledge. So teaching it "yes" for
> > MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
> > ok to me.
>
> I am not completely against working around the issue we have in the code
> but nobody stumbled over them. IMHO it's just error prone code and
> handling we have here that will bite us in the long term. E.g., any pfn
> walker/code that sticks to the current documentation of
> pfn_to_online_page().
>
> I am not sure it's the right thing to do for
> MEMORY_DEVICE_PRIVATE-ZONE_DEVICE, that requires more discussion.
>
> >> We might tackle 1. by:
>
> [...]
> >> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
> >> section. In the worst case, don't add partially present sections that
> >> have big holes in the beginning/end. Like, if there is a 128MB section
> >> with 126MB of memory followed by a 2MB hole, don't add that memory to
> >> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
> >> well, it would be the price to pay for simplicity. Being able to hotadd
> >> PMEM is more important than using each and every last MB of memory.
> >
> > The collisions that are out there in the wild are 64MB System RAM
> > followed by 64MB of PMEM. If you're suggesting reducing System RAM
> > that collides with PMEM that's a consideration. It can't go the other
> > way because there are deployed configurations that have persistent
> > data there. Reducing System RAM runs into the problem of how early
> > does the kernel know that it's bordering ZONE_DEVICE. It's not just
> > PMEM, it's also EFI_MEMORY_SP (Linux Soft Reserved) memory.
>
> Yeah, obviously the first one. Being able to add+use PMEM is more
> important than using each and every last MB of main memory.
>
> I wonder if we can just stop adding any system RAM like
>
> [ Memory Section ]
> [ RAM ] [ Hole ]
>
> When there could be the possibility that the hole might actually be
> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
> sequence of sections, not just a tiny hole)
I like the simplicity of it... I worry that the capacity loss
regression is easy to notice by looking at the output of free(1) from
one kernel to the next and someone screams.
>> Yeah, obviously the first one. Being able to add+use PMEM is more
>> important than using each and every last MB of main memory.
>>
>> I wonder if we can just stop adding any system RAM like
>>
>> [ Memory Section ]
>> [ RAM ] [ Hole ]
>>
>> When there could be the possibility that the hole might actually be
>> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
>> sequence of sections, not just a tiny hole)
>
> I like the simplicity of it... I worry that the capacity loss
> regression is easy to notice by looking at the output of free(1) from
> one kernel to the next and someone screams.
Well, you can always make it configurable and then simply fail to add
PMEM later if impossible (trying to sub-section hot-add into early
section). It's in the hands of the sysadmin then ("max out system ram"
vs. "support any PMEM device that could eventually be there at
runtime"). Distros would go for the second.
I agree that it's not optimal, but sometimes simplicity has to win.
--
Thanks,
David / dhildenb
On 05.01.21 08:44, Michal Hocko wrote:
> On Mon 04-01-21 17:30:48, David Hildenbrand wrote:
>>>> Let's assume this is indeed a reserved pfn in the altmap. What's the
>>>> actual address of the memmap?
>>>
>>> Not sure what exactly you are asking for but crash says
>>> crash> kmem -p 6060000
>>> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
>>> fffff8c600181800 6060000 0 0 0 fffffc0000000
>>
>> ^ this looks like it was somewhat initialized. All flags zero, nid/zone
>> set to -1 (wild guess) and thus the crash? weird
>
> Yes that made me scratch my head as well.
My best guess would be that this is indeed part of the altmap reserve
with a memmap on the altmap. Maybe these values are just leftovers on
the PMEM device (e.g., from older kernels)?
In the kernel, I cannot find anything that would initialize node/zone to
-1 only.
@Dan: Will get_dev_pagemap() return a valid dev_pagemap for pfns falling
into the altmap reserved region? I think we should also initialize the
memmap of the altmap reserved region somewhow.
--
Thanks,
David / dhildenb
On Tue, Jan 5, 2021 at 1:37 AM David Hildenbrand <[email protected]> wrote:
>
> >> Yeah, obviously the first one. Being able to add+use PMEM is more
> >> important than using each and every last MB of main memory.
> >>
> >> I wonder if we can just stop adding any system RAM like
> >>
> >> [ Memory Section ]
> >> [ RAM ] [ Hole ]
> >>
> >> When there could be the possibility that the hole might actually be
> >> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
> >> sequence of sections, not just a tiny hole)
> >
> > I like the simplicity of it... I worry that the capacity loss
> > regression is easy to notice by looking at the output of free(1) from
> > one kernel to the next and someone screams.
>
> Well, you can always make it configurable and then simply fail to add
> PMEM later if impossible (trying to sub-section hot-add into early
> section). It's in the hands of the sysadmin then ("max out system ram"
> vs. "support any PMEM device that could eventually be there at
> runtime"). Distros would go for the second.
>
> I agree that it's not optimal, but sometimes simplicity has to win.
Here's where we left it last time, open to pfn_to_online_page hacks...
https://lore.kernel.org/linux-mm/CAPcyv4ivq=EPUePXiX2ErcVyF7+dV9Yv215Oue7X_Y2X_Jfw8Q@mail.gmail.com
I don't think a slow-path flag in the mem-section is too onerous, but
I'll withhold judgement until I have the patch I'm thinking of
in-hand. Let me give it a shot, you can always nack the final result.
On 05.01.21 10:56, Dan Williams wrote:
> On Tue, Jan 5, 2021 at 1:37 AM David Hildenbrand <[email protected]> wrote:
>>
>>>> Yeah, obviously the first one. Being able to add+use PMEM is more
>>>> important than using each and every last MB of main memory.
>>>>
>>>> I wonder if we can just stop adding any system RAM like
>>>>
>>>> [ Memory Section ]
>>>> [ RAM ] [ Hole ]
>>>>
>>>> When there could be the possibility that the hole might actually be
>>>> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
>>>> sequence of sections, not just a tiny hole)
>>>
>>> I like the simplicity of it... I worry that the capacity loss
>>> regression is easy to notice by looking at the output of free(1) from
>>> one kernel to the next and someone screams.
>>
>> Well, you can always make it configurable and then simply fail to add
>> PMEM later if impossible (trying to sub-section hot-add into early
>> section). It's in the hands of the sysadmin then ("max out system ram"
>> vs. "support any PMEM device that could eventually be there at
>> runtime"). Distros would go for the second.
>>
>> I agree that it's not optimal, but sometimes simplicity has to win.
>
> Here's where we left it last time, open to pfn_to_online_page hacks...
>
> https://lore.kernel.org/linux-mm/CAPcyv4ivq=EPUePXiX2ErcVyF7+dV9Yv215Oue7X_Y2X_Jfw8Q@mail.gmail.com
>
Yeah, I recall. That's why I favor simple approaches right now - less
brain power to waste ;)
> I don't think a slow-path flag in the mem-section is too onerous, but
> I'll withhold judgement until I have the patch I'm thinking of
> in-hand. Let me give it a shot, you can always nack the final result.
Sure!
--
Thanks,
David / dhildenb
On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 05-01-21 00:27:34, Dan Williams wrote:
> > On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > > >> [...]
> > > > > >>> Do the physical addresses you see fall into the same section as boot
> > > > > >>> memory? Or what's around these addresses?
> > > > > >>
> > > > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > > > >> pmem section [1]
> > > > > >> [ 0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x603fffffff]
> > > > > >> [ 0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x6060000000-0x11d5fffffff] non-volatile
> > > > > >>
> > > > > >> The pfn without the initialized struct page is 0x6060000. This is a
> > > > > >> first pfn in a section.
> > > > > >
> > > > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > > > different story.
> > > > > >
> > > > > > Due to [1], is_mem_section_removable() called
> > > > > > pfn_to_page(PHYS_PFN(0x6060000)). page_zone(page) made it crash, as not
> > > > > > initialized.
> > > > > >
> > > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > > actual address of the memmap?
> > > > > >
> > > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x6060000)) - is it actually
> > > > > > part of the actual altmap (i.e. > 0x6060000) or maybe even self-hosted?
> > > > > >
> > > > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > > > just fine I guess. Otherwise things get more complicated.
> > > > >
> > > > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > > > sure other pfn walkers are safe. It was just an issue of
> > > > > is_mem_section_removable().
> > > >
> > > > Hmm, I suspect you are right. I haven't put this together, thanks! The memory
> > > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > > out:
> > > > crash> p (0x6060000>>15)
> > > > $3 = 3084
> > > > crash> p mem_section[3084/128][3084 & 127]
> > > > $4 = {
> > > > section_mem_map = 18446736128020054019,
> > > > usage = 0xffff902dcf956680,
> > > > page_ext = 0x0,
> > > > pad = 0
> > > > }
> > > > crash> p 18446736128020054019 & (1UL<<2)
> > > > $5 = 0
> > > >
> > > > That makes it considerably less of a problem than I thought!
> > >
> > > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > > some reason can fix the crash by the following simple patch.
> > >
> > > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > ===================================================================
> > > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > > goto out;
> > >
> > > for (i = 0; i < sections_per_block; i++) {
> > > - if (!present_section_nr(mem->start_section_nr + i))
> > > + unsigned long nr = mem->start_section_nr + i;
> > > + if (!present_section_nr(nr))
> > > continue;
> > > - pfn = section_nr_to_pfn(mem->start_section_nr + i);
> > > + if (!online_section_nr()) {
> >
> > I assume that's onlince_section_nr(nr) in the version that compiles?
>
> Yup.
>
> > This makes sense because the memory block size is larger than the
> > section size. I suspect you have 1GB memory block size on this system,
> > but since the System RAM and PMEM collide at a 512MB alignment in a
> > memory block you end up walking the back end of the last 512MB of the
> > System RAM memory block and run into the offline PMEM section.
>
> Sections are 128MB and memory blocks are 2GB on this system.
>
> > So, I don't think it's pfn_to_online_page that necessarily needs to
> > know how to disambiguate each page, it's things that walk sections and
> > memory blocks and expects them to be consistent over the span.
>
> Well, memory hotplug code is hard wired to sparse memory model so in
> this particular case asking about the section is ok. But pfn walkers
> shouldn't really care and only rely on pfn_to_online_page. But that will
> do the right thing here. So we are good as long as the section is marked
> properly. But this would become a problem as soon as the uninitialized
> pages where sharing the same memory section as David pointed out.
> pfn_to_online_page would then return something containing garbage. So we
> should still think of a way to either initialize all those pages or make
> sure pfn_to_online_page recognizes them. The former is preferred IMHO.
The former would not have saved the crash in this case because
pfn_to_online_page() is not used in v5.3:removable_show() that I can
see, nor some of the other paths that might walk pfns and the wrong
thing with ZONE_DEVICE.
However, I do think pfn_to_online_page() should be reliable, and I
prefer to just brute force add a section flag to indicate whether the
section might be ZONE_DEVICE polluted and fallback to the
get_dev_pagemap() slow-path in that case.
...but it would still require hunting to find the places where
pfn_to_online_page() is missing for assumptions like this crash which
assumed memblock-online + section-present == section-online.