This patch refers to the older thread:
https://marc.info/?t=153658306400001&r=1&w=2
As suggested by Michal Hocko, instead of adjusting memory_hotplug paths,
I have changed memmap_init_zone() to initialize struct pages beyond the
zone end (if zone end is not aligned with the section boundary).
Mikhail Zaslonko (1):
mm, memory_hotplug: Initialize struct pages for the full memory
section
mm/page_alloc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
--
2.16.4
If memory end is not aligned with the sparse memory section boundary, the
mapping of such a section is only partly initialized. This may lead to
VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered by
memory_hotplug sysfs handlers:
Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y
kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
[<00000000008f15c4>] show_valid_zones+0x5c/0x190
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops
kernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
[<00000000008f12fa>] show_mem_removable+0x9a/0xd8
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<000000000038596c>] is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops
Fix the problem by initializing the last memory section of each zone
in memmap_init_zone() till the very end, even if it goes beyond the zone
end.
Signed-off-by: Mikhail Zaslonko <[email protected]>
Reviewed-by: Gerald Schaefer <[email protected]>
Cc: <[email protected]>
---
mm/page_alloc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..e2afdb2dc2c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
cond_resched();
}
}
+#ifdef CONFIG_SPARSEMEM
+ /*
+ * If the zone does not span the rest of the section then
+ * we should at least initialize those pages. Otherwise we
+ * could blow up on a poisoned page in some paths which depend
+ * on full sections being initialized (e.g. memory hotplug).
+ */
+ while (end_pfn % PAGES_PER_SECTION) {
+ __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
+ end_pfn++;
+ }
+#endif
}
#ifdef CONFIG_ZONE_DEVICE
--
2.16.4
On Wed, Dec 12, 2018 at 06:27:12PM +0100, Mikhail Zaslonko wrote:
>If memory end is not aligned with the sparse memory section boundary, the
>mapping of such a section is only partly initialized. This may lead to
>VM_BUG_ON due to uninitialized struct page access from
>is_mem_section_removable() or test_pages_in_a_zone() function triggered by
>memory_hotplug sysfs handlers:
>
>Here are the the panic examples:
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VM_PGFLAGS=y
>
> kernel parameter mem=2050M
> --------------------------
> page:000003d082008000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> kernel parameter mem=3075M
> --------------------------
> page:000003d08300c000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
>Fix the problem by initializing the last memory section of each zone
>in memmap_init_zone() till the very end, even if it goes beyond the zone
>end.
>
>Signed-off-by: Mikhail Zaslonko <[email protected]>
>Reviewed-by: Gerald Schaefer <[email protected]>
>Cc: <[email protected]>
>---
> mm/page_alloc.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 2ec9cc407216..e2afdb2dc2c5 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> cond_resched();
> }
> }
>+#ifdef CONFIG_SPARSEMEM
>+ /*
>+ * If the zone does not span the rest of the section then
>+ * we should at least initialize those pages. Otherwise we
>+ * could blow up on a poisoned page in some paths which depend
>+ * on full sections being initialized (e.g. memory hotplug).
>+ */
>+ while (end_pfn % PAGES_PER_SECTION) {
>+ __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
>+ end_pfn++;
>+ }
>+#endif
> }
What will happen if the end_pfn is PAGES_PER_SECTION aligned, but there
is invalid pfn? For example, the page is skipped in the for loop of
memmap_init_zone()?
>
> #ifdef CONFIG_ZONE_DEVICE
>--
>2.16.4
--
Wei Yang
Help you, Help me
On Wed 12-12-18 18:27:12, Mikhail Zaslonko wrote:
> If memory end is not aligned with the sparse memory section boundary, the
> mapping of such a section is only partly initialized. This may lead to
> VM_BUG_ON due to uninitialized struct page access from
> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
> memory_hotplug sysfs handlers:
>
> Here are the the panic examples:
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VM_PGFLAGS=y
>
> kernel parameter mem=2050M
> --------------------------
> page:000003d082008000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> kernel parameter mem=3075M
> --------------------------
> page:000003d08300c000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> Fix the problem by initializing the last memory section of each zone
> in memmap_init_zone() till the very end, even if it goes beyond the zone
> end.
>
> Signed-off-by: Mikhail Zaslonko <[email protected]>
> Reviewed-by: Gerald Schaefer <[email protected]>
> Cc: <[email protected]>
This has alwways been problem AFAIU. It just went unnoticed because we
have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
zeroing memory during allocation in vmemmap") and so the above test
would simply skip these ranges as belonging to zone 0 or provided a
garbage.
So I guess we do care for post f7f99100d8d9 kernels mostly and therefore
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Other than that looks good to me.
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
> mm/page_alloc.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..e2afdb2dc2c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> cond_resched();
> }
> }
> +#ifdef CONFIG_SPARSEMEM
> + /*
> + * If the zone does not span the rest of the section then
> + * we should at least initialize those pages. Otherwise we
> + * could blow up on a poisoned page in some paths which depend
> + * on full sections being initialized (e.g. memory hotplug).
> + */
> + while (end_pfn % PAGES_PER_SECTION) {
> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> + end_pfn++;
> + }
> +#endif
> }
>
> #ifdef CONFIG_ZONE_DEVICE
> --
> 2.16.4
--
Michal Hocko
SUSE Labs
On Thu, Dec 13, 2018 at 01:37:16PM +0100, Zaslonko Mikhail wrote:
>On 13.12.2018 04:46, Wei Yang wrote:
>> On Wed, Dec 12, 2018 at 06:27:12PM +0100, Mikhail Zaslonko wrote:
>>> If memory end is not aligned with the sparse memory section boundary, the
>>> mapping of such a section is only partly initialized. This may lead to
>>> VM_BUG_ON due to uninitialized struct page access from
>>> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
>>> memory_hotplug sysfs handlers:
>>>
>>> Here are the the panic examples:
>>> CONFIG_DEBUG_VM=y
>>> CONFIG_DEBUG_VM_PGFLAGS=y
>>>
>>> kernel parameter mem=2050M
>>> --------------------------
>>> page:000003d082008000 is uninitialized and poisoned
>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> Call Trace:
>>> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
>>> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
>>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>>> [<00000000003e4194>] seq_read+0x204/0x480
>>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>>> [<00000000003b55b2>] vfs_read+0x82/0x138
>>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>>> Last Breaking-Event-Address:
>>> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>>
>>> kernel parameter mem=3075M
>>> --------------------------
>>> page:000003d08300c000 is uninitialized and poisoned
>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> Call Trace:
>>> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
>>> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
>>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>>> [<00000000003e4194>] seq_read+0x204/0x480
>>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>>> [<00000000003b55b2>] vfs_read+0x82/0x138
>>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>>> Last Breaking-Event-Address:
>>> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>>
>>> Fix the problem by initializing the last memory section of each zone
>>> in memmap_init_zone() till the very end, even if it goes beyond the zone
>>> end.
>>>
>>> Signed-off-by: Mikhail Zaslonko <[email protected]>
>>> Reviewed-by: Gerald Schaefer <[email protected]>
>>> Cc: <[email protected]>
>>> ---
>>> mm/page_alloc.c | 12 ++++++++++++
>>> 1 file changed, 12 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 2ec9cc407216..e2afdb2dc2c5 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>> cond_resched();
>>> }
>>> }
>>> +#ifdef CONFIG_SPARSEMEM
>>> + /*
>>> + * If the zone does not span the rest of the section then
>>> + * we should at least initialize those pages. Otherwise we
>>> + * could blow up on a poisoned page in some paths which depend
>>> + * on full sections being initialized (e.g. memory hotplug).
>>> + */
>>> + while (end_pfn % PAGES_PER_SECTION) {
>>> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
>>> + end_pfn++;
>>> + }
>>> +#endif
>>> }
>>
>> What will happen if the end_pfn is PAGES_PER_SECTION aligned, but there
>> is invalid pfn? For example, the page is skipped in the for loop of
>> memmap_init_zone()?
>>
>
>If the end_pfn is PAGES_PER_SECTION aligned, we do not do any extra
>initialization.
>If the page is skipped in the loop, it will remain uninitialized for
>the reason (e.g. a memory hole). The behavior has not been changed here.
>
I may not describe my question clearly.
If the page is skipped, then it is uninitialized. Then would this page
trigger the call trace when read the removable sysfs?
>>>
>>> #ifdef CONFIG_ZONE_DEVICE
>>> --
>>> 2.16.4
>>
--
Wei Yang
Help you, Help me
[Your From address seems to have a typo (linux.bm.com) - fixed]
On Fri 14-12-18 10:33:55, Zaslonko Mikhail wrote:
[...]
> Yes, it might still trigger PF_POISONED_CHECK if the first page
> of the pageblock is left uninitialized (poisoned).
> But in order to cover these exceptional cases we would need to
> adjust memory_hotplug sysfs handler functions with similar
> checks (as in the for loop of memmap_init_zone()). And I guess
> that is what we were trying to avoid (adding special cases to
> memory_hotplug paths).
is_mem_section_removable should test pfn_valid_within at least.
But that would require some care because next_active_pageblock expects
aligned pages. Ble, this code is just horrible. I would just remove it
altogether. I strongly suspect that nobody is using it for anything
reasonable anyway. The only reliable way to check whether a block is
removable is to remove it. Everything else is just racy.
--
Michal Hocko
SUSE Labs
On 12.12.18 18:27, Mikhail Zaslonko wrote:
> If memory end is not aligned with the sparse memory section boundary, the
> mapping of such a section is only partly initialized. This may lead to
> VM_BUG_ON due to uninitialized struct page access from
> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
> memory_hotplug sysfs handlers:
>
> Here are the the panic examples:
> CONFIG_DEBUG_VM=y
> CONFIG_DEBUG_VM_PGFLAGS=y
>
> kernel parameter mem=2050M
> --------------------------
> page:000003d082008000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> kernel parameter mem=3075M
> --------------------------
> page:000003d08300c000 is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> Call Trace:
> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> [<00000000003e4194>] seq_read+0x204/0x480
> [<00000000003b53ea>] __vfs_read+0x32/0x178
> [<00000000003b55b2>] vfs_read+0x82/0x138
> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> Last Breaking-Event-Address:
> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> Fix the problem by initializing the last memory section of each zone
> in memmap_init_zone() till the very end, even if it goes beyond the zone
> end.
>
> Signed-off-by: Mikhail Zaslonko <[email protected]>
> Reviewed-by: Gerald Schaefer <[email protected]>
> Cc: <[email protected]>
> ---
> mm/page_alloc.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2ec9cc407216..e2afdb2dc2c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> cond_resched();
> }
> }
> +#ifdef CONFIG_SPARSEMEM
> + /*
> + * If the zone does not span the rest of the section then
> + * we should at least initialize those pages. Otherwise we
> + * could blow up on a poisoned page in some paths which depend
> + * on full sections being initialized (e.g. memory hotplug).
> + */
> + while (end_pfn % PAGES_PER_SECTION) {
> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> + end_pfn++;
This page will not be marked as PG_reserved - although it is a physical
memory gap. Do we care?
> + }
> +#endif
> }
>
> #ifdef CONFIG_ZONE_DEVICE
>
--
Thanks,
David / dhildenb
On 14.12.18 16:22, David Hildenbrand wrote:
> On 12.12.18 18:27, Mikhail Zaslonko wrote:
>> If memory end is not aligned with the sparse memory section boundary, the
>> mapping of such a section is only partly initialized. This may lead to
>> VM_BUG_ON due to uninitialized struct page access from
>> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
>> memory_hotplug sysfs handlers:
>>
>> Here are the the panic examples:
>> CONFIG_DEBUG_VM=y
>> CONFIG_DEBUG_VM_PGFLAGS=y
>>
>> kernel parameter mem=2050M
>> --------------------------
>> page:000003d082008000 is uninitialized and poisoned
>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> Call Trace:
>> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
>> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>> [<00000000003e4194>] seq_read+0x204/0x480
>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>> [<00000000003b55b2>] vfs_read+0x82/0x138
>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>> Last Breaking-Event-Address:
>> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>
>> kernel parameter mem=3075M
>> --------------------------
>> page:000003d08300c000 is uninitialized and poisoned
>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> Call Trace:
>> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
>> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>> [<00000000003e4194>] seq_read+0x204/0x480
>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>> [<00000000003b55b2>] vfs_read+0x82/0x138
>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>> Last Breaking-Event-Address:
>> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>
>> Fix the problem by initializing the last memory section of each zone
>> in memmap_init_zone() till the very end, even if it goes beyond the zone
>> end.
>>
>> Signed-off-by: Mikhail Zaslonko <[email protected]>
>> Reviewed-by: Gerald Schaefer <[email protected]>
>> Cc: <[email protected]>
>> ---
>> mm/page_alloc.c | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2ec9cc407216..e2afdb2dc2c5 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> cond_resched();
>> }
>> }
>> +#ifdef CONFIG_SPARSEMEM
>> + /*
>> + * If the zone does not span the rest of the section then
>> + * we should at least initialize those pages. Otherwise we
>> + * could blow up on a poisoned page in some paths which depend
>> + * on full sections being initialized (e.g. memory hotplug).
>> + */
>> + while (end_pfn % PAGES_PER_SECTION) {
>> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
>> + end_pfn++;
>
> This page will not be marked as PG_reserved - although it is a physical
> memory gap. Do we care?
>
Hm, or do we even have any idea what this is (e.g. could it also be
something not a gap)?
For physical memory gaps within a section, architectures usually exclude
that memory from getting passed to e.g. the page allocator by
memblock_reserve().
Before handing all free pages to the page allocator, all such reserved
memblocks will be marked reserved.
But this here seems to be different. We don't have a previous
memblock_reserve(), because otherwise these pages would have properly
been initialized already when marking them reserved.
--
Thanks,
David / dhildenb
On Fri, 14 Dec 2018 16:49:14 +0100
David Hildenbrand <[email protected]> wrote:
> On 14.12.18 16:22, David Hildenbrand wrote:
> > On 12.12.18 18:27, Mikhail Zaslonko wrote:
> >> If memory end is not aligned with the sparse memory section boundary, the
> >> mapping of such a section is only partly initialized. This may lead to
> >> VM_BUG_ON due to uninitialized struct page access from
> >> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
> >> memory_hotplug sysfs handlers:
> >>
> >> Here are the the panic examples:
> >> CONFIG_DEBUG_VM=y
> >> CONFIG_DEBUG_VM_PGFLAGS=y
> >>
> >> kernel parameter mem=2050M
> >> --------------------------
> >> page:000003d082008000 is uninitialized and poisoned
> >> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> Call Trace:
> >> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
> >> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
> >> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> >> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> >> [<00000000003e4194>] seq_read+0x204/0x480
> >> [<00000000003b53ea>] __vfs_read+0x32/0x178
> >> [<00000000003b55b2>] vfs_read+0x82/0x138
> >> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> >> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> >> Last Breaking-Event-Address:
> >> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
> >> Kernel panic - not syncing: Fatal exception: panic_on_oops
> >>
> >> kernel parameter mem=3075M
> >> --------------------------
> >> page:000003d08300c000 is uninitialized and poisoned
> >> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> Call Trace:
> >> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
> >> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
> >> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
> >> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
> >> [<00000000003e4194>] seq_read+0x204/0x480
> >> [<00000000003b53ea>] __vfs_read+0x32/0x178
> >> [<00000000003b55b2>] vfs_read+0x82/0x138
> >> [<00000000003b5be2>] ksys_read+0x5a/0xb0
> >> [<0000000000b86ba0>] system_call+0xdc/0x2d8
> >> Last Breaking-Event-Address:
> >> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
> >> Kernel panic - not syncing: Fatal exception: panic_on_oops
> >>
> >> Fix the problem by initializing the last memory section of each zone
> >> in memmap_init_zone() till the very end, even if it goes beyond the zone
> >> end.
> >>
> >> Signed-off-by: Mikhail Zaslonko <[email protected]>
> >> Reviewed-by: Gerald Schaefer <[email protected]>
> >> Cc: <[email protected]>
> >> ---
> >> mm/page_alloc.c | 12 ++++++++++++
> >> 1 file changed, 12 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 2ec9cc407216..e2afdb2dc2c5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >> cond_resched();
> >> }
> >> }
> >> +#ifdef CONFIG_SPARSEMEM
> >> + /*
> >> + * If the zone does not span the rest of the section then
> >> + * we should at least initialize those pages. Otherwise we
> >> + * could blow up on a poisoned page in some paths which depend
> >> + * on full sections being initialized (e.g. memory hotplug).
> >> + */
> >> + while (end_pfn % PAGES_PER_SECTION) {
> >> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> >> + end_pfn++;
> >
> > This page will not be marked as PG_reserved - although it is a physical
> > memory gap. Do we care?
> >
>
> Hm, or do we even have any idea what this is (e.g. could it also be
> something not a gap)?
In the "mem=" restriction scenario it would be a gap, and probably fall
into the PG_reserved categorization from your recent patch:
* - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
* to read/write these pages might end badly. Don't touch!
Not sure if it could be something else. In theory, if it is possible to have
a scenario where memory zones are not section-aligned, then this
end_pfn % PAGES_PER_SECTION part could be part of another zone. But then it
should not matter if the pages get pre-initialized here, with or w/o
PG_reseved, because they should later be properly initialized in their zone.
So marking them as PG_reserved sounds right, especially in the light of your
current PG_reserved clean-up.
>
> For physical memory gaps within a section, architectures usually exclude
> that memory from getting passed to e.g. the page allocator by
> memblock_reserve().
>
> Before handing all free pages to the page allocator, all such reserved
> memblocks will be marked reserved.
>
> But this here seems to be different. We don't have a previous
> memblock_reserve(), because otherwise these pages would have properly
> been initialized already when marking them reserved.
Not sure how memblock_reserve() and struct page initialization are
related, but at least on s390 there is a memblock_reserve() on the range
in question in setup_arch() -> reserve_memory_end(). However, in this
"mem=" scenario, the range is also removed later with memblock_remove()
in setup_memory_end(), because it is beyond memory_end.
On Fri, Dec 14, 2018 at 11:19:59AM +0100, Michal Hocko wrote:
>[Your From address seems to have a typo (linux.bm.com) - fixed]
>
>On Fri 14-12-18 10:33:55, Zaslonko Mikhail wrote:
>[...]
>> Yes, it might still trigger PF_POISONED_CHECK if the first page
>> of the pageblock is left uninitialized (poisoned).
>> But in order to cover these exceptional cases we would need to
>> adjust memory_hotplug sysfs handler functions with similar
>> checks (as in the for loop of memmap_init_zone()). And I guess
>> that is what we were trying to avoid (adding special cases to
>> memory_hotplug paths).
>
>is_mem_section_removable should test pfn_valid_within at least.
>But that would require some care because next_active_pageblock expects
>aligned pages. Ble, this code is just horrible. I would just remove it
>altogether. I strongly suspect that nobody is using it for anything
>reasonable anyway. The only reliable way to check whether a block is
>removable is to remove it. Everything else is just racy.
>
Sounds reasonable.
The result return from removable sysfs is transient. If no user rely on
this, remove this is a better way.
>--
>Michal Hocko
>SUSE Labs
--
Wei Yang
Help you, Help me
On 14.12.18 20:23, Gerald Schaefer wrote:
> On Fri, 14 Dec 2018 16:49:14 +0100
> David Hildenbrand <[email protected]> wrote:
>
>> On 14.12.18 16:22, David Hildenbrand wrote:
>>> On 12.12.18 18:27, Mikhail Zaslonko wrote:
>>>> If memory end is not aligned with the sparse memory section boundary, the
>>>> mapping of such a section is only partly initialized. This may lead to
>>>> VM_BUG_ON due to uninitialized struct page access from
>>>> is_mem_section_removable() or test_pages_in_a_zone() function triggered by
>>>> memory_hotplug sysfs handlers:
>>>>
>>>> Here are the the panic examples:
>>>> CONFIG_DEBUG_VM=y
>>>> CONFIG_DEBUG_VM_PGFLAGS=y
>>>>
>>>> kernel parameter mem=2050M
>>>> --------------------------
>>>> page:000003d082008000 is uninitialized and poisoned
>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>> Call Trace:
>>>> ([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
>>>> [<00000000008f15c4>] show_valid_zones+0x5c/0x190
>>>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>>>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>>>> [<00000000003e4194>] seq_read+0x204/0x480
>>>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>>>> [<00000000003b55b2>] vfs_read+0x82/0x138
>>>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>>>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>>>> Last Breaking-Event-Address:
>>>> [<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
>>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>>>
>>>> kernel parameter mem=3075M
>>>> --------------------------
>>>> page:000003d08300c000 is uninitialized and poisoned
>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>> Call Trace:
>>>> ([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
>>>> [<00000000008f12fa>] show_mem_removable+0x9a/0xd8
>>>> [<00000000008cf9c4>] dev_attr_show+0x34/0x70
>>>> [<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
>>>> [<00000000003e4194>] seq_read+0x204/0x480
>>>> [<00000000003b53ea>] __vfs_read+0x32/0x178
>>>> [<00000000003b55b2>] vfs_read+0x82/0x138
>>>> [<00000000003b5be2>] ksys_read+0x5a/0xb0
>>>> [<0000000000b86ba0>] system_call+0xdc/0x2d8
>>>> Last Breaking-Event-Address:
>>>> [<000000000038596c>] is_mem_section_removable+0xb4/0x190
>>>> Kernel panic - not syncing: Fatal exception: panic_on_oops
>>>>
>>>> Fix the problem by initializing the last memory section of each zone
>>>> in memmap_init_zone() till the very end, even if it goes beyond the zone
>>>> end.
>>>>
>>>> Signed-off-by: Mikhail Zaslonko <[email protected]>
>>>> Reviewed-by: Gerald Schaefer <[email protected]>
>>>> Cc: <[email protected]>
>>>> ---
>>>> mm/page_alloc.c | 12 ++++++++++++
>>>> 1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 2ec9cc407216..e2afdb2dc2c5 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>> cond_resched();
>>>> }
>>>> }
>>>> +#ifdef CONFIG_SPARSEMEM
>>>> + /*
>>>> + * If the zone does not span the rest of the section then
>>>> + * we should at least initialize those pages. Otherwise we
>>>> + * could blow up on a poisoned page in some paths which depend
>>>> + * on full sections being initialized (e.g. memory hotplug).
>>>> + */
>>>> + while (end_pfn % PAGES_PER_SECTION) {
>>>> + __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
>>>> + end_pfn++;
>>>
>>> This page will not be marked as PG_reserved - although it is a physical
>>> memory gap. Do we care?
>>>
>>
>> Hm, or do we even have any idea what this is (e.g. could it also be
>> something not a gap)?
>
> In the "mem=" restriction scenario it would be a gap, and probably fall
> into the PG_reserved categorization from your recent patch:
> * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
> * to read/write these pages might end badly. Don't touch!
>
> Not sure if it could be something else. In theory, if it is possible to have
> a scenario where memory zones are not section-aligned, then this
> end_pfn % PAGES_PER_SECTION part could be part of another zone. But then it
> should not matter if the pages get pre-initialized here, with or w/o
> PG_reseved, because they should later be properly initialized in their zone.
>
> So marking them as PG_reserved sounds right, especially in the light of your
> current PG_reserved clean-up.
>
>>
>> For physical memory gaps within a section, architectures usually exclude
>> that memory from getting passed to e.g. the page allocator by
>> memblock_reserve().
>>
>> Before handing all free pages to the page allocator, all such reserved
>> memblocks will be marked reserved.
>>
>> But this here seems to be different. We don't have a previous
>> memblock_reserve(), because otherwise these pages would have properly
>> been initialized already when marking them reserved.
>
> Not sure how memblock_reserve() and struct page initialization are
> related, but at least on s390 there is a memblock_reserve() on the range
> in question in setup_arch() -> reserve_memory_end(). However, in this
> "mem=" scenario, the range is also removed later with memblock_remove()
> in setup_memory_end(), because it is beyond memory_end.
>
I am wondering if we should fix this on the memblock level instead than.
Something like, before handing memory over to the page allocator, add
memory as reserved up to the last section boundary. Or even when setting
the physical memory limit (mem= scenario).
--
Thanks,
David / dhildenb
On Mon 17-12-18 10:38:32, David Hildenbrand wrote:
[...]
> I am wondering if we should fix this on the memblock level instead than.
> Something like, before handing memory over to the page allocator, add
> memory as reserved up to the last section boundary. Or even when setting
> the physical memory limit (mem= scenario).
Memory initialization is spread over several places and that makes it
really hard to grasp and maintain. I do not really see why we should
make memblock even more special. We do intialize the section worth of
memory here so it sounds like a proper place to quirk for incomplete
sections.
--
Michal Hocko
SUSE Labs
On 17.12.18 13:28, Michal Hocko wrote:
> On Mon 17-12-18 10:38:32, David Hildenbrand wrote:
> [...]
>> I am wondering if we should fix this on the memblock level instead than.
>> Something like, before handing memory over to the page allocator, add
>> memory as reserved up to the last section boundary. Or even when setting
>> the physical memory limit (mem= scenario).
>
> Memory initialization is spread over several places and that makes it
> really hard to grasp and maintain. I do not really see why we should
> make memblock even more special. We do intialize the section worth of
> memory here so it sounds like a proper place to quirk for incomplete
> sections.
>
True as well. The reason I am asking is, that memblock usually takes
care of physical memory holes.
--
Thanks,
David / dhildenb
On Mon 17-12-18 14:29:04, David Hildenbrand wrote:
> On 17.12.18 13:28, Michal Hocko wrote:
> > On Mon 17-12-18 10:38:32, David Hildenbrand wrote:
> > [...]
> >> I am wondering if we should fix this on the memblock level instead than.
> >> Something like, before handing memory over to the page allocator, add
> >> memory as reserved up to the last section boundary. Or even when setting
> >> the physical memory limit (mem= scenario).
> >
> > Memory initialization is spread over several places and that makes it
> > really hard to grasp and maintain. I do not really see why we should
> > make memblock even more special. We do intialize the section worth of
> > memory here so it sounds like a proper place to quirk for incomplete
> > sections.
> >
>
> True as well. The reason I am asking is, that memblock usually takes
> care of physical memory holes.
Yes and no. It only reflects existing memory ranges (so yes it skips
over holes) and then it provides an API that platform/arch code can
abuse to cut holes into existing ranges.
--
Michal Hocko
SUSE Labs