On ThunderX systems with certain memory configurations we see the
following BUG_ON():
kernel BUG at mm/page_alloc.c:1848!
This happens for some configs with 64k page size enabled. The BUG_ON()
checks if start and end page of a memmap range belongs to the same
zone.
The BUG_ON() check fails if a memory zone contains NOMAP regions. In
this case the node information of those pages is not initialized. This
causes an inconsistency of the page links with wrong zone and node
information for that pages. NOMAP pages from node 1 still point to the
mem zone from node 0 and have the wrong nid assigned.
The reason for the mis-configuration is a change in pfn_valid() which
reports pages marked NOMAP as invalid:
68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
This causes pages marked as nomap being no long reassigned to the new
zone in memmap_init_zone() by calling __init_single_pfn().
Fixing this by restoring the old behavior of pfn_valid() to use
memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
to use memblock_is_map_memory() where necessary. This only affects
code in ioremap.c. The code in mmu.c still can use the new version of
pfn_valid().
As a consequence, pfn_valid() can not be used to check if a physical
page is RAM. It just checks if there is an underlying memmap with a
valid struct page. Moreover, for performance reasons the whole memmap
(with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM
config). The memory range is extended to fit the alignment of the
memmap. Thus, pfn_valid() may return true for pfns that do not map to
physical memory. Those pages are simply not reported to the mm, they
are not marked reserved nor added to the list of free pages. Other
functions such a page_is_ram() or memblock_is_map_ memory() must be
used to check for memory and if the page can be mapped with the linear
mapping.
Since NOMAP mem ranges may need to be mapped with different mem
attributes (e.g. read-only or non-caching) we can not use linear
mapping here. The use of memblock_is_memory() in pfn_valid() may not
break this behaviour. Since commit:
e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
NOMAP mem resources are no longer marked as system RAM (IORESOURCE_
SYSTEM_RAM). Now page_is_ram() and region_intersects() (see
memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not
added to the linear mapping as system RAM is.
v2:
* Added Ack
* updated description to reflect the discussion
Acked-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/arm64/mm/init.c | 2 +-
arch/arm64/mm/ioremap.c | 5 +++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 212c4d1e2f26..166911f4a2e6 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
#ifdef CONFIG_HAVE_ARCH_PFN_VALID
int pfn_valid(unsigned long pfn)
{
- return memblock_is_map_memory(pfn << PAGE_SHIFT);
+ return memblock_is_memory(pfn << PAGE_SHIFT);
}
EXPORT_SYMBOL(pfn_valid);
#endif
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index 01e88c8bcab0..c17c220b0c48 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -21,6 +21,7 @@
*/
#include <linux/export.h>
+#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/io.h>
@@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
/*
* Don't allow RAM to be mapped.
*/
- if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
+ if (WARN_ON(memblock_is_map_memory(phys_addr)))
return NULL;
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap);
void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
{
/* For normal memory we already have a cacheable mapping. */
- if (pfn_valid(__phys_to_pfn(phys_addr)))
+ if (memblock_is_map_memory(phys_addr))
return (void __iomem *)__phys_to_virt(phys_addr);
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
--
2.1.4
On Wed, Nov 30, 2016 at 07:21:31PM +0100, Robert Richter wrote:
> On ThunderX systems with certain memory configurations we see the
> following BUG_ON():
>
> kernel BUG at mm/page_alloc.c:1848!
>
> This happens for some configs with 64k page size enabled. The BUG_ON()
> checks if start and end page of a memmap range belongs to the same
> zone.
>
> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
> this case the node information of those pages is not initialized. This
> causes an inconsistency of the page links with wrong zone and node
> information for that pages. NOMAP pages from node 1 still point to the
> mem zone from node 0 and have the wrong nid assigned.
>
> The reason for the mis-configuration is a change in pfn_valid() which
> reports pages marked NOMAP as invalid:
>
> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>
> This causes pages marked as nomap being no long reassigned to the new
> zone in memmap_init_zone() by calling __init_single_pfn().
>
> Fixing this by restoring the old behavior of pfn_valid() to use
> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
> to use memblock_is_map_memory() where necessary. This only affects
> code in ioremap.c. The code in mmu.c still can use the new version of
> pfn_valid().
>
> As a consequence, pfn_valid() can not be used to check if a physical
> page is RAM. It just checks if there is an underlying memmap with a
> valid struct page. Moreover, for performance reasons the whole memmap
> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM
> config). The memory range is extended to fit the alignment of the
> memmap. Thus, pfn_valid() may return true for pfns that do not map to
> physical memory. Those pages are simply not reported to the mm, they
> are not marked reserved nor added to the list of free pages. Other
> functions such a page_is_ram() or memblock_is_map_ memory() must be
> used to check for memory and if the page can be mapped with the linear
> mapping.
>
> Since NOMAP mem ranges may need to be mapped with different mem
> attributes (e.g. read-only or non-caching) we can not use linear
> mapping here. The use of memblock_is_memory() in pfn_valid() may not
> break this behaviour. Since commit:
>
> e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>
> NOMAP mem resources are no longer marked as system RAM (IORESOURCE_
> SYSTEM_RAM). Now page_is_ram() and region_intersects() (see
> memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not
> added to the linear mapping as system RAM is.
>
> v2:
>
> * Added Ack
> * updated description to reflect the discussion
>
> Acked-by: Ard Biesheuvel <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> arch/arm64/mm/init.c | 2 +-
> arch/arm64/mm/ioremap.c | 5 +++--
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 212c4d1e2f26..166911f4a2e6 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> #ifdef CONFIG_HAVE_ARCH_PFN_VALID
> int pfn_valid(unsigned long pfn)
> {
> - return memblock_is_map_memory(pfn << PAGE_SHIFT);
> + return memblock_is_memory(pfn << PAGE_SHIFT);
> }
> EXPORT_SYMBOL(pfn_valid);
> #endif
> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> index 01e88c8bcab0..c17c220b0c48 100644
> --- a/arch/arm64/mm/ioremap.c
> +++ b/arch/arm64/mm/ioremap.c
> @@ -21,6 +21,7 @@
> */
>
> #include <linux/export.h>
> +#include <linux/memblock.h>
> #include <linux/mm.h>
> #include <linux/vmalloc.h>
> #include <linux/io.h>
> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
> /*
> * Don't allow RAM to be mapped.
> */
> - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> + if (WARN_ON(memblock_is_map_memory(phys_addr)))
> return NULL;
>
> area = get_vm_area_caller(size, VM_IOREMAP, caller);
> @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap);
> void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
> {
> /* For normal memory we already have a cacheable mapping. */
> - if (pfn_valid(__phys_to_pfn(phys_addr)))
> + if (memblock_is_map_memory(phys_addr))
> return (void __iomem *)__phys_to_virt(phys_addr);
Thanks for sending out the new patch. Whilst I'm still a bit worried about
changing pfn_valid like this, I guess we'll just have to fix up any callers
which suffer from this change.
Acked-by: Will Deacon <[email protected]>
I'd like to see this sit in -next for a bit before we send it further.
Will
Hi Robert, Will,
On 01/12/16 16:45, Will Deacon wrote:
> On Wed, Nov 30, 2016 at 07:21:31PM +0100, Robert Richter wrote:
>> On ThunderX systems with certain memory configurations we see the
>> following BUG_ON():
>>
>> kernel BUG at mm/page_alloc.c:1848!
>>
>> This happens for some configs with 64k page size enabled. The BUG_ON()
>> checks if start and end page of a memmap range belongs to the same
>> zone.
>>
>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>> this case the node information of those pages is not initialized. This
>> causes an inconsistency of the page links with wrong zone and node
>> information for that pages. NOMAP pages from node 1 still point to the
>> mem zone from node 0 and have the wrong nid assigned.
>>
>> The reason for the mis-configuration is a change in pfn_valid() which
>> reports pages marked NOMAP as invalid:
>>
>> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>>
>> This causes pages marked as nomap being no long reassigned to the new
>> zone in memmap_init_zone() by calling __init_single_pfn().
>>
>> Fixing this by restoring the old behavior of pfn_valid() to use
>> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
>> to use memblock_is_map_memory() where necessary. This only affects
>> code in ioremap.c. The code in mmu.c still can use the new version of
>> pfn_valid().
>>
>> As a consequence, pfn_valid() can not be used to check if a physical
>> page is RAM. It just checks if there is an underlying memmap with a
>> valid struct page. Moreover, for performance reasons the whole memmap
>> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM
>> config). The memory range is extended to fit the alignment of the
>> memmap. Thus, pfn_valid() may return true for pfns that do not map to
>> physical memory. Those pages are simply not reported to the mm, they
>> are not marked reserved nor added to the list of free pages. Other
>> functions such a page_is_ram() or memblock_is_map_ memory() must be
>> used to check for memory and if the page can be mapped with the linear
>> mapping.
[...]
> Thanks for sending out the new patch. Whilst I'm still a bit worried about
> changing pfn_valid like this, I guess we'll just have to fix up any callers
> which suffer from this change.
Hibernate's core code falls foul of this. This patch causes a panic when copying
memory to build the 'image'[0].
saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid()
pages can be accessed.
Fortunately the core code exposes pfn_is_nosave() which we can extend to catch
'nomap' pages, but only if they are also marked as PageReserved().
Are there any side-effects of marking all the nomap regions with
mark_page_reserved()? (it doesn't appear to be the case today).
Patches incoming...
Thanks,
James
[0] panic trace
root@juno-r1:~# echo disk > /sys/power/state
[ 56.914184] PM: Syncing filesystems ... [ 56.918853] done.
[ 56.920826] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 56.930383] PM: Preallocating image memory... done (allocated 97481 pages)
[ 60.566084] PM: Allocated 389924 kbytes in 3.62 seconds (107.71 MB/s)
[ 60.572576] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 60.604877] PM: freeze of devices complete after 23.146 msecs
[ 60.611230] PM: late freeze of devices complete after 0.578 msecs
[ 60.618609] PM: noirq freeze of devices complete after 1.247 msecs
[ 60.624833] Disabling non-boot CPUs ...
[ 60.649112] CPU1: shutdown
[ 60.651823] psci: CPU1 killed.
[ 60.701055] CPU2: shutdown
[ 60.703766] psci: CPU2 killed.
[ 60.745002] IRQ11 no longer affine to CPU3
[ 60.745043] CPU3: shutdown
[ 60.751890] psci: CPU3 killed.
[ 60.784966] CPU4: shutdown
[ 60.787676] psci: CPU4 killed.
[ 60.824916] IRQ8 no longer affine to CPU5
[ 60.824920] IRQ9 no longer affine to CPU5
[ 60.824927] IRQ18 no longer affine to CPU5
[ 60.824931] IRQ20 no longer affine to CPU5
[ 60.824951] CPU5: shutdown
[ 60.843975] psci: CPU5 killed.
[ 60.857989] PM: Creating hibernation image:
[ 60.857989] PM: Need to copy 96285 pages
[ 60.857989] Unable to handle kernel paging request at virtual address
ffff8000794a0000
[ 60.857989] pgd = ffff800975190000
[ 60.857989] [ffff8000794a0000] *pgd=0000000000000000[ 60.857989]
[ 60.857989] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 60.857989] Modules linked in:
[ 60.857989] CPU: 0 PID: 2366 Comm: bash Not tainted
4.9.0-rc7-00001-gecf7c47af54d #6346
[ 60.857989] Hardware name: ARM Juno development board (r1) (DT)
[ 60.857989] task: ffff8009766d3200 task.stack: ffff800975fec000
[ 60.857989] PC is at swsusp_save+0x250/0x2c8
[ 60.857989] LR is at swsusp_save+0x214/0x2c8
[ 60.857989] pc : [<ffff000008100bd0>] lr : [<ffff000008100b94>] pstate: 200003c5
[ 60.857989] sp : ffff800975fefb50
[ 60.857989] x29: ffff800975fefb50 x28: ffff800975fec000
[ 60.857989] x27: ffff0000088c2000 x26: 0000000000000040
[ 60.857989] x25: 00000000000f94a0 x24: ffff000008e437e8
[ 60.857989] x23: 000000000001781d x22: ffff000008bee000
[ 60.857989] x21: ffff000008e437f8 x20: ffff000008e43878
[ 60.857989] x19: ffff7e0000000000 x18: 0000000000000006
[ 60.857989] x17: 0000000000000000 x16: 00000000000005d0
[ 60.857989] x15: ffff000008e43e95 x14: 00000000000001d9
[ 60.857989] x13: 0000000000000001 x12: ffff7e0000000000
[ 60.857989] x11: ffff7e0025ffffc0 x10: 0000000025ffffc0
[ 60.857989] x9 : 000000000000012f x8 : ffff80096cd04ce0
[ 60.857989] x7 : 0000000000978000 x6 : 0000000000000076
[ 60.857989] x5 : fffffffffffffff8 x4 : 0000000000080000
[ 60.857989] x3 : ffff8000794a0000 x2 : ffff800959d83000
[ 60.857989] x1 : 0000000000000000 x0 : ffff7e0001e52800
[ 60.857989] Process bash (pid: 2366, stack limit = 0xffff800975fec020)
[ 60.857989] Stack: (0xffff800975fefb50 to 0xffff800975ff0000)
[ 60.857989] Call trace:
[ 60.857989] [<ffff000008100bd0>] swsusp_save+0x250/0x2c8
[ 60.857989] [<ffff0000080936ec>] swsusp_arch_suspend+0xb4/0x100
[ 60.857989] [<ffff0000080fe670>] hibernation_snapshot+0x278/0x318
[ 60.857989] [<ffff0000080fef10>] hibernate+0x1d0/0x268
[ 60.857989] [<ffff0000080fc954>] state_store+0xdc/0x100
[ 60.857989] [<ffff00000838419c>] kobj_attr_store+0x14/0x28
[ 60.857989] [<ffff00000825be68>] sysfs_kf_write+0x48/0x58
[ 60.857989] [<ffff00000825b1f8>] kernfs_fop_write+0xb0/0x1d8
[ 60.857989] [<ffff0000081e2ddc>] __vfs_write+0x1c/0x110
[ 60.857989] [<ffff0000081e3bd8>] vfs_write+0xa0/0x1b8
[ 60.857989] [<ffff0000081e4f44>] SyS_write+0x44/0xa0
[ 60.857989] [<ffff000008082ef0>] el0_svc_naked+0x24/0x28
[ 60.857989] Code: d37ae442 d37ae463 b2514042 b2514063 (f8636820)
[ 60.857989] ---[ end trace d0265b757c9dd571 ]---
[ 60.857989] ------------[ cut here ]------------
James,
On 01.12.16 17:26:55, James Morse wrote:
> On 01/12/16 16:45, Will Deacon wrote:
> > Thanks for sending out the new patch. Whilst I'm still a bit worried about
> > changing pfn_valid like this, I guess we'll just have to fix up any callers
> > which suffer from this change.
>
> Hibernate's core code falls foul of this. This patch causes a panic when copying
> memory to build the 'image'[0].
> saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid()
> pages can be accessed.
>
> Fortunately the core code exposes pfn_is_nosave() which we can extend to catch
> 'nomap' pages, but only if they are also marked as PageReserved().
>
> Are there any side-effects of marking all the nomap regions with
> mark_page_reserved()? (it doesn't appear to be the case today).
Reserving the page adds it to the memory management which is what we
would like to avoid for NOMAP pages. I don't believe we should do
this. Since NOMAP is to some degree now core functionality I would
rather implement pfn_is_nomap() that defaults to pfn_is_valid() but
calls memblock_is_nomap() for arm64 or does something equivalent.
The question arises what to do with that mem at all. There could be
mappings by the kernel, e.g. of acpi tables. We can't assume the mem
regions still come out the same from the BIOS during resume. Do we
need to save the mem? I can't answer that as I don't know much about
hibernation yet.
Thanks,
-Robert
Hi Robert,
On 02/12/16 07:11, Robert Richter wrote:
> On 01.12.16 17:26:55, James Morse wrote:
>> On 01/12/16 16:45, Will Deacon wrote:
>>> Thanks for sending out the new patch. Whilst I'm still a bit worried about
>>> changing pfn_valid like this, I guess we'll just have to fix up any callers
>>> which suffer from this change.
>>
>> Hibernate's core code falls foul of this. This patch causes a panic when copying
>> memory to build the 'image'[0].
>> saveable_page() in kernel/power/snapshot.c broadly assumes that pfn_valid()
>> pages can be accessed.
>>
>> Fortunately the core code exposes pfn_is_nosave() which we can extend to catch
>> 'nomap' pages, but only if they are also marked as PageReserved().
>>
>> Are there any side-effects of marking all the nomap regions with
>> mark_page_reserved()? (it doesn't appear to be the case today).
>
> Reserving the page adds it to the memory management which is what we
> would like to avoid for NOMAP pages. I don't believe we should do
> this. Since NOMAP is to some degree now core functionality I would
> rather implement pfn_is_nomap() that defaults to pfn_is_valid() but
> calls memblock_is_nomap() for arm64 or does something equivalent.
I thought the adjust_managed_page_count() code was just fiddling with some
counters. I will change it to call SetPageReserved() directly which will just
set the bit in struct page's flags. I will post these shortly as the 'fixes' way
of solving the hibernate fallout.
I guess any arch that uses memblock nomap needs core code to take account of it,
but at the moment that is just arm/arm64. If we are adding new pfn_is_ calls, we
could try and clean up pfn_valid() users to use pfn_is_memory(), pfn_is_mapped()
or pfn_has_memmap(). Part of the problem is 'valid' means different things to
different people.
> The question arises what to do with that mem at all. There could be
> mappings by the kernel, e.g. of acpi tables. We can't assume the mem
> regions still come out the same from the BIOS during resume.
Unfortunately we have to assume this. If the firmware reserved regions move
around in memory we can't resume from hibernate. Other OS also require this not
to happen. ([0] 'firmware memory requirements')
Hibernate core code checks the number of pages of kernel memory is the same
before trying to resume. If you just move the allocations around this will panic
during resume as the resume kernel will have surprising holes in its linear map.
x86 recently grew an MD5sum check of the e820 memory map, I intend to do the
same for memblock.
The theory is this would only happen if you change the hardware in some way, and
that otherwise the firmware is entirely deterministic...
> Do we
> need to save the mem? I can't answer that as I don't know much about
> hibernation yet.
Hibernate only save/restores the linear map and CPU state. We expect firmware to
put equivalent data in the same places for its nomap regions. If the region
belongs to a device, its up to the device driver to tidy up. (It has
freeze/thaw/resume callbacks to do this).
Thanks,
James
[0]
https://msdn.microsoft.com/en-gb/windows/hardware/commercialize/manufacture/desktop/uefi-requirements-boot-time-runtime-hibernation-state--s4
Patch "arm64: mm: Fix memmap to be initialized for the entire section"
changes pfn_valid() in a way that breaks hibernate. These patches fix
hibernate, and provided struct page's are allocated for nomap pages,
can be applied before [0].
Hibernate core code belives 'valid' to mean "I can access this". It
uses pfn_valid() to test the page if the page is 'valid'.
pfn_valid() needs to be changed so that all struct pages in a numa
node have the same node-id. Currently 'nomap' pages are skipped, and
retain their pre-numa node-ids, which leads to a later BUG_ON().
These patches make hibernate's savable_page() take its escape route
via 'if (PageReserved(page) && pfn_is_nosave(pfn))'.
[0] https://lkml.org/lkml/2016/11/30/566
James Morse (2):
arm64: mm: Mark nomap regions with the PG_reserved flag
arm64: hibernate: report nomap regions as being pfn_nosave
arch/arm64/kernel/hibernate.c | 6 +++++-
arch/arm64/mm/init.c | 14 ++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)
--
2.10.1
pfn_valid() needs to be changed so that all struct pages in a numa
node have the same node-id. Currently 'nomap' pages are skipped, and
retain their pre-numa node-ids, which leads to a later BUG_ON().
Once this change happens, hibernate's code code will try and
save/restore the nomap pages.
Add the memblock nomap regions to the ranges reported as being
'pfn_nosave' to the hibernate core code. This only works if all
pages in the nomap region are also marked with PG_reserved.
Signed-off-by: James Morse <[email protected]>
---
arch/arm64/kernel/hibernate.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index d55a7b09959b..9e901658a123 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -17,6 +17,7 @@
#define pr_fmt(x) "hibernate: " x
#include <linux/cpu.h>
#include <linux/kvm_host.h>
+#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/pm.h>
#include <linux/sched.h>
@@ -105,7 +106,10 @@ int pfn_is_nosave(unsigned long pfn)
unsigned long nosave_begin_pfn = virt_to_pfn(&__nosave_begin);
unsigned long nosave_end_pfn = virt_to_pfn(&__nosave_end - 1);
- return (pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn);
+ if ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn))
+ return 1;
+
+ return !memblock_is_map_memory(pfn << PAGE_SHIFT);
}
void notrace save_processor_state(void)
--
2.10.1
linux/page-flags.h has a flag for describing special pages:
> PG_reserved is set for special pages, which can never be swapped out.
> Some of them might not even exist (eg empty_bad_page)...
memblock nomap pages fall in the 'might not even exist' category,
set this bit on all the struct pages in the nomap regions.
This gives pfn walkers the necessary hint that the page might not
be accessible, allowing pfn_valid()s meaning to change slightly.
Signed-off-by: James Morse <[email protected]>
---
arch/arm64/mm/init.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 166911f4a2e6..5da9ff7d20f5 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -31,6 +31,7 @@
#include <linux/memblock.h>
#include <linux/sort.h>
#include <linux/of_fdt.h>
+#include <linux/page-flags.h>
#include <linux/dma-mapping.h>
#include <linux/dma-contiguous.h>
#include <linux/efi.h>
@@ -401,6 +402,8 @@ static void __init free_unused_memmap(void)
*/
void __init mem_init(void)
{
+ struct memblock_region *region;
+
if (swiotlb_force || max_pfn > (arm64_dma_phys_limit >> PAGE_SHIFT))
swiotlb_init(1);
@@ -479,6 +482,17 @@ void __init mem_init(void)
*/
sysctl_overcommit_memory = OVERCOMMIT_ALWAYS;
}
+
+ /* Mark struct pages for the memblock:nomap regions as reserved */
+ for_each_memblock(memory, region) {
+ u64 pfn;
+ u64 start_pfn = memblock_region_memory_base_pfn(region);
+ u64 end_pfn = memblock_region_memory_end_pfn(region);
+
+ if (memblock_is_nomap(region))
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ SetPageReserved(pfn_to_page(pfn));
+ }
}
void free_initmem(void)
--
2.10.1
On 2 December 2016 at 14:49, James Morse <[email protected]> wrote:
> Patch "arm64: mm: Fix memmap to be initialized for the entire section"
> changes pfn_valid() in a way that breaks hibernate. These patches fix
> hibernate, and provided struct page's are allocated for nomap pages,
> can be applied before [0].
>
> Hibernate core code belives 'valid' to mean "I can access this". It
> uses pfn_valid() to test the page if the page is 'valid'.
>
> pfn_valid() needs to be changed so that all struct pages in a numa
> node have the same node-id. Currently 'nomap' pages are skipped, and
> retain their pre-numa node-ids, which leads to a later BUG_ON().
>
> These patches make hibernate's savable_page() take its escape route
> via 'if (PageReserved(page) && pfn_is_nosave(pfn))'.
>
This makes me feel slightly uneasy. Robert makes a convincing point,
but I wonder if we can expect more fallout from the ambiguity of
pfn_valid(). Now we are not only forced to assign non-existing (as far
as the OS is concerned) pages to the correct NUMA node, we also need
to set certain page flags.
On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote:
> On 2 December 2016 at 14:49, James Morse <[email protected]> wrote:
> > Patch "arm64: mm: Fix memmap to be initialized for the entire section"
> > changes pfn_valid() in a way that breaks hibernate. These patches fix
> > hibernate, and provided struct page's are allocated for nomap pages,
> > can be applied before [0].
> >
> > Hibernate core code belives 'valid' to mean "I can access this". It
> > uses pfn_valid() to test the page if the page is 'valid'.
> >
> > pfn_valid() needs to be changed so that all struct pages in a numa
> > node have the same node-id. Currently 'nomap' pages are skipped, and
> > retain their pre-numa node-ids, which leads to a later BUG_ON().
> >
> > These patches make hibernate's savable_page() take its escape route
> > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'.
> >
>
> This makes me feel slightly uneasy. Robert makes a convincing point,
> but I wonder if we can expect more fallout from the ambiguity of
> pfn_valid(). Now we are not only forced to assign non-existing (as far
> as the OS is concerned) pages to the correct NUMA node, we also need
> to set certain page flags.
Yes, I really don't know how to proceed here. Playing whack-a-mole with
pfn_valid() users doesn't sounds like an improvement on the current
situation to me.
Robert -- if we leave pfn_valid() as it is, would a point-hack to
memmap_init_zone help, or do you anticipate other problems?
Will
On 06.12.16 17:38:11, Will Deacon wrote:
> On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote:
> > On 2 December 2016 at 14:49, James Morse <[email protected]> wrote:
> > > Patch "arm64: mm: Fix memmap to be initialized for the entire section"
> > > changes pfn_valid() in a way that breaks hibernate. These patches fix
> > > hibernate, and provided struct page's are allocated for nomap pages,
> > > can be applied before [0].
> > >
> > > Hibernate core code belives 'valid' to mean "I can access this". It
> > > uses pfn_valid() to test the page if the page is 'valid'.
> > >
> > > pfn_valid() needs to be changed so that all struct pages in a numa
> > > node have the same node-id. Currently 'nomap' pages are skipped, and
> > > retain their pre-numa node-ids, which leads to a later BUG_ON().
> > >
> > > These patches make hibernate's savable_page() take its escape route
> > > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'.
> > >
> >
> > This makes me feel slightly uneasy. Robert makes a convincing point,
> > but I wonder if we can expect more fallout from the ambiguity of
> > pfn_valid(). Now we are not only forced to assign non-existing (as far
> > as the OS is concerned) pages to the correct NUMA node, we also need
> > to set certain page flags.
>
> Yes, I really don't know how to proceed here. Playing whack-a-mole with
> pfn_valid() users doesn't sounds like an improvement on the current
> situation to me.
>
> Robert -- if we leave pfn_valid() as it is, would a point-hack to
> memmap_init_zone help, or do you anticipate other problems?
I would suggest to fix the hibernation code as I commented on before
to use pfn_is_nosave() that defaults to pfn_valid() but uses memblock_
is_nomap() for arm64. Let's just fix it and see if no other issues
arise. I am trying to send a patch for this until tomorrow.
I am also going to see how early_pfn_valid() could be redirected to
use memblock_is_nomap() on arm64. That would "quick fix" the problem,
though I rather prefer to go further with the current solution.
-Robert
On Wed, Dec 07, 2016 at 10:06:38AM +0100, Robert Richter wrote:
> On 06.12.16 17:38:11, Will Deacon wrote:
> > On Mon, Dec 05, 2016 at 03:42:14PM +0000, Ard Biesheuvel wrote:
> > > On 2 December 2016 at 14:49, James Morse <[email protected]> wrote:
> > > > Patch "arm64: mm: Fix memmap to be initialized for the entire section"
> > > > changes pfn_valid() in a way that breaks hibernate. These patches fix
> > > > hibernate, and provided struct page's are allocated for nomap pages,
> > > > can be applied before [0].
> > > >
> > > > Hibernate core code belives 'valid' to mean "I can access this". It
> > > > uses pfn_valid() to test the page if the page is 'valid'.
> > > >
> > > > pfn_valid() needs to be changed so that all struct pages in a numa
> > > > node have the same node-id. Currently 'nomap' pages are skipped, and
> > > > retain their pre-numa node-ids, which leads to a later BUG_ON().
> > > >
> > > > These patches make hibernate's savable_page() take its escape route
> > > > via 'if (PageReserved(page) && pfn_is_nosave(pfn))'.
> > > >
> > >
> > > This makes me feel slightly uneasy. Robert makes a convincing point,
> > > but I wonder if we can expect more fallout from the ambiguity of
> > > pfn_valid(). Now we are not only forced to assign non-existing (as far
> > > as the OS is concerned) pages to the correct NUMA node, we also need
> > > to set certain page flags.
> >
> > Yes, I really don't know how to proceed here. Playing whack-a-mole with
> > pfn_valid() users doesn't sounds like an improvement on the current
> > situation to me.
> >
> > Robert -- if we leave pfn_valid() as it is, would a point-hack to
> > memmap_init_zone help, or do you anticipate other problems?
>
> I would suggest to fix the hibernation code as I commented on before
> to use pfn_is_nosave() that defaults to pfn_valid() but uses memblock_
> is_nomap() for arm64. Let's just fix it and see if no other issues
> arise. I am trying to send a patch for this until tomorrow.
I'd rather not use mainline as a guinea pig like this, since I'd be very
surprised if other places don't break given the scope for different
interpretations of pfn_valid.
> I am also going to see how early_pfn_valid() could be redirected to
> use memblock_is_nomap() on arm64. That would "quick fix" the problem,
> though I rather prefer to go further with the current solution.
I don't like either of them, but early_pfn_valid is easier to revert so
let's go with that.
Will
On 9 December 2016 at 12:14, Yisheng Xie <[email protected]> wrote:
> Hi Robert,
> We have merged your patch to 4.9.0-rc8, however we still meet the similar problem
> on our D05 board:
>
To be clear: does this issue also occur on D05 *without* the patch?
> -----------------
> [ 5.081971] ------------[ cut here ]------------
> [ 5.086668] kernel BUG at mm/page_alloc.c:1863!
> [ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP
> [ 5.096159] Modules linked in:
> [ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58
> [ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016
> [ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000
> [ 5.120903] PC is at move_freepages+0x170/0x180
> [ 5.125514] LR is at move_freepages_block+0xa8/0xb8
> [ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5
> [ 5.138008] sp : fffffe13f66fb910
> [ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80
> [ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90
> [ 5.152193] x25: 000000000000000a x24: fffffdff800b0020
> [ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000
> [ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680
> [ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26
> [ 5.173820] x17: 0000000000000000 x16: 0000000000000000
> [ 5.179226] x15: 000000005c589429 x14: 0000000000000000
> [ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0
> [ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b
> [ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000
> [ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001
> [ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000
> [ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000
> [ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680
> [...]
> [ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180
> [ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8
> [ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0
> [ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58
> [ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08
> [ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0
> [ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168
> [ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc
> [ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44
> [ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138
> [ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c
> [ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110
> [ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20
> [ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000)
> [ 5.851024] ---[ end trace 3382df1ae82057db ]---
>
>
> On 2016/12/1 2:21, Robert Richter wrote:
>> On ThunderX systems with certain memory configurations we see the
>> following BUG_ON():
>>
>> kernel BUG at mm/page_alloc.c:1848!
>>
>> This happens for some configs with 64k page size enabled. The BUG_ON()
>> checks if start and end page of a memmap range belongs to the same
>> zone.
>>
>> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
>> this case the node information of those pages is not initialized. This
>> causes an inconsistency of the page links with wrong zone and node
>> information for that pages. NOMAP pages from node 1 still point to the
>> mem zone from node 0 and have the wrong nid assigned.
>>
>> The reason for the mis-configuration is a change in pfn_valid() which
>> reports pages marked NOMAP as invalid:
>>
>> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>>
>> This causes pages marked as nomap being no long reassigned to the new
>> zone in memmap_init_zone() by calling __init_single_pfn().
>>
>> Fixing this by restoring the old behavior of pfn_valid() to use
>> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
>> to use memblock_is_map_memory() where necessary. This only affects
>> code in ioremap.c. The code in mmu.c still can use the new version of
>> pfn_valid().
>>
>> As a consequence, pfn_valid() can not be used to check if a physical
>> page is RAM. It just checks if there is an underlying memmap with a
>> valid struct page. Moreover, for performance reasons the whole memmap
>> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM
>> config). The memory range is extended to fit the alignment of the
>> memmap. Thus, pfn_valid() may return true for pfns that do not map to
>> physical memory. Those pages are simply not reported to the mm, they
>> are not marked reserved nor added to the list of free pages. Other
>> functions such a page_is_ram() or memblock_is_map_ memory() must be
>> used to check for memory and if the page can be mapped with the linear
>> mapping.
>>
>> Since NOMAP mem ranges may need to be mapped with different mem
>> attributes (e.g. read-only or non-caching) we can not use linear
>> mapping here. The use of memblock_is_memory() in pfn_valid() may not
>> break this behaviour. Since commit:
>>
>> e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>>
>> NOMAP mem resources are no longer marked as system RAM (IORESOURCE_
>> SYSTEM_RAM). Now page_is_ram() and region_intersects() (see
>> memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not
>> added to the linear mapping as system RAM is.
>>
>> v2:
>>
>> * Added Ack
>> * updated description to reflect the discussion
>>
>> Acked-by: Ard Biesheuvel <[email protected]>
>> Signed-off-by: Robert Richter <[email protected]>
>> ---
>> arch/arm64/mm/init.c | 2 +-
>> arch/arm64/mm/ioremap.c | 5 +++--
>> 2 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 212c4d1e2f26..166911f4a2e6 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>> #ifdef CONFIG_HAVE_ARCH_PFN_VALID
>> int pfn_valid(unsigned long pfn)
>> {
>> - return memblock_is_map_memory(pfn << PAGE_SHIFT);
>> + return memblock_is_memory(pfn << PAGE_SHIFT);
>> }
>> EXPORT_SYMBOL(pfn_valid);
>> #endif
>> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
>> index 01e88c8bcab0..c17c220b0c48 100644
>> --- a/arch/arm64/mm/ioremap.c
>> +++ b/arch/arm64/mm/ioremap.c
>> @@ -21,6 +21,7 @@
>> */
>>
>> #include <linux/export.h>
>> +#include <linux/memblock.h>
>> #include <linux/mm.h>
>> #include <linux/vmalloc.h>
>> #include <linux/io.h>
>> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
>> /*
>> * Don't allow RAM to be mapped.
>> */
>> - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
>> + if (WARN_ON(memblock_is_map_memory(phys_addr)))
>> return NULL;
>>
>> area = get_vm_area_caller(size, VM_IOREMAP, caller);
>> @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap);
>> void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
>> {
>> /* For normal memory we already have a cacheable mapping. */
>> - if (pfn_valid(__phys_to_pfn(phys_addr)))
>> + if (memblock_is_map_memory(phys_addr))
>> return (void __iomem *)__phys_to_virt(phys_addr);
>>
>> return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
>>
>
Hi Robert,
We have merged your patch to 4.9.0-rc8, however we still meet the similar problem
on our D05 board:
Thanks,
Yisheng Xie
-----------------
[ 5.081971] ------------[ cut here ]------------
[ 5.086668] kernel BUG at mm/page_alloc.c:1863!
[ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP
[ 5.096159] Modules linked in:
[ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58
[ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016
[ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000
[ 5.120903] PC is at move_freepages+0x170/0x180
[ 5.125514] LR is at move_freepages_block+0xa8/0xb8
[ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5
[ 5.138008] sp : fffffe13f66fb910
[ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80
[ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90
[ 5.152193] x25: 000000000000000a x24: fffffdff800b0020
[ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000
[ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680
[ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26
[ 5.173820] x17: 0000000000000000 x16: 0000000000000000
[ 5.179226] x15: 000000005c589429 x14: 0000000000000000
[ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0
[ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b
[ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000
[ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001
[ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000
[ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000
[ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680
[...]
[ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180
[ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8
[ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0
[ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58
[ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08
[ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0
[ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168
[ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc
[ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44
[ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138
[ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c
[ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110
[ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20
[ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000)
[ 5.851024] ---[ end trace 3382df1ae82057db ]---
On 2016/12/1 2:21, Robert Richter wrote:
> On ThunderX systems with certain memory configurations we see the
> following BUG_ON():
>
> kernel BUG at mm/page_alloc.c:1848!
>
> This happens for some configs with 64k page size enabled. The BUG_ON()
> checks if start and end page of a memmap range belongs to the same
> zone.
>
> The BUG_ON() check fails if a memory zone contains NOMAP regions. In
> this case the node information of those pages is not initialized. This
> causes an inconsistency of the page links with wrong zone and node
> information for that pages. NOMAP pages from node 1 still point to the
> mem zone from node 0 and have the wrong nid assigned.
>
> The reason for the mis-configuration is a change in pfn_valid() which
> reports pages marked NOMAP as invalid:
>
> 68709f45385a arm64: only consider memblocks with NOMAP cleared for linear mapping
>
> This causes pages marked as nomap being no long reassigned to the new
> zone in memmap_init_zone() by calling __init_single_pfn().
>
> Fixing this by restoring the old behavior of pfn_valid() to use
> memblock_is_memory(). Also changing users of pfn_valid() in arm64 code
> to use memblock_is_map_memory() where necessary. This only affects
> code in ioremap.c. The code in mmu.c still can use the new version of
> pfn_valid().
>
> As a consequence, pfn_valid() can not be used to check if a physical
> page is RAM. It just checks if there is an underlying memmap with a
> valid struct page. Moreover, for performance reasons the whole memmap
> (with pageblock_nr_pages number of pages) has valid pfns (SPARSEMEM
> config). The memory range is extended to fit the alignment of the
> memmap. Thus, pfn_valid() may return true for pfns that do not map to
> physical memory. Those pages are simply not reported to the mm, they
> are not marked reserved nor added to the list of free pages. Other
> functions such a page_is_ram() or memblock_is_map_ memory() must be
> used to check for memory and if the page can be mapped with the linear
> mapping.
>
> Since NOMAP mem ranges may need to be mapped with different mem
> attributes (e.g. read-only or non-caching) we can not use linear
> mapping here. The use of memblock_is_memory() in pfn_valid() may not
> break this behaviour. Since commit:
>
> e7cd190385d1 arm64: mark reserved memblock regions explicitly in iomem
>
> NOMAP mem resources are no longer marked as system RAM (IORESOURCE_
> SYSTEM_RAM). Now page_is_ram() and region_intersects() (see
> memremap()) do not detect NOMAP mem as system ram and NOMAP mem is not
> added to the linear mapping as system RAM is.
>
> v2:
>
> * Added Ack
> * updated description to reflect the discussion
>
> Acked-by: Ard Biesheuvel <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> arch/arm64/mm/init.c | 2 +-
> arch/arm64/mm/ioremap.c | 5 +++--
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 212c4d1e2f26..166911f4a2e6 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -147,7 +147,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
> #ifdef CONFIG_HAVE_ARCH_PFN_VALID
> int pfn_valid(unsigned long pfn)
> {
> - return memblock_is_map_memory(pfn << PAGE_SHIFT);
> + return memblock_is_memory(pfn << PAGE_SHIFT);
> }
> EXPORT_SYMBOL(pfn_valid);
> #endif
> diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> index 01e88c8bcab0..c17c220b0c48 100644
> --- a/arch/arm64/mm/ioremap.c
> +++ b/arch/arm64/mm/ioremap.c
> @@ -21,6 +21,7 @@
> */
>
> #include <linux/export.h>
> +#include <linux/memblock.h>
> #include <linux/mm.h>
> #include <linux/vmalloc.h>
> #include <linux/io.h>
> @@ -55,7 +56,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
> /*
> * Don't allow RAM to be mapped.
> */
> - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
> + if (WARN_ON(memblock_is_map_memory(phys_addr)))
> return NULL;
>
> area = get_vm_area_caller(size, VM_IOREMAP, caller);
> @@ -96,7 +97,7 @@ EXPORT_SYMBOL(__iounmap);
> void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
> {
> /* For normal memory we already have a cacheable mapping. */
> - if (pfn_valid(__phys_to_pfn(phys_addr)))
> + if (memblock_is_map_memory(phys_addr))
> return (void __iomem *)__phys_to_virt(phys_addr);
>
> return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
>
On 12/09/2016 08:19 PM, Ard Biesheuvel wrote:
> On 9 December 2016 at 12:14, Yisheng Xie <[email protected]> wrote:
>> Hi Robert,
>> We have merged your patch to 4.9.0-rc8, however we still meet the similar problem
>> on our D05 board:
>>
>
> To be clear: does this issue also occur on D05 *without* the patch?
It boots ok on D05 without this patch.
But I think the problem Robert described in the commit message is
still there, just not triggered in the boot. we met this problem
when having LTP stress memory test and hit the same BUG_ON.
Thanks
Hanjun
On 2016/12/9 20:23, Hanjun Guo wrote:
> On 12/09/2016 08:19 PM, Ard Biesheuvel wrote:
>> On 9 December 2016 at 12:14, Yisheng Xie <[email protected]> wrote:
>>> Hi Robert,
>>> We have merged your patch to 4.9.0-rc8, however we still meet the similar problem
>>> on our D05 board:
>>>
>>
>> To be clear: does this issue also occur on D05 *without* the patch?
>
> It boots ok on D05 without this patch.
>
> But I think the problem Robert described in the commit message is
> still there, just not triggered in the boot. we met this problem
> when having LTP stress memory test and hit the same BUG_ON.
>
That's right. for D05's case, when trigger the BUG_ON on:
1863 VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
the end_page is not nomap, but BIOS reserved. here is the log I got:
[ 0.000000] efi: 0x00003fc00000-0x00003fffffff [Reserved | | | | | | | | | | | | ]
[...]
[ 5.081443] move_freepages: phys(start_page: 0x20000000,end_page:0x3fff0000)
For invalid pages, their zone and node information is not initialized, and it
do have risk to trigger the BUG_ON, so I have a silly question,
why not just change the BUG_ON:
-----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..af199b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1860,12 +1860,13 @@ int move_freepages(struct zone *zone,
* Remove at a later date when no bug reports exist related to
* grouping pages by mobility
*/
- VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
+ VM_BUG_ON(early_pfn_valid(start_page) && early_pfn_valid(end_page) &&
+ page_zone(start_page) != page_zone(end_page));
#endif
for (page = start_page; page <= end_page;) {
/* Make sure we are not inadvertently changing nodes */
- VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
+ VM_BUG_ON_PAGE(early_pfn_valid(page) && (page_to_nid(page) != zone_to_nid(zone)), page);
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
Thanks,
Yisheng Xie
> Thanks
> Hanjun
>
> .
>
On 09.12.16 20:14:24, Yisheng Xie wrote:
> We have merged your patch to 4.9.0-rc8, however we still meet the similar problem
> on our D05 board:
I assume you can reliable trigger the bug. Can you add some debug
messages that show the pfn number, node id, memory range. Also, add to
kernel parameters:
memblock=debug efi=debug loglevel=8 mminit_loglevel=4
We need the mem layout detected by efi and the memblock regions. This
should be in the full boot log then.
Thanks,
-Robert
> -----------------
> [ 5.081971] ------------[ cut here ]------------
> [ 5.086668] kernel BUG at mm/page_alloc.c:1863!
> [ 5.091281] Internal error: Oops - BUG: 0 [#1] SMP
> [ 5.096159] Modules linked in:
> [ 5.099265] CPU: 61 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc8+ #58
> [ 5.105822] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 UEFI 16.08 RC1 12/08/2016
> [ 5.114860] task: fffffe13f23d7400 task.stack: fffffe13f66f8000
> [ 5.120903] PC is at move_freepages+0x170/0x180
> [ 5.125514] LR is at move_freepages_block+0xa8/0xb8
> [ 5.130479] pc : [<fffffc00081e9588>] lr : [<fffffc00081e9640>] pstate: 200000c5
> [ 5.138008] sp : fffffe13f66fb910
> [ 5.141375] x29: fffffe13f66fb910 x28: 00000000ffffff80
> [ 5.146784] x27: fffffdff800b0000 x26: fffffe13fbf62b90
> [ 5.152193] x25: 000000000000000a x24: fffffdff800b0020
> [ 5.157598] x23: 0000000000000000 x22: fffffdff800b0000
> [ 5.163006] x21: fffffdff800fffc0 x20: fffffe13fbf62680
> [ 5.168412] x19: fffffdff80080000 x18: 000000004c4d6d26
> [ 5.173820] x17: 0000000000000000 x16: 0000000000000000
> [ 5.179226] x15: 000000005c589429 x14: 0000000000000000
> [ 5.184634] x13: 0000000000000000 x12: 00000000fe2ce6e0
> [ 5.190039] x11: 00000000bb5b525b x10: 00000000bb48417b
> [ 5.195446] x9 : 0000000000000068 x8 : 0000000000000000
> [ 5.200854] x7 : fffffe13fbff2680 x6 : 0000000000000001
> [ 5.206260] x5 : fffffc0008e12328 x4 : 0000000000000000
> [ 5.211667] x3 : fffffe13fbf62680 x2 : 0000000000000000
> [ 5.217073] x1 : fffffe13fbff2680 x0 : fffffe13fbf62680
> [...]
> [ 5.768991] [<fffffc00081e9588>] move_freepages+0x170/0x180
> [ 5.774664] [<fffffc00081e9640>] move_freepages_block+0xa8/0xb8
> [ 5.780691] [<fffffc00081e9bbc>] __rmqueue+0x494/0x5f0
> [ 5.785921] [<fffffc00081eb10c>] get_page_from_freelist+0x5ec/0xb58
> [ 5.792302] [<fffffc00081ebc4c>] __alloc_pages_nodemask+0x144/0xd08
> [ 5.798687] [<fffffc0008240514>] alloc_page_interleave+0x64/0xc0
> [ 5.804801] [<fffffc0008240b20>] alloc_pages_current+0x108/0x168
> [ 5.810920] [<fffffc0008c75410>] atomic_pool_init+0x78/0x1cc
> [ 5.816680] [<fffffc0008c755a0>] arm64_dma_init+0x3c/0x44
> [ 5.822180] [<fffffc0008082d94>] do_one_initcall+0x44/0x138
> [ 5.827853] [<fffffc0008c70d54>] kernel_init_freeable+0x1ec/0x28c
> [ 5.834060] [<fffffc00088a79f0>] kernel_init+0x18/0x110
> [ 5.839378] [<fffffc0008082b30>] ret_from_fork+0x10/0x20
> [ 5.844785] Code: 9108a021 9400afc5 d4210000 d503201f (d4210000)
> [ 5.851024] ---[ end trace 3382df1ae82057db ]---
On 09.12.16 21:15:12, Yisheng Xie wrote:
> For invalid pages, their zone and node information is not initialized, and it
> do have risk to trigger the BUG_ON, so I have a silly question,
> why not just change the BUG_ON:
We need to get the page handling correct. Modifying the BUG_ON() just
hides that something is wrong.
-Robert
> -----------
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6de9440..af199b8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1860,12 +1860,13 @@ int move_freepages(struct zone *zone,
> * Remove at a later date when no bug reports exist related to
> * grouping pages by mobility
> */
> - VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
> + VM_BUG_ON(early_pfn_valid(start_page) && early_pfn_valid(end_page) &&
> + page_zone(start_page) != page_zone(end_page));
> #endif
>
> for (page = start_page; page <= end_page;) {
> /* Make sure we are not inadvertently changing nodes */
> - VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
> + VM_BUG_ON_PAGE(early_pfn_valid(page) && (page_to_nid(page) != zone_to_nid(zone)), page);
>
> if (!pfn_valid_within(page_to_pfn(page))) {
> page++;
On 9 December 2016 at 14:52, Robert Richter <[email protected]> wrote:
> On 09.12.16 21:15:12, Yisheng Xie wrote:
>> For invalid pages, their zone and node information is not initialized, and it
>> do have risk to trigger the BUG_ON, so I have a silly question,
>> why not just change the BUG_ON:
>
> We need to get the page handling correct. Modifying the BUG_ON() just
> hides that something is wrong.
>
Actually, I think this is a reasonable question. We are trying very
hard to work around the BUG_ON(), which arguably does something wrong
by calling page_to_nid() on a struct page without checking if it is a
valid page.
Looking at commit 344c790e3821 ("mm: make setup_zone_migrate_reserve()
aware of overlapping nodes"), the BUG_ON() has a specific purpose
related to adjacent zones.
What will go wrong if we ignore this check?
>
>> -----------
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6de9440..af199b8 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1860,12 +1860,13 @@ int move_freepages(struct zone *zone,
>> * Remove at a later date when no bug reports exist related to
>> * grouping pages by mobility
>> */
>> - VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
>> + VM_BUG_ON(early_pfn_valid(start_page) && early_pfn_valid(end_page) &&
>> + page_zone(start_page) != page_zone(end_page));
>> #endif
>>
>> for (page = start_page; page <= end_page;) {
>> /* Make sure we are not inadvertently changing nodes */
>> - VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
>> + VM_BUG_ON_PAGE(early_pfn_valid(page) && (page_to_nid(page) != zone_to_nid(zone)), page);
>>
>> if (!pfn_valid_within(page_to_pfn(page))) {
>> page++;
>