This fixes the issue reported by Robert Richter where the fact that
the node id of struct pages covered by NOMAP regions is not initialized,
triggering a VM_BUG_ON() in the mm code.
I know that this approach is the least preferred option by Robert, but it
has been used successfully in the downstream Linaro Enterprise kernel,
running on HiSilicon D05, which suffered from the same issue as Cavium
ThunderX where it was originally reported.
Given that the other proposed solutions either fail to solve the issue
completely, or cause regressions in other code (hibernate), I think this
issue is appropriate for merging now, and backported to -stable. If there
are performance concerns, we can try to improve on this solution, which
could include reverting patch #2 altogether, for all I care.
Patch #1 fixes a bug in the generic mm code where a struct page is
dereferenced before pfn_valid() is called. This should probably go to
stable regardless of where the arm64 discussion goes.
Patch #2 enables CONFIG_HOLES_IN_ZONE for arm64 numa, causing the kernel
to no longer assume that all pages in a zone have valid struct pages
associated with them.
Ard Biesheuvel (2):
mm: don't dereference struct page fields of invalid pages
arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA
arch/arm64/Kconfig | 4 ++++
mm/page_alloc.c | 6 +++---
2 files changed, 7 insertions(+), 3 deletions(-)
--
2.7.4
The VM_BUG_ON() check in move_freepages() checks whether the node
id of a page matches the node id of its zone. However, it does this
before having checked whether the struct page pointer refers to a
valid struct page to begin with. This is guaranteed in most cases,
but may not be the case if CONFIG_HOLES_IN_ZONE=y.
So reorder the VM_BUG_ON() with the pfn_valid_within() check.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f64e7bcb43b7..4e298e31fa86 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1864,14 +1864,14 @@ int move_freepages(struct zone *zone,
#endif
for (page = start_page; page <= end_page;) {
- /* Make sure we are not inadvertently changing nodes */
- VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
-
if (!pfn_valid_within(page_to_pfn(page))) {
page++;
continue;
}
+ /* Make sure we are not inadvertently changing nodes */
+ VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
+
if (!PageBuddy(page)) {
page++;
continue;
--
2.7.4
The NUMA code may get confused by the presence of NOMAP regions within
zones, resulting in spurious BUG() checks where the node id deviates
from the containing zone's node id.
Since the kernel has no business reasoning about node ids of pages it
does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure
that such pages are disregarded.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 111742126897..0472afe64d55 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -614,6 +614,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
+config HOLES_IN_ZONE
+ def_bool y
+ depends on NUMA
+
source kernel/Kconfig.preempt
source kernel/Kconfig.hz
--
2.7.4
I was going to do some measurements but my kernel crashes now with a
page fault in efi_rtc_probe():
[ 21.663393] Unable to handle kernel paging request at virtual address 20251000
[ 21.663396] pgd = ffff000009090000
[ 21.663401] [20251000] *pgd=0000010ffff90003
[ 21.663402] , *pud=0000010ffff90003
[ 21.663404] , *pmd=0000000fdc030003
[ 21.663405] , *pte=00e8832000250707
The sparsemem config requires the whole section to be initialized.
Your patches do not address this.
On 14.12.16 09:11:47, Ard Biesheuvel wrote:
> +config HOLES_IN_ZONE
> + def_bool y
> + depends on NUMA
This enables pfn_valid_within() for arm64 and causes the check for
each page of a section. The arm64 implementation of pfn_valid() is
already expensive (traversing memblock areas). Now, this is increased
by a factor of 2^18 for 4k page size (16384 for 64k). We need to
initialize the whole section to avoid that.
-Robert
[ 21.663393] Unable to handle kernel paging request at virtual address 20251000
[ 21.663396] pgd = ffff000009090000
[ 21.663401] [20251000] *pgd=0000010ffff90003
[ 21.663402] , *pud=0000010ffff90003
[ 21.663404] , *pmd=0000000fdc030003
[ 21.663405] , *pte=00e8832000250707
[ 21.663405]
[ 21.663411] Internal error: Oops: 96000047 [#1] SMP
[ 21.663416] Modules linked in:
[ 21.663425] CPU: 49 PID: 1 Comm: swapper/0 Tainted: G W 4.9.0.0.vanilla10-00002-g429605e9ab0a #1
[ 21.663426] Hardware name: http://www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Sep 13 2016
[ 21.663429] task: ffff800feee6bc00 task.stack: ffff800fec050000
[ 21.663433] PC is at 0x201ff820
[ 21.663434] LR is at 0x201fdfc0
[ 21.663435] pc : [<00000000201ff820>] lr : [<00000000201fdfc0>] pstate: 20000045
[ 21.663437] sp : ffff800fec053b70
[ 21.663440] x29: ffff800fec053bc0 x28: 0000000000000000
[ 21.663443] x27: ffff000008ce3e08 x26: ffff000008c52568
[ 21.663445] x25: ffff000008bf045c x24: ffff000008bdb828
[ 21.663448] x23: 0000000000000000 x22: 0000000000000040
[ 21.663451] x21: ffff800fec053bb8 x20: 0000000020251000
[ 21.663453] x19: ffff800fec053c20 x18: 0000000000000000
[ 21.663456] x17: 0000000000000000 x16: 00000000bbb67a65
[ 21.663459] x15: ffffffffffffffff x14: ffff810016ea291c
[ 21.663461] x13: ffff810016ea2181 x12: 0000000000000030
[ 21.663464] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[ 21.663467] x9 : feff716475687163 x8 : ffffffffffffffff
[ 21.663469] x7 : 83f0680000000000 x6 : 0000000000000000
[ 21.663472] x5 : ffff800fc187aab9 x4 : 0002000000000000
[ 21.663474] x3 : ffff800fec053bb8 x2 : 0000000000000000
[ 21.663477] x1 : 83f0680000000000 x0 : 0000000020251000
[ 21.663478]
[ 21.663479] Process swapper/0 (pid: 1, stack limit = 0xffff800fec050020)
...
[ 21.663605] [<00000000201ff820>] 0x201ff820
[ 21.663617] [<ffff000008c3eef4>] efi_rtc_probe+0x24/0x78
[ 21.663625] [<ffff000008586c88>] platform_drv_probe+0x60/0xc8
[ 21.663636] [<ffff0000085845d4>] driver_probe_device+0x26c/0x420
[ 21.663639] [<ffff0000085848ac>] __driver_attach+0x124/0x128
[ 21.663642] [<ffff000008581e08>] bus_for_each_dev+0x70/0xb0
[ 21.663644] [<ffff000008583c30>] driver_attach+0x30/0x40
[ 21.663647] [<ffff000008583668>] bus_add_driver+0x200/0x2b8
[ 21.663650] [<ffff000008585430>] driver_register+0x68/0x100
[ 21.663652] [<ffff000008586e3c>] __platform_driver_probe+0x84/0x128
[ 21.663654] [<ffff000008c3eec8>] efi_rtc_driver_init+0x20/0x28
[ 21.663658] [<ffff000008082d94>] do_one_initcall+0x44/0x138
[ 21.663665] [<ffff000008bf0d0c>] kernel_init_freeable+0x1ac/0x24c
[ 21.663673] [<ffff00000885e7a0>] kernel_init+0x18/0x110
[ 21.663675] [<ffff000008082b30>] ret_from_fork+0x10/0x20
[ 21.663679] Code: f9400000 d5033d9f d65f03c0 d5033e9f (f9000001)
[ 21.663688] ---[ end trace e420ef9636e3c9b2 ]---
[ 21.663711] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 21.663711]
[ 21.663713] SMP: stopping secondary CPUs
[ 21.670234] Kernel Offset: disabled
[ 21.670235] Memory Limit: none
[ 22.681333] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
On 15 December 2016 at 15:39, Robert Richter <[email protected]> wrote:
> I was going to do some measurements but my kernel crashes now with a
> page fault in efi_rtc_probe():
>
> [ 21.663393] Unable to handle kernel paging request at virtual address 20251000
> [ 21.663396] pgd = ffff000009090000
> [ 21.663401] [20251000] *pgd=0000010ffff90003
> [ 21.663402] , *pud=0000010ffff90003
> [ 21.663404] , *pmd=0000000fdc030003
> [ 21.663405] , *pte=00e8832000250707
>
> The sparsemem config requires the whole section to be initialized.
> Your patches do not address this.
>
96000047 is a third level translation fault, and the PTE address has
RES0 bits set. I don't see how this is related to sparsemem, could you
explain?
> On 14.12.16 09:11:47, Ard Biesheuvel wrote:
>> +config HOLES_IN_ZONE
>> + def_bool y
>> + depends on NUMA
>
> This enables pfn_valid_within() for arm64 and causes the check for
> each page of a section. The arm64 implementation of pfn_valid() is
> already expensive (traversing memblock areas). Now, this is increased
> by a factor of 2^18 for 4k page size (16384 for 64k). We need to
> initialize the whole section to avoid that.
>
I know that. But if you want something for -stable, we should have
something that is correct first, and only then care about the
performance hit (if there is one)
Hi Robert,
On 2016/12/15 23:39, Robert Richter wrote:
> I was going to do some measurements but my kernel crashes now with a
> page fault in efi_rtc_probe():
>
> [ 21.663393] Unable to handle kernel paging request at virtual address 20251000
> [ 21.663396] pgd = ffff000009090000
> [ 21.663401] [20251000] *pgd=0000010ffff90003
> [ 21.663402] , *pud=0000010ffff90003
> [ 21.663404] , *pmd=0000000fdc030003
> [ 21.663405] , *pte=00e8832000250707
>
> The sparsemem config requires the whole section to be initialized.
> Your patches do not address this.
This patch set is running properly on D05, both the boot and
LTP MM stress test are ok, seems it's a different configuration
of memory mappings in firmware, just a stupid question, which
part is related to this problem, is it only the Reserved memory?
Thanks
Hanjun
On 15.12.16 16:07:26, Ard Biesheuvel wrote:
> On 15 December 2016 at 15:39, Robert Richter <[email protected]> wrote:
> > I was going to do some measurements but my kernel crashes now with a
> > page fault in efi_rtc_probe():
> >
> > [ 21.663393] Unable to handle kernel paging request at virtual address 20251000
> > [ 21.663396] pgd = ffff000009090000
> > [ 21.663401] [20251000] *pgd=0000010ffff90003
> > [ 21.663402] , *pud=0000010ffff90003
> > [ 21.663404] , *pmd=0000000fdc030003
> > [ 21.663405] , *pte=00e8832000250707
> >
> > The sparsemem config requires the whole section to be initialized.
> > Your patches do not address this.
> >
>
> 96000047 is a third level translation fault, and the PTE address has
> RES0 bits set. I don't see how this is related to sparsemem, could you
> explain?
When initializing the whole section it works. Maybe it uncovers
another bug. Did not yet start debugging this.
>
> > On 14.12.16 09:11:47, Ard Biesheuvel wrote:
> >> +config HOLES_IN_ZONE
> >> + def_bool y
> >> + depends on NUMA
> >
> > This enables pfn_valid_within() for arm64 and causes the check for
> > each page of a section. The arm64 implementation of pfn_valid() is
> > already expensive (traversing memblock areas). Now, this is increased
> > by a factor of 2^18 for 4k page size (16384 for 64k). We need to
> > initialize the whole section to avoid that.
> >
>
> I know that. But if you want something for -stable, we should have
> something that is correct first, and only then care about the
> performance hit (if there is one)
I would prefer to check for a performance penalty *before* we put it
into stable. There is nor risk at all with the patch I am proposing.
See: https://lkml.org/lkml/2016/12/16/412
-Robert
On 16.12.16 09:57:20, Hanjun Guo wrote:
> Hi Robert,
>
> On 2016/12/15 23:39, Robert Richter wrote:
> >I was going to do some measurements but my kernel crashes now with a
> >page fault in efi_rtc_probe():
> >
> >[ 21.663393] Unable to handle kernel paging request at virtual address 20251000
> >[ 21.663396] pgd = ffff000009090000
> >[ 21.663401] [20251000] *pgd=0000010ffff90003
> >[ 21.663402] , *pud=0000010ffff90003
> >[ 21.663404] , *pmd=0000000fdc030003
> >[ 21.663405] , *pte=00e8832000250707
> >
> >The sparsemem config requires the whole section to be initialized.
> >Your patches do not address this.
>
> This patch set is running properly on D05, both the boot and
> LTP MM stress test are ok, seems it's a different configuration
> of memory mappings in firmware, just a stupid question, which
> part is related to this problem, is it only the Reserved memory?
The problem are efi reserved regions that are no longer reserved but
marked as nomap pages. Those are excluded from page initialization
causing parts of a memory section not being initialized.
-Robert