2021-01-27 21:47:48

by Lukasz Majczak

[permalink] [raw]
Subject: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Crash after mm: fix initialization of struct page for holes in memory layout

Hi,
I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
but I've noticed it has crashed - unfortunately it seems to happen at
a very early stage - No output to the console nor to the screen, so I
have started a bisect (between 5.11-rc4 - which works just find - and
5.11-rc5),
bisect results points to:

d3921cb8be29 mm: fix initialization of struct page for holes in memory layout

Reproduction is just to build and load the kernel.

If it will help any how I am attaching:
- /proc/cpuinfo (from healthy system):
https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
- my .config file (for a broken system):
https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c

If there is anything I could add/do/test to help fix this please let me know.

Best regards
Lukasz


2021-01-27 22:00:25

by Lukasz Majczak

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Hi Mike,

Actually I have a serial console attached (via servo device), but
there is no output :( and also the reboot/crash is very fast/immediate
after power on.

Best regards
Lukasz

śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
>
> Hi Lukasz,
>
> On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > Crash after mm: fix initialization of struct page for holes in memory layout
> >
> > Hi,
> > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > but I've noticed it has crashed - unfortunately it seems to happen at
> > a very early stage - No output to the console nor to the screen, so I
> > have started a bisect (between 5.11-rc4 - which works just find - and
> > 5.11-rc5),
> > bisect results points to:
> >
> > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> >
> > Reproduction is just to build and load the kernel.
> >
> > If it will help any how I am attaching:
> > - /proc/cpuinfo (from healthy system):
> > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > - my .config file (for a broken system):
> > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> >
> > If there is anything I could add/do/test to help fix this please let me know.
>
> Chris Wilson also reported boot failures on several Chromebooks:
>
> https://lore.kernel.org/lkml/[email protected]
>
> I presume serial console is not an option, so if you could boot with
> earlyprintk=vga and see if there is anything useful printed on the screen
> it would be really helpful.
>
> > Best regards
> > Lukasz
>
> --
> Sincerely yours,
> Mike.

2021-01-27 22:00:26

by Mike Rapoport

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Hi Lukasz,

On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> Crash after mm: fix initialization of struct page for holes in memory layout
>
> Hi,
> I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> but I've noticed it has crashed - unfortunately it seems to happen at
> a very early stage - No output to the console nor to the screen, so I
> have started a bisect (between 5.11-rc4 - which works just find - and
> 5.11-rc5),
> bisect results points to:
>
> d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
>
> Reproduction is just to build and load the kernel.
>
> If it will help any how I am attaching:
> - /proc/cpuinfo (from healthy system):
> https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> - my .config file (for a broken system):
> https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
>
> If there is anything I could add/do/test to help fix this please let me know.

Chris Wilson also reported boot failures on several Chromebooks:

https://lore.kernel.org/lkml/[email protected]

I presume serial console is not an option, so if you could boot with
earlyprintk=vga and see if there is anything useful printed on the screen
it would be really helpful.

> Best regards
> Lukasz

--
Sincerely yours,
Mike.

2021-01-27 23:50:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> Hi Mike,
>
> Actually I have a serial console attached (via servo device), but
> there is no output :( and also the reboot/crash is very fast/immediate
> after power on.

If you boot with earlyprintk=serial are there any messages?

> Best regards
> Lukasz
>
> śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> >
> > Hi Lukasz,
> >
> > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > Crash after mm: fix initialization of struct page for holes in memory layout
> > >
> > > Hi,
> > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > a very early stage - No output to the console nor to the screen, so I
> > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > 5.11-rc5),
> > > bisect results points to:
> > >
> > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > >
> > > Reproduction is just to build and load the kernel.
> > >
> > > If it will help any how I am attaching:
> > > - /proc/cpuinfo (from healthy system):
> > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > - my .config file (for a broken system):
> > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > >
> > > If there is anything I could add/do/test to help fix this please let me know.
> >
> > Chris Wilson also reported boot failures on several Chromebooks:
> >
> > https://lore.kernel.org/lkml/[email protected]
> >
> > I presume serial console is not an option, so if you could boot with
> > earlyprintk=vga and see if there is anything useful printed on the screen
> > it would be really helpful.
> >
> > > Best regards
> > > Lukasz
> >
> > --
> > Sincerely yours,
> > Mike.

--
Sincerely yours,
Mike.

2021-01-27 23:55:23

by Lukasz Majczak

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Unfortunately nothing :( my current kernel command line contains:
console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7

I was thinking about using earlycon, but it seems to be blocked.
(I think the lack of earlycon might be related to Chromebook HW
security design. There is an EC controller which is a part of AP ->
serial chain as kernel messages are considered sensitive from a
security standpoint.)

Best regards,
Lukasz

śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
>
> On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > Hi Mike,
> >
> > Actually I have a serial console attached (via servo device), but
> > there is no output :( and also the reboot/crash is very fast/immediate
> > after power on.
>
> If you boot with earlyprintk=serial are there any messages?
>
> > Best regards
> > Lukasz
> >
> > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > >
> > > Hi Lukasz,
> > >
> > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > >
> > > > Hi,
> > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > a very early stage - No output to the console nor to the screen, so I
> > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > 5.11-rc5),
> > > > bisect results points to:
> > > >
> > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > >
> > > > Reproduction is just to build and load the kernel.
> > > >
> > > > If it will help any how I am attaching:
> > > > - /proc/cpuinfo (from healthy system):
> > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > - my .config file (for a broken system):
> > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > >
> > > > If there is anything I could add/do/test to help fix this please let me know.
> > >
> > > Chris Wilson also reported boot failures on several Chromebooks:
> > >
> > > https://lore.kernel.org/lkml/[email protected]
> > >
> > > I presume serial console is not an option, so if you could boot with
> > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > it would be really helpful.
> > >
> > > > Best regards
> > > > Lukasz
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
>
> --
> Sincerely yours,
> Mike.

2021-01-27 23:58:02

by Lukasz Majczak

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Hi Mike,

I have started bisecting your patch and I have figured out that there
might be something wrong with clamping - with comments out these lines
it started to work.
The full log (with logs from below patch) can be found here:
https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
it's fresh - I haven't analyze it yet, just sharing with hope it will help.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eed54ce26ad1..9f4468c413a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7093,9 +7093,11 @@ static u64 __init
init_unavailable_range(unsigned long spfn, unsigned long epfn,
zone_spfn = arch_zone_lowest_possible_pfn[zone];
zone_epfn = arch_zone_highest_possible_pfn[zone];

- spfn = clamp(spfn, zone_spfn, zone_epfn);
- epfn = clamp(epfn, zone_spfn, zone_epfn);
-
+ //spfn = clamp(spfn, zone_spfn, zone_epfn);
+ //epfn = clamp(epfn, zone_spfn, zone_epfn);
+ pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
zone_spfn, zone_epfn);
+ pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
+ pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
for (pfn = spfn; pfn < epfn; pfn++) {
if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)

Best regards,
Lukasz


śr., 27 sty 2021 o 13:15 Łukasz Majczak <[email protected]> napisał(a):
>
> Unfortunately nothing :( my current kernel command line contains:
> console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
>
> I was thinking about using earlycon, but it seems to be blocked.
> (I think the lack of earlycon might be related to Chromebook HW
> security design. There is an EC controller which is a part of AP ->
> serial chain as kernel messages are considered sensitive from a
> security standpoint.)
>
> Best regards,
> Lukasz
>
> śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
> >
> > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > Hi Mike,
> > >
> > > Actually I have a serial console attached (via servo device), but
> > > there is no output :( and also the reboot/crash is very fast/immediate
> > > after power on.
> >
> > If you boot with earlyprintk=serial are there any messages?
> >
> > > Best regards
> > > Lukasz
> > >
> > > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > > >
> > > > Hi Lukasz,
> > > >
> > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > > >
> > > > > Hi,
> > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > > a very early stage - No output to the console nor to the screen, so I
> > > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > > 5.11-rc5),
> > > > > bisect results points to:
> > > > >
> > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > > >
> > > > > Reproduction is just to build and load the kernel.
> > > > >
> > > > > If it will help any how I am attaching:
> > > > > - /proc/cpuinfo (from healthy system):
> > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > > - my .config file (for a broken system):
> > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > > >
> > > > > If there is anything I could add/do/test to help fix this please let me know.
> > > >
> > > > Chris Wilson also reported boot failures on several Chromebooks:
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]
> > > >
> > > > I presume serial console is not an option, so if you could boot with
> > > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > > it would be really helpful.
> > > >
> > > > > Best regards
> > > > > Lukasz
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> >
> > --
> > Sincerely yours,
> > Mike.

2021-01-28 00:13:30

by Mike Rapoport

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Hi Lukasz,

On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote:
> Hi Mike,
>
> I have started bisecting your patch and I have figured out that there
> might be something wrong with clamping - with comments out these lines
> it started to work.
> The full log (with logs from below patch) can be found here:
> https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
> it's fresh - I haven't analyze it yet, just sharing with hope it will help.

Thanks, that helps!

The first page is never considered by the kernel as memory and so
arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result,
init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in
reserve_bootmem_region() panics because the struct page for pfn 0 remains
poisoned.

Can you please try the below patch on top of v5.11-rc5?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 783913e41f65..3ce9ef238dfc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid)
static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
int zone, int nid)
{
- unsigned long pfn, zone_spfn, zone_epfn;
+ unsigned long pfn, zone_spfn = 0, zone_epfn;
u64 pgcnt = 0;

- zone_spfn = arch_zone_lowest_possible_pfn[zone];
+ if (zone > 0)
+ zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
zone_epfn = arch_zone_highest_possible_pfn[zone];

spfn = clamp(spfn, zone_spfn, zone_epfn);


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eed54ce26ad1..9f4468c413a1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7093,9 +7093,11 @@ static u64 __init
> init_unavailable_range(unsigned long spfn, unsigned long epfn,
> zone_spfn = arch_zone_lowest_possible_pfn[zone];
> zone_epfn = arch_zone_highest_possible_pfn[zone];
>
> - spfn = clamp(spfn, zone_spfn, zone_epfn);
> - epfn = clamp(epfn, zone_spfn, zone_epfn);
> -
> + //spfn = clamp(spfn, zone_spfn, zone_epfn);
> + //epfn = clamp(epfn, zone_spfn, zone_epfn);
> + pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
> zone_spfn, zone_epfn);
> + pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
> + pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
> clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
> for (pfn = spfn; pfn < epfn; pfn++) {
> if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
>
> Best regards,
> Lukasz
>
>
> śr., 27 sty 2021 o 13:15 Łukasz Majczak <[email protected]> napisał(a):
> >
> > Unfortunately nothing :( my current kernel command line contains:
> > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
> >
> > I was thinking about using earlycon, but it seems to be blocked.
> > (I think the lack of earlycon might be related to Chromebook HW
> > security design. There is an EC controller which is a part of AP ->
> > serial chain as kernel messages are considered sensitive from a
> > security standpoint.)
> >
> > Best regards,
> > Lukasz
> >
> > śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
> > >
> > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > > Hi Mike,
> > > >
> > > > Actually I have a serial console attached (via servo device), but
> > > > there is no output :( and also the reboot/crash is very fast/immediate
> > > > after power on.
> > >
> > > If you boot with earlyprintk=serial are there any messages?
> > >
> > > > Best regards
> > > > Lukasz
> > > >
> > > > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > > > >
> > > > > Hi Lukasz,
> > > > >
> > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > > > >
> > > > > > Hi,
> > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > > > a very early stage - No output to the console nor to the screen, so I
> > > > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > > > 5.11-rc5),
> > > > > > bisect results points to:
> > > > > >
> > > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > > > >
> > > > > > Reproduction is just to build and load the kernel.
> > > > > >
> > > > > > If it will help any how I am attaching:
> > > > > > - /proc/cpuinfo (from healthy system):
> > > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > > > - my .config file (for a broken system):
> > > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > > > >
> > > > > > If there is anything I could add/do/test to help fix this please let me know.
> > > > >
> > > > > Chris Wilson also reported boot failures on several Chromebooks:
> > > > >
> > > > > https://lore.kernel.org/lkml/[email protected]
> > > > >
> > > > > I presume serial console is not an option, so if you could boot with
> > > > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > > > it would be really helpful.
> > > > >
> > > > > > Best regards
> > > > > > Lukasz
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> > >
> > > --
> > > Sincerely yours,
> > > Mike.

--
Sincerely yours,
Mike.

2021-01-28 01:38:34

by Lukasz Majczak

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

Hi Mike,

Great ! it seems to work well - I have built a valila kernel v5.11-rc5
with your patch and it boots properly.
Full log available here:
https://gist.github.com/semihalf-majczak-lukasz/ea89bf52f6fad7907a18d1870e7ce9bd

Best regards,
Lukasz

śr., 27 sty 2021 o 19:27 Mike Rapoport <[email protected]> napisał(a):
>
> Hi Lukasz,
>
> On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote:
> > Hi Mike,
> >
> > I have started bisecting your patch and I have figured out that there
> > might be something wrong with clamping - with comments out these lines
> > it started to work.
> > The full log (with logs from below patch) can be found here:
> > https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
> > it's fresh - I haven't analyze it yet, just sharing with hope it will help.
>
> Thanks, that helps!
>
> The first page is never considered by the kernel as memory and so
> arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result,
> init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in
> reserve_bootmem_region() panics because the struct page for pfn 0 remains
> poisoned.
>
> Can you please try the below patch on top of v5.11-rc5?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 783913e41f65..3ce9ef238dfc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid)
> static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
> int zone, int nid)
> {
> - unsigned long pfn, zone_spfn, zone_epfn;
> + unsigned long pfn, zone_spfn = 0, zone_epfn;
> u64 pgcnt = 0;
>
> - zone_spfn = arch_zone_lowest_possible_pfn[zone];
> + if (zone > 0)
> + zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
> zone_epfn = arch_zone_highest_possible_pfn[zone];
>
> spfn = clamp(spfn, zone_spfn, zone_epfn);
>
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eed54ce26ad1..9f4468c413a1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7093,9 +7093,11 @@ static u64 __init
> > init_unavailable_range(unsigned long spfn, unsigned long epfn,
> > zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > zone_epfn = arch_zone_highest_possible_pfn[zone];
> >
> > - spfn = clamp(spfn, zone_spfn, zone_epfn);
> > - epfn = clamp(epfn, zone_spfn, zone_epfn);
> > -
> > + //spfn = clamp(spfn, zone_spfn, zone_epfn);
> > + //epfn = clamp(epfn, zone_spfn, zone_epfn);
> > + pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
> > zone_spfn, zone_epfn);
> > + pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
> > + pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
> > clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
> > for (pfn = spfn; pfn < epfn; pfn++) {
> > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> >
> > Best regards,
> > Lukasz
> >
> >
> > śr., 27 sty 2021 o 13:15 Łukasz Majczak <[email protected]> napisał(a):
> > >
> > > Unfortunately nothing :( my current kernel command line contains:
> > > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
> > >
> > > I was thinking about using earlycon, but it seems to be blocked.
> > > (I think the lack of earlycon might be related to Chromebook HW
> > > security design. There is an EC controller which is a part of AP ->
> > > serial chain as kernel messages are considered sensitive from a
> > > security standpoint.)
> > >
> > > Best regards,
> > > Lukasz
> > >
> > > śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
> > > >
> > > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > > > Hi Mike,
> > > > >
> > > > > Actually I have a serial console attached (via servo device), but
> > > > > there is no output :( and also the reboot/crash is very fast/immediate
> > > > > after power on.
> > > >
> > > > If you boot with earlyprintk=serial are there any messages?
> > > >
> > > > > Best regards
> > > > > Lukasz
> > > > >
> > > > > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > > > > >
> > > > > > Hi Lukasz,
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > > > > >
> > > > > > > Hi,
> > > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > > > > a very early stage - No output to the console nor to the screen, so I
> > > > > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > > > > 5.11-rc5),
> > > > > > > bisect results points to:
> > > > > > >
> > > > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > > > > >
> > > > > > > Reproduction is just to build and load the kernel.
> > > > > > >
> > > > > > > If it will help any how I am attaching:
> > > > > > > - /proc/cpuinfo (from healthy system):
> > > > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > > > > - my .config file (for a broken system):
> > > > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > > > > >
> > > > > > > If there is anything I could add/do/test to help fix this please let me know.
> > > > > >
> > > > > > Chris Wilson also reported boot failures on several Chromebooks:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/[email protected]
> > > > > >
> > > > > > I presume serial console is not an option, so if you could boot with
> > > > > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > > > > it would be really helpful.
> > > > > >
> > > > > > > Best regards
> > > > > > > Lukasz
> > > > > >
> > > > > > --
> > > > > > Sincerely yours,
> > > > > > Mike.
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
>
> --
> Sincerely yours,
> Mike.

2021-01-28 02:49:06

by Baoquan He

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

On 01/27/21 at 08:26pm, Mike Rapoport wrote:
> Hi Lukasz,
>
> On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote:
> > Hi Mike,
> >
> > I have started bisecting your patch and I have figured out that there
> > might be something wrong with clamping - with comments out these lines
> > it started to work.
> > The full log (with logs from below patch) can be found here:
> > https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
> > it's fresh - I haven't analyze it yet, just sharing with hope it will help.
>
> Thanks, that helps!
>
> The first page is never considered by the kernel as memory and so
> arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result,
> init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in
> reserve_bootmem_region() panics because the struct page for pfn 0 remains
> poisoned.

It's a great finding and quick fix. Previously I tested my cleanup
patches based on Mike's commit 9ebeee59af4cdd4d ("mm: fix initialization
of struct page for holes in memory layout") on a hardware system,
didn't meet this crash. But this crash seems to be a always reproduced
issue, wondering why I didn't reproduce it.

>
> Can you please try the below patch on top of v5.11-rc5?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 783913e41f65..3ce9ef238dfc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid)
> static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
> int zone, int nid)
> {
> - unsigned long pfn, zone_spfn, zone_epfn;
> + unsigned long pfn, zone_spfn = 0, zone_epfn;
> u64 pgcnt = 0;
>
> - zone_spfn = arch_zone_lowest_possible_pfn[zone];
> + if (zone > 0)
> + zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
> zone_epfn = arch_zone_highest_possible_pfn[zone];
>
> spfn = clamp(spfn, zone_spfn, zone_epfn);
>
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eed54ce26ad1..9f4468c413a1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7093,9 +7093,11 @@ static u64 __init
> > init_unavailable_range(unsigned long spfn, unsigned long epfn,
> > zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > zone_epfn = arch_zone_highest_possible_pfn[zone];
> >
> > - spfn = clamp(spfn, zone_spfn, zone_epfn);
> > - epfn = clamp(epfn, zone_spfn, zone_epfn);
> > -
> > + //spfn = clamp(spfn, zone_spfn, zone_epfn);
> > + //epfn = clamp(epfn, zone_spfn, zone_epfn);
> > + pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
> > zone_spfn, zone_epfn);
> > + pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
> > + pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
> > clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
> > for (pfn = spfn; pfn < epfn; pfn++) {
> > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> >
> > Best regards,
> > Lukasz
> >
> >
> > śr., 27 sty 2021 o 13:15 Łukasz Majczak <[email protected]> napisał(a):
> > >
> > > Unfortunately nothing :( my current kernel command line contains:
> > > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
> > >
> > > I was thinking about using earlycon, but it seems to be blocked.
> > > (I think the lack of earlycon might be related to Chromebook HW
> > > security design. There is an EC controller which is a part of AP ->
> > > serial chain as kernel messages are considered sensitive from a
> > > security standpoint.)
> > >
> > > Best regards,
> > > Lukasz
> > >
> > > śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
> > > >
> > > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > > > Hi Mike,
> > > > >
> > > > > Actually I have a serial console attached (via servo device), but
> > > > > there is no output :( and also the reboot/crash is very fast/immediate
> > > > > after power on.
> > > >
> > > > If you boot with earlyprintk=serial are there any messages?
> > > >
> > > > > Best regards
> > > > > Lukasz
> > > > >
> > > > > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > > > > >
> > > > > > Hi Lukasz,
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > > > > >
> > > > > > > Hi,
> > > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > > > > a very early stage - No output to the console nor to the screen, so I
> > > > > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > > > > 5.11-rc5),
> > > > > > > bisect results points to:
> > > > > > >
> > > > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > > > > >
> > > > > > > Reproduction is just to build and load the kernel.
> > > > > > >
> > > > > > > If it will help any how I am attaching:
> > > > > > > - /proc/cpuinfo (from healthy system):
> > > > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > > > > - my .config file (for a broken system):
> > > > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > > > > >
> > > > > > > If there is anything I could add/do/test to help fix this please let me know.
> > > > > >
> > > > > > Chris Wilson also reported boot failures on several Chromebooks:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/[email protected]
> > > > > >
> > > > > > I presume serial console is not an option, so if you could boot with
> > > > > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > > > > it would be really helpful.
> > > > > >
> > > > > > > Best regards
> > > > > > > Lukasz
> > > > > >
> > > > > > --
> > > > > > Sincerely yours,
> > > > > > Mike.
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
>
> --
> Sincerely yours,
> Mike.
>

2021-01-28 09:36:41

by Mike Rapoport

[permalink] [raw]
Subject: Re: PROBLEM: Crash after mm: fix initialization of struct page for holes in memory layout

On Thu, Jan 28, 2021 at 10:45:49AM +0800, Baoquan He wrote:
> On 01/27/21 at 08:26pm, Mike Rapoport wrote:
> > Hi Lukasz,
> >
> > On Wed, Jan 27, 2021 at 02:15:53PM +0100, Łukasz Majczak wrote:
> > > Hi Mike,
> > >
> > > I have started bisecting your patch and I have figured out that there
> > > might be something wrong with clamping - with comments out these lines
> > > it started to work.
> > > The full log (with logs from below patch) can be found here:
> > > https://gist.github.com/semihalf-majczak-lukasz/3cecbab0ddc59a6c3ce11ddc29645725
> > > it's fresh - I haven't analyze it yet, just sharing with hope it will help.
> >
> > Thanks, that helps!
> >
> > The first page is never considered by the kernel as memory and so
> > arch_zone_lowest_possible_pfn[ZONE_DMA] is set to 0x1000. As the result,
> > init_unavailable_mem() skips pfn 0 and then __SetPageReserved(page) in
> > reserve_bootmem_region() panics because the struct page for pfn 0 remains
> > poisoned.
>
> It's a great finding and quick fix.

Unfortunately it's only a partial fix as it does not address the problem of
having pfn 0 outside any zone. It gets ZONE_DMA link at
init_unavailable_mem(), but zones[ZONE_DMA]->zone_start_pfn is 1.

I'm looking now how to fix this as well, hopefully I'll have a patch Real
Soon (tm) :)

> Previously I tested my cleanup patches based on Mike's commit
> 9ebeee59af4cdd4d ("mm: fix initialization of struct page for holes in
> memory layout") on a hardware system, didn't meet this crash. But this
> crash seems to be a always reproduced issue, wondering why I didn't
> reproduce it.

This crash is reproducible on systems that do not report pfn 0 as usable,
e.g for Chromebook Lukasz is using it is 'type 16':

[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16
[ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable

And on my laptop and on a bunch of other systems I have it is usable:

[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009cfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009d000-0x000000000009ffff] reserved


> >
> > Can you please try the below patch on top of v5.11-rc5?
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 783913e41f65..3ce9ef238dfc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7083,10 +7083,11 @@ void __init free_area_init_memoryless_node(int nid)
> > static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
> > int zone, int nid)
> > {
> > - unsigned long pfn, zone_spfn, zone_epfn;
> > + unsigned long pfn, zone_spfn = 0, zone_epfn;
> > u64 pgcnt = 0;
> >
> > - zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > + if (zone > 0)
> > + zone_spfn = arch_zone_highest_possible_pfn[zone - 1];
> > zone_epfn = arch_zone_highest_possible_pfn[zone];
> >
> > spfn = clamp(spfn, zone_spfn, zone_epfn);
> >
> >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index eed54ce26ad1..9f4468c413a1 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -7093,9 +7093,11 @@ static u64 __init
> > > init_unavailable_range(unsigned long spfn, unsigned long epfn,
> > > zone_spfn = arch_zone_lowest_possible_pfn[zone];
> > > zone_epfn = arch_zone_highest_possible_pfn[zone];
> > >
> > > - spfn = clamp(spfn, zone_spfn, zone_epfn);
> > > - epfn = clamp(epfn, zone_spfn, zone_epfn);
> > > -
> > > + //spfn = clamp(spfn, zone_spfn, zone_epfn);
> > > + //epfn = clamp(epfn, zone_spfn, zone_epfn);
> > > + pr_info("LMA DBG: zone_spfn: %llx, zone_epfn %llx\n",
> > > zone_spfn, zone_epfn);
> > > + pr_info("LMA DBG: spfn: %llx, epfn %llx\n", spfn, epfn);
> > > + pr_info("LMA DBG: clamp_spfn: %llx, clamp_epfn %llx\n",
> > > clamp(spfn, zone_spfn, zone_epfn), clamp(epfn, zone_spfn, zone_epfn));
> > > for (pfn = spfn; pfn < epfn; pfn++) {
> > > if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> > > pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> > >
> > > Best regards,
> > > Lukasz
> > >
> > >
> > > śr., 27 sty 2021 o 13:15 Łukasz Majczak <[email protected]> napisał(a):
> > > >
> > > > Unfortunately nothing :( my current kernel command line contains:
> > > > console=ttyS0,115200n8 debug earlyprintk=serial loglevel=7
> > > >
> > > > I was thinking about using earlycon, but it seems to be blocked.
> > > > (I think the lack of earlycon might be related to Chromebook HW
> > > > security design. There is an EC controller which is a part of AP ->
> > > > serial chain as kernel messages are considered sensitive from a
> > > > security standpoint.)
> > > >
> > > > Best regards,
> > > > Lukasz
> > > >
> > > > śr., 27 sty 2021 o 12:19 Mike Rapoport <[email protected]> napisał(a):
> > > > >
> > > > > On Wed, Jan 27, 2021 at 11:08:17AM +0100, Łukasz Majczak wrote:
> > > > > > Hi Mike,
> > > > > >
> > > > > > Actually I have a serial console attached (via servo device), but
> > > > > > there is no output :( and also the reboot/crash is very fast/immediate
> > > > > > after power on.
> > > > >
> > > > > If you boot with earlyprintk=serial are there any messages?
> > > > >
> > > > > > Best regards
> > > > > > Lukasz
> > > > > >
> > > > > > śr., 27 sty 2021 o 11:05 Mike Rapoport <[email protected]> napisał(a):
> > > > > > >
> > > > > > > Hi Lukasz,
> > > > > > >
> > > > > > > On Wed, Jan 27, 2021 at 10:22:29AM +0100, Łukasz Majczak wrote:
> > > > > > > > Crash after mm: fix initialization of struct page for holes in memory layout
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > > I was trying to run v5.11-rc5 on my Samsung Chromebook Pro (Caroline),
> > > > > > > > but I've noticed it has crashed - unfortunately it seems to happen at
> > > > > > > > a very early stage - No output to the console nor to the screen, so I
> > > > > > > > have started a bisect (between 5.11-rc4 - which works just find - and
> > > > > > > > 5.11-rc5),
> > > > > > > > bisect results points to:
> > > > > > > >
> > > > > > > > d3921cb8be29 mm: fix initialization of struct page for holes in memory layout
> > > > > > > >
> > > > > > > > Reproduction is just to build and load the kernel.
> > > > > > > >
> > > > > > > > If it will help any how I am attaching:
> > > > > > > > - /proc/cpuinfo (from healthy system):
> > > > > > > > https://gist.github.com/semihalf-majczak-lukasz/3517867bf39f07377c1a785b64a97066
> > > > > > > > - my .config file (for a broken system):
> > > > > > > > https://gist.github.com/semihalf-majczak-lukasz/584b329f1bf3e43b53efe8e18b5da33c
> > > > > > > >
> > > > > > > > If there is anything I could add/do/test to help fix this please let me know.
> > > > > > >
> > > > > > > Chris Wilson also reported boot failures on several Chromebooks:
> > > > > > >
> > > > > > > https://lore.kernel.org/lkml/[email protected]
> > > > > > >
> > > > > > > I presume serial console is not an option, so if you could boot with
> > > > > > > earlyprintk=vga and see if there is anything useful printed on the screen
> > > > > > > it would be really helpful.
> > > > > > >
> > > > > > > > Best regards
> > > > > > > > Lukasz
> > > > > > >
> > > > > > > --
> > > > > > > Sincerely yours,
> > > > > > > Mike.
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> >
> > --
> > Sincerely yours,
> > Mike.
> >
>

--
Sincerely yours,
Mike.