2019-01-25 04:51:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

[ Just adding a lot of other people to the cc ]

Robert, could you add a dmesg of a successful boot to that bugzilla,
or just as an attachement in email to this group of people..

This looks to be with the Fedora kernel config. Two people reporting
it, it looks like similar machines.

I assume it's some odd memory sizing detail that happens to trigger a
particular case.

I absolutely *hate* those "let's lazily clear 'struct page' array"
patches. They've caused problems before, and I'm not convinced the
pain has been worth it. Maybe we should revert them (again) and
promise to never ever take things like that again? Andrew?

Linus

On Fri, Jan 25, 2019 at 3:21 AM robert shteynfeld
<[email protected]> wrote:
>
> Hi,
>
> It looks like the above commit is causing an early kernel panic during boot on my system starting with kernel 4.19.13. Here's the redhat bug regarding this: https://bugzilla.redhat.com/show_bug.cgi?id=1666948.
>
> Cheers,
>
> Robert


2019-01-25 07:38:11

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 17:48:32, Linus Torvalds wrote:
> [ Just adding a lot of other people to the cc ]
>
> Robert, could you add a dmesg of a successful boot to that bugzilla,
> or just as an attachement in email to this group of people..
>
> This looks to be with the Fedora kernel config. Two people reporting
> it, it looks like similar machines.
>
> I assume it's some odd memory sizing detail that happens to trigger a
> particular case.

Quite possible.

> I absolutely *hate* those "let's lazily clear 'struct page' array"
> patches. They've caused problems before, and I'm not convinced the
> pain has been worth it. Maybe we should revert them (again) and
> promise to never ever take things like that again? Andrew?

The performance numbers were pretty dramatic on the other hand. This was
especially seen for very large NVDIMMs initialization when a userspace
was timing out without these applied.

I am certainly not very happy about regressions which we still do see. I
was worried about that early when reviewing these patches because it is
really hard to find all those weird places which simply happened to work
even when broken before. E.g. unitialized struct pages were simply
with zeroed and that means that they seemed to belong to node zero and
zone DMA and nothing really blown up. With the poisoning in place we
have an explicit VM_BUG_ON and Fedora kernels do enable VM debugging so
those problems are visible.

I still think we should chase after those issue and fix them regardless.
Lazy initialization revert will not solve those problems. It will just
paper over them.
--
Michal Hocko
SUSE Labs

2019-01-25 08:19:55

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 08:37:04, Michal Hocko wrote:
> On Fri 25-01-19 17:48:32, Linus Torvalds wrote:
> > [ Just adding a lot of other people to the cc ]
> >
> > Robert, could you add a dmesg of a successful boot to that bugzilla,
> > or just as an attachement in email to this group of people..
> >
> > This looks to be with the Fedora kernel config. Two people reporting
> > it, it looks like similar machines.
> >
> > I assume it's some odd memory sizing detail that happens to trigger a
> > particular case.
>
> Quite possible.

Forgot to ask. Can we get a dmesg with 2830bf6f05fb ("mm,
memory_hotplug: initialize struct pages for the full memory section")
reverted and memblock=debug kernel command line parameter?
--
Michal Hocko
SUSE Labs

2019-01-25 08:31:53

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 09:19:24, Michal Hocko wrote:
> On Fri 25-01-19 08:37:04, Michal Hocko wrote:
> > On Fri 25-01-19 17:48:32, Linus Torvalds wrote:
> > > [ Just adding a lot of other people to the cc ]
> > >
> > > Robert, could you add a dmesg of a successful boot to that bugzilla,
> > > or just as an attachement in email to this group of people..
> > >
> > > This looks to be with the Fedora kernel config. Two people reporting
> > > it, it looks like similar machines.
> > >
> > > I assume it's some odd memory sizing detail that happens to trigger a
> > > particular case.
> >
> > Quite possible.
>
> Forgot to ask. Can we get a dmesg with 2830bf6f05fb ("mm,
> memory_hotplug: initialize struct pages for the full memory section")
> reverted and memblock=debug kernel command line parameter?

And one more thing which I have overlook until now and it is not really
clear to me. One of th comments says
: The relevant part was:
: kernel bug at mm/page_alloc.c=790

I suppose this is 4.19 stable kernel because that would be
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);

in __free_one_page. I do not really see how 2830bf6f05fb could make any
difference here. It simply zeroes out the rest of the mem section and
that is guaranteed to be allocated because we do not do subsections. The
above VM_BUG_ON says that we start allocating an unaligned pfn for its
order.

Or are there two issues reported in that bug?
--
Michal Hocko
SUSE Labs

2019-01-25 15:52:58

by robert shteynfeld

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

The person who pointed to mm/page_alloc.c commits likely causing the
issue did not have time to build a patched/reverted kernel to confirm
his hypothesis. When I tried backing out the two separate commits he
suggested, the first commit (ie. the one in the subject) was the one
that when backed out fixed the boot issue. Reverting the second one
had no effect.

On Fri, Jan 25, 2019 at 3:29 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 25-01-19 09:19:24, Michal Hocko wrote:
> > On Fri 25-01-19 08:37:04, Michal Hocko wrote:
> > > On Fri 25-01-19 17:48:32, Linus Torvalds wrote:
> > > > [ Just adding a lot of other people to the cc ]
> > > >
> > > > Robert, could you add a dmesg of a successful boot to that bugzilla,
> > > > or just as an attachement in email to this group of people..
> > > >
> > > > This looks to be with the Fedora kernel config. Two people reporting
> > > > it, it looks like similar machines.
> > > >
> > > > I assume it's some odd memory sizing detail that happens to trigger a
> > > > particular case.
> > >
> > > Quite possible.
> >
> > Forgot to ask. Can we get a dmesg with 2830bf6f05fb ("mm,
> > memory_hotplug: initialize struct pages for the full memory section")
> > reverted and memblock=debug kernel command line parameter?
>
> And one more thing which I have overlook until now and it is not really
> clear to me. One of th comments says
> : The relevant part was:
> : kernel bug at mm/page_alloc.c=790
>
> I suppose this is 4.19 stable kernel because that would be
> VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>
> in __free_one_page. I do not really see how 2830bf6f05fb could make any
> difference here. It simply zeroes out the rest of the mem section and
> that is guaranteed to be allocated because we do not do subsections. The
> above VM_BUG_ON says that we start allocating an unaligned pfn for its
> order.
>
> Or are there two issues reported in that bug?
> --
> Michal Hocko
> SUSE Labs

2019-01-25 15:58:55

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 10:52:06, robert shteynfeld wrote:
> The person who pointed to mm/page_alloc.c commits likely causing the
> issue did not have time to build a patched/reverted kernel to confirm
> his hypothesis. When I tried backing out the two separate commits he
> suggested, the first commit (ie. the one in the subject) was the one
> that when backed out fixed the boot issue. Reverting the second one
> had no effect.

OK, so can you provide the full dmesg (with revert) along wth
memblock=debug kernel parameter. Also would it be possible to configure
serial or net console to get the actual panic report please?
--
Michal Hocko
SUSE Labs

2019-01-25 16:17:09

by robert shteynfeld

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

Attached is the dmesg from patched kernel.

I'm not an expert at debugging the kernel, obviously. I tried setting
up a serial console before without much luck as part of this debugging
session. I'll try again when I have some time, but I couldn't get to
the exact message before -- it scrolled by too quickly (and boot_delay
froze the output).

On Fri, Jan 25, 2019 at 10:58 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 25-01-19 10:52:06, robert shteynfeld wrote:
> > The person who pointed to mm/page_alloc.c commits likely causing the
> > issue did not have time to build a patched/reverted kernel to confirm
> > his hypothesis. When I tried backing out the two separate commits he
> > suggested, the first commit (ie. the one in the subject) was the one
> > that when backed out fixed the boot issue. Reverting the second one
> > had no effect.
>
> OK, so can you provide the full dmesg (with revert) along wth
> memblock=debug kernel parameter. Also would it be possible to configure
> serial or net console to get the actual panic report please?
> --
> Michal Hocko
> SUSE Labs


Attachments:
dmesg.good.boot (106.11 kB)

2019-01-25 16:41:27

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> Attached is the dmesg from patched kernel.

Your Node1 physical memory range precedes Node0 which is quite unusual
but it shouldn't be a huge problem on its own. But memory ranges are
not aligned to the memory section

[ 0.286954] Early memory node ranges
[ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]

As you can see the last pfn for the node1 is inside the section and
Node0 starts right after. This is quite unusual as well. If for no other
reasons then the memmap of those struct pages will be remote for one or
the other. Actually I am not even sure we can handle that properly
because we do expect 1:1 mapping between sections and nodes.

Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
initialize struct pages for the full memory section") made any
difference. We simply write over a potentially initialized struct page
and blow up on that. I strongly suspect that the commit just uncovered
a pre-existing problem. Let me think what we can do about that.

> I'm not an expert at debugging the kernel, obviously. I tried setting
> up a serial console before without much luck as part of this debugging
> session.

Ubuntu has a nice howto for netconsole configuration
https://wiki.ubuntu.com/Kernel/Netconsole. It is quite important to get
the actual failure.
--
Michal Hocko
SUSE Labs

2019-01-25 16:52:42

by robert shteynfeld

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

Could the unusual memory config be due to one empty DIMM slot on my
motherboard? I have 9 slots, but only 8 x 16G filled. The 6th slot
on the motherboard is empty -- which is a valid config according to
the manual.

On Fri, Jan 25, 2019 at 11:39 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> > Attached is the dmesg from patched kernel.
>
> Your Node1 physical memory range precedes Node0 which is quite unusual
> but it shouldn't be a huge problem on its own. But memory ranges are
> not aligned to the memory section
>
> [ 0.286954] Early memory node ranges
> [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
>
> As you can see the last pfn for the node1 is inside the section and
> Node0 starts right after. This is quite unusual as well. If for no other
> reasons then the memmap of those struct pages will be remote for one or
> the other. Actually I am not even sure we can handle that properly
> because we do expect 1:1 mapping between sections and nodes.
>
> Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
> initialize struct pages for the full memory section") made any
> difference. We simply write over a potentially initialized struct page
> and blow up on that. I strongly suspect that the commit just uncovered
> a pre-existing problem. Let me think what we can do about that.
>
> > I'm not an expert at debugging the kernel, obviously. I tried setting
> > up a serial console before without much luck as part of this debugging
> > session.
>
> Ubuntu has a nice howto for netconsole configuration
> https://wiki.ubuntu.com/Kernel/Netconsole. It is quite important to get
> the actual failure.
> --
> Michal Hocko
> SUSE Labs

2019-01-25 17:34:57

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 17:39:38, Michal Hocko wrote:
> On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> > Attached is the dmesg from patched kernel.
>
> Your Node1 physical memory range precedes Node0 which is quite unusual
> but it shouldn't be a huge problem on its own. But memory ranges are
> not aligned to the memory section
>
> [ 0.286954] Early memory node ranges
> [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
>
> As you can see the last pfn for the node1 is inside the section and
> Node0 starts right after. This is quite unusual as well. If for no other
> reasons then the memmap of those struct pages will be remote for one or
> the other. Actually I am not even sure we can handle that properly
> because we do expect 1:1 mapping between sections and nodes.
>
> Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
> initialize struct pages for the full memory section") made any
> difference. We simply write over a potentially initialized struct page
> and blow up on that. I strongly suspect that the commit just uncovered
> a pre-existing problem. Let me think what we can do about that.

Appart from force aligning node's start the only other option is to
revert 2830bf6f05fb and handling the underlying issue in the hotplug
code. I really wanted to prevent that because memory hotplug assumes
sections to be in a single node at way too many places. Maybe somebody
has a more clever idea though.
--
Michal Hocko
SUSE Labs

2019-01-25 18:17:37

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 18:33:15, Michal Hocko wrote:
> On Fri 25-01-19 17:39:38, Michal Hocko wrote:
> > On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> > > Attached is the dmesg from patched kernel.
> >
> > Your Node1 physical memory range precedes Node0 which is quite unusual
> > but it shouldn't be a huge problem on its own. But memory ranges are
> > not aligned to the memory section
> >
> > [ 0.286954] Early memory node ranges
> > [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> > [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> > [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> > [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
> >
> > As you can see the last pfn for the node1 is inside the section and
> > Node0 starts right after. This is quite unusual as well. If for no other
> > reasons then the memmap of those struct pages will be remote for one or
> > the other. Actually I am not even sure we can handle that properly
> > because we do expect 1:1 mapping between sections and nodes.
> >
> > Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
> > initialize struct pages for the full memory section") made any
> > difference. We simply write over a potentially initialized struct page
> > and blow up on that. I strongly suspect that the commit just uncovered
> > a pre-existing problem. Let me think what we can do about that.
>
> Appart from force aligning node's start the only other option is to
> revert 2830bf6f05fb and handling the underlying issue in the hotplug
> code.

We cannot really align because we have things like ZONE_DMA starting at
0x1000 and who knows what else. So let's go with the revert. Hutplug
simply needs a larger surgery to get rid of the PAGES_PER_SECTION
inherent assumptions.

Linus, could you take the revert please?

From 817b18d3db36a6900ca9043af8c1416c56358be3 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 25 Jan 2019 19:08:58 +0100
Subject: [PATCH] Revert "mm, memory_hotplug: initialize struct pages for the
full memory section"

This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:
[ 0.286954] Early memory node ranges
[ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]

This means that node0 starts in the middle of a memory section which is
also in node1. memmap_init_zone tries to initialize padding of a section
even when it is outside of the given pfn range because there are code
paths (e.g. memory hotplug) which assume that the full worth of memory
section is always initialized. In this particular case, though, such a
range is already intialized and most likely already managed by the page
allocator. Scribbling over those pages corrupts the internal state and
likely blows up when any of those pages gets used.

Reported-by: Robert Shteynfeld <[email protected]>
Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d295c9bc01a8..35fdde041f5c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5701,18 +5701,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
cond_resched();
}
}
-#ifdef CONFIG_SPARSEMEM
- /*
- * If the zone does not span the rest of the section then
- * we should at least initialize those pages. Otherwise we
- * could blow up on a poisoned page in some paths which depend
- * on full sections being initialized (e.g. memory hotplug).
- */
- while (end_pfn % PAGES_PER_SECTION) {
- __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
- end_pfn++;
- }
-#endif
}

#ifdef CONFIG_ZONE_DEVICE
--
2.20.1

--
Michal Hocko
SUSE Labs

2019-01-28 06:37:43

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

> Linus, could you take the revert please?
>
> From 817b18d3db36a6900ca9043af8c1416c56358be3 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 25 Jan 2019 19:08:58 +0100
> Subject: [PATCH] Revert "mm, memory_hotplug: initialize struct pages for the
> full memory section"
>
> This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
>
> The underlying assumption that one sparse section belongs into a single
> numa node doesn't hold really. Robert Shteynfeld has reported a boot
> failure. The boot log was not captured but his memory layout is as
> follows:
> [ 0.286954] Early memory node ranges
> [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
>
> This means that node0 starts in the middle of a memory section which is
> also in node1. memmap_init_zone tries to initialize padding of a section
> even when it is outside of the given pfn range because there are code
> paths (e.g. memory hotplug) which assume that the full worth of memory
> section is always initialized. In this particular case, though, such a
> range is already intialized and most likely already managed by the page
> allocator. Scribbling over those pages corrupts the internal state and
> likely blows up when any of those pages gets used.
>
> Reported-by: Robert Shteynfeld <[email protected]>
> Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
> Cc: stable
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 12 ------------
> 1 file changed, 12 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d295c9bc01a8..35fdde041f5c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5701,18 +5701,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> cond_resched();
> }
> }
> -#ifdef CONFIG_SPARSEMEM
> - /*
> - * If the zone does not span the rest of the section then
> - * we should at least initialize those pages. Otherwise we
> - * could blow up on a poisoned page in some paths which depend
> - * on full sections being initialized (e.g. memory hotplug).
> - */
> - while (end_pfn % PAGES_PER_SECTION) {
> - __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> - end_pfn++;
> - }
> -#endif
> }
>
> #ifdef CONFIG_ZONE_DEVICE

Michal, I suppose that revert the commit
2830bf6f05fb3e05bc4743274b806c821807a68 are return my issue
https://marc.info/?l=linux-mm&m=154499704718428
Are any other better approach would be proposed for fixing my issue?

--
Best Regards,
Mike Gavrilov.

2019-01-28 06:43:47

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Mon 28-01-19 11:37:00, Mikhail Gavrilov wrote:
> > Linus, could you take the revert please?
> >
> > From 817b18d3db36a6900ca9043af8c1416c56358be3 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Fri, 25 Jan 2019 19:08:58 +0100
> > Subject: [PATCH] Revert "mm, memory_hotplug: initialize struct pages for the
> > full memory section"
> >
> > This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
> >
> > The underlying assumption that one sparse section belongs into a single
> > numa node doesn't hold really. Robert Shteynfeld has reported a boot
> > failure. The boot log was not captured but his memory layout is as
> > follows:
> > [ 0.286954] Early memory node ranges
> > [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> > [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> > [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> > [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
> >
> > This means that node0 starts in the middle of a memory section which is
> > also in node1. memmap_init_zone tries to initialize padding of a section
> > even when it is outside of the given pfn range because there are code
> > paths (e.g. memory hotplug) which assume that the full worth of memory
> > section is always initialized. In this particular case, though, such a
> > range is already intialized and most likely already managed by the page
> > allocator. Scribbling over those pages corrupts the internal state and
> > likely blows up when any of those pages gets used.
> >
> > Reported-by: Robert Shteynfeld <[email protected]>
> > Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
> > Cc: stable
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/page_alloc.c | 12 ------------
> > 1 file changed, 12 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d295c9bc01a8..35fdde041f5c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5701,18 +5701,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > cond_resched();
> > }
> > }
> > -#ifdef CONFIG_SPARSEMEM
> > - /*
> > - * If the zone does not span the rest of the section then
> > - * we should at least initialize those pages. Otherwise we
> > - * could blow up on a poisoned page in some paths which depend
> > - * on full sections being initialized (e.g. memory hotplug).
> > - */
> > - while (end_pfn % PAGES_PER_SECTION) {
> > - __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> > - end_pfn++;
> > - }
> > -#endif
> > }
> >
> > #ifdef CONFIG_ZONE_DEVICE
>
> Michal, I suppose that revert the commit
> 2830bf6f05fb3e05bc4743274b806c821807a68 are return my issue
> https://marc.info/?l=linux-mm&m=154499704718428
> Are any other better approach would be proposed for fixing my issue?

As I've said above. We will need to remove the hardcoded
PAGES_PER_SECTION assumption from the hotplug code. Mikhail had a patch
which was dealing with the two specific sysfs file handlers and I will
build on top of that.

--
Michal Hocko
SUSE Labs

2019-01-28 11:44:41

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Fri 25-01-19 19:15:49, Michal Hocko wrote:
> On Fri 25-01-19 18:33:15, Michal Hocko wrote:
> > On Fri 25-01-19 17:39:38, Michal Hocko wrote:
> > > On Fri 25-01-19 11:16:30, robert shteynfeld wrote:
> > > > Attached is the dmesg from patched kernel.
> > >
> > > Your Node1 physical memory range precedes Node0 which is quite unusual
> > > but it shouldn't be a huge problem on its own. But memory ranges are
> > > not aligned to the memory section
> > >
> > > [ 0.286954] Early memory node ranges
> > > [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> > > [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> > > [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> > > [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
> > >
> > > As you can see the last pfn for the node1 is inside the section and
> > > Node0 starts right after. This is quite unusual as well. If for no other
> > > reasons then the memmap of those struct pages will be remote for one or
> > > the other. Actually I am not even sure we can handle that properly
> > > because we do expect 1:1 mapping between sections and nodes.
> > >
> > > Now it also makes some sense why 2830bf6f05fb ("mm, memory_hotplug:
> > > initialize struct pages for the full memory section") made any
> > > difference. We simply write over a potentially initialized struct page
> > > and blow up on that. I strongly suspect that the commit just uncovered
> > > a pre-existing problem. Let me think what we can do about that.
> >
> > Appart from force aligning node's start the only other option is to
> > revert 2830bf6f05fb and handling the underlying issue in the hotplug
> > code.
>
> We cannot really align because we have things like ZONE_DMA starting at
> 0x1000 and who knows what else. So let's go with the revert. Hutplug
> simply needs a larger surgery to get rid of the PAGES_PER_SECTION
> inherent assumptions.
>
> Linus, could you take the revert please?

or should I post the patch as a reply to make your life easier?

> From 817b18d3db36a6900ca9043af8c1416c56358be3 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 25 Jan 2019 19:08:58 +0100
> Subject: [PATCH] Revert "mm, memory_hotplug: initialize struct pages for the
> full memory section"
>
> This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
>
> The underlying assumption that one sparse section belongs into a single
> numa node doesn't hold really. Robert Shteynfeld has reported a boot
> failure. The boot log was not captured but his memory layout is as
> follows:
> [ 0.286954] Early memory node ranges
> [ 0.286955] node 1: [mem 0x0000000000001000-0x0000000000090fff]
> [ 0.286955] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
> [ 0.286956] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
> [ 0.286956] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
>
> This means that node0 starts in the middle of a memory section which is
> also in node1. memmap_init_zone tries to initialize padding of a section
> even when it is outside of the given pfn range because there are code
> paths (e.g. memory hotplug) which assume that the full worth of memory
> section is always initialized. In this particular case, though, such a
> range is already intialized and most likely already managed by the page
> allocator. Scribbling over those pages corrupts the internal state and
> likely blows up when any of those pages gets used.
>
> Reported-by: Robert Shteynfeld <[email protected]>
> Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
> Cc: stable
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 12 ------------
> 1 file changed, 12 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d295c9bc01a8..35fdde041f5c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5701,18 +5701,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> cond_resched();
> }
> }
> -#ifdef CONFIG_SPARSEMEM
> - /*
> - * If the zone does not span the rest of the section then
> - * we should at least initialize those pages. Otherwise we
> - * could blow up on a poisoned page in some paths which depend
> - * on full sections being initialized (e.g. memory hotplug).
> - */
> - while (end_pfn % PAGES_PER_SECTION) {
> - __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
> - end_pfn++;
> - }
> -#endif
> }
>
> #ifdef CONFIG_ZONE_DEVICE
> --
> 2.20.1
>
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs

2019-01-28 18:43:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Mon, Jan 28, 2019 at 3:44 AM Michal Hocko <[email protected]> wrote:
>
> >
> > Linus, could you take the revert please?
>
> or should I post the patch as a reply to make your life easier?

I've taken that revert now, but yes, in general, it's much better if a
patch is sent separately from the discussion as a very explicit patch
(ie subject line has that usual "[PATCH] ..." format etc) simply
because I get too much email and I can often miss things otherwise.
Particularly inside a thread where the developers are discussing
things, I go "ah, good, things are still being discussed" without
looking any closer into the details.

That's particularly true when traveling - I only noticed this one on
this second reminder email.. I'm home now, but Friday, when you sent
it originally, was 45 hours long for me with 24 hours on airplanes and
airports, so I had entirely missed the original email back then.

Thanks,

Linus

2019-01-28 19:03:14

by Michal Hocko

[permalink] [raw]
Subject: Re: kernel panic due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684

On Mon 28-01-19 10:42:25, Linus Torvalds wrote:
> On Mon, Jan 28, 2019 at 3:44 AM Michal Hocko <[email protected]> wrote:
> >
> > >
> > > Linus, could you take the revert please?
> >
> > or should I post the patch as a reply to make your life easier?
>
> I've taken that revert now, but yes, in general, it's much better if a
> patch is sent separately from the discussion as a very explicit patch
> (ie subject line has that usual "[PATCH] ..." format etc) simply
> because I get too much email and I can often miss things otherwise.
> Particularly inside a thread where the developers are discussing
> things, I go "ah, good, things are still being discussed" without
> looking any closer into the details.

Sure, I usually do it that way. I was just awaiting some review feedback
on the changelog first.

> That's particularly true when traveling - I only noticed this one on
> this second reminder email.. I'm home now, but Friday, when you sent
> it originally, was 45 hours long for me with 24 hours on airplanes and
> airports, so I had entirely missed the original email back then.

No problem. Andrew has already queued it as well so I guess he simply
drops it when it reaches your tree.

Thanks!
--
Michal Hocko
SUSE Labs