2018-12-19 02:22:30

by Qian Cai

[permalink] [raw]
Subject: [PATCH] mm: skip checking poison pattern for page_to_nid()

Kernel panic with page_owner=on

CONFIG_DEBUG_VM_PGFLAGS=y
PAGE_OWNER=y
NODE_NOT_IN_PAGE_FLAGS=n

This is due to f165b378bbd (mm: uninitialized struct page poisoning
sanity checking) shoots itself in the foot.

[ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
[ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 11.926498] page_owner info is not active (free page?)
[ 12.329560] kernel BUG at include/linux/mm.h:990!
[ 12.337632] RIP: 0010:init_page_owner+0x486/0x520

At first,

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
sparse_buffer_init
memblock_virt_alloc_try_nid_raw

It poisons all the allocated pages there.

memset(ptr, PAGE_POISON_PATTERN, size)

Later,

page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid
PF_POISONED_CHECK <--- panic here.

This because all allocated pages are not initialized until later.

init_pages_in_zone
__set_page_owner_handle

Fixes: f165b378bbd (mm: uninitialized struct page poisoning
sanity checking)
Signed-off-by: Qian Cai <[email protected]>
---
include/linux/mm.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..f083f366ea90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
#else
static inline int page_to_nid(const struct page *page)
{
- struct page *p = (struct page *)page;
-
- return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
+ return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif

--
2.17.2 (Apple Git-113)



2018-12-19 10:21:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: skip checking poison pattern for page_to_nid()

On Tue 18-12-18 20:57:32, Qian Cai wrote:
[...]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..f083f366ea90 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
> #else
> static inline int page_to_nid(const struct page *page)
> {
> - struct page *p = (struct page *)page;
> -
> - return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
> + return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> }
> #endif

I didn't get to think about a proper fix but this is clearly worng. If
the page is still poisoned then flags are clearly bogus and the node you
get is a garbage as well. Have you actually tested this patch?
--
Michal Hocko
SUSE Labs

2018-12-19 15:37:40

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH] mm: skip checking poison pattern for page_to_nid()

On 12/19/18 5:20 AM, Michal Hocko wrote:
> On Tue 18-12-18 20:57:32, Qian Cai wrote:
> [...]
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5411de93a363..f083f366ea90 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
>> #else
>> static inline int page_to_nid(const struct page *page)
>> {
>> - struct page *p = (struct page *)page;
>> -
>> - return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>> + return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
>> }
>> #endif
>
> I didn't get to think about a proper fix but this is clearly worng. If
> the page is still poisoned then flags are clearly bogus and the node you
> get is a garbage as well. Have you actually tested this patch?
>

Yes, I did notice that after running for a while triggering some UBSAN
out-of-bounds access warnings. I am still trying to figure out how those
uninitialized page flags survived though after

mm_init
mem_init
memblock_free_all
init_single_page()

2018-12-20 06:51:10

by Qian Cai

[permalink] [raw]
Subject: [PATCH] mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[ 8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
mem_init
memblock_free_all
free_low_memory_core_early
reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
continue;

However it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
[ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 11.926498] page_owner info is not active (free page?)
[ 12.329560] kernel BUG at include/linux/mm.h:990!
[ 12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since init_pages_in_zone() has already had the node information, there
is no need to call page_to_nid() at all during the page ext lookup, and
also replace calls that could incorrectly checked for poisoned page
structs.

It ends up wasting some memory to allocate page ext for those already
freed pages, but there is no sane ways to tell those freed pages apart
from uninitialized valid pages due to DEFERRED_STRUCT_PAGE_INIT. It
looks quite reasonable on an arm64 server though.

allocated 83230720 bytes of page_ext
Node 0, zone DMA32: page owner found early allocated 0 pages
Node 0, zone Normal: page owner found early allocated 2048214 pages
Node 1, zone Normal: page owner found early allocated 2080641 pages

Used more memory on a x86_64 server.

allocated 334233600 bytes of page_ext
Node 0, zone DMA: page owner found early allocated 2 pages
Node 0, zone DMA32: page owner found early allocated 24303 pages
Node 0, zone Normal: page owner found early allocated 7545357 pages
Node 1, zone Normal: page owner found early allocated 8331279 pages

Finally, rename get_entry() to get_ext_entry(), so it can be exported
without a naming collision.

Signed-off-by: Qian Cai <[email protected]>
---
include/linux/page_ext.h | 6 ++++++
mm/page_ext.c | 8 ++++----
mm/page_owner.c | 39 ++++++++++++++++++++++++++++++++-------
3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index f84f167ec04c..e95cb6198014 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -51,6 +51,7 @@ static inline void page_ext_init(void)
#endif

struct page_ext *lookup_page_ext(const struct page *page);
+struct page_ext *get_ext_entry(void *base, unsigned long index);

#else /* !CONFIG_PAGE_EXTENSION */
struct page_ext;
@@ -64,6 +65,11 @@ static inline struct page_ext *lookup_page_ext(const struct page *page)
return NULL;
}

+static inline struct page_ext *get_ext_entry(void *base, unsigned long index)
+{
+ return NULL;
+}
+
static inline void page_ext_init(void)
{
}
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..3cd8f0c13057 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -107,7 +107,7 @@ static unsigned long get_entry_size(void)
return sizeof(struct page_ext) + extra_mem;
}

-static inline struct page_ext *get_entry(void *base, unsigned long index)
+struct page_ext *get_ext_entry(void *base, unsigned long index)
{
return base + get_entry_size() * index;
}
@@ -137,7 +137,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
return NULL;
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
- return get_entry(base, index);
+ return get_ext_entry(base, index);
}

static int __init alloc_node_page_ext(int nid)
@@ -207,7 +207,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
*/
if (!section->page_ext)
return NULL;
- return get_entry(section->page_ext, pfn);
+ return get_ext_entry(section->page_ext, pfn);
}

static void *__meminit alloc_page_ext(size_t size, int nid)
@@ -285,7 +285,7 @@ static void __free_page_ext(unsigned long pfn)
ms = __pfn_to_section(pfn);
if (!ms || !ms->page_ext)
return;
- base = get_entry(ms->page_ext, pfn);
+ base = get_ext_entry(ms->page_ext, pfn);
free_page_ext(base);
ms->page_ext = NULL;
}
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 87bc0dfdb52b..c27712c9a764 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -531,6 +531,7 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
unsigned long pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
unsigned long count = 0;
+ struct page_ext *base;

/*
* Walk the zone in pageblock_nr_pages steps. If a page block spans
@@ -555,11 +556,11 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
if (!pfn_valid_within(pfn))
continue;

- page = pfn_to_page(pfn);
-
- if (page_zone(page) != zone)
+ if (pfn < zone->zone_start_pfn || pfn >= end_pfn)
continue;

+ page = pfn_to_page(pfn);
+
/*
* To avoid having to grab zone->lock, be a little
* careful when reading buddy page order. The only
@@ -575,13 +576,37 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
continue;
}

- if (PageReserved(page))
+#ifdef CONFIG_SPARSEMEM
+ base = __pfn_to_section(pfn)->page_ext;
+#else
+ base = pgdat->node_page_ext;
+#endif
+ /*
+ * The sanity checks the page allocator does upon
+ * freeing a page can reach here before the page_ext
+ * arrays are allocated when feeding a range of pages to
+ * the allocator for the first time during bootup or
+ * memory hotplug.
+ */
+ if (unlikely(!base))
continue;

- page_ext = lookup_page_ext(page);
- if (unlikely(!page_ext))
+ /*
+ * Those pages reached here might had already been freed
+ * due to the deferred struct page init.
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ if (pfn < pgdat->first_deferred_pfn)
+#endif
+ if (PageReserved(page))
continue;
-
+#ifdef CONFIG_SPARSEMEM
+ page_ext = get_ext_entry(base, pfn);
+#else
+ page_ext = get_ext_entry(base, pfn -
+ round_down(pgdat->node_start_pfn,
+ MAX_ORDER_NR_PAGES));
+#endif
/* Maybe overlapping zone */
if (test_bit(PAGE_EXT_OWNER, &page_ext->flags))
continue;
--
2.17.2 (Apple Git-113)


2018-12-20 09:25:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/page_owner: fix for deferred struct page init

On Thu 20-12-18 01:03:03, Qian Cai wrote:
> When booting a system with "page_owner=on",
>
> start_kernel
> page_ext_init
> invoke_init_callbacks
> init_section_page_ext
> init_page_owner
> init_early_allocated_pages
> init_zones_in_node
> init_pages_in_zone
> lookup_page_ext
> page_to_nid
>
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
>
> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [ 8.672603] index 7 is out of range for type 'zone [5]'
>
> Also, kernel will panic since flags were poisoned earlier with,
>
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>
> start_kernel
> setup_arch
> pagetable_init
> paging_init
> sparse_init
> sparse_init_nid
> memblock_alloc_try_nid_raw
>
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
>
> mm_init
> mem_init
> memblock_free_all
> free_low_memory_core_early
> reserve_bootmem_region
>
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().

Is there any reason why we cannot postpone page_ext initialization to
after all the memory is initialized?
--
Michal Hocko
SUSE Labs

2018-12-21 06:05:29

by Qian Cai

[permalink] [raw]
Subject: [PATCH v2] mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[ 8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
mem_init
memblock_free_all
free_low_memory_core_early
reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
continue;

However it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
[ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 11.926498] page_owner info is not active (free page?)
[ 12.329560] kernel BUG at include/linux/mm.h:990!
[ 12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel() and no real benefit to call it so early, just move it
after page_alloc_init_late() to ensure that there is no deferred pages
need to de dealt with.

Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
---

v2: postpone init_page_ext() to after page_alloc_init_late().

init/main.c | 2 +-
mm/page_ext.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/init/main.c b/init/main.c
index 2b7b7fe173c9..1aeb062b2cb7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,6 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
#endif
- page_ext_init();
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
@@ -1147,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();

page_alloc_init_late();
+ page_ext_init();

do_basic_setup();

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..d76fd51e312a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
* -------------pfn-------------->
* N0 | N1 | N2 | N0 | N1 | N2|....
*
- * Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
- if (early_pfn_to_nid(pfn) != nid)
+ if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_ext(pfn, nid))
goto oom;
--
2.17.2 (Apple Git-113)


2018-12-21 07:30:47

by William Kucharski

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init



> On Dec 20, 2018, at 1:31 PM, Qian Cai <[email protected]> wrote:
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> * -------------pfn-------------->
> * N0 | N1 | N2 | N0 | N1 | N2|....
> *
> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> - if (early_pfn_to_nid(pfn) != nid)
> + if (pfn_to_nid(pfn) != nid)
> continue;
> if (init_section_page_ext(pfn, nid))
> goto oom;
> --
> 2.17.2 (Apple Git-113)
>

Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the same issue seen in v2?


2018-12-21 07:32:12

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu, 2018-12-20 at 14:00 -0700, William Kucharski wrote:
> > On Dec 20, 2018, at 1:31 PM, Qian Cai <[email protected]> wrote:
> >
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index ae44f7adbe07..d76fd51e312a 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >  * -------------pfn-------------->
> >  * N0 | N1 | N2 | N0 | N1 | N2|....
> >  *
> > -  * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >  */
> > - if (early_pfn_to_nid(pfn) != nid)
> > + if (pfn_to_nid(pfn) != nid)
> > continue;
> > if (init_section_page_ext(pfn, nid))
> > goto oom;
> > -- 
> > 2.17.2 (Apple Git-113)
> >
>
> Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h
> (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the
> same issue seen in v2?
>

No. If CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, page_ext_init() is called after
page_alloc_init_late() where all the memory has already been initialized,
so page_to_nid() will work then.

2018-12-21 10:34:31

by Qian Cai

[permalink] [raw]
Subject: [PATCH v3] mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[ 8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
mem_init
memblock_free_all
free_low_memory_core_early
reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
[ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 11.926498] page_owner info is not active (free page?)
[ 12.329560] kernel BUG at include/linux/mm.h:990!
[ 12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/[email protected]/

Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
---

v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.

v2: postpone page_ext_init() to after page_alloc_init_late().

init/main.c | 5 +++++
mm/page_ext.c | 3 +--
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 2b7b7fe173c9..5d9904370f76 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
#endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
page_ext_init();
+#endif
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
@@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();

page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ page_ext_init();
+#endif

do_basic_setup();

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..d76fd51e312a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
* -------------pfn-------------->
* N0 | N1 | N2 | N0 | N1 | N2|....
*
- * Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
- if (early_pfn_to_nid(pfn) != nid)
+ if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_ext(pfn, nid))
goto oom;
--
2.17.2 (Apple Git-113)


2019-01-03 16:01:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu 20-12-18 15:31:56, Qian Cai wrote:
> When booting a system with "page_owner=on",
>
> start_kernel
> page_ext_init
> invoke_init_callbacks
> init_section_page_ext
> init_page_owner
> init_early_allocated_pages
> init_zones_in_node
> init_pages_in_zone
> lookup_page_ext
> page_to_nid
>
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
>
> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [ 8.672603] index 7 is out of range for type 'zone [5]'
>
> Also, kernel will panic since flags were poisoned earlier with,
>
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>
> start_kernel
> setup_arch
> pagetable_init
> paging_init
> sparse_init
> sparse_init_nid
> memblock_alloc_try_nid_raw
>
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
>
> mm_init
> mem_init
> memblock_free_all
> free_low_memory_core_early
> reserve_bootmem_region
>
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().
>
> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> if (early_pfn_to_nid(pfn) != nid)
> continue;
>
> However, it did not handle it well in init_pages_in_zone() which end up
> calling page_to_nid().
>
> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [ 11.926498] page_owner info is not active (free page?)
> [ 12.329560] kernel BUG at include/linux/mm.h:990!
> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520
>
> Since there is no other routines depend on page_ext_init() in
> start_kernel(), just move it after page_alloc_init_late() to ensure that
> there is no deferred pages need to de dealt with. If deselected
> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> earlier, so page owner could catch more early page allocation call
> sites. This gives us a good compromise between catching good and bad
> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Qian Cai <[email protected]>
> ---
>
> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
>
> v2: postpone page_ext_init() to after page_alloc_init_late().
>
> init/main.c | 5 +++++
> mm/page_ext.c | 3 +--
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index 2b7b7fe173c9..5d9904370f76 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
> initrd_start = 0;
> }
> #endif
> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> page_ext_init();
> +#endif
> kmemleak_init();
> setup_per_cpu_pageset();
> numa_policy_init();
> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
> sched_init_smp();
>
> page_alloc_init_late();
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> + page_ext_init();
> +#endif
>
> do_basic_setup();

Is this really necessary? Why cannot we simply postpone page_ext_init
unconditioanally?

>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> * -------------pfn-------------->
> * N0 | N1 | N2 | N0 | N1 | N2|....
> *
> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> - if (early_pfn_to_nid(pfn) != nid)
> + if (pfn_to_nid(pfn) != nid)
> continue;
> if (init_section_page_ext(pfn, nid))
> goto oom;

Also this doesn't seem to be related, right?
--
Michal Hocko
SUSE Labs

2019-01-03 22:19:18

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/3/19 6:51 AM, Michal Hocko wrote:
> On Thu 20-12-18 15:31:56, Qian Cai wrote:
>> When booting a system with "page_owner=on",
>>
>> start_kernel
>> page_ext_init
>> invoke_init_callbacks
>> init_section_page_ext
>> init_page_owner
>> init_early_allocated_pages
>> init_zones_in_node
>> init_pages_in_zone
>> lookup_page_ext
>> page_to_nid
>>
>> The issue here is that page_to_nid() will not work since some page
>> flags have no node information until later in page_alloc_init_late() due
>> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
>> access with an invalid nid.
>>
>> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
>> [ 8.672603] index 7 is out of range for type 'zone [5]'
>>
>> Also, kernel will panic since flags were poisoned earlier with,
>>
>> CONFIG_DEBUG_VM_PGFLAGS=y
>> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>>
>> start_kernel
>> setup_arch
>> pagetable_init
>> paging_init
>> sparse_init
>> sparse_init_nid
>> memblock_alloc_try_nid_raw
>>
>> Although later it tries to set page flags for pages in reserved bootmem
>> regions,
>>
>> mm_init
>> mem_init
>> memblock_free_all
>> free_low_memory_core_early
>> reserve_bootmem_region
>>
>> there could still have some freed pages from the page allocator but yet
>> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
>> dealt with a bit in page_ext_init().
>>
>> * Take into account DEFERRED_STRUCT_PAGE_INIT.
>> */
>> if (early_pfn_to_nid(pfn) != nid)
>> continue;
>>
>> However, it did not handle it well in init_pages_in_zone() which end up
>> calling page_to_nid().
>>
>> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
>> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> [ 11.926498] page_owner info is not active (free page?)
>> [ 12.329560] kernel BUG at include/linux/mm.h:990!
>> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520
>>
>> Since there is no other routines depend on page_ext_init() in
>> start_kernel(), just move it after page_alloc_init_late() to ensure that
>> there is no deferred pages need to de dealt with. If deselected
>> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
>> earlier, so page owner could catch more early page allocation call
>> sites. This gives us a good compromise between catching good and bad
>> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>>
>> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
>> Suggested-by: Michal Hocko <[email protected]>
>> Signed-off-by: Qian Cai <[email protected]>
>> ---
>>
>> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
>>
>> v2: postpone page_ext_init() to after page_alloc_init_late().
>>
>> init/main.c | 5 +++++
>> mm/page_ext.c | 3 +--
>> 2 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/init/main.c b/init/main.c
>> index 2b7b7fe173c9..5d9904370f76 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
>> initrd_start = 0;
>> }
>> #endif
>> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> page_ext_init();
>> +#endif
>> kmemleak_init();
>> setup_per_cpu_pageset();
>> numa_policy_init();
>> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
>> sched_init_smp();
>>
>> page_alloc_init_late();
>> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> + page_ext_init();
>> +#endif
>>
>> do_basic_setup();
>
> Is this really necessary? Why cannot we simply postpone page_ext_init
> unconditioanally?

As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
to call page_ext_init() earlier, so page owner could catch more early page
allocation call sites."

>
>>
>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>> index ae44f7adbe07..d76fd51e312a 100644
>> --- a/mm/page_ext.c
>> +++ b/mm/page_ext.c
>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>> * -------------pfn-------------->
>> * N0 | N1 | N2 | N0 | N1 | N2|....
>> *
>> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
>> */
>> - if (early_pfn_to_nid(pfn) != nid)
>> + if (pfn_to_nid(pfn) != nid)
>> continue;
>> if (init_section_page_ext(pfn, nid))
>> goto oom;
>
> Also this doesn't seem to be related, right?

No, it is related. Because of this patch, page_ext_init() is called after all
the memory has already been initialized,
so no longer necessary to call early_pfn_to_nid().

2019-01-03 23:30:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu 03-01-19 11:38:31, Qian Cai wrote:
> On 1/3/19 6:51 AM, Michal Hocko wrote:
> > On Thu 20-12-18 15:31:56, Qian Cai wrote:
> >> When booting a system with "page_owner=on",
> >>
> >> start_kernel
> >> page_ext_init
> >> invoke_init_callbacks
> >> init_section_page_ext
> >> init_page_owner
> >> init_early_allocated_pages
> >> init_zones_in_node
> >> init_pages_in_zone
> >> lookup_page_ext
> >> page_to_nid
> >>
> >> The issue here is that page_to_nid() will not work since some page
> >> flags have no node information until later in page_alloc_init_late() due
> >> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> >> access with an invalid nid.
> >>
> >> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> >> [ 8.672603] index 7 is out of range for type 'zone [5]'
> >>
> >> Also, kernel will panic since flags were poisoned earlier with,
> >>
> >> CONFIG_DEBUG_VM_PGFLAGS=y
> >> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> >>
> >> start_kernel
> >> setup_arch
> >> pagetable_init
> >> paging_init
> >> sparse_init
> >> sparse_init_nid
> >> memblock_alloc_try_nid_raw
> >>
> >> Although later it tries to set page flags for pages in reserved bootmem
> >> regions,
> >>
> >> mm_init
> >> mem_init
> >> memblock_free_all
> >> free_low_memory_core_early
> >> reserve_bootmem_region
> >>
> >> there could still have some freed pages from the page allocator but yet
> >> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> >> dealt with a bit in page_ext_init().
> >>
> >> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >> */
> >> if (early_pfn_to_nid(pfn) != nid)
> >> continue;
> >>
> >> However, it did not handle it well in init_pages_in_zone() which end up
> >> calling page_to_nid().
> >>
> >> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
> >> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> [ 11.926498] page_owner info is not active (free page?)
> >> [ 12.329560] kernel BUG at include/linux/mm.h:990!
> >> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520
> >>
> >> Since there is no other routines depend on page_ext_init() in
> >> start_kernel(), just move it after page_alloc_init_late() to ensure that
> >> there is no deferred pages need to de dealt with. If deselected
> >> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> >> earlier, so page owner could catch more early page allocation call
> >> sites. This gives us a good compromise between catching good and bad
> >> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> [1] https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> >> Suggested-by: Michal Hocko <[email protected]>
> >> Signed-off-by: Qian Cai <[email protected]>
> >> ---
> >>
> >> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
> >>
> >> v2: postpone page_ext_init() to after page_alloc_init_late().
> >>
> >> init/main.c | 5 +++++
> >> mm/page_ext.c | 3 +--
> >> 2 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/init/main.c b/init/main.c
> >> index 2b7b7fe173c9..5d9904370f76 100644
> >> --- a/init/main.c
> >> +++ b/init/main.c
> >> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
> >> initrd_start = 0;
> >> }
> >> #endif
> >> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> page_ext_init();
> >> +#endif
> >> kmemleak_init();
> >> setup_per_cpu_pageset();
> >> numa_policy_init();
> >> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
> >> sched_init_smp();
> >>
> >> page_alloc_init_late();
> >> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> + page_ext_init();
> >> +#endif
> >>
> >> do_basic_setup();
> >
> > Is this really necessary? Why cannot we simply postpone page_ext_init
> > unconditioanally?
>
> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> to call page_ext_init() earlier, so page owner could catch more early page
> allocation call sites."

Do you have any numbers to show how many allocation are we losing that
way? In other words, do we care enough to create an ugly code?

> >> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >> index ae44f7adbe07..d76fd51e312a 100644
> >> --- a/mm/page_ext.c
> >> +++ b/mm/page_ext.c
> >> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >> * -------------pfn-------------->
> >> * N0 | N1 | N2 | N0 | N1 | N2|....
> >> *
> >> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >> */
> >> - if (early_pfn_to_nid(pfn) != nid)
> >> + if (pfn_to_nid(pfn) != nid)
> >> continue;
> >> if (init_section_page_ext(pfn, nid))
> >> goto oom;
> >
> > Also this doesn't seem to be related, right?
>
> No, it is related. Because of this patch, page_ext_init() is called after all
> the memory has already been initialized,
> so no longer necessary to call early_pfn_to_nid().

Yes, but it looks like a follow up cleanup/optimization to me.
--
Michal Hocko
SUSE Labs

2019-01-03 23:39:35

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/3/19 11:59 AM, Michal Hocko wrote:
>> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
>> to call page_ext_init() earlier, so page owner could catch more early page
>> allocation call sites."
>
> Do you have any numbers to show how many allocation are we losing that
> way? In other words, do we care enough to create an ugly code?

Well, I don't have any numbers, but I read that Joonsoo did not really like to
defer page_ext_init() unconditionally.

"because deferring page_ext_init() would make page owner which uses page_ext
miss some early page allocation callsites. Although it already miss some early
page allocation callsites, we don't need to miss more."

https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/

>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index ae44f7adbe07..d76fd51e312a 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>>>> * -------------pfn-------------->
>>>> * N0 | N1 | N2 | N0 | N1 | N2|....
>>>> *
>>>> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
>>>> */
>>>> - if (early_pfn_to_nid(pfn) != nid)
>>>> + if (pfn_to_nid(pfn) != nid)
>>>> continue;
>>>> if (init_section_page_ext(pfn, nid))
>>>> goto oom;
>>>
>>> Also this doesn't seem to be related, right?
>>
>> No, it is related. Because of this patch, page_ext_init() is called after all
>> the memory has already been initialized,
>> so no longer necessary to call early_pfn_to_nid().
>
> Yes, but it looks like a follow up cleanup/optimization to me.

That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
in page_ext_init) which also messed up the order of page_ext_init() in
start_kernel(), so this patch basically revert that commit.

2019-01-03 23:58:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu 03-01-19 12:38:59, Qian Cai wrote:
> On 1/3/19 11:59 AM, Michal Hocko wrote:
> >> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> >> to call page_ext_init() earlier, so page owner could catch more early page
> >> allocation call sites."
> >
> > Do you have any numbers to show how many allocation are we losing that
> > way? In other words, do we care enough to create an ugly code?
>
> Well, I don't have any numbers, but I read that Joonsoo did not really like to
> defer page_ext_init() unconditionally.
>
> "because deferring page_ext_init() would make page owner which uses page_ext
> miss some early page allocation callsites. Although it already miss some early
> page allocation callsites, we don't need to miss more."

This is quite unspecific.

> https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/
>
> >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>> index ae44f7adbe07..d76fd51e312a 100644
> >>>> --- a/mm/page_ext.c
> >>>> +++ b/mm/page_ext.c
> >>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >>>> * -------------pfn-------------->
> >>>> * N0 | N1 | N2 | N0 | N1 | N2|....
> >>>> *
> >>>> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >>>> */
> >>>> - if (early_pfn_to_nid(pfn) != nid)
> >>>> + if (pfn_to_nid(pfn) != nid)
> >>>> continue;
> >>>> if (init_section_page_ext(pfn, nid))
> >>>> goto oom;
> >>>
> >>> Also this doesn't seem to be related, right?
> >>
> >> No, it is related. Because of this patch, page_ext_init() is called after all
> >> the memory has already been initialized,
> >> so no longer necessary to call early_pfn_to_nid().
> >
> > Yes, but it looks like a follow up cleanup/optimization to me.
>
> That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
> in page_ext_init) which also messed up the order of page_ext_init() in
> start_kernel(), so this patch basically revert that commit.

So can we make the revert with an explanation that the patch was wrong?
If we want to make hacks to catch more objects to be tracked then it
would be great to have some numbers in hands.
--
Michal Hocko
SUSE Labs

2019-01-04 00:26:17

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
explanation that the patch was wrong?
> If we want to make hacks to catch more objects to be tracked then it
> would be great to have some numbers in hands.

Well, those numbers are subject to change depends on future start_kernel()
order. Right now, there are many functions could be caught earlier by page owner.

kmemleak_init();
debug_objects_mem_init();
setup_per_cpu_pageset();
numa_policy_init();
acpi_early_init();
if (late_time_init)
late_time_init();
sched_clock_init();
calibrate_delay();
pid_idr_init();
anon_vma_init();
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
#endif
thread_stack_cache_init();
cred_init();
fork_init();
proc_caches_init();
uts_ns_init();
buffer_init();
key_init();
security_init();
dbg_late_init();
vfs_caches_init();
pagecache_init();
signals_init();
seq_file_init();
proc_root_init();
nsfs_init();
cpuset_init();
cgroup_init();
taskstats_init_early();
delayacct_init();

check_bugs();

acpi_subsystem_init();
arch_post_acpi_subsys_init();
sfi_init_late();

if (efi_enabled(EFI_RUNTIME_SERVICES)) {
efi_free_boot_services();

rcu_scheduler_starting();
/*
* Wait until kthreadd is all set-up.
*/
wait_for_completion(&kthreadd_done);

/* Now the scheduler is fully set up and can do blocking allocations */
gfp_allowed_mask = __GFP_BITS_MASK;

/*
* init can allocate pages on any node
*/
set_mems_allowed(node_states[N_MEMORY]);

cad_pid = task_pid(current);

smp_prepare_cpus(setup_max_cpus);

workqueue_init();

init_mm_internals();

do_pre_smp_initcalls();
lockup_detector_init();

smp_init();
sched_init_smp();

2019-01-04 00:36:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu 03-01-19 14:53:47, Qian Cai wrote:
> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> explanation that the patch was wrong?
> > If we want to make hacks to catch more objects to be tracked then it
> > would be great to have some numbers in hands.
>
> Well, those numbers are subject to change depends on future start_kernel()
> order. Right now, there are many functions could be caught earlier by page owner.
>
> kmemleak_init();
[...]
> sched_init_smp();

The kernel source dump will not tell us much of course. A ball park
number whether we are talking about dozen, hundreds or thousands of
allocations would tell us something at least, doesn't it.

Handwaving that it might help us some is not particurarly useful. We are
already losing some allocations already. Does it matter? Well, that
depends, sometimes we do want to catch an owner of particular page and
it is sad to find nothing. But how many times have you or somebody else
encountered that in practice. That is exactly a useful information to
judge an ugly ifdefery in the code. See my point?

--
Michal Hocko
SUSE Labs

2019-01-04 03:28:12

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/3/19 3:22 PM, Michal Hocko wrote:
> On Thu 03-01-19 14:53:47, Qian Cai wrote:
>> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
>> explanation that the patch was wrong?
>>> If we want to make hacks to catch more objects to be tracked then it
>>> would be great to have some numbers in hands.
>>
>> Well, those numbers are subject to change depends on future start_kernel()
>> order. Right now, there are many functions could be caught earlier by page owner.
>>
>> kmemleak_init();
> [...]
>> sched_init_smp();
>
> The kernel source dump will not tell us much of course. A ball park
> number whether we are talking about dozen, hundreds or thousands of
> allocations would tell us something at least, doesn't it.
>
> Handwaving that it might help us some is not particurarly useful. We are
> already losing some allocations already. Does it matter? Well, that
> depends, sometimes we do want to catch an owner of particular page and
> it is sad to find nothing. But how many times have you or somebody else
> encountered that in practice. That is exactly a useful information to
> judge an ugly ifdefery in the code. See my point?

Here is the number without DEFERRED_STRUCT_PAGE_INIT.

== page_ext_init() after page_alloc_init_late() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 7009 pages
Node 0, zone Normal: page owner found early allocated 85827 pages
Node 4, zone Normal: page owner found early allocated 75063 pages

== page_ext_init() before kmemleak_init() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 41907 pages
Node 4, zone Normal: page owner found early allocated 41356 pages

So, it told us that it will miss tens of thousands of early page allocation call
sites.

2019-01-04 16:50:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Thu 03-01-19 17:22:29, Qian Cai wrote:
> On 1/3/19 3:22 PM, Michal Hocko wrote:
> > On Thu 03-01-19 14:53:47, Qian Cai wrote:
> >> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> >> explanation that the patch was wrong?
> >>> If we want to make hacks to catch more objects to be tracked then it
> >>> would be great to have some numbers in hands.
> >>
> >> Well, those numbers are subject to change depends on future start_kernel()
> >> order. Right now, there are many functions could be caught earlier by page owner.
> >>
> >> kmemleak_init();
> > [...]
> >> sched_init_smp();
> >
> > The kernel source dump will not tell us much of course. A ball park
> > number whether we are talking about dozen, hundreds or thousands of
> > allocations would tell us something at least, doesn't it.
> >
> > Handwaving that it might help us some is not particurarly useful. We are
> > already losing some allocations already. Does it matter? Well, that
> > depends, sometimes we do want to catch an owner of particular page and
> > it is sad to find nothing. But how many times have you or somebody else
> > encountered that in practice. That is exactly a useful information to
> > judge an ugly ifdefery in the code. See my point?
>
> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>
> == page_ext_init() after page_alloc_init_late() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 7009 pages
> Node 0, zone Normal: page owner found early allocated 85827 pages
> Node 4, zone Normal: page owner found early allocated 75063 pages
>
> == page_ext_init() before kmemleak_init() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 6654 pages
> Node 0, zone Normal: page owner found early allocated 41907 pages
> Node 4, zone Normal: page owner found early allocated 41356 pages
>
> So, it told us that it will miss tens of thousands of early page allocation call
> sites.

This is an answer for the first part of the question (how much). The
second is _do_we_care_?

--
Michal Hocko
SUSE Labs

2019-01-04 17:25:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Fri 04-01-19 10:01:40, Qian Cai wrote:
> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> == page_ext_init() after page_alloc_init_late() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >> Node 0, zone Normal: page owner found early allocated 85827 pages
> >> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>
> >> == page_ext_init() before kmemleak_init() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >> Node 0, zone Normal: page owner found early allocated 41907 pages
> >> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>
> >> So, it told us that it will miss tens of thousands of early page allocation call
> >> sites.
> >
> > This is an answer for the first part of the question (how much). The
> > second is _do_we_care_?
>
> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> start to miss tens of thousands early page allocation call sites.

I am pretty sure we will hear about that when that happens. And act
accordingly.

> The other option I can think of to not hurt your eyes is to rewrite the whole
> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> However, I have a hard-time to convince myself it is a sensible thing to do.

Or simply make the page_owner initialization only touch the already
initialized memory. Have you explored that option as well?

Look, I am trying to push for a clean solution. Hacks I have seen so
far are not convincing. You have identified a regression and as such I
would consider the most straightforward to revert the buggy commit. If
you want to improve the situation then I would suggest to think about
something that is more robust than ifdefed hacks. It might be more work
but it also might be a better thing long term.

If you think I am asking too much then you are free to ignore my
opinion. I am not a maintainer of the page_owner code.
--
Michal Hocko
SUSE Labs

2019-01-04 17:26:06

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/4/19 10:17 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>
>>>> == page_ext_init() after page_alloc_init_late() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>
>>>> == page_ext_init() before kmemleak_init() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>
>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>> sites.
>>>
>>> This is an answer for the first part of the question (how much). The
>>> second is _do_we_care_?
>>
>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>> start to miss tens of thousands early page allocation call sites.
>
> I am pretty sure we will hear about that when that happens. And act
> accordingly.
>
>> The other option I can think of to not hurt your eyes is to rewrite the whole
>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>> However, I have a hard-time to convince myself it is a sensible thing to do.
>
> Or simply make the page_owner initialization only touch the already
> initialized memory. Have you explored that option as well?

Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
dealing with all the low-level details,

https://lore.kernel.org/lkml/[email protected]/



2019-01-04 17:30:24

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/4/19 8:09 AM, Michal Hocko wrote:
>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>
>> == page_ext_init() after page_alloc_init_late() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>> Node 0, zone Normal: page owner found early allocated 85827 pages
>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>
>> == page_ext_init() before kmemleak_init() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>> Node 0, zone Normal: page owner found early allocated 41907 pages
>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>
>> So, it told us that it will miss tens of thousands of early page allocation call
>> sites.
>
> This is an answer for the first part of the question (how much). The
> second is _do_we_care_?

Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

The other option I can think of to not hurt your eyes is to rewrite the whole
page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
However, I have a hard-time to convince myself it is a sensible thing to do.

2019-01-04 17:30:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Fri 04-01-19 10:25:12, Qian Cai wrote:
> On 1/4/19 10:17 AM, Michal Hocko wrote:
> > On Fri 04-01-19 10:01:40, Qian Cai wrote:
> >> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>>>
> >>>> == page_ext_init() after page_alloc_init_late() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >>>> Node 0, zone Normal: page owner found early allocated 85827 pages
> >>>> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>>>
> >>>> == page_ext_init() before kmemleak_init() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >>>> Node 0, zone Normal: page owner found early allocated 41907 pages
> >>>> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>>>
> >>>> So, it told us that it will miss tens of thousands of early page allocation call
> >>>> sites.
> >>>
> >>> This is an answer for the first part of the question (how much). The
> >>> second is _do_we_care_?
> >>
> >> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> >> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> >> start to miss tens of thousands early page allocation call sites.
> >
> > I am pretty sure we will hear about that when that happens. And act
> > accordingly.
> >
> >> The other option I can think of to not hurt your eyes is to rewrite the whole
> >> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> >> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> >> However, I have a hard-time to convince myself it is a sensible thing to do.
> >
> > Or simply make the page_owner initialization only touch the already
> > initialized memory. Have you explored that option as well?
>
> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
> dealing with all the low-level details,
>
> https://lore.kernel.org/lkml/[email protected]/

That is obviously not what I've had in mind. We have __init_single_page
which initializes a single struct page. Is there any way to hook
page_ext initialization there?

--
Michal Hocko
SUSE Labs

2019-01-04 20:19:39

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On 1/4/19 10:32 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:25:12, Qian Cai wrote:
>> On 1/4/19 10:17 AM, Michal Hocko wrote:
>>> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>>>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>>>
>>>>>> == page_ext_init() after page_alloc_init_late() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>>>
>>>>>> == page_ext_init() before kmemleak_init() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>>>
>>>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>>>> sites.
>>>>>
>>>>> This is an answer for the first part of the question (how much). The
>>>>> second is _do_we_care_?
>>>>
>>>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>>>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>>>> start to miss tens of thousands early page allocation call sites.
>>>
>>> I am pretty sure we will hear about that when that happens. And act
>>> accordingly.
>>>
>>>> The other option I can think of to not hurt your eyes is to rewrite the whole
>>>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>>>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>>>> However, I have a hard-time to convince myself it is a sensible thing to do.
>>>
>>> Or simply make the page_owner initialization only touch the already
>>> initialized memory. Have you explored that option as well?
>>
>> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
>> dealing with all the low-level details,
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> That is obviously not what I've had in mind. We have __init_single_page
> which initializes a single struct page. Is there any way to hook
> page_ext initialization there?

Well, the current design is,

(1) allocate page_ext physical-contiguous pages for all
nodes.
(2) page owner gathers all early allocation pages.
(3) page owner is fully operational.

It may be possible to move (1) as early as possible just after vmalloc_init() in
start_kernel() because it depends on vmalloc(), but it still needs to call (2)
as there have had many early allocation pages already.

Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 40972 pages
Node 4, zone Normal: page owner found early allocated 40968 pages

But, (2) still depends on DEFERRED_STRUCT_PAGE_INIT. To get ride of this
dependency, it may be possible to add some checking in __init_single_page():

- if not done (1), save those pages to an array and defer
page_owner early_handle initialization until done (1).

Though, I can't see any really benefit of this approach apart from "beautify"
the code.

2019-01-07 20:12:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Fri 04-01-19 15:18:08, Qian Cai wrote:
[...]
> Though, I can't see any really benefit of this approach apart from "beautify"

This is not about beautifying! This is about making the code long term
maintainable. As you can see it is just too easy to break it with the
current scheme. And that is bad especially when the code is broken
because of an optimization.

--
Michal Hocko
SUSE Labs

2019-01-08 01:54:31

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init



On 1/7/19 1:43 PM, Michal Hocko wrote:
> On Fri 04-01-19 15:18:08, Qian Cai wrote:
> [...]
>> Though, I can't see any really benefit of this approach apart from "beautify"
>
> This is not about beautifying! This is about making the code long term
> maintainable. As you can see it is just too easy to break it with the
> current scheme. And that is bad especially when the code is broken
> because of an optimization.
>

Understood, but the code is now fixed. If there is something fundamentally
broken in the future, it may be a good time then to create a looks like
hundred-line cleanup patch for long-term maintenance at the same time to fix
real bugs.

2019-01-08 08:21:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Mon 07-01-19 20:53:08, Qian Cai wrote:
>
>
> On 1/7/19 1:43 PM, Michal Hocko wrote:
> > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > [...]
> >> Though, I can't see any really benefit of this approach apart from "beautify"
> >
> > This is not about beautifying! This is about making the code long term
> > maintainable. As you can see it is just too easy to break it with the
> > current scheme. And that is bad especially when the code is broken
> > because of an optimization.
> >
>
> Understood, but the code is now fixed. If there is something fundamentally
> broken in the future, it may be a good time then to create a looks like
> hundred-line cleanup patch for long-term maintenance at the same time to fix
> real bugs.

Yeah, so revert = fix and redisign the thing to make the code more
robust longterm + allow to catch more allocation. I really fail to see
why this has to be repeated several times in this thread. Really.

--
Michal Hocko
SUSE Labs

2019-01-08 13:20:32

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Tue, 2019-01-08 at 09:20 +0100, Michal Hocko wrote:
> On Mon 07-01-19 20:53:08, Qian Cai wrote:
> >
> >
> > On 1/7/19 1:43 PM, Michal Hocko wrote:
> > > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > > [...]
> > > > Though, I can't see any really benefit of this approach apart from
> > > > "beautify"
> > >
> > > This is not about beautifying! This is about making the code long term
> > > maintainable. As you can see it is just too easy to break it with the
> > > current scheme. And that is bad especially when the code is broken
> > > because of an optimization.
> > >
> >
> > Understood, but the code is now fixed. If there is something fundamentally
> > broken in the future, it may be a good time then to create a looks like
> > hundred-line cleanup patch for long-term maintenance at the same time to fix
> > real bugs.
>
> Yeah, so revert = fix and redisign the thing to make the code more
> robust longterm + allow to catch more allocation. I really fail to see
> why this has to be repeated several times in this thread. Really.
>

Again, this will introduce a immediately regression (arguably small) that
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

I think the disagreement comes from that you want to deal with this passively
rather than proactively that you said "I am pretty sure we will hear about that
when that happens. And act accordingly", but I think it is better to fix it now
rather than later with a 4-line ifdef which you don't like.

I suppose someone else needs to make a judgment call for this as we are in a
"you can't convince me and I can't convince you" situation right now.

2019-01-08 22:06:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init


It's unclear (to me) where we stand with this patch. Shold we proceed
with v3 for now, or is something else planned?


From: Qian Cai <[email protected]>
Subject: mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[ 8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
mem_init
memblock_free_all
free_low_memory_core_early
reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
[ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[ 11.926498] page_owner info is not active (free page?)
[ 12.329560] kernel BUG at include/linux/mm.h:990!
[ 12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/[email protected]/

Link: http://lkml.kernel.org/r/[email protected]
Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Signed-off-by: Qian Cai <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

init/main.c | 5 +++++
mm/page_ext.c | 3 +--
2 files changed, 6 insertions(+), 2 deletions(-)

--- a/init/main.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/init/main.c
@@ -695,7 +695,9 @@ asmlinkage __visible void __init start_k
initrd_start = 0;
}
#endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
page_ext_init();
+#endif
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
@@ -1131,6 +1133,9 @@ static noinline void __init kernel_init_
sched_init_smp();

page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+ page_ext_init();
+#endif

do_basic_setup();

--- a/mm/page_ext.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
* -------------pfn-------------->
* N0 | N1 | N2 | N0 | N1 | N2|....
*
- * Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
- if (early_pfn_to_nid(pfn) != nid)
+ if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_ext(pfn, nid))
goto oom;
_



2019-01-08 22:57:03

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init



On 1/8/19 5:02 PM, Andrew Morton wrote:
>
> It's unclear (to me) where we stand with this patch. Shold we proceed
> with v3 for now, or is something else planned?

I don't have anything else plan for this right now. Michal particular don't like
that 4-line ifdef which supposes to avoid an immediately regression (arguably
small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
that would start to miss tens of thousands early page allocation call sites. So,
feel free to choose v2 of this which has no ifdef if you agree with Michal too.
I am fine either way.


2019-01-08 23:32:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

On Tue 08-01-19 17:13:41, Qian Cai wrote:
>
>
> On 1/8/19 5:02 PM, Andrew Morton wrote:
> >
> > It's unclear (to me) where we stand with this patch. Shold we proceed
> > with v3 for now, or is something else planned?
>
> I don't have anything else plan for this right now. Michal particular don't like
> that 4-line ifdef which supposes to avoid an immediately regression (arguably
> small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
> that would start to miss tens of thousands early page allocation call sites. So,
> feel free to choose v2 of this which has no ifdef if you agree with Michal too.
> I am fine either way.

Yes I would prefer to revert the faulty commit (fe53ca54270) and work on
a more robust page_owner initialization to cover earlier allocations on
top of that.

Not that I would insist but this would be a more straightforward
approach and I hope it will result in a better long term maintainable
code in the end.

--
Michal Hocko
SUSE Labs

2019-01-09 07:36:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_owner: fix for deferred struct page init

On Thu 20-12-18 13:50:31, Qian Cai wrote:
> When booting a system with "page_owner=on",
>
> start_kernel
> page_ext_init
> invoke_init_callbacks
> init_section_page_ext
> init_page_owner
> init_early_allocated_pages
> init_zones_in_node
> init_pages_in_zone
> lookup_page_ext
> page_to_nid
>
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
>
> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [ 8.672603] index 7 is out of range for type 'zone [5]'
>
> Also, kernel will panic since flags were poisoned earlier with,
>
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>
> start_kernel
> setup_arch
> pagetable_init
> paging_init
> sparse_init
> sparse_init_nid
> memblock_alloc_try_nid_raw
>
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
>
> mm_init
> mem_init
> memblock_free_all
> free_low_memory_core_early
> reserve_bootmem_region
>
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().
>
> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> if (early_pfn_to_nid(pfn) != nid)
> continue;
>
> However it did not handle it well in init_pages_in_zone() which end up
> calling page_to_nid().
>
> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned
> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [ 11.926498] page_owner info is not active (free page?)
> [ 12.329560] kernel BUG at include/linux/mm.h:990!
> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520
>
> Since there is no other routines depend on page_ext_init() in
> start_kernel() and no real benefit to call it so early, just move it
> after page_alloc_init_late() to ensure that there is no deferred pages
> need to de dealt with.

The last paragraph should be updated. I would propose something like
this.

"
This means that assumptions behind fe53ca54270a ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore revert the
commit for now. A proper way to move the page_owner initialization to
sooner is to hook into memmap initialization.
"
>

Fixes: fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init")

> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Qian Cai <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
>
> v2: postpone init_page_ext() to after page_alloc_init_late().
>
> init/main.c | 2 +-
> mm/page_ext.c | 3 +--
> 2 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/init/main.c b/init/main.c
> index 2b7b7fe173c9..1aeb062b2cb7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,6 @@ asmlinkage __visible void __init start_kernel(void)
> initrd_start = 0;
> }
> #endif
> - page_ext_init();
> kmemleak_init();
> setup_per_cpu_pageset();
> numa_policy_init();
> @@ -1147,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void)
> sched_init_smp();
>
> page_alloc_init_late();
> + page_ext_init();
>
> do_basic_setup();
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> * -------------pfn-------------->
> * N0 | N1 | N2 | N0 | N1 | N2|....
> *
> - * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> - if (early_pfn_to_nid(pfn) != nid)
> + if (pfn_to_nid(pfn) != nid)
> continue;
> if (init_section_page_ext(pfn, nid))
> goto oom;
> --
> 2.17.2 (Apple Git-113)

--
Michal Hocko
SUSE Labs

2019-01-16 11:13:57

by Qian Cai

[permalink] [raw]
Subject: [PATCH] Revert "mm: use early_pfn_to_nid in page_ext_init"

This reverts commit fe53ca54270a757f0a28ee6bf3a54d952b550ed0.

When booting a system with "page_owner=on",

start_kernel
page_ext_init
invoke_init_callbacks
init_section_page_ext
init_page_owner
init_early_allocated_pages
init_zones_in_node
init_pages_in_zone
lookup_page_ext
page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[ 8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
setup_arch
pagetable_init
paging_init
sparse_init
sparse_init_nid
memblock_alloc_try_nid_raw

It did not handle it well in init_pages_in_zone() which ends up calling
page_to_nid().

page:ffffea0004200000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
page_owner info is not active (free page?)
kernel BUG at include/linux/mm.h:990!
RIP: 0010:init_page_owner+0x486/0x520

This means that assumptions behind commit fe53ca54270a ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert
the commit for now. A proper way to move the page_owner initialization
to sooner is to hook into memmap initialization.

Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Qian Cai <[email protected]>
---
init/main.c | 3 ++-
mm/page_ext.c | 4 +---
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/init/main.c b/init/main.c
index e2e80ca3165a..c86a1c8f19f4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -695,7 +695,6 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
#endif
- page_ext_init();
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
@@ -1131,6 +1130,8 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();

page_alloc_init_late();
+ /* Initialize page ext after all struct pages are initialized. */
+ page_ext_init();

do_basic_setup();

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..8c78b8d45117 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -398,10 +398,8 @@ void __init page_ext_init(void)
* We know some arch can have a nodes layout such as
* -------------pfn-------------->
* N0 | N1 | N2 | N0 | N1 | N2|....
- *
- * Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
- if (early_pfn_to_nid(pfn) != nid)
+ if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_ext(pfn, nid))
goto oom;
--
2.17.2 (Apple Git-113)