2021-04-27 08:35:20

by Wang Wensheng

[permalink] [raw]
Subject: [PATCH] mm/sparse: Fix flags overlap in section_mem_map

The section_mem_map member of struct mem_section stores some flags and
the address of struct page array of the mem_section.

Additionally the node id of the mem_section is stored during early boot,
where the struct page array has not been allocated. In other words, the
higher bits of section_mem_map are used for two purpose, and the node id
should be clear properly after the early boot.

Currently the node id field is overlapped with the flag field and cannot
be clear properly. That overlapped bits would then be treated as
mem_section flags and may lead to unexpected side effects.

Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
field always locates after flags field. That's why the overlap occurs -
forgetting to increase SECTION_NID_SHIFT when adding new mem_section
flag.

Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: Wang Wensheng <[email protected]>
---
include/linux/mmzone.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946ce..b01694d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
#define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
#define SECTION_MAP_LAST_BIT (1UL<<5)
#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT 3
+#define SECTION_NID_SHIFT order_base_2(SECTION_MAP_LAST_BIT)

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
--
2.9.4


2021-04-27 09:06:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/sparse: Fix flags overlap in section_mem_map

On 27.04.21 10:30, Wang Wensheng wrote:
> The section_mem_map member of struct mem_section stores some flags and
> the address of struct page array of the mem_section.
>
> Additionally the node id of the mem_section is stored during early boot,
> where the struct page array has not been allocated. In other words, the
> higher bits of section_mem_map are used for two purpose, and the node id
> should be clear properly after the early boot.
>
> Currently the node id field is overlapped with the flag field and cannot
> be clear properly. That overlapped bits would then be treated as
> mem_section flags and may lead to unexpected side effects.
>
> Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> field always locates after flags field. That's why the overlap occurs -
> forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> flag.
>
> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> Signed-off-by: Wang Wensheng <[email protected]>
> ---
> include/linux/mmzone.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 47946ce..b01694d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> #define SECTION_MAP_LAST_BIT (1UL<<5)
> #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT 3
> +#define SECTION_NID_SHIFT order_base_2(SECTION_MAP_LAST_BIT)
>
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>

Well, all sections around during boot that have an early NID are early
... so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly,
but not broken.

But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
sparse_init_one_section() would leave the bit set if the nid happens to
have that bit set (e.g., node 2,3). It's semi-broken then, because we
force all pfn_to_online_page() through the slow path.


That whole section flag setting code is fragile.

--
Thanks,

David / dhildenb

Subject: Re: [PATCH] mm/sparse: Fix flags overlap in section_mem_map

On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> On 27.04.21 10:30, Wang Wensheng wrote:
> > The section_mem_map member of struct mem_section stores some flags and
> > the address of struct page array of the mem_section.
> >
> > Additionally the node id of the mem_section is stored during early boot,
> > where the struct page array has not been allocated. In other words, the
> > higher bits of section_mem_map are used for two purpose, and the node id
> > should be clear properly after the early boot.
> >
> > Currently the node id field is overlapped with the flag field and cannot
> > be clear properly. That overlapped bits would then be treated as
> > mem_section flags and may lead to unexpected side effects.
> >
> > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > field always locates after flags field. That's why the overlap occurs -
> > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > flag.
> >
> > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > Signed-off-by: Wang Wensheng <[email protected]>
> > ---
> > include/linux/mmzone.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 47946ce..b01694d 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> > #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > #define SECTION_MAP_LAST_BIT (1UL<<5)
> > #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> > -#define SECTION_NID_SHIFT 3
> > +#define SECTION_NID_SHIFT order_base_2(SECTION_MAP_LAST_BIT)
> > static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > {
> >
>
> Well, all sections around during boot that have an early NID are early ...
> so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> broken.
>
> But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> sparse_init_one_section() would leave the bit set if the nid happens to have
> that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> pfn_to_online_page() through the slow path.
>
>
> That whole section flag setting code is fragile.

Hi Wensheng, David,

This patch seems not accepted or updated yet, so what's going on?

We recently saw the exact issue on testing crash utilities with latest
kernels on 4 NUMA node system. SECTION_TAINT_ZONE_DEVICE seems to be
set wrongly, and makedumpfile could fail due to this. So we need a fix.

I thought of another approach like below before finding this thread,
so if you are fine, I'd like to submit a patch with this. And if you
like going with order_base_2() approach, I'm glad to ack this patch.

--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
* which results in PFN_SECTION_SHIFT equal 6.
* To sum it up, at least 6 bits are available.
*/
+#define SECTION_MAP_LAST_SHIFT 5
#define SECTION_MARKED_PRESENT (1UL<<0)
#define SECTION_HAS_MEM_MAP (1UL<<1)
#define SECTION_IS_ONLINE (1UL<<2)
#define SECTION_IS_EARLY (1UL<<3)
#define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
-#define SECTION_MAP_LAST_BIT (1UL<<5)
+#define SECTION_MAP_LAST_BIT (1UL<<SECTION_MAP_LAST_SHIFT)
#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT 3
+#define SECTION_NID_SHIFT SECTION_MAP_LAST_SHIFT

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{

Thanks,
Naoya Horiguchi

2021-06-25 21:24:32

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] mm/sparse: Fix flags overlap in section_mem_map

On Wed, Jun 23, 2021 at 4:10 PM HORIGUCHI NAOYA(堀口 直也)
<[email protected]> wrote:
>
> On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> > On 27.04.21 10:30, Wang Wensheng wrote:
> > > The section_mem_map member of struct mem_section stores some flags and
> > > the address of struct page array of the mem_section.
> > >
> > > Additionally the node id of the mem_section is stored during early boot,
> > > where the struct page array has not been allocated. In other words, the
> > > higher bits of section_mem_map are used for two purpose, and the node id
> > > should be clear properly after the early boot.
> > >
> > > Currently the node id field is overlapped with the flag field and cannot
> > > be clear properly. That overlapped bits would then be treated as
> > > mem_section flags and may lead to unexpected side effects.
> > >
> > > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > > field always locates after flags field. That's why the overlap occurs -
> > > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > > flag.
> > >
> > > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > > Signed-off-by: Wang Wensheng <[email protected]>
> > > ---
> > > include/linux/mmzone.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 47946ce..b01694d 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> > > #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > > #define SECTION_MAP_LAST_BIT (1UL<<5)
> > > #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> > > -#define SECTION_NID_SHIFT 3
> > > +#define SECTION_NID_SHIFT order_base_2(SECTION_MAP_LAST_BIT)
> > > static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > > {
> > >
> >
> > Well, all sections around during boot that have an early NID are early ...
> > so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> > broken.
> >
> > But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> > sparse_init_one_section() would leave the bit set if the nid happens to have
> > that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> > pfn_to_online_page() through the slow path.
> >
> >
> > That whole section flag setting code is fragile.
>
> Hi Wensheng, David,
>
> This patch seems not accepted or updated yet, so what's going on?
>
> We recently saw the exact issue on testing crash utilities with latest
> kernels on 4 NUMA node system. SECTION_TAINT_ZONE_DEVICE seems to be
> set wrongly, and makedumpfile could fail due to this. So we need a fix.
>
> I thought of another approach like below before finding this thread,
> so if you are fine, I'd like to submit a patch with this. And if you
> like going with order_base_2() approach, I'm glad to ack this patch.
>
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
> * which results in PFN_SECTION_SHIFT equal 6.
> * To sum it up, at least 6 bits are available.
> */
> +#define SECTION_MAP_LAST_SHIFT 5
> #define SECTION_MARKED_PRESENT (1UL<<0)
> #define SECTION_HAS_MEM_MAP (1UL<<1)
> #define SECTION_IS_ONLINE (1UL<<2)
> #define SECTION_IS_EARLY (1UL<<3)
> #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> -#define SECTION_MAP_LAST_BIT (1UL<<5)
> +#define SECTION_MAP_LAST_BIT (1UL<<SECTION_MAP_LAST_SHIFT)
> #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT 3
> +#define SECTION_NID_SHIFT SECTION_MAP_LAST_SHIFT

Rather than make it dynamic, why not just make it 6 directly since
that matches the comment about the maximum number of flags available.

2021-06-28 07:08:04

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6

On Fri, Jun 25, 2021 at 02:23:05PM -0700, Dan Williams wrote:
> On Wed, Jun 23, 2021 at 4:10 PM HORIGUCHI NAOYA(堀口 直也)
> <[email protected]> wrote:
> >
> > On Tue, Apr 27, 2021 at 11:05:17AM +0200, David Hildenbrand wrote:
> > > On 27.04.21 10:30, Wang Wensheng wrote:
> > > > The section_mem_map member of struct mem_section stores some flags and
> > > > the address of struct page array of the mem_section.
> > > >
> > > > Additionally the node id of the mem_section is stored during early boot,
> > > > where the struct page array has not been allocated. In other words, the
> > > > higher bits of section_mem_map are used for two purpose, and the node id
> > > > should be clear properly after the early boot.
> > > >
> > > > Currently the node id field is overlapped with the flag field and cannot
> > > > be clear properly. That overlapped bits would then be treated as
> > > > mem_section flags and may lead to unexpected side effects.
> > > >
> > > > Define SECTION_NID_SHIFT using order_base_2 to ensure that the node id
> > > > field always locates after flags field. That's why the overlap occurs -
> > > > forgetting to increase SECTION_NID_SHIFT when adding new mem_section
> > > > flag.
> > > >
> > > > Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> > > > Signed-off-by: Wang Wensheng <[email protected]>
> > > > ---
> > > > include/linux/mmzone.h | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 47946ce..b01694d 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -1325,7 +1325,7 @@ extern size_t mem_section_usage_size(void);
> > > > #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > > > #define SECTION_MAP_LAST_BIT (1UL<<5)
> > > > #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> > > > -#define SECTION_NID_SHIFT 3
> > > > +#define SECTION_NID_SHIFT order_base_2(SECTION_MAP_LAST_BIT)
> > > > static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > > > {
> > > >
> > >
> > > Well, all sections around during boot that have an early NID are early ...
> > > so it's not an issue with SECTION_IS_EARLY, no? I mean, it's ugly, but not
> > > broken.
> > >
> > > But it's an issue with SECTION_TAINT_ZONE_DEVICE, AFAIKT.
> > > sparse_init_one_section() would leave the bit set if the nid happens to have
> > > that bit set (e.g., node 2,3). It's semi-broken then, because we force all
> > > pfn_to_online_page() through the slow path.
> > >
> > >
> > > That whole section flag setting code is fragile.
> >
> > Hi Wensheng, David,
> >
> > This patch seems not accepted or updated yet, so what's going on?
> >
> > We recently saw the exact issue on testing crash utilities with latest
> > kernels on 4 NUMA node system. SECTION_TAINT_ZONE_DEVICE seems to be
> > set wrongly, and makedumpfile could fail due to this. So we need a fix.
> >
> > I thought of another approach like below before finding this thread,
> > so if you are fine, I'd like to submit a patch with this. And if you
> > like going with order_base_2() approach, I'm glad to ack this patch.
> >
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1358,14 +1358,15 @@ extern size_t mem_section_usage_size(void);
> > * which results in PFN_SECTION_SHIFT equal 6.
> > * To sum it up, at least 6 bits are available.
> > */
> > +#define SECTION_MAP_LAST_SHIFT 5
> > #define SECTION_MARKED_PRESENT (1UL<<0)
> > #define SECTION_HAS_MEM_MAP (1UL<<1)
> > #define SECTION_IS_ONLINE (1UL<<2)
> > #define SECTION_IS_EARLY (1UL<<3)
> > #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > -#define SECTION_MAP_LAST_BIT (1UL<<5)
> > +#define SECTION_MAP_LAST_BIT (1UL<<SECTION_MAP_LAST_SHIFT)
> > #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> > -#define SECTION_NID_SHIFT 3
> > +#define SECTION_NID_SHIFT SECTION_MAP_LAST_SHIFT
>
> Rather than make it dynamic, why not just make it 6 directly since
> that matches the comment about the maximum number of flags available.

Sounds nice to me, so here's a patch. Could you review this?

Thanks,
Naoya Horiguchi
---
From a146c9f12ae8985c8985a5861330f7528cd14fe8 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <[email protected]>
Date: Mon, 28 Jun 2021 15:50:37 +0900
Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6

Hagio-san reported that crash utility can see bit 4 in section_mem_map
(SECTION_TAINT_ZONE_DEVICE) to be set, even if we do not use any
ZONE_DEVICE ilke pmem or HMM. This problem could break crash-related
toolsets and/or other memory analysis tools.

The root cause is that SECTION_NID_SHIFT is incorrectly set to 3,
while we use lower 5 bits for SECTION_* flags. So bit 3 and 4 can be
overlapped by sub-field for early NID, and bit 4 is unexpectedly set
on (for example) NUMA node id is 2 or 3.

To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
available bits of section flag field.

[1]: https://github.com/crash-utility/crash/commit/0b5435e10161345cf713ed447a155a611a1b408b

Fixes: 1f90a3477df3 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions")
Cc: [email protected] # v5.12+
Reported-by: Kazuhito Hagio <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/mmzone.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fcb535560028..d6aa2a196aeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1357,6 +1357,7 @@ extern size_t mem_section_usage_size(void);
* worst combination is powerpc with 256k pages,
* which results in PFN_SECTION_SHIFT equal 6.
* To sum it up, at least 6 bits are available.
+ * SECTION_NID_SHIFT is set to 6 based on this fact.
*/
#define SECTION_MARKED_PRESENT (1UL<<0)
#define SECTION_HAS_MEM_MAP (1UL<<1)
@@ -1365,7 +1366,7 @@ extern size_t mem_section_usage_size(void);
#define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
#define SECTION_MAP_LAST_BIT (1UL<<5)
#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT 3
+#define SECTION_NID_SHIFT 6

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
--
2.25.1

2021-07-06 08:37:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6

> Sounds nice to me, so here's a patch. Could you review this?
>

Hi,

sorry for the late reply, I was on vacation. Please send it as a proper
stand-alone patch next time, such that it

1. won't get silently ignored by reviewers/maintainers within a thread
2. Can easily get picked up/tested

Some minor comments below.

> Thanks,
> Naoya Horiguchi
> ---
> From a146c9f12ae8985c8985a5861330f7528cd14fe8 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <[email protected]>
> Date: Mon, 28 Jun 2021 15:50:37 +0900
> Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6
>
> Hagio-san reported that crash utility can see bit 4 in section_mem_map
> (SECTION_TAINT_ZONE_DEVICE) to be set, even if we do not use any
> ZONE_DEVICE ilke pmem or HMM. This problem could break crash-related

s/ilke/like/

> toolsets and/or other memory analysis tools.
>

I'd rephrase this to "Having SECTION_TAINT_ZONE_DEVICE set for wrong
sections forces pfn_to_online_page() through the slow path, but doesn't
actually break the kernel. However, it can break crash-related toolsets."

However, I am not sure why it actually breaks crash? crash would have to
implement the same slow-path check and would have to double-check the
sub-section present map. Then, it should just work like
pfn_to_online_page() and not have a real issue. What am I missing?

> The root cause is that SECTION_NID_SHIFT is incorrectly set to 3,
> while we use lower 5 bits for SECTION_* flags. So bit 3 and 4 can be
> overlapped by sub-field for early NID, and bit 4 is unexpectedly set
> on (for example) NUMA node id is 2 or 3.
>
> To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
> available bits of section flag field.
>
> [1]: https://github.com/crash-utility/crash/commit/0b5435e10161345cf713ed447a155a611a1b408b

[1] is never referenced

>
> Fixes: 1f90a3477df3 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions")
> Cc: [email protected] # v5.12+

^ I am not really convinced that this is a stable fix. It forces
something through the slow path, but the kernel itself is not broken, no?

> Reported-by: Kazuhito Hagio <[email protected]>
> Suggested-by: Dan Williams <[email protected]>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> include/linux/mmzone.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fcb535560028..d6aa2a196aeb 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1357,6 +1357,7 @@ extern size_t mem_section_usage_size(void);
> * worst combination is powerpc with 256k pages,
> * which results in PFN_SECTION_SHIFT equal 6.
> * To sum it up, at least 6 bits are available.
> + * SECTION_NID_SHIFT is set to 6 based on this fact.

I'd drop that comment or rephrase to ("once this changes, don't forget
to adjust SECTION_NID_SHIFT")

> */
> #define SECTION_MARKED_PRESENT (1UL<<0)
> #define SECTION_HAS_MEM_MAP (1UL<<1)
> @@ -1365,7 +1366,7 @@ extern size_t mem_section_usage_size(void);
> #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> #define SECTION_MAP_LAST_BIT (1UL<<5)
> #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT 3
> +#define SECTION_NID_SHIFT 6
>
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>

Change itself looks correct to me.

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2021-07-07 04:55:03

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6

On Tue, Jul 06, 2021 at 10:36:36AM +0200, David Hildenbrand wrote:
> > Sounds nice to me, so here's a patch. Could you review this?
> >
>
> Hi,
>
> sorry for the late reply, I was on vacation. Please send it as a proper
> stand-alone patch next time, such that it
>
> 1. won't get silently ignored by reviewers/maintainers within a thread
> 2. Can easily get picked up/tested

Sorry, I'll send in a separate thread next time.

>
> Some minor comments below.
>
> > Thanks,
> > Naoya Horiguchi
> > ---
> > From a146c9f12ae8985c8985a5861330f7528cd14fe8 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <[email protected]>
> > Date: Mon, 28 Jun 2021 15:50:37 +0900
> > Subject: [PATCH] mm/sparse: set SECTION_NID_SHIFT to 6
> >
> > Hagio-san reported that crash utility can see bit 4 in section_mem_map
> > (SECTION_TAINT_ZONE_DEVICE) to be set, even if we do not use any
> > ZONE_DEVICE ilke pmem or HMM. This problem could break crash-related
>
> s/ilke/like/
>
> > toolsets and/or other memory analysis tools.
> >
>
> I'd rephrase this to "Having SECTION_TAINT_ZONE_DEVICE set for wrong
> sections forces pfn_to_online_page() through the slow path, but doesn't
> actually break the kernel. However, it can break crash-related toolsets."
>
> However, I am not sure why it actually breaks crash? crash would have to
> implement the same slow-path check and would have to double-check the
> sub-section present map. Then, it should just work like pfn_to_online_page()
> and not have a real issue. What am I missing?

Because a kdump generation tool (makedumpfile in my mind) uses this field
to calculate the physical address to read. So wrong bits mean wrong address,
so kdump generation can fail. We already avoid this by fixing makedumpfile not
to check lower 5 bits, so it's not critial for us.

> > The root cause is that SECTION_NID_SHIFT is incorrectly set to 3,
> > while we use lower 5 bits for SECTION_* flags. So bit 3 and 4 can be
> > overlapped by sub-field for early NID, and bit 4 is unexpectedly set
> > on (for example) NUMA node id is 2 or 3.
> >
> > To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
> > available bits of section flag field.
> >
> > [1]: https://github.com/crash-utility/crash/commit/0b5435e10161345cf713ed447a155a611a1b408b
>
> [1] is never referenced

I dropped this.

>
> >
> > Fixes: 1f90a3477df3 ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions")
> > Cc: [email protected] # v5.12+
>
> ^ I am not really convinced that this is a stable fix. It forces something
> through the slow path, but the kernel itself is not broken, no?

No, it's not broken, but just suboptimal.
So I drop the CC to stable.

> > Reported-by: Kazuhito Hagio <[email protected]>
> > Suggested-by: Dan Williams <[email protected]>
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > include/linux/mmzone.h | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index fcb535560028..d6aa2a196aeb 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1357,6 +1357,7 @@ extern size_t mem_section_usage_size(void);
> > * worst combination is powerpc with 256k pages,
> > * which results in PFN_SECTION_SHIFT equal 6.
> > * To sum it up, at least 6 bits are available.
> > + * SECTION_NID_SHIFT is set to 6 based on this fact.
>
> I'd drop that comment or rephrase to ("once this changes, don't forget to
> adjust SECTION_NID_SHIFT")

OK, I drop the comment.

> > */
> > #define SECTION_MARKED_PRESENT (1UL<<0)
> > #define SECTION_HAS_MEM_MAP (1UL<<1)
> > @@ -1365,7 +1366,7 @@ extern size_t mem_section_usage_size(void);
> > #define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
> > #define SECTION_MAP_LAST_BIT (1UL<<5)
> > #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> > -#define SECTION_NID_SHIFT 3
> > +#define SECTION_NID_SHIFT 6
> > static inline struct page *__section_mem_map_addr(struct mem_section *section)
> > {
> >
>
> Change itself looks correct to me.
>
> Acked-by: David Hildenbrand <[email protected]>

Thank you, I'll post the update soon.

- Naoya Horiguchi