2021-01-14 01:06:28

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 0/5] mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE

Changes since v3 [1]:
- Switch to "if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) &&
!pfn_valid(pfn))" (David)
- Finish collecting reviewed-bys across all patches in the series
- Drop the libnvdimm fixup, to be merged through nvdimm.git not -mm

[1]: http://lore.kernel.org/r/161052331545.1805594.2356512831689786960.stgit@dwillia2-desk3.amr.corp.intel.com

---

Andrew,

All patches in this series have been reviewed and the kbuild-robot
reports a build-success over 172 configs. They pass an updated version
of the nvdimm unit tests to exercise corner cases of
pfn_to_online_page() and get_dev_pagemap() [2], and apply cleanly to
current -next.

Please apply, thanks.

[2]: http://lore.kernel.org/r/161052209289.1804207.11599120961607513911.stgit@dwillia2-desk3.amr.corp.intel.com

---

Michal reminds that the discussion about how to ensure pfn-walkers do
not get confused by ZONE_DEVICE pages never resolved. A pfn-walker that
uses pfn_to_online_page() may inadvertently translate a pfn as online
and in the page allocator, when it is offline managed by a ZONE_DEVICE
mapping (details in Patch 3: ("mm: Teach pfn_to_online_page() about
ZONE_DEVICE section collisions")).

The 2 proposals under consideration are teach pfn_to_online_page() to be
precise in the presence of mixed-zone sections, or teach the memory-add
code to drop the System RAM associated with ZONE_DEVICE collisions. In
order to not regress memory capacity by a few 10s to 100s of MiB the
approach taken in this set is to add precision to pfn_to_online_page().

In the course of validating pfn_to_online_page() a couple other fixes
fell out:

1/ soft_offline_page() fails to drop the reference taken in the
madvise(..., MADV_SOFT_OFFLINE) case.

2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
however that mapping may contain data pages and metadata raw pfns.
Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
handling of raw metadata pfns.

---

Dan Williams (5):
mm: Move pfn_to_online_page() out of line
mm: Teach pfn_to_online_page() to consider subsection validity
mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
mm: Fix page reference leak in soft_offline_page()
mm: Fix memory_failure() handling of dax-namespace metadata


include/linux/memory_hotplug.h | 17 +---------
include/linux/memremap.h | 6 +++
include/linux/mmzone.h | 22 +++++++++----
mm/memory-failure.c | 26 +++++++++++++--
mm/memory_hotplug.c | 69 ++++++++++++++++++++++++++++++++++++++++
mm/memremap.c | 15 +++++++++
6 files changed, 128 insertions(+), 27 deletions(-)


2021-01-14 01:07:59

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 1/5] mm: Move pfn_to_online_page() out of line

pfn_to_online_page() is already too large to be a macro or an inline
function. In anticipation of further logic changes / growth, move it out
of line.

No functional change, just code movement.

Reported-by: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/memory_hotplug.h | 17 +----------------
mm/memory_hotplug.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 15acce5ab106..3d99de0db2dd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,22 +16,7 @@ struct resource;
struct vmem_altmap;

#ifdef CONFIG_MEMORY_HOTPLUG
-/*
- * Return page for the valid pfn only if the page is online. All pfn
- * walkers which rely on the fully initialized page->flags and others
- * should use this rather than pfn_valid && pfn_to_page
- */
-#define pfn_to_online_page(pfn) \
-({ \
- struct page *___page = NULL; \
- unsigned long ___pfn = pfn; \
- unsigned long ___nr = pfn_to_section_nr(___pfn); \
- \
- if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
- pfn_valid_within(___pfn)) \
- ___page = pfn_to_page(___pfn); \
- ___page; \
-})
+struct page *pfn_to_online_page(unsigned long pfn);

/*
* Types for free bootmem stored in page->lru.next. These have to be in
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..55a69d4396e7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,22 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
return 0;
}

+/*
+ * Return page for the valid pfn only if the page is online. All pfn
+ * walkers which rely on the fully initialized page->flags and others
+ * should use this rather than pfn_valid && pfn_to_page
+ */
+struct page *pfn_to_online_page(unsigned long pfn)
+{
+ unsigned long nr = pfn_to_section_nr(pfn);
+
+ if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
+ pfn_valid_within(pfn))
+ return pfn_to_page(pfn);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(pfn_to_online_page);
+
/*
* Reasonably generic function for adding memory. It is
* expected that archs that support memory hotplug will

2021-01-14 01:49:02

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 5/5] mm: Fix memory_failure() handling of dax-namespace metadata

Given 'struct dev_pagemap' spans both data pages and metadata pages be
careful to consult the altmap if present to delineate metadata. In fact
the pfn_first() helper already identifies the first valid data pfn, so
export that helper for other code paths via pgmap_pfn_valid().

Other usage of get_dev_pagemap() are not a concern because those are
operating on known data pfns having been looked up by get_user_pages().
I.e. metadata pfns are never user mapped.

Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Cc: Naoya Horiguchi <[email protected]>
Cc: Andrew Morton <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/memremap.h | 6 ++++++
mm/memory-failure.c | 6 ++++++
mm/memremap.c | 15 +++++++++++++++
3 files changed, 27 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..f5b464daeeca 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap);
+bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);

unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
@@ -165,6 +166,11 @@ static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
return NULL;
}

+static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+ return false;
+}
+
static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
{
return 0;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 78b173c7190c..541569cb4a99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1308,6 +1308,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
*/
put_page(page);

+ /* device metadata space is not recoverable */
+ if (!pgmap_pfn_valid(pgmap, pfn)) {
+ rc = -ENXIO;
+ goto out;
+ }
+
/*
* Prevent the inode from being freed while we are interrogating
* the address_space, typically this would be handled by
diff --git a/mm/memremap.c b/mm/memremap.c
index 16b2fb482da1..2455bac89506 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -80,6 +80,21 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap, int range_id)
return pfn + vmem_altmap_offset(pgmap_altmap(pgmap));
}

+bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; i < pgmap->nr_range; i++) {
+ struct range *range = &pgmap->ranges[i];
+
+ if (pfn >= PHYS_PFN(range->start) &&
+ pfn <= PHYS_PFN(range->end))
+ return pfn >= pfn_first(pgmap, i);
+ }
+
+ return false;
+}
+
static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
{
const struct range *range = &pgmap->ranges[range_id];

2021-01-14 01:49:08

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 2/5] mm: Teach pfn_to_online_page() to consider subsection validity

pfn_section_valid() determines pfn validity on subsection granularity
where pfn_valid() may be limited to coarse section granularity.
Explicitly validate subsections after pfn_valid() succeeds.

Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
Cc: Qian Cai <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/memory_hotplug.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 55a69d4396e7..d0c81f7a3347 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -308,11 +308,26 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
struct page *pfn_to_online_page(unsigned long pfn)
{
unsigned long nr = pfn_to_section_nr(pfn);
+ struct mem_section *ms;
+
+ if (nr >= NR_MEM_SECTIONS)
+ return NULL;
+
+ ms = __nr_to_section(nr);
+ if (!online_section(ms))
+ return NULL;
+
+ /*
+ * Save some code text when online_section() +
+ * pfn_section_valid() are sufficient.
+ */
+ if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn))
+ return NULL;
+
+ if (!pfn_section_valid(ms, pfn))
+ return NULL;

- if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
- pfn_valid_within(pfn))
- return pfn_to_page(pfn);
- return NULL;
+ return pfn_to_page(pfn);
}
EXPORT_SYMBOL_GPL(pfn_to_online_page);


2021-01-14 01:49:36

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() &&
!pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
a leaked reference.

Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Cc: Andrew Morton <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/memory-failure.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5a38e9eade94..78b173c7190c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct page *page)
return rc;
}

+static void put_ref_page(struct page *page)
+{
+ if (page)
+ put_page(page);
+}
+
/**
* soft_offline_page - Soft offline a page.
* @pfn: pfn to soft-offline
@@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct page *page)
int soft_offline_page(unsigned long pfn, int flags)
{
int ret;
- struct page *page;
bool try_again = true;
+ struct page *page, *ref_page = NULL;
+
+ WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));

if (!pfn_valid(pfn))
return -ENXIO;
+ if (flags & MF_COUNT_INCREASED)
+ ref_page = pfn_to_page(pfn);
+
/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
page = pfn_to_online_page(pfn);
- if (!page)
+ if (!page) {
+ put_ref_page(ref_page);
return -EIO;
+ }

if (PageHWPoison(page)) {
pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
- if (flags & MF_COUNT_INCREASED)
- put_page(page);
+ put_ref_page(ref_page);
return 0;
}


2021-01-14 01:50:24

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 3/5] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

While pfn_to_online_page() is able to determine pfn_valid() at
subsection granularity it is not able to reliably determine if a given
pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with
ZONE_DEVICE. This means that pfn_to_online_page() may return invalid
@page objects. For example with a memory map like:

100000000-1fbffffff : System RAM
142000000-143002e16 : Kernel code
143200000-143713fff : Kernel rodata
143800000-143b15b7f : Kernel data
144227000-144ffffff : Kernel bss
1fc000000-2fbffffff : Persistent Memory (legacy)
1fc000000-2fbffffff : namespace0.0

This command:

echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page

...succeeds when it should fail. When it succeeds it touches
an uninitialized page and may crash or cause other damage (see
dissolve_free_huge_page()).

While the memory map above is contrived via the memmap=ss!nn kernel
command line option, the collision happens in practice on shipping
platforms. The memory controller resources that decode spans of
physical address space are a limited resource. One technique
platform-firmware uses to conserve those resources is to share a decoder
across 2 devices to keep the address range contiguous. Unfortunately the
unit of operation of a decoder is 64MiB while the Linux section size is
128MiB. This results in situations where, without subsection hotplug
memory mappings with different lifetimes collide into one object that
can only express one lifetime.

Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns. With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
back to a slow-path check for ZONE_DEVICE pfns in an online section. In
the fast path online_section() for a full ZONE_DEVICE section returns
false.

Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Cc: Andrew Morton <[email protected]>
Reported-by: Michal Hocko <[email protected]>
Reported-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/mmzone.h | 22 +++++++++++++++-------
mm/memory_hotplug.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..0b5c44f730b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void);
* which results in PFN_SECTION_SHIFT equal 6.
* To sum it up, at least 6 bits are available.
*/
-#define SECTION_MARKED_PRESENT (1UL<<0)
-#define SECTION_HAS_MEM_MAP (1UL<<1)
-#define SECTION_IS_ONLINE (1UL<<2)
-#define SECTION_IS_EARLY (1UL<<3)
-#define SECTION_MAP_LAST_BIT (1UL<<4)
-#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT 3
+#define SECTION_MARKED_PRESENT (1UL<<0)
+#define SECTION_HAS_MEM_MAP (1UL<<1)
+#define SECTION_IS_ONLINE (1UL<<2)
+#define SECTION_IS_EARLY (1UL<<3)
+#define SECTION_TAINT_ZONE_DEVICE (1UL<<4)
+#define SECTION_MAP_LAST_BIT (1UL<<5)
+#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
+#define SECTION_NID_SHIFT 3

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
@@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section *section)
return (section && (section->section_mem_map & SECTION_IS_ONLINE));
}

+static inline int online_device_section(struct mem_section *section)
+{
+ unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
+
+ return section && ((section->section_mem_map & flags) == flags);
+}
+
static inline int online_section_nr(unsigned long nr)
{
return online_section(__nr_to_section(nr));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d0c81f7a3347..c78a1bef561b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -308,6 +308,7 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
struct page *pfn_to_online_page(unsigned long pfn)
{
unsigned long nr = pfn_to_section_nr(pfn);
+ struct dev_pagemap *pgmap;
struct mem_section *ms;

if (nr >= NR_MEM_SECTIONS)
@@ -327,6 +328,22 @@ struct page *pfn_to_online_page(unsigned long pfn)
if (!pfn_section_valid(ms, pfn))
return NULL;

+ if (!online_device_section(ms))
+ return pfn_to_page(pfn);
+
+ /*
+ * Slowpath: when ZONE_DEVICE collides with
+ * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
+ * the section may be 'offline' but 'valid'. Only
+ * get_dev_pagemap() can determine sub-section online status.
+ */
+ pgmap = get_dev_pagemap(pfn, NULL);
+ put_dev_pagemap(pgmap);
+
+ /* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
+ if (pgmap)
+ return NULL;
+
return pfn_to_page(pfn);
}
EXPORT_SYMBOL_GPL(pfn_to_online_page);
@@ -709,6 +726,14 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;

}
+
+static void section_taint_zone_device(unsigned long pfn)
+{
+ struct mem_section *ms = __pfn_to_section(pfn);
+
+ ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE;
+}
+
/*
* Associate the pfn range with the given zone, initializing the memmaps
* and resizing the pgdat/zone data to span the added pages. After this
@@ -738,6 +763,19 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
resize_pgdat_range(pgdat, start_pfn, nr_pages);
pgdat_resize_unlock(pgdat, &flags);

+ /*
+ * Subsection population requires care in pfn_to_online_page().
+ * Set the taint to enable the slow path detection of
+ * ZONE_DEVICE pages in an otherwise ZONE_{NORMAL,MOVABLE}
+ * section.
+ */
+ if (zone_idx(zone) == ZONE_DEVICE) {
+ if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION))
+ section_taint_zone_device(start_pfn);
+ if (!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION))
+ section_taint_zone_device(start_pfn + nr_pages);
+ }
+
/*
* TODO now we have a visible range of pages which are not associated
* with their zone properly. Not nice but set_pfnblock_flags_mask

Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote:
> The conversion to move pfn_to_online_page() internal to
> soft_offline_page() missed that the get_user_pages() reference taken by
> the madvise() path needs to be dropped when pfn_to_online_page() fails.
> Note the direct sysfs-path to soft_offline_page() does not perform a
> get_user_pages() lookup.
>
> When soft_offline_page() is handed a pfn_valid() &&
> !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> a leaked reference.
>
> Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> Cc: Andrew Morton <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

I'm OK if we don't have any other better approach, but the proposed changes
make code a little messy, and I feel that get_user_pages() might be the
right place to fix. Is get_user_pages() expected to return struct page with
holding refcount for offline valid pages? I thought that such pages are
only used by drivers for dax-devices, but that might be wrong. Can I ask for
a little more explanation from this perspective?

Thanks,
Naoya Horiguchi

> ---
> mm/memory-failure.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5a38e9eade94..78b173c7190c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct page *page)
> return rc;
> }
>
> +static void put_ref_page(struct page *page)
> +{
> + if (page)
> + put_page(page);
> +}
> +
> /**
> * soft_offline_page - Soft offline a page.
> * @pfn: pfn to soft-offline
> @@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct page *page)
> int soft_offline_page(unsigned long pfn, int flags)
> {
> int ret;
> - struct page *page;
> bool try_again = true;
> + struct page *page, *ref_page = NULL;
> +
> + WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));
>
> if (!pfn_valid(pfn))
> return -ENXIO;
> + if (flags & MF_COUNT_INCREASED)
> + ref_page = pfn_to_page(pfn);
> +
> /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> page = pfn_to_online_page(pfn);
> - if (!page)
> + if (!page) {
> + put_ref_page(ref_page);
> return -EIO;
> + }
>
> if (PageHWPoison(page)) {
> pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
> - if (flags & MF_COUNT_INCREASED)
> - put_page(page);
> + put_ref_page(ref_page);
> return 0;
> }
>
>

Subject: Re: [PATCH v4 5/5] mm: Fix memory_failure() handling of dax-namespace metadata

On Wed, Jan 13, 2021 at 04:43:37PM -0800, Dan Williams wrote:
> Given 'struct dev_pagemap' spans both data pages and metadata pages be
> careful to consult the altmap if present to delineate metadata. In fact
> the pfn_first() helper already identifies the first valid data pfn, so
> export that helper for other code paths via pgmap_pfn_valid().
>
> Other usage of get_dev_pagemap() are not a concern because those are
> operating on known data pfns having been looked up by get_user_pages().
> I.e. metadata pfns are never user mapped.
>
> Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Reported-by: David Hildenbrand <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks fine to me.

Reviewed-by: Naoya Horiguchi <[email protected]>

2021-01-14 06:21:58

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On Wed, Jan 13, 2021 at 5:50 PM HORIGUCHI NAOYA(堀口 直也)
<[email protected]> wrote:
>
> On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote:
> > The conversion to move pfn_to_online_page() internal to
> > soft_offline_page() missed that the get_user_pages() reference taken by
> > the madvise() path needs to be dropped when pfn_to_online_page() fails.
> > Note the direct sysfs-path to soft_offline_page() does not perform a
> > get_user_pages() lookup.
> >
> > When soft_offline_page() is handed a pfn_valid() &&
> > !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> > a leaked reference.
> >
> > Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> > Cc: Andrew Morton <[email protected]>
> > Cc: Naoya Horiguchi <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Reviewed-by: David Hildenbrand <[email protected]>
> > Reviewed-by: Oscar Salvador <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
>
> I'm OK if we don't have any other better approach, but the proposed changes
> make code a little messy, and I feel that get_user_pages() might be the
> right place to fix. Is get_user_pages() expected to return struct page with
> holding refcount for offline valid pages? I thought that such pages are
> only used by drivers for dax-devices, but that might be wrong. Can I ask for
> a little more explanation from this perspective?

The motivation for ZONE_DEVICE is to allow get_user_pages() for "offline" pfns.

soft_offline_page() wants to only operate on "online" pfns.

get_user_pages() on a dax-mapping returns an "offline" ZONE_DEVICE page.

When pfn_to_online_page() fails the get_user_pages() reference needs
to be dropped.

To be honest I dislike the entire flags based scheme for communicating
the fact that page reference obtained by madvise needs to be dropped
later. I'd rather pass a non-NULL 'struct page *' than redo
pfn_to_page() conversions in the leaf functions, but that's a much
larger change.

2021-01-14 06:32:10

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On 2021-01-14 07:18, Dan Williams wrote:
> To be honest I dislike the entire flags based scheme for communicating
> the fact that page reference obtained by madvise needs to be dropped
> later. I'd rather pass a non-NULL 'struct page *' than redo
> pfn_to_page() conversions in the leaf functions, but that's a much
> larger change.

We tried to remove that flag in the past but for different reasons.
I will have another look to see what can be done.

Thanks

--
Oscar Salvador
SUSE L3

Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On Wed, Jan 13, 2021 at 10:18:09PM -0800, Dan Williams wrote:
> On Wed, Jan 13, 2021 at 5:50 PM HORIGUCHI NAOYA($BKY8}!!D>Li(B)
> <[email protected]> wrote:
> >
> > On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote:
> > > The conversion to move pfn_to_online_page() internal to
> > > soft_offline_page() missed that the get_user_pages() reference taken by
> > > the madvise() path needs to be dropped when pfn_to_online_page() fails.
> > > Note the direct sysfs-path to soft_offline_page() does not perform a
> > > get_user_pages() lookup.
> > >
> > > When soft_offline_page() is handed a pfn_valid() &&
> > > !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> > > a leaked reference.
> > >
> > > Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Naoya Horiguchi <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Reviewed-by: David Hildenbrand <[email protected]>
> > > Reviewed-by: Oscar Salvador <[email protected]>
> > > Cc: <[email protected]>
> > > Signed-off-by: Dan Williams <[email protected]>
> >
> > I'm OK if we don't have any other better approach, but the proposed changes
> > make code a little messy, and I feel that get_user_pages() might be the
> > right place to fix. Is get_user_pages() expected to return struct page with
> > holding refcount for offline valid pages? I thought that such pages are
> > only used by drivers for dax-devices, but that might be wrong. Can I ask for
> > a little more explanation from this perspective?
>
> The motivation for ZONE_DEVICE is to allow get_user_pages() for "offline" pfns.

I missed this point, thank you.

>
> soft_offline_page() wants to only operate on "online" pfns.

Right.

>
> get_user_pages() on a dax-mapping returns an "offline" ZONE_DEVICE page.

OK.

>
> When pfn_to_online_page() fails the get_user_pages() reference needs
> to be dropped.

The background is clear to me, and I agree with this patch.

Reviewed-by: Naoya Horiguchi <[email protected]>

>
> To be honest I dislike the entire flags based scheme for communicating
> the fact that page reference obtained by madvise needs to be dropped
> later. I'd rather pass a non-NULL 'struct page *' than redo
> pfn_to_page() conversions in the leaf functions, but that's a much
> larger change.

As Oscar mentions in another email, removing MF_COUNT_INCREASED was
recently discussed and rejected. I think that if pfn-page conversion
layer is somehow factored out from soft/hard offline handler, code would
be more readable/maintainable.

Thanks,
Naoya Horiguchi

2021-01-17 22:06:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On Wed, 13 Jan 2021 16:43:32 -0800 Dan Williams <[email protected]> wrote:

> The conversion to move pfn_to_online_page() internal to
> soft_offline_page() missed that the get_user_pages() reference taken by
> the madvise() path needs to be dropped when pfn_to_online_page() fails.
> Note the direct sysfs-path to soft_offline_page() does not perform a
> get_user_pages() lookup.
>
> When soft_offline_page() is handed a pfn_valid() &&
> !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> a leaked reference.
>
> Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> Cc: Andrew Morton <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

A cc:stable patch in the middle is awkward. I'll make this a
standalone patch for merging into mainline soon (for 5.11) and shall
turn the rest into a 4-patch series, OK?

2021-01-17 23:49:21

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

On Sun, Jan 17, 2021 at 2:01 PM Andrew Morton <[email protected]> wrote:
>
> On Wed, 13 Jan 2021 16:43:32 -0800 Dan Williams <[email protected]> wrote:
>
> > The conversion to move pfn_to_online_page() internal to
> > soft_offline_page() missed that the get_user_pages() reference taken by
> > the madvise() path needs to be dropped when pfn_to_online_page() fails.
> > Note the direct sysfs-path to soft_offline_page() does not perform a
> > get_user_pages() lookup.
> >
> > When soft_offline_page() is handed a pfn_valid() &&
> > !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> > a leaked reference.
> >
> > Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> > Cc: Andrew Morton <[email protected]>
> > Cc: Naoya Horiguchi <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Reviewed-by: David Hildenbrand <[email protected]>
> > Reviewed-by: Oscar Salvador <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
>
> A cc:stable patch in the middle is awkward. I'll make this a
> standalone patch for merging into mainline soon (for 5.11) and shall
> turn the rest into a 4-patch series, OK?

Sounds good to me.

2021-01-20 10:46:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] mm: Move pfn_to_online_page() out of line

On Wed 13-01-21 16:43:16, Dan Williams wrote:
> pfn_to_online_page() is already too large to be a macro or an inline
> function. In anticipation of further logic changes / growth, move it out
> of line.
>
> No functional change, just code movement.
>
> Reported-by: Michal Hocko <[email protected]>

I am not sure what r-b refers to. I suspect it is the overal problem
rather than this particular patch. Not that I care much but it might get
confusing because I do not remember ever complaining about
pfn_to_online_page to be too large.

> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

I do not remember any hot path which depends on pfn walk and a function
call would be clearly visible. If this ever turn out to be a problem we
can make it inline and push the heavy lifting out of line.

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> include/linux/memory_hotplug.h | 17 +----------------
> mm/memory_hotplug.c | 16 ++++++++++++++++
> 2 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..3d99de0db2dd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -16,22 +16,7 @@ struct resource;
> struct vmem_altmap;
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> -/*
> - * Return page for the valid pfn only if the page is online. All pfn
> - * walkers which rely on the fully initialized page->flags and others
> - * should use this rather than pfn_valid && pfn_to_page
> - */
> -#define pfn_to_online_page(pfn) \
> -({ \
> - struct page *___page = NULL; \
> - unsigned long ___pfn = pfn; \
> - unsigned long ___nr = pfn_to_section_nr(___pfn); \
> - \
> - if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> - pfn_valid_within(___pfn)) \
> - ___page = pfn_to_page(___pfn); \
> - ___page; \
> -})
> +struct page *pfn_to_online_page(unsigned long pfn);
>
> /*
> * Types for free bootmem stored in page->lru.next. These have to be in
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..55a69d4396e7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,22 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
> return 0;
> }
>
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> + unsigned long nr = pfn_to_section_nr(pfn);
> +
> + if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> + pfn_valid_within(pfn))
> + return pfn_to_page(pfn);
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);
> +
> /*
> * Reasonably generic function for adding memory. It is
> * expected that archs that support memory hotplug will

--
Michal Hocko
SUSE Labs

2021-01-20 12:04:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] mm: Teach pfn_to_online_page() to consider subsection validity

On Wed 13-01-21 16:43:21, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity
> where pfn_valid() may be limited to coarse section granularity.
> Explicitly validate subsections after pfn_valid() succeeds.

The changelog is not really clear on the underlying problem. It would
benefit from some clarification. What about something like this?
"
pfn_to_online_page is primarily used to filter out offline or fully
uninitialized pages. pfn_valid resp. online_section_nr have a coarse
per memory section granularity. If a section shared with a partially
offline memory (e.g. part of ZONE_DEVICE) then pfn_to_online_page would
lead to a false positive on some pfns. Fix this by adding
pfn_section_valid check which is subsection aware.
"

>
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Reported-by: David Hildenbrand <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

With that feel free to add
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/memory_hotplug.c | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..d0c81f7a3347 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,26 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
> struct page *pfn_to_online_page(unsigned long pfn)
> {
> unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + /*
> + * Save some code text when online_section() +
> + * pfn_section_valid() are sufficient.
> + */
> + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn))
> + return NULL;
> +
> + if (!pfn_section_valid(ms, pfn))
> + return NULL;
>
> - if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> - pfn_valid_within(pfn))
> - return pfn_to_page(pfn);
> - return NULL;
> + return pfn_to_page(pfn);
> }
> EXPORT_SYMBOL_GPL(pfn_to_online_page);
>
>

--
Michal Hocko
SUSE Labs

2021-01-20 12:05:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

On Wed 13-01-21 16:43:26, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with
> ZONE_DEVICE. This means that pfn_to_online_page() may return invalid
> @page objects. For example with a memory map like:
>
> 100000000-1fbffffff : System RAM
> 142000000-143002e16 : Kernel code
> 143200000-143713fff : Kernel rodata
> 143800000-143b15b7f : Kernel data
> 144227000-144ffffff : Kernel bss
> 1fc000000-2fbffffff : Persistent Memory (legacy)
> 1fc000000-2fbffffff : namespace0.0
>
> This command:
>
> echo 0x1fc000000 > /sys/devices/system/memory/soft_offline_page
>
> ...succeeds when it should fail. When it succeeds it touches
> an uninitialized page and may crash or cause other damage (see
> dissolve_free_huge_page()).
>
> While the memory map above is contrived via the memmap=ss!nn kernel
> command line option, the collision happens in practice on shipping
> platforms. The memory controller resources that decode spans of
> physical address space are a limited resource. One technique
> platform-firmware uses to conserve those resources is to share a decoder
> across 2 devices to keep the address range contiguous. Unfortunately the
> unit of operation of a decoder is 64MiB while the Linux section size is
> 128MiB. This results in situations where, without subsection hotplug
> memory mappings with different lifetimes collide into one object that
> can only express one lifetime.

Thank you this is a very useful insight to have in the changelog.

> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section. In
> the fast path online_section() for a full ZONE_DEVICE section returns
> false.
>
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
>
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <[email protected]>
> Reported-by: Michal Hocko <[email protected]>
> Reported-by: David Hildenbrand <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Acked-by: Michal Hocko <[email protected]>

I do not want to bikeshed but online_device_section is quite confusing.
device_mixed_section would sound like a better name to me.
--
Michal Hocko
SUSE Labs