2021-12-06 18:47:36

by Minchan Kim

[permalink] [raw]
Subject: [RFC] mm: introduce page pinner

The contiguous memory allocation fails if one of the pages in
requested range has unexpected elevated reference count since
VM couldn't migrate the page out. It's very common pattern for
CMA allocation failure. The temporal elevated page refcount
could happen from various places and it's really hard to chase
who held the temporal page refcount at that time, which is the
vital information to debug the allocation failure.

This patch introduces page pinner to keep track of Page Pinner
who caused the CMA allocation failure. How page pinner work is
once VM found the non-migrated page after trying migration
during contiguos allocation, it marks the page and every page-put
operation on the page since then will have event trace. Since
page-put is always with page-get, the page-put event trace helps
to deduce where the pair page-get originated from.

The reason why the feature tracks page-put instead of page-get
indirectly is that since VM couldn't expect when page migration
fails, it should keep track of every page-get for migratable page
to dump information at failure. Considering backtrace as vitial
information as well as page's get/put is one of hottest path,
it's too heavy approach. Thus, to minimize runtime overhead,
this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
debugging option to indicate migration-failed page and only
tracks every page-put operation for the page since the failure.

usage:

trace_dir="/sys/kernel/tracing"
echo 1 > $trace_dir/events/page_pinner/enable
echo 1 > $trace_dir/options/stacktrace
..
run workload
..
..

cat $trace_dir/trace

<...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
<...>-498 [006] .... 33306.301625: <stack trace>
=> __page_pinner_failure
=> test_pages_isolated
=> alloc_contig_range
=> cma_alloc
=> cma_heap_allocate
=> dma_heap_ioctl
=> __arm64_sys_ioctl
=> el0_svc_common
=> do_el0_svc
=> el0_svc
=> el0_sync_handler
=> el0_sync
<...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
<...>-24965 [001] .... 33306.392846: <stack trace>
=> __page_pinner_put
=> release_pages
=> free_pages_and_swap_cache
=> tlb_flush_mmu_free
=> tlb_flush_mmu
=> zap_pte_range
=> unmap_page_range
=> unmap_vmas
=> exit_mmap
=> __mmput
=> mmput
=> exit_mm
=> do_exit
=> do_group_exit
=> get_signal
=> do_signal
=> do_notify_resume
=> work_pending

Signed-off-by: Minchan Kim <[email protected]>
---
The PagePinner named after PageOwner since I wanted to keep track of
page refcount holder. Feel free to suggest better names.
Actually, I had alloc_contig_failure tracker as a candidate.

include/linux/mm.h | 7 ++-
include/linux/page_ext.h | 3 +
include/linux/page_pinner.h | 47 ++++++++++++++++
include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
mm/Kconfig.debug | 13 +++++
mm/Makefile | 1 +
mm/page_alloc.c | 3 +
mm/page_ext.c | 4 ++
mm/page_isolation.c | 3 +
mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
10 files changed, 230 insertions(+), 1 deletion(-)
create mode 100644 include/linux/page_pinner.h
create mode 100644 include/trace/events/page_pinner.h
create mode 100644 mm/page_pinner.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..a640cae593f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
#include <linux/err.h>
#include <linux/page-flags.h>
#include <linux/page_ref.h>
+#include <linux/page_pinner.h>
#include <linux/memremap.h>
#include <linux/overflow.h>
#include <linux/sizes.h>
@@ -744,8 +745,12 @@ struct inode;
*/
static inline int put_page_testzero(struct page *page)
{
+ int ret;
+
VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
- return page_ref_dec_and_test(page);
+ ret = page_ref_dec_and_test(page);
+ page_pinner_put(page);
+ return ret;
}

/*
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index fabb2e1e087f..561d8458dc5a 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -23,6 +23,9 @@ enum page_ext_flags {
PAGE_EXT_YOUNG,
PAGE_EXT_IDLE,
#endif
+#if defined(CONFIG_PAGE_PINNER)
+ PAGE_EXT_PINNER,
+#endif
};

/*
diff --git a/include/linux/page_pinner.h b/include/linux/page_pinner.h
new file mode 100644
index 000000000000..3f93a753b8e0
--- /dev/null
+++ b/include/linux/page_pinner.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_PAGE_PINNER_H
+#define __LINUX_PAGE_PINNER_H
+
+#include <linux/jump_label.h>
+
+#ifdef CONFIG_PAGE_PINNER
+extern struct static_key_false page_pinner_inited;
+extern struct page_ext_operations page_pinner_ops;
+
+void __page_pinner_failure(struct page *page);
+void __page_pinner_put(struct page *page);
+void __reset_page_pinner(struct page *page, unsigned int order);
+
+static inline void reset_page_pinner(struct page *page, unsigned int order)
+{
+ if (static_branch_unlikely(&page_pinner_inited))
+ __reset_page_pinner(page, order);
+}
+
+static inline void page_pinner_failure(struct page *page)
+{
+ if (!static_branch_unlikely(&page_pinner_inited))
+ return;
+
+ __page_pinner_failure(page);
+}
+
+static inline void page_pinner_put(struct page *page)
+{
+ if (!static_branch_unlikely(&page_pinner_inited))
+ return;
+
+ __page_pinner_put(page);
+}
+#else
+static inline void reset_page_pinner(struct page *page, unsigned int order)
+{
+}
+static inline void page_pinner_failure(struct page *page)
+{
+}
+static inline void page_pinner_put(struct page *page)
+{
+}
+#endif /* CONFIG_PAGE_PINNER */
+#endif /* __LINUX_PAGE_PINNER_H */
diff --git a/include/trace/events/page_pinner.h b/include/trace/events/page_pinner.h
new file mode 100644
index 000000000000..69ccd5c30f66
--- /dev/null
+++ b/include/trace/events/page_pinner.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM page_pinner
+
+#if !defined(_TRACE_PAGE_PINNER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGE_PINNER_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include <trace/events/mmflags.h>
+
+DECLARE_EVENT_CLASS(page_pinner_template,
+
+ TP_PROTO(struct page *page),
+
+ TP_ARGS(page),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(unsigned long, flags)
+ __field(int, count)
+ __field(int, mapcount)
+ __field(void *, mapping)
+ __field(int, mt)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = page_to_pfn(page);
+ __entry->flags = page->flags;
+ __entry->count = page_ref_count(page);
+ __entry->mapcount = page_mapcount(page);
+ __entry->mapping = page->mapping;
+ __entry->mt = get_pageblock_migratetype(page);
+ ),
+
+ TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
+ __entry->pfn,
+ show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+ __entry->count,
+ __entry->mapcount, __entry->mapping, __entry->mt)
+);
+
+DEFINE_EVENT(page_pinner_template, page_pinner_failure,
+
+ TP_PROTO(struct page *page),
+
+ TP_ARGS(page)
+);
+
+DEFINE_EVENT(page_pinner_template, page_pinner_put,
+
+ TP_PROTO(struct page *page),
+
+ TP_ARGS(page)
+);
+
+#endif /* _TRACE_PAGE_PINNER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 1e73717802f8..0ad4a3b8f4eb 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -62,6 +62,19 @@ config PAGE_OWNER

If unsure, say N.

+config PAGE_PINNER
+ bool "Track page pinner"
+ select PAGE_EXTENSION
+ depends on DEBUG_KERNEL && TRACEPOINTS
+ help
+ This keeps track of what call chain is the pinner of a page, may
+ help to find contiguos page allocation failure. Even if you include
+ this feature in your build, it is disabled by default. You should
+ pass "page_pinner=on" to boot parameter in order to enable it. Eats
+ a fair amount of memory if enabled.
+
+ If unsure, say N.
+
config PAGE_POISONING
bool "Poison pages after freeing"
help
diff --git a/mm/Makefile b/mm/Makefile
index fc60a40ce954..0c9b78b15070 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
obj-$(CONFIG_PAGE_OWNER) += page_owner.o
+obj-$(CONFIG_PAGE_PINNER) += page_pinner.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
obj-$(CONFIG_ZPOOL) += zpool.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f41a5e990ac0..6e3a6f875a40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -63,6 +63,7 @@
#include <linux/sched/rt.h>
#include <linux/sched/mm.h>
#include <linux/page_owner.h>
+#include <linux/page_pinner.h>
#include <linux/kthread.h>
#include <linux/memcontrol.h>
#include <linux/ftrace.h>
@@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
if (memcg_kmem_enabled() && PageMemcgKmem(page))
__memcg_kmem_uncharge_page(page, order);
reset_page_owner(page, order);
+ reset_page_pinner(page, order);
return false;
}

@@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
page_cpupid_reset_last(page);
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
+ reset_page_pinner(page, order);

if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 2a52fd9ed464..0dafe968b212 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -8,6 +8,7 @@
#include <linux/kmemleak.h>
#include <linux/page_owner.h>
#include <linux/page_idle.h>
+#include <linux/page_pinner.h>

/*
* struct page extension
@@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
+#ifdef CONFIG_PAGE_PINNER
+ &page_pinner_ops,
+#endif
};

unsigned long page_ext_size = sizeof(struct page_ext);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index a95c2c6562d0..a9ddea1c9166 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -9,6 +9,7 @@
#include <linux/memory.h>
#include <linux/hugetlb.h>
#include <linux/page_owner.h>
+#include <linux/page_pinner.h>
#include <linux/migrate.h>
#include "internal.h"

@@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,

out:
trace_test_pages_isolated(start_pfn, end_pfn, pfn);
+ if (ret < 0)
+ page_pinner_failure(pfn_to_page(pfn));

return ret;
}
diff --git a/mm/page_pinner.c b/mm/page_pinner.c
new file mode 100644
index 000000000000..300a90647557
--- /dev/null
+++ b/mm/page_pinner.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/page_pinner.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/page_pinner.h>
+
+static bool page_pinner_enabled;
+DEFINE_STATIC_KEY_FALSE(page_pinner_inited);
+EXPORT_SYMBOL(page_pinner_inited);
+
+static int __init early_page_pinner_param(char *buf)
+{
+ return kstrtobool(buf, &page_pinner_enabled);
+}
+early_param("page_pinner", early_page_pinner_param);
+
+static bool need_page_pinner(void)
+{
+ return page_pinner_enabled;
+}
+
+static void init_page_pinner(void)
+{
+ if (!page_pinner_enabled)
+ return;
+
+ static_branch_enable(&page_pinner_inited);
+}
+
+struct page_ext_operations page_pinner_ops = {
+ .need = need_page_pinner,
+ .init = init_page_pinner,
+};
+
+void __reset_page_pinner(struct page *page, unsigned int order)
+{
+ struct page_ext *page_ext;
+ int i;
+
+ page_ext = lookup_page_ext(page);
+ if (unlikely(!page_ext))
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
+ break;
+
+ clear_bit(PAGE_EXT_PINNER, &page_ext->flags);
+ page_ext = page_ext_next(page_ext);
+ }
+}
+
+void __page_pinner_failure(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ trace_page_pinner_failure(page);
+ test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags);
+}
+
+void __page_pinner_put(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
+ return;
+
+ trace_page_pinner_put(page);
+}
+EXPORT_SYMBOL(__page_pinner_put);
+
+
+static int __init page_pinner_init(void)
+{
+ if (!static_branch_unlikely(&page_pinner_inited)) {
+ pr_info("page_pinner is disabled\n");
+ return 0;
+ }
+
+ pr_info("page_pinner is enabled\n");
+ return 0;
+}
+late_initcall(page_pinner_init)
--
2.34.1.400.ga245620fadb-goog



2021-12-08 12:13:14

by 胡玮文

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> The contiguous memory allocation fails if one of the pages in
> requested range has unexpected elevated reference count since
> VM couldn't migrate the page out. It's very common pattern for
> CMA allocation failure. The temporal elevated page refcount
> could happen from various places and it's really hard to chase
> who held the temporal page refcount at that time, which is the
> vital information to debug the allocation failure.

Hi Minchan,

I'm a newbie here. We are debugging a problem where every CPU core is doing
compaction but making no progress, because of the unexpected page refcount. I'm
interested in your approach, but this patch seems only to cover the CMA
allocation path. So could it be extended to debugging migrate failure during
compaction? I'm not familiar with the kernel codebase, here is my untested
thought:

diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..85dacbca8fa0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -46,6 +46,7 @@
#include <linux/mmu_notifier.h>
#include <linux/page_idle.h>
#include <linux/page_owner.h>
+#include <linux/page_pinner.h>
#include <linux/sched/mm.h>
#include <linux/ptrace.h>
#include <linux/oom.h>
@@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *mapping,

if (!mapping) {
/* Anonymous page without mapping */
- if (folio_ref_count(folio) != expected_count)
+ if (folio_ref_count(folio) != expected_count) {
+ page_pinner_failure(&folio->page);
return -EAGAIN;
+ }

/* No turning back from here */
newfolio->index = folio->index;
@@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
xas_lock_irq(&xas);
if (!folio_ref_freeze(folio, expected_count)) {
xas_unlock_irq(&xas);
+ page_pinner_failure(&folio->page);
return -EAGAIN;
}

I'm not sure what to do with the new folio, it seems using folio->page in new
codes is not correct.

> This patch introduces page pinner to keep track of Page Pinner
> who caused the CMA allocation failure. How page pinner work is
> once VM found the non-migrated page after trying migration
> during contiguos allocation, it marks the page and every page-put
> operation on the page since then will have event trace. Since
> page-put is always with page-get, the page-put event trace helps
> to deduce where the pair page-get originated from.
>
> The reason why the feature tracks page-put instead of page-get
> indirectly is that since VM couldn't expect when page migration
> fails, it should keep track of every page-get for migratable page
> to dump information at failure. Considering backtrace as vitial
> information as well as page's get/put is one of hottest path,
> it's too heavy approach. Thus, to minimize runtime overhead,
> this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> debugging option to indicate migration-failed page and only
> tracks every page-put operation for the page since the failure.
>
> usage:
>
> trace_dir="/sys/kernel/tracing"
> echo 1 > $trace_dir/events/page_pinner/enable
> echo 1 > $trace_dir/options/stacktrace
> ..
> run workload
> ..
> ..
>
> cat $trace_dir/trace
>
> <...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> <...>-498 [006] .... 33306.301625: <stack trace>
> => __page_pinner_failure
> => test_pages_isolated
> => alloc_contig_range
> => cma_alloc
> => cma_heap_allocate
> => dma_heap_ioctl
> => __arm64_sys_ioctl
> => el0_svc_common
> => do_el0_svc
> => el0_svc
> => el0_sync_handler
> => el0_sync
> <...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> <...>-24965 [001] .... 33306.392846: <stack trace>
> => __page_pinner_put
> => release_pages
> => free_pages_and_swap_cache
> => tlb_flush_mmu_free
> => tlb_flush_mmu
> => zap_pte_range
> => unmap_page_range
> => unmap_vmas
> => exit_mmap
> => __mmput
> => mmput
> => exit_mm
> => do_exit
> => do_group_exit
> => get_signal
> => do_signal
> => do_notify_resume
> => work_pending
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> The PagePinner named after PageOwner since I wanted to keep track of
> page refcount holder. Feel free to suggest better names.
> Actually, I had alloc_contig_failure tracker as a candidate.
>
> include/linux/mm.h | 7 ++-
> include/linux/page_ext.h | 3 +
> include/linux/page_pinner.h | 47 ++++++++++++++++
> include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> mm/Kconfig.debug | 13 +++++
> mm/Makefile | 1 +
> mm/page_alloc.c | 3 +
> mm/page_ext.c | 4 ++
> mm/page_isolation.c | 3 +
> mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
> 10 files changed, 230 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/page_pinner.h
> create mode 100644 include/trace/events/page_pinner.h
> create mode 100644 mm/page_pinner.c
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..a640cae593f9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,7 @@
> #include <linux/err.h>
> #include <linux/page-flags.h>
> #include <linux/page_ref.h>
> +#include <linux/page_pinner.h>
> #include <linux/memremap.h>
> #include <linux/overflow.h>
> #include <linux/sizes.h>
> @@ -744,8 +745,12 @@ struct inode;
> */
> static inline int put_page_testzero(struct page *page)
> {
> + int ret;
> +
> VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> - return page_ref_dec_and_test(page);
> + ret = page_ref_dec_and_test(page);
> + page_pinner_put(page);
> + return ret;
> }
>
> /*
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1e087f..561d8458dc5a 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -23,6 +23,9 @@ enum page_ext_flags {
> PAGE_EXT_YOUNG,
> PAGE_EXT_IDLE,
> #endif
> +#if defined(CONFIG_PAGE_PINNER)
> + PAGE_EXT_PINNER,
> +#endif
> };
>
> /*
> diff --git a/include/linux/page_pinner.h b/include/linux/page_pinner.h
> new file mode 100644
> index 000000000000..3f93a753b8e0
> --- /dev/null
> +++ b/include/linux/page_pinner.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_PAGE_PINNER_H
> +#define __LINUX_PAGE_PINNER_H
> +
> +#include <linux/jump_label.h>
> +
> +#ifdef CONFIG_PAGE_PINNER
> +extern struct static_key_false page_pinner_inited;
> +extern struct page_ext_operations page_pinner_ops;
> +
> +void __page_pinner_failure(struct page *page);
> +void __page_pinner_put(struct page *page);
> +void __reset_page_pinner(struct page *page, unsigned int order);
> +
> +static inline void reset_page_pinner(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&page_pinner_inited))
> + __reset_page_pinner(page, order);
> +}
> +
> +static inline void page_pinner_failure(struct page *page)
> +{
> + if (!static_branch_unlikely(&page_pinner_inited))
> + return;
> +
> + __page_pinner_failure(page);
> +}
> +
> +static inline void page_pinner_put(struct page *page)
> +{
> + if (!static_branch_unlikely(&page_pinner_inited))
> + return;
> +
> + __page_pinner_put(page);
> +}
> +#else
> +static inline void reset_page_pinner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void page_pinner_failure(struct page *page)
> +{
> +}
> +static inline void page_pinner_put(struct page *page)
> +{
> +}
> +#endif /* CONFIG_PAGE_PINNER */
> +#endif /* __LINUX_PAGE_PINNER_H */
> diff --git a/include/trace/events/page_pinner.h b/include/trace/events/page_pinner.h
> new file mode 100644
> index 000000000000..69ccd5c30f66
> --- /dev/null
> +++ b/include/trace/events/page_pinner.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM page_pinner
> +
> +#if !defined(_TRACE_PAGE_PINNER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PAGE_PINNER_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/mmflags.h>
> +
> +DECLARE_EVENT_CLASS(page_pinner_template,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, pfn)
> + __field(unsigned long, flags)
> + __field(int, count)
> + __field(int, mapcount)
> + __field(void *, mapping)
> + __field(int, mt)
> + ),
> +
> + TP_fast_assign(
> + __entry->pfn = page_to_pfn(page);
> + __entry->flags = page->flags;
> + __entry->count = page_ref_count(page);
> + __entry->mapcount = page_mapcount(page);
> + __entry->mapping = page->mapping;
> + __entry->mt = get_pageblock_migratetype(page);
> + ),
> +
> + TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> + __entry->pfn,
> + show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> + __entry->count,
> + __entry->mapcount, __entry->mapping, __entry->mt)
> +);
> +
> +DEFINE_EVENT(page_pinner_template, page_pinner_failure,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page)
> +);
> +
> +DEFINE_EVENT(page_pinner_template, page_pinner_put,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page)
> +);
> +
> +#endif /* _TRACE_PAGE_PINNER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 1e73717802f8..0ad4a3b8f4eb 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -62,6 +62,19 @@ config PAGE_OWNER
>
> If unsure, say N.
>
> +config PAGE_PINNER
> + bool "Track page pinner"
> + select PAGE_EXTENSION
> + depends on DEBUG_KERNEL && TRACEPOINTS
> + help
> + This keeps track of what call chain is the pinner of a page, may
> + help to find contiguos page allocation failure. Even if you include
> + this feature in your build, it is disabled by default. You should
> + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> + a fair amount of memory if enabled.

I'm a bit confused. It seems page pinner does not allocate any additional
memory if you enable it by boot parameter. So the description seems inaccurate.

> +
> + If unsure, say N.
> +
> config PAGE_POISONING
> bool "Poison pages after freeing"
> help
> diff --git a/mm/Makefile b/mm/Makefile
> index fc60a40ce954..0c9b78b15070 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
> obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
> obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> +obj-$(CONFIG_PAGE_PINNER) += page_pinner.o
> obj-$(CONFIG_CLEANCACHE) += cleancache.o
> obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> obj-$(CONFIG_ZPOOL) += zpool.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f41a5e990ac0..6e3a6f875a40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -63,6 +63,7 @@
> #include <linux/sched/rt.h>
> #include <linux/sched/mm.h>
> #include <linux/page_owner.h>
> +#include <linux/page_pinner.h>
> #include <linux/kthread.h>
> #include <linux/memcontrol.h>
> #include <linux/ftrace.h>
> @@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> if (memcg_kmem_enabled() && PageMemcgKmem(page))
> __memcg_kmem_uncharge_page(page, order);
> reset_page_owner(page, order);
> + reset_page_pinner(page, order);
> return false;
> }
>
> @@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> page_cpupid_reset_last(page);
> page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> reset_page_owner(page, order);
> + reset_page_pinner(page, order);
>
> if (!PageHighMem(page)) {
> debug_check_no_locks_freed(page_address(page),
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 2a52fd9ed464..0dafe968b212 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -8,6 +8,7 @@
> #include <linux/kmemleak.h>
> #include <linux/page_owner.h>
> #include <linux/page_idle.h>
> +#include <linux/page_pinner.h>
>
> /*
> * struct page extension
> @@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
> &page_idle_ops,
> #endif
> +#ifdef CONFIG_PAGE_PINNER
> + &page_pinner_ops,
> +#endif
> };
>
> unsigned long page_ext_size = sizeof(struct page_ext);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index a95c2c6562d0..a9ddea1c9166 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -9,6 +9,7 @@
> #include <linux/memory.h>
> #include <linux/hugetlb.h>
> #include <linux/page_owner.h>
> +#include <linux/page_pinner.h>
> #include <linux/migrate.h>
> #include "internal.h"
>
> @@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
>
> out:
> trace_test_pages_isolated(start_pfn, end_pfn, pfn);
> + if (ret < 0)
> + page_pinner_failure(pfn_to_page(pfn));
>
> return ret;
> }
> diff --git a/mm/page_pinner.c b/mm/page_pinner.c
> new file mode 100644
> index 000000000000..300a90647557
> --- /dev/null
> +++ b/mm/page_pinner.c
> @@ -0,0 +1,90 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/page_pinner.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/page_pinner.h>
> +
> +static bool page_pinner_enabled;
> +DEFINE_STATIC_KEY_FALSE(page_pinner_inited);
> +EXPORT_SYMBOL(page_pinner_inited);
> +
> +static int __init early_page_pinner_param(char *buf)
> +{
> + return kstrtobool(buf, &page_pinner_enabled);
> +}
> +early_param("page_pinner", early_page_pinner_param);
> +
> +static bool need_page_pinner(void)
> +{
> + return page_pinner_enabled;
> +}
> +
> +static void init_page_pinner(void)
> +{
> + if (!page_pinner_enabled)
> + return;
> +
> + static_branch_enable(&page_pinner_inited);
> +}
> +
> +struct page_ext_operations page_pinner_ops = {
> + .need = need_page_pinner,
> + .init = init_page_pinner,
> +};
> +
> +void __reset_page_pinner(struct page *page, unsigned int order)
> +{
> + struct page_ext *page_ext;
> + int i;
> +
> + page_ext = lookup_page_ext(page);
> + if (unlikely(!page_ext))
> + return;
> +
> + for (i = 0; i < (1 << order); i++) {
> + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> + break;
> +
> + clear_bit(PAGE_EXT_PINNER, &page_ext->flags);
> + page_ext = page_ext_next(page_ext);
> + }
> +}
> +
> +void __page_pinner_failure(struct page *page)
> +{
> + struct page_ext *page_ext = lookup_page_ext(page);
> +
> + if (unlikely(!page_ext))
> + return;
> +
> + trace_page_pinner_failure(page);
> + test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags);
> +}
> +
> +void __page_pinner_put(struct page *page)
> +{
> + struct page_ext *page_ext = lookup_page_ext(page);
> +
> + if (unlikely(!page_ext))
> + return;
> +
> + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> + return;
> +
> + trace_page_pinner_put(page);
> +}
> +EXPORT_SYMBOL(__page_pinner_put);
> +
> +
> +static int __init page_pinner_init(void)
> +{
> + if (!static_branch_unlikely(&page_pinner_inited)) {
> + pr_info("page_pinner is disabled\n");
> + return 0;
> + }
> +
> + pr_info("page_pinner is enabled\n");
> + return 0;
> +}
> +late_initcall(page_pinner_init)
> --
> 2.34.1.400.ga245620fadb-goog
>

More info about my compaction issue:

This call stack returns -EAGAIN in 99.9% cases on the problematic host
(Ubuntu 20.04 with kernel 5.11.0-40):

migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAGAIN
migrate_page
fallback_migrate_page
move_to_new_page
migrate_pages
compact_zone
compact_zone_order
try_to_compact_pages
__alloc_pages_direct_compact
__alloc_pages_slowpath.constprop.0
__alloc_pages_nodemask
alloc_pages_vma
do_huge_pmd_anonymous_page
__handle_mm_fault
handle_mm_fault
do_user_addr_fault
exc_page_fault
asm_exc_page_fault

The offending pages are from shm, allocated by mmap() with MAP_SHARED by a
machine learning program. They may have relationships with NVIDIA CUDA, but I
want to confirm this, and make improvements if possible.

When the issue reproduce, a single page fault that triggers a sync compaction
can take tens of seconds. Then all 40 CPU threads are doing compaction, and
application runs several order of magnitude slower.

Disabling sync compaction is a workaround (the default is "madvise"):

echo never > /sys/kernel/mm/transparent_hugepage/defrag

Previously I asked for help at https://lore.kernel.org/linux-mm/[email protected]/
Now I have more information but still cannot pinpoint the root cause.

Thanks,
Hu Weiwen


2021-12-08 14:24:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On Wed, Dec 08, 2021 at 07:54:35PM +0800, 胡玮文 wrote:
> @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> xas_lock_irq(&xas);
> if (!folio_ref_freeze(folio, expected_count)) {
> xas_unlock_irq(&xas);
> + page_pinner_failure(&folio->page);
> return -EAGAIN;
> }
>
> I'm not sure what to do with the new folio, it seems using folio->page in new
> codes is not correct.

It's a transitional measure. Once everything that should be converted
to folios has been converted, it can go away.

Tail pages don't have their own refcounts, so the page pinner should
actually be the folio pinner. I hadn't bothered to point this out
because we're in the early days of the folio transition and it'd be
unreasonable to criticise Minchan for not converting his patch from
pages to folios.

2021-12-08 18:42:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On Wed, Dec 08, 2021 at 07:54:35PM +0800, 胡玮文 wrote:
> On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> > The contiguous memory allocation fails if one of the pages in
> > requested range has unexpected elevated reference count since
> > VM couldn't migrate the page out. It's very common pattern for
> > CMA allocation failure. The temporal elevated page refcount
> > could happen from various places and it's really hard to chase
> > who held the temporal page refcount at that time, which is the
> > vital information to debug the allocation failure.

Hi,

Please don't cut down original Cc list without special reason.

>
> Hi Minchan,
>
> I'm a newbie here. We are debugging a problem where every CPU core is doing
> compaction but making no progress, because of the unexpected page refcount. I'm
> interested in your approach, but this patch seems only to cover the CMA
> allocation path. So could it be extended to debugging migrate failure during
> compaction? I'm not familiar with the kernel codebase, here is my untested
> thought:

The compaction failure will produce a lot events I wanted to avoid
in my system but I think your case is reasonable if you doesn't
mind the large events.

>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cf25b00f03c8..85dacbca8fa0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -46,6 +46,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/page_idle.h>
> #include <linux/page_owner.h>
> +#include <linux/page_pinner.h>
> #include <linux/sched/mm.h>
> #include <linux/ptrace.h>
> #include <linux/oom.h>
> @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>
> if (!mapping) {
> /* Anonymous page without mapping */
> - if (folio_ref_count(folio) != expected_count)
> + if (folio_ref_count(folio) != expected_count) {
> + page_pinner_failure(&folio->page);
> return -EAGAIN;
> + }
>
> /* No turning back from here */
> newfolio->index = folio->index;
> @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> xas_lock_irq(&xas);
> if (!folio_ref_freeze(folio, expected_count)) {
> xas_unlock_irq(&xas);
> + page_pinner_failure(&folio->page);
> return -EAGAIN;
> }
>
> I'm not sure what to do with the new folio, it seems using folio->page in new
> codes is not correct.

If you want to cover compaction only, maybe this one:

diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..7bfbf7205fb8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
/* All pages were either migrated or will be released */
cc->nr_migratepages = 0;
if (err) {
+ struct page *failed_page;
+
+ list_for_each_entry(failed_page, &cc->migratepages, lru)
+ page_pinner_failure(failed_page);
+
putback_movable_pages(&cc->migratepages);
/*
* migrate_pages() may return -ENOMEM when scanners meet

However, for the case, I want to introduce some filter options like
failure reason(?)

page_pinner_failure(pfn, reason)

So, I could keep getting only CMA allocation failure events, not
compaction failure.

>
> > This patch introduces page pinner to keep track of Page Pinner
> > who caused the CMA allocation failure. How page pinner work is
> > once VM found the non-migrated page after trying migration
> > during contiguos allocation, it marks the page and every page-put
> > operation on the page since then will have event trace. Since
> > page-put is always with page-get, the page-put event trace helps
> > to deduce where the pair page-get originated from.
> >
> > The reason why the feature tracks page-put instead of page-get
> > indirectly is that since VM couldn't expect when page migration
> > fails, it should keep track of every page-get for migratable page
> > to dump information at failure. Considering backtrace as vitial
> > information as well as page's get/put is one of hottest path,
> > it's too heavy approach. Thus, to minimize runtime overhead,
> > this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> > debugging option to indicate migration-failed page and only
> > tracks every page-put operation for the page since the failure.
> >
> > usage:
> >
> > trace_dir="/sys/kernel/tracing"
> > echo 1 > $trace_dir/events/page_pinner/enable
> > echo 1 > $trace_dir/options/stacktrace
> > ..
> > run workload
> > ..
> > ..
> >
> > cat $trace_dir/trace
> >
> > <...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-498 [006] .... 33306.301625: <stack trace>
> > => __page_pinner_failure
> > => test_pages_isolated
> > => alloc_contig_range
> > => cma_alloc
> > => cma_heap_allocate
> > => dma_heap_ioctl
> > => __arm64_sys_ioctl
> > => el0_svc_common
> > => do_el0_svc
> > => el0_svc
> > => el0_sync_handler
> > => el0_sync
> > <...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-24965 [001] .... 33306.392846: <stack trace>
> > => __page_pinner_put
> > => release_pages
> > => free_pages_and_swap_cache
> > => tlb_flush_mmu_free
> > => tlb_flush_mmu
> > => zap_pte_range
> > => unmap_page_range
> > => unmap_vmas
> > => exit_mmap
> > => __mmput
> > => mmput
> > => exit_mm
> > => do_exit
> > => do_group_exit
> > => get_signal
> > => do_signal
> > => do_notify_resume
> > => work_pending
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > The PagePinner named after PageOwner since I wanted to keep track of
> > page refcount holder. Feel free to suggest better names.
> > Actually, I had alloc_contig_failure tracker as a candidate.
> >
> > include/linux/mm.h | 7 ++-
> > include/linux/page_ext.h | 3 +
> > include/linux/page_pinner.h | 47 ++++++++++++++++
> > include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> > mm/Kconfig.debug | 13 +++++
> > mm/Makefile | 1 +
> > mm/page_alloc.c | 3 +
> > mm/page_ext.c | 4 ++
> > mm/page_isolation.c | 3 +
> > mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
> > 10 files changed, 230 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/page_pinner.h
> > create mode 100644 include/trace/events/page_pinner.h
> > create mode 100644 mm/page_pinner.c

< snip>

> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index 1e73717802f8..0ad4a3b8f4eb 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -62,6 +62,19 @@ config PAGE_OWNER
> >
> > If unsure, say N.
> >
> > +config PAGE_PINNER
> > + bool "Track page pinner"
> > + select PAGE_EXTENSION
> > + depends on DEBUG_KERNEL && TRACEPOINTS
> > + help
> > + This keeps track of what call chain is the pinner of a page, may
> > + help to find contiguos page allocation failure. Even if you include
> > + this feature in your build, it is disabled by default. You should
> > + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> > + a fair amount of memory if enabled.
>
> I'm a bit confused. It seems page pinner does not allocate any additional
> memory if you enable it by boot parameter. So the description seems inaccurate.

It will allocate page_ext descriptors so consumes the memory.

>
> > +
> > + If unsure, say N.
> > +
> > config PAGE_POISONING
> > bool "Poison pages after freeing"
> > help
> > diff --git a/mm/Makefile b/mm/Makefile
> > index fc60a40ce954..0c9b78b15070 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> > obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
> > obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
> > obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> > +obj-$(CONFIG_PAGE_PINNER) += page_pinner.o
> > obj-$(CONFIG_CLEANCACHE) += cleancache.o
> > obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> > obj-$(CONFIG_ZPOOL) += zpool.o
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f41a5e990ac0..6e3a6f875a40 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -63,6 +63,7 @@
> > #include <linux/sched/rt.h>
> > #include <linux/sched/mm.h>
> > #include <linux/page_owner.h>
> > +#include <linux/page_pinner.h>
> > #include <linux/kthread.h>
> > #include <linux/memcontrol.h>
> > #include <linux/ftrace.h>
> > @@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > if (memcg_kmem_enabled() && PageMemcgKmem(page))
> > __memcg_kmem_uncharge_page(page, order);
> > reset_page_owner(page, order);
> > + reset_page_pinner(page, order);
> > return false;
> > }
> >
> > @@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > page_cpupid_reset_last(page);
> > page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > reset_page_owner(page, order);
> > + reset_page_pinner(page, order);
> >
> > if (!PageHighMem(page)) {
> > debug_check_no_locks_freed(page_address(page),
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 2a52fd9ed464..0dafe968b212 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -8,6 +8,7 @@
> > #include <linux/kmemleak.h>
> > #include <linux/page_owner.h>
> > #include <linux/page_idle.h>
> > +#include <linux/page_pinner.h>
> >
> > /*
> > * struct page extension
> > @@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> > #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
> > &page_idle_ops,
> > #endif
> > +#ifdef CONFIG_PAGE_PINNER
> > + &page_pinner_ops,
> > +#endif
> > };
> >
> > unsigned long page_ext_size = sizeof(struct page_ext);
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index a95c2c6562d0..a9ddea1c9166 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -9,6 +9,7 @@
> > #include <linux/memory.h>
> > #include <linux/hugetlb.h>
> > #include <linux/page_owner.h>
> > +#include <linux/page_pinner.h>
> > #include <linux/migrate.h>
> > #include "internal.h"
> >
> > @@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> >
> > out:
> > trace_test_pages_isolated(start_pfn, end_pfn, pfn);
> > + if (ret < 0)
> > + page_pinner_failure(pfn_to_page(pfn));
> >
> > return ret;
> > }
> > diff --git a/mm/page_pinner.c b/mm/page_pinner.c
> > new file mode 100644
> > index 000000000000..300a90647557
> > --- /dev/null
> > +++ b/mm/page_pinner.c
> > @@ -0,0 +1,90 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/mm.h>
> > +#include <linux/page_pinner.h>
> > +
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/page_pinner.h>
> > +
> > +static bool page_pinner_enabled;
> > +DEFINE_STATIC_KEY_FALSE(page_pinner_inited);
> > +EXPORT_SYMBOL(page_pinner_inited);
> > +
> > +static int __init early_page_pinner_param(char *buf)
> > +{
> > + return kstrtobool(buf, &page_pinner_enabled);
> > +}
> > +early_param("page_pinner", early_page_pinner_param);
> > +
> > +static bool need_page_pinner(void)
> > +{
> > + return page_pinner_enabled;
> > +}
> > +
> > +static void init_page_pinner(void)
> > +{
> > + if (!page_pinner_enabled)
> > + return;
> > +
> > + static_branch_enable(&page_pinner_inited);
> > +}
> > +
> > +struct page_ext_operations page_pinner_ops = {
> > + .need = need_page_pinner,
> > + .init = init_page_pinner,
> > +};
> > +
> > +void __reset_page_pinner(struct page *page, unsigned int order)
> > +{
> > + struct page_ext *page_ext;
> > + int i;
> > +
> > + page_ext = lookup_page_ext(page);
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > + break;
> > +
> > + clear_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > + page_ext = page_ext_next(page_ext);
> > + }
> > +}
> > +
> > +void __page_pinner_failure(struct page *page)
> > +{
> > + struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + trace_page_pinner_failure(page);
> > + test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > +}
> > +
> > +void __page_pinner_put(struct page *page)
> > +{
> > + struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > + return;
> > +
> > + trace_page_pinner_put(page);
> > +}
> > +EXPORT_SYMBOL(__page_pinner_put);
> > +
> > +
> > +static int __init page_pinner_init(void)
> > +{
> > + if (!static_branch_unlikely(&page_pinner_inited)) {
> > + pr_info("page_pinner is disabled\n");
> > + return 0;
> > + }
> > +
> > + pr_info("page_pinner is enabled\n");
> > + return 0;
> > +}
> > +late_initcall(page_pinner_init)
> > --
> > 2.34.1.400.ga245620fadb-goog
> >
>
> More info about my compaction issue:
>
> This call stack returns -EAGAIN in 99.9% cases on the problematic host
> (Ubuntu 20.04 with kernel 5.11.0-40):
>
> migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAGAIN
> migrate_page
> fallback_migrate_page
> move_to_new_page
> migrate_pages
> compact_zone
> compact_zone_order
> try_to_compact_pages
> __alloc_pages_direct_compact
> __alloc_pages_slowpath.constprop.0
> __alloc_pages_nodemask
> alloc_pages_vma
> do_huge_pmd_anonymous_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
>
> The offending pages are from shm, allocated by mmap() with MAP_SHARED by a
> machine learning program. They may have relationships with NVIDIA CUDA, but I
> want to confirm this, and make improvements if possible.

So you are suspecting some kernel driver hold a addtional refcount
using get_user_pages or page get API?

>
> When the issue reproduce, a single page fault that triggers a sync compaction
> can take tens of seconds. Then all 40 CPU threads are doing compaction, and
> application runs several order of magnitude slower.
>
> Disabling sync compaction is a workaround (the default is "madvise"):
>
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
>
> Previously I asked for help at https://lore.kernel.org/linux-mm/[email protected]/
> Now I have more information but still cannot pinpoint the root cause.
>
> Thanks,
> Hu Weiwen
>
>

2021-12-12 07:21:40

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On 12/10/21 01:54, 胡玮文 wrote:
...
>> So you are suspecting some kernel driver hold a addtional refcount
>> using get_user_pages or page get API?
>
> Yes. By using the trace events in this patch, I have confirmed it is nvidia
> kernel module that holds the refcount. I got the stacktrace like this (from
> "perf script"):
>
> cuda-EvtHandlr 31023 [000] 3244.976411: page_pinner:page_pinner_put: pfn=0x13e473 flags=0x8001e count=0 mapcount=0 mapping=(nil) mt=1
> ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
> ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
> ffffffffc0b71e1f os_unlock_user_pages+0xbf ([nvidia])

The corresponding call to os_unlock_user_pages() is os_lock_user_pages(). And
os_lock_user_pages() does call get_user_pages().

This is part of normal operation for many CUDA (and OpenCL) programs: "host memory"
(host == CPU, device == GPU) is pinned, and GPU pages tables set up to point to it.

If your program calls cudaHostRegister() [1], then that will in turn work its way
down to os_lock_user_pages(), and if the program is still running on the GPU, then
it's reasonable for those pages to still be pinned. This is a very common pattern
for some programs, especially for those who have tuned their access patterns and
know that most accesses are from the CPU side, with possible rare access from the
GPU.

> ffffffffc14a4546 _nv032165rm+0x96 ([nvidia])
>
> Still not much information. NVIDIA does not want me to debug its module. Maybe
> the only thing I can do is reporting this to NVIDIA.
>

...or hope that someone here, maybe even from NVIDIA, can help! :)

Let me know if there are further questions, and if they are outside of the linux-mm
area, we can take it up in an off-list email thread if necessary.


[1] cudaHostRegister():
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c


thanks,
--
John Hubbard
NVIDIA

2021-12-13 01:56:45

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On 12/6/21 10:47, Minchan Kim wrote:
> The contiguous memory allocation fails if one of the pages in
> requested range has unexpected elevated reference count since
> VM couldn't migrate the page out. It's very common pattern for
> CMA allocation failure. The temporal elevated page refcount
> could happen from various places and it's really hard to chase
> who held the temporal page refcount at that time, which is the
> vital information to debug the allocation failure.
>
> This patch introduces page pinner to keep track of Page Pinner
> who caused the CMA allocation failure. How page pinner work is
> once VM found the non-migrated page after trying migration
> during contiguos allocation, it marks the page and every page-put
> operation on the page since then will have event trace. Since
> page-put is always with page-get, the page-put event trace helps
> to deduce where the pair page-get originated from.
>
> The reason why the feature tracks page-put instead of page-get
> indirectly is that since VM couldn't expect when page migration
> fails, it should keep track of every page-get for migratable page
> to dump information at failure. Considering backtrace as vitial
> information as well as page's get/put is one of hottest path,
> it's too heavy approach. Thus, to minimize runtime overhead,
> this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> debugging option to indicate migration-failed page and only
> tracks every page-put operation for the page since the failure.

Hi Minchan,

This looks very useful, so I have a bunch of hopefully very
easy-to-deal-with naming and documentation comments that are intended to
tighten it up and get it ready for merging.

Starting with the subject line and commit description: rather than
nitpick on these, I've studied the patch and written up a proposed
replacement for the subject line and the lines above this comment.
Please see if you like it:


mm: introduce page pin reporter

A Contiguous Memory Allocator (CMA) allocation can fail if any page
within the requested range has an elevated refcount (a pinned page).

Debugging such failures is difficult, because the struct pages only show
a combined refcount, and do not show the callstacks or backtraces of the
code that acquired each refcount. So the source of the page pins remains
a mystery, at the time of CMA failure.

In order to solve this without adding too much overhead, just do nothing
most of the time, which is pretty low overhead. :) However, once a CMA
failure occurs, then mark the page (this requires a pointer's worth of
space in struct page, but it uses page extensions to get that), and
start tracing the subsequent put_page() calls. As the program finishes
up, each page pin will be undone, and traced with a backtrace. The
programmer reads the trace output and sees the list of all page pinning
code paths.

This will consume an additional 8 bytes per 4KB page, or an additional
0.2% of RAM. In addition to the storage space, it will have some
performance cost, due to increasing the size of struct page so that it
is greater than the cacheline size (or multiples thereof) of popular
(x86, ...) CPUs.


>
> usage:
>
> trace_dir="/sys/kernel/tracing"
> echo 1 > $trace_dir/events/page_pinner/enable
> echo 1 > $trace_dir/options/stacktrace
> ..
> run workload
> ..
> ..
>
> cat $trace_dir/trace
>
> <...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> <...>-498 [006] .... 33306.301625: <stack trace>
> => __page_pinner_failure
> => test_pages_isolated
> => alloc_contig_range
> => cma_alloc
> => cma_heap_allocate
> => dma_heap_ioctl
> => __arm64_sys_ioctl
> => el0_svc_common
> => do_el0_svc
> => el0_svc
> => el0_sync_handler
> => el0_sync
> <...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> <...>-24965 [001] .... 33306.392846: <stack trace>
> => __page_pinner_put
> => release_pages
> => free_pages_and_swap_cache
> => tlb_flush_mmu_free
> => tlb_flush_mmu
> => zap_pte_range
> => unmap_page_range
> => unmap_vmas
> => exit_mmap
> => __mmput
> => mmput
> => exit_mm
> => do_exit
> => do_group_exit
> => get_signal
> => do_signal
> => do_notify_resume
> => work_pending
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> The PagePinner named after PageOwner since I wanted to keep track of
> page refcount holder. Feel free to suggest better names.


I understand how you arrived at the name, and it makes sense from that
perspective. However, from an "I just read this code" perspective, it
sounds like:

a) This is a tool to pin pages, and

b) It has a key API call to help report when it *fails* to pin pages.

...both of which are completely wrong statements, of course. :)

So, I'd recommend renaming:

page_pinner --> page_pin_owner (and all variations of the name)

page_pinner_failure --> report_page_pinners


> Actually, I had alloc_contig_failure tracker as a candidate.
>
> include/linux/mm.h | 7 ++-
> include/linux/page_ext.h | 3 +
> include/linux/page_pinner.h | 47 ++++++++++++++++
> include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> mm/Kconfig.debug | 13 +++++
> mm/Makefile | 1 +
> mm/page_alloc.c | 3 +
> mm/page_ext.c | 4 ++
> mm/page_isolation.c | 3 +
> mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
> 10 files changed, 230 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/page_pinner.h
> create mode 100644 include/trace/events/page_pinner.h
> create mode 100644 mm/page_pinner.c
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..a640cae593f9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,7 @@
> #include <linux/err.h>
> #include <linux/page-flags.h>
> #include <linux/page_ref.h>
> +#include <linux/page_pinner.h>
> #include <linux/memremap.h>
> #include <linux/overflow.h>
> #include <linux/sizes.h>
> @@ -744,8 +745,12 @@ struct inode;
> */
> static inline int put_page_testzero(struct page *page)
> {
> + int ret;
> +
> VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> - return page_ref_dec_and_test(page);
> + ret = page_ref_dec_and_test(page);
> + page_pinner_put(page);
> + return ret;
> }
>
> /*
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1e087f..561d8458dc5a 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -23,6 +23,9 @@ enum page_ext_flags {
> PAGE_EXT_YOUNG,
> PAGE_EXT_IDLE,
> #endif
> +#if defined(CONFIG_PAGE_PINNER)
> + PAGE_EXT_PINNER,
> +#endif
> };
>
> /*
> diff --git a/include/linux/page_pinner.h b/include/linux/page_pinner.h
> new file mode 100644
> index 000000000000..3f93a753b8e0
> --- /dev/null
> +++ b/include/linux/page_pinner.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_PAGE_PINNER_H
> +#define __LINUX_PAGE_PINNER_H
> +
> +#include <linux/jump_label.h>
> +
> +#ifdef CONFIG_PAGE_PINNER
> +extern struct static_key_false page_pinner_inited;
> +extern struct page_ext_operations page_pinner_ops;
> +
> +void __page_pinner_failure(struct page *page);
> +void __page_pinner_put(struct page *page);
> +void __reset_page_pinner(struct page *page, unsigned int order);
> +
> +static inline void reset_page_pinner(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&page_pinner_inited))
> + __reset_page_pinner(page, order);
> +}
> +
> +static inline void page_pinner_failure(struct page *page)
> +{
> + if (!static_branch_unlikely(&page_pinner_inited))
> + return;
> +
> + __page_pinner_failure(page);
> +}
> +
> +static inline void page_pinner_put(struct page *page)
> +{
> + if (!static_branch_unlikely(&page_pinner_inited))
> + return;
> +
> + __page_pinner_put(page);
> +}
> +#else
> +static inline void reset_page_pinner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void page_pinner_failure(struct page *page)
> +{
> +}
> +static inline void page_pinner_put(struct page *page)
> +{
> +}
> +#endif /* CONFIG_PAGE_PINNER */
> +#endif /* __LINUX_PAGE_PINNER_H */
> diff --git a/include/trace/events/page_pinner.h b/include/trace/events/page_pinner.h
> new file mode 100644
> index 000000000000..69ccd5c30f66
> --- /dev/null
> +++ b/include/trace/events/page_pinner.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM page_pinner
> +
> +#if !defined(_TRACE_PAGE_PINNER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PAGE_PINNER_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/mmflags.h>
> +
> +DECLARE_EVENT_CLASS(page_pinner_template,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, pfn)
> + __field(unsigned long, flags)
> + __field(int, count)
> + __field(int, mapcount)
> + __field(void *, mapping)
> + __field(int, mt)
> + ),
> +
> + TP_fast_assign(
> + __entry->pfn = page_to_pfn(page);
> + __entry->flags = page->flags;
> + __entry->count = page_ref_count(page);
> + __entry->mapcount = page_mapcount(page);
> + __entry->mapping = page->mapping;
> + __entry->mt = get_pageblock_migratetype(page);
> + ),
> +
> + TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> + __entry->pfn,
> + show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> + __entry->count,
> + __entry->mapcount, __entry->mapping, __entry->mt)
> +);
> +
> +DEFINE_EVENT(page_pinner_template, page_pinner_failure,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page)
> +);
> +
> +DEFINE_EVENT(page_pinner_template, page_pinner_put,
> +
> + TP_PROTO(struct page *page),
> +
> + TP_ARGS(page)
> +);
> +
> +#endif /* _TRACE_PAGE_PINNER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 1e73717802f8..0ad4a3b8f4eb 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -62,6 +62,19 @@ config PAGE_OWNER
>
> If unsure, say N.
>
> +config PAGE_PINNER
> + bool "Track page pinner"
> + select PAGE_EXTENSION
> + depends on DEBUG_KERNEL && TRACEPOINTS
> + help
> + This keeps track of what call chain is the pinner of a page, may
> + help to find contiguos page allocation failure. Even if you include
> + this feature in your build, it is disabled by default. You should
> + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> + a fair amount of memory if enabled.


We can do a *lot* better in documenting this, than "a fair bit of
memory". How about something more like this (borrowing from the updated
commit description):

This keeps track of what call chain is the pinner of a page. That may
help to debug Contiguous Memory Allocator (CMA) allocation failures.
Even if you include this feature in your build, it is disabled by
default. In order to enable the feature, you must pass
"page_pinner=on" as a boot parameter.

When enabled, this will consume an additional 8 bytes per 4KB page, or
an additional 0.2% of RAM. In addition to the storage space, it will
have some performance cost, due to increasing the size of struct page
so that it is greater than the cacheline size (or multiples thereof)
of popular (x86, ...) CPUs.


thanks,
--
John Hubbard
NVIDIA

2021-12-13 22:58:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On Fri, Dec 10, 2021 at 05:54:24PM +0800, 胡玮文 wrote:
> On Wed, Dec 08, 2021 at 10:42:37AM -0800, Minchan Kim wrote:
> > On Wed, Dec 08, 2021 at 07:54:35PM +0800, 胡玮文 wrote:
> > > On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> > > > The contiguous memory allocation fails if one of the pages in
> > > > requested range has unexpected elevated reference count since
> > > > VM couldn't migrate the page out. It's very common pattern for
> > > > CMA allocation failure. The temporal elevated page refcount
> > > > could happen from various places and it's really hard to chase
> > > > who held the temporal page refcount at that time, which is the
> > > > vital information to debug the allocation failure.
> >
> > Hi,
> >
> > Please don't cut down original Cc list without special reason.
>
> Sorry, my school SMTP server does not allow that much recipients. I haved
> changed to outlook.
>
> > > Hi Minchan,
> > >
> > > I'm a newbie here. We are debugging a problem where every CPU core is doing
> > > compaction but making no progress, because of the unexpected page refcount. I'm
> > > interested in your approach, but this patch seems only to cover the CMA
> > > allocation path. So could it be extended to debugging migrate failure during
> > > compaction? I'm not familiar with the kernel codebase, here is my untested
> > > thought:
> >
> > The compaction failure will produce a lot events I wanted to avoid
> > in my system but I think your case is reasonable if you doesn't
> > mind the large events.
> >
> > >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index cf25b00f03c8..85dacbca8fa0 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -46,6 +46,7 @@
> > > #include <linux/mmu_notifier.h>
> > > #include <linux/page_idle.h>
> > > #include <linux/page_owner.h>
> > > +#include <linux/page_pinner.h>
> > > #include <linux/sched/mm.h>
> > > #include <linux/ptrace.h>
> > > #include <linux/oom.h>
> > > @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *mapping,
> > >
> > > if (!mapping) {
> > > /* Anonymous page without mapping */
> > > - if (folio_ref_count(folio) != expected_count)
> > > + if (folio_ref_count(folio) != expected_count) {
> > > + page_pinner_failure(&folio->page);
> > > return -EAGAIN;
> > > + }
> > >
> > > /* No turning back from here */
> > > newfolio->index = folio->index;
> > > @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> > > xas_lock_irq(&xas);
> > > if (!folio_ref_freeze(folio, expected_count)) {
> > > xas_unlock_irq(&xas);
> > > + page_pinner_failure(&folio->page);
> > > return -EAGAIN;
> > > }
> > >
> > > I'm not sure what to do with the new folio, it seems using folio->page in new
> > > codes is not correct.
>
> I tested the above proposed patch, it works in my case. But it produces a lot of
> redundant page_pinner_put events. Before the true pinner reveals, the traced
> pages are get and put multiple times. Besides, when passed to
> page_pinner_failure(), the "count" is 3 in my case, any of the 3 holders could
> be the interested pinner. I think this is hard to avoid, and we can just let the
> users distinguish which is the interested pinner. Maybe we need some docs about
> this.

If you run experiement a bit more, you can find what's the most interesting
callsites because usually the other two are would be common path like lru
drainnig or rmap for page migration. Maybe you don't want to pay attention
for them. If they are too nosiy we may introduce some annotation to not
produce events on such callsites optinally. However, let's do not make
thing too complicated from the beginning.

>
> > If you want to cover compaction only, maybe this one:
> >
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index bfc93da1c2c7..7bfbf7205fb8 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> > /* All pages were either migrated or will be released */
> > cc->nr_migratepages = 0;
> > if (err) {
> > + struct page *failed_page;
> > +
> > + list_for_each_entry(failed_page, &cc->migratepages, lru)
> > + page_pinner_failure(failed_page);
> > +
> > putback_movable_pages(&cc->migratepages);
> > /*
> > * migrate_pages() may return -ENOMEM when scanners meet
>
> Maybe we should put the page_pinner_failure() calls as close to the real
> refcount check as possible to avoid protential racing and loss some
> page_pinner_put events? Besides, migrate_pages() will retry for 10 times, and I
> image that someone may want to find out who is causing the retry. And migration
> may fail for a number of reason, not only unexpected refcount.

Yeah, I encountered couple of different reasons when I chased down CMA
issues(e.gl., LRU isolation failure) but I wanted to focus page refcount
issue with this page_pinner. I thought each subsystem(cma, compaction,
memory-hotplug, or something of migration users) would have different
requirements/stategey so they would provide their own trace events
for other failure purposes unless it's related to page pinning.

And the page_pinner is already racy since it doesn't take any lock
to mark the failure. So I thought it's not a problem not to close
trigger the trace event from the real refcount check. Since the work
is based on deduction, user should catch it up. Yeah, instead, we could
warn about the race issue into doc for user.

>
> I image that enabling page pinner for migration senarios other than compaction
> could be helpful for others.

It's not too hard to add new tracepoint in the general migrate_pages so
I want to keep it leave for only alloc_contig user and compaction until
we hear usecases from real users.

>
> > However, for the case, I want to introduce some filter options like
> > failure reason(?)
> >
> > page_pinner_failure(pfn, reason)
> >
> > So, I could keep getting only CMA allocation failure events, not
> > compaction failure.
>
> This is a good idea to me. But how can we implement the filter? Can we reuse the
> trace event filter? i.e., if the page_pinner_failure event is filtered out, then
> we don't set the PAGE_EXT_PINNER flag and effectively also filter the
> corresponding page_pinner_put event out. I can't see whether it is possible now.
> trace_page_pinner_failure() returns void, so it seems we cannot know whether the
> event got through.

We can introduce mutiple failure reports something like debug_page_ref.c

void page_pinner_alloc_contig_failure(struct *page)
{
if (page_pinner_tracepoint_active(pp_fail_alloc_contig))
__page_pinner_aloc_contig_failure(page);
}

void __page_pinner_aloc_contig_failure(struct page *page)
{
trace_pp_alloc_contig_failure(page);
}

void page_pinner_compaction_failure(struct *page)
{
if (page_pinner_tracepoint_active(pp_fail_compacion))
__page_pinner_compaction_failure(page);
}

void __page_pinner_compaction_failure(struct page *page)
{
trace_pp_compaction_failure(page);
}

So admin can enable interested events only.

>
> If this is not possible, we may need to allocate additional space to store the
> reason for each traced page, and also pass the reason to trace_page_pinner_put().
>

< snip>

> > > > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > > > index 1e73717802f8..0ad4a3b8f4eb 100644
> > > > --- a/mm/Kconfig.debug
> > > > +++ b/mm/Kconfig.debug
> > > > @@ -62,6 +62,19 @@ config PAGE_OWNER
> > > >
> > > > If unsure, say N.
> > > >
> > > > +config PAGE_PINNER
> > > > + bool "Track page pinner"
> > > > + select PAGE_EXTENSION
> > > > + depends on DEBUG_KERNEL && TRACEPOINTS
> > > > + help
> > > > + This keeps track of what call chain is the pinner of a page, may
> > > > + help to find contiguos page allocation failure. Even if you include
> > > > + this feature in your build, it is disabled by default. You should
> > > > + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> > > > + a fair amount of memory if enabled.
> > >
> > > I'm a bit confused. It seems page pinner does not allocate any additional
> > > memory if you enable it by boot parameter. So the description seems inaccurate.
> >
> > It will allocate page_ext descriptors so consumes the memory.
>
> Thanks, I see. So it is 8 bytes for each 4k page. Not much I think.

Yub but for someone it's not too small, either considering how often
the problem happen. ;-)

> > > More info about my compaction issue:
> > >
> > > This call stack returns -EAGAIN in 99.9% cases on the problematic host
> > > (Ubuntu 20.04 with kernel 5.11.0-40):
> > >
> > > migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAGAIN
> > > migrate_page
> > > fallback_migrate_page
> > > move_to_new_page
> > > migrate_pages
> > > compact_zone
> > > compact_zone_order
> > > try_to_compact_pages
> > > __alloc_pages_direct_compact
> > > __alloc_pages_slowpath.constprop.0
> > > __alloc_pages_nodemask
> > > alloc_pages_vma
> > > do_huge_pmd_anonymous_page
> > > __handle_mm_fault
> > > handle_mm_fault
> > > do_user_addr_fault
> > > exc_page_fault
> > > asm_exc_page_fault
> > >
> > > The offending pages are from shm, allocated by mmap() with MAP_SHARED by a
> > > machine learning program. They may have relationships with NVIDIA CUDA, but I
> > > want to confirm this, and make improvements if possible.
> >
> > So you are suspecting some kernel driver hold a addtional refcount
> > using get_user_pages or page get API?
>
> Yes. By using the trace events in this patch, I have confirmed it is nvidia
> kernel module that holds the refcount. I got the stacktrace like this (from
> "perf script"):
>
> cuda-EvtHandlr 31023 [000] 3244.976411: page_pinner:page_pinner_put: pfn=0x13e473 flags=0x8001e count=0 mapcount=0 mapping=(nil) mt=1
> ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
> ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/build/vmlinux)
> ffffffffc0b71e1f os_unlock_user_pages+0xbf ([nvidia])
> ffffffffc14a4546 _nv032165rm+0x96 ([nvidia])
>
> Still not much information. NVIDIA does not want me to debug its module. Maybe
> the only thing I can do is reporting this to NVIDIA.

I'm glad that you found the problem using page_pinner and the trace
could help to move them to dig in it. Looks like that it is already
happening.

>
> > > When the issue reproduce, a single page fault that triggers a sync compaction
> > > can take tens of seconds. Then all 40 CPU threads are doing compaction, and
> > > application runs several order of magnitude slower.
> > >
> > > Disabling sync compaction is a workaround (the default is "madvise"):
> > >
> > > echo never > /sys/kernel/mm/transparent_hugepage/defrag
> > >
> > > Previously I asked for help at https://lore.kernel.org/linux-mm/[email protected]/
> > > Now I have more information but still cannot pinpoint the root cause.
> > >
> > > Thanks,
> > > Hu Weiwen

Thanks for commenting and shaing your experience, Hu.

2021-12-13 23:10:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On Sun, Dec 12, 2021 at 05:56:35PM -0800, John Hubbard wrote:
> On 12/6/21 10:47, Minchan Kim wrote:
> > The contiguous memory allocation fails if one of the pages in
> > requested range has unexpected elevated reference count since
> > VM couldn't migrate the page out. It's very common pattern for
> > CMA allocation failure. The temporal elevated page refcount
> > could happen from various places and it's really hard to chase
> > who held the temporal page refcount at that time, which is the
> > vital information to debug the allocation failure.
> >
> > This patch introduces page pinner to keep track of Page Pinner
> > who caused the CMA allocation failure. How page pinner work is
> > once VM found the non-migrated page after trying migration
> > during contiguos allocation, it marks the page and every page-put
> > operation on the page since then will have event trace. Since
> > page-put is always with page-get, the page-put event trace helps
> > to deduce where the pair page-get originated from.
> >
> > The reason why the feature tracks page-put instead of page-get
> > indirectly is that since VM couldn't expect when page migration
> > fails, it should keep track of every page-get for migratable page
> > to dump information at failure. Considering backtrace as vitial
> > information as well as page's get/put is one of hottest path,
> > it's too heavy approach. Thus, to minimize runtime overhead,
> > this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> > debugging option to indicate migration-failed page and only
> > tracks every page-put operation for the page since the failure.
>

Hi John,

> Hi Minchan,
>
> This looks very useful, so I have a bunch of hopefully very
> easy-to-deal-with naming and documentation comments that are intended to
> tighten it up and get it ready for merging.
>
> Starting with the subject line and commit description: rather than
> nitpick on these, I've studied the patch and written up a proposed
> replacement for the subject line and the lines above this comment.
> Please see if you like it:
>
>
> mm: introduce page pin reporter
>
> A Contiguous Memory Allocator (CMA) allocation can fail if any page
> within the requested range has an elevated refcount (a pinned page).
>
> Debugging such failures is difficult, because the struct pages only show
> a combined refcount, and do not show the callstacks or backtraces of the
> code that acquired each refcount. So the source of the page pins remains
> a mystery, at the time of CMA failure.
>
> In order to solve this without adding too much overhead, just do nothing
> most of the time, which is pretty low overhead. :) However, once a CMA
> failure occurs, then mark the page (this requires a pointer's worth of
> space in struct page, but it uses page extensions to get that), and
> start tracing the subsequent put_page() calls. As the program finishes
> up, each page pin will be undone, and traced with a backtrace. The
> programmer reads the trace output and sees the list of all page pinning
> code paths.
>
> This will consume an additional 8 bytes per 4KB page, or an additional
> 0.2% of RAM. In addition to the storage space, it will have some
> performance cost, due to increasing the size of struct page so that it
> is greater than the cacheline size (or multiples thereof) of popular
> (x86, ...) CPUs.

Perfect. Let me take your wording. And the next version, I might append
a little bit more and hope you will correct them once I screw it up
with poor wording again. :) Thank, John!

>
>
> >
> > usage:
> >
> > trace_dir="/sys/kernel/tracing"
> > echo 1 > $trace_dir/events/page_pinner/enable
> > echo 1 > $trace_dir/options/stacktrace
> > ..
> > run workload
> > ..
> > ..
> >
> > cat $trace_dir/trace
> >
> > <...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-498 [006] .... 33306.301625: <stack trace>
> > => __page_pinner_failure
> > => test_pages_isolated
> > => alloc_contig_range
> > => cma_alloc
> > => cma_heap_allocate
> > => dma_heap_ioctl
> > => __arm64_sys_ioctl
> > => el0_svc_common
> > => do_el0_svc
> > => el0_svc
> > => el0_sync_handler
> > => el0_sync
> > <...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-24965 [001] .... 33306.392846: <stack trace>
> > => __page_pinner_put
> > => release_pages
> > => free_pages_and_swap_cache
> > => tlb_flush_mmu_free
> > => tlb_flush_mmu
> > => zap_pte_range
> > => unmap_page_range
> > => unmap_vmas
> > => exit_mmap
> > => __mmput
> > => mmput
> > => exit_mm
> > => do_exit
> > => do_group_exit
> > => get_signal
> > => do_signal
> > => do_notify_resume
> > => work_pending
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > The PagePinner named after PageOwner since I wanted to keep track of
> > page refcount holder. Feel free to suggest better names.
>
>
> I understand how you arrived at the name, and it makes sense from that
> perspective. However, from an "I just read this code" perspective, it
> sounds like:
>
> a) This is a tool to pin pages, and
>
> b) It has a key API call to help report when it *fails* to pin pages.
>
> ...both of which are completely wrong statements, of course. :)

Fair enough. I also thought but couldn't come up with better naming.

>
> So, I'd recommend renaming:
>
> page_pinner --> page_pin_owner (and all variations of the name)
>
> page_pinner_failure --> report_page_pinners

Surely it's better naming.

>
>
> > Actually, I had alloc_contig_failure tracker as a candidate.
> >
> > include/linux/mm.h | 7 ++-
> > include/linux/page_ext.h | 3 +
> > include/linux/page_pinner.h | 47 ++++++++++++++++
> > include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> > mm/Kconfig.debug | 13 +++++
> > mm/Makefile | 1 +
> > mm/page_alloc.c | 3 +
> > mm/page_ext.c | 4 ++
> > mm/page_isolation.c | 3 +
> > mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
> > 10 files changed, 230 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/page_pinner.h
> > create mode 100644 include/trace/events/page_pinner.h
> > create mode 100644 mm/page_pinner.c
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 73a52aba448f..a640cae593f9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -26,6 +26,7 @@
> > #include <linux/err.h>
> > #include <linux/page-flags.h>
> > #include <linux/page_ref.h>
> > +#include <linux/page_pinner.h>
> > #include <linux/memremap.h>
> > #include <linux/overflow.h>
> > #include <linux/sizes.h>
> > @@ -744,8 +745,12 @@ struct inode;
> > */
> > static inline int put_page_testzero(struct page *page)
> > {
> > + int ret;
> > +
> > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > - return page_ref_dec_and_test(page);
> > + ret = page_ref_dec_and_test(page);
> > + page_pinner_put(page);
> > + return ret;
> > }
> > /*
> > diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> > index fabb2e1e087f..561d8458dc5a 100644
> > --- a/include/linux/page_ext.h
> > +++ b/include/linux/page_ext.h
> > @@ -23,6 +23,9 @@ enum page_ext_flags {
> > PAGE_EXT_YOUNG,
> > PAGE_EXT_IDLE,
> > #endif
> > +#if defined(CONFIG_PAGE_PINNER)
> > + PAGE_EXT_PINNER,
> > +#endif
> > };
> > /*
> > diff --git a/include/linux/page_pinner.h b/include/linux/page_pinner.h
> > new file mode 100644
> > index 000000000000..3f93a753b8e0
> > --- /dev/null
> > +++ b/include/linux/page_pinner.h
> > @@ -0,0 +1,47 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef __LINUX_PAGE_PINNER_H
> > +#define __LINUX_PAGE_PINNER_H
> > +
> > +#include <linux/jump_label.h>
> > +
> > +#ifdef CONFIG_PAGE_PINNER
> > +extern struct static_key_false page_pinner_inited;
> > +extern struct page_ext_operations page_pinner_ops;
> > +
> > +void __page_pinner_failure(struct page *page);
> > +void __page_pinner_put(struct page *page);
> > +void __reset_page_pinner(struct page *page, unsigned int order);
> > +
> > +static inline void reset_page_pinner(struct page *page, unsigned int order)
> > +{
> > + if (static_branch_unlikely(&page_pinner_inited))
> > + __reset_page_pinner(page, order);
> > +}
> > +
> > +static inline void page_pinner_failure(struct page *page)
> > +{
> > + if (!static_branch_unlikely(&page_pinner_inited))
> > + return;
> > +
> > + __page_pinner_failure(page);
> > +}
> > +
> > +static inline void page_pinner_put(struct page *page)
> > +{
> > + if (!static_branch_unlikely(&page_pinner_inited))
> > + return;
> > +
> > + __page_pinner_put(page);
> > +}
> > +#else
> > +static inline void reset_page_pinner(struct page *page, unsigned int order)
> > +{
> > +}
> > +static inline void page_pinner_failure(struct page *page)
> > +{
> > +}
> > +static inline void page_pinner_put(struct page *page)
> > +{
> > +}
> > +#endif /* CONFIG_PAGE_PINNER */
> > +#endif /* __LINUX_PAGE_PINNER_H */
> > diff --git a/include/trace/events/page_pinner.h b/include/trace/events/page_pinner.h
> > new file mode 100644
> > index 000000000000..69ccd5c30f66
> > --- /dev/null
> > +++ b/include/trace/events/page_pinner.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM page_pinner
> > +
> > +#if !defined(_TRACE_PAGE_PINNER_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_PAGE_PINNER_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/tracepoint.h>
> > +#include <trace/events/mmflags.h>
> > +
> > +DECLARE_EVENT_CLASS(page_pinner_template,
> > +
> > + TP_PROTO(struct page *page),
> > +
> > + TP_ARGS(page),
> > +
> > + TP_STRUCT__entry(
> > + __field(unsigned long, pfn)
> > + __field(unsigned long, flags)
> > + __field(int, count)
> > + __field(int, mapcount)
> > + __field(void *, mapping)
> > + __field(int, mt)
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->pfn = page_to_pfn(page);
> > + __entry->flags = page->flags;
> > + __entry->count = page_ref_count(page);
> > + __entry->mapcount = page_mapcount(page);
> > + __entry->mapping = page->mapping;
> > + __entry->mt = get_pageblock_migratetype(page);
> > + ),
> > +
> > + TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> > + __entry->pfn,
> > + show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> > + __entry->count,
> > + __entry->mapcount, __entry->mapping, __entry->mt)
> > +);
> > +
> > +DEFINE_EVENT(page_pinner_template, page_pinner_failure,
> > +
> > + TP_PROTO(struct page *page),
> > +
> > + TP_ARGS(page)
> > +);
> > +
> > +DEFINE_EVENT(page_pinner_template, page_pinner_put,
> > +
> > + TP_PROTO(struct page *page),
> > +
> > + TP_ARGS(page)
> > +);
> > +
> > +#endif /* _TRACE_PAGE_PINNER_H */
> > +
> > +/* This part must be outside protection */
> > +#include <trace/define_trace.h>
> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index 1e73717802f8..0ad4a3b8f4eb 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -62,6 +62,19 @@ config PAGE_OWNER
> > If unsure, say N.
> > +config PAGE_PINNER
> > + bool "Track page pinner"
> > + select PAGE_EXTENSION
> > + depends on DEBUG_KERNEL && TRACEPOINTS
> > + help
> > + This keeps track of what call chain is the pinner of a page, may
> > + help to find contiguos page allocation failure. Even if you include
> > + this feature in your build, it is disabled by default. You should
> > + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> > + a fair amount of memory if enabled.
>
>
> We can do a *lot* better in documenting this, than "a fair bit of
> memory". How about something more like this (borrowing from the updated
> commit description):
>
> This keeps track of what call chain is the pinner of a page. That may
> help to debug Contiguous Memory Allocator (CMA) allocation failures.

https://lore.kernel.org/lkml/YbfQJ0eTkkImUQ%[email protected]/

I need to rephrase this one to cover other sites not only CMA since other
reviewer also want to see the failure from compaction.
If you are interested in, please chime in in the thread.

> Even if you include this feature in your build, it is disabled by
> default. In order to enable the feature, you must pass
> "page_pinner=on" as a boot parameter.
>
> When enabled, this will consume an additional 8 bytes per 4KB page, or
> an additional 0.2% of RAM. In addition to the storage space, it will
> have some performance cost, due to increasing the size of struct page
> so that it is greater than the cacheline size (or multiples thereof)
> of popular (x86, ...) CPUs.


Definitely, it looks much better.
Thanks for the review, John!

2021-12-14 18:27:53

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC] mm: introduce page pinner

On 12/13/21 15:10, Minchan Kim wrote:
>>> @@ -62,6 +62,19 @@ config PAGE_OWNER
>>> If unsure, say N.
>>> +config PAGE_PINNER
>>> + bool "Track page pinner"
>>> + select PAGE_EXTENSION
>>> + depends on DEBUG_KERNEL && TRACEPOINTS
>>> + help
>>> + This keeps track of what call chain is the pinner of a page, may
>>> + help to find contiguos page allocation failure. Even if you include
>>> + this feature in your build, it is disabled by default. You should
>>> + pass "page_pinner=on" to boot parameter in order to enable it. Eats
>>> + a fair amount of memory if enabled.
>>
>>
>> We can do a *lot* better in documenting this, than "a fair bit of
>> memory". How about something more like this (borrowing from the updated
>> commit description):
>>
>> This keeps track of what call chain is the pinner of a page. That may
>> help to debug Contiguous Memory Allocator (CMA) allocation failures.
>
> https://lore.kernel.org/lkml/YbfQJ0eTkkImUQ%[email protected]/
>
> I need to rephrase this one to cover other sites not only CMA since other
> reviewer also want to see the failure from compaction.
> If you are interested in, please chime in in the thread.
>

Actually yes, this feature is potentially a general-purpose tracker of page
pinners. It's not inherently CMA-related. One could use it for other cases, such
as any of the NUMA or HMM migration cases. HMM's migration between CPU and
devices fails if pages are pinned, for example.


thanks,
--
John Hubbard
NVIDIA