Major Changes from v1
* patch 1: add overall design description in code comment
* patch 6: make page owner more accurate for alloc_pages_exact() and
compaction/CMA
* patch 7: handles early allocated pages for page owner
When we debug something, we'd like to insert some information to
every page. For this purpose, we sometimes modify struct page itself.
But, this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the kernel
due to third party module dependency. At third, system behaviour would be
largely different after re-compile, because it changes size of struct
page greatly and this structure is accessed by every part of kernel.
Keeping this as it is would be better to reproduce errornous situation.
To overcome these drawbacks, we can extend struct page on another place
rather than struct page itself. Until now, memcg uses this technique. But,
now, memcg decides to embed their variable to struct page itself and it's
code to extend struct page has been removed. I'd like to use this code
to develop debug feature, so this series resurrect it.
With resurrecting it, this patchset makes two debugging features boottime
configurable. As mentioned above, rebuild has serious problems. Making
it boottime configurable mitigate these problems with marginal
computational overhead. One of the features, page_owner isn't in mainline
now. But, it is really useful so it is in mm tree for a long time. I think
that it's time to upstream this feature.
Any comments are more than welcome.
This patchset is based on next-20141106 + my two patches related to
debug-pagealloc [1].
[1]: https://lkml.org/lkml/2014/11/7/78
Joonsoo Kim (7):
mm/page_ext: resurrect struct page extending code for debugging
mm/debug-pagealloc: prepare boottime configurable on/off
mm/debug-pagealloc: make debug-pagealloc boottime configurable
mm/nommu: use alloc_pages_exact() rather than it's own implementation
stacktrace: introduce snprint_stack_trace for buffer output
mm/page_owner: keep track of page owners
mm/page_owner: correct owner information for early allocated pages
Documentation/kernel-parameters.txt | 14 ++
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/s390/mm/pageattr.c | 2 +-
arch/sparc/mm/init_64.c | 2 +-
arch/x86/mm/pageattr.c | 2 +-
include/linux/mm.h | 36 +++-
include/linux/mm_types.h | 4 -
include/linux/mmzone.h | 12 ++
include/linux/page-debug-flags.h | 32 ---
include/linux/page_ext.h | 84 ++++++++
include/linux/page_owner.h | 19 ++
include/linux/stacktrace.h | 3 +
init/main.c | 7 +
kernel/stacktrace.c | 24 +++
lib/Kconfig.debug | 13 ++
mm/Kconfig.debug | 10 +
mm/Makefile | 2 +
mm/debug-pagealloc.c | 45 +++-
mm/nommu.c | 33 +--
mm/page_alloc.c | 67 +++++-
mm/page_ext.c | 393 +++++++++++++++++++++++++++++++++++
mm/page_owner.c | 313 ++++++++++++++++++++++++++++
mm/vmstat.c | 93 +++++++++
tools/vm/Makefile | 4 +-
tools/vm/page_owner_sort.c | 144 +++++++++++++
26 files changed, 1286 insertions(+), 76 deletions(-)
delete mode 100644 include/linux/page-debug-flags.h
create mode 100644 include/linux/page_ext.h
create mode 100644 include/linux/page_owner.h
create mode 100644 mm/page_ext.c
create mode 100644 mm/page_owner.c
create mode 100644 tools/vm/page_owner_sort.c
--
1.7.9.5
Current stacktrace only have the function for console output.
page_owner that will be introduced in following patch needs to print
the output of stacktrace into the buffer for our own output format
so so new function, snprint_stack_trace(), is needed.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/stacktrace.h | 3 +++
kernel/stacktrace.c | 24 ++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 115b570..5948c67 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -20,6 +20,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);
extern void print_stack_trace(struct stack_trace *trace, int spaces);
+extern int snprint_stack_trace(char *buf, int buf_len,
+ struct stack_trace *trace, int spaces);
#ifdef CONFIG_USER_STACKTRACE_SUPPORT
extern void save_stack_trace_user(struct stack_trace *trace);
@@ -32,6 +34,7 @@ extern void save_stack_trace_user(struct stack_trace *trace);
# define save_stack_trace_tsk(tsk, trace) do { } while (0)
# define save_stack_trace_user(trace) do { } while (0)
# define print_stack_trace(trace, spaces) do { } while (0)
+# define snprint_stack_trace(buf, len, trace, spaces) do { } while (0)
#endif
#endif
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 00fe55c..61088ff 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -25,6 +25,30 @@ void print_stack_trace(struct stack_trace *trace, int spaces)
}
EXPORT_SYMBOL_GPL(print_stack_trace);
+int snprint_stack_trace(char *buf, int buf_len, struct stack_trace *trace,
+ int spaces)
+{
+ int i, printed;
+ unsigned long ip;
+ int ret = 0;
+
+ if (WARN_ON(!trace->entries))
+ return 0;
+
+ for (i = 0; i < trace->nr_entries && buf_len; i++) {
+ ip = trace->entries[i];
+ printed = snprintf(buf, buf_len, "%*c[<%p>] %pS\n",
+ 1 + spaces, ' ', (void *) ip, (void *) ip);
+
+ buf_len -= printed;
+ ret += printed;
+ buf += printed;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(snprint_stack_trace);
+
/*
* Architectures that do not implement save_stack_trace_tsk or
* save_stack_trace_regs get this weak alias and a once-per-bootup warning
--
1.7.9.5
This is the page owner tracking code which is introduced
so far ago. It is resident on Andrew's tree, though, nobody
tried to upstream so it remain as is. Our company uses this feature
actively to debug memory leak or to find a memory hogger so
I decide to upstream this feature.
This functionality help us to know who allocates the page.
When allocating a page, we store some information about
allocation in extra memory. Later, if we need to know
status of all pages, we can get and analyze it from this stored
information.
In previous version of this feature, extra memory is statically defined
in struct page, but, in this version, extra memory is allocated outside
of struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.
Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge
the trace buffer for preventing overlapping until userspace program
launched. And, launched program continually dump out the trace buffer
for later analysis and it would change system behaviour with more
possibility rather than just keeping it in memory, so bad for debug.
Moreover, we can use page_owner feature further for various purposes.
For example, we can use it for fragmentation statistics implemented in
this patch. And, I also plan to implement some CMA failure debugging
feature using this interface.
I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.
Contributor:
Alexander Nyberg <[email protected]>
Mel Gorman <[email protected]>
Dave Hansen <[email protected]>
Minchan Kim <[email protected]>
Michal Nazarewicz <[email protected]>
Andrew Morton <[email protected]>
Jungsoo Son <[email protected]>
v2: Do set_page_owner() more places than v1. This corrects page owner
information of memory for alloc_pages_exact() and compaction/CMA.
Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/kernel-parameters.txt | 6 +
include/linux/page_ext.h | 10 ++
include/linux/page_owner.h | 19 +++
lib/Kconfig.debug | 13 ++
mm/Makefile | 1 +
mm/page_alloc.c | 11 +-
mm/page_ext.c | 4 +
mm/page_owner.c | 224 +++++++++++++++++++++++++++++++++++
mm/vmstat.c | 93 +++++++++++++++
tools/vm/Makefile | 4 +-
tools/vm/page_owner_sort.c | 144 ++++++++++++++++++++++
11 files changed, 526 insertions(+), 3 deletions(-)
create mode 100644 include/linux/page_owner.h
create mode 100644 mm/page_owner.c
create mode 100644 tools/vm/page_owner_sort.c
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b5ac055..3632005 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -884,6 +884,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
MTRR settings. This parameter disables that behavior,
possibly causing your machine to run very slowly.
+ disable_page_owner
+ [KNL] Disable to store the information who requests
+ the page. We can avoid allocating huge chunk of memory
+ if page owner is only client for page extension and
+ this parameter is specified.
+
disable_timer_pin_1 [X86]
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 61c0f05..d2a2c84 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -1,6 +1,9 @@
#ifndef __LINUX_PAGE_EXT_H
#define __LINUX_PAGE_EXT_H
+#include <linux/types.h>
+#include <linux/stacktrace.h>
+
struct pglist_data;
struct page_ext_operations {
bool (*need)(void);
@@ -22,6 +25,7 @@ struct page_ext_operations {
enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
+ PAGE_EXT_OWNER,
};
/*
@@ -33,6 +37,12 @@ enum page_ext_flags {
*/
struct page_ext {
unsigned long flags;
+#ifdef CONFIG_PAGE_OWNER
+ unsigned int order;
+ gfp_t gfp_mask;
+ struct stack_trace trace;
+ unsigned long trace_entries[8];
+#endif
};
extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
new file mode 100644
index 0000000..a2e15c3
--- /dev/null
+++ b/include/linux/page_owner.h
@@ -0,0 +1,19 @@
+#ifndef __LINUX_PAGE_OWNER_H
+#define __LINUX_PAGE_OWNER_H
+
+#ifdef CONFIG_PAGE_OWNER
+extern struct page_ext_operations page_owner_ops;
+extern void reset_page_owner(struct page *page, unsigned int order);
+extern void set_page_owner(struct page *page,
+ unsigned int order, gfp_t gfp_mask);
+#else
+static inline void reset_page_owner(struct page *page, unsigned int order)
+{
+}
+static inline void set_page_owner(struct page *page,
+ unsigned int order, gfp_t gfp_mask)
+{
+}
+
+#endif /* CONFIG_PAGE_OWNER */
+#endif /* __LINUX_PAGE_OWNER_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c078a76..4b03d14 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -227,6 +227,19 @@ config UNUSED_SYMBOLS
you really need it, and what the merge plan to the mainline kernel for
your module is.
+config PAGE_OWNER
+ bool "Track page owner"
+ depends on DEBUG_KERNEL && STACKTRACE_SUPPORT
+ select DEBUG_FS
+ select STACKTRACE
+ select PAGE_EXTENSION
+ help
+ This keeps track of what call chain is the owner of a page, may
+ help to find bare alloc_page(s) leaks. Eats a fair amount of memory.
+ See tools/vm/page_owner_sort.c for user-space helper.
+
+ If unsure, say N.
+
config DEBUG_FS
bool "Debug Filesystem"
help
diff --git a/mm/Makefile b/mm/Makefile
index 0b7a784..3548460 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PAGE_OWNER) += page_owner.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
obj-$(CONFIG_ZPOOL) += zpool.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4eea173..f1968d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/page_ext.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
+#include <linux/page_owner.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -810,6 +811,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (bad)
return false;
+ reset_page_owner(page, order);
+
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
PAGE_SIZE << order);
@@ -985,6 +988,8 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
+ set_page_owner(page, order, gfp_flags);
+
return 0;
}
@@ -1557,8 +1562,11 @@ void split_page(struct page *page, unsigned int order)
split_page(virt_to_page(page[0].shadow), order);
#endif
- for (i = 1; i < (1 << order); i++)
+ set_page_owner(page, 0, 0);
+ for (i = 1; i < (1 << order); i++) {
set_page_refcounted(page + i);
+ set_page_owner(page + i, 0, 0);
+ }
}
EXPORT_SYMBOL_GPL(split_page);
@@ -1598,6 +1606,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}
}
+ set_page_owner(page, order, 0);
return 1UL << order;
}
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 439f2f5..4edecea 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -5,6 +5,7 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/kmemleak.h>
+#include <linux/page_owner.h>
/*
* struct page extension
@@ -55,6 +56,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_POISONING
&page_poisoning_ops,
#endif
+#ifdef CONFIG_PAGE_OWNER
+ &page_owner_ops,
+#endif
};
static unsigned long total_usage;
diff --git a/mm/page_owner.c b/mm/page_owner.c
new file mode 100644
index 0000000..f75dfe0
--- /dev/null
+++ b/mm/page_owner.c
@@ -0,0 +1,224 @@
+#include <linux/debugfs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/bootmem.h>
+#include <linux/stacktrace.h>
+#include <linux/page_owner.h>
+#include "internal.h"
+
+static bool page_owner_disabled;
+static bool page_owner_enabled __read_mostly;
+
+static int early_disable_page_owner(char *buf)
+{
+ page_owner_disabled = true;
+
+ return 0;
+}
+early_param("disable_page_owner", early_disable_page_owner);
+
+static bool need_page_owner(void)
+{
+ if (page_owner_disabled)
+ return false;
+
+ return true;
+}
+
+static void init_page_owner(void)
+{
+ if (page_owner_disabled)
+ return;
+
+ page_owner_enabled = true;
+}
+
+struct page_ext_operations page_owner_ops = {
+ .need = need_page_owner,
+ .init = init_page_owner,
+};
+
+void reset_page_owner(struct page *page, unsigned int order)
+{
+ int i;
+ struct page_ext *page_ext;
+
+ if (!page_owner_enabled)
+ return;
+
+ for (i = 0; i < (1 << order); i++) {
+ page_ext = lookup_page_ext(page + i);
+ __clear_bit(PAGE_EXT_OWNER, &page_ext->flags);
+ }
+}
+
+void set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask)
+{
+ struct page_ext *page_ext;
+ struct stack_trace *trace;
+
+ if (!page_owner_enabled)
+ return;
+
+ page_ext = lookup_page_ext(page);
+
+ trace = &page_ext->trace;
+ trace->nr_entries = 0;
+ trace->max_entries = ARRAY_SIZE(page_ext->trace_entries);
+ trace->entries = &page_ext->trace_entries[0];
+ trace->skip = 3;
+ save_stack_trace(&page_ext->trace);
+
+ page_ext->order = order;
+ page_ext->gfp_mask = gfp_mask;
+
+ __set_bit(PAGE_EXT_OWNER, &page_ext->flags);
+}
+
+static ssize_t
+print_page_owner(char __user *buf, size_t count, unsigned long pfn,
+ struct page *page, struct page_ext *page_ext)
+{
+ int ret;
+ int pageblock_mt, page_mt;
+ char *kbuf;
+
+ kbuf = kmalloc(count, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ ret = snprintf(kbuf, count,
+ "Page allocated via order %u, mask 0x%x\n",
+ page_ext->order, page_ext->gfp_mask);
+
+ if (ret >= count)
+ goto err;
+
+ /* Print information relevant to grouping pages by mobility */
+ pageblock_mt = get_pfnblock_migratetype(page, pfn);
+ page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
+ ret += snprintf(kbuf + ret, count - ret,
+ "PFN %lu Block %lu type %d %s Flags %s%s%s%s%s%s%s%s%s%s%s%s\n",
+ pfn,
+ pfn >> pageblock_order,
+ pageblock_mt,
+ pageblock_mt != page_mt ? "Fallback" : " ",
+ PageLocked(page) ? "K" : " ",
+ PageError(page) ? "E" : " ",
+ PageReferenced(page) ? "R" : " ",
+ PageUptodate(page) ? "U" : " ",
+ PageDirty(page) ? "D" : " ",
+ PageLRU(page) ? "L" : " ",
+ PageActive(page) ? "A" : " ",
+ PageSlab(page) ? "S" : " ",
+ PageWriteback(page) ? "W" : " ",
+ PageCompound(page) ? "C" : " ",
+ PageSwapCache(page) ? "B" : " ",
+ PageMappedToDisk(page) ? "M" : " ");
+
+ if (ret >= count)
+ goto err;
+
+ ret += snprint_stack_trace(kbuf + ret, count - ret,
+ &page_ext->trace, 0);
+ if (ret >= count)
+ goto err;
+
+ ret += snprintf(kbuf + ret, count - ret, "\n");
+ if (ret >= count)
+ goto err;
+
+ if (copy_to_user(buf, kbuf, ret))
+ ret = -EFAULT;
+
+ kfree(kbuf);
+ return ret;
+
+err:
+ kfree(kbuf);
+ return -ENOMEM;
+}
+
+static ssize_t
+read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned long pfn;
+ struct page *page;
+ struct page_ext *page_ext;
+
+ if (!page_owner_enabled)
+ return -EINVAL;
+
+ page = NULL;
+ pfn = min_low_pfn + *ppos;
+
+ /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */
+ while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
+ pfn++;
+
+ drain_all_pages(NULL);
+
+ /* Find an allocated page */
+ for (; pfn < max_pfn; pfn++) {
+ /*
+ * If the new page is in a new MAX_ORDER_NR_PAGES area,
+ * validate the area as existing, skip it if not
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0 && !pfn_valid(pfn)) {
+ pfn += MAX_ORDER_NR_PAGES - 1;
+ continue;
+ }
+
+ /* Check for holes within a MAX_ORDER area */
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page)) {
+ unsigned long freepage_order = page_order_unsafe(page);
+
+ if (freepage_order < MAX_ORDER)
+ pfn += (1UL << freepage_order) - 1;
+ continue;
+ }
+
+ page_ext = lookup_page_ext(page);
+
+ /*
+ * Pages allocated before initialization of page_owner are
+ * non-buddy and have no page_owner info.
+ */
+ if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ /* Record the next PFN to read in the file offset */
+ *ppos = (pfn - min_low_pfn) + 1;
+
+ return print_page_owner(buf, count, pfn, page, page_ext);
+ }
+
+ return 0;
+}
+
+static const struct file_operations proc_page_owner_operations = {
+ .read = read_page_owner,
+};
+
+static int __init pageowner_init(void)
+{
+ struct dentry *dentry;
+
+ if (!page_owner_enabled) {
+ pr_info("page_owner is disabled\n");
+ return 0;
+ }
+
+ dentry = debugfs_create_file("page_owner", S_IRUSR, NULL,
+ NULL, &proc_page_owner_operations);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ return 0;
+}
+module_init(pageowner_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1b12d39..c2f0c58 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -22,6 +22,7 @@
#include <linux/writeback.h>
#include <linux/compaction.h>
#include <linux/mm_inline.h>
+#include <linux/page_ext.h>
#include "internal.h"
@@ -1017,6 +1018,97 @@ static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
return 0;
}
+#ifdef CONFIG_PAGE_OWNER
+static void pagetypeinfo_showmixedcount_print(struct seq_file *m,
+ pg_data_t *pgdat,
+ struct zone *zone)
+{
+ struct page *page;
+ struct page_ext *page_ext;
+ unsigned long pfn = zone->zone_start_pfn, block_end_pfn;
+ unsigned long end_pfn = pfn + zone->spanned_pages;
+ unsigned long count[MIGRATE_TYPES] = { 0, };
+ int pageblock_mt, page_mt;
+ int i;
+
+ /* Scan block by block. First and last block may be incomplete */
+ pfn = zone->zone_start_pfn;
+
+ /*
+ * Walk the zone in pageblock_nr_pages steps. If a page block spans
+ * a zone boundary, it will be double counted between zones. This does
+ * not matter as the mixed block count will still be correct
+ */
+ for (; pfn < end_pfn; ) {
+ if (!pfn_valid(pfn)) {
+ pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
+ continue;
+ }
+
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ block_end_pfn = min(block_end_pfn, end_pfn);
+
+ page = pfn_to_page(pfn);
+ pageblock_mt = get_pfnblock_migratetype(page, pfn);
+
+ for (; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page)) {
+ pfn += (1UL << page_order(page)) - 1;
+ continue;
+ }
+
+ page_ext = lookup_page_ext(page);
+
+ /* We can't track early allocated pages */
+ if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
+ if (pageblock_mt != page_mt) {
+ if (is_migrate_cma(pageblock_mt))
+ count[MIGRATE_MOVABLE]++;
+ else
+ count[pageblock_mt]++;
+ break;
+ }
+ pfn += (1UL << page_ext->order) - 1;
+ }
+ }
+
+ /* Print counts */
+ seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ for (i = 0; i < MIGRATE_TYPES; i++)
+ seq_printf(m, "%12lu ", count[i]);
+ seq_putc(m, '\n');
+}
+#endif /* CONFIG_PAGE_OWNER */
+
+/*
+ * Print out the number of pageblocks for each migratetype that contain pages
+ * of other types. This gives an indication of how well fallbacks are being
+ * contained by rmqueue_fallback(). It requires information from PAGE_OWNER
+ * to determine what is going on
+ */
+static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
+{
+#ifdef CONFIG_PAGE_OWNER
+ int mtype;
+
+ drain_all_pages(NULL);
+
+ seq_printf(m, "\n%-23s", "Number of mixed blocks ");
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ seq_printf(m, "%12s ", migratetype_names[mtype]);
+ seq_putc(m, '\n');
+
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print);
+#endif /* CONFIG_PAGE_OWNER */
+}
+
/*
* This prints out statistics in relation to grouping pages by mobility.
* It is expensive to collect so do not constantly read the file.
@@ -1034,6 +1126,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
+ pagetypeinfo_showmixedcount(m, pgdat);
return 0;
}
diff --git a/tools/vm/Makefile b/tools/vm/Makefile
index 3d907da..ac884b6 100644
--- a/tools/vm/Makefile
+++ b/tools/vm/Makefile
@@ -1,6 +1,6 @@
# Makefile for vm tools
#
-TARGETS=page-types slabinfo
+TARGETS=page-types slabinfo page_owner_sort
LIB_DIR = ../lib/api
LIBS = $(LIB_DIR)/libapikfs.a
@@ -18,5 +18,5 @@ $(LIBS):
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
clean:
- $(RM) page-types slabinfo
+ $(RM) page-types slabinfo page_owner_sort
make -C $(LIB_DIR) clean
diff --git a/tools/vm/page_owner_sort.c b/tools/vm/page_owner_sort.c
new file mode 100644
index 0000000..77147b4
--- /dev/null
+++ b/tools/vm/page_owner_sort.c
@@ -0,0 +1,144 @@
+/*
+ * User-space helper to sort the output of /sys/kernel/debug/page_owner
+ *
+ * Example use:
+ * cat /sys/kernel/debug/page_owner > page_owner_full.txt
+ * grep -v ^PFN page_owner_full.txt > page_owner.txt
+ * ./sort page_owner.txt sorted_page_owner.txt
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+
+struct block_list {
+ char *txt;
+ int len;
+ int num;
+};
+
+
+static struct block_list *list;
+static int list_size;
+static int max_size;
+
+struct block_list *block_head;
+
+int read_block(char *buf, int buf_size, FILE *fin)
+{
+ char *curr = buf, *const buf_end = buf + buf_size;
+
+ while (buf_end - curr > 1 && fgets(curr, buf_end - curr, fin)) {
+ if (*curr == '\n') /* empty line */
+ return curr - buf;
+ curr += strlen(curr);
+ }
+
+ return -1; /* EOF or no space left in buf. */
+}
+
+static int compare_txt(const void *p1, const void *p2)
+{
+ const struct block_list *l1 = p1, *l2 = p2;
+
+ return strcmp(l1->txt, l2->txt);
+}
+
+static int compare_num(const void *p1, const void *p2)
+{
+ const struct block_list *l1 = p1, *l2 = p2;
+
+ return l2->num - l1->num;
+}
+
+static void add_list(char *buf, int len)
+{
+ if (list_size != 0 &&
+ len == list[list_size-1].len &&
+ memcmp(buf, list[list_size-1].txt, len) == 0) {
+ list[list_size-1].num++;
+ return;
+ }
+ if (list_size == max_size) {
+ printf("max_size too small??\n");
+ exit(1);
+ }
+ list[list_size].txt = malloc(len+1);
+ list[list_size].len = len;
+ list[list_size].num = 1;
+ memcpy(list[list_size].txt, buf, len);
+ list[list_size].txt[len] = 0;
+ list_size++;
+ if (list_size % 1000 == 0) {
+ printf("loaded %d\r", list_size);
+ fflush(stdout);
+ }
+}
+
+#define BUF_SIZE 1024
+
+int main(int argc, char **argv)
+{
+ FILE *fin, *fout;
+ char buf[BUF_SIZE];
+ int ret, i, count;
+ struct block_list *list2;
+ struct stat st;
+
+ if (argc < 3) {
+ printf("Usage: ./program <input> <output>\n");
+ perror("open: ");
+ exit(1);
+ }
+
+ fin = fopen(argv[1], "r");
+ fout = fopen(argv[2], "w");
+ if (!fin || !fout) {
+ printf("Usage: ./program <input> <output>\n");
+ perror("open: ");
+ exit(1);
+ }
+
+ fstat(fileno(fin), &st);
+ max_size = st.st_size / 100; /* hack ... */
+
+ list = malloc(max_size * sizeof(*list));
+
+ for ( ; ; ) {
+ ret = read_block(buf, BUF_SIZE, fin);
+ if (ret < 0)
+ break;
+
+ add_list(buf, ret);
+ }
+
+ printf("loaded %d\n", list_size);
+
+ printf("sorting ....\n");
+
+ qsort(list, list_size, sizeof(list[0]), compare_txt);
+
+ list2 = malloc(sizeof(*list) * list_size);
+
+ printf("culling\n");
+
+ for (i = count = 0; i < list_size; i++) {
+ if (count == 0 ||
+ strcmp(list2[count-1].txt, list[i].txt) != 0) {
+ list2[count++] = list[i];
+ } else {
+ list2[count-1].num += list[i].num;
+ }
+ }
+
+ qsort(list2, count, sizeof(list[0]), compare_num);
+
+ for (i = 0; i < count; i++)
+ fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
+
+ return 0;
+}
--
1.7.9.5
Extended memory to store page owner information is initialized some time
later than that page allocator starts. Until initialization, many pages
can be allocated and they have no owner information. This make debugging
using page owner harder, so some fixup will be helpful.
This patch fix up this situation by setting fake owner information
immediately after page extension is initialized. Information doesn't
tell the right owner, but, at least, it can tell whether page is
allocated or not, more correctly.
On my testing, this patch catches 13343 early allocated pages, although
they are mostly allocated from page extension feature. Anyway, after then,
there is no page left that it is allocated and has no page owner flag.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_owner.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 91 insertions(+), 2 deletions(-)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index f75dfe0..2ea8d02 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,8 @@
static bool page_owner_disabled;
static bool page_owner_enabled __read_mostly;
+static void init_early_allocated_pages(void);
+
static int early_disable_page_owner(char *buf)
{
page_owner_disabled = true;
@@ -32,6 +34,7 @@ static void init_page_owner(void)
return;
page_owner_enabled = true;
+ init_early_allocated_pages();
}
struct page_ext_operations page_owner_ops = {
@@ -186,8 +189,8 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
page_ext = lookup_page_ext(page);
/*
- * Pages allocated before initialization of page_owner are
- * non-buddy and have no page_owner info.
+ * Some pages could be missed by concurrent allocation or free,
+ * because we don't hold the zone lock.
*/
if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
continue;
@@ -201,6 +204,92 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
return 0;
}
+static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
+{
+ struct page *page;
+ struct page_ext *page_ext;
+ unsigned long pfn = zone->zone_start_pfn, block_end_pfn;
+ unsigned long end_pfn = pfn + zone->spanned_pages;
+ unsigned long count = 0;
+
+ /* Scan block by block. First and last block may be incomplete */
+ pfn = zone->zone_start_pfn;
+
+ /*
+ * Walk the zone in pageblock_nr_pages steps. If a page block spans
+ * a zone boundary, it will be double counted between zones. This does
+ * not matter as the mixed block count will still be correct
+ */
+ for (; pfn < end_pfn; ) {
+ if (!pfn_valid(pfn)) {
+ pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
+ continue;
+ }
+
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ block_end_pfn = min(block_end_pfn, end_pfn);
+
+ page = pfn_to_page(pfn);
+
+ for (; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+
+ /*
+ * We are safe to check buddy flag and order, because
+ * this is init stage and only single thread runs.
+ */
+ if (PageBuddy(page)) {
+ pfn += (1UL << page_order(page)) - 1;
+ continue;
+ }
+
+ if (PageReserved(page))
+ continue;
+
+ page_ext = lookup_page_ext(page);
+
+ /* Maybe overraping zone */
+ if (test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ /* Found early allocated page */
+ set_page_owner(page, 0, 0);
+ count++;
+ }
+ }
+
+ pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n",
+ pgdat->node_id, zone->name, count);
+}
+
+static void init_zones_in_node(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ struct zone *node_zones = pgdat->node_zones;
+ unsigned long flags;
+
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!populated_zone(zone))
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ init_pages_in_zone(pgdat, zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
+static void init_early_allocated_pages(void)
+{
+ pg_data_t *pgdat;
+
+ drain_all_pages(NULL);
+ for_each_online_pgdat(pgdat)
+ init_zones_in_node(pgdat);
+}
+
static const struct file_operations proc_page_owner_operations = {
.read = read_page_owner,
};
--
1.7.9.5
do_mmap_private() in nommu.c try to allocate physically contiguous pages
with arbitrary size in some cases and we now have good abstract function
to do exactly same thing, alloc_pages_exact(). So, change to use it.
There is no functional change.
This is the preparation step for support page owner feature accurately.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/nommu.c | 33 +++++++++++----------------------
1 file changed, 11 insertions(+), 22 deletions(-)
diff --git a/mm/nommu.c b/mm/nommu.c
index 2266a34..1b87c17 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1149,8 +1149,7 @@ static int do_mmap_private(struct vm_area_struct *vma,
unsigned long len,
unsigned long capabilities)
{
- struct page *pages;
- unsigned long total, point, n;
+ unsigned long total, point;
void *base;
int ret, order;
@@ -1182,33 +1181,23 @@ static int do_mmap_private(struct vm_area_struct *vma,
order = get_order(len);
kdebug("alloc order %d for %lx", order, len);
- pages = alloc_pages(GFP_KERNEL, order);
- if (!pages)
- goto enomem;
-
total = 1 << order;
- atomic_long_add(total, &mmap_pages_allocated);
-
point = len >> PAGE_SHIFT;
- /* we allocated a power-of-2 sized page set, so we may want to trim off
- * the excess */
+ /* we don't want to allocate a power-of-2 sized page set */
if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) {
- while (total > point) {
- order = ilog2(total - point);
- n = 1 << order;
- kdebug("shave %lu/%lu @%lu", n, total - point, total);
- atomic_long_sub(n, &mmap_pages_allocated);
- total -= n;
- set_page_refcounted(pages + total);
- __free_pages(pages + total, order);
- }
+ total = point;
+ kdebug("try to alloc exact %lu pages", total);
+ base = alloc_pages_exact(len, GFP_KERNEL);
+ } else {
+ base = __get_free_pages(GFP_KERNEL, order);
}
- for (point = 1; point < total; point++)
- set_page_refcounted(&pages[point]);
+ if (!base)
+ goto enomem;
+
+ atomic_long_add(total, &mmap_pages_allocated);
- base = page_address(pages);
region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY;
region->vm_start = (unsigned long) base;
region->vm_end = region->vm_start + len;
--
1.7.9.5
Until now, debug-pagealloc needs extra flags in struct page, so we need
to recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page.
So, we can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert
extra flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead
in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it
will use field of struct page_ext. Because memory for page_ext is
allocated later than initialization of page allocator in CONFIG_SPARSEMEM,
we should disable debug-pagealloc feature temporarily until initialization
of page_ext. This patch implements this.
v2: fix compile error on CONFIG_PAGE_POISONING
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
include/linux/mm_types.h | 4 ----
include/linux/page-debug-flags.h | 32 --------------------------------
include/linux/page_ext.h | 15 +++++++++++++++
mm/Kconfig.debug | 1 +
mm/debug-pagealloc.c | 37 +++++++++++++++++++++++++++++++++----
mm/page_alloc.c | 38 +++++++++++++++++++++++++++++++++++---
mm/page_ext.c | 4 ++++
8 files changed, 106 insertions(+), 44 deletions(-)
delete mode 100644 include/linux/page-debug-flags.h
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b922a16..5a8d4d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@
#include <linux/bit_spinlock.h>
#include <linux/shrinker.h>
#include <linux/resource.h>
+#include <linux/page_ext.h>
struct mempolicy;
struct anon_vma;
@@ -2149,20 +2150,36 @@ extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned int pages_per_huge_page);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+extern struct page_ext_operations debug_guardpage_ops;
+extern struct page_ext_operations page_poisoning_ops;
+
#ifdef CONFIG_DEBUG_PAGEALLOC
extern unsigned int _debug_guardpage_minorder;
+extern bool _debug_guardpage_enabled;
static inline unsigned int debug_guardpage_minorder(void)
{
return _debug_guardpage_minorder;
}
+static inline bool debug_guardpage_enabled(void)
+{
+ return _debug_guardpage_enabled;
+}
+
static inline bool page_is_guard(struct page *page)
{
- return test_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return false;
+
+ page_ext = lookup_page_ext(page);
+ return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
}
#else
static inline unsigned int debug_guardpage_minorder(void) { return 0; }
+static inline bool debug_guardpage_enabled(void) { return false; }
static inline bool page_is_guard(struct page *page) { return false; }
#endif /* CONFIG_DEBUG_PAGEALLOC */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 33a8acf..c7b22e7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,7 +10,6 @@
#include <linux/rwsem.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
-#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
#include <asm/page.h>
@@ -186,9 +185,6 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
- unsigned long debug_flags; /* Use atomic bitops on this */
-#endif
#ifdef CONFIG_KMEMCHECK
/*
diff --git a/include/linux/page-debug-flags.h b/include/linux/page-debug-flags.h
deleted file mode 100644
index 22691f61..0000000
--- a/include/linux/page-debug-flags.h
+++ /dev/null
@@ -1,32 +0,0 @@
-#ifndef LINUX_PAGE_DEBUG_FLAGS_H
-#define LINUX_PAGE_DEBUG_FLAGS_H
-
-/*
- * page->debug_flags bits:
- *
- * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to
- * implement generic debug pagealloc feature. The pages are filled with
- * poison patterns and set this flag after free_pages(). The poisoned
- * pages are verified whether the patterns are not corrupted and clear
- * the flag before alloc_pages().
- */
-
-enum page_debug_flags {
- PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */
- PAGE_DEBUG_FLAG_GUARD,
-};
-
-/*
- * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably
- * gets turned off when no debug features are enabling it!
- */
-
-#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
-#if !defined(CONFIG_PAGE_POISONING) && \
- !defined(CONFIG_PAGE_GUARD) \
-/* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */
-#error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features!
-#endif
-#endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */
-
-#endif /* LINUX_PAGE_DEBUG_FLAGS_H */
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 2ccc8b4..61c0f05 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -10,6 +10,21 @@ struct page_ext_operations {
#ifdef CONFIG_PAGE_EXTENSION
/*
+ * page_ext->flags bits:
+ *
+ * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to
+ * implement generic debug pagealloc feature. The pages are filled with
+ * poison patterns and set this flag after free_pages(). The poisoned
+ * pages are verified whether the patterns are not corrupted and clear
+ * the flag before alloc_pages().
+ */
+
+enum page_ext_flags {
+ PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
+ PAGE_EXT_DEBUG_GUARD,
+};
+
+/*
* Page Extension can be considered as an extended mem_map.
* A page_ext page is associated with every page descriptor. The
* page_ext helps us add more information about the page.
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 1ba81c7..56badfc 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -12,6 +12,7 @@ config DEBUG_PAGEALLOC
depends on DEBUG_KERNEL
depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC
depends on !KMEMCHECK
+ select PAGE_EXTENSION
select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
---help---
diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c
index 789ff70..0072f2c 100644
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -2,23 +2,49 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/highmem.h>
-#include <linux/page-debug-flags.h>
+#include <linux/page_ext.h>
#include <linux/poison.h>
#include <linux/ratelimit.h>
+static bool page_poisoning_enabled __read_mostly;
+
+static bool need_page_poisoning(void)
+{
+ return true;
+}
+
+static void init_page_poisoning(void)
+{
+ page_poisoning_enabled = true;
+}
+
+struct page_ext_operations page_poisoning_ops = {
+ .need = need_page_poisoning,
+ .init = init_page_poisoning,
+};
+
static inline void set_page_poison(struct page *page)
{
- __set_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ __set_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
static inline void clear_page_poison(struct page *page)
{
- __clear_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ __clear_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
static inline bool page_poison(struct page *page)
{
- return test_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ return test_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}
static void poison_page(struct page *page)
@@ -95,6 +121,9 @@ static void unpoison_pages(struct page *page, int n)
void kernel_map_pages(struct page *page, int numpages, int enable)
{
+ if (!page_poisoning_enabled)
+ return;
+
if (enable)
unpoison_pages(page, numpages);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c91f449..7534733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -56,7 +56,7 @@
#include <linux/prefetch.h>
#include <linux/mm_inline.h>
#include <linux/migrate.h>
-#include <linux/page-debug-flags.h>
+#include <linux/page_ext.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
@@ -426,6 +426,22 @@ static inline void prep_zero_page(struct page *page, unsigned int order,
#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
+bool _debug_guardpage_enabled __read_mostly;
+
+static bool need_debug_guardpage(void)
+{
+ return true;
+}
+
+static void init_debug_guardpage(void)
+{
+ _debug_guardpage_enabled = true;
+}
+
+struct page_ext_operations debug_guardpage_ops = {
+ .need = need_debug_guardpage,
+ .init = init_debug_guardpage,
+};
static int __init debug_guardpage_minorder_setup(char *buf)
{
@@ -444,7 +460,14 @@ __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
static inline void set_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype)
{
- __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return;
+
+ page_ext = lookup_page_ext(page);
+ __set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
+
INIT_LIST_HEAD(&page->lru);
set_page_private(page, order);
/* Guard pages are not available for any usage */
@@ -454,12 +477,20 @@ static inline void set_page_guard(struct zone *zone, struct page *page,
static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype)
{
- __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return;
+
+ page_ext = lookup_page_ext(page);
+ __clear_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
+
set_page_private(page, 0);
if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, (1 << order), migratetype);
}
#else
+struct page_ext_operations debug_guardpage_ops = { NULL, };
static inline void set_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
static inline void clear_page_guard(struct zone *zone, struct page *page,
@@ -870,6 +901,7 @@ static inline void expand(struct zone *zone, struct page *page,
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) &&
+ debug_guardpage_enabled() &&
high < debug_guardpage_minorder()) {
/*
* Mark as guard pages (or page), that will allow to
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 41e62e2..439f2f5 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -51,6 +51,10 @@
*/
static struct page_ext_operations *page_ext_ops[] = {
+ &debug_guardpage_ops,
+#ifdef CONFIG_PAGE_POISONING
+ &page_poisoning_ops,
+#endif
};
static unsigned long total_usage;
--
1.7.9.5
Now, we have prepared to avoid using debug-pagealloc in boottime. So
introduce new kernel-parameter to disable debug-pagealloc in boottime,
and makes related functions to be disabled in this case.
Only non-intuitive part is change of guard page functions. Because
guard page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.
v2: makes debug-pagealloc boottime configurable for page poisoning, too
Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++++++
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/s390/mm/pageattr.c | 2 +-
arch/sparc/mm/init_64.c | 2 +-
arch/x86/mm/pageattr.c | 2 +-
include/linux/mm.h | 17 ++++++++++++++++-
mm/debug-pagealloc.c | 8 +++++++-
mm/page_alloc.c | 16 ++++++++++++++++
9 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 3c5a178..b5ac055 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
causing system reset or hang due to sending
INIT from AP to BSP.
+ disable_debug_pagealloc
+ [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+ parameter allows user to disable it at boot time.
+ With this parameter, we can avoid allocating huge
+ chunk of memory for debug pagealloc and then
+ the system will work mostly same with the kernel
+ built without CONFIG_DEBUG_PAGEALLOC.
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d5339a3..57b9c23 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1432,7 +1432,7 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
mmu_kernel_ssize, 0);
}
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long flags, vaddr, lmi;
int i;
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index cf11342..b98aac6 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -430,7 +430,7 @@ static int change_page_attr(struct page *page, int numpages, pgprot_t prot)
}
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 3fef3b2..426c9d4 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -120,7 +120,7 @@ static void ipte_range(pte_t *pte, unsigned long address, int nr)
}
}
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long address;
int nr, i, j;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 2d91c62..3ea267c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1621,7 +1621,7 @@ static void __init kernel_physical_mapping_init(void)
}
#ifdef CONFIG_DEBUG_PAGEALLOC
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long phys_start = page_to_pfn(page) << PAGE_SHIFT;
unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 36de293..4d304e1 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1801,7 +1801,7 @@ static int __set_pages_np(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 0);
}
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a8d4d4..5dc11e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2055,7 +2055,22 @@ static inline void vm_stat_account(struct mm_struct *mm,
#endif /* CONFIG_PROC_FS */
#ifdef CONFIG_DEBUG_PAGEALLOC
-extern void kernel_map_pages(struct page *page, int numpages, int enable);
+extern bool _debug_pagealloc_enabled;
+extern void __kernel_map_pages(struct page *page, int numpages, int enable);
+
+static inline bool debug_pagealloc_enabled(void)
+{
+ return _debug_pagealloc_enabled;
+}
+
+static inline void
+kernel_map_pages(struct page *page, int numpages, int enable)
+{
+ if (!debug_pagealloc_enabled())
+ return;
+
+ __kernel_map_pages(page, numpages, enable);
+}
#ifdef CONFIG_HIBERNATION
extern bool kernel_page_present(struct page *page);
#endif /* CONFIG_HIBERNATION */
diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c
index 0072f2c..5bf5906 100644
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -10,11 +10,17 @@ static bool page_poisoning_enabled __read_mostly;
static bool need_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}
static void init_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
page_poisoning_enabled = true;
}
@@ -119,7 +125,7 @@ static void unpoison_pages(struct page *page, int n)
unpoison_page(page + i);
}
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (!page_poisoning_enabled)
return;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7534733..4eea173 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,15 +426,31 @@ static inline void prep_zero_page(struct page *page, unsigned int order,
#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
+bool _debug_pagealloc_enabled __read_mostly = true;
bool _debug_guardpage_enabled __read_mostly;
+static int __init early_disable_debug_pagealloc(char *buf)
+{
+ _debug_pagealloc_enabled = false;
+
+ return 0;
+}
+early_param("disable_debug_pagealloc", early_disable_debug_pagealloc);
+
static bool need_debug_guardpage(void)
{
+ /* If we don't use debug_pagealloc, we don't need guard page */
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}
static void init_debug_guardpage(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
_debug_guardpage_enabled = true;
}
--
1.7.9.5
When we debug something, we'd like to insert some information to
every page. For this purpose, we sometimes modify struct page itself.
But, this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the kernel
due to third party module dependency. At third, system behaviour would be
largely different after re-compile, because it changes size of struct
page greatly and this structure is accessed by every part of kernel.
Keeping this as it is would be better to reproduce errornous situation.
This feature is intended to overcome above mentioned problems. This feature
allocates memory for extended data per page in certain place rather than
the struct page itself. This memory can be accessed by the accessor
functions provided by this code. During the boot process, it checks whether
allocation of huge chunk of memory is needed or not. If not, it avoids
allocating memory at all. With this advantage, we can include this feature
into the kernel in default and can avoid rebuild and solve related problems.
Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature,
so this patch resurrect it.
To help these things to work well, this patch introduces two callbacks
for clients. One is the need callback which is mandatory if user wants
to avoid useless memory allocation at boot-time. The other is optional,
init callback, which is used to do proper initialization after memory
is allocated. Detailed explanation about purpose of these functions is
in code comment. Please refer it.
Others are completely same with previous extension code in memcg.
v2:
describe overall design at the top of the page extension code.
add more description on commit message.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 12 ++
include/linux/page_ext.h | 59 +++++++
init/main.c | 7 +
mm/Kconfig.debug | 9 ++
mm/Makefile | 1 +
mm/page_alloc.c | 2 +
mm/page_ext.c | 385 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 475 insertions(+)
create mode 100644 include/linux/page_ext.h
create mode 100644 mm/page_ext.c
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3879d76..2f0856d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -722,6 +722,9 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
+#ifdef CONFIG_PAGE_EXTENSION
+ struct page_ext *node_page_ext;
+#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
@@ -1075,6 +1078,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
#define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
struct page;
+struct page_ext;
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
@@ -1092,6 +1096,14 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
+#ifdef CONFIG_PAGE_EXTENSION
+ /*
+ * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use
+ * section. (see page_ext.h about this.)
+ */
+ struct page_ext *page_ext;
+ unsigned long pad;
+#endif
/*
* WARNING: mem_section must be a power-of-2 in size for the
* calculation and use of SECTION_ROOT_MASK to make sense.
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
new file mode 100644
index 0000000..2ccc8b4
--- /dev/null
+++ b/include/linux/page_ext.h
@@ -0,0 +1,59 @@
+#ifndef __LINUX_PAGE_EXT_H
+#define __LINUX_PAGE_EXT_H
+
+struct pglist_data;
+struct page_ext_operations {
+ bool (*need)(void);
+ void (*init)(void);
+};
+
+#ifdef CONFIG_PAGE_EXTENSION
+
+/*
+ * Page Extension can be considered as an extended mem_map.
+ * A page_ext page is associated with every page descriptor. The
+ * page_ext helps us add more information about the page.
+ * All page_ext are allocated at boot or memory hotplug event,
+ * then the page_ext for pfn always exists.
+ */
+struct page_ext {
+ unsigned long flags;
+};
+
+extern void pgdat_page_ext_init(struct pglist_data *pgdat);
+
+#ifdef CONFIG_SPARSEMEM
+static inline void page_ext_init_flatmem(void)
+{
+}
+extern void page_ext_init(void);
+#else
+extern void page_ext_init_flatmem(void);
+static inline void page_ext_init(void)
+{
+}
+#endif
+
+struct page_ext *lookup_page_ext(struct page *page);
+
+#else /* !CONFIG_PAGE_EXTENSION */
+struct page_ext;
+
+static inline void pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+}
+
+static inline struct page_ext *lookup_page_ext(struct page *page)
+{
+ return NULL;
+}
+
+static inline void page_ext_init(void)
+{
+}
+
+static inline void page_ext_init_flatmem(void)
+{
+}
+#endif /* CONFIG_PAGE_EXTENSION */
+#endif /* __LINUX_PAGE_EXT_H */
diff --git a/init/main.c b/init/main.c
index 235aafb..c60a246 100644
--- a/init/main.c
+++ b/init/main.c
@@ -51,6 +51,7 @@
#include <linux/mempolicy.h>
#include <linux/key.h>
#include <linux/buffer_head.h>
+#include <linux/page_ext.h>
#include <linux/debug_locks.h>
#include <linux/debugobjects.h>
#include <linux/lockdep.h>
@@ -484,6 +485,11 @@ void __init __weak thread_info_cache_init(void)
*/
static void __init mm_init(void)
{
+ /*
+ * page_ext requires contiguous pages,
+ * bigger than MAX_ORDER unless SPARSEMEM.
+ */
+ page_ext_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
@@ -621,6 +627,7 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
#endif
+ page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 4b24432..1ba81c7 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -1,3 +1,12 @@
+config PAGE_EXTENSION
+ bool "Extend memmap on extra space for more information on page"
+ ---help---
+ Extend memmap on extra space for more information on page. This
+ could be used for debugging features that need to insert extra
+ field for every page. This extension enables us to save memory
+ by not allocating this extra memory according to boottime
+ configuration.
+
config DEBUG_PAGEALLOC
bool "Debug page memory allocations"
depends on DEBUG_KERNEL
diff --git a/mm/Makefile b/mm/Makefile
index 9c4371d..0b7a784 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -71,3 +71,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0dbede..c91f449 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,6 +48,7 @@
#include <linux/backing-dev.h>
#include <linux/fault-inject.h>
#include <linux/page-isolation.h>
+#include <linux/page_ext.h>
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/compaction.h>
@@ -4857,6 +4858,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+ pgdat_page_ext_init(pgdat);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_ext.c b/mm/page_ext.c
new file mode 100644
index 0000000..41e62e2
--- /dev/null
+++ b/mm/page_ext.c
@@ -0,0 +1,385 @@
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/bootmem.h>
+#include <linux/page_ext.h>
+#include <linux/memory.h>
+#include <linux/vmalloc.h>
+#include <linux/kmemleak.h>
+
+/*
+ * struct page extension
+ *
+ * This is the feature to manage memory for extended data per page.
+ *
+ * Until now, we must modify struct page itself to store extra data per page.
+ * This requires rebuilding the kernel and it is really time consuming process.
+ * And, sometimes, rebuild is impossible due to third party module dependency.
+ * At last, enlarging struct page could cause un-wanted system behaviour change.
+ *
+ * This feature is intended to overcome above mentioned problems. This feature
+ * allocates memory for extended data per page in certain place rather than
+ * the struct page itself. This memory can be accessed by the accessor
+ * functions provided by this code. During the boot process, it checks whether
+ * allocation of huge chunk of memory is needed or not. If not, it avoids
+ * allocating memory at all. With this advantage, we can include this feature
+ * into the kernel in default and can avoid rebuild and solve related problems.
+ *
+ * To help these things to work well, there are two callbacks for clients. One
+ * is the need callback which is mandatory if user wants to avoid useless
+ * memory allocation at boot-time. The other is optional, init callback, which
+ * is used to do proper initialization after memory is allocated.
+ *
+ * The need callback is used to decide whether extended memory allocation is
+ * needed or not. Sometimes users want to deactivate some features in this
+ * boot and extra memory would be unneccessary. In this case, to avoid
+ * allocating huge chunk of memory, each clients represent their need of
+ * extra memory through the need callback. If one of the need callbacks
+ * returns true, it means that someone needs extra memory so that
+ * page extension core should allocates memory for page extension. If
+ * none of need callbacks return true, memory isn't needed at all in this boot
+ * and page extension core can skip to allocate memory. As result,
+ * none of memory is wasted.
+ *
+ * The init callback is used to do proper initialization after page extension
+ * is completely initialized. In sparse memory system, extra memory is
+ * allocated some time later than memmap is allocated. In other words, lifetime
+ * of memory for page extension isn't same with memmap for struct page.
+ * Therefore, clients can't store extra data until page extension is
+ * initialized, even if pages are allocated and used freely. This could
+ * cause inadequate state of extra data per page, so, to prevent it, client
+ * can utilize this callback to initialize the state of it correctly.
+ */
+
+static struct page_ext_operations *page_ext_ops[] = {
+};
+
+static unsigned long total_usage;
+
+static bool __init invoke_need_callbacks(void)
+{
+ int i;
+ int entries = ARRAY_SIZE(page_ext_ops);
+
+ for (i = 0; i < entries; i++) {
+ if (page_ext_ops[i]->need && page_ext_ops[i]->need())
+ return true;
+ }
+
+ return false;
+}
+
+static void __init invoke_init_callbacks(void)
+{
+ int i;
+ int entries = sizeof(page_ext_ops) / sizeof(page_ext_ops[0]);
+
+ for (i = 0; i < entries; i++) {
+ if (page_ext_ops[i]->init)
+ page_ext_ops[i]->init();
+ }
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+
+void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+ pgdat->node_page_ext = NULL;
+}
+
+struct page_ext *lookup_page_ext(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_ext *base;
+
+ base = NODE_DATA(page_to_nid(page))->node_page_ext;
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_ext arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (unlikely(!base))
+ return NULL;
+#endif
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_ext(int nid)
+{
+ struct page_ext *base;
+ unsigned long table_size;
+ unsigned long nr_pages;
+
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+ if (!nr_pages)
+ return 0;
+
+ table_size = sizeof(struct page_ext) * nr_pages;
+
+ base = memblock_virt_alloc_try_nid_nopanic(
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+ BOOTMEM_ALLOC_ACCESSIBLE, nid);
+ if (!base)
+ return -ENOMEM;
+ NODE_DATA(nid)->node_page_ext = base;
+ total_usage += table_size;
+ return 0;
+}
+
+void __init page_ext_init_flatmem(void)
+{
+
+ int nid, fail;
+
+ if (!invoke_need_callbacks)
+ return;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_ext(nid);
+ if (fail)
+ goto fail;
+ }
+ pr_info("allocated %ld bytes of page_ext\n", total_usage);
+ invoke_init_callbacks();
+ return;
+
+fail:
+ pr_crit("allocation of page_ext failed.\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_FLAT_NODE_MEM_MAP */
+
+struct page_ext *lookup_page_ext(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_ext arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (!section->page_ext)
+ return NULL;
+#endif
+ return section->page_ext + pfn;
+}
+
+static void *__meminit alloc_page_ext(size_t size, int nid)
+{
+ gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
+ void *addr = NULL;
+
+ addr = alloc_pages_exact_nid(nid, size, flags);
+ if (addr) {
+ kmemleak_alloc(addr, size, 1, flags);
+ return addr;
+ }
+
+ if (node_state(nid, N_HIGH_MEMORY))
+ addr = vzalloc_node(size, nid);
+ else
+ addr = vzalloc(size);
+
+ return addr;
+}
+
+static int __meminit init_section_page_ext(unsigned long pfn, int nid)
+{
+ struct mem_section *section;
+ struct page_ext *base;
+ unsigned long table_size;
+
+ section = __pfn_to_section(pfn);
+
+ if (section->page_ext)
+ return 0;
+
+ table_size = sizeof(struct page_ext) * PAGES_PER_SECTION;
+ base = alloc_page_ext(table_size, nid);
+
+ /*
+ * The value stored in section->page_ext is (base - pfn)
+ * and it does not point to the memory block allocated above,
+ * causing kmemleak false positives.
+ */
+ kmemleak_not_leak(base);
+
+ if (!base) {
+ pr_err("page ext allocation failure\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * The passed "pfn" may not be aligned to SECTION. For the calculation
+ * we need to apply a mask.
+ */
+ pfn &= PAGE_SECTION_MASK;
+ section->page_ext = base - pfn;
+ total_usage += table_size;
+ return 0;
+}
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_page_ext(void *addr)
+{
+ if (is_vmalloc_addr(addr)) {
+ vfree(addr);
+ } else {
+ struct page *page = virt_to_page(addr);
+ size_t table_size =
+ sizeof(struct page_ext) * PAGES_PER_SECTION;
+
+ BUG_ON(PageReserved(page));
+ free_pages_exact(addr, table_size);
+ }
+}
+
+static void __free_page_ext(unsigned long pfn)
+{
+ struct mem_section *ms;
+ struct page_ext *base;
+
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->page_ext)
+ return;
+ base = ms->page_ext + pfn;
+ free_page_ext(base);
+ ms->page_ext = NULL;
+}
+
+static int __meminit online_page_ext(unsigned long start_pfn,
+ unsigned long nr_pages,
+ int nid)
+{
+ unsigned long start, end, pfn;
+ int fail = 0;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ if (nid == -1) {
+ /*
+ * In this case, "nid" already exists and contains valid memory.
+ * "start_pfn" passed to us is a pfn which is an arg for
+ * online__pages(), and start_pfn should exist.
+ */
+ nid = pfn_to_nid(start_pfn);
+ VM_BUG_ON(!node_state(nid, N_ONLINE));
+ }
+
+ for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+ if (!pfn_present(pfn))
+ continue;
+ fail = init_section_page_ext(pfn, nid);
+ }
+ if (!fail)
+ return 0;
+
+ /* rollback */
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_ext(pfn);
+
+ return -ENOMEM;
+}
+
+static int __meminit offline_page_ext(unsigned long start_pfn,
+ unsigned long nr_pages, int nid)
+{
+ unsigned long start, end, pfn;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_ext(pfn);
+ return 0;
+
+}
+
+static int __meminit page_ext_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = online_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_OFFLINE:
+ offline_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_CANCEL_ONLINE:
+ offline_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_GOING_OFFLINE:
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ return notifier_from_errno(ret);
+}
+
+#endif
+
+void __init page_ext_init(void)
+{
+ unsigned long pfn;
+ int nid;
+
+ if (!invoke_need_callbacks())
+ return;
+
+ for_each_node_state(nid, N_MEMORY) {
+ unsigned long start_pfn, end_pfn;
+
+ start_pfn = node_start_pfn(nid);
+ end_pfn = node_end_pfn(nid);
+ /*
+ * start_pfn and end_pfn may not be aligned to SECTION and the
+ * page->flags of out of node pages are not initialized. So we
+ * scan [start_pfn, the biggest section's pfn < end_pfn) here.
+ */
+ for (pfn = start_pfn; pfn < end_pfn;
+ pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
+
+ if (!pfn_valid(pfn))
+ continue;
+ /*
+ * Nodes's pfns can be overlapping.
+ * We know some arch can have a nodes layout such as
+ * -------------pfn-------------->
+ * N0 | N1 | N2 | N0 | N1 | N2|....
+ */
+ if (pfn_to_nid(pfn) != nid)
+ continue;
+ if (init_section_page_ext(pfn, nid))
+ goto oom;
+ }
+ }
+ hotplug_memory_notifier(page_ext_callback, 0);
+ pr_info("allocated %ld bytes of page_ext\n", total_usage);
+ invoke_init_callbacks();
+ return;
+
+oom:
+ panic("Out of memory");
+}
+
+void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+}
+
+#endif
+
--
1.7.9.5
On Fri, 21 Nov 2014 17:14:00 +0900 Joonsoo Kim <[email protected]> wrote:
> When we debug something, we'd like to insert some information to
> every page. For this purpose, we sometimes modify struct page itself.
> But, this has drawbacks. First, it requires re-compile. This makes us
> hesitate to use the powerful debug feature so development process is
> slowed down. And, second, sometimes it is impossible to rebuild the kernel
> due to third party module dependency. At third, system behaviour would be
> largely different after re-compile, because it changes size of struct
> page greatly and this structure is accessed by every part of kernel.
> Keeping this as it is would be better to reproduce errornous situation.
>
> This feature is intended to overcome above mentioned problems. This feature
> allocates memory for extended data per page in certain place rather than
> the struct page itself. This memory can be accessed by the accessor
> functions provided by this code. During the boot process, it checks whether
> allocation of huge chunk of memory is needed or not. If not, it avoids
> allocating memory at all. With this advantage, we can include this feature
> into the kernel in default and can avoid rebuild and solve related problems.
>
> Until now, memcg uses this technique. But, now, memcg decides to embed
> their variable to struct page itself and it's code to extend struct page
> has been removed. I'd like to use this code to develop debug feature,
> so this patch resurrect it.
>
> To help these things to work well, this patch introduces two callbacks
> for clients. One is the need callback which is mandatory if user wants
> to avoid useless memory allocation at boot-time. The other is optional,
> init callback, which is used to do proper initialization after memory
> is allocated. Detailed explanation about purpose of these functions is
> in code comment. Please refer it.
>
> Others are completely same with previous extension code in memcg.
>
> ...
>
> +static bool __init invoke_need_callbacks(void)
> +{
> + int i;
> + int entries = ARRAY_SIZE(page_ext_ops);
> +
> + for (i = 0; i < entries; i++) {
> + if (page_ext_ops[i]->need && page_ext_ops[i]->need())
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static void __init invoke_init_callbacks(void)
> +{
> + int i;
> + int entries = sizeof(page_ext_ops) / sizeof(page_ext_ops[0]);
ARRAY_SIZE()
> + for (i = 0; i < entries; i++) {
> + if (page_ext_ops[i]->init)
> + page_ext_ops[i]->init();
> + }
> +}
> +
>
> ...
>
> +void __init page_ext_init_flatmem(void)
> +{
> +
> + int nid, fail;
> +
> + if (!invoke_need_callbacks)
> + return;
> +
> + for_each_online_node(nid) {
> + fail = alloc_node_page_ext(nid);
> + if (fail)
> + goto fail;
> + }
> + pr_info("allocated %ld bytes of page_ext\n", total_usage);
> + invoke_init_callbacks();
> + return;
> +
> +fail:
> + pr_crit("allocation of page_ext failed.\n");
> + panic("Out of memory");
Did we really need to panic the machine? The situation should be
pretty easily recoverable by disabling the clients. I guess it's OK as
long as page_ext is being used for kernel developer debug things.
> +}
> +
We'll need this to fix the build. I'll queue it up.
From: Andrew Morton <[email protected]>
Subject: include/linux/kmemleak.h: needs slab.h
include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)
--- a/include/linux/kmemleak.h~include-linux-kmemleakh-needs-slabh
+++ a/include/linux/kmemleak.h
@@ -21,6 +21,8 @@
#ifndef __KMEMLEAK_H
#define __KMEMLEAK_H
+#include <linux/slab.h>
+
#ifdef CONFIG_DEBUG_KMEMLEAK
extern void kmemleak_init(void) __ref;
And here are a couple of tweaks for this patch:
From: Andrew Morton <[email protected]>
Subject: mm-page_ext-resurrect-struct-page-extending-code-for-debugging-fix
use ARRAY_SIZE, clean up 80-col tricks
--- a/mm/page_ext.c~mm-page_ext-resurrect-struct-page-extending-code-for-debugging-fix
+++ a/mm/page_ext.c
@@ -71,7 +71,7 @@ static bool __init invoke_need_callbacks
static void __init invoke_init_callbacks(void)
{
int i;
- int entries = sizeof(page_ext_ops) / sizeof(page_ext_ops[0]);
+ int entries = ARRAY_SIZE(page_ext_ops);
for (i = 0; i < entries; i++) {
if (page_ext_ops[i]->init)
@@ -81,7 +81,6 @@ static void __init invoke_init_callbacks
#if !defined(CONFIG_SPARSEMEM)
-
void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
{
pgdat->node_page_ext = NULL;
@@ -232,8 +231,9 @@ static void free_page_ext(void *addr)
vfree(addr);
} else {
struct page *page = virt_to_page(addr);
- size_t table_size =
- sizeof(struct page_ext) * PAGES_PER_SECTION;
+ size_t table_size;
+
+ table_size = sizeof(struct page_ext) * PAGES_PER_SECTION;
BUG_ON(PageReserved(page));
free_pages_exact(addr, table_size);
On Fri, 21 Nov 2014 17:14:04 +0900 Joonsoo Kim <[email protected]> wrote:
> Current stacktrace only have the function for console output.
> page_owner that will be introduced in following patch needs to print
> the output of stacktrace into the buffer for our own output format
> so so new function, snprint_stack_trace(), is needed.
>
> ...
>
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -20,6 +20,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
> struct stack_trace *trace);
>
> extern void print_stack_trace(struct stack_trace *trace, int spaces);
> +extern int snprint_stack_trace(char *buf, int buf_len,
> + struct stack_trace *trace, int spaces);
>
> #ifdef CONFIG_USER_STACKTRACE_SUPPORT
> extern void save_stack_trace_user(struct stack_trace *trace);
> @@ -32,6 +34,7 @@ extern void save_stack_trace_user(struct stack_trace *trace);
> # define save_stack_trace_tsk(tsk, trace) do { } while (0)
> # define save_stack_trace_user(trace) do { } while (0)
> # define print_stack_trace(trace, spaces) do { } while (0)
> +# define snprint_stack_trace(buf, len, trace, spaces) do { } while (0)
Doing this with macros instead of C functions is pretty crappy - it
defeats typechecking and can lead to unused-var warnings when the
feature is disabled.
Fixing this might not be practical if struct stack_trace isn't
available, dunno.
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -25,6 +25,30 @@ void print_stack_trace(struct stack_trace *trace, int spaces)
> }
> EXPORT_SYMBOL_GPL(print_stack_trace);
>
> +int snprint_stack_trace(char *buf, int buf_len, struct stack_trace *trace,
> + int spaces)
> +{
> + int i, printed;
> + unsigned long ip;
> + int ret = 0;
> +
> + if (WARN_ON(!trace->entries))
> + return 0;
> +
> + for (i = 0; i < trace->nr_entries && buf_len; i++) {
> + ip = trace->entries[i];
> + printed = snprintf(buf, buf_len, "%*c[<%p>] %pS\n",
> + 1 + spaces, ' ', (void *) ip, (void *) ip);
> +
> + buf_len -= printed;
> + ret += printed;
> + buf += printed;
> + }
> +
> + return ret;
> +}
I'm not liking this much. The behaviour when the output buffer is too
small is scary. snprintf() will return "the number of characters which
would be generated for the given input", so local variable `buf_len'
will go negative and we pass a negative int into snprintf()'s `size_t
size'. snprintf() says "goody, lots and lots of buffer!" and your
machine crashes.
buf_len should be a size_t and snprint_stack_trace() will need to be
changed to handle this.
On Fri, 21 Nov 2014 17:14:05 +0900 Joonsoo Kim <[email protected]> wrote:
> This is the page owner tracking code which is introduced
> so far ago. It is resident on Andrew's tree, though, nobody
> tried to upstream so it remain as is. Our company uses this feature
> actively to debug memory leak or to find a memory hogger so
> I decide to upstream this feature.
>
> This functionality help us to know who allocates the page.
> When allocating a page, we store some information about
> allocation in extra memory. Later, if we need to know
> status of all pages, we can get and analyze it from this stored
> information.
>
> In previous version of this feature, extra memory is statically defined
> in struct page, but, in this version, extra memory is allocated outside
> of struct page. It enables us to turn on/off this feature at boottime
> without considerable memory waste.
>
> Although we already have tracepoint for tracing page allocation/free,
> using it to analyze page owner is rather complex. We need to enlarge
> the trace buffer for preventing overlapping until userspace program
> launched. And, launched program continually dump out the trace buffer
> for later analysis and it would change system behaviour with more
> possibility rather than just keeping it in memory, so bad for debug.
>
> Moreover, we can use page_owner feature further for various purposes.
> For example, we can use it for fragmentation statistics implemented in
> this patch. And, I also plan to implement some CMA failure debugging
> feature using this interface.
>
> I'd like to give the credit for all developers contributed this feature,
> but, it's not easy because I don't know exact history. Sorry about that.
> Below is people who has "Signed-off-by" in the patches in Andrew's tree.
>
> ...
>
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -884,6 +884,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> MTRR settings. This parameter disables that behavior,
> possibly causing your machine to run very slowly.
>
> + disable_page_owner
> + [KNL] Disable to store the information who requests
> + the page.
How about "Disable storage of the information about who allocated each
page".
It seems odd that we have a disable flag. Wouldn't it be less
surprising to disable it by default and only enable if the boot option
is provided?
What is the overhead of page_owner if it is runtime-disabled, btw?
Will it be feasible for lots of people to just leave it enabled in
config and to only turn it on when they want to use it? That would be
nice. Please add a paragraph on this point to the changelog and the
yet-to-be-written documentation.
On Fri, 21 Nov 2014 17:14:06 +0900 Joonsoo Kim <[email protected]> wrote:
> Extended memory to store page owner information is initialized some time
> later than that page allocator starts. Until initialization, many pages
> can be allocated and they have no owner information. This make debugging
> using page owner harder, so some fixup will be helpful.
>
> This patch fix up this situation by setting fake owner information
> immediately after page extension is initialized. Information doesn't
> tell the right owner, but, at least, it can tell whether page is
> allocated or not, more correctly.
>
> On my testing, this patch catches 13343 early allocated pages, although
> they are mostly allocated from page extension feature. Anyway, after then,
> there is no page left that it is allocated and has no page owner flag.
We really should have a Documentation/vm/page_owner.txt which explains
all this stuff, provides examples, etc.
On Fri, Nov 21, 2014 at 03:37:31PM -0800, Andrew Morton wrote:
> On Fri, 21 Nov 2014 17:14:00 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > When we debug something, we'd like to insert some information to
> > every page. For this purpose, we sometimes modify struct page itself.
> > But, this has drawbacks. First, it requires re-compile. This makes us
> > hesitate to use the powerful debug feature so development process is
> > slowed down. And, second, sometimes it is impossible to rebuild the kernel
> > due to third party module dependency. At third, system behaviour would be
> > largely different after re-compile, because it changes size of struct
> > page greatly and this structure is accessed by every part of kernel.
> > Keeping this as it is would be better to reproduce errornous situation.
> >
> > This feature is intended to overcome above mentioned problems. This feature
> > allocates memory for extended data per page in certain place rather than
> > the struct page itself. This memory can be accessed by the accessor
> > functions provided by this code. During the boot process, it checks whether
> > allocation of huge chunk of memory is needed or not. If not, it avoids
> > allocating memory at all. With this advantage, we can include this feature
> > into the kernel in default and can avoid rebuild and solve related problems.
> >
> > Until now, memcg uses this technique. But, now, memcg decides to embed
> > their variable to struct page itself and it's code to extend struct page
> > has been removed. I'd like to use this code to develop debug feature,
> > so this patch resurrect it.
> >
> > To help these things to work well, this patch introduces two callbacks
> > for clients. One is the need callback which is mandatory if user wants
> > to avoid useless memory allocation at boot-time. The other is optional,
> > init callback, which is used to do proper initialization after memory
> > is allocated. Detailed explanation about purpose of these functions is
> > in code comment. Please refer it.
> >
> > Others are completely same with previous extension code in memcg.
> >
> > ...
> >
> > +static bool __init invoke_need_callbacks(void)
> > +{
> > + int i;
> > + int entries = ARRAY_SIZE(page_ext_ops);
> > +
> > + for (i = 0; i < entries; i++) {
> > + if (page_ext_ops[i]->need && page_ext_ops[i]->need())
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > +static void __init invoke_init_callbacks(void)
> > +{
> > + int i;
> > + int entries = sizeof(page_ext_ops) / sizeof(page_ext_ops[0]);
>
> ARRAY_SIZE()
Oops... Sorry. I will fix it.
>
> > + for (i = 0; i < entries; i++) {
> > + if (page_ext_ops[i]->init)
> > + page_ext_ops[i]->init();
> > + }
> > +}
> > +
> >
> > ...
> >
> > +void __init page_ext_init_flatmem(void)
> > +{
> > +
> > + int nid, fail;
> > +
> > + if (!invoke_need_callbacks)
> > + return;
> > +
> > + for_each_online_node(nid) {
> > + fail = alloc_node_page_ext(nid);
> > + if (fail)
> > + goto fail;
> > + }
> > + pr_info("allocated %ld bytes of page_ext\n", total_usage);
> > + invoke_init_callbacks();
> > + return;
> > +
> > +fail:
> > + pr_crit("allocation of page_ext failed.\n");
> > + panic("Out of memory");
>
> Did we really need to panic the machine? The situation should be
> pretty easily recoverable by disabling the clients. I guess it's OK as
> long as page_ext is being used for kernel developer debug things.
I think that panic() would be better. If the feature is disabled silently
or with some printk output, user can't easily notice that situation
and will try to do real work for debugging. This would waste user's time
so panic() looks better to me.
> > +}
> > +
>
> We'll need this to fix the build. I'll queue it up.
Thank you!
>
>
> From: Andrew Morton <[email protected]>
> Subject: include/linux/kmemleak.h: needs slab.h
>
> include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
> include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)
>
> --- a/include/linux/kmemleak.h~include-linux-kmemleakh-needs-slabh
> +++ a/include/linux/kmemleak.h
> @@ -21,6 +21,8 @@
> #ifndef __KMEMLEAK_H
> #define __KMEMLEAK_H
>
> +#include <linux/slab.h>
> +
> #ifdef CONFIG_DEBUG_KMEMLEAK
>
> extern void kmemleak_init(void) __ref;
>
>
>
> And here are a couple of tweaks for this patch:
Okay. I will include below changes in next spin.
Thanks.
>
> From: Andrew Morton <[email protected]>
> Subject: mm-page_ext-resurrect-struct-page-extending-code-for-debugging-fix
>
> use ARRAY_SIZE, clean up 80-col tricks
>
> --- a/mm/page_ext.c~mm-page_ext-resurrect-struct-page-extending-code-for-debugging-fix
> +++ a/mm/page_ext.c
> @@ -71,7 +71,7 @@ static bool __init invoke_need_callbacks
> static void __init invoke_init_callbacks(void)
> {
> int i;
> - int entries = sizeof(page_ext_ops) / sizeof(page_ext_ops[0]);
> + int entries = ARRAY_SIZE(page_ext_ops);
>
> for (i = 0; i < entries; i++) {
> if (page_ext_ops[i]->init)
> @@ -81,7 +81,6 @@ static void __init invoke_init_callbacks
>
> #if !defined(CONFIG_SPARSEMEM)
>
> -
> void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
> {
> pgdat->node_page_ext = NULL;
> @@ -232,8 +231,9 @@ static void free_page_ext(void *addr)
> vfree(addr);
> } else {
> struct page *page = virt_to_page(addr);
> - size_t table_size =
> - sizeof(struct page_ext) * PAGES_PER_SECTION;
> + size_t table_size;
> +
> + table_size = sizeof(struct page_ext) * PAGES_PER_SECTION;
>
> BUG_ON(PageReserved(page));
> free_pages_exact(addr, table_size);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, Nov 21, 2014 at 03:37:59PM -0800, Andrew Morton wrote:
> On Fri, 21 Nov 2014 17:14:04 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > Current stacktrace only have the function for console output.
> > page_owner that will be introduced in following patch needs to print
> > the output of stacktrace into the buffer for our own output format
> > so so new function, snprint_stack_trace(), is needed.
> >
> > ...
> >
> > --- a/include/linux/stacktrace.h
> > +++ b/include/linux/stacktrace.h
> > @@ -20,6 +20,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
> > struct stack_trace *trace);
> >
> > extern void print_stack_trace(struct stack_trace *trace, int spaces);
> > +extern int snprint_stack_trace(char *buf, int buf_len,
> > + struct stack_trace *trace, int spaces);
> >
> > #ifdef CONFIG_USER_STACKTRACE_SUPPORT
> > extern void save_stack_trace_user(struct stack_trace *trace);
> > @@ -32,6 +34,7 @@ extern void save_stack_trace_user(struct stack_trace *trace);
> > # define save_stack_trace_tsk(tsk, trace) do { } while (0)
> > # define save_stack_trace_user(trace) do { } while (0)
> > # define print_stack_trace(trace, spaces) do { } while (0)
> > +# define snprint_stack_trace(buf, len, trace, spaces) do { } while (0)
>
> Doing this with macros instead of C functions is pretty crappy - it
> defeats typechecking and can lead to unused-var warnings when the
> feature is disabled.
>
> Fixing this might not be practical if struct stack_trace isn't
> available, dunno.
struct stack_trace is defined only if CONFIG_STACKTRACE, and,
most call sites seems to be defined only if CONFIG_STACKTRACE.
I guess that removing all of them would works fine, but, dunno. :)
>
> > --- a/kernel/stacktrace.c
> > +++ b/kernel/stacktrace.c
> > @@ -25,6 +25,30 @@ void print_stack_trace(struct stack_trace *trace, int spaces)
> > }
> > EXPORT_SYMBOL_GPL(print_stack_trace);
> >
> > +int snprint_stack_trace(char *buf, int buf_len, struct stack_trace *trace,
> > + int spaces)
> > +{
> > + int i, printed;
> > + unsigned long ip;
> > + int ret = 0;
> > +
> > + if (WARN_ON(!trace->entries))
> > + return 0;
> > +
> > + for (i = 0; i < trace->nr_entries && buf_len; i++) {
> > + ip = trace->entries[i];
> > + printed = snprintf(buf, buf_len, "%*c[<%p>] %pS\n",
> > + 1 + spaces, ' ', (void *) ip, (void *) ip);
> > +
> > + buf_len -= printed;
> > + ret += printed;
> > + buf += printed;
> > + }
> > +
> > + return ret;
> > +}
>
> I'm not liking this much. The behaviour when the output buffer is too
> small is scary. snprintf() will return "the number of characters which
> would be generated for the given input", so local variable `buf_len'
> will go negative and we pass a negative int into snprintf()'s `size_t
> size'. snprintf() says "goody, lots and lots of buffer!" and your
> machine crashes.
>
> buf_len should be a size_t and snprint_stack_trace() will need to be
> changed to handle this.
Okay. I will fix overflow problem. And, current implementation doesn't
comply snprint* functions sementic that returns generated string
length rather than printed string length. I will fix it, too.
Thanks.
On Fri, Nov 21, 2014 at 03:38:32PM -0800, Andrew Morton wrote:
> On Fri, 21 Nov 2014 17:14:05 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > This is the page owner tracking code which is introduced
> > so far ago. It is resident on Andrew's tree, though, nobody
> > tried to upstream so it remain as is. Our company uses this feature
> > actively to debug memory leak or to find a memory hogger so
> > I decide to upstream this feature.
> >
> > This functionality help us to know who allocates the page.
> > When allocating a page, we store some information about
> > allocation in extra memory. Later, if we need to know
> > status of all pages, we can get and analyze it from this stored
> > information.
> >
> > In previous version of this feature, extra memory is statically defined
> > in struct page, but, in this version, extra memory is allocated outside
> > of struct page. It enables us to turn on/off this feature at boottime
> > without considerable memory waste.
> >
> > Although we already have tracepoint for tracing page allocation/free,
> > using it to analyze page owner is rather complex. We need to enlarge
> > the trace buffer for preventing overlapping until userspace program
> > launched. And, launched program continually dump out the trace buffer
> > for later analysis and it would change system behaviour with more
> > possibility rather than just keeping it in memory, so bad for debug.
> >
> > Moreover, we can use page_owner feature further for various purposes.
> > For example, we can use it for fragmentation statistics implemented in
> > this patch. And, I also plan to implement some CMA failure debugging
> > feature using this interface.
> >
> > I'd like to give the credit for all developers contributed this feature,
> > but, it's not easy because I don't know exact history. Sorry about that.
> > Below is people who has "Signed-off-by" in the patches in Andrew's tree.
> >
> > ...
> >
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -884,6 +884,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > MTRR settings. This parameter disables that behavior,
> > possibly causing your machine to run very slowly.
> >
> > + disable_page_owner
> > + [KNL] Disable to store the information who requests
> > + the page.
>
> How about "Disable storage of the information about who allocated each
> page".
>
> It seems odd that we have a disable flag. Wouldn't it be less
> surprising to disable it by default and only enable if the boot option
> is provided?
Okay. Will do.
>
> What is the overhead of page_owner if it is runtime-disabled, btw?
> Will it be feasible for lots of people to just leave it enabled in
> config and to only turn it on when they want to use it? That would be
> nice. Please add a paragraph on this point to the changelog and the
> yet-to-be-written documentation.
- Without page owner
text data bss dec hex filename
40662 1493 644 42799 a72f mm/page_alloc.o
- With page owner
text data bss dec hex filename
40892 1493 644 43029 a815 mm/page_alloc.o
1427 24 8 1459 5b3 mm/page_ext.o
2722 50 0 2772 ad4 mm/page_owner.o
Roughly, 4 KB code is added in total. No more runtime memory is needed if
runtime-disabled. Size of page_alloc.o is 200 bytes bigger than disabled one.
Page owner addes two 'if' statements in allocator hotpath and two 'if'
statements in coldpath. If runtime-disabled, allocation performance would not
be affected by these few unlikely branches.
Will write this to yet-to-be-written documentation.
Thanks.
On Fri, Nov 21, 2014 at 03:38:41PM -0800, Andrew Morton wrote:
> On Fri, 21 Nov 2014 17:14:06 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > Extended memory to store page owner information is initialized some time
> > later than that page allocator starts. Until initialization, many pages
> > can be allocated and they have no owner information. This make debugging
> > using page owner harder, so some fixup will be helpful.
> >
> > This patch fix up this situation by setting fake owner information
> > immediately after page extension is initialized. Information doesn't
> > tell the right owner, but, at least, it can tell whether page is
> > allocated or not, more correctly.
> >
> > On my testing, this patch catches 13343 early allocated pages, although
> > they are mostly allocated from page extension feature. Anyway, after then,
> > there is no page left that it is allocated and has no page owner flag.
>
> We really should have a Documentation/vm/page_owner.txt which explains
> all this stuff, provides examples, etc.
Okay. Will do in next spin.
Thanks.