2014-11-24 08:12:38

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 0/8] Resurrect and use struct page extension for some debugging features

Major Changes from v2
* patch 5: fix potential buffer overflow in snprint_stack_trace()
return generated string length in snprint_stack_trace()
* patch 6: disable in default, add enabling boot option
* patch 8(new): documentation for page owner

Major Changes from v1
* patch 1: add overall design description in code comment
* patch 6: make page owner more accurate for alloc_pages_exact() and
compaction/CMA
* patch 7: handles early allocated pages for page owner

When we debug something, we'd like to insert some information to
every page. For this purpose, we sometimes modify struct page itself.
But, this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the kernel
due to third party module dependency. At third, system behaviour would be
largely different after re-compile, because it changes size of struct
page greatly and this structure is accessed by every part of kernel.
Keeping this as it is would be better to reproduce errornous situation.

To overcome these drawbacks, we can extend struct page on another place
rather than struct page itself. Until now, memcg uses this technique. But,
now, memcg decides to embed their variable to struct page itself and it's
code to extend struct page has been removed. I'd like to use this code
to develop debug feature, so this series resurrect it.

With resurrecting it, this patchset makes two debugging features boottime
configurable. As mentioned above, rebuild has serious problems. Making
it boottime configurable mitigate these problems with marginal
computational overhead. One of the features, page_owner isn't in mainline
now. But, it is really useful so it is in mm tree for a long time. I think
that it's time to upstream this feature.

Any comments are more than welcome.

This patchset is based on next-20141106 + my two patches related to
debug-pagealloc [1].

[1]: https://lkml.org/lkml/2014/11/7/78

Joonsoo Kim (8):
mm/page_ext: resurrect struct page extending code for debugging
mm/debug-pagealloc: prepare boottime configurable on/off
mm/debug-pagealloc: make debug-pagealloc boottime configurable
mm/nommu: use alloc_pages_exact() rather than it's own implementation
stacktrace: introduce snprint_stack_trace for buffer output
mm/page_owner: keep track of page owners
mm/page_owner: correct owner information for early allocated pages
Documentation: add new page_owner document

Documentation/kernel-parameters.txt | 14 ++
Documentation/vm/page_owner.txt | 81 +++++++
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/s390/mm/pageattr.c | 2 +-
arch/sparc/mm/init_64.c | 2 +-
arch/x86/mm/pageattr.c | 2 +-
include/linux/mm.h | 36 +++-
include/linux/mm_types.h | 4 -
include/linux/mmzone.h | 12 ++
include/linux/page-debug-flags.h | 32 ---
include/linux/page_ext.h | 84 ++++++++
include/linux/page_owner.h | 38 ++++
include/linux/stacktrace.h | 5 +
init/main.c | 7 +
kernel/stacktrace.c | 32 +++
lib/Kconfig.debug | 16 ++
mm/Kconfig.debug | 10 +
mm/Makefile | 2 +
mm/debug-pagealloc.c | 45 +++-
mm/nommu.c | 33 +--
mm/page_alloc.c | 67 +++++-
mm/page_ext.c | 394 +++++++++++++++++++++++++++++++++++
mm/page_owner.c | 311 +++++++++++++++++++++++++++
mm/vmstat.c | 101 +++++++++
tools/vm/Makefile | 4 +-
tools/vm/page_owner_sort.c | 144 +++++++++++++
27 files changed, 1406 insertions(+), 76 deletions(-)
create mode 100644 Documentation/vm/page_owner.txt
delete mode 100644 include/linux/page-debug-flags.h
create mode 100644 include/linux/page_ext.h
create mode 100644 include/linux/page_owner.h
create mode 100644 mm/page_ext.c
create mode 100644 mm/page_owner.c
create mode 100644 tools/vm/page_owner_sort.c

--
1.7.9.5


2014-11-24 08:12:48

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 2/8] mm/debug-pagealloc: prepare boottime configurable on/off

Until now, debug-pagealloc needs extra flags in struct page, so we need
to recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page.
So, we can't use this good feature in many cases.

Now, we have the page extension feature that allows us to insert
extra flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead
in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.

This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it
will use field of struct page_ext. Because memory for page_ext is
allocated later than initialization of page allocator in CONFIG_SPARSEMEM,
we should disable debug-pagealloc feature temporarily until initialization
of page_ext. This patch implements this.

v2: fix compile error on CONFIG_PAGE_POISONING

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
include/linux/mm_types.h | 4 ----
include/linux/page-debug-flags.h | 32 --------------------------------
include/linux/page_ext.h | 15 +++++++++++++++
mm/Kconfig.debug | 1 +
mm/debug-pagealloc.c | 37 +++++++++++++++++++++++++++++++++----
mm/page_alloc.c | 38 +++++++++++++++++++++++++++++++++++---
mm/page_ext.c | 4 ++++
8 files changed, 106 insertions(+), 44 deletions(-)
delete mode 100644 include/linux/page-debug-flags.h

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b922a16..5a8d4d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@
#include <linux/bit_spinlock.h>
#include <linux/shrinker.h>
#include <linux/resource.h>
+#include <linux/page_ext.h>

struct mempolicy;
struct anon_vma;
@@ -2149,20 +2150,36 @@ extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned int pages_per_huge_page);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

+extern struct page_ext_operations debug_guardpage_ops;
+extern struct page_ext_operations page_poisoning_ops;
+
#ifdef CONFIG_DEBUG_PAGEALLOC
extern unsigned int _debug_guardpage_minorder;
+extern bool _debug_guardpage_enabled;

static inline unsigned int debug_guardpage_minorder(void)
{
return _debug_guardpage_minorder;
}

+static inline bool debug_guardpage_enabled(void)
+{
+ return _debug_guardpage_enabled;
+}
+
static inline bool page_is_guard(struct page *page)
{
- return test_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return false;
+
+ page_ext = lookup_page_ext(page);
+ return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
}
#else
static inline unsigned int debug_guardpage_minorder(void) { return 0; }
+static inline bool debug_guardpage_enabled(void) { return false; }
static inline bool page_is_guard(struct page *page) { return false; }
#endif /* CONFIG_DEBUG_PAGEALLOC */

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 33a8acf..c7b22e7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,7 +10,6 @@
#include <linux/rwsem.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
-#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
#include <asm/page.h>
@@ -186,9 +185,6 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
- unsigned long debug_flags; /* Use atomic bitops on this */
-#endif

#ifdef CONFIG_KMEMCHECK
/*
diff --git a/include/linux/page-debug-flags.h b/include/linux/page-debug-flags.h
deleted file mode 100644
index 22691f61..0000000
--- a/include/linux/page-debug-flags.h
+++ /dev/null
@@ -1,32 +0,0 @@
-#ifndef LINUX_PAGE_DEBUG_FLAGS_H
-#define LINUX_PAGE_DEBUG_FLAGS_H
-
-/*
- * page->debug_flags bits:
- *
- * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to
- * implement generic debug pagealloc feature. The pages are filled with
- * poison patterns and set this flag after free_pages(). The poisoned
- * pages are verified whether the patterns are not corrupted and clear
- * the flag before alloc_pages().
- */
-
-enum page_debug_flags {
- PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */
- PAGE_DEBUG_FLAG_GUARD,
-};
-
-/*
- * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably
- * gets turned off when no debug features are enabling it!
- */
-
-#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
-#if !defined(CONFIG_PAGE_POISONING) && \
- !defined(CONFIG_PAGE_GUARD) \
-/* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */
-#error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features!
-#endif
-#endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */
-
-#endif /* LINUX_PAGE_DEBUG_FLAGS_H */
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 2ccc8b4..61c0f05 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -10,6 +10,21 @@ struct page_ext_operations {
#ifdef CONFIG_PAGE_EXTENSION

/*
+ * page_ext->flags bits:
+ *
+ * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to
+ * implement generic debug pagealloc feature. The pages are filled with
+ * poison patterns and set this flag after free_pages(). The poisoned
+ * pages are verified whether the patterns are not corrupted and clear
+ * the flag before alloc_pages().
+ */
+
+enum page_ext_flags {
+ PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
+ PAGE_EXT_DEBUG_GUARD,
+};
+
+/*
* Page Extension can be considered as an extended mem_map.
* A page_ext page is associated with every page descriptor. The
* page_ext helps us add more information about the page.
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 1ba81c7..56badfc 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -12,6 +12,7 @@ config DEBUG_PAGEALLOC
depends on DEBUG_KERNEL
depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC
depends on !KMEMCHECK
+ select PAGE_EXTENSION
select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
---help---
diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c
index 789ff70..0072f2c 100644
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -2,23 +2,49 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/highmem.h>
-#include <linux/page-debug-flags.h>
+#include <linux/page_ext.h>
#include <linux/poison.h>
#include <linux/ratelimit.h>

+static bool page_poisoning_enabled __read_mostly;
+
+static bool need_page_poisoning(void)
+{
+ return true;
+}
+
+static void init_page_poisoning(void)
+{
+ page_poisoning_enabled = true;
+}
+
+struct page_ext_operations page_poisoning_ops = {
+ .need = need_page_poisoning,
+ .init = init_page_poisoning,
+};
+
static inline void set_page_poison(struct page *page)
{
- __set_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ __set_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}

static inline void clear_page_poison(struct page *page)
{
- __clear_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ __clear_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}

static inline bool page_poison(struct page *page)
{
- return test_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ page_ext = lookup_page_ext(page);
+ return test_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags);
}

static void poison_page(struct page *page)
@@ -95,6 +121,9 @@ static void unpoison_pages(struct page *page, int n)

void kernel_map_pages(struct page *page, int numpages, int enable)
{
+ if (!page_poisoning_enabled)
+ return;
+
if (enable)
unpoison_pages(page, numpages);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c91f449..7534733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -56,7 +56,7 @@
#include <linux/prefetch.h>
#include <linux/mm_inline.h>
#include <linux/migrate.h>
-#include <linux/page-debug-flags.h>
+#include <linux/page_ext.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>

@@ -426,6 +426,22 @@ static inline void prep_zero_page(struct page *page, unsigned int order,

#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
+bool _debug_guardpage_enabled __read_mostly;
+
+static bool need_debug_guardpage(void)
+{
+ return true;
+}
+
+static void init_debug_guardpage(void)
+{
+ _debug_guardpage_enabled = true;
+}
+
+struct page_ext_operations debug_guardpage_ops = {
+ .need = need_debug_guardpage,
+ .init = init_debug_guardpage,
+};

static int __init debug_guardpage_minorder_setup(char *buf)
{
@@ -444,7 +460,14 @@ __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
static inline void set_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype)
{
- __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return;
+
+ page_ext = lookup_page_ext(page);
+ __set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
+
INIT_LIST_HEAD(&page->lru);
set_page_private(page, order);
/* Guard pages are not available for any usage */
@@ -454,12 +477,20 @@ static inline void set_page_guard(struct zone *zone, struct page *page,
static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype)
{
- __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
+ struct page_ext *page_ext;
+
+ if (!debug_guardpage_enabled())
+ return;
+
+ page_ext = lookup_page_ext(page);
+ __clear_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags);
+
set_page_private(page, 0);
if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, (1 << order), migratetype);
}
#else
+struct page_ext_operations debug_guardpage_ops = { NULL, };
static inline void set_page_guard(struct zone *zone, struct page *page,
unsigned int order, int migratetype) {}
static inline void clear_page_guard(struct zone *zone, struct page *page,
@@ -870,6 +901,7 @@ static inline void expand(struct zone *zone, struct page *page,
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);

if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) &&
+ debug_guardpage_enabled() &&
high < debug_guardpage_minorder()) {
/*
* Mark as guard pages (or page), that will allow to
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 8b3a97a..ede4d1e 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -51,6 +51,10 @@
*/

static struct page_ext_operations *page_ext_ops[] = {
+ &debug_guardpage_ops,
+#ifdef CONFIG_PAGE_POISONING
+ &page_poisoning_ops,
+#endif
};

static unsigned long total_usage;
--
1.7.9.5

2014-11-24 08:12:47

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 4/8] mm/nommu: use alloc_pages_exact() rather than it's own implementation

do_mmap_private() in nommu.c try to allocate physically contiguous pages
with arbitrary size in some cases and we now have good abstract function
to do exactly same thing, alloc_pages_exact(). So, change to use it.

There is no functional change.
This is the preparation step for support page owner feature accurately.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/nommu.c | 33 +++++++++++----------------------
1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index 2266a34..1b87c17 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1149,8 +1149,7 @@ static int do_mmap_private(struct vm_area_struct *vma,
unsigned long len,
unsigned long capabilities)
{
- struct page *pages;
- unsigned long total, point, n;
+ unsigned long total, point;
void *base;
int ret, order;

@@ -1182,33 +1181,23 @@ static int do_mmap_private(struct vm_area_struct *vma,
order = get_order(len);
kdebug("alloc order %d for %lx", order, len);

- pages = alloc_pages(GFP_KERNEL, order);
- if (!pages)
- goto enomem;
-
total = 1 << order;
- atomic_long_add(total, &mmap_pages_allocated);
-
point = len >> PAGE_SHIFT;

- /* we allocated a power-of-2 sized page set, so we may want to trim off
- * the excess */
+ /* we don't want to allocate a power-of-2 sized page set */
if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) {
- while (total > point) {
- order = ilog2(total - point);
- n = 1 << order;
- kdebug("shave %lu/%lu @%lu", n, total - point, total);
- atomic_long_sub(n, &mmap_pages_allocated);
- total -= n;
- set_page_refcounted(pages + total);
- __free_pages(pages + total, order);
- }
+ total = point;
+ kdebug("try to alloc exact %lu pages", total);
+ base = alloc_pages_exact(len, GFP_KERNEL);
+ } else {
+ base = __get_free_pages(GFP_KERNEL, order);
}

- for (point = 1; point < total; point++)
- set_page_refcounted(&pages[point]);
+ if (!base)
+ goto enomem;
+
+ atomic_long_add(total, &mmap_pages_allocated);

- base = page_address(pages);
region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY;
region->vm_start = (unsigned long) base;
region->vm_end = region->vm_start + len;
--
1.7.9.5

2014-11-24 08:12:45

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 8/8] Documentation: add new page_owner document

page owner is for the tracking about who allocated each page.
This document explains what is the page owner feature and what is
the merit of it. And, simple HOW-TO is also explained. See the document
for detailed information.

Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/vm/page_owner.txt | 81 +++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)
create mode 100644 Documentation/vm/page_owner.txt

diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.txt
new file mode 100644
index 0000000..8f3ce9b
--- /dev/null
+++ b/Documentation/vm/page_owner.txt
@@ -0,0 +1,81 @@
+page owner: Tracking about who allocated each page
+-----------------------------------------------------------
+
+* Introduction
+
+page owner is for the tracking about who allocated each page.
+It can be used to debug memory leak or to find a memory hogger.
+When allocation happens, information about allocation such as call stack
+and order of pages is stored into certain storage for each page.
+When we need to know about status of all pages, we can get and analyze
+this information.
+
+Although we already have tracepoint for tracing page allocation/free,
+using it for analyzing who allocate each page is rather complex. We need
+to enlarge the trace buffer for preventing overlapping until userspace
+program launched. And, launched program continually dump out the trace
+buffer for later analysis and it would change system behviour with more
+possibility rather than just keeping it in memory, so bad for debugging.
+
+page owner can also be used for various purposes. For example, accurate
+fragmentation statistics can be obtained through gfp flag information of
+each page. It is already implemented and activated if page owner is
+enabled. Other usages are more than welcome.
+
+page owner is disabled in default. So, if you'd like to use it, you need
+to add "page_owner=on" into your boot cmdline. If the kernel is built
+with page owner and page owner is disabled in runtime due to no enabling
+boot option, runtime overhead is marginal. If disabled in runtime, it
+doesn't require memory to store owner information, so there is no runtime
+memory overhead. And, page owner inserts just two unlikely branches into
+the page allocator hotpath and if it returns false then allocation is
+done like as the kernel without page owner. These two unlikely branches
+would not affect to allocation performance. Following is the kernel's
+code size change due to this facility.
+
+- Without page owner
+ text data bss dec hex filename
+ 40662 1493 644 42799 a72f mm/page_alloc.o
+
+- With page owner
+ text data bss dec hex filename
+ 40892 1493 644 43029 a815 mm/page_alloc.o
+ 1427 24 8 1459 5b3 mm/page_ext.o
+ 2722 50 0 2772 ad4 mm/page_owner.o
+
+Although, roughly, 4 KB code is added in total, page_alloc.o increase by
+230 bytes and only half of it is in hotpath. Building the kernel with
+page owner and turning it on if needed would be great option to debug
+kernel memory problem.
+
+There is one notice that is caused by implementation detail. page owner
+stores information into the memory from struct page extension. This memory
+is initialized some time later than that page allocator starts in sparse
+memory system, so, until initialization, many pages can be allocated and
+they would have no owner information. To fix it up, these early allocated
+pages are investigated and marked as allocated in initialization phase.
+Although it doesn't mean that they have the right owner information,
+at least, we can tell whether the page is allocated or not,
+more accurately. On 2GB memory x86-64 VM box, 13343 early allocated pages
+are catched and marked, although they are mostly allocated from struct
+page extension feature. Anyway, after that, no page is left in
+un-tracking state.
+
+* Usage
+
+1) Build user-space helper
+ cd tools/vm
+ make page_owner_sort
+
+2) Enable page owner
+ Add "page_owner=on" to boot cmdline.
+
+3) Do the job what you want to debug
+
+4) Analyze information from page owner
+ cat /sys/kernel/debug/page_owner > page_owner_full.txt
+ grep -v ^PFN page_owner_full.txt > page_owner.txt
+ ./page_owner_sort page_owner.txt sorted_page_owner.txt
+
+ See the result about who allocated each page
+ in the sorted_page_owner.txt.
--
1.7.9.5

2014-11-24 08:13:37

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 5/8] stacktrace: introduce snprint_stack_trace for buffer output

Current stacktrace only have the function for console output.
page_owner that will be introduced in following patch needs to print
the output of stacktrace into the buffer for our own output format
so so new function, snprint_stack_trace(), is needed.

v3:
fix potential buffer overflow case
make snprint_stack_trace() return generated string length rather than
printed string length like as snprint* functions semantic.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/stacktrace.h | 5 +++++
kernel/stacktrace.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 115b570..669045a 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -1,6 +1,8 @@
#ifndef __LINUX_STACKTRACE_H
#define __LINUX_STACKTRACE_H

+#include <linux/types.h>
+
struct task_struct;
struct pt_regs;

@@ -20,6 +22,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
+extern int snprint_stack_trace(char *buf, size_t size,
+ struct stack_trace *trace, int spaces);

#ifdef CONFIG_USER_STACKTRACE_SUPPORT
extern void save_stack_trace_user(struct stack_trace *trace);
@@ -32,6 +36,7 @@ extern void save_stack_trace_user(struct stack_trace *trace);
# define save_stack_trace_tsk(tsk, trace) do { } while (0)
# define save_stack_trace_user(trace) do { } while (0)
# define print_stack_trace(trace, spaces) do { } while (0)
+# define snprint_stack_trace(buf, size, trace, spaces) do { } while (0)
#endif

#endif
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 00fe55c..b6e4c16 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -25,6 +25,38 @@ void print_stack_trace(struct stack_trace *trace, int spaces)
}
EXPORT_SYMBOL_GPL(print_stack_trace);

+int snprint_stack_trace(char *buf, size_t size,
+ struct stack_trace *trace, int spaces)
+{
+ int i;
+ unsigned long ip;
+ int generated;
+ int total = 0;
+
+ if (WARN_ON(!trace->entries))
+ return 0;
+
+ for (i = 0; i < trace->nr_entries; i++) {
+ ip = trace->entries[i];
+ generated = snprintf(buf, size, "%*c[<%p>] %pS\n",
+ 1 + spaces, ' ', (void *) ip, (void *) ip);
+
+ total += generated;
+
+ /* Assume that generated isn't a negative number */
+ if (generated >= size) {
+ buf += size;
+ size = 0;
+ } else {
+ buf += generated;
+ size -= generated;
+ }
+ }
+
+ return total;
+}
+EXPORT_SYMBOL_GPL(snprint_stack_trace);
+
/*
* Architectures that do not implement save_stack_trace_tsk or
* save_stack_trace_regs get this weak alias and a once-per-bootup warning
--
1.7.9.5

2014-11-24 08:13:34

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 6/8] mm/page_owner: keep track of page owners

This is the page owner tracking code which is introduced
so far ago. It is resident on Andrew's tree, though, nobody
tried to upstream so it remain as is. Our company uses this feature
actively to debug memory leak or to find a memory hogger so
I decide to upstream this feature.

This functionality help us to know who allocates the page.
When allocating a page, we store some information about
allocation in extra memory. Later, if we need to know
status of all pages, we can get and analyze it from this stored
information.

In previous version of this feature, extra memory is statically defined
in struct page, but, in this version, extra memory is allocated outside
of struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.

Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge
the trace buffer for preventing overlapping until userspace program
launched. And, launched program continually dump out the trace buffer
for later analysis and it would change system behaviour with more
possibility rather than just keeping it in memory, so bad for debug.

Moreover, we can use page_owner feature further for various purposes.
For example, we can use it for fragmentation statistics implemented in
this patch. And, I also plan to implement some CMA failure debugging
feature using this interface.

I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.

Contributor:
Alexander Nyberg <[email protected]>
Mel Gorman <[email protected]>
Dave Hansen <[email protected]>
Minchan Kim <[email protected]>
Michal Nazarewicz <[email protected]>
Andrew Morton <[email protected]>
Jungsoo Son <[email protected]>

v3: Make the default behaviour disabled.
Change boot parameter from disabling switch to enabling one.
Inline a check whether page owner is initialized or not
to minimize runtime overhead when disabled.
Fix infinite loop condition in fragmentation statistics
Disable fragmentation statistics if page owner isn't initialized.

v2: Do set_page_owner() more places than v1. This corrects page owner
information of memory for alloc_pages_exact() and compaction/CMA.

Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/kernel-parameters.txt | 6 +
include/linux/page_ext.h | 10 ++
include/linux/page_owner.h | 38 ++++++
lib/Kconfig.debug | 16 +++
mm/Makefile | 1 +
mm/page_alloc.c | 11 +-
mm/page_ext.c | 4 +
mm/page_owner.c | 222 +++++++++++++++++++++++++++++++++++
mm/vmstat.c | 101 ++++++++++++++++
tools/vm/Makefile | 4 +-
tools/vm/page_owner_sort.c | 144 +++++++++++++++++++++++
11 files changed, 554 insertions(+), 3 deletions(-)
create mode 100644 include/linux/page_owner.h
create mode 100644 mm/page_owner.c
create mode 100644 tools/vm/page_owner_sort.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b5ac055..c8c4446 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
OSS [HW,OSS]
See Documentation/sound/oss/oss-parameters.txt

+ page_owner= [KNL] Boot-time page_owner enabling option.
+ Storage of the information about who allocated
+ each page is disabled in default. With this switch,
+ we can turn it on.
+ on: enable the feature
+
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index 61c0f05..d2a2c84 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -1,6 +1,9 @@
#ifndef __LINUX_PAGE_EXT_H
#define __LINUX_PAGE_EXT_H

+#include <linux/types.h>
+#include <linux/stacktrace.h>
+
struct pglist_data;
struct page_ext_operations {
bool (*need)(void);
@@ -22,6 +25,7 @@ struct page_ext_operations {
enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
+ PAGE_EXT_OWNER,
};

/*
@@ -33,6 +37,12 @@ enum page_ext_flags {
*/
struct page_ext {
unsigned long flags;
+#ifdef CONFIG_PAGE_OWNER
+ unsigned int order;
+ gfp_t gfp_mask;
+ struct stack_trace trace;
+ unsigned long trace_entries[8];
+#endif
};

extern void pgdat_page_ext_init(struct pglist_data *pgdat);
diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
new file mode 100644
index 0000000..b48c347
--- /dev/null
+++ b/include/linux/page_owner.h
@@ -0,0 +1,38 @@
+#ifndef __LINUX_PAGE_OWNER_H
+#define __LINUX_PAGE_OWNER_H
+
+#ifdef CONFIG_PAGE_OWNER
+extern bool page_owner_inited;
+extern struct page_ext_operations page_owner_ops;
+
+extern void __reset_page_owner(struct page *page, unsigned int order);
+extern void __set_page_owner(struct page *page,
+ unsigned int order, gfp_t gfp_mask);
+
+static inline void reset_page_owner(struct page *page, unsigned int order)
+{
+ if (likely(!page_owner_inited))
+ return;
+
+ __reset_page_owner(page, order);
+}
+
+static inline void set_page_owner(struct page *page,
+ unsigned int order, gfp_t gfp_mask)
+{
+ if (likely(!page_owner_inited))
+ return;
+
+ __set_page_owner(page, order, gfp_mask);
+}
+#else
+static inline void reset_page_owner(struct page *page, unsigned int order)
+{
+}
+static inline void set_page_owner(struct page *page,
+ unsigned int order, gfp_t gfp_mask)
+{
+}
+
+#endif /* CONFIG_PAGE_OWNER */
+#endif /* __LINUX_PAGE_OWNER_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c078a76..8864e90 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -227,6 +227,22 @@ config UNUSED_SYMBOLS
you really need it, and what the merge plan to the mainline kernel for
your module is.

+config PAGE_OWNER
+ bool "Track page owner"
+ depends on DEBUG_KERNEL && STACKTRACE_SUPPORT
+ select DEBUG_FS
+ select STACKTRACE
+ select PAGE_EXTENSION
+ help
+ This keeps track of what call chain is the owner of a page, may
+ help to find bare alloc_page(s) leaks. Even if you include this
+ feature on your build, it is disabled in default. You should pass
+ "page_owner=on" to boot parameter in order to enable it. Eats
+ a fair amount of memory if enabled. See tools/vm/page_owner_sort.c
+ for user-space helper.
+
+ If unsure, say N.
+
config DEBUG_FS
bool "Debug Filesystem"
help
diff --git a/mm/Makefile b/mm/Makefile
index 0b7a784..3548460 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_PAGE_OWNER) += page_owner.o
obj-$(CONFIG_CLEANCACHE) += cleancache.o
obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
obj-$(CONFIG_ZPOOL) += zpool.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4eea173..f1968d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/page_ext.h>
#include <linux/hugetlb.h>
#include <linux/sched/rt.h>
+#include <linux/page_owner.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -810,6 +811,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (bad)
return false;

+ reset_page_owner(page, order);
+
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),
PAGE_SIZE << order);
@@ -985,6 +988,8 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);

+ set_page_owner(page, order, gfp_flags);
+
return 0;
}

@@ -1557,8 +1562,11 @@ void split_page(struct page *page, unsigned int order)
split_page(virt_to_page(page[0].shadow), order);
#endif

- for (i = 1; i < (1 << order); i++)
+ set_page_owner(page, 0, 0);
+ for (i = 1; i < (1 << order); i++) {
set_page_refcounted(page + i);
+ set_page_owner(page + i, 0, 0);
+ }
}
EXPORT_SYMBOL_GPL(split_page);

@@ -1598,6 +1606,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
}
}

+ set_page_owner(page, order, 0);
return 1UL << order;
}

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ede4d1e..ce86485 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -5,6 +5,7 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/kmemleak.h>
+#include <linux/page_owner.h>

/*
* struct page extension
@@ -55,6 +56,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_POISONING
&page_poisoning_ops,
#endif
+#ifdef CONFIG_PAGE_OWNER
+ &page_owner_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/page_owner.c b/mm/page_owner.c
new file mode 100644
index 0000000..85eec7e
--- /dev/null
+++ b/mm/page_owner.c
@@ -0,0 +1,222 @@
+#include <linux/debugfs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/bootmem.h>
+#include <linux/stacktrace.h>
+#include <linux/page_owner.h>
+#include "internal.h"
+
+static bool page_owner_disabled = true;
+bool page_owner_inited __read_mostly;
+
+static int early_page_owner_param(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ if (strcmp(buf, "on") == 0)
+ page_owner_disabled = false;
+
+ return 0;
+}
+early_param("page_owner", early_page_owner_param);
+
+static bool need_page_owner(void)
+{
+ if (page_owner_disabled)
+ return false;
+
+ return true;
+}
+
+static void init_page_owner(void)
+{
+ if (page_owner_disabled)
+ return;
+
+ page_owner_inited = true;
+}
+
+struct page_ext_operations page_owner_ops = {
+ .need = need_page_owner,
+ .init = init_page_owner,
+};
+
+void __reset_page_owner(struct page *page, unsigned int order)
+{
+ int i;
+ struct page_ext *page_ext;
+
+ for (i = 0; i < (1 << order); i++) {
+ page_ext = lookup_page_ext(page + i);
+ __clear_bit(PAGE_EXT_OWNER, &page_ext->flags);
+ }
+}
+
+void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask)
+{
+ struct page_ext *page_ext;
+ struct stack_trace *trace;
+
+ page_ext = lookup_page_ext(page);
+
+ trace = &page_ext->trace;
+ trace->nr_entries = 0;
+ trace->max_entries = ARRAY_SIZE(page_ext->trace_entries);
+ trace->entries = &page_ext->trace_entries[0];
+ trace->skip = 3;
+ save_stack_trace(&page_ext->trace);
+
+ page_ext->order = order;
+ page_ext->gfp_mask = gfp_mask;
+
+ __set_bit(PAGE_EXT_OWNER, &page_ext->flags);
+}
+
+static ssize_t
+print_page_owner(char __user *buf, size_t count, unsigned long pfn,
+ struct page *page, struct page_ext *page_ext)
+{
+ int ret;
+ int pageblock_mt, page_mt;
+ char *kbuf;
+
+ kbuf = kmalloc(count, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ ret = snprintf(kbuf, count,
+ "Page allocated via order %u, mask 0x%x\n",
+ page_ext->order, page_ext->gfp_mask);
+
+ if (ret >= count)
+ goto err;
+
+ /* Print information relevant to grouping pages by mobility */
+ pageblock_mt = get_pfnblock_migratetype(page, pfn);
+ page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
+ ret += snprintf(kbuf + ret, count - ret,
+ "PFN %lu Block %lu type %d %s Flags %s%s%s%s%s%s%s%s%s%s%s%s\n",
+ pfn,
+ pfn >> pageblock_order,
+ pageblock_mt,
+ pageblock_mt != page_mt ? "Fallback" : " ",
+ PageLocked(page) ? "K" : " ",
+ PageError(page) ? "E" : " ",
+ PageReferenced(page) ? "R" : " ",
+ PageUptodate(page) ? "U" : " ",
+ PageDirty(page) ? "D" : " ",
+ PageLRU(page) ? "L" : " ",
+ PageActive(page) ? "A" : " ",
+ PageSlab(page) ? "S" : " ",
+ PageWriteback(page) ? "W" : " ",
+ PageCompound(page) ? "C" : " ",
+ PageSwapCache(page) ? "B" : " ",
+ PageMappedToDisk(page) ? "M" : " ");
+
+ if (ret >= count)
+ goto err;
+
+ ret += snprint_stack_trace(kbuf + ret, count - ret,
+ &page_ext->trace, 0);
+ if (ret >= count)
+ goto err;
+
+ ret += snprintf(kbuf + ret, count - ret, "\n");
+ if (ret >= count)
+ goto err;
+
+ if (copy_to_user(buf, kbuf, ret))
+ ret = -EFAULT;
+
+ kfree(kbuf);
+ return ret;
+
+err:
+ kfree(kbuf);
+ return -ENOMEM;
+}
+
+static ssize_t
+read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned long pfn;
+ struct page *page;
+ struct page_ext *page_ext;
+
+ if (!page_owner_inited)
+ return -EINVAL;
+
+ page = NULL;
+ pfn = min_low_pfn + *ppos;
+
+ /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */
+ while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
+ pfn++;
+
+ drain_all_pages(NULL);
+
+ /* Find an allocated page */
+ for (; pfn < max_pfn; pfn++) {
+ /*
+ * If the new page is in a new MAX_ORDER_NR_PAGES area,
+ * validate the area as existing, skip it if not
+ */
+ if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0 && !pfn_valid(pfn)) {
+ pfn += MAX_ORDER_NR_PAGES - 1;
+ continue;
+ }
+
+ /* Check for holes within a MAX_ORDER area */
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page)) {
+ unsigned long freepage_order = page_order_unsafe(page);
+
+ if (freepage_order < MAX_ORDER)
+ pfn += (1UL << freepage_order) - 1;
+ continue;
+ }
+
+ page_ext = lookup_page_ext(page);
+
+ /*
+ * Pages allocated before initialization of page_owner are
+ * non-buddy and have no page_owner info.
+ */
+ if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ /* Record the next PFN to read in the file offset */
+ *ppos = (pfn - min_low_pfn) + 1;
+
+ return print_page_owner(buf, count, pfn, page, page_ext);
+ }
+
+ return 0;
+}
+
+static const struct file_operations proc_page_owner_operations = {
+ .read = read_page_owner,
+};
+
+static int __init pageowner_init(void)
+{
+ struct dentry *dentry;
+
+ if (!page_owner_inited) {
+ pr_info("page_owner is disabled\n");
+ return 0;
+ }
+
+ dentry = debugfs_create_file("page_owner", S_IRUSR, NULL,
+ NULL, &proc_page_owner_operations);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ return 0;
+}
+module_init(pageowner_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1b12d39..b090e9e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -22,6 +22,8 @@
#include <linux/writeback.h>
#include <linux/compaction.h>
#include <linux/mm_inline.h>
+#include <linux/page_ext.h>
+#include <linux/page_owner.h>

#include "internal.h"

@@ -1017,6 +1019,104 @@ static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
return 0;
}

+#ifdef CONFIG_PAGE_OWNER
+static void pagetypeinfo_showmixedcount_print(struct seq_file *m,
+ pg_data_t *pgdat,
+ struct zone *zone)
+{
+ struct page *page;
+ struct page_ext *page_ext;
+ unsigned long pfn = zone->zone_start_pfn, block_end_pfn;
+ unsigned long end_pfn = pfn + zone->spanned_pages;
+ unsigned long count[MIGRATE_TYPES] = { 0, };
+ int pageblock_mt, page_mt;
+ int i;
+
+ /* Scan block by block. First and last block may be incomplete */
+ pfn = zone->zone_start_pfn;
+
+ /*
+ * Walk the zone in pageblock_nr_pages steps. If a page block spans
+ * a zone boundary, it will be double counted between zones. This does
+ * not matter as the mixed block count will still be correct
+ */
+ for (; pfn < end_pfn; ) {
+ if (!pfn_valid(pfn)) {
+ pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
+ continue;
+ }
+
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ block_end_pfn = min(block_end_pfn, end_pfn);
+
+ page = pfn_to_page(pfn);
+ pageblock_mt = get_pfnblock_migratetype(page, pfn);
+
+ for (; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+ if (PageBuddy(page)) {
+ pfn += (1UL << page_order(page)) - 1;
+ continue;
+ }
+
+ if (PageReserved(page))
+ continue;
+
+ page_ext = lookup_page_ext(page);
+
+ if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
+ if (pageblock_mt != page_mt) {
+ if (is_migrate_cma(pageblock_mt))
+ count[MIGRATE_MOVABLE]++;
+ else
+ count[pageblock_mt]++;
+
+ pfn = block_end_pfn;
+ break;
+ }
+ pfn += (1UL << page_ext->order) - 1;
+ }
+ }
+
+ /* Print counts */
+ seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ for (i = 0; i < MIGRATE_TYPES; i++)
+ seq_printf(m, "%12lu ", count[i]);
+ seq_putc(m, '\n');
+}
+#endif /* CONFIG_PAGE_OWNER */
+
+/*
+ * Print out the number of pageblocks for each migratetype that contain pages
+ * of other types. This gives an indication of how well fallbacks are being
+ * contained by rmqueue_fallback(). It requires information from PAGE_OWNER
+ * to determine what is going on
+ */
+static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
+{
+#ifdef CONFIG_PAGE_OWNER
+ int mtype;
+
+ if (!page_owner_inited)
+ return;
+
+ drain_all_pages(NULL);
+
+ seq_printf(m, "\n%-23s", "Number of mixed blocks ");
+ for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+ seq_printf(m, "%12s ", migratetype_names[mtype]);
+ seq_putc(m, '\n');
+
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print);
+#endif /* CONFIG_PAGE_OWNER */
+}
+
/*
* This prints out statistics in relation to grouping pages by mobility.
* It is expensive to collect so do not constantly read the file.
@@ -1034,6 +1134,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);
+ pagetypeinfo_showmixedcount(m, pgdat);

return 0;
}
diff --git a/tools/vm/Makefile b/tools/vm/Makefile
index 3d907da..ac884b6 100644
--- a/tools/vm/Makefile
+++ b/tools/vm/Makefile
@@ -1,6 +1,6 @@
# Makefile for vm tools
#
-TARGETS=page-types slabinfo
+TARGETS=page-types slabinfo page_owner_sort

LIB_DIR = ../lib/api
LIBS = $(LIB_DIR)/libapikfs.a
@@ -18,5 +18,5 @@ $(LIBS):
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)

clean:
- $(RM) page-types slabinfo
+ $(RM) page-types slabinfo page_owner_sort
make -C $(LIB_DIR) clean
diff --git a/tools/vm/page_owner_sort.c b/tools/vm/page_owner_sort.c
new file mode 100644
index 0000000..77147b4
--- /dev/null
+++ b/tools/vm/page_owner_sort.c
@@ -0,0 +1,144 @@
+/*
+ * User-space helper to sort the output of /sys/kernel/debug/page_owner
+ *
+ * Example use:
+ * cat /sys/kernel/debug/page_owner > page_owner_full.txt
+ * grep -v ^PFN page_owner_full.txt > page_owner.txt
+ * ./sort page_owner.txt sorted_page_owner.txt
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+
+struct block_list {
+ char *txt;
+ int len;
+ int num;
+};
+
+
+static struct block_list *list;
+static int list_size;
+static int max_size;
+
+struct block_list *block_head;
+
+int read_block(char *buf, int buf_size, FILE *fin)
+{
+ char *curr = buf, *const buf_end = buf + buf_size;
+
+ while (buf_end - curr > 1 && fgets(curr, buf_end - curr, fin)) {
+ if (*curr == '\n') /* empty line */
+ return curr - buf;
+ curr += strlen(curr);
+ }
+
+ return -1; /* EOF or no space left in buf. */
+}
+
+static int compare_txt(const void *p1, const void *p2)
+{
+ const struct block_list *l1 = p1, *l2 = p2;
+
+ return strcmp(l1->txt, l2->txt);
+}
+
+static int compare_num(const void *p1, const void *p2)
+{
+ const struct block_list *l1 = p1, *l2 = p2;
+
+ return l2->num - l1->num;
+}
+
+static void add_list(char *buf, int len)
+{
+ if (list_size != 0 &&
+ len == list[list_size-1].len &&
+ memcmp(buf, list[list_size-1].txt, len) == 0) {
+ list[list_size-1].num++;
+ return;
+ }
+ if (list_size == max_size) {
+ printf("max_size too small??\n");
+ exit(1);
+ }
+ list[list_size].txt = malloc(len+1);
+ list[list_size].len = len;
+ list[list_size].num = 1;
+ memcpy(list[list_size].txt, buf, len);
+ list[list_size].txt[len] = 0;
+ list_size++;
+ if (list_size % 1000 == 0) {
+ printf("loaded %d\r", list_size);
+ fflush(stdout);
+ }
+}
+
+#define BUF_SIZE 1024
+
+int main(int argc, char **argv)
+{
+ FILE *fin, *fout;
+ char buf[BUF_SIZE];
+ int ret, i, count;
+ struct block_list *list2;
+ struct stat st;
+
+ if (argc < 3) {
+ printf("Usage: ./program <input> <output>\n");
+ perror("open: ");
+ exit(1);
+ }
+
+ fin = fopen(argv[1], "r");
+ fout = fopen(argv[2], "w");
+ if (!fin || !fout) {
+ printf("Usage: ./program <input> <output>\n");
+ perror("open: ");
+ exit(1);
+ }
+
+ fstat(fileno(fin), &st);
+ max_size = st.st_size / 100; /* hack ... */
+
+ list = malloc(max_size * sizeof(*list));
+
+ for ( ; ; ) {
+ ret = read_block(buf, BUF_SIZE, fin);
+ if (ret < 0)
+ break;
+
+ add_list(buf, ret);
+ }
+
+ printf("loaded %d\n", list_size);
+
+ printf("sorting ....\n");
+
+ qsort(list, list_size, sizeof(list[0]), compare_txt);
+
+ list2 = malloc(sizeof(*list) * list_size);
+
+ printf("culling\n");
+
+ for (i = count = 0; i < list_size; i++) {
+ if (count == 0 ||
+ strcmp(list2[count-1].txt, list[i].txt) != 0) {
+ list2[count++] = list[i];
+ } else {
+ list2[count-1].num += list[i].num;
+ }
+ }
+
+ qsort(list2, count, sizeof(list[0]), compare_num);
+
+ for (i = 0; i < count; i++)
+ fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
+
+ return 0;
+}
--
1.7.9.5

2014-11-24 08:13:33

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 3/8] mm/debug-pagealloc: make debug-pagealloc boottime configurable

Now, we have prepared to avoid using debug-pagealloc in boottime. So
introduce new kernel-parameter to disable debug-pagealloc in boottime,
and makes related functions to be disabled in this case.

Only non-intuitive part is change of guard page functions. Because
guard page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.

v2: makes debug-pagealloc boottime configurable for page poisoning, too

Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/kernel-parameters.txt | 8 ++++++++
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/s390/mm/pageattr.c | 2 +-
arch/sparc/mm/init_64.c | 2 +-
arch/x86/mm/pageattr.c | 2 +-
include/linux/mm.h | 17 ++++++++++++++++-
mm/debug-pagealloc.c | 8 +++++++-
mm/page_alloc.c | 16 ++++++++++++++++
9 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 3c5a178..b5ac055 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
causing system reset or hang due to sending
INIT from AP to BSP.

+ disable_debug_pagealloc
+ [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+ parameter allows user to disable it at boot time.
+ With this parameter, we can avoid allocating huge
+ chunk of memory for debug pagealloc and then
+ the system will work mostly same with the kernel
+ built without CONFIG_DEBUG_PAGEALLOC.
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d5339a3..57b9c23 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1432,7 +1432,7 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
mmu_kernel_ssize, 0);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long flags, vaddr, lmi;
int i;
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index cf11342..b98aac6 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -430,7 +430,7 @@ static int change_page_attr(struct page *page, int numpages, pgprot_t prot)
}


-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 3fef3b2..426c9d4 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -120,7 +120,7 @@ static void ipte_range(pte_t *pte, unsigned long address, int nr)
}
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long address;
int nr, i, j;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 2d91c62..3ea267c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1621,7 +1621,7 @@ static void __init kernel_physical_mapping_init(void)
}

#ifdef CONFIG_DEBUG_PAGEALLOC
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long phys_start = page_to_pfn(page) << PAGE_SHIFT;
unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 36de293..4d304e1 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1801,7 +1801,7 @@ static int __set_pages_np(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 0);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a8d4d4..5dc11e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2055,7 +2055,22 @@ static inline void vm_stat_account(struct mm_struct *mm,
#endif /* CONFIG_PROC_FS */

#ifdef CONFIG_DEBUG_PAGEALLOC
-extern void kernel_map_pages(struct page *page, int numpages, int enable);
+extern bool _debug_pagealloc_enabled;
+extern void __kernel_map_pages(struct page *page, int numpages, int enable);
+
+static inline bool debug_pagealloc_enabled(void)
+{
+ return _debug_pagealloc_enabled;
+}
+
+static inline void
+kernel_map_pages(struct page *page, int numpages, int enable)
+{
+ if (!debug_pagealloc_enabled())
+ return;
+
+ __kernel_map_pages(page, numpages, enable);
+}
#ifdef CONFIG_HIBERNATION
extern bool kernel_page_present(struct page *page);
#endif /* CONFIG_HIBERNATION */
diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c
index 0072f2c..5bf5906 100644
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -10,11 +10,17 @@ static bool page_poisoning_enabled __read_mostly;

static bool need_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}

static void init_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
page_poisoning_enabled = true;
}

@@ -119,7 +125,7 @@ static void unpoison_pages(struct page *page, int n)
unpoison_page(page + i);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (!page_poisoning_enabled)
return;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7534733..4eea173 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,15 +426,31 @@ static inline void prep_zero_page(struct page *page, unsigned int order,

#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
+bool _debug_pagealloc_enabled __read_mostly = true;
bool _debug_guardpage_enabled __read_mostly;

+static int __init early_disable_debug_pagealloc(char *buf)
+{
+ _debug_pagealloc_enabled = false;
+
+ return 0;
+}
+early_param("disable_debug_pagealloc", early_disable_debug_pagealloc);
+
static bool need_debug_guardpage(void)
{
+ /* If we don't use debug_pagealloc, we don't need guard page */
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}

static void init_debug_guardpage(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
_debug_guardpage_enabled = true;
}

--
1.7.9.5

2014-11-24 08:13:32

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 7/8] mm/page_owner: correct owner information for early allocated pages

Extended memory to store page owner information is initialized some time
later than that page allocator starts. Until initialization, many pages
can be allocated and they have no owner information. This make debugging
using page owner harder, so some fixup will be helpful.

This patch fix up this situation by setting fake owner information
immediately after page extension is initialized. Information doesn't
tell the right owner, but, at least, it can tell whether page is
allocated or not, more correctly.

On my testing, this patch catches 13343 early allocated pages, although
they are mostly allocated from page extension feature. Anyway, after then,
there is no page left that it is allocated and has no page owner flag.

Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_owner.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 91 insertions(+), 2 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 85eec7e..9ab4a9b 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,8 @@
static bool page_owner_disabled = true;
bool page_owner_inited __read_mostly;

+static void init_early_allocated_pages(void);
+
static int early_page_owner_param(char *buf)
{
if (!buf)
@@ -36,6 +38,7 @@ static void init_page_owner(void)
return;

page_owner_inited = true;
+ init_early_allocated_pages();
}

struct page_ext_operations page_owner_ops = {
@@ -184,8 +187,8 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
page_ext = lookup_page_ext(page);

/*
- * Pages allocated before initialization of page_owner are
- * non-buddy and have no page_owner info.
+ * Some pages could be missed by concurrent allocation or free,
+ * because we don't hold the zone lock.
*/
if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags))
continue;
@@ -199,6 +202,92 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
return 0;
}

+static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
+{
+ struct page *page;
+ struct page_ext *page_ext;
+ unsigned long pfn = zone->zone_start_pfn, block_end_pfn;
+ unsigned long end_pfn = pfn + zone->spanned_pages;
+ unsigned long count = 0;
+
+ /* Scan block by block. First and last block may be incomplete */
+ pfn = zone->zone_start_pfn;
+
+ /*
+ * Walk the zone in pageblock_nr_pages steps. If a page block spans
+ * a zone boundary, it will be double counted between zones. This does
+ * not matter as the mixed block count will still be correct
+ */
+ for (; pfn < end_pfn; ) {
+ if (!pfn_valid(pfn)) {
+ pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
+ continue;
+ }
+
+ block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+ block_end_pfn = min(block_end_pfn, end_pfn);
+
+ page = pfn_to_page(pfn);
+
+ for (; pfn < block_end_pfn; pfn++) {
+ if (!pfn_valid_within(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+
+ /*
+ * We are safe to check buddy flag and order, because
+ * this is init stage and only single thread runs.
+ */
+ if (PageBuddy(page)) {
+ pfn += (1UL << page_order(page)) - 1;
+ continue;
+ }
+
+ if (PageReserved(page))
+ continue;
+
+ page_ext = lookup_page_ext(page);
+
+ /* Maybe overraping zone */
+ if (test_bit(PAGE_EXT_OWNER, &page_ext->flags))
+ continue;
+
+ /* Found early allocated page */
+ set_page_owner(page, 0, 0);
+ count++;
+ }
+ }
+
+ pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n",
+ pgdat->node_id, zone->name, count);
+}
+
+static void init_zones_in_node(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ struct zone *node_zones = pgdat->node_zones;
+ unsigned long flags;
+
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!populated_zone(zone))
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ init_pages_in_zone(pgdat, zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
+static void init_early_allocated_pages(void)
+{
+ pg_data_t *pgdat;
+
+ drain_all_pages(NULL);
+ for_each_online_pgdat(pgdat)
+ init_zones_in_node(pgdat);
+}
+
static const struct file_operations proc_page_owner_operations = {
.read = read_page_owner,
};
--
1.7.9.5

2014-11-24 08:15:22

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 1/8] mm/page_ext: resurrect struct page extending code for debugging

When we debug something, we'd like to insert some information to
every page. For this purpose, we sometimes modify struct page itself.
But, this has drawbacks. First, it requires re-compile. This makes us
hesitate to use the powerful debug feature so development process is
slowed down. And, second, sometimes it is impossible to rebuild the kernel
due to third party module dependency. At third, system behaviour would be
largely different after re-compile, because it changes size of struct
page greatly and this structure is accessed by every part of kernel.
Keeping this as it is would be better to reproduce errornous situation.

This feature is intended to overcome above mentioned problems. This feature
allocates memory for extended data per page in certain place rather than
the struct page itself. This memory can be accessed by the accessor
functions provided by this code. During the boot process, it checks whether
allocation of huge chunk of memory is needed or not. If not, it avoids
allocating memory at all. With this advantage, we can include this feature
into the kernel in default and can avoid rebuild and solve related problems.

Until now, memcg uses this technique. But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed. I'd like to use this code to develop debug feature,
so this patch resurrect it.

To help these things to work well, this patch introduces two callbacks
for clients. One is the need callback which is mandatory if user wants
to avoid useless memory allocation at boot-time. The other is optional,
init callback, which is used to do proper initialization after memory
is allocated. Detailed explanation about purpose of these functions is
in code comment. Please refer it.

Others are completely same with previous extension code in memcg.

v3:
minor fix for readable code

v2:
describe overall design at the top of the page extension code.
add more description on commit message.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 12 ++
include/linux/page_ext.h | 59 +++++++
init/main.c | 7 +
mm/Kconfig.debug | 9 ++
mm/Makefile | 1 +
mm/page_alloc.c | 2 +
mm/page_ext.c | 386 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 476 insertions(+)
create mode 100644 include/linux/page_ext.h
create mode 100644 mm/page_ext.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3879d76..2f0856d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -722,6 +722,9 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
+#ifdef CONFIG_PAGE_EXTENSION
+ struct page_ext *node_page_ext;
+#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
@@ -1075,6 +1078,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
#define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)

struct page;
+struct page_ext;
struct mem_section {
/*
* This is, logically, a pointer to an array of struct
@@ -1092,6 +1096,14 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
+#ifdef CONFIG_PAGE_EXTENSION
+ /*
+ * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use
+ * section. (see page_ext.h about this.)
+ */
+ struct page_ext *page_ext;
+ unsigned long pad;
+#endif
/*
* WARNING: mem_section must be a power-of-2 in size for the
* calculation and use of SECTION_ROOT_MASK to make sense.
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
new file mode 100644
index 0000000..2ccc8b4
--- /dev/null
+++ b/include/linux/page_ext.h
@@ -0,0 +1,59 @@
+#ifndef __LINUX_PAGE_EXT_H
+#define __LINUX_PAGE_EXT_H
+
+struct pglist_data;
+struct page_ext_operations {
+ bool (*need)(void);
+ void (*init)(void);
+};
+
+#ifdef CONFIG_PAGE_EXTENSION
+
+/*
+ * Page Extension can be considered as an extended mem_map.
+ * A page_ext page is associated with every page descriptor. The
+ * page_ext helps us add more information about the page.
+ * All page_ext are allocated at boot or memory hotplug event,
+ * then the page_ext for pfn always exists.
+ */
+struct page_ext {
+ unsigned long flags;
+};
+
+extern void pgdat_page_ext_init(struct pglist_data *pgdat);
+
+#ifdef CONFIG_SPARSEMEM
+static inline void page_ext_init_flatmem(void)
+{
+}
+extern void page_ext_init(void);
+#else
+extern void page_ext_init_flatmem(void);
+static inline void page_ext_init(void)
+{
+}
+#endif
+
+struct page_ext *lookup_page_ext(struct page *page);
+
+#else /* !CONFIG_PAGE_EXTENSION */
+struct page_ext;
+
+static inline void pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+}
+
+static inline struct page_ext *lookup_page_ext(struct page *page)
+{
+ return NULL;
+}
+
+static inline void page_ext_init(void)
+{
+}
+
+static inline void page_ext_init_flatmem(void)
+{
+}
+#endif /* CONFIG_PAGE_EXTENSION */
+#endif /* __LINUX_PAGE_EXT_H */
diff --git a/init/main.c b/init/main.c
index 235aafb..c60a246 100644
--- a/init/main.c
+++ b/init/main.c
@@ -51,6 +51,7 @@
#include <linux/mempolicy.h>
#include <linux/key.h>
#include <linux/buffer_head.h>
+#include <linux/page_ext.h>
#include <linux/debug_locks.h>
#include <linux/debugobjects.h>
#include <linux/lockdep.h>
@@ -484,6 +485,11 @@ void __init __weak thread_info_cache_init(void)
*/
static void __init mm_init(void)
{
+ /*
+ * page_ext requires contiguous pages,
+ * bigger than MAX_ORDER unless SPARSEMEM.
+ */
+ page_ext_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
@@ -621,6 +627,7 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
#endif
+ page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 4b24432..1ba81c7 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -1,3 +1,12 @@
+config PAGE_EXTENSION
+ bool "Extend memmap on extra space for more information on page"
+ ---help---
+ Extend memmap on extra space for more information on page. This
+ could be used for debugging features that need to insert extra
+ field for every page. This extension enables us to save memory
+ by not allocating this extra memory according to boottime
+ configuration.
+
config DEBUG_PAGEALLOC
bool "Debug page memory allocations"
depends on DEBUG_KERNEL
diff --git a/mm/Makefile b/mm/Makefile
index 9c4371d..0b7a784 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -71,3 +71,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0dbede..c91f449 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,6 +48,7 @@
#include <linux/backing-dev.h>
#include <linux/fault-inject.h>
#include <linux/page-isolation.h>
+#include <linux/page_ext.h>
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/compaction.h>
@@ -4857,6 +4858,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+ pgdat_page_ext_init(pgdat);

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_ext.c b/mm/page_ext.c
new file mode 100644
index 0000000..8b3a97a
--- /dev/null
+++ b/mm/page_ext.c
@@ -0,0 +1,386 @@
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/bootmem.h>
+#include <linux/page_ext.h>
+#include <linux/memory.h>
+#include <linux/vmalloc.h>
+#include <linux/kmemleak.h>
+
+/*
+ * struct page extension
+ *
+ * This is the feature to manage memory for extended data per page.
+ *
+ * Until now, we must modify struct page itself to store extra data per page.
+ * This requires rebuilding the kernel and it is really time consuming process.
+ * And, sometimes, rebuild is impossible due to third party module dependency.
+ * At last, enlarging struct page could cause un-wanted system behaviour change.
+ *
+ * This feature is intended to overcome above mentioned problems. This feature
+ * allocates memory for extended data per page in certain place rather than
+ * the struct page itself. This memory can be accessed by the accessor
+ * functions provided by this code. During the boot process, it checks whether
+ * allocation of huge chunk of memory is needed or not. If not, it avoids
+ * allocating memory at all. With this advantage, we can include this feature
+ * into the kernel in default and can avoid rebuild and solve related problems.
+ *
+ * To help these things to work well, there are two callbacks for clients. One
+ * is the need callback which is mandatory if user wants to avoid useless
+ * memory allocation at boot-time. The other is optional, init callback, which
+ * is used to do proper initialization after memory is allocated.
+ *
+ * The need callback is used to decide whether extended memory allocation is
+ * needed or not. Sometimes users want to deactivate some features in this
+ * boot and extra memory would be unneccessary. In this case, to avoid
+ * allocating huge chunk of memory, each clients represent their need of
+ * extra memory through the need callback. If one of the need callbacks
+ * returns true, it means that someone needs extra memory so that
+ * page extension core should allocates memory for page extension. If
+ * none of need callbacks return true, memory isn't needed at all in this boot
+ * and page extension core can skip to allocate memory. As result,
+ * none of memory is wasted.
+ *
+ * The init callback is used to do proper initialization after page extension
+ * is completely initialized. In sparse memory system, extra memory is
+ * allocated some time later than memmap is allocated. In other words, lifetime
+ * of memory for page extension isn't same with memmap for struct page.
+ * Therefore, clients can't store extra data until page extension is
+ * initialized, even if pages are allocated and used freely. This could
+ * cause inadequate state of extra data per page, so, to prevent it, client
+ * can utilize this callback to initialize the state of it correctly.
+ */
+
+static struct page_ext_operations *page_ext_ops[] = {
+};
+
+static unsigned long total_usage;
+
+static bool __init invoke_need_callbacks(void)
+{
+ int i;
+ int entries = ARRAY_SIZE(page_ext_ops);
+
+ for (i = 0; i < entries; i++) {
+ if (page_ext_ops[i]->need && page_ext_ops[i]->need())
+ return true;
+ }
+
+ return false;
+}
+
+static void __init invoke_init_callbacks(void)
+{
+ int i;
+ int entries = ARRAY_SIZE(page_ext_ops);
+
+ for (i = 0; i < entries; i++) {
+ if (page_ext_ops[i]->init)
+ page_ext_ops[i]->init();
+ }
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+
+void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+ pgdat->node_page_ext = NULL;
+}
+
+struct page_ext *lookup_page_ext(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_ext *base;
+
+ base = NODE_DATA(page_to_nid(page))->node_page_ext;
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_ext arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (unlikely(!base))
+ return NULL;
+#endif
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_ext(int nid)
+{
+ struct page_ext *base;
+ unsigned long table_size;
+ unsigned long nr_pages;
+
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+ if (!nr_pages)
+ return 0;
+
+ table_size = sizeof(struct page_ext) * nr_pages;
+
+ base = memblock_virt_alloc_try_nid_nopanic(
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+ BOOTMEM_ALLOC_ACCESSIBLE, nid);
+ if (!base)
+ return -ENOMEM;
+ NODE_DATA(nid)->node_page_ext = base;
+ total_usage += table_size;
+ return 0;
+}
+
+void __init page_ext_init_flatmem(void)
+{
+
+ int nid, fail;
+
+ if (!invoke_need_callbacks())
+ return;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_ext(nid);
+ if (fail)
+ goto fail;
+ }
+ pr_info("allocated %ld bytes of page_ext\n", total_usage);
+ invoke_init_callbacks();
+ return;
+
+fail:
+ pr_crit("allocation of page_ext failed.\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_FLAT_NODE_MEM_MAP */
+
+struct page_ext *lookup_page_ext(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_ext arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (!section->page_ext)
+ return NULL;
+#endif
+ return section->page_ext + pfn;
+}
+
+static void *__meminit alloc_page_ext(size_t size, int nid)
+{
+ gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
+ void *addr = NULL;
+
+ addr = alloc_pages_exact_nid(nid, size, flags);
+ if (addr) {
+ kmemleak_alloc(addr, size, 1, flags);
+ return addr;
+ }
+
+ if (node_state(nid, N_HIGH_MEMORY))
+ addr = vzalloc_node(size, nid);
+ else
+ addr = vzalloc(size);
+
+ return addr;
+}
+
+static int __meminit init_section_page_ext(unsigned long pfn, int nid)
+{
+ struct mem_section *section;
+ struct page_ext *base;
+ unsigned long table_size;
+
+ section = __pfn_to_section(pfn);
+
+ if (section->page_ext)
+ return 0;
+
+ table_size = sizeof(struct page_ext) * PAGES_PER_SECTION;
+ base = alloc_page_ext(table_size, nid);
+
+ /*
+ * The value stored in section->page_ext is (base - pfn)
+ * and it does not point to the memory block allocated above,
+ * causing kmemleak false positives.
+ */
+ kmemleak_not_leak(base);
+
+ if (!base) {
+ pr_err("page ext allocation failure\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * The passed "pfn" may not be aligned to SECTION. For the calculation
+ * we need to apply a mask.
+ */
+ pfn &= PAGE_SECTION_MASK;
+ section->page_ext = base - pfn;
+ total_usage += table_size;
+ return 0;
+}
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_page_ext(void *addr)
+{
+ if (is_vmalloc_addr(addr)) {
+ vfree(addr);
+ } else {
+ struct page *page = virt_to_page(addr);
+ size_t table_size;
+
+ table_size = sizeof(struct page_ext) * PAGES_PER_SECTION;
+
+ BUG_ON(PageReserved(page));
+ free_pages_exact(addr, table_size);
+ }
+}
+
+static void __free_page_ext(unsigned long pfn)
+{
+ struct mem_section *ms;
+ struct page_ext *base;
+
+ ms = __pfn_to_section(pfn);
+ if (!ms || !ms->page_ext)
+ return;
+ base = ms->page_ext + pfn;
+ free_page_ext(base);
+ ms->page_ext = NULL;
+}
+
+static int __meminit online_page_ext(unsigned long start_pfn,
+ unsigned long nr_pages,
+ int nid)
+{
+ unsigned long start, end, pfn;
+ int fail = 0;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ if (nid == -1) {
+ /*
+ * In this case, "nid" already exists and contains valid memory.
+ * "start_pfn" passed to us is a pfn which is an arg for
+ * online__pages(), and start_pfn should exist.
+ */
+ nid = pfn_to_nid(start_pfn);
+ VM_BUG_ON(!node_state(nid, N_ONLINE));
+ }
+
+ for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+ if (!pfn_present(pfn))
+ continue;
+ fail = init_section_page_ext(pfn, nid);
+ }
+ if (!fail)
+ return 0;
+
+ /* rollback */
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_ext(pfn);
+
+ return -ENOMEM;
+}
+
+static int __meminit offline_page_ext(unsigned long start_pfn,
+ unsigned long nr_pages, int nid)
+{
+ unsigned long start, end, pfn;
+
+ start = SECTION_ALIGN_DOWN(start_pfn);
+ end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+
+ for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+ __free_page_ext(pfn);
+ return 0;
+
+}
+
+static int __meminit page_ext_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = online_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_OFFLINE:
+ offline_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_CANCEL_ONLINE:
+ offline_page_ext(mn->start_pfn,
+ mn->nr_pages, mn->status_change_nid);
+ break;
+ case MEM_GOING_OFFLINE:
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ return notifier_from_errno(ret);
+}
+
+#endif
+
+void __init page_ext_init(void)
+{
+ unsigned long pfn;
+ int nid;
+
+ if (!invoke_need_callbacks())
+ return;
+
+ for_each_node_state(nid, N_MEMORY) {
+ unsigned long start_pfn, end_pfn;
+
+ start_pfn = node_start_pfn(nid);
+ end_pfn = node_end_pfn(nid);
+ /*
+ * start_pfn and end_pfn may not be aligned to SECTION and the
+ * page->flags of out of node pages are not initialized. So we
+ * scan [start_pfn, the biggest section's pfn < end_pfn) here.
+ */
+ for (pfn = start_pfn; pfn < end_pfn;
+ pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
+
+ if (!pfn_valid(pfn))
+ continue;
+ /*
+ * Nodes's pfns can be overlapping.
+ * We know some arch can have a nodes layout such as
+ * -------------pfn-------------->
+ * N0 | N1 | N2 | N0 | N1 | N2|....
+ */
+ if (pfn_to_nid(pfn) != nid)
+ continue;
+ if (init_section_page_ext(pfn, nid))
+ goto oom;
+ }
+ }
+ hotplug_memory_notifier(page_ext_callback, 0);
+ pr_info("allocated %ld bytes of page_ext\n", total_usage);
+ invoke_init_callbacks();
+ return;
+
+oom:
+ panic("Out of memory");
+}
+
+void __meminit pgdat_page_ext_init(struct pglist_data *pgdat)
+{
+}
+
+#endif
+
--
1.7.9.5

2014-11-24 22:55:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] mm/debug-pagealloc: make debug-pagealloc boottime configurable

On Mon, 24 Nov 2014 17:15:21 +0900 Joonsoo Kim <[email protected]> wrote:

> Now, we have prepared to avoid using debug-pagealloc in boottime. So
> introduce new kernel-parameter to disable debug-pagealloc in boottime,
> and makes related functions to be disabled in this case.
>
> Only non-intuitive part is change of guard page functions. Because
> guard page is effective only if debug-pagealloc is enabled, turning off
> according to debug-pagealloc is reasonable thing to do.
>
> ...
>
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> causing system reset or hang due to sending
> INIT from AP to BSP.
>
> + disable_debug_pagealloc
> + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
> + parameter allows user to disable it at boot time.
> + With this parameter, we can avoid allocating huge
> + chunk of memory for debug pagealloc and then
> + the system will work mostly same with the kernel
> + built without CONFIG_DEBUG_PAGEALLOC.
> +

Weren't we going to make this default to "off", require a boot option
to turn debug_pagealloc on?

2014-11-24 22:57:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] stacktrace: introduce snprint_stack_trace for buffer output

On Mon, 24 Nov 2014 17:15:23 +0900 Joonsoo Kim <[email protected]> wrote:

> Current stacktrace only have the function for console output.
> page_owner that will be introduced in following patch needs to print
> the output of stacktrace into the buffer for our own output format
> so so new function, snprint_stack_trace(), is needed.
>
> ...
>
> +int snprint_stack_trace(char *buf, size_t size,
> + struct stack_trace *trace, int spaces)
> +{
> + int i;
> + unsigned long ip;
> + int generated;
> + int total = 0;
> +
> + if (WARN_ON(!trace->entries))
> + return 0;
> +
> + for (i = 0; i < trace->nr_entries; i++) {
> + ip = trace->entries[i];
> + generated = snprintf(buf, size, "%*c[<%p>] %pS\n",
> + 1 + spaces, ' ', (void *) ip, (void *) ip);
> +
> + total += generated;
> +
> + /* Assume that generated isn't a negative number */
> + if (generated >= size) {
> + buf += size;
> + size = 0;

Seems strange to keep looping around doing nothing. Would it be better
to `break' here?

> + } else {
> + buf += generated;
> + size -= generated;
> + }
> + }
> +
> + return total;
> +}
> +EXPORT_SYMBOL_GPL(snprint_stack_trace);
> +

2014-11-24 23:39:42

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] mm/debug-pagealloc: make debug-pagealloc boottime configurable

On Mon, Nov 24, 2014 at 02:55:42PM -0800, Andrew Morton wrote:
> On Mon, 24 Nov 2014 17:15:21 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > Now, we have prepared to avoid using debug-pagealloc in boottime. So
> > introduce new kernel-parameter to disable debug-pagealloc in boottime,
> > and makes related functions to be disabled in this case.
> >
> > Only non-intuitive part is change of guard page functions. Because
> > guard page is effective only if debug-pagealloc is enabled, turning off
> > according to debug-pagealloc is reasonable thing to do.
> >
> > ...
> >
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > causing system reset or hang due to sending
> > INIT from AP to BSP.
> >
> > + disable_debug_pagealloc
> > + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
> > + parameter allows user to disable it at boot time.
> > + With this parameter, we can avoid allocating huge
> > + chunk of memory for debug pagealloc and then
> > + the system will work mostly same with the kernel
> > + built without CONFIG_DEBUG_PAGEALLOC.
> > +
>
> Weren't we going to make this default to "off", require a boot option
> to turn debug_pagealloc on?

Hello, Andrew.

I'm afraid that changing default to "off" confuses some old users.
They would expect that it is default "on". But, it is just debug
feature, so, it may be no problem. If you prefer to change default, I
will rework this patch. Please let me know your decision.

Thanks.

2014-11-24 23:43:52

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] stacktrace: introduce snprint_stack_trace for buffer output

On Mon, Nov 24, 2014 at 02:57:52PM -0800, Andrew Morton wrote:
> On Mon, 24 Nov 2014 17:15:23 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > Current stacktrace only have the function for console output.
> > page_owner that will be introduced in following patch needs to print
> > the output of stacktrace into the buffer for our own output format
> > so so new function, snprint_stack_trace(), is needed.
> >
> > ...
> >
> > +int snprint_stack_trace(char *buf, size_t size,
> > + struct stack_trace *trace, int spaces)
> > +{
> > + int i;
> > + unsigned long ip;
> > + int generated;
> > + int total = 0;
> > +
> > + if (WARN_ON(!trace->entries))
> > + return 0;
> > +
> > + for (i = 0; i < trace->nr_entries; i++) {
> > + ip = trace->entries[i];
> > + generated = snprintf(buf, size, "%*c[<%p>] %pS\n",
> > + 1 + spaces, ' ', (void *) ip, (void *) ip);
> > +
> > + total += generated;
> > +
> > + /* Assume that generated isn't a negative number */
> > + if (generated >= size) {
> > + buf += size;
> > + size = 0;
>
> Seems strange to keep looping around doing nothing. Would it be better
> to `break' here?

generated will be added to total in each iteration even if size is 0.
snprint_stack_trace() could return accurate generated string length
by this looping.

Thanks.

2014-11-26 20:49:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] mm/debug-pagealloc: make debug-pagealloc boottime configurable

On Tue, 25 Nov 2014 08:42:37 +0900 Joonsoo Kim <[email protected]> wrote:

> On Mon, Nov 24, 2014 at 02:55:42PM -0800, Andrew Morton wrote:
> > On Mon, 24 Nov 2014 17:15:21 +0900 Joonsoo Kim <[email protected]> wrote:
> >
> > > Now, we have prepared to avoid using debug-pagealloc in boottime. So
> > > introduce new kernel-parameter to disable debug-pagealloc in boottime,
> > > and makes related functions to be disabled in this case.
> > >
> > > Only non-intuitive part is change of guard page functions. Because
> > > guard page is effective only if debug-pagealloc is enabled, turning off
> > > according to debug-pagealloc is reasonable thing to do.
> > >
> > > ...
> > >
> > > --- a/Documentation/kernel-parameters.txt
> > > +++ b/Documentation/kernel-parameters.txt
> > > @@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > > causing system reset or hang due to sending
> > > INIT from AP to BSP.
> > >
> > > + disable_debug_pagealloc
> > > + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
> > > + parameter allows user to disable it at boot time.
> > > + With this parameter, we can avoid allocating huge
> > > + chunk of memory for debug pagealloc and then
> > > + the system will work mostly same with the kernel
> > > + built without CONFIG_DEBUG_PAGEALLOC.
> > > +
> >
> > Weren't we going to make this default to "off", require a boot option
> > to turn debug_pagealloc on?
>
> Hello, Andrew.
>
> I'm afraid that changing default to "off" confuses some old users.
> They would expect that it is default "on". But, it is just debug
> feature, so, it may be no problem. If you prefer to change default, I
> will rework this patch. Please let me know your decision.

I suspect the number of "old users" is one ;)

I think it would be better to default to off - that's the typical
behaviour for debug features, for good reasons.

2014-11-27 05:07:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/8] mm/debug-pagealloc: make debug-pagealloc boottime configurable

On Wed, Nov 26, 2014 at 12:49:36PM -0800, Andrew Morton wrote:
> On Tue, 25 Nov 2014 08:42:37 +0900 Joonsoo Kim <[email protected]> wrote:
>
> > On Mon, Nov 24, 2014 at 02:55:42PM -0800, Andrew Morton wrote:
> > > On Mon, 24 Nov 2014 17:15:21 +0900 Joonsoo Kim <[email protected]> wrote:
> > >
> > > > Now, we have prepared to avoid using debug-pagealloc in boottime. So
> > > > introduce new kernel-parameter to disable debug-pagealloc in boottime,
> > > > and makes related functions to be disabled in this case.
> > > >
> > > > Only non-intuitive part is change of guard page functions. Because
> > > > guard page is effective only if debug-pagealloc is enabled, turning off
> > > > according to debug-pagealloc is reasonable thing to do.
> > > >
> > > > ...
> > > >
> > > > --- a/Documentation/kernel-parameters.txt
> > > > +++ b/Documentation/kernel-parameters.txt
> > > > @@ -858,6 +858,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > > > causing system reset or hang due to sending
> > > > INIT from AP to BSP.
> > > >
> > > > + disable_debug_pagealloc
> > > > + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
> > > > + parameter allows user to disable it at boot time.
> > > > + With this parameter, we can avoid allocating huge
> > > > + chunk of memory for debug pagealloc and then
> > > > + the system will work mostly same with the kernel
> > > > + built without CONFIG_DEBUG_PAGEALLOC.
> > > > +
> > >
> > > Weren't we going to make this default to "off", require a boot option
> > > to turn debug_pagealloc on?
> >
> > Hello, Andrew.
> >
> > I'm afraid that changing default to "off" confuses some old users.
> > They would expect that it is default "on". But, it is just debug
> > feature, so, it may be no problem. If you prefer to change default, I
> > will rework this patch. Please let me know your decision.
>
> I suspect the number of "old users" is one ;)
>
> I think it would be better to default to off - that's the typical
> behaviour for debug features, for good reasons.

Okay.
Here goes the patch.

------------->8----------------
>From 799eba44b2d071d55e4444399be63f1af0aff204 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Wed, 5 Nov 2014 16:05:14 +0900
Subject: [PATCH v4] mm/debug-pagealloc: make debug-pagealloc boottime
configurable

Now, we have prepared to avoid using debug-pagealloc in boottime. So,
make it disabled in default and introduce new kernel-parameter
to enable debug-pagealloc in boottime. This is a debugging feature, so
default disabled may be more useful.

Only non-intuitive part is change of guard page functions. Because
guard page is effective only if debug-pagealloc is enabled, turning on/off
according to debug-pagealloc is reasonable thing to do.

v2: makes debug-pagealloc boottime configurable for page poisoning, too
v4: makes debug-pagealloc disabled in default

Signed-off-by: Joonsoo Kim <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +++++++++
arch/powerpc/mm/hash_utils_64.c | 2 +-
arch/powerpc/mm/pgtable_32.c | 2 +-
arch/s390/mm/pageattr.c | 2 +-
arch/sparc/mm/init_64.c | 2 +-
arch/x86/mm/pageattr.c | 2 +-
include/linux/mm.h | 17 ++++++++++++++++-
mm/debug-pagealloc.c | 8 +++++++-
mm/page_alloc.c | 20 ++++++++++++++++++++
9 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 3c5a178..e35bfcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -829,6 +829,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
CONFIG_DEBUG_PAGEALLOC, hence this option will not help
tracking down these problems.

+ debug_pagealloc=
+ [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+ parameter enables the feature at boot time. In
+ default, it is disabled. We can avoid allocating huge
+ chunk of memory for debug pagealloc if we don't enable
+ it at boot time and the system will work mostly same
+ with the kernel built without CONFIG_DEBUG_PAGEALLOC.
+ on: enable the feature
+
debugpat [X86] Enable PAT debugging

decnet.addr= [HW,NET]
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d5339a3..57b9c23 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1432,7 +1432,7 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
mmu_kernel_ssize, 0);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long flags, vaddr, lmi;
int i;
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index cf11342..b98aac6 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -430,7 +430,7 @@ static int change_page_attr(struct page *page, int numpages, pgprot_t prot)
}


-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 3fef3b2..426c9d4 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -120,7 +120,7 @@ static void ipte_range(pte_t *pte, unsigned long address, int nr)
}
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long address;
int nr, i, j;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 2d91c62..3ea267c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1621,7 +1621,7 @@ static void __init kernel_physical_mapping_init(void)
}

#ifdef CONFIG_DEBUG_PAGEALLOC
-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
unsigned long phys_start = page_to_pfn(page) << PAGE_SHIFT;
unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 36de293..4d304e1 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1801,7 +1801,7 @@ static int __set_pages_np(struct page *page, int numpages)
return __change_page_attr_set_clr(&cpa, 0);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a8d4d4..5dc11e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2055,7 +2055,22 @@ static inline void vm_stat_account(struct mm_struct *mm,
#endif /* CONFIG_PROC_FS */

#ifdef CONFIG_DEBUG_PAGEALLOC
-extern void kernel_map_pages(struct page *page, int numpages, int enable);
+extern bool _debug_pagealloc_enabled;
+extern void __kernel_map_pages(struct page *page, int numpages, int enable);
+
+static inline bool debug_pagealloc_enabled(void)
+{
+ return _debug_pagealloc_enabled;
+}
+
+static inline void
+kernel_map_pages(struct page *page, int numpages, int enable)
+{
+ if (!debug_pagealloc_enabled())
+ return;
+
+ __kernel_map_pages(page, numpages, enable);
+}
#ifdef CONFIG_HIBERNATION
extern bool kernel_page_present(struct page *page);
#endif /* CONFIG_HIBERNATION */
diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c
index 0072f2c..5bf5906 100644
--- a/mm/debug-pagealloc.c
+++ b/mm/debug-pagealloc.c
@@ -10,11 +10,17 @@ static bool page_poisoning_enabled __read_mostly;

static bool need_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}

static void init_page_poisoning(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
page_poisoning_enabled = true;
}

@@ -119,7 +125,7 @@ static void unpoison_pages(struct page *page, int n)
unpoison_page(page + i);
}

-void kernel_map_pages(struct page *page, int numpages, int enable)
+void __kernel_map_pages(struct page *page, int numpages, int enable)
{
if (!page_poisoning_enabled)
return;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7534733..10310ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -426,15 +426,35 @@ static inline void prep_zero_page(struct page *page, unsigned int order,

#ifdef CONFIG_DEBUG_PAGEALLOC
unsigned int _debug_guardpage_minorder;
+bool _debug_pagealloc_enabled __read_mostly;
bool _debug_guardpage_enabled __read_mostly;

+static int __init early_debug_pagealloc(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ if (strcmp(buf, "on") == 0)
+ _debug_pagealloc_enabled = true;
+
+ return 0;
+}
+early_param("debug_pagealloc", early_debug_pagealloc);
+
static bool need_debug_guardpage(void)
{
+ /* If we don't use debug_pagealloc, we don't need guard page */
+ if (!debug_pagealloc_enabled())
+ return false;
+
return true;
}

static void init_debug_guardpage(void)
{
+ if (!debug_pagealloc_enabled())
+ return;
+
_debug_guardpage_enabled = true;
}

--
1.7.9.5

2014-11-27 12:35:44

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] mm/debug-pagealloc: prepare boottime configurable on/off

Joonsoo,

On Mon, 2014-11-24 at 17:15 +0900, Joonsoo Kim wrote:
> Until now, debug-pagealloc needs extra flags in struct page, so we need
> to recompile whole source code when we decide to use it. This is really
> painful, because it takes some time to recompile and sometimes rebuild is
> not possible due to third party module depending on struct page.
> So, we can't use this good feature in many cases.
>
> Now, we have the page extension feature that allows us to insert
> extra flags to outside of struct page. This gets rid of third party module
> issue mentioned above. And, this allows us to determine if we need extra
> memory for this page extension in boottime. With these property, we can
> avoid using debug-pagealloc in boottime with low computational overhead
> in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
> development process greatly.
>
> This patch is the preparation step to achive above goal. debug-pagealloc
> originally uses extra field of struct page, but, after this patch, it
> will use field of struct page_ext. Because memory for page_ext is
> allocated later than initialization of page allocator in CONFIG_SPARSEMEM,
> we should disable debug-pagealloc feature temporarily until initialization
> of page_ext. This patch implements this.
>
> v2: fix compile error on CONFIG_PAGE_POISONING
>
> Signed-off-by: Joonsoo Kim <[email protected]>

This patch is included in today's linux-next (ie, next-2o0141127) as
commit 1e491e9be4c9 ("mm/debug-pagealloc: prepare boottime configurable
on/off").

> [...]
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 33a8acf..c7b22e7 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -10,7 +10,6 @@
> #include <linux/rwsem.h>
> #include <linux/completion.h>
> #include <linux/cpumask.h>
> -#include <linux/page-debug-flags.h>
> #include <linux/uprobes.h>
> #include <linux/page-flags-layout.h>
> #include <asm/page.h>
> @@ -186,9 +185,6 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
> - unsigned long debug_flags; /* Use atomic bitops on this */
> -#endif
>
> #ifdef CONFIG_KMEMCHECK
> /*
> diff --git a/include/linux/page-debug-flags.h b/include/linux/page-debug-flags.h
> deleted file mode 100644
> index 22691f61..0000000
> --- a/include/linux/page-debug-flags.h
> +++ /dev/null
> @@ -1,32 +0,0 @@
> -#ifndef LINUX_PAGE_DEBUG_FLAGS_H
> -#define LINUX_PAGE_DEBUG_FLAGS_H
> -
> -/*
> - * page->debug_flags bits:
> - *
> - * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to
> - * implement generic debug pagealloc feature. The pages are filled with
> - * poison patterns and set this flag after free_pages(). The poisoned
> - * pages are verified whether the patterns are not corrupted and clear
> - * the flag before alloc_pages().
> - */
> -
> -enum page_debug_flags {
> - PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */
> - PAGE_DEBUG_FLAG_GUARD,
> -};
> -
> -/*
> - * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably
> - * gets turned off when no debug features are enabling it!
> - */
> -
> -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
> -#if !defined(CONFIG_PAGE_POISONING) && \
> - !defined(CONFIG_PAGE_GUARD) \
> -/* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */
> -#error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features!
> -#endif
> -#endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */
> -
> -#endif /* LINUX_PAGE_DEBUG_FLAGS_H */

This remove all uses of CONFIG_WANT_PAGE_DEBUG_FLAGS and
CONFIG_PAGE_GUARD. So the Kconfig symbols WANT_PAGE_DEBUG_FLAGS and
PAGE_GUARD are now unused.

Should I submit the trivial patch to remove these symbols or is a patch
that does that queued already?


Paul Bolle

2014-11-28 07:32:16

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] mm/debug-pagealloc: prepare boottime configurable on/off

On Thu, Nov 27, 2014 at 01:35:39PM +0100, Paul Bolle wrote:
> Joonsoo,
>
> On Mon, 2014-11-24 at 17:15 +0900, Joonsoo Kim wrote:
> > Until now, debug-pagealloc needs extra flags in struct page, so we need
> > to recompile whole source code when we decide to use it. This is really
> > painful, because it takes some time to recompile and sometimes rebuild is
> > not possible due to third party module depending on struct page.
> > So, we can't use this good feature in many cases.
> >
> > Now, we have the page extension feature that allows us to insert
> > extra flags to outside of struct page. This gets rid of third party module
> > issue mentioned above. And, this allows us to determine if we need extra
> > memory for this page extension in boottime. With these property, we can
> > avoid using debug-pagealloc in boottime with low computational overhead
> > in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
> > development process greatly.
> >
> > This patch is the preparation step to achive above goal. debug-pagealloc
> > originally uses extra field of struct page, but, after this patch, it
> > will use field of struct page_ext. Because memory for page_ext is
> > allocated later than initialization of page allocator in CONFIG_SPARSEMEM,
> > we should disable debug-pagealloc feature temporarily until initialization
> > of page_ext. This patch implements this.
> >
> > v2: fix compile error on CONFIG_PAGE_POISONING
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> This patch is included in today's linux-next (ie, next-2o0141127) as
> commit 1e491e9be4c9 ("mm/debug-pagealloc: prepare boottime configurable
> on/off").
>
> > [...]
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 33a8acf..c7b22e7 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -10,7 +10,6 @@
> > #include <linux/rwsem.h>
> > #include <linux/completion.h>
> > #include <linux/cpumask.h>
> > -#include <linux/page-debug-flags.h>
> > #include <linux/uprobes.h>
> > #include <linux/page-flags-layout.h>
> > #include <asm/page.h>
> > @@ -186,9 +185,6 @@ struct page {
> > void *virtual; /* Kernel virtual address (NULL if
> > not kmapped, ie. highmem) */
> > #endif /* WANT_PAGE_VIRTUAL */
> > -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
> > - unsigned long debug_flags; /* Use atomic bitops on this */
> > -#endif
> >
> > #ifdef CONFIG_KMEMCHECK
> > /*
> > diff --git a/include/linux/page-debug-flags.h b/include/linux/page-debug-flags.h
> > deleted file mode 100644
> > index 22691f61..0000000
> > --- a/include/linux/page-debug-flags.h
> > +++ /dev/null
> > @@ -1,32 +0,0 @@
> > -#ifndef LINUX_PAGE_DEBUG_FLAGS_H
> > -#define LINUX_PAGE_DEBUG_FLAGS_H
> > -
> > -/*
> > - * page->debug_flags bits:
> > - *
> > - * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to
> > - * implement generic debug pagealloc feature. The pages are filled with
> > - * poison patterns and set this flag after free_pages(). The poisoned
> > - * pages are verified whether the patterns are not corrupted and clear
> > - * the flag before alloc_pages().
> > - */
> > -
> > -enum page_debug_flags {
> > - PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */
> > - PAGE_DEBUG_FLAG_GUARD,
> > -};
> > -
> > -/*
> > - * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably
> > - * gets turned off when no debug features are enabling it!
> > - */
> > -
> > -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
> > -#if !defined(CONFIG_PAGE_POISONING) && \
> > - !defined(CONFIG_PAGE_GUARD) \
> > -/* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */
> > -#error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features!
> > -#endif
> > -#endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */
> > -
> > -#endif /* LINUX_PAGE_DEBUG_FLAGS_H */
>
> This remove all uses of CONFIG_WANT_PAGE_DEBUG_FLAGS and
> CONFIG_PAGE_GUARD. So the Kconfig symbols WANT_PAGE_DEBUG_FLAGS and
> PAGE_GUARD are now unused.
>
> Should I submit the trivial patch to remove these symbols or is a patch
> that does that queued already?

Hello, Paul.

Thanks for spotting this.
I attach the patch. :)

Andrew,
Could you kindly fold this into the patch in your tree?

Thanks.

------------------->8---------------
>From a33c480160904cc93333807a448960151ac4c534 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Fri, 28 Nov 2014 16:05:32 +0900
Subject: [PATCH] mm/debug_pagealloc: remove obsolete Kconfig options

These are obsolete since commit "mm/debug-pagealloc: prepare boottime
configurable" is merged. So, remove it.

[[email protected]: find obsolete Kconfig options]
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/Kconfig.debug | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 56badfc..957d3da 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -14,7 +14,6 @@ config DEBUG_PAGEALLOC
depends on !KMEMCHECK
select PAGE_EXTENSION
select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
- select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
---help---
Unmap pages from the kernel linear mapping after free_pages().
This results in a large slowdown, but helps to find certain types
@@ -27,13 +26,5 @@ config DEBUG_PAGEALLOC
that would result in incorrect warnings of memory corruption after
a resume because free pages are not saved to the suspend image.

-config WANT_PAGE_DEBUG_FLAGS
- bool
-
config PAGE_POISONING
bool
- select WANT_PAGE_DEBUG_FLAGS
-
-config PAGE_GUARD
- bool
- select WANT_PAGE_DEBUG_FLAGS
--
1.7.9.5

2014-12-03 01:15:21

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] mm/page_ext: resurrect struct page extending code for debugging

On Mon, Nov 24, 2014 at 05:15:19PM +0900, Joonsoo Kim wrote:
> When we debug something, we'd like to insert some information to
> every page. For this purpose, we sometimes modify struct page itself.
> But, this has drawbacks. First, it requires re-compile. This makes us
> hesitate to use the powerful debug feature so development process is
> slowed down. And, second, sometimes it is impossible to rebuild the kernel
> due to third party module dependency. At third, system behaviour would be
> largely different after re-compile, because it changes size of struct
> page greatly and this structure is accessed by every part of kernel.
> Keeping this as it is would be better to reproduce errornous situation.
>
> This feature is intended to overcome above mentioned problems. This feature
> allocates memory for extended data per page in certain place rather than
> the struct page itself. This memory can be accessed by the accessor
> functions provided by this code. During the boot process, it checks whether
> allocation of huge chunk of memory is needed or not. If not, it avoids
> allocating memory at all. With this advantage, we can include this feature
> into the kernel in default and can avoid rebuild and solve related problems.
>
> Until now, memcg uses this technique. But, now, memcg decides to embed
> their variable to struct page itself and it's code to extend struct page
> has been removed. I'd like to use this code to develop debug feature,
> so this patch resurrect it.
>
> To help these things to work well, this patch introduces two callbacks
> for clients. One is the need callback which is mandatory if user wants
> to avoid useless memory allocation at boot-time. The other is optional,
> init callback, which is used to do proper initialization after memory
> is allocated. Detailed explanation about purpose of these functions is
> in code comment. Please refer it.
>
> Others are completely same with previous extension code in memcg.
>
> v3:
> minor fix for readable code
>
> v2:
> describe overall design at the top of the page extension code.
> add more description on commit message.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Hello, Andrew.

Could you fold following fix into the merged patch?
It fixes the problem on !CONFIG_SPARSEMEM which is reported by
0day kernel testing robot.

https://lkml.org/lkml/2014/11/28/123

Thanks.


------->8----------
>From 8436eaa754208d08f065e6317c7a16c7dfe1a766 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Wed, 3 Dec 2014 09:48:16 +0900
Subject: [PATCH] mm/page_ext: reserve more space in case of unaligned node
range

When page allocator's buddy algorithm checks buddy's status,
checked page could be in invalid range. In this case, lookup_page_ext()
will return invalid address and it results in invalid address
defereference problem. For example, if node_start_pfn is 1 and page
with pfn 1 is freed to page allocator, page_is_buddy() would check
the page with pfn 0. In page_ext code, offset would be calculated
by pfn - node_start_pfn, so, 0 - 1 = -1. This causes following problem
reported by Fengguang.

[ 0.480155] BUG: unable to handle kernel
[ 0.480155] BUG: unable to handle kernel paging requestpaging request at d26bdffc
at d26bdffc
[ 0.481566] IP:
[ 0.481566] IP: [<c110bc7a>] free_one_page+0x31a/0x3e0
[<c110bc7a>] free_one_page+0x31a/0x3e0
[ 0.482801] *pdpt = 0000000001866001
[ 0.482801] *pdpt = 0000000001866001 *pde = 0000000012584067 *pde = 0000000012584067 *pte = 80000000126bd060 *pte = 80000000126bd060
[ 0.483333] Oops: 0000 [#1]
[ 0.483333] Oops: 0000 [#1] SMP SMP DEBUG_PAGEALLOCDEBUG_PAGEALLOC
snip...
[ 0.483333] Call Trace:
[ 0.483333] Call Trace:
[ 0.483333] [<c110bdec>] __free_pages_ok+0xac/0xf0
[ 0.483333] [<c110bdec>] __free_pages_ok+0xac/0xf0
[ 0.483333] [<c110c769>] __free_pages+0x19/0x30
[ 0.483333] [<c110c769>] __free_pages+0x19/0x30
[ 0.483333] [<c1144ca5>] kfree+0x75/0xf0
[ 0.483333] [<c1144ca5>] kfree+0x75/0xf0
[ 0.483333] [<c111b595>] ? kvfree+0x45/0x50
[ 0.483333] [<c111b595>] ? kvfree+0x45/0x50
[ 0.483333] [<c111b595>] kvfree+0x45/0x50
[ 0.483333] [<c111b595>] kvfree+0x45/0x50
[ 0.483333] [<c134bb73>] rhashtable_expand+0x1b3/0x1e0
[ 0.483333] [<c134bb73>] rhashtable_expand+0x1b3/0x1e0
[ 0.483333] [<c17fc9f9>] test_rht_init+0x173/0x2e8
[ 0.483333] [<c17fc9f9>] test_rht_init+0x173/0x2e8
[ 0.483333] [<c134b750>] ? jhash2+0xe0/0xe0
[ 0.483333] [<c134b750>] ? jhash2+0xe0/0xe0
[ 0.483333] [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[ 0.483333] [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[ 0.483333] [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[ 0.483333] [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[ 0.483333] [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[ 0.483333] [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[ 0.483333] [<c134b750>] ? jhash2+0xe0/0xe0
[ 0.483333] [<c134b750>] ? jhash2+0xe0/0xe0
[ 0.483333] [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[ 0.483333] [<c134b790>] ? rhashtable_hashfn+0x20/0x20
[ 0.483333] [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[ 0.483333] [<c134b7b0>] ? rht_grow_above_75+0x20/0x20
[ 0.483333] [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[ 0.483333] [<c134b7d0>] ? rht_shrink_below_30+0x20/0x20
[ 0.483333] [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[ 0.483333] [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[ 0.483333] [<c1000486>] do_one_initcall+0xc6/0x210
[ 0.483333] [<c1000486>] do_one_initcall+0xc6/0x210
[ 0.483333] [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[ 0.483333] [<c17fc886>] ? test_rht_lookup+0x8f/0x8f
[ 0.483333] [<c17d0505>] ? repair_env_string+0x12/0x54
[ 0.483333] [<c17d0505>] ? repair_env_string+0x12/0x54
[ 0.483333] [<c17d0cf3>] kernel_init_freeable+0x193/0x213
[ 0.483333] [<c17d0cf3>] kernel_init_freeable+0x193/0x213
[ 0.483333] [<c1512500>] kernel_init+0x10/0xf0
[ 0.483333] [<c1512500>] kernel_init+0x10/0xf0
[ 0.483333] [<c151c5c1>] ret_from_kernel_thread+0x21/0x30
[ 0.483333] [<c151c5c1>] ret_from_kernel_thread+0x21/0x30
[ 0.483333] [<c15124f0>] ? rest_init+0xb0/0xb0
[ 0.483333] [<c15124f0>] ? rest_init+0xb0/0xb0
snip...
[ 0.483333] EIP: [<c110bc7a>]
[ 0.483333] EIP: [<c110bc7a>] free_one_page+0x31a/0x3e0free_one_page+0x31a/0x3e0 SS:ESP 0068:c0041de0
SS:ESP 0068:c0041de0
[ 0.483333] CR2: 00000000d26bdffc
[ 0.483333] CR2: 00000000d26bdffc
[ 0.483333] ---[ end trace 7648e12f817ef2ad ]---
[ 0.483333] ---[ end trace 7648e12f817ef2ad ]---

This case is already handled in case of struct page by considering
alignment of node_start_pfn. So, this patch follows that way to fix this
situation.

Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/page_ext.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index ce86485..184f3ef 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -112,7 +112,8 @@ struct page_ext *lookup_page_ext(struct page *page)
if (unlikely(!base))
return NULL;
#endif
- offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ offset = pfn - round_down(node_start_pfn(page_to_nid(page)),
+ MAX_ORDER_NR_PAGES);
return base + offset;
}

@@ -126,6 +127,15 @@ static int __init alloc_node_page_ext(int nid)
if (!nr_pages)
return 0;

+ /*
+ * Need extra space if node range is not aligned with
+ * MAX_ORDER_NR_PAGES. When page allocator's buddy algorithm
+ * checks buddy's status, range could be out of exact node range.
+ */
+ if (!IS_ALIGNED(node_start_pfn(nid), MAX_ORDER_NR_PAGES) ||
+ !IS_ALIGNED(node_end_pfn(nid), MAX_ORDER_NR_PAGES))
+ nr_pages += MAX_ORDER_NR_PAGES;
+
table_size = sizeof(struct page_ext) * nr_pages;

base = memblock_virt_alloc_try_nid_nopanic(
--
1.7.9.5

2014-12-03 13:24:20

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH v3 6/8] mm/page_owner: keep track of page owners

Hi Kim,

This is really useful stuff that you are doing. And the runtime
allocation for storing page owner stack is a good call.

Along with that, we also use extended version of the original page_owner
patch. The extension is, to store stack trace at the time of freeing the
page. That will indeed eat up space like anything (just double of
original page_owner) but it helps in debugging some crucial issues.
Like, illegitimate free, finding leaked pages (if we store their time
stamps) etc. The same has been useful in finding double-free cases in
drivers. But we have never got a chance to upstream that. Now that these
patches are being discussed again, do you think it would be good idea to
integrate in the same league of patches ?

Thanks,


On 11/24/2014 01:45 PM, Joonsoo Kim wrote:
> This is the page owner tracking code which is introduced
> so far ago. It is resident on Andrew's tree, though, nobody
> tried to upstream so it remain as is. Our company uses this feature
> actively to debug memory leak or to find a memory hogger so
> I decide to upstream this feature.
>
> This functionality help us to know who allocates the page.
> When allocating a page, we store some information about
> allocation in extra memory. Later, if we need to know
> status of all pages, we can get and analyze it from this stored
> information.
>
> In previous version of this feature, extra memory is statically defined
> in struct page, but, in this version, extra memory is allocated outside
> of struct page. It enables us to turn on/off this feature at boottime
> without considerable memory waste.
>
> Although we already have tracepoint for tracing page allocation/free,
> using it to analyze page owner is rather complex. We need to enlarge
> the trace buffer for preventing overlapping until userspace program
> launched. And, launched program continually dump out the trace buffer
> for later analysis and it would change system behaviour with more
> possibility rather than just keeping it in memory, so bad for debug.
>
> Moreover, we can use page_owner feature further for various purposes.
> For example, we can use it for fragmentation statistics implemented in
> this patch. And, I also plan to implement some CMA failure debugging
> feature using this interface.
>
> I'd like to give the credit for all developers contributed this feature,
> but, it's not easy because I don't know exact history. Sorry about that.
> Below is people who has "Signed-off-by" in the patches in Andrew's tree.
>
> Contributor:
> Alexander Nyberg<[email protected]>
> Mel Gorman<[email protected]>
> Dave Hansen<[email protected]>
> Minchan Kim<[email protected]>
> Michal Nazarewicz<[email protected]>
> Andrew Morton<[email protected]>
> Jungsoo Son<[email protected]>
>
> v3: Make the default behaviour disabled.
> Change boot parameter from disabling switch to enabling one.
> Inline a check whether page owner is initialized or not
> to minimize runtime overhead when disabled.
> Fix infinite loop condition in fragmentation statistics
> Disable fragmentation statistics if page owner isn't initialized.
>
> v2: Do set_page_owner() more places than v1. This corrects page owner
> information of memory for alloc_pages_exact() and compaction/CMA.
>
> Signed-off-by: Joonsoo Kim<[email protected]>
> ---
> Documentation/kernel-parameters.txt | 6 +
> include/linux/page_ext.h | 10 ++
> include/linux/page_owner.h | 38 ++++++
> lib/Kconfig.debug | 16 +++
> mm/Makefile | 1 +
> mm/page_alloc.c | 11 +-
> mm/page_ext.c | 4 +
> mm/page_owner.c | 222 +++++++++++++++++++++++++++++++++++
> mm/vmstat.c | 101 ++++++++++++++++
> tools/vm/Makefile | 4 +-
> tools/vm/page_owner_sort.c | 144 +++++++++++++++++++++++
> 11 files changed, 554 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/page_owner.h
> create mode 100644 mm/page_owner.c
> create mode 100644 tools/vm/page_owner_sort.c
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index b5ac055..c8c4446 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2515,6 +2515,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> OSS [HW,OSS]
> See Documentation/sound/oss/oss-parameters.txt
>
> + page_owner= [KNL] Boot-time page_owner enabling option.
> + Storage of the information about who allocated
> + each page is disabled in default. With this switch,
> + we can turn it on.
> + on: enable the feature
> +
> panic= [KNL] Kernel behaviour on panic: delay<timeout>
> timeout> 0: seconds before rebooting
> timeout = 0: wait forever
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 61c0f05..d2a2c84 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -1,6 +1,9 @@
> #ifndef __LINUX_PAGE_EXT_H
> #define __LINUX_PAGE_EXT_H
>
> +#include<linux/types.h>
> +#include<linux/stacktrace.h>
> +
> struct pglist_data;
> struct page_ext_operations {
> bool (*need)(void);
> @@ -22,6 +25,7 @@ struct page_ext_operations {
> enum page_ext_flags {
> PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
> PAGE_EXT_DEBUG_GUARD,
> + PAGE_EXT_OWNER,
> };
>
> /*
> @@ -33,6 +37,12 @@ enum page_ext_flags {
> */
> struct page_ext {
> unsigned long flags;
> +#ifdef CONFIG_PAGE_OWNER
> + unsigned int order;
> + gfp_t gfp_mask;
> + struct stack_trace trace;
> + unsigned long trace_entries[8];
> +#endif
> };
>
> extern void pgdat_page_ext_init(struct pglist_data *pgdat);
> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> new file mode 100644
> index 0000000..b48c347
> --- /dev/null
> +++ b/include/linux/page_owner.h
> @@ -0,0 +1,38 @@
> +#ifndef __LINUX_PAGE_OWNER_H
> +#define __LINUX_PAGE_OWNER_H
> +
> +#ifdef CONFIG_PAGE_OWNER
> +extern bool page_owner_inited;
> +extern struct page_ext_operations page_owner_ops;
> +
> +extern void __reset_page_owner(struct page *page, unsigned int order);
> +extern void __set_page_owner(struct page *page,
> + unsigned int order, gfp_t gfp_mask);
> +
> +static inline void reset_page_owner(struct page *page, unsigned int order)
> +{
> + if (likely(!page_owner_inited))
> + return;
> +
> + __reset_page_owner(page, order);
> +}
> +
> +static inline void set_page_owner(struct page *page,
> + unsigned int order, gfp_t gfp_mask)
> +{
> + if (likely(!page_owner_inited))
> + return;
> +
> + __set_page_owner(page, order, gfp_mask);
> +}
> +#else
> +static inline void reset_page_owner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void set_page_owner(struct page *page,
> + unsigned int order, gfp_t gfp_mask)
> +{
> +}
> +
> +#endif /* CONFIG_PAGE_OWNER */
> +#endif /* __LINUX_PAGE_OWNER_H */
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index c078a76..8864e90 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -227,6 +227,22 @@ config UNUSED_SYMBOLS
> you really need it, and what the merge plan to the mainline kernel for
> your module is.
>
> +config PAGE_OWNER
> + bool "Track page owner"
> + depends on DEBUG_KERNEL&& STACKTRACE_SUPPORT
> + select DEBUG_FS
> + select STACKTRACE
> + select PAGE_EXTENSION
> + help
> + This keeps track of what call chain is the owner of a page, may
> + help to find bare alloc_page(s) leaks. Even if you include this
> + feature on your build, it is disabled in default. You should pass
> + "page_owner=on" to boot parameter in order to enable it. Eats
> + a fair amount of memory if enabled. See tools/vm/page_owner_sort.c
> + for user-space helper.
> +
> + If unsure, say N.
> +
> config DEBUG_FS
> bool "Debug Filesystem"
> help
> diff --git a/mm/Makefile b/mm/Makefile
> index 0b7a784..3548460 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -63,6 +63,7 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> obj-$(CONFIG_CLEANCACHE) += cleancache.o
> obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> obj-$(CONFIG_ZPOOL) += zpool.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4eea173..f1968d7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
> #include<linux/page_ext.h>
> #include<linux/hugetlb.h>
> #include<linux/sched/rt.h>
> +#include<linux/page_owner.h>
>
> #include<asm/sections.h>
> #include<asm/tlbflush.h>
> @@ -810,6 +811,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> if (bad)
> return false;
>
> + reset_page_owner(page, order);
> +
> if (!PageHighMem(page)) {
> debug_check_no_locks_freed(page_address(page),
> PAGE_SIZE<< order);
> @@ -985,6 +988,8 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
> if (order&& (gfp_flags& __GFP_COMP))
> prep_compound_page(page, order);
>
> + set_page_owner(page, order, gfp_flags);
> +
> return 0;
> }
>
> @@ -1557,8 +1562,11 @@ void split_page(struct page *page, unsigned int order)
> split_page(virt_to_page(page[0].shadow), order);
> #endif
>
> - for (i = 1; i< (1<< order); i++)
> + set_page_owner(page, 0, 0);
> + for (i = 1; i< (1<< order); i++) {
> set_page_refcounted(page + i);
> + set_page_owner(page + i, 0, 0);
> + }
> }
> EXPORT_SYMBOL_GPL(split_page);
>
> @@ -1598,6 +1606,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
> }
> }
>
> + set_page_owner(page, order, 0);
> return 1UL<< order;
> }
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ede4d1e..ce86485 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -5,6 +5,7 @@
> #include<linux/memory.h>
> #include<linux/vmalloc.h>
> #include<linux/kmemleak.h>
> +#include<linux/page_owner.h>
>
> /*
> * struct page extension
> @@ -55,6 +56,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> #ifdef CONFIG_PAGE_POISONING
> &page_poisoning_ops,
> #endif
> +#ifdef CONFIG_PAGE_OWNER
> + &page_owner_ops,
> +#endif
> };
>
> static unsigned long total_usage;
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> new file mode 100644
> index 0000000..85eec7e
> --- /dev/null
> +++ b/mm/page_owner.c
> @@ -0,0 +1,222 @@
> +#include<linux/debugfs.h>
> +#include<linux/mm.h>
> +#include<linux/slab.h>
> +#include<linux/uaccess.h>
> +#include<linux/bootmem.h>
> +#include<linux/stacktrace.h>
> +#include<linux/page_owner.h>
> +#include "internal.h"
> +
> +static bool page_owner_disabled = true;
> +bool page_owner_inited __read_mostly;
> +
> +static int early_page_owner_param(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + if (strcmp(buf, "on") == 0)
> + page_owner_disabled = false;
> +
> + return 0;
> +}
> +early_param("page_owner", early_page_owner_param);
> +
> +static bool need_page_owner(void)
> +{
> + if (page_owner_disabled)
> + return false;
> +
> + return true;
> +}
> +
> +static void init_page_owner(void)
> +{
> + if (page_owner_disabled)
> + return;
> +
> + page_owner_inited = true;
> +}
> +
> +struct page_ext_operations page_owner_ops = {
> + .need = need_page_owner,
> + .init = init_page_owner,
> +};
> +
> +void __reset_page_owner(struct page *page, unsigned int order)
> +{
> + int i;
> + struct page_ext *page_ext;
> +
> + for (i = 0; i< (1<< order); i++) {
> + page_ext = lookup_page_ext(page + i);
> + __clear_bit(PAGE_EXT_OWNER,&page_ext->flags);
> + }
> +}
> +
> +void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask)
> +{
> + struct page_ext *page_ext;
> + struct stack_trace *trace;
> +
> + page_ext = lookup_page_ext(page);
> +
> + trace =&page_ext->trace;
> + trace->nr_entries = 0;
> + trace->max_entries = ARRAY_SIZE(page_ext->trace_entries);
> + trace->entries =&page_ext->trace_entries[0];
> + trace->skip = 3;
> + save_stack_trace(&page_ext->trace);
> +
> + page_ext->order = order;
> + page_ext->gfp_mask = gfp_mask;
> +
> + __set_bit(PAGE_EXT_OWNER,&page_ext->flags);
> +}
> +
> +static ssize_t
> +print_page_owner(char __user *buf, size_t count, unsigned long pfn,
> + struct page *page, struct page_ext *page_ext)
> +{
> + int ret;
> + int pageblock_mt, page_mt;
> + char *kbuf;
> +
> + kbuf = kmalloc(count, GFP_KERNEL);
> + if (!kbuf)
> + return -ENOMEM;
> +
> + ret = snprintf(kbuf, count,
> + "Page allocated via order %u, mask 0x%x\n",
> + page_ext->order, page_ext->gfp_mask);
> +
> + if (ret>= count)
> + goto err;
> +
> + /* Print information relevant to grouping pages by mobility */
> + pageblock_mt = get_pfnblock_migratetype(page, pfn);
> + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
> + ret += snprintf(kbuf + ret, count - ret,
> + "PFN %lu Block %lu type %d %s Flags %s%s%s%s%s%s%s%s%s%s%s%s\n",
> + pfn,
> + pfn>> pageblock_order,
> + pageblock_mt,
> + pageblock_mt != page_mt ? "Fallback" : " ",
> + PageLocked(page) ? "K" : " ",
> + PageError(page) ? "E" : " ",
> + PageReferenced(page) ? "R" : " ",
> + PageUptodate(page) ? "U" : " ",
> + PageDirty(page) ? "D" : " ",
> + PageLRU(page) ? "L" : " ",
> + PageActive(page) ? "A" : " ",
> + PageSlab(page) ? "S" : " ",
> + PageWriteback(page) ? "W" : " ",
> + PageCompound(page) ? "C" : " ",
> + PageSwapCache(page) ? "B" : " ",
> + PageMappedToDisk(page) ? "M" : " ");
> +
> + if (ret>= count)
> + goto err;
> +
> + ret += snprint_stack_trace(kbuf + ret, count - ret,
> + &page_ext->trace, 0);
> + if (ret>= count)
> + goto err;
> +
> + ret += snprintf(kbuf + ret, count - ret, "\n");
> + if (ret>= count)
> + goto err;
> +
> + if (copy_to_user(buf, kbuf, ret))
> + ret = -EFAULT;
> +
> + kfree(kbuf);
> + return ret;
> +
> +err:
> + kfree(kbuf);
> + return -ENOMEM;
> +}
> +
> +static ssize_t
> +read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> +{
> + unsigned long pfn;
> + struct page *page;
> + struct page_ext *page_ext;
> +
> + if (!page_owner_inited)
> + return -EINVAL;
> +
> + page = NULL;
> + pfn = min_low_pfn + *ppos;
> +
> + /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */
> + while (!pfn_valid(pfn)&& (pfn& (MAX_ORDER_NR_PAGES - 1)) != 0)
> + pfn++;
> +
> + drain_all_pages(NULL);
> +
> + /* Find an allocated page */
> + for (; pfn< max_pfn; pfn++) {
> + /*
> + * If the new page is in a new MAX_ORDER_NR_PAGES area,
> + * validate the area as existing, skip it if not
> + */
> + if ((pfn& (MAX_ORDER_NR_PAGES - 1)) == 0&& !pfn_valid(pfn)) {
> + pfn += MAX_ORDER_NR_PAGES - 1;
> + continue;
> + }
> +
> + /* Check for holes within a MAX_ORDER area */
> + if (!pfn_valid_within(pfn))
> + continue;
> +
> + page = pfn_to_page(pfn);
> + if (PageBuddy(page)) {
> + unsigned long freepage_order = page_order_unsafe(page);
> +
> + if (freepage_order< MAX_ORDER)
> + pfn += (1UL<< freepage_order) - 1;
> + continue;
> + }
> +
> + page_ext = lookup_page_ext(page);
> +
> + /*
> + * Pages allocated before initialization of page_owner are
> + * non-buddy and have no page_owner info.
> + */
> + if (!test_bit(PAGE_EXT_OWNER,&page_ext->flags))
> + continue;
> +
> + /* Record the next PFN to read in the file offset */
> + *ppos = (pfn - min_low_pfn) + 1;
> +
> + return print_page_owner(buf, count, pfn, page, page_ext);
> + }
> +
> + return 0;
> +}
> +
> +static const struct file_operations proc_page_owner_operations = {
> + .read = read_page_owner,
> +};
> +
> +static int __init pageowner_init(void)
> +{
> + struct dentry *dentry;
> +
> + if (!page_owner_inited) {
> + pr_info("page_owner is disabled\n");
> + return 0;
> + }
> +
> + dentry = debugfs_create_file("page_owner", S_IRUSR, NULL,
> + NULL,&proc_page_owner_operations);
> + if (IS_ERR(dentry))
> + return PTR_ERR(dentry);
> +
> + return 0;
> +}
> +module_init(pageowner_init)
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1b12d39..b090e9e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -22,6 +22,8 @@
> #include<linux/writeback.h>
> #include<linux/compaction.h>
> #include<linux/mm_inline.h>
> +#include<linux/page_ext.h>
> +#include<linux/page_owner.h>
>
> #include "internal.h"
>
> @@ -1017,6 +1019,104 @@ static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
> return 0;
> }
>
> +#ifdef CONFIG_PAGE_OWNER
> +static void pagetypeinfo_showmixedcount_print(struct seq_file *m,
> + pg_data_t *pgdat,
> + struct zone *zone)
> +{
> + struct page *page;
> + struct page_ext *page_ext;
> + unsigned long pfn = zone->zone_start_pfn, block_end_pfn;
> + unsigned long end_pfn = pfn + zone->spanned_pages;
> + unsigned long count[MIGRATE_TYPES] = { 0, };
> + int pageblock_mt, page_mt;
> + int i;
> +
> + /* Scan block by block. First and last block may be incomplete */
> + pfn = zone->zone_start_pfn;
> +
> + /*
> + * Walk the zone in pageblock_nr_pages steps. If a page block spans
> + * a zone boundary, it will be double counted between zones. This does
> + * not matter as the mixed block count will still be correct
> + */
> + for (; pfn< end_pfn; ) {
> + if (!pfn_valid(pfn)) {
> + pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES);
> + continue;
> + }
> +
> + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
> + block_end_pfn = min(block_end_pfn, end_pfn);
> +
> + page = pfn_to_page(pfn);
> + pageblock_mt = get_pfnblock_migratetype(page, pfn);
> +
> + for (; pfn< block_end_pfn; pfn++) {
> + if (!pfn_valid_within(pfn))
> + continue;
> +
> + page = pfn_to_page(pfn);
> + if (PageBuddy(page)) {
> + pfn += (1UL<< page_order(page)) - 1;
> + continue;
> + }
> +
> + if (PageReserved(page))
> + continue;
> +
> + page_ext = lookup_page_ext(page);
> +
> + if (!test_bit(PAGE_EXT_OWNER,&page_ext->flags))
> + continue;
> +
> + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask);
> + if (pageblock_mt != page_mt) {
> + if (is_migrate_cma(pageblock_mt))
> + count[MIGRATE_MOVABLE]++;
> + else
> + count[pageblock_mt]++;
> +
> + pfn = block_end_pfn;
> + break;
> + }
> + pfn += (1UL<< page_ext->order) - 1;
> + }
> + }
> +
> + /* Print counts */
> + seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
> + for (i = 0; i< MIGRATE_TYPES; i++)
> + seq_printf(m, "%12lu ", count[i]);
> + seq_putc(m, '\n');
> +}
> +#endif /* CONFIG_PAGE_OWNER */
> +
> +/*
> + * Print out the number of pageblocks for each migratetype that contain pages
> + * of other types. This gives an indication of how well fallbacks are being
> + * contained by rmqueue_fallback(). It requires information from PAGE_OWNER
> + * to determine what is going on
> + */
> +static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat)
> +{
> +#ifdef CONFIG_PAGE_OWNER
> + int mtype;
> +
> + if (!page_owner_inited)
> + return;
> +
> + drain_all_pages(NULL);
> +
> + seq_printf(m, "\n%-23s", "Number of mixed blocks ");
> + for (mtype = 0; mtype< MIGRATE_TYPES; mtype++)
> + seq_printf(m, "%12s ", migratetype_names[mtype]);
> + seq_putc(m, '\n');
> +
> + walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print);
> +#endif /* CONFIG_PAGE_OWNER */
> +}
> +
> /*
> * This prints out statistics in relation to grouping pages by mobility.
> * It is expensive to collect so do not constantly read the file.
> @@ -1034,6 +1134,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
> seq_putc(m, '\n');
> pagetypeinfo_showfree(m, pgdat);
> pagetypeinfo_showblockcount(m, pgdat);
> + pagetypeinfo_showmixedcount(m, pgdat);
>
> return 0;
> }
> diff --git a/tools/vm/Makefile b/tools/vm/Makefile
> index 3d907da..ac884b6 100644
> --- a/tools/vm/Makefile
> +++ b/tools/vm/Makefile
> @@ -1,6 +1,6 @@
> # Makefile for vm tools
> #
> -TARGETS=page-types slabinfo
> +TARGETS=page-types slabinfo page_owner_sort
>
> LIB_DIR = ../lib/api
> LIBS = $(LIB_DIR)/libapikfs.a
> @@ -18,5 +18,5 @@ $(LIBS):
> $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
>
> clean:
> - $(RM) page-types slabinfo
> + $(RM) page-types slabinfo page_owner_sort
> make -C $(LIB_DIR) clean
> diff --git a/tools/vm/page_owner_sort.c b/tools/vm/page_owner_sort.c
> new file mode 100644
> index 0000000..77147b4
> --- /dev/null
> +++ b/tools/vm/page_owner_sort.c
> @@ -0,0 +1,144 @@
> +/*
> + * User-space helper to sort the output of /sys/kernel/debug/page_owner
> + *
> + * Example use:
> + * cat /sys/kernel/debug/page_owner> page_owner_full.txt
> + * grep -v ^PFN page_owner_full.txt> page_owner.txt
> + * ./sort page_owner.txt sorted_page_owner.txt
> +*/
> +
> +#include<stdio.h>
> +#include<stdlib.h>
> +#include<sys/types.h>
> +#include<sys/stat.h>
> +#include<fcntl.h>
> +#include<unistd.h>
> +#include<string.h>
> +
> +struct block_list {
> + char *txt;
> + int len;
> + int num;
> +};
> +
> +
> +static struct block_list *list;
> +static int list_size;
> +static int max_size;
> +
> +struct block_list *block_head;
> +
> +int read_block(char *buf, int buf_size, FILE *fin)
> +{
> + char *curr = buf, *const buf_end = buf + buf_size;
> +
> + while (buf_end - curr> 1&& fgets(curr, buf_end - curr, fin)) {
> + if (*curr == '\n') /* empty line */
> + return curr - buf;
> + curr += strlen(curr);
> + }
> +
> + return -1; /* EOF or no space left in buf. */
> +}
> +
> +static int compare_txt(const void *p1, const void *p2)
> +{
> + const struct block_list *l1 = p1, *l2 = p2;
> +
> + return strcmp(l1->txt, l2->txt);
> +}
> +
> +static int compare_num(const void *p1, const void *p2)
> +{
> + const struct block_list *l1 = p1, *l2 = p2;
> +
> + return l2->num - l1->num;
> +}
> +
> +static void add_list(char *buf, int len)
> +{
> + if (list_size != 0&&
> + len == list[list_size-1].len&&
> + memcmp(buf, list[list_size-1].txt, len) == 0) {
> + list[list_size-1].num++;
> + return;
> + }
> + if (list_size == max_size) {
> + printf("max_size too small??\n");
> + exit(1);
> + }
> + list[list_size].txt = malloc(len+1);
> + list[list_size].len = len;
> + list[list_size].num = 1;
> + memcpy(list[list_size].txt, buf, len);
> + list[list_size].txt[len] = 0;
> + list_size++;
> + if (list_size % 1000 == 0) {
> + printf("loaded %d\r", list_size);
> + fflush(stdout);
> + }
> +}
> +
> +#define BUF_SIZE 1024
> +
> +int main(int argc, char **argv)
> +{
> + FILE *fin, *fout;
> + char buf[BUF_SIZE];
> + int ret, i, count;
> + struct block_list *list2;
> + struct stat st;
> +
> + if (argc< 3) {
> + printf("Usage: ./program<input> <output>\n");
> + perror("open: ");
> + exit(1);
> + }
> +
> + fin = fopen(argv[1], "r");
> + fout = fopen(argv[2], "w");
> + if (!fin || !fout) {
> + printf("Usage: ./program<input> <output>\n");
> + perror("open: ");
> + exit(1);
> + }
> +
> + fstat(fileno(fin),&st);
> + max_size = st.st_size / 100; /* hack ... */
> +
> + list = malloc(max_size * sizeof(*list));
> +
> + for ( ; ; ) {
> + ret = read_block(buf, BUF_SIZE, fin);
> + if (ret< 0)
> + break;
> +
> + add_list(buf, ret);
> + }
> +
> + printf("loaded %d\n", list_size);
> +
> + printf("sorting ....\n");
> +
> + qsort(list, list_size, sizeof(list[0]), compare_txt);
> +
> + list2 = malloc(sizeof(*list) * list_size);
> +
> + printf("culling\n");
> +
> + for (i = count = 0; i< list_size; i++) {
> + if (count == 0 ||
> + strcmp(list2[count-1].txt, list[i].txt) != 0) {
> + list2[count++] = list[i];
> + } else {
> + list2[count-1].num += list[i].num;
> + }
> + }
> +
> + qsort(list2, count, sizeof(list[0]), compare_num);
> +
> + for (i = 0; i< count; i++)
> + fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
> +
> + return 0;
> +}


--
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

2014-12-04 06:52:52

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 6/8] mm/page_owner: keep track of page owners

On Wed, Dec 03, 2014 at 06:54:08PM +0530, Chintan Pandya wrote:
> Hi Kim,

Hello, Chintan.

>
> This is really useful stuff that you are doing. And the runtime
> allocation for storing page owner stack is a good call.
>
> Along with that, we also use extended version of the original
> page_owner patch. The extension is, to store stack trace at the time
> of freeing the page. That will indeed eat up space like anything
> (just double of original page_owner) but it helps in debugging some
> crucial issues. Like, illegitimate free, finding leaked pages (if we

Sound really interesting. I hope to see it.

> store their time stamps) etc. The same has been useful in finding
> double-free cases in drivers. But we have never got a chance to
> upstream that. Now that these patches are being discussed again, do
> you think it would be good idea to integrate in the same league of
> patches ?

Good to hear. I think that you can upstream it separately. If you send
the patch, I will review it and help to merge it.

Thanks.

2014-12-22 09:10:47

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] mm/debug-pagealloc: prepare boottime configurable on/off

Hi Joonsoo,

On Fri, 2014-11-28 at 16:35 +0900, Joonsoo Kim wrote:
> Hello, Paul.
>
> Thanks for spotting this.
> I attach the patch. :)
>
> Andrew,
> Could you kindly fold this into the patch in your tree?
>
> Thanks.
>
> ------------------->8---------------
> From a33c480160904cc93333807a448960151ac4c534 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Fri, 28 Nov 2014 16:05:32 +0900
> Subject: [PATCH] mm/debug_pagealloc: remove obsolete Kconfig options
>
> These are obsolete since commit "mm/debug-pagealloc: prepare boottime
> configurable" is merged. So, remove it.
>
> [[email protected]: find obsolete Kconfig options]
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/Kconfig.debug | 9 ---------
> 1 file changed, 9 deletions(-)
>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 56badfc..957d3da 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -14,7 +14,6 @@ config DEBUG_PAGEALLOC
> depends on !KMEMCHECK
> select PAGE_EXTENSION
> select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
> - select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
> ---help---
> Unmap pages from the kernel linear mapping after free_pages().
> This results in a large slowdown, but helps to find certain types
> @@ -27,13 +26,5 @@ config DEBUG_PAGEALLOC
> that would result in incorrect warnings of memory corruption after
> a resume because free pages are not saved to the suspend image.
>
> -config WANT_PAGE_DEBUG_FLAGS
> - bool
> -
> config PAGE_POISONING
> bool
> - select WANT_PAGE_DEBUG_FLAGS
> -
> -config PAGE_GUARD
> - bool
> - select WANT_PAGE_DEBUG_FLAGS

This patch didn't make it into v3.19-rc1. And I think it never entered
linux-next. Did this fall through the cracks or was there some other
issue with this patch?

Thanks,


Paul Bolle

2014-12-23 04:52:12

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/8] mm/debug-pagealloc: prepare boottime configurable on/off

On Mon, Dec 22, 2014 at 10:10:42AM +0100, Paul Bolle wrote:
> Hi Joonsoo,
>
> On Fri, 2014-11-28 at 16:35 +0900, Joonsoo Kim wrote:
> > Hello, Paul.
> >
> > Thanks for spotting this.
> > I attach the patch. :)
> >
> > Andrew,
> > Could you kindly fold this into the patch in your tree?
> >
> > Thanks.
> >
> > ------------------->8---------------
> > From a33c480160904cc93333807a448960151ac4c534 Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <[email protected]>
> > Date: Fri, 28 Nov 2014 16:05:32 +0900
> > Subject: [PATCH] mm/debug_pagealloc: remove obsolete Kconfig options
> >
> > These are obsolete since commit "mm/debug-pagealloc: prepare boottime
> > configurable" is merged. So, remove it.
> >
> > [[email protected]: find obsolete Kconfig options]
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > mm/Kconfig.debug | 9 ---------
> > 1 file changed, 9 deletions(-)
> >
> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index 56badfc..957d3da 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -14,7 +14,6 @@ config DEBUG_PAGEALLOC
> > depends on !KMEMCHECK
> > select PAGE_EXTENSION
> > select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
> > - select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC
> > ---help---
> > Unmap pages from the kernel linear mapping after free_pages().
> > This results in a large slowdown, but helps to find certain types
> > @@ -27,13 +26,5 @@ config DEBUG_PAGEALLOC
> > that would result in incorrect warnings of memory corruption after
> > a resume because free pages are not saved to the suspend image.
> >
> > -config WANT_PAGE_DEBUG_FLAGS
> > - bool
> > -
> > config PAGE_POISONING
> > bool
> > - select WANT_PAGE_DEBUG_FLAGS
> > -
> > -config PAGE_GUARD
> > - bool
> > - select WANT_PAGE_DEBUG_FLAGS
>
> This patch didn't make it into v3.19-rc1. And I think it never entered
> linux-next. Did this fall through the cracks or was there some other
> issue with this patch?

Hello,

I guess that it is just missed.
I re-sent the patch to Andrew a while ago.

Thank you for reporting.

Thanks.